Nick Benthem
Nick Benthem
1 min read
Goal:
Audience:

Databricks gives documentation hooking up Spark with Redshift using the raw Spark libraries, but not with SparklyR, which gives some great functions you want (notably - dplyr syntax). This post is how to connect the two.

Unsurprisingly, you need to use the sparklyr::spark_read_jdbc command, but critically, you need the redshift jdbc42 driver installed - you can install with Maven or follow the instructions at AWS (https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver). Installing the driver is easy in Databricks - I usually install the driver in the cluster - but you can install it in the notebook if you want.

Once it’s installed - you can just use the following function to get your data - the critical component here is the driver = "com.amazon.redshift.jdbc42.Driver" piece. Your jdbcUrl needs to be of the form jdbc:postgresql://endpoint:port/database

i.e.,

jdbc_url <- jdbc:redshift://examplecluster.abc123xyz789.us-west-2.redshift.amazonaws.com:5439/dev

Then you can create a function to load in your data as:

read_from_redshift_spark <- function(sql_string, sc, table_name, jdbcUrl, redshift_password) {
  df <- sparklyr::spark_read_jdbc(sc,
                    name = table_name ,
                    options = list(url = jdbcUrl,
                                   user = redshift_user,
                                   driver = "com.amazon.redshift.jdbc42.Driver",
                                   password = redshift_password,
                                   dbtable = sprintf("(%s)", sql_string) ))
  return(df)
}

Wrapping it all together -

library(SparklyR)
library(dplyr) # order matters - https://www.ucl.ac.uk/~uctqiax/PUBLG100/2015/faq/masking.html
config <- spark_config()
sc <- sparklyr::spark_connect(method = "databricks", config = config) # Connect locally
read_from_redshift_spark <- function(sql_string, sc, table_name, jdbcUrl, redshift_password,redshift_user) {
  df <- sparklyr::spark_read_jdbc(sc,
                    name = table_name ,
                    options = list(url = jdbcUrl,
                                   user = redshift_user,
                                   driver = "com.amazon.redshift.jdbc42.Driver",
                                   password = redshift_password,
                                   dbtable = sprintf("(%s)", sql_string) ))
  return(df)
}

jdbc_url <- jdbc:redshift://examplecluster.abc123xyz789.us-west-2.redshift.amazonaws.com:5439/dev
redshift_password <- "passwordhere"
redshift_user <- "userhere"
dataset_test <- read_from_redshift_spark("select * from my_table",sc,"my_table_name_spark_sql",jdbcUrl,redshift_password,redshift_user)

dataset_test %>% select("column1")

And bam - you have dplyr syntax in Spark.