Databricks gives documentation hooking up Spark with Redshift using the raw Spark libraries, but not with SparklyR, which gives some great functions you want (notably - dplyr syntax). This post is how to connect the two.
Unsurprisingly, you need to use the sparklyr::spark_read_jdbc
command, but critically, you need the redshift jdbc42 driver installed - you can install with Maven or follow the instructions at AWS (https://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html#download-jdbc-driver). Installing the driver is easy in Databricks - I usually install the driver in the cluster - but you can install it in the notebook if you want.
Once it’s installed - you can just use the following function to get your data - the critical component here is the driver = "com.amazon.redshift.jdbc42.Driver"
piece. Your jdbcUrl needs to be of the form jdbc:postgresql://endpoint:port/database
i.e.,
jdbc_url <- jdbc:redshift://examplecluster.abc123xyz789.us-west-2.redshift.amazonaws.com:5439/dev
Then you can create a function to load in your data as:
read_from_redshift_spark <- function(sql_string, sc, table_name, jdbcUrl, redshift_password) {
df <- sparklyr::spark_read_jdbc(sc,
name = table_name ,
options = list(url = jdbcUrl,
user = redshift_user,
driver = "com.amazon.redshift.jdbc42.Driver",
password = redshift_password,
dbtable = sprintf("(%s)", sql_string) ))
return(df)
}
Wrapping it all together -
library(SparklyR)
library(dplyr) # order matters - https://www.ucl.ac.uk/~uctqiax/PUBLG100/2015/faq/masking.html
config <- spark_config()
sc <- sparklyr::spark_connect(method = "databricks", config = config) # Connect locally
read_from_redshift_spark <- function(sql_string, sc, table_name, jdbcUrl, redshift_password,redshift_user) {
df <- sparklyr::spark_read_jdbc(sc,
name = table_name ,
options = list(url = jdbcUrl,
user = redshift_user,
driver = "com.amazon.redshift.jdbc42.Driver",
password = redshift_password,
dbtable = sprintf("(%s)", sql_string) ))
return(df)
}
jdbc_url <- jdbc:redshift://examplecluster.abc123xyz789.us-west-2.redshift.amazonaws.com:5439/dev
redshift_password <- "passwordhere"
redshift_user <- "userhere"
dataset_test <- read_from_redshift_spark("select * from my_table",sc,"my_table_name_spark_sql",jdbcUrl,redshift_password,redshift_user)
dataset_test %>% select("column1")
And bam - you have dplyr syntax in Spark.