Short tips and tricks that I’ve come across and want to remember.
PySpark Catalyst Optimization on large DataFrames
Running some operations on DataFrames with a large number of columns, or a large set of data, can get slow. PySpark does lazy evaluation, which means the com...
Connecting Databricks to Redshift with SparklyR
Databricks gives documentation hooking up Spark with Redshift using the raw Spark libraries, but not with SparklyR, which gives some great functions you want...
Generating a non-self signed SSL Certificate for Kubernetes
We had an issue with our Kubernetes cluster running Astronomer where we had a SSL certificate for our cluster - but it had a 90 day expiration. The configura...
Turning csv into list for SQL
So this is something I’ve been doing for a long time - if you have a CSV of values and don’t want to write a Python script to read in the results and execute...
Connect AWS EMR to Redshift
How to connect your Spark Cluster to Redshift
Killing long running redshift connections
We recently ran into an issue where we needed to kill database connections to redshift. To do so, we just wrote the following script. Nothing fancy, but usef...