PySpark Catalyst Optimization on large DataFrames
Running some operations on DataFrames with a large number of columns, or a large set of data, can get slow. PySpark does lazy evaluation, which means the computationally expensive things are delayed until the very end when needed (often when writing the data, .collect()
, or by .count()
).
What actually happens when you COPY in Postgres?
I recently had someone ask me why the COPY
command is more performant than INSERT INTO
. While coming up with an answer, I discovered I was starting from a deficient: I didn’t know how COPY
works under the hood. Trying to come up with an answer was at best a guess. Through this post, I hope to narrow that knowledge gap and help myself and others get a deeper understanding of my favorite database.
Modifying Nginx settings on ElasticBeanStalk with Docker
We run our stack on ElasticBeanStalk - and have potentially large payloads. The default payload length is 1MB for nginx - which was too small for us. Here’s how to update that in ElasticBeanStalk if you run a Dockerfile.
Pixel Art Challenge
This one was pretty fun. During a team social - we were asked to draw a picture in Google Sheets given a color pallete - 1 was red, 2 was blue, 3 was black, etc.
Connecting Databricks to Redshift with SparklyR
Databricks gives documentation hooking up Spark with Redshift using the raw Spark libraries, but not with SparklyR, which gives some great functions you want (notably - dplyr syntax). This post is how to connect the two.
Airflow & Kubernetes
I’ve seen a lot of people confused about the difference between the KubernetesExecutor and the KubernetesPodOperator - they similarly named and both use Kubernetes Pods, yet very different in how they run, so the goal of this post is to lay out the differences and help you decide which to use.
Generating a non-self signed SSL Certificate for Kubernetes
We had an issue with our Kubernetes cluster running Astronomer where we had a SSL certificate for our cluster - but it had a 90 day expiration. The configuration was through Terraform, but due to some version skew, as well as complicated dependency trees we didn’t want to address, we needed to generate a valid SSL certificate today - and update a secret that containers in EKS were using. Typically - you should automate this process somehow (Airflow job or some similarly scheduled performance), but if you need the manual version - here’s how we did it.
Turning csv into list for SQL
So this is something I’ve been doing for a long time - if you have a CSV of values and don’t want to write a Python script to read in the results and execute SQL, you can go ahead and use Excel/GoogleSheets/{your favorite spreadsheet program} to create this into a comma seperated list.
Connect AWS EMR to Redshift
How to connect your Spark Cluster to Redshift
Killing long running redshift connections
We recently ran into an issue where we needed to kill database connections to redshift. To do so, we just wrote the following script. Nothing fancy, but useful.