BigQuery Slot Contention – Slow BigQuery Jobs

Big Query

Are you managing a BigQuery environment at scale and dealing with slow BigQuery job execution? Does your company have hundreds or thousands of jobs running simultaneously and sometimes experience wildly slow query execution? This happens. It’s growing pains. We’ll go over a couple of methods of bringing your query speeds back up which might eventually … Read more

How to Make Your SQL Queries Blazing Fast on BigQuery

BigQuery

In this blog post we’re going to cover how to make your SQL queries blazing fast. So, BigQuery is an incredible tool for wrangling massive datasets and running SQL queries at scale. But let’s be honest—just because it can handle huge queries doesn’t mean you should throw inefficiency at it. The faster your queries run, … Read more

Update DAGs on EC2

This past weekend I decided to spin up a quick Airflow deployment for some personal scripts I wanted executed on a schedule. I didn’t have scale in mind and I didn’t have robustness in mind. I had speed. I’ll review how I’m deploying my version-controlled DAG code to my EC2. I want to highlight, this … Read more

Partition Existence In BigQuery – Cheapest & Fastest Method

BigQuery

We’ll show you the cheapest & fastest way to find partition existence in BigQuery. Use BigQuery’s INFORMATION_SCHEMA to find the most recent partition ID. So instead of this: Do this: Need to translate the partition_id back into a date? Do this: Depending on your table size, this will save you both a ton of money … Read more

Comms As A Data Engineer

Comms as a Data Engineer can be tough. Should you email a group of people? Should you dump a message in a public Slack channel they frequent? Should you follow up daily, weekly, etc? It’s a lot of manual labor. I don’t like manual work. Also, I hate email. This seems to be a common … Read more

Apache Airflow DAG Factories

What in the the world are Apache Airflow DAG Factories and why should you use them? Let’s go into what they are, why they’re used, and how they could make your life easier. We’ll also go into the nitty gritty of how to design and build one. Also, before I jump into this post, shout … Read more

How To Clone A Git Repo In Python – Updated

python

So, a loooong time ago I wrote this post on how to clone a Git repo in Python3. I used subprocess that first time around to run git commands. I was essentially trying to run git commands in python explicitly. But, there’s a better way to do this. It’s prettier, it’s easier to read. There’s … Read more

Pull A Domain From A Full Website Path In BigQuery

BigQuery

This post will show you how to pull a domain from a full website path in BigQuery. So let’s set the stage for a hypothetical. You own a URL shortener company. You want to partner with a website for whatever reason. You decide that you want to do analysis over the data you’ve streamed or … Read more

AWS CloudFormation, PHP, and WordPress Issues

Background Info This blog post will discuss AWS CloudFormation, PHP, and WordPress issues. So, a couple of years back I decided to leverage a CloudFormation template to scaffold a WordPress blog. Do note that this link is close to what I used, but it isn’t exactly what I used. It spun up a load balancer, … Read more