BigQuery Slot Contention - Slow BigQuery Jobs - Feral Cat

Are you managing a BigQuery environment at scale and dealing with slow BigQuery job execution? Does your company have hundreds or thousands of jobs running simultaneously and sometimes experience wildly slow query execution? This happens. It’s growing pains. We’ll go over a couple of methods of bringing your query speeds back up which might eventually save you a couple dollars. We’ll first approach this from a temporary fix avenue and will then go into automated management.

What causes slot contention or slow queries?

At its core, you do not have enough compute resources to run your queries promptly. But this could be due to:

Natural company and analytics growth.
Someone deployed a bad query that gets run on a cadence. Or, several people wrote bad queries which compounded.

So now you know you’re lacking on compute a couple times during the day. What do you do? We need to figure out whether it’s due to bad SQL or if it’s due to natural analytics growth. You can generally use the “bytes processed” graph on the BigQuery monitoring page. I don’t have much traffic in my personal account, I’ve mostly done this with corporate accounts so please excuse the bad fake graph.

Natural growth in bytes processed should generally look like a smoothish upward trend. It might look like:

Yours might be a little spikier here and there. But you’d be seeing a general upwards trend of bytes processed. There wouldn’t be massive growth from one day to the next. Even the following might be fine.

Your usage might actually more closely resemble what I show above because people love to schedule jobs around 12:00 am. So the peaks would be bytes processed at their highest times during the day.

If people are complaining about slow queries and we have either of the graphs above. then it might be due to natural growth.

What Your Graphs Shouldn’t Look Like

Now, how can we identify if somebody did a select * from ReallyBigTable where name like '%foo%‘ ? We look for huge spikes in the Bytes Processed graph. The graphs and spikes might look something like:

Where from one day to the next, we see massive jumps in bytes processed. The massive spike we see above is from somewhere between October 25nd and October 27th. So the job ran for the first time around October 26th but could have been deployed any time prior. Now you know when the anomaly started, let’s further debug.

What Do I Do About Natural Growth?

It depends on how cash-strapped your company is. Generally speaking, your employees running analytics is good for the firm. People making decisions based on data make better decisions. I caution against the stifling of this growth. Data democratization is hard to facilitate in the first place. But, if you’re cash-strapped and have to pick between staying open and letting some random person run a query. Just remove access.

If you have the money, throw cash at the problem. Are you doing this at scale and want to make sure every last dollar is being used intelligently? We’ll talk about this a bit more later. But the general recommendation here is to buy more compute if there aren’t any crazy anomalies in your consumption.

What Do I Do About An Anomaly In My Bytes Processed?

Figure out what the anomaly is being caused by. We can do this by using the bytes processed graph to narrow down when the first instance of this anomaly happened and then poking at the graphs. So, from the graph I showed up above it looks the first instance of the anomaly happened on the 26th at around 12:00 am Pacific. Make sure to keep your time zones in mind. My timezone is PST.

So now you know what general time period to look into. I’ll check out 10/25 from 23:00 to 10:26 01:00. This will give me a little wiggle space in case there was a delay on when the job was created and when it started running. Now let’s go over into the BigQuery console. We’re going to use the information schema to dig into what’s been running.

BigQuery Console Exploration

So, pull up your BigQuery query console. Save the window you identified somewhere safe like Google Keep. You’ll need it in a second.

Make sure you have the correct project selected up top. You can query jobs by the organization, but I don’t have an organization setup in my personal accounts. So we’ll keep this project specific at the moment.

Since I’ll be using “JOBS_BY_PROJECT” then the step listed above is important. Else, you may not find the job that’s causing you issues. If you’re using a multi project design, you can leverage the “JOBS_BY_ORGANIZATION” table. That table should contain less information per job but will have all the BigQuery jobs run at your firm.

Now that the query console is set up for us. Let’s run some SQL. Copy+paste the following:

select * from `yourProjectName.region-us-central1.INFORMATION_SCHEMA.JOBS_BY_PROJECT` limit 100

Hit run. This will give you visibility on the table’s schema. Want a better description of the fields you’re seeing? Check out Google’s schema of the table. Hopefully, you’re seeing a “total_bytes_processed” field in that table or “total_slot_ms.” My personal jobs seem to be very tiny so they’re not registering as having any bytes processed. But total_slot_ms has data. Now let’s filter the table down to the window in question:

SELECT
  *
FROM
  `yourProjectName.region-us-central1.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE
  DATETIME(start_time) BETWEEN "2024-10-25T23:00:00.000000"
  AND "2024-10-26T01:00:00.000000"
order by total_slot_ms desc, total_bytes_processed desc

Now we’ve narrowed our lists of jobs to the window we saw our anomaly. We’re also sorting by total_slot_ms and total_bytes_processed from large to small. These top couple of jobs might be the culprits. I might then copy+paste some of the SQL in the “query” field into Github to find the corresponding code that’s running the query on a cadence.

Can’t find the job in your corporate Github? It might be a scheduled query! In either case, you can now contact the team and ask if they’re aware of the anomaly and if they can provide an explanation on the huge resources they are consuming. If they can’t I’d push to kill the job. If they can, it might make sense to move towards creating a reservation specifically for their team so they can throw cash at BQ compute by themselves. This is an extreme case. In most instances it’s someone accidentally scheduling a bad job.

What If I Queried The Window But Can’t Figure Out Which Job Is Causing Problems?

Sometimes it isn’t a single culprit that’s causing slot contention. This might push you towards job management at scale. I.E. manual changes aren’t going to help. But seeing the table and determining which job is new and bad can be tough. I’d recommend visualizing your data to gain better insights. It’ll let you see the huge growth by user_email. So first, widen the gap of the window you flagged by a day or two. We want to see trends over time. Do something like:

SELECT
  *
FROM
  `yourProjectName.region-us-central1.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE
  DATETIME(start_time) BETWEEN "2024-10-22T23:00:00.000000"
  AND "2024-10-29T01:00:00.000000"
order by total_slot_ms desc, total_bytes_processed desc

Now your window has increased a bit. You should be able to see a jump on caller. You can use the “Explore Data” button in the query console to poke at the data in Looker. Or use whatever visualization tool your firm decided to buy. It makes no difference.

You’re going to want a line graph with “start_time” on your x axis, either “total_slot_ms” or “total_bytes_processed” on your y axis, and you’re going to want the lines colored by caller.

Now, we’ll be able to see which “user_email” caused a spike on the dates in question. We’ll need to further drill down onto the “user_email” we identified if it’s a service account for something like Airflow or a visualization tool.

In the cases where you find a service account, it helps to send labels over from these services when creating BigQuery jobs. You can do this with the BigQueryInsertJobOperator in Airflow.

In tools like Mode, labels might not be feasible but you can create service accounts on a per team basis to make this exploration and diagnosis easier for you.

Slot Contention Is Happening All The Time, Help

Cool, so the debugging strategy I described above works well during incident response. Has BigQuery crawled to a stop? Need to find something to kill? The strategy I mentioned above will work wonders. But if you’re getting big enough, there’s a potential that a “bad job” is getting generated accidentally daily. We’ll go into what automated BigQuery job management might look like in the next blog post.

Happy coding everybody! Happen to be new to BigQuery? Want something a little more bite-size? Check out this post on efficient partition discovery.