BigQuery’s Having Clause

BigQuery’s Having Clause

In what situation would you want to use BigQuery’s having clause outside of an interview? We’ll go over a couple of use cases and how I use it as a Data Engineer for Reddit.

My Setup

  • Doesn’t matter! Everything is done on our browser.

What Is BigQuery?

BigQuery is a data warehouse as a service. Google handles your compute, your storage, and does a pretty good job of it. It can scale to petabytes and give you some out of the box optimizations without even lifting a finger. BigQuery isn’t a good choice for application databases. It’s good for BIG data analytics and data storage.

What is BigQuery’s Having Clause?

The HAVING SQL clause isn’t specific to BigQuery. It’s common syntax that can be found across flavors of SQL. HAVING clauses let you interact with aggregates without having to use a CTE. Let’s go over what working around a having clause looks like.

Example Of No Having Clause

WITH aggregated_data as (select _PARTITIONDATE as pt, count(*) as the_count from yourTable group by 1)

select * from aggregated_data where the_count>=1000

So, we had to create a common table expression with our with statement and then filter on the aggregate in our core SQL. If you tried filtering on the_count you’d get an error that looks like Unrecognized name: the_count at [1:xyz]. Throwing your aggregate function into your where statement where it looks somehting like the following throws a different error.

select _PARTITIONDATE as pt, count(*) as the_count from yourTable where count(*)>100 group by 1

The query above will throw a Syntax error: Expected end of input but got keyword AGGREGATE at [2:1]error.

Example WITH having clause

The having clause let’s us bypass the CTE therebye making our SQL prettier and more succint.

select _PARTITIONDATE as pt, count(*) as the_count from yourTable group by 1 HAVING the_count>=1000

Conclusion

Is the having clause a game changer? No. But SQL get’s more difficult to manage the larger it gets. If you have an oppotunity to make code more succint, you should generally take it. This doesn’t mean you should give up readability for less SQL, but I might argue that having in the instances mentioned above actually achieves both. Having issues deduping arrays in BigQuery? Check out this article for a short good read. Happy coding folks!