Removing Duplicate Elements From An Array in BigQuery

Removing Duplicate Elements From An Array in BigQuery

Removing Duplicate Elements From An Array in BigQuery

We’ll be removing duplicate elements from an array in BigQuery in this post. Let’s set the stage. It’s a chilly night in the office and marketing reaches out asking for data. PANIK. But they’ve given you a general location of where that data is located. CALM. You do a quick select * of the table and notice that one of the columns has arrays with repeating values which makes the dataset look ugly. PANIK! What. Are. You. Going. To. Do.

Setup

  • engine & warehouse = BigQuery
  • My machine = doesn’t matter.

Big Query Pros And Cons

So there are pros and cons to BigQuery. Huge plus is that it’s fully managed. Can BigQuery handle your company’s load? Yes, probably. It’s Google. A better question might be, can you afford it? Want to delve deeper on the tool? Here are the docs. What are some cons? There are things you don’t immediately get out of the box with tools like Spark or Presto. In this case it would be the array_distinct() function.

The Solution

So, we’ll be using a UDF or a user-defined function to remake the array_distinct function. Not familiar with the concept? Here are more docs on what a UDF is when to use them. At their simplest, they use Javascript and SQL to manipulate data. Do note that UDFs can be persistent and shared!

# declare this at the top of your query
CREATE TEMP FUNCTION array_distinct(value ANY TYPE) AS ((
  SELECT ARRAY_AGG(a.b)
  FROM (SELECT DISTINCT * FROM UNNEST(value) b) a
));

So let’s cover what this does. It takes in an array as value. It unnests that array and selects distinct values from it. Those values are then smushed back into an array and returned.

Let’s test this out with data as we should always do. Let’s setup the dataset first.

# lets setup the dataset
with test_data as (
select [1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,5,6,7,23,45,34] as column_1)

# Now we have a sort of temp table with repeating values
select * from test_data


Now here’s the fix.

# declare this at the top of your query
CREATE TEMP FUNCTION array_distinct(value ANY TYPE) AS ((
  SELECT ARRAY_AGG(a.b)
  FROM (SELECT DISTINCT * FROM UNNEST(value) b) a
));

# lets setup the dataset
with test_data as (
select [1,2,3,4,4,4,4,4,4,4,4,4,4,4,4,5,6,7,23,45,34] as column_1)

# Lets remove those values with our user-defined function
select array_distinct(column_1) from test_data

And now you should have clean beautiful data. CALM. Feel free to reach out if you have technical questions or comment on this post. Working with Airflow and getting weird errors or no errors at all? Check this post out!

Prefer to pay someone instead of figuring this type of stuff out yourself? Here’s my LinkedIn.