This post will show you how to pull a domain from a full website path in BigQuery. So let’s set the stage for a hypothetical. You own a URL shortener company. You want to partner with a website for whatever reason.
You decide that you want to do analysis over the data you’ve streamed or imported into BigQuery. And now, you want to see what the base url for these paths are.
But, all the links are complicated. For example it might look like:
https://feralcat.xyz/blog/2024/02/14/aws-cloudformation-php-and-wordpress-issues/
So we can’t count how many feralcat.xyz links we have because of all the extra nonsense.
Can we regex? When can’t we regex. But why rebuild the wheel?!? Let’s use a BigQuery function to pull a domain from a full website path in BigQuery
Solution
Use a BigQuery Net function. There’s a bunch of them. Check out there docs for more info. But here’s a short list of what I found to save a couple of clicks.
Name | Summary |
---|---|
NET.HOST | Gets the hostname from a URL. |
NET.IP_FROM_STRING | Converts an IPv4 or IPv6 address from a STRING value to a BYTES value in network byte order. |
NET.IP_NET_MASK | Gets a network mask. |
NET.IP_TO_STRING | Converts an IPv4 or IPv6 address from a BYTES value in network byte order to a STRING value. |
NET.IP_TRUNC | Converts a BYTES IPv4 or IPv6 address in network byte order to a BYTES subnet address. |
NET.IPV4_FROM_INT64 | Converts an IPv4 address from an INT64 value to a BYTES value in network byte order. |
NET.IPV4_TO_INT64 | Converts an IPv4 address from a BYTES value in network byte order to an INT64 value. |
NET.PUBLIC_SUFFIX | Gets the public suffix from a URL. |
NET.REG_DOMAIN | Gets the registered or registrable domain from a URL. |
NET.SAFE_IP_FROM_STRING | Similar to the NET. , but returns NULL instead of producing an error if the input is invalid. |
In our case we’ll want the host or reg_domain functions to get the base of this website. Let’s see what using it looks like:
select net.reg_domain("https://www.feralcat.xyz/blog/2024/02/14/aws-cloudformation-php-and-wordpress-issues/")
In your situation you’d have a table column inside of the reg_domain function. I’m just using a hard coded string as an example. My output looks like:
Now we’ll be able to count the number of instances of the domain and figure out which website to reach out to. Happy coding folks! If you’re a BigQuery geek then check out this post.