Pull A Domain From A Full Website Path In BigQuery

Pull A Domain From A Full Website Path In BigQuery

This post will show you how to pull a domain from a full website path in BigQuery. So let’s set the stage for a hypothetical. You own a URL shortener company. You want to partner with a website for whatever reason.

You decide that you want to do analysis over the data you’ve streamed or imported into BigQuery. And now, you want to see what the base url for these paths are.

But, all the links are complicated. For example it might look like:

https://feralcat.xyz/blog/2024/02/14/aws-cloudformation-php-and-wordpress-issues/

So we can’t count how many feralcat.xyz links we have because of all the extra nonsense.

Can we regex? When can’t we regex. But why rebuild the wheel?!? Let’s use a BigQuery function to pull a domain from a full website path in BigQuery

Solution

Use a BigQuery Net function. There’s a bunch of them. Check out there docs for more info. But here’s a short list of what I found to save a couple of clicks.

NameSummary
NET.HOSTGets the hostname from a URL.
NET.IP_FROM_STRINGConverts an IPv4 or IPv6 address from a STRING value to a BYTES value in network byte order.
NET.IP_NET_MASKGets a network mask.
NET.IP_TO_STRINGConverts an IPv4 or IPv6 address from a BYTES value in network byte order to a STRING value.
NET.IP_TRUNCConverts a BYTES IPv4 or IPv6 address in network byte order to a BYTES subnet address.
NET.IPV4_FROM_INT64Converts an IPv4 address from an INT64 value to a BYTES value in network byte order.
NET.IPV4_TO_INT64Converts an IPv4 address from a BYTES value in network byte order to an INT64 value.
NET.PUBLIC_SUFFIXGets the public suffix from a URL.
NET.REG_DOMAINGets the registered or registrable domain from a URL.
NET.SAFE_IP_FROM_STRINGSimilar to the NET.IP_FROM_STRING, but returns NULL instead of producing an error if the input is invalid.

In our case we’ll want the host or reg_domain functions to get the base of this website. Let’s see what using it looks like:

select net.reg_domain("https://www.feralcat.xyz/blog/2024/02/14/aws-cloudformation-php-and-wordpress-issues/")

In your situation you’d have a table column inside of the reg_domain function. I’m just using a hard coded string as an example. My output looks like:

Now we’ll be able to count the number of instances of the domain and figure out which website to reach out to. Happy coding folks! If you’re a BigQuery geek then check out this post.