Dropping duplicates from a Pandas DataFrame came into my purview when I noticed a script which contained a ton of networking was sloooooooow. Unfortunately, the endpoint didn’t support asynchronous API calls, so I was stuck getting the job done the ol’ fashioned way.
I had tons of data in a Pandas DataFrame which was getting sent over to a web application, but I only cared about a subset which contained duplicates. Since networking is so taxing, this was unfortunately where I could save the most time. Python, while not known to be the fastest of languages, is still very efficient! Should you needlessly loop through random shit? No. But can you and potentially never notice that you did? Yes! Include a hefty API call in there and you’ll most definitely feel the effect. So let’s get to chopping down those API calls by removing duplicates in my starting data set.
# this is what's doing the heavy lifting import pandas as pd # define the dataframe df = pd.DataFrame() # the data first_names = ['Bob', 'Sandra', 'Amanda', 'Jill', 'Sandy', 'Bob', 'Sandra', 'Sandy'] last_names = ['Bobby', 'Sanders', 'Apricot', 'Jillerson', 'Sanders', 'Bobby', 'Sanders', 'Sanders'] important_data = [1,2,3,4,5,1,2,5] unimportant_data = ['a','b','c','d','e','f','g','h'] # adding data to the dataframe df['first_names'] = first_names df['last_names'] = last_names df['important_data'] = important_data df['unimportant_data'] = unimportant_data df # What's below is output of the data in the dataframe first_names last_names important_data unimportant_data 0 Bob Bobby 1 a 1 Sandra Sanders 2 b 2 Amanda Apricot 3 c 3 Jill Jillerson 4 d 4 Sandy Sanders 5 e 5 Bob Bobby 1 f 6 Sandra Sanders 2 g 7 Sandy Sanders 5 h
Soooooo let’s say that this data represents people’s reviews of a restaurant and that first_names – last_names represents one person. Important_data is duplicated because of unimportant_data maybe due to a shitty SQL query. If we wanted to send data over to another service and simply looped through this dataframe, we’d be sending reviews over multiple times. Bob Bobb’s review would be sent over twice (notice index 0 and 5.) So, let’s remove duplicates based on columns ‘first_names’ and ‘last_names’.
# Bam, you no longer have duplicates df1 = df.drop_duplicates(subset=['first_names', 'last_names'] ) df1 # The following is output first_names last_names important_data unimportant_data 0 Bob Bobby 1 a 1 Sandra Sanders 2 b 2 Amanda Apricot 3 c 3 Jill Jillerson 4 d 4 Sandy Sanders 5 e
You can now loop through this Pandas DataFrame, create dictionaries from each row, and send that data over to wherever you’d like. What did we learn from this example? That DataFrames are powerful as f*#$%. Happy coding! Let us know where else dropping duplicates from a Pandas DataFrame has helped you get shit done in the comments! Check this out if you’re having issues with modules.