Dropping Duplicates From A Pandas DataFrame

Dropping duplicates from a Pandas DataFrame came into my purview when I noticed a script which contained a ton of networking was sloooooooow. Unfortunately, the endpoint didn’t support asynchronous API calls, so I was stuck getting the job done the ol’ fashioned way.

My Setup

Python 3.x
macOS

The Problem

I had tons of data in a Pandas DataFrame which was getting sent over to a web application, but I only cared about a subset which contained duplicates. Since networking is so taxing, this was unfortunately where I could save the most time. Python, while not known to be the fastest of languages, is still very efficient! Should you needlessly loop through random shit? No. But can you and potentially never notice that you did? Yes! Include a hefty API call in there and you’ll most definitely feel the effect. So let’s get to chopping down those API calls by removing duplicates in my starting data set.

The Solution

# this is what's doing the heavy lifting
import pandas as pd
# define the dataframe
df = pd.DataFrame()
# the data
first_names = ['Bob', 'Sandra', 'Amanda', 'Jill', 'Sandy', 'Bob', 'Sandra', 'Sandy']
last_names = ['Bobby', 'Sanders', 'Apricot', 'Jillerson', 'Sanders', 'Bobby', 'Sanders', 'Sanders']
important_data = [1,2,3,4,5,1,2,5]
unimportant_data = ['a','b','c','d','e','f','g','h']
# adding data to the dataframe
df['first_names'] = first_names
df['last_names'] = last_names
df['important_data'] = important_data
df['unimportant_data'] = unimportant_data
df
# What's below is output of the data in the dataframe
  first_names last_names  important_data unimportant_data
0         Bob      Bobby               1                a
1      Sandra    Sanders               2                b
2      Amanda    Apricot               3                c
3        Jill  Jillerson               4                d
4       Sandy    Sanders               5                e
5         Bob      Bobby               1                f
6      Sandra    Sanders               2                g
7       Sandy    Sanders               5                h

Soooooo let’s say that this data represents people’s reviews of a restaurant and that first_names – last_names represents one person. Important_data is duplicated because of unimportant_data maybe due to a shitty SQL query. If we wanted to send data over to another service and simply looped through this dataframe, we’d be sending reviews over multiple times. Bob Bobb’s review would be sent over twice (notice index 0 and 5.) So, let’s remove duplicates based on columns ‘first_names’ and ‘last_names’.

# Bam, you no longer have duplicates
df1 = df.drop_duplicates(subset=['first_names', 'last_names'] )
df1
# The following is output
  first_names last_names  important_data unimportant_data
0         Bob      Bobby               1                a
1      Sandra    Sanders               2                b
2      Amanda    Apricot               3                c
3        Jill  Jillerson               4                d
4       Sandy    Sanders               5                e

You can now loop through this Pandas DataFrame, create dictionaries from each row, and send that data over to wherever you’d like. What did we learn from this example? That DataFrames are powerful as f*#$%. Happy coding! Let us know where else dropping duplicates from a Pandas DataFrame has helped you get shit done in the comments! Check this out if you’re having issues with modules.

Leave a Comment Cancel reply

Exit mobile version