Let’s work on manipulating a dataframe. Let’s work on editing all elements of a DataFrame according to a condition. You have a horde of data you just imported from a CSV or an Excel doc. You’ve managed to get the data into a Pandas DataFrame using one of the built in import methods like read_csv for example. Then, you notice that one of your data people maybe manually entered a name incorrectly. All your Pams are Pims. Before we jump in, here’s what my setup looks like.
My Computer Setup
- Python3 (Finally getting used to adding parentheses around my print statements )
- macOS
Getting your Environment Setup
You can find some info on installing pip here. The article actually covers pip installation for Python2 and the Python3 pip installation slightly differs. I’ll be sure to add and link another article in the near future. So assuming you have pip for python3 installed, execute the following to get the latest Pandas package up and running:
1 | sudo pip3 install pandas |
Creating your Test DataFrame
So, let’s setup the DataFrame that our hypothetical data entry person made the mistake on. Remember, don’t be too hard on hypothetical him/her!
import pandas as pd
# Instantiate the dataframe
df = pd.DataFrame()
# Add Values to the df
df['a']=['Bob','Bill','Pim']
df['b']=['Jim','Pim','Terry']
Cool, so now we have a 3×2 dataframe with some crap data that we need to fix. Rememeber that both of your test columns need to be the same array length. You can’t create a DataFrame with different size array’s or you’ll get the “length of values does not match the lenth of index” error. I cover that problem here. Execute the following if you ever need to know the shape of your dataframe. It’s much more useful when you can’t easily count the rows and columns.
df.shape
***output*** <--not code
(3x2) <-- output and not code
Okay, so let’s get back to the problem at hand. You have a dataframe that looks like this.
We need to replace Pim with Pam. And how do we do that?
Replacing Pim with Pam
We’re going to use the where method of the pandas DataFrame in conjunction with it’s optional second paramater! So this method kinda threw me off my first time around. It might be useful to first execute it without the second paramter.
df.where(df=='Pim')
So what this method’s second parameter does is replace anywhere you see that NaN with your second paramater. When I had my first go at using it, I thought the logic would be where the DataFrame equals Pim replace with my second paramater. Instead it’s the opposite. So we need to tack on a not ! . Let’s see what that looks like.
df.where(df!='Pim')
Now we can see NaN where our Pim’s use to be. Those would be what is replaced. So let’s go ahead and pass the second parameter to resolve our misspelled name.
df.where(df!='Pim', 'Pam')
And our data is now cleansed and ready for consumption. No need to send the data back for costly manual editing. We can write something for that! If you enjoyed learning how to edit all elements of a DataFrame according to a condition, try out this article for more Python. Happy programming!