Upload a Pandas DataFrame to DynamoDB using Python

Upload a Pandas DataFrame to DynamoDB using Python

So you’re trying to upload a Pandas DataFrame to DynamoDB using Python? Let’s take a step back first. Why are we using DynamoDB? What is DynamoDB?

What is DyanmoDB

DyanmoDB is a non-relational fully managed database product offered by Amazon’s cloud computing arm AWS. So why would you go the DynamoDB route vs MySQL, Postgres, Firebase Real-Time DB, etc? Scale, Scale, Scale. While DynamoDB could be a bad choice if you needed to write complex queries or only needed a small amount of storage. DynamoDB is great for applications you plan to scale. DyanmoDB goes as far as preventing bad database design so that you get your data at the same speed whether you have a couple megabytes worth of data or several terabytes worth of data. Did I mention that it’s fully managed? No need to update your servers and there is less surface area prone to errors.

What is a Python DataFrame?

A Pandas DataFrame is a data structure that helps you easily manipulate information. Each column can have different data types. This is very similar to a Google Sheet or an Excel doc. Need to enrich your data? A DataFrame could come in handy. It prevents a developer from having to loop over objects and is in my opinion a big reason why Python is the preferred language for many data guys.

My Computer Setup

  • Python3 (Finally getting used to adding parentheses around my print statements )
  • macOS

Let’s Get to Sending Data

So, I’ll start off with an Excel doc. That’s where I originally had my data. I’m graduating to DynamoDB so that I can add a ton more data ontop.

import openpyxl as openpyxl
import pandas as pd
path = "path/to/your/excel/doc.xlsm"
wb_obj = openpyxl.load_workbook(path)
rw = wb_obj['your_excel_sheet_name_here']
df = pd.DataFrame(rw.values)

We’re using Openpyxl to access our Excel data. Make sure to head over to their docs if you have any specific questions. Openpyxl should be able to deal with most if not all of the Excel formats currently on the market like macro enabled Excel docs .xlsm or your typical .xlsx Excel docs.

From there, we load the values into the dataframe using the pd alias we previously defined. Why alias? Because we’re lazy! No other reason.

Boto3 and AWS CLI

From what I’ve learned, the boto3 package and AWS CLI are your goto packages to interact with AWS products programmatically using Python. The AWS CLI requires that you setup a user in the IAM section in AWS. The AWS configuration page walks you through that setup pretty well. Then you’ll need to pip3 install your boto3 package and you should be good to go.

No Auth Keys Needed

One thing I was initially confused about when seeing boto3 example scripts was that there were no passwords or keys that were passed to the package. Why is that? Because boto3 uses the AWS CLI behind the scenes. So if you have you’re AWS CLI setup then you don’t need to auth in.

**SOLUTION** Creating the Table and Sending Your Data

import openpyxl as openpyxl
import pandas as pd
path = "path/to/your/excel/doc.xlsm"
wb_obj = openpyxl.load_workbook(path)
rw = wb_obj['your_excel_sheet_name_here']
df = pd.DataFrame(rw.values)
client = boto3.client('dynamodb')
key_schema =[
{
                'AttributeName': 'uuid',
                'KeyType': 'HASH'
            },
            {
                'AttributeName': 'date',
                'KeyType': 'RANGE'
            }
        ]
attribute_definitions = [
            {
                'AttributeName': 'phrase',
                'AttributeType': 'S'
            },
            {
                'AttributeName': 'uuid',
                'AttributeType': 'S'
            }
        ]
table = dynamodb.create_table(
        TableName=table_name,
        KeySchema=key_schema,
        AttributeDefinitions=attribute_definitions,
        ProvisionedThroughput={
            'ReadCapacityUnits': 10,
            'WriteCapacityUnits': 10
        }
    )
for index, row in df.iterrows():
    phraseType=''
    if row[0].count(' ') > 0:
        phraseType = 'Phrase'
    else:
        phraseType = 'Word'
    chunk = {"phrase": row[0], "type": phraseType, 'uuid': str(uuid.uuid1())}
    print(chunk)
    table.put_item(Item=chunk)

So, we first created the DataFrame using the Excel doc like we mentioned before. client = boto3... instantiated the boto3 client which does the heavy lifting of interacting with AWS. The next few lines of code like the key_schema and attribute_definitions are used to create the table and define its structure.

Then there’s the loop which goes through each DataFrame row, does a bit of logic on the row’s data, and creates the chunk object we’ll eventually be sending to DynamoDB.

So that’s it! Hope we helped you upload a Pandas DataFrame to DynamoDB using Python. Happen to be interested in making a Twitter bot? Check out this post! Happy coding!

Leave a Comment