Update DAGs on EC2

This past weekend I decided to spin up a quick Airflow deployment for some personal scripts I wanted executed on a schedule. I didn’t have scale in mind and I didn’t have robustness in mind. I had speed. I’ll review how I’m deploying my version-controlled DAG code to my EC2. I want to highlight, this isn’t best practice if your deployment does anything remotely important. Luckily mine doesn’t. I just wanted to get code from Github onto my EC2 as fast as possible. Note, we’ll breeze over how to get Airflow standing and focus on getting code quickly updated after the inital deployment is stood up.

Setting Up Your Airflow Deployment

I used a t3.medium ec2 on AWS which has 2 virtual CPUs and and 2 Gbs of memory. Airflow can be memory and CPU intensive. I wanted to make sure the machine didn’t immediately fall over. I might explore eventually cutting my machine size down to save a couple more dollars. But first things first, spin up your EC2 with the Ubuntu AMI. Then create a venv and pip install apache-airflow. We won’t go into the nuance here on how to get airflow deployed. Check out this other dev’s blog post which digs into the details instead!

Update DAGs on EC2

My Thought Process

So, again I had speed in mind. The EC2 deployed Airflow is by design very fragile. I would have loved to spin up my deployment on k8s, but a sitting EKS cluster is fairly expensive even without having it scale up during DAG execution. So for the sake of saving a couple of bucks, I’m avoiding Kubernetes. An aside is that now I kind of understand why some devs spin up home labs. K8s costs kind of push you towards a home lab if you want to freely experiment with it.

So Kubernetes was ruled out. Therefore, there was no baking of DAGs into our Docker image.

Now, since I have a standalone EC2 my mind then went to copying the files directly to the EC2. I’m using Github actions at the moment for CI/CD. I’d need to whitelist where calls were coming from to ensure my security group let my Github Actions through. I started researching if there was a defined IP range. The internet eventually said there was an API I could call to get my Github Actions IP addresses. I saw it mentioned elsewhere that this was finnicky.

So, I ultimately ended up at scheduling a pipeline that runs every 5 minutes which git pulls from a repo with my DAG code in it. After the git pull happens I copy the files to my dags location in their entirety. Ii cleaner solution would’ve probably been to cleanly clone the DAGs folder to the Airflow DAGs folder destination. But, again. Through a mixutre of laziness and need for speed I ended up at this solution.

The Solution

So we’re assuming your Airflow is accessible via your EC2.

  1. Create a repo in Github. The free version of Github will be fine.
  2. Clone the repo locally.
  3. Create a folder named “dags”
  4. Create a “.gitignore” file and add .venv to the file. We’re going to create a venv so we can do local dev more easily.
  5. make a file in the dags folder named “git_pull” or whatever you’d like.
  6. You’ll need to ssh into the ec2 now. We’ll need to create the public/private key that we’ll upload into Github. I ssh into my instance doing something that looks like:
    ssh -i "yourKeyFileFromAWS.pem" ec2-user@ec2-66-66-666-66.us-east-6.compute.amazonaws.com
  7. So now you’re in your EC2. Let’s generate SSH public/private keys. Execute the following:
    ssh-keygen -t ed25519 -C "your_email@example.com"
  8. Smush enter until you get through all the prompts. We don’t want a password and we want the public and private keys stored in the default location.
  9. Now we’ll need to extract our public key so we can throw it into Github. Change dirs into the .ssh folder. Something like:
    cd ~/.ssh
    should work. Next run:
    ls
    There should be a file in that folder that has a .pub suffix. Let’s cat out it’s contents. So run:
    cat < yourPubFile.pub
  10. Now grab the contents from the cat command and copy them into the Github prompt after clicking “new SSH key” on this page.
  11. Hit save in Github. You should now be able to clone stuff from your EC2. Note, that once your server gets killed your key does too. Also, this is kinda sketchy. If your EC2 gets compromised, then the attackers get access to your Github account. Navigate to your airflow folder or wherever you’d like to clone your repo and run:
    git clone git@github.com:githubStuffMetadataBlah
  12. Now let’s create a DAG that will run every 5 minutes locally. Open the “git_pull.py” file you made earlier and add the following:
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
default_args = {
    'owner': 'Carlos',
    'depends_on_past': False,
    'retries': 3,
    'concurrency': 1
}

dag = DAG(
    dag_id="git_pull",
    default_args=default_args,
    start_date=datetime(2024, 10, 17),
    schedule_interval=timedelta(minutes=60),
    catchup=False
)

bash_task = BashOperator(task_id='sync_code',
                         bash_command="cd /path/to/where/you/cloned/your/dags && git pull origin main && rsync -a --delete dags/ /home/ubuntu/airflow/dags",
                         dag=dag)

The DAG above spins up a job that runs every 60 minutes that will pull from `main` in the repo. Note that you’ll need to identify where DAGs are stored for your airflow deployment. You’ll need to update the command in the ETL to match your infra.

13. Assuming you wrote the code above in your local repo. Push the code into your separate branch, create a PR, merge to main.

14. Maneuver back to your terminal that is SSH’d into your EC2. Maneuver into the folder that you cloned in step 11. Git pull from main. Give airlfow a few minutes to create the ETL in the airflow UI. You’ll need to turn it on in the UI. If it’s taking a while remember we can run “airflow reserialize” to speed the process up.

15. Turn the ETL on. Now every hour the code will get synced. If you want it done faster then just manually create a DAG run.

Remember, this isn’t for production! This is to just get deploys and version controlling working. Feel free to increase the sync cadence if need be.