When Airbnb was scaling rapidly, they faced the problem of organizing complex data pipelines. To combat this and become a data-driven organization, Airbnb launched Apache Airflow in 2015, their custom-made open-source platform to manage complex workflows.
In simple words, Apache Airflow is a platform where you can create, schedule, and monitor complex workflows using simple Python code.
In this explainer post, we will dive into Apache Airflow while discussing how to get started with this novel platform.
Let’s get started.
Why Should You Use Apache Airflow?
In this age of clicks, your business performance relies on how you leverage data to make operational decisions. However, orchestrating complex data becomes a nightmare as your business scales.
Enter Apache Airflow!
Apache Airflow allows you to organize complex workflows and big data processing batch jobs with simple Python code. Yes, what you can do in python, you can do in Airflow plus Apache Airflow is extensible as it has readily available plugins to connect with external systems. You can also create your own custom plugins. The best part is that Airflow is scalable, you can run thousands of tasks per day without any worries.
Apache Airflow Concepts
In this section, we will discuss some concepts that you should know before you start creating your Apache workflow.
- Directed Acyclic Graph (DAG)
A DAG is a data pipeline written in Python. In Apache Airflow, each DAG represents a set of tasks you want to run. Plus, DAG establishes relationships between different tasks.
Tasks are nodes of a DAG. Tasks show the work being done at every step of the workflow.
Operators determine the actual work done.
There are three main operators –
- Action Operators – Perform a function (PythonOperator, BashOpeartor)
- Transfer Operators – Move data from a source to the destination (S3ToRedshiftOperator)
- Sensor Operators – Waits for the predefined function to execute (ExternalTaskSensor)
As the name suggests, Hooks connect Apache Airflow to external APIs and databases like Hive, S3, GCS, MySQL, Postgres, etc.
Plugins are a combination of Hooks and Operators that are used to accomplish certain tasks.
Connections are Airflow’s information storage house. The information stored in connections allows you to connect to external systems.
Installation and Set-Up
Now that you know the importance and core concepts of Apache Airflow, let’s install it right away.
Apache Airflow is Python-based, make sure you have the latest version of Python installed on your computer. Also, install pip libraries.
1. Install Apache Airflow
#add a path to airflow directory, ~/airflow is default $ export AIRFLOW_HOME=~/airflow #Now let's install apache airflow $ pip3 install apache-airflow #Checking successful installation $ airflow version #The above command will give you the version of Apache Airflow installed. Another way to verify your installation is - $ pip3 list #This command will give you a list of all directories. You will see the apache-airflow directory at the top.
2. Initialize The Airflow Database
#initializing the database $ airflow db init #Once you initialize the database, it’s time to create an airflow account. For the sake of understanding, we’ve used random details. $ airflow users create \ --username admin \ --firstname Bob \ --lastname Potter \ --role Admin \ --email firstname.lastname@example.org #After entering the details the terminal will ask you to create a password. Create the password and this step is completed.
3. Start The Airflow Webserver
#Let’s start the Airflow webserver $ airflow webserver --port 8080
Now, open your browser and type localhost:8080
Once, you type this, the browser will ask you to enter the login ID and password. Enter the login details you created in the command terminal and the Apache Airflow dashboard will be displayed.
4. Starting The Scheduler
#Open a new terminal and start the Airflow Scheduler. Airflow scheduler monitors all tasks and DAGs. $ airflow scheduler
Upon executing these commands, Airflow will create the $AIRFLOW_HOME folder. Plus, it will also create the ‘’airflow.cfg’’ file. This file contains default values that are required to get you started with Apache Airflow.
5. Creating First Task Instance
#Now that you have initiated the scheduler, let’s create our first task instance $ airflow tasks run example_bash_operator runme_0 2021-3-12 #creating a backfill over 2 days $ airflow dags backfill example_bash_operator \ --start-date 2021-3-12 \ --end-date 2021-3-13
Visit, localhost:8080 in your browser, and you should be able to see the status of the created task.
The Bottom Line
Airflow is now the preferred choice for data engineering. It is used to get public contributions at big-name organizations like Bloomberg, Lyft, and Robinhood to name a few.
Apache Airflow can be used for virtually any batch data pipelines. If you are someone whose job is to orchestrate jobs with complex dependencies, then this novel platform is for you.
Want more Python? Check out this blog post on Twitter bots!