Airflow for Automation
An open-source tool to programmatically author, schedule, and monitor workflows
Table of contents
Introduction
Automation is the use of technology to perform tasks with reduced human assistance. Anytime we have to do a task that need's to be performed again and again on a daily basis can be done using automation.
Automation is just reducing human interaction for a task which is same everytime you run it.
One such automation tool that I have worked with is Apache Airflow.
What is Apache Airflow?
Apache Airflow is a open source software created by the software community for easy automation of tasks by programmatically author, schedule and monitor workflows.
It allows you to write workflows in code. That can easily scalable, dynamically created and flexible.
for eg: I have a task to daily delete entrires in my database which are a year old, with airflow we can write a program that runs daily to do this task rather than manually triggering the program daily.
Why Apache Airflow?
Apache Airflow is open source and free to use and keeps improving day by day.
Here are some reason's why I feel Apache Airflow is good.
Workflow Orchestration: Airflow allows to monitor multiple tasks that are dependent on each other as a directed acyclic graph, this helps in noticing any circular dependencies which helps to order the tasks more efficently
Airflow also makes sure that the task is run only if all the tasks that it depends on is completed.
Running parallely: Multiple tasks can run together or in parallel if they are not dependent on each other this greatly improves the time needed to complete the tasks.
Altering: Airflow will send a email alert incase of any failure in any of the tasks, This will keep the user carefree as no mail means the task has been completed sucessfully today, or user gets a mail to fix it and run if it fails.
Scalability: Airflow can scale to handle large amounts of a data and cpu intensive tasks dynamically.
Web UI: Airflow provides as simple web ui to monitor tasks and view the progress of each dag or task.
Pure Python: Tasks and dags can be written purely in python even for command line intructions.
Extensibility: Airflow has a modular architecture and provides a rich set of APIs, allowing users to extend its functionality by creating custom operators, sensors, and hooks.
Although it has many advantages, Airflow has some drawbacks such as
Learning curve: User will need to learn all of airflows terminologies and may take long time to learn.
Resource Intensive: Airflow takes up alot of computational power to run the tasks.
Testing: Airflow pipelines are harder to test and review outside of production deployments.
Here is a basic code snippet of a airflow dag and tasks.
# Step 1: Importing Modules
# To initiate the DAG Object
from airflow import DAG
# Importing datetime and timedelta modules for scheduling the DAGs
from datetime import timedelta, datetime
# Importing operators
from airflow.operators.python_operator import PythonOperator
# Step 2: Initiating the default_args
default_args = {
'owner' : 'Aaron',
'start_date' : datetime(2024,03,07),
}
def count_numbers():
for i in range(10):
print(i)
def multiply_numbers():
res = 1
for i in range(1,10):
res = res * i
print(res)
# Step 3: Creating DAG Object
with DAG(dag_id='DAG-1',
default_args=default_args,
schedule_interval='@daily', # daily to run daily can be any time interval
catchup=False
) as dag:
# Step 4: Creating task
# Creating first task
start = PythonOperator(
task_id = 'start',
python_callable=execute_count_numbers # will call the fuction mentioned
)
# Creating second task
end = PythonOperator(
task_id = 'end',
python_callable=execute_multiply_numbers # will call the fuction mentioned
)
# Step 5: Setting up dependencies
start >> end # here start will run first and end is dependent on it and will run after start finishes
# >> indicates end is dependent on start
# this can be run in parallel as
# start
# end
Conclusion
Apache Airflow offers robust workflow automation capabilities, enabling efficient task management, parallel execution, and scalability. While it may pose challenges such as a learning curve and resource intensity, its benefits in streamlining operations and ensuring reliable task execution make it a valuable tool for organizations.
Note: For more information please visit the offical website here.
Note: for any corrections do reach out to me using my socials present in the navbar.