March 26th, 2020
AWS Step Functions for Data Orchestration
By Pat Reilly

Apache Airflow has quickly become the de facto data orchestration tool for managing multiple big data pipelines. Companies running many data pipelines appreciate Airflow for its Directed Acyclic Graphs (DAGs) which allow for pipelines as code, its scalability, and its elegant user interface for tracking ETL job status. These are all good reasons to put some thought into choosing your orchestration tool, and Airflow is certainly among the best.

However, AWS Step Functions is a serverless workflow service that allows one to stitch together multiple AWS services into a single execution. For example, a PUT event for a specific S3 location could trigger a Glue job to transform the raw data into a new location, then trigger a SageMaker training job. All of the steps in a particular workflow are declared in JSON and comprise what’s called a State Machine. This allows for reusability and versioning of the workflow in your code repository, which also makes it great tool for data orchestration.

On any new AWS data project, the team is often tasked with determining how to orchestrate their new data processes. Often times these services have been refactored from SQL Server Integration Services (SSIS) to work with Lambda, AWS Glue, EMR, Athena, or a combination of these. Teams are rightfully curious about what life will be like in AWS with regards to troubleshooting and monitoring ETL/ELT jobs. Inevitably, they ask, “Should we use Airflow to monitor our jobs?” My answer is always the same: “Do you want to manage the infrastructure?”

In order to run an Airflow cluster effectively, you’ll need to host it on EC2 instances or a containerized compute infrastructure, which means you’ll need a team to manage, patch, and capacity plan for that infrastructure. AWS Step Functions is a low-maintenance alternative. For small data teams, this means more time can be spent on ingestion and transformation, and less time on infrastructure uptime. For small data teams, this means more time can be spent on ingestion and transformation, and less time on infrastructure uptime.

Here are the top 10 reasons you should consider Step Functions over Airflow in your AWS data ecosystem:

  1. Serverless. Step Functions requires no user-maintained infrastructure. This means no EC2 to maintain, no Celery tasks to maintain, and no CPU bottlenecks with your Airflow instance.
  2. Integrations are numerous and growing. Step Functions integrate directly with several AWS services meaning you reference those resources directly in Step Functions, without using Lambda. Still, for any services not natively supported in Step Functions, using Lambda and Boto3 accomplish the same thing. For a full list of services, see here.
  3. Visualize your workflow. Limited visualizations in Step Functions can offer insight into failures and the data being passed between tasks.Airflow offers better visualizations which is one of the first reasons teams opt for it over Step Functions.
  4. DAG-like pipelines. You can author pipelines as code, version them in your repo, and deploy them using CloudFormation. However, it uses JSON instead of Python.
  5. Limitless scheduling. You can schedule Step Functions workflows to be triggered in a number of different ways: events from S3, CloudWatch, SNS, SQS, etc.
  6. Wait for Callbacks. You can wait for up to a year for your workflow to complete, and you aren’t paying for that wait time. Airflow would charge you for that EC2 time while you wait for the process to complete.
  7. Parallelism. You can build dynamically parallel fanout and scatter-gather patterns with less code. Fanout patterns dispatch a list of identical tasks in parallel to simplify workflows such as order processing and instance patch management.
  8. Inexpensive. The service is extremely inexpensive in both resources and maintenance overhead. Charges incur starting after the first 4000 state transitions in any given month, which is to say, AWS isn’t interested in making money on the service but rather the services you’re orchestrating with it.
  9. No single point of failure. Since Step Functions is a managed service, it’s built with High Availability natively, which means you don’t need to procure a load balancer or weight routing to orchestrate your pipelines. There’s also never a worry the EC2 instance size is too small to handle the job processing.
  10. It scales. Airflow doesn’t scale natively and requires one to either containerize the software or leverage EC2 Auto Scaling and load balancing to meet demand. Step Functions is a managed service, so it scales to meet orchestration demands and concurrency.

The data orchestration space will continue to evolve as data engineers look for ways to easily deploy and monitor their data processes. Expect AWS to continue to roll out new features and support for more of their services in Step Functions. The less time your team needs to spend on orchestrating ETL, the more time they can spend delivering value for the end users of the business. If you need hands-on assistance with Step Functions or any other AWS tools, reach out to info@ 1strategy.com; one of our AWS experts will be there to help! In addition, AWS offers several programs which available through 1Strategy. Contact us if you think a Well-Architected Review might be right for you.