Event-based Dataflow Job Orchestration with Cloud Composer, Airflow, and Cloud Functions

Apache Beam and Google Dataflow are helping data engineers adapt Kappa architecture with a unified programming model and a managed service with a production scale runner. In enterprise solutions, data processing jobs are part of complex pipelines, and fully productionizing these jobs means integration with existing Continuous Integration(CI) and automation for Continuous Deployment(CD). This article explores an event-based Dataflow job automation approach using Cloud Composer, Airflow, and Cloud Functions.

Motivation

Google Cloud Platform(GCP) documentation provides reference solutions for setting up a CI/CD pipeline and scheduling Dataflow jobs. However, these solutions do not provide a simple interface and abstraction from the underlying orchestration system. By decoupling the intent with a simple manifest as the entry point, we can create a consistent interface for ad-hoc, production runs, batch, streaming workloads. The event-driven architecture also allows easy pluggability with CI and extensibility.

System

The Dataflow job automation system has two main functionalities: rolling code changes and scheduling recurring jobs. It takes rollout and schedule manifests which allow overwriting dynamic environment and pipeline options and job templates as input. It manages the lifecycle of jobs with workflows modeled as Airflow DAGs.

Rollout flow

Rollout flow is responsible for pushing Dataflow job code and runtime parameter changes. There could be multiple update strategies for the running job with the old version and parameters. Two main strategies are rolling update and restart. In the rolling update, the new and old versions of the job run parallel while validating the new job. In the restart strategy, the old job is first stopped.

  1. The user/CI uploads the rollout config to the GCS bucket.
  2. Cloud Functions fn is triggered due to the create operation; it passes the contents to Cloud Composer REST API to trigger the deployer DAG externally.
  3. The deployer DAG stops the job running with the old version.
  4. The deployer DAG starts the job with the new spec and awaits autoscaling events, metrics, and messages from the job with sensors.
Rollout Flow: Restart Strategy
Dataflow Deployer DAG

Schedule flow

Schedule flow is responsible for creating/updating recurring Dataflow jobs. The DAGs for these needs to be registered with Airflow dynamically with a DAG creator, as shown below.

  1. The user/CI uploads schedule config to schedules GCS bucket
  2. Cloud functions fn is triggered by the create operation in schedules config bucket.
  3. Controller DAG is triggered externally via Cloud Composer REST API, and it updates the global variable to add new spec.
  4. Dynamic DAG creator at every dags folder scan checks for a global variable to create/update batch DAGs.
  5. Based on the defined start/end date and interval, the batch DAG gets triggered by the Airflow scheduler. When executing, it launches a Dataflow job for each slot.
Schedule Flow

Recap

In this article, we discussed automating Dataflow job updates and scheduling with a simple interface hiding the complexities from the user. The solution is scoped to restart strategy with basic validation and basic scheduling. It can be extended with the following features.

  • Additional validation with a test operator to verify the health of the new job.
  • Rolling update strategy.
  • Unscheduling DAGs.

References

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store