Which scheduler should I use for dbt jobs?
As an analyst or data engineer, you’ve probably used dbt ad-hoc, launched the dbt shell and run commands against a data warehouse. This works during the development and testing stages. What happens when you need to move your dbt jobs to production?
Data analysts and engineers have all used dbt ad hoc, by launching the shell and running commands against a data warehouse. During the development and testing stages, ad hoc works just fine. But what happens when analysts need to move dbt jobs to production?
Running dbt in production sets up a system for running dbt jobs on a schedule, rather than on demand. Production dbt jobs create the tables and views that your business intelligence tools and end users query.
Scheduling dbt jobs involves some complexity, even if you use a simple cron job. The process of scheduling includes monitoring, retrying failed jobs, viewing logs, caching, and receiving notifications about job execution stages. At a certain point of scale, the best practice is to use a workflow orchestration tool for scheduling.
Official dbt documentation explains how you can schedule dbt jobs in production using tools like Apache Airflow, Dagster, Prefect, and dbt Cloud. This article aims to go beyond the basics and explain each tool in detail, covering their pros and cons so that it helps you decide at the end.
Some context around dbt
Dbt is the open-source command line tool for running, testing, and documenting SQL queries. When you're looking to bring software-engineering-style discipline to your data analysis work, dbt is your go-to. It performs the "T" in ELT (Extract, Load, Transform) processes. Dbt doesn’t extract or load data, but it’s extremely good at transforming the data already loaded into your warehouse.
Dbt helps analysts and engineers transform data in their warehouse by writing SELECT statements. The founders of dbt were inspired by programming practices like encapsulation and refactoring. Dbt helps with splitting SELECT statements into reusable code blocks and keep them version-controlled. It turns these written queries into tables and views, allowing BI tools to query them later.
Dbt scheduler options
Apache Airflow has been the first choice for many data engineers for workflow orchestration. Written in Python, Airflow’s core abstraction is the DAG (directed, acyclic graph), a collection of tasks connected via execution dependencies.
As one of the oldest options, Airflow is a battle-tested project. It is entirely open-source, backed by the Apache Foundation, and is used by at least a thousand organizations today, including Disney and Zoom.
Airflow has been around for over a decade, and boasts a rich ecosystem of operators and integrations. Vendors like Astronomer provide hosted Airflow-As-A-Service options, freeing you from the burden of maintaining Airflow in-house.
Airflow was built a decade ago to serve a specific purpose: scheduling, ordering, and monitoring deployed workflows. However, the data landscape has continued to evolve, with new use cases popping up related to data science, machine learning, and data lakes.
As DAGs grew more complicated, Airflow users have run up against its limitations, among them:
- Lack of support for locally testing and debugging DAGs.
- Lack of support for movement of data between related tasks; instead, data has to stored in an external storage device and passing information about where it is stored using a technology called XComs.
- Lack of support for ad-hoc and event driven scheduling of DAGs (all DAGS demand some kind of schedule)
- Parallel runs of a DAG with the same execution time are impossible
Dagster is a relatively young project. It was started in April 2018 by Nick Schrock, who previously co-created GraphQL at Facebook. Dagster goes beyond traditional orchestration to “think” holistically about the challenges of making data applications. In particular, Dagster offers a rich feature set that aligns with general software engineering principles, including:
- workflow as code
- parameterized functions with type annotations
- a rich metadata system
- strong data dependencies among functions
Dagster models a workflow (job) as a collection of Ops. Ops are graphs of metadata-rich, parameterizable functions--connected via gradually typed data dependencies. The developer writes vanilla Python functions that define computations and graph structure to define a job. This functional data processing API makes it simple to pass data between functions, eliminating the need for external mediums for data exchange.
Dagster Ops (or functions) are parameterized, allowing them to expect different values during execution. That is a significant change compared to Airflow, where DAGs remain static. Parameterized functions also enable Dagster to decouple compute and storage for each function, making local testing and debugging incredibly fast by passing different parameter values for each environment (e.g., DEV, TEST, PROD, etc.).
Dagster expects you to explicitly declare your function’s inputs and output to provide typing guarantees for them. You can add additional metadata by annotating functions with required configurations, documentation, and so on. However, like Airflow, you can place any logic inside the function's body with Python.
Dagster uses its rich metadata system to surface better, more plentiful insights. Dagit, the Dagster UI, shows the input and output data types for functions annotated during development. They can also do deeper data quality checks and schema validation and enforce other guarantees.
Like Dagster, Prefect is also a relatively young project. It was founded in 2018 by Jeremiah Lowin.
Prefect’s design adheres to a philosophy of "negative engineering." Prefect assumes that the developer knows how to code and makes it as simple as possible to build code into a distributed pipeline backed by its scheduling and orchestration engine. Therefore, Prefect aims to be minimally invasive when things go right and maximally helpful when they go wrong.
The tool introduces functional APIs for writing workflows (Prefect Flows) in pure Python code. Prefect tasks behave like functions. You can call them with inputs and work with their outputs. That makes converting existing code or scripts into full-fledged Prefect workflows trivial.
Prefect transparently manages the input-output dependency. Unlike in Airflow, Prefect tasks can directly exchange data, enabling complicated branching logic, richer task states, a stricter contract between tasks and runners within a flow.
Like Dagster, Prefect supports parameterized workflows. It also provides a beautiful real-time UI, available in open-source and cloud-hosted versions.
Last but not least, dbt Cloud is a hosted service for deploying dbt jobs in production. It has turnkey support for scheduling jobs, CI/CD, serving documentation, monitoring & alerting, and an Integrated Developer Environment (IDE).
Compared to other tools, dbt Cloud is relatively new. It lacks certain advanced features offered by other orchestrators, like support for environment variables and dynamic workflows.
Which scheduler do you use?
There’s no hard and fast rule for choosing the right scheduler for dbt jobs. Here are some user opinions, straight from the source:
“Airflow is great for batch jobs on a fixed schedule but starts to show a lot of pain when you are moving to event-based or ad hoc scheduling. Dagster aims to make testing and development easier, and Prefect seemed more focused on data science workflows. I trialed Dagster back at 0.7 and it felt incomplete for me, but I know it's come a long way. If batch/scheduling cases fit 95% of your use cases, my recommendation would be Airflow. The ecosystem around it is very strong and well-tested.
– Pedram Navid, Head of Data @ Hightouch
“I think the general heuristic we used is, once we wanted to start breaking out of a typical hourly/nightly/weekly deployment cycle, we had to integrate with our other pipelines in Airflow. Then it was just a decision between triggering cloud jobs or running dbt directly. Another advantage is being able to track and upload metadata to the warehouse more effectively in Airflow.
Dbt Cloud is the easiest out of the box, but if you already have one of the other schedulers set up, then deploying dbt is pretty easy. There are a few ways to do it, but my preference is to create a Docker image for our dbt repo and execute dbt commands in Airflow using k8s pod operators using that image. Since we already had Airflow set up, this took me about a day of work.”
–Jonathan Talmi, Senior data platform manager @ Snapcommerce
“Dagster pipelines are more structured and constrained. This allows us to have a lot of additional features (a type system, a config management system, a system-managed context object that flows through the compute, among other things). By contrast, Prefect pipelines are more minimal and dynamic.
Another way of framing this difference is that dagster is very interested in what the computations are doing rather than only how they are doing it. We consider ourselves the application layer for data applications (rich metadata, type system, structured events with semantic meaning etc), whereas prefect frames their software in terms of “negative” and “positive” engineering. This negative/positive framing is more exclusively about the “how” of pipelines: retries, operational matters etc.”
–Nick Schrock, founder @ Dagster
Schema change detection