Prevent data engineering burnout
The most effective data engineering teams are those who successfully shift from a toil culture— a series of manual one-off tasks, firefights, and moments of heroism—to a reliability culture with planned, repeatable, and scalable processes.
Data engineers play an increasingly important role in our economy. These days, every business is pushing to leverage its data to remain competitive.
In addition to standing up the pipelines that deliver data to the analysts, data scientists, and machine learning engineers who put it to work, data engineers are often also responsible for a slew of (usually thankless) maintenance tasks to keep those pipelines running. Data documentation, data quality, data observability, fixing pipeline issues, ETL code review, and a ton of other reliability work falls on the already-full plate of the data engineer.
The most effective data engineering teams are those who successfully shift from a toil culture—where work happens in a series of manual one-off tasks, firefights, and moments of heroism—to a reliability culture with planned, repeatable, and scalable processes that keep things running smoothly without burning out the team.
They do it by borrowing time-tested best practices of modern software engineering and DevOps and applying them to the data engineering field.
Best practices for a data engineering teams
1. Assign clear ownership whenever possible
Whether a simple maintenance task, an update to the data model, a backfill, or investigating a performance issue, when the task (or more often it’s associated object) has no owner, everyone is the owner. As such, collective ownership (and no ownership) is the fastest way to problems being ignored and turning into bigger problems.
Your team will expand, shrink, or simply rotate its team members over the course of quarters or years. Not only is it important to establish and document owners in a publicly visible way; you must hand off ownership as the team naturally turns over.
Not every owner will deliver the same level of attention and diligence to the assets they’re for which they're responsible. But declaring ownership allows the data engineering team to manage the care of higher or lower priority assets. This prioritization ensures the proper level of reliability needed for each piece of infrastructure, data pipeline, and downstream analytics or machine learning application.
2. Document processes and make them discoverable
Process often gets put on the back burner in exchange for more time to spend on “real” work. Work like tackling a request to change the data model, reducing a pipeline’s runtime, or other actions that deliver immediate value.
But good processes can pay dividends by reducing time spent. Good processes can also eliminate confusion and waste. For repeatable future tasks that will come up over and over again, good processes help you win big.
Some examples of high-value processes are:
- An incident response process when a data pipeline or data quality issue comes up
- A data sunsetting process for deciding when and how unused data will be removed
- A process for communicating upcoming changes to the data model to prevent surprises to downstream consumers
- A process for making schema changes in data sources without breaking ingestion to the data warehouse or to pipelines that depend on that data
Aside from having a clearly documented process, it’s also important to make that process discoverable by anyone who will either execute it or be impacted by it. Identify those people with a RACI matrix: who’s responsible for the actions in the process, who’s accountable for the success of the process, who is actively consulted during the process, and who is merely informed of the execution or completion of the process.
For each category of stakeholder, consider how and when they need to be aware of the process, and use things like links to documents, reminders in Slack or email, or in the extreme case a straight up meeting.
3. Plug leaks before running the bilge pump
Boats have a bilge pump to push water back out of the ship’s hull. The bigger the leaks, the harder the bilge pump works to prevent sinking. When you have growing backlog, fix the source first, in the same way you'd plug leaks on a boat.
Once new work isn’t appearing anymore, you can attack your backlog. Clearly communicate to stakeholders that the backlog is no longer growing. Then you can set deadlines. Use time-boxing, and other techniques to work through the backlog as needed.
If you don’t stop the leak first, every item you clear from your backlog might be replaced by a new incoming issue that prevents you from making progress.
4. Automate to reduce toil
Columns, tables, and ETL jobs need to be more like cattle and less like pets. Keep a sharp eye out for “that one table” that sucks up a lot of time because it has to be carefully handled. It’s natural to make exceptions for high-importance tables used all around the company, or for high-criticality tasks like finance. A scalable data engineering team finds ways to apply the same maintenance tasks on those items as on any other tables or pipeline jobs in their environment.
As a rule, batch any action that you’d do more than once manually. If you find yourself repeating the same task on multiple objects, or at regular intervals, it’s time to automate!
Idempotency is your friend. Once you have a script or an endpoint that performs your task in batch across all the objects (i.e. tables, Airflow jobs, etc.), you need to run it without worrying. Idempotent automations can be run again and again to get your desired objects into an end state, without changing them from there.
In this state, you can run your automation and validate that all objects were acted on. If they weren't, you can smash the button again and again until they all receive the action. There's no need to hand-pick objects for automation. Once your automation is successful, those objects will simply stop being modified upon reaching their desired state.
5. Prevent sprawl
The above techniques help you be more efficient, but you also need to slow down the rate of growth of things that need you attention. Preventing sprawl eliminates things that don’t need to hang around, so you have less to worry about on a given day.
Implement retention policies to limit the size of fast-growing tables. Many data science teams only need aggregates after a certain length of time has passed. Roll up an event log table into daily aggregates to massively reduce dataset sizes. That action in turn will massively reduce the compute and time costs needed to work with them. Converting three-year old events into daily or even hourly aggregates may be totally fine for your consumers, while saving time and money for the company.
Also, create a policy for removing data or pipeline jobs that aren’t used after a certain length of time. Ensure users have an appeal process. Consider making the data inaccessible first, to ensure that nobody is impacted before you permanently drop the data. Consider much shorter retention policies for non-production tables as well.
Best practices in the real-world
Here’s a real-world example of how a company used those best practices to overcome some obstacles they’d developed as they grew.
Airbnb used four tools to build, store, orchestrate, and manage its data pipelines (Hadoop, Presto, Airflow, Hive, and Spark.) Their foundational data stores for modeling and analytics evolved organically, with information being appended to tables as data scientists saw fit. There was no overall master plan aligning or uniting the tables, so they grew increasingly large and hard to manage.
From a data engineering perspective:
- Some data pipelines took days to run, and any failure would take a week or more to fix
- Pipeline changes were difficult to plan, manage, and do
- Pipelines constantly ran into resource constraints, causing bottlenecks in connected systems and delays to downstream processes and workflows
- Data engineers were scared to delete unnecessary data because they didn’t know how it would affect unknown downstream dependencies
- Data quality suffered as vital machine learning features often went to null and nobody noticed for weeks
To solve these problems and alleviate the company’s stress, Airbnb formed a new data team. The data engineers on the team handled three main data reliability tasks, including building new foundational tables from scratch, implementing regular quality checks for data models and pipelines, and deploying data reliability SLAs that helped identify issues faster so they could take corrective action more efficiently.
- The data engineers redesigned the data lakes, stores, and pipelines.
- As they connected tools and applications, they got more visibility into how data scientists and other business data users were consuming their data. That empowered them to make meaningful changes to the data and tech stack while ensuring downstream users weren’t negatively impacted.
- Foundational data tables and schemas were changed to be more efficient from a storage and transfer perspective. That meant their machine learning applications had to be retrained to find, process, and deliver the correct data.
Your data pipelines and data engineering teams were set up one way, but that doesn’t mean they’ll stay that way forever. As technology and approaches evolve, so can your data engineering practices.
These best practices are not unique to data engineering; in fact, they’re widely used in managing infrastructure and traditional software applications. They can help teams manage complexity at scale and be more efficient with their time.
All of the above can be implemented on Day One by the first data engineer. Or, like at Airbnb, teams can refine and optimize the data engineering function well into its maturity cycle. Either way, your data engineering management becomes easier and more efficient. You also help the rest of your company get the data they need to push the company forward.
Schema change detection