Thought leadership
-
December 2, 2022

Prevent data engineering burnout

The most effective data engineering teams are those who successfully shift from a toil culture— a series of manual one-off tasks, firefights, and moments of heroism—to a reliability culture with planned, repeatable, and scalable processes.

Kyle Kirwan

Data engineers play an increasingly important role in our economy. These days, every business is pushing to leverage its data to remain competitive.

In addition to standing up the pipelines that deliver data to the analysts, data scientists, and machine learning engineers who put it to work, data engineers are often also responsible for a slew of (usually thankless) maintenance tasks to keep those pipelines running. Data documentation, data quality, data observability, fixing pipeline issues, ETL code review, and a ton of other reliability work falls on the already-full plate of the data engineer.

The most effective data engineering teams are those who successfully shift from a toil culture—where work happens in a series of manual one-off tasks, firefights, and moments of heroism—to a reliability culture with planned, repeatable, and scalable processes that keep things running smoothly without burning out the team.

They do it by borrowing time-tested best practices of modern software engineering and DevOps and applying them to the data engineering field.

Best practices for a data engineering teams

1. Assign clear ownership whenever possible

Whether a simple maintenance task, an update to the data model, a backfill, or investigating a performance issue, when the task (or more often it’s associated object) has no owner, everyone is the owner. As such, collective ownership (and no ownership) is the fastest way to problems being ignored and turning into bigger problems.

Your team will expand, shrink, or simply rotate its team members over the course of quarters or years. Not only is it important to establish and document owners in a publicly visible way; you must hand off ownership as the team naturally turns over.

Not every owner will deliver the same level of attention and diligence to the assets they’re for which they're responsible. But declaring ownership allows the data engineering team to manage the care of higher or lower priority assets. This prioritization ensures the proper level of reliability needed for each piece of infrastructure, data pipeline, and downstream analytics or machine learning application.

2. Document processes and make them discoverable

Process often gets put on the back burner in exchange for more time to spend on “real” work. Work like tackling a request to change the data model, reducing a pipeline’s runtime, or other actions that deliver immediate value.

But good processes can pay dividends by reducing time spent. Good processes can also eliminate confusion and waste. For repeatable future tasks that will come up over and over again, good processes help you win big.

Some examples of high-value processes are:

  • An incident response process when a data pipeline or data quality issue comes up
  • A data sunsetting process for deciding when and how unused data will be removed
  • A process for communicating upcoming changes to the data model to prevent surprises to downstream consumers
  • A process for making schema changes in data sources without breaking ingestion to the data warehouse or to pipelines that depend on that data

Aside from having a clearly documented process, it’s also important to make that process discoverable by anyone who will either execute it or be impacted by it. Identify those people with a RACI matrix: who’s responsible for the actions in the process, who’s accountable for the success of the process, who is actively consulted during the process, and who is merely informed of the execution or completion of the process.

For each category of stakeholder, consider how and when they need to be aware of the process, and use things like links to documents, reminders in Slack or email, or in the extreme case a straight up meeting.

3. Plug leaks before running the bilge pump

Boats have a bilge pump to push water back out of the ship’s hull. The bigger the leaks, the harder the bilge pump works to prevent sinking. When you have growing backlog, fix the source first, in the same way you'd plug leaks on a boat.

Once new work isn’t appearing anymore, you can attack your backlog. Clearly communicate to stakeholders that the backlog is no longer growing. Then you can set deadlines. Use time-boxing, and other techniques to work through the backlog as needed.

If you don’t stop the leak first, every item you clear from your backlog might be replaced by a new incoming issue that prevents you from making progress.

4. Automate to reduce toil

Columns, tables, and ETL jobs need to be more like cattle and less like pets. Keep a sharp eye out for “that one table” that sucks up a lot of time because it has to be carefully handled. It’s natural to make exceptions for high-importance tables used all around the company, or for high-criticality tasks like finance. A scalable data engineering team finds ways to apply the same maintenance tasks on those items as on any other tables or pipeline jobs in their environment.

As a rule, batch any action that you’d do more than once manually. If you find yourself repeating the same task on multiple objects, or at regular intervals, it’s time to automate!

Idempotency is your friend. Once you have a script or an endpoint that performs your task in batch across all the objects (i.e. tables, Airflow jobs, etc.), you need to run it without worrying. Idempotent automations can be run again and again to get your desired objects into an end state, without changing them from there.

In this state, you can run your automation and validate that all objects were acted on. If they weren't, you can smash the button again and again until they all receive the action. There's no need to hand-pick objects for automation. Once your automation is successful, those objects will simply stop being modified upon reaching their desired state.

5. Prevent sprawl

The above techniques help you be more efficient, but you also need to slow down the rate of growth of things that need you attention. Preventing sprawl eliminates things that don’t need to hang around, so you have less to worry about on a given day.

Implement retention policies to limit the size of fast-growing tables. Many data science teams only need aggregates after a certain length of time has passed. Roll up an event log table into daily aggregates to massively reduce dataset sizes. That action in turn will massively reduce the compute and time costs needed to work with them. Converting three-year old events into daily or even hourly aggregates may be totally fine for your consumers, while saving time and money for the company.

Also, create a policy for removing data or pipeline jobs that aren’t used after a certain length of time. Ensure users have an appeal process. Consider making the data inaccessible first, to ensure that nobody is impacted before you permanently drop the data. Consider much shorter retention policies for non-production tables as well.

Best practices in the real-world

Here’s a real-world example of how a company used those best practices to overcome some obstacles they’d developed as they grew.

Airbnb used four tools to build, store, orchestrate, and manage its data pipelines (Hadoop, Presto, Airflow, Hive, and Spark.) Their foundational data stores for modeling and analytics evolved organically, with information being appended to tables as data scientists saw fit. There was no overall master plan aligning or uniting the tables, so they grew increasingly large and hard to manage.

The challenges

From a data engineering perspective:

  • Some data pipelines took days to run, and any failure would take a week or more to fix
  • Pipeline changes were difficult to plan, manage, and do
  • Pipelines constantly ran into resource constraints, causing bottlenecks in connected systems and delays to downstream processes and workflows
  • Data engineers were scared to delete unnecessary data because they didn’t know how it would affect unknown downstream dependencies
  • Data quality suffered as vital machine learning features often went to null and nobody noticed for weeks

The solution(s)

To solve these problems and alleviate the company’s stress, Airbnb formed a new data team. The data engineers on the team handled three main data reliability tasks, including building new foundational tables from scratch, implementing regular quality checks for data models and pipelines, and deploying data reliability SLAs that helped identify issues faster so they could take corrective action more efficiently.

The result

  1. The data engineers redesigned the data lakes, stores, and pipelines.
  2. As they connected tools and applications, they got more visibility into how data scientists and other business data users were consuming their data. That empowered them to make meaningful changes to the data and tech stack while ensuring downstream users weren’t negatively impacted.
  3. Foundational data tables and schemas were changed to be more efficient from a storage and transfer perspective. That meant their machine learning applications had to be retrained to find, process, and deliver the correct data.

Final thought

Your data pipelines and data engineering teams were set up one way, but that doesn’t mean they’ll stay that way forever. As technology and approaches evolve, so can your data engineering practices.

These best practices are not unique to data engineering; in fact, they’re widely used in managing infrastructure and traditional software applications. They can help teams manage complexity at scale and be more efficient with their time.

All of the above can be implemented on Day One by the first data engineer. Or, like at Airbnb, teams can refine and optimize the data engineering function well into its maturity cycle. Either way, your data engineering management becomes easier and more efficient. You also help the rest of your company get the data they need to push the company forward.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.