Data in Practice: Data reliability tips from a former Airbnb data engineer
We spoke with Dzmitry Kishylau, a former member of Airbnb’s Trust and Safety team, to learn how they approached data reliability and get from-the-trenches tips.
In the "Data in Practice" series, we talk to real data engineers who have put data reliability engineering concepts into practice, learning from their challenges and successes in the real world.
Every growing company will eventually face an important existential question. That is, “how do we maintain data quality?”
Ignore this question, and it will rear its head at a later date, potentially with catastrophic consequences. Tackle it in good time, and you’ll reap the rewards: meaningful analysis, happy business teams, better bottom lines, and beyond. but it’s easier said than done.
Companies that want to use data and make data-driven decisions (spoiler alert: that’s everyone!) will need to take a good hard look in the mirror, and address the data quality question head on, ASAP.
If managed correctly, data observability systems have a clear ROI, because they provide a comprehensive view of what’s happening inside the tables and pipelines of your data stack. Analogous to observability in software engineering, data reliability comprises instrumentation, monitoring, and alerting. While data problems can’t be completely prevented forevermore, Data Reliability Engineering provides a framework for detecting and addressing issues quickly and with a minimal blast radius.
It’s one thing to theorize about “data reliability,” but what does it actually look like in practice? What sorts of scenarios lead to data disasters, and how do real-life data teams mitigate them?
Let’s look at one example from one of Airbnb’s teams between 2019 and 2020. We spoke with Dzmitry Kishylau, a former member of Airbnb’s Trust and Safety team, to learn how they approached data reliability and get from-the-trenches tips.
Airbnb’s Trust and Safety team: Setting the context
The Trust and Safety team at Airbnb is responsible for everything from payments fraud to real-life incidents. They often build machine learning models to predict and detect problematic users or listings, before those issues have a chance to do severe damage.
“There was a lot of data, a lot of demand for accurate labels,” says Dmitry Kishylau, a software engineer on the team in 2019, who eventually went on to lead it. The data in question was on the order of 1.5 petabytes during the time that they were running into reliability issues.
The Trust and Safety team’s data stack
The team’s data stack was mostly Hadoop/Presto with Airflow for orchestration. Data pipelines were written initially in Hive, and later in Spark.
In terms of foundational tables for modeling and analytics, those evolved organically over Airbnb’s existence. Data scientists appended information to tables as they saw fit, creating, in effect, a patchwork quilt of tables.
There was a table for payments fraud, another for fake listings. There was no overall master plan or through line drawing these tables together. In fact, some grew increasingly large and difficult to wrangle.
The Trust and Safety Team bump up against reliability issues and a coming IPO
By 2019 and 2020, the haphazard table setup was becoming impossible to maintain. Some of the pipelines took days to run by default; any pipeline breakage or failure would take a week to fix. This meant that all critical metrics were at least a day behind on accuracy.
In addition to data not landing in a timely fashion, data reliability problems in that time period included:
- Data pipeline changes were very difficult, for instance if engineers wanted to change the logic that detected whether a certain account was taken over by fraudsters.
- Pipelines were constantly running into resource constraints.
- The cost of pipelines was extremely high.
- Engineers did not feel empowered to delete unnecessary data because they didn’t know what was used by unknown downstream dependencies.
- Many important tables lived in dev environments; as the company grew, there were thousands of tables in dev, and no one knew which ones could be deleted.
- Data quality issues, where certain important machine learning features went to null and nobody noticed for a month.
Airbnb’s approaching IPO precipitated action at the company. In getting their ducks in a row, the team realized that the state of data was not up to “IPO level.” They leapt into action at the opportunity to declare data bankruptcy and start over.
The approach: A new data engineering team
Airbnb stood up a new data team within Trust and Safety that accomplished a few main tasks in data reliability. These tasks were:
- Building new, correct foundational tables from scratch
For each table, the team created a detailed specification of what was represented and what each of the data fields meant. This document was reviewed and approved for each table.
Generally speaking, the new foundational tables were smaller than the old ones. To minimize the potential impact of late upstream dependencies, new tables were designed to have as few dependencies as possible.
2. Implementing data quality checks for data models and pipelines
The team implemented data quality checks on all inputs and outputs to data pipelines. These checks were written in Spark and Airflow, and were fairly basic. For example, they checked that tables feeding data contained some rows, and that certain data fields were never null.
Implementing even simplistic checks like these caught a surprising number of data reliability issues. For example, in one case, the team found a broken upstream dependency.
Not all data quality checks were automatable. Engineers occasionally conducted manual tests during production. In the instance that quality issues in production databases were passed along to data pipelines, data engineers would partner with engineering teams to deemphasize the culture of testing in production.
3. Driving data reliability SLAs
For each new foundational table, the data team set a freshness SLAs specifying whether the data was expected to land 4 hours, 12 hours, or 24 hours after its generation. The team then tracked how many times in the past month or quarter the data landed late, with a goal of no more than 5% of the time.
These data freshness SLAs ensured that even in instances where the data was late or inaccurate, corrections were put in place to fix issues within 24 hours, instead taking a week or a month, as it had prior to the SLAs.
Results and unexpected challenges
During the implementation phase of their new reliability strategy, the team ran into some unexpected challenges. Let’s look at what those were, and how they fixed them.
1. The manner in which data scientists like to consume tables differs from the best way to produce tables from a “software reliability perspective”
New foundational tables were often smaller than the original tables, averaging 5 columns versus the prior 500, that mapped cleanly to a smaller number of dependencies. While this may have been “better” from a software engineering perspective – easier to maintain and keep clean – it wasn’t necessarily better for the data scientists who were then consuming these tables.
Data scientists preferred to work with a single enormous table with all the columns. This issue was resolved by creating appropriate views in the data warehouse, giving the data scientists the denormalized tables that they prefer to work with that joined the foundational tables under the hood.
2. Data definition changes in the tables forced the team to retrain downstream models.
Data labels like “fraudulent payment”, “non-friendly fraud”, and “friendly fraud” were sometimes redefined in the new tables. This issue required that machine learning models, which depended on these tables, be retrained.
Since his time at Airbnb, Dmitry Kishylau has consulted for a number of companies using what’s called “the modern data stack,” and he believes that tools like dbt, Snowflake, and Bigeye would have made the data team’s work significantly easier. However, using the tools at hand, they were able to formulate a data reliability strategy that worked within their existing infrastructure and delivered measurable improvements in the data experience for teams all over AirBnB.
Schema change detection