Why data observability?
Businesses want to use data in exciting, high-risk-high-reward ways that help them deliver the best products and services. Slow manual processes can be automated with machine learning, critical decisions can be made quickly with self-service analytics, and data can be shared or sold to help partners.
Within a business, data teams want to ship quick changes to their data model without negatively impacting their end users. New sources of data need to be onboarded, new transformations written, existing transformations extended, and (not often enough!) useless data end-of-life’d.
But every change in the data pipeline risks breaking something and impacting end users like data scientists, analysts, executives, and the business’ customers. It’s hard to mitigate that risk when you have blind spots inside your data model, can’t understand how your data (including downstream dependencies) is changing, and who those changes will impact.
Data observability solves this by creating visibility over the entire data model, and enabling data teams to detect upstream changes, predict how changes in one part of the model will cascade downstream, and trace problems upstream to their sources. Data observability is used at companies like Uber, AirBnB, and Netflix to cover that blind spot, and to help achieve these goals:
Data reliable enough to use in high-risk/reward applications
Faster, safer, easier changes to the data model + infra
Better use of time for expensive data engineering teams
Org-wide confidence in the production data
How does it do that?
Data Observability tools — like their counterparts in DevOps — help data engineering teams achieve these goals by providing continuous context about the data model’s behavior.
These tools allow data engineers to understand what’s happening inside any table in their data stores, when they’re being written to or read from, and where they come from and where they end up going. Data observability tools give a bird’s eye view of the entire data model, supplying both the foresight to predict and an early-warning-system to detect changes to the data (e.g. changing a schema) or infrastructure (e.g. reducing how often an ELT job runs) that will impact end users.
Data observability is ultimately built on three core blocks that provide the raw information to create context for data engineers: metrics, logs, and lineage.
(As Jean Yang from Akita points out in her blog on DevOps observability, the ultimate goal of an observability system should be a higher level understanding of what’s happening in the system—not limited to the exposure of metrics, logs, and traces—but without these underlying raw inputs, that level of abstraction can’t be built.)
Metrics are numeric results that come from directly querying the dataset. Examples would be: the table’s row count, the average value of a numeric column, or the number of duplicate user_uuid’s. Metrics should seem familiar to anyone who has run a SQL query; they’re what you get back from any query with aggregated results (e.g. the average value of a numeric column). In an observability context, the goal of the metric is to quantify what’s happening inside the dataset, instead of answering a business question.
Metrics help answer questions about the internal state of each table across the data model like:
Count of duplicate user_uuids: are we recording duplicate user records?
Percent valid formatted user_emails: are we recording valid email addresses for our users?
Skew of transaction_amount: did the distribution of our payments suddenly shift?
Because the metrics describe the behavior of what’s inside the table, collecting them requires an understanding of the table being observed. Each table will need a unique set of metrics to accurately describe its behavior. This can be a big barrier to instrumentation (imagine hand-picking the right set of metrics for 100+ tables with 50–75 columns each), and something that Data Observability tools seek to automate with techniques like data profiling and their own secret sauce of picking the right metrics.
(Author’s plug: at Bigeye we put a lot of work into getting this part right because of how critical it is to reducing uncaught outages. We built over 70 metrics into Bigeye and our recommender can infer semantic concepts from the data profile results like UUIDs, ZIP codes, etc. to pick the right metrics for each table)
Metadata contains (but isn’t limited to!) information about physical data or related concepts like ELT jobs. Examples would be: the log of all queries run against a given Snowflake table, or the logs produced by an Airflow job. Instead of telling you anything about the data itself, metadata can tell you what’s being done to the data by the infrastructure, e.g. running an INSERT to append new rows to a table, or an Airflow job that failed and didn’t write anything to its intended destination.
Metadata answers questions about what’s happening TO the data, rather than about what’s IN the data. Questions like:
How long has it been since this table was written to?
How many rows were inserted when that happened?
How long did this ELT job take to run?
Collecting metadata is a bit simpler than metrics (which have to be configured uniquely for every table) or lineage (which has to be merged together from multiple sources). Some tools like Snowflake make them queryable, just like any other table, and Fivetran dumps metadata into the destination schema where it can be similarly queried. All that’s needed after that is a little parsing to pull out the relevant statistics from the logged queries, and you have tracking on time-since-write, rows-inserted, job-run-duration, etc.
Lineage is the term we’ve collectively chosen in the DataOps-space instead of “traces” in DevOps-space. It’s the path the data took from creation, through any databases and transformation jobs, all the way down to final destinations like analytics dashboards or ML feature stores. While lineage is often constructed by parsing logs — though this isn’t the only way to construct it — it stands on its own as a concept because of the role it plays in understanding the behavior of the overall data model, showing how data flows, and where both problems and improvements will eventually have impacts.
Lineage helps to answer questions about where something happened, or where something is going to end up having an impact:
If I change the schema of this table, what other tables will start having problems?
If I see a problem in this table, how do I know whether it flowed here due to a problem elsewhere?
If I have an accuracy issue in this table, who’s looking at the dashboards that ultimately depend on it?
Lineage is most often collected by parsing the logs of queries that write into each table. By modeling what’s happening inside the query, you can see which tables are being read from, and which tables are being written into. But lineage can (and should) go further than just table-to-table or column-to-column relationships. Companies like AirBnB and Uber have been modeling lineage all the way upstream to the source database or Kafka topic, and all the way downstream to the user level, so they can communicate data problems or changes all the way up to the relevant humans.
Merging these three sources of information tells the data operator: what’s going on INSIDE my tables and is that changing over time, what’s happening TO my tables via queries and ELT jobs, and what’s the relationship BETWEEN my tables (and other concepts like users) and how will problems or changes flow across the graph. There are other aspects that have to be built right for a data observability tool to be useful — good monitoring and alerting interfaces, for instance — but these three building blocks are the primary sources of information that enable everything else.
Data operators equipped with observability products can use the combined views of this information to understand the state of their complete data model, ship improvements faster, and deliver more reliable data to their end users. As the field of data engineering continues to evolve, I predict we’ll see common nomenclature and awareness around these concepts, akin to this evolution in DevOps.
P.S. - If you're interested in our lineage product, which is currently in Beta, get in touch and take it for a spin!