Deploying data observability: wide or deep?
Borrowing patterns from Site Reliability Engineering (SRE) and DevOps, data observability tools help data teams to understand the internal state and behavior of their data.
Borrowing patterns from Site Reliability Engineering (SRE) and DevOps, data observability tools help data teams to understand the internal state and behavior of their data. With data observability, data teams can prevent problems that would impact how their organization uses data.
When rolling out a data observability platform like Bigeye, you’ll need to decide which data to monitor and how. In this post, we’ll take a look at two different strategies for deploying data observability across your data lake or warehouse: narrow-and-deep and wide-and-shallow.
How it works
Narrow-and-deep deployments focus on tracking a narrow but important group of tables with as many data quality attributes as possible. This can include monitoring for characteristics like null or missing fields, malformatted strings, sudden distribution changes, outliers, and cross-column relationships.
Monitoring these attributes requires a metrics-based approach, as the data must be queried directly to observe these attributes (as opposed to inferring them via database logs). Without automation, this metrics-based approach would be time consuming to configure and require a user to manually write and run an SQL query for each metric they wanted to monitor.
An example of this approach is Impira. Impira leverages Bigeye to continually monitor customer usage data, which product management uses to power the company’s self-service go-to-market strategy. Because Impira makes updates and changes continually, deep monitoring is key to ensuring that one issue doesn’t balloon into many issues.
The biggest advantage of a narrow-and-deep strategy is reducing the number of expensive outages that slip past the observability system and affect business-critical applications – the kind that impact customers, confuse executive decision-making, result in fines, or otherwise hurt the business if they break.
Out of all of the ways to handle data quality, a data observability system is unique in its ability to detect not only the things you know could go wrong, but the “unknown, unknowns” that can be difficult to detect and impossible to anticipate. This level of detection, however, requires more extensive monitoring on the datasets that matter most to ensure that the time teams spend responding to alerts is high ROI for their organization.
Key benefits of a narrow-and-deep approach, include:
- Efficacy: Higher chance to detect potential outages on the datasets that matter most, and lower chance of false alerts coming from datasets that aren’t important to the business.
- Accuracy: If implemented with metrics-based monitoring (as opposed to metadata-based), a narrow-and-deep approach always reflects exactly what data consumers will see when they query the table.
- Visibility: A narrow-and-deep approach implemented with metrics-based monitoring is completely extensible. Because it queries the data directly, it can monitor any attribute of the data that a data consumer would see themselves.
For most data teams, a narrow-and-deep approach will provide the best coverage for business-critical datasets and applications. There are, however, some potential drawbacks, which require planning and careful design optimizations.
- Speed: The point of the strategy is to focus monitoring on a subset of datasets, which will leave some data unmonitored (at least initially). This will require an understanding of what data, and applications, are most important to the business and may require conversations with some stakeholders.
- Cost: If collected via direct querying of the target data (i.e. metrics-based monitoring), this approach will produce higher compute costs than inferring them from logs (metadata-based monitoring).
Because narrow-and-deep boosts visibility on some datasets (at the expense of others), it works best in use cases where key data applications need to be kept defect-free, like executive dashboards and production machine learning models. A few, example use cases are:
- Making sure high-priority analytics dashboards are fed with high-quality data.
- Keeping training datasets for in-production ML models safe.
- Ensuring data being sent or sold to partners or customers is defect-free and within any contractual SLAs.
How it works
The wide-and-shallow strategy works by enabling basic monitoring for a few key attributes on most or all of your datasets. Freshness and volume are the most common attributes monitored this way because they can be easily estimated by parsing the logs stored in data warehouses, like Snowflake.
This approach requires practically zero manual configuration because the parser simply reads the logs for each table to see when it was inserted into, and with how many records, and stores the results.
An example of this approach is Udacity. Udacity’s data engineering team uses Bigeye to create a single place to understand the quality of their data pipelines, monitoring hundreds of datasets in a cloud-native data lake. With Bigeye in placer, Udacity’s data engineering team can feel confident in the data they are providing to business analysts and data scientists.
The wide-and-shallow is great at getting a lot of basic monitoring coverage put down rapidly and with minimal tax on the data source.
- Speed: Adds monitoring to a lot of datasets instantaneously, which can be great for quickly inspiring trust within the organization.
- Cost: If gathered using metadata, performance impact on the data source is minimal, and requires little to no user configuration.
At first glance, wide-and-shallow monitoring might seem like the obvious choice. All of your data is important, right? In reality, most organizations heavily leverage a fraction of their data (between 5% and 20% based on conversations we’ve had with hundreds of data teams) with the rest being only being occasionally used or not ready for production use cases. Deeply understanding the data that matters is usually more effective than getting a basic understanding of data that is less critical to the business.
Drilling in, most of the drawbacks are a result of the most common method used to achieve a wide-and-shallow deployment: metadata-based monitoring, which infers data quality attributes from logs.
- Efficacy: Regardless of the technical implementation, this approach increases the chance of your team getting alerts about relatively unimportant data and decreases the chance of catching a potential outage on an important dataset.
- Accuracy: Because it is inferred from logs, not by interrogating the actual data, metadata-based monitoring is inherently less accurate. This inaccuracy can create both false-alarms and missed outages.
- Visibility: If collected via metadata, the attributes available to track are limited to what the warehouse vendor exposes in their logs, and cannot be extended.
The most common use cases for wide-and-shallow deployments involve checking just the basic operational statuses across tons and tons of datasets at once, which may make sense if a data engineering or data ops team is mainly focused on keeping the infrastructure up and running, rather than ensuring any specific use case is working.
- Ingestion: Making sure hundreds of raw datasets are all landing into the warehouse on time.
- Redundancy: A backup monitoring system for ensuring dbt or Airflow jobs are running on time and producing rows.
Broad-and-narrow can certainly be the right choice for some teams, but we most often recommend starting with a narrow-and-deep deployment strategy, because of its focus on protecting critical data rather than casting a wide net.
That said, many customers start with either narrow-and-deep and eventually add wide-and-shallow and vice versa. Starting with one type of deployment doesn’t lock you out of the other and many customers end up with a T-shaped deployment with narrow-and-deep monitoring on critical datasets and wide-and-shallow across less critical data.
Where you start depends on your initial goals for data observability: Do you want to ensure that key applications are error-free? Or is a broad view of just basic freshness and volume more important?
Whichever deployment strategy is right for you, Bigeye helps you get great monitoring coverage and prevent data outages.
Schema change detection