Anomaly detection part 1: The key to effective data observability
Anomaly detection is critical to effective data observability. In this blog, Henry Li, senior data scientist at Bigeye, dives into three of the most important aspects of anomaly detection.
Anomaly detection is critical to effective data observability. In this blog, Henry Li, senior data scientist at Bigeye, dives into three of the most important aspects of anomaly detection. In part 2 of this series, Henry will explain how the Bigeye data observability platform addresses each aspect.
Data teams are often stuck between a rock and a hard place. On one side, data platforms, even at young companies, are growing at speeds that would be unheard of just a few years ago. On the other side, data is being used in increasingly critical applications — the kind that costs money and customers if they break. As a result, data teams are under greater pressure to move quickly, while keeping the data fresh, high quality, and reliable.
Data observability helps to relieve the pressure caused by those competing needs. Data observability helps data teams make changes within their data platforms quickly, while still feeling confident. They’ll find out fast if something breaks, and can stop pipelines, roll back changes, or take other actions before anything downstream gets impacted. Key to this confidence are accurate alerts that fire as soon as the breakage occurs.
Unless data engineers want to spend time hand-tuning hundreds of rules or thresholds, they’ll likely be using anomaly detection to get those alerts.
Thankfully, data observability has become more accessible than ever. I led the development of Uber’s Data Quality Monitor (DQM) before I joined Bigeye. The DQM was an important part of their overall data observability effort, and it took 12 months for a team of data scientists and engineers to develop — and requires ongoing investment to maintain. Now there is a growing selection of data observability tools available as off-the-shelf SaaS products. These tools help data teams leapfrog the build-it-yourself process and get to a practical level of data observability in a fraction of the time and at a fraction of the cost.
But when evaluating a data observability platform, buyers should beware: anomaly detection isn’t a simple box to check on a feature list. Quality anomaly detection is the difference between an effective data observability platform and a giant noise machine.
Evaluating the efficacy of an anomaly detection system needs to cover a few topics:
- Accurate detection: Accurate anomaly detection is fundamental to effective data observability. It ensures that data teams aren’t overwhelmed by false positives, and keeps them from scrambling to figure out what went wrong when something inevitably does. Measuring this is often done by looking at the false positive and false negative rates. One will always be traded for the other to a degree, but detection systems that outperform on both metrics will deliver better performance and help prevent alert blindness.
- Intelligent adaptation: Your business is dynamic and so is the behavior of the data that your business is producing. Anomaly detection that can identify and adapt to these pattern changes removes the need for the data team to continually anticipate pattern shifts in the business and manually tune the data observability system.
- Continual improvement: Out of the box, no data observability system is going to completely understand your business, or what your team cares about. In healthcare, a 5x jump in null IDs could lead to a SEV1 situation, but in a machine learning pipeline, it might be a SEV5. Anomaly detection should learn from your team’s feedback to become more accurate over time. This requires anomaly detection to improve through functionalities like reinforcement learning and the intelligent treatment of bad metric values.
In this blog, I’m going to cover these three important aspects of anomaly detection and provide examples to explain why each is critical for effective data observability — and so difficult to solve well.
Accurate Detection
While most data observability systems use time series forecasting for anomaly detection, there is a science to finding the best forecasts to use for each potential data issue. While there is plenty of existing research on forecasting performance, none of the widely available models are purpose-built for detecting anomalies in the metric patterns found in data observability applications. If the anomaly detector can’t accurately find issues, data teams will either be overwhelmed by false positives or miss critical issues — or both.
One of the most obvious symptoms of poor anomaly detection is too many false positive alerts. When inundated with too many alerts that don’t actually need attention, the data team quickly succumbs to alert blindness. At that point, alerts that need attention go unnoticed, and the data observability system ceases to be the safety net that it should be.
Poor forecasting choices can also lead to missed issues (false negatives). To illustrate this point, let’s take a look at one of the most sinister problems in data quality: slow degradation.
Slow degradation
Slow degradation refers to an issue in the data pipeline that appears to be low severity at first but snowballs over time, resulting in huge costs. For example, let’s say that two months ago a new process inserted a small number of duplicated data into a table. If the root cause analysis window is only a couple of weeks long, the problematic data values won’t be captured and removed. The data team simply isn’t looking back far enough. Unknowingly, these degraded data batches get used in future processes, such as in an artificial intelligence model pipeline, and the bad values persist and ruin even the best-performing algorithm.
Once in the system, slow degradation issues are difficult to root cause. The issue persists, accumulating in cost and hampering reliability and efficiency efforts. Even when teams explore deeper into the past for problematic code and configuration changes, confounding factors that influence data quality make it hard to pinpoint the real issue.
Intelligent Adaptation
Accurate detection is important for effectively alerting data teams to important issues without overwhelming data teams with false positives, but the anomaly detector must also adapt to valid changes in the business and the context of the data.
Without intelligent adaptation, data teams are stuck in a cycle of manual tuning every time there is a change to the business. Over hundreds or thousands of tables, manual tuning quickly eats up engineering time and takes focus away from growing and improving the data platform. Let’s take a look at one common example of intelligent adaptation: dynamic alert thresholds.
Dynamic thresholds
Alert threshold boundaries are important — and sometimes difficult — to get right. When thresholds are too narrow, data teams get overwhelmed with alerts. This leads to alert blindness. When thresholds are too broad, data teams won’t be alerted to issues that need attention. This is a tougher problem than it seems on the surface because you can't simply set thresholds once and be done with it. Thresholds need to adapt to changes in the business behavior that appear as seasonality, trends, and pattern changes.
Continual improvement
Out of the box, no data observability platform can completely understand the nuances of your business behavior. An effective anomaly detector needs to be able to learn from your team’s feedback and improve over time. I’m going to discuss two features that assist with this process: reinforcement learning and anomaly exclusion.
Reinforcement learning
Reinforcement learning collects inputs from the data team to help to fine-tune detection and alerting. Perhaps the data team is more interested in knowing about extreme changes to data batches rather than small fluctuations. Or, perhaps the data team wants to understand if there’s any fluctuation in the machine learning feature store so that downstream automations are run with the most consistent inputs. Reinforcement learning can help reduce false positives and make the data observability system more effective for the way the data team works.
Anomaly exclusion
The way that the anomaly detection system treats bad values is also important for the continual improvement of the data observability system. When data metrics change drastically, say when there is an incident like the metric values drop significantly, the goal is to resolve the issue and return the metric values back to normal. The values from the start of the incident and throughout the recovery are not representative of the healthy state of the data infrastructure.
The detection system needs to exclude or include these values effectively or risk becoming tuned to an abnormal state. When complex trends and seasonality are added into the mix, it is often not clear how to deal with potentially bad data metric values. When dealt with effectively, bad values add additional layers of intelligence to the anomaly detector and tune the data observability system to your business and workflows.
The Future of Data Observability
While data observability is relatively new to many data teams, a handful of data engineering powerhouses have built data observability systems in-house, including LinkedIn, Netflix, andUber. Those companies collectively spent several years and no shortage of resources getting their anomaly detection right. They understand clearly that anomaly detection is core to effective data observability and that the difference between an 80% and a 95% solution is massive in terms of tangible results.
When evaluating a data observability platform — whether to improve trust for self-service analytics, protect pipelines for machine learning, or validate third-party data — it's important to consider the strength of the anomaly detection under the hood.
In my next blog, I’ll dive into some of the details of how I’m designing the anomaly detection strategy at Bigeye, based on my previous experience at Uber.
Monitoring
Schema change detection
Lineage monitoring