Anomaly Detection Part 2: The Bigeye Approach
This is the second blog in a series on anomaly detection from Henry Li, Senior Data Scientist at Bigeye.
This is the second blog in a series on anomaly detection from Henry Li, Senior Data Scientist at Bigeye. The first blog dove into why quality anomaly detection is critical for effective data observability. In this blog, Henry will describe Bigeye’s approach to anomaly detection.
Anomaly detection is core to detecting data issues and ensuring data pipeline reliability with data observability. In my previous blog, I discussed three important aspects of effective anomaly detection:
- Accurate detection: Accurate anomaly detection is fundamental to effective data observability. It ensures that data teams aren’t overwhelmed by false positives, and keeps them from scrambling to figure out what went wrong when there is an issue.
- Intelligent adaptation: Your business is dynamic and so is the data it’s consuming and producing. Intelligent adaptation eliminates the need for the data team to constantly and manually tune the data observability system to business changes.
- Continual improvement: No data observability system is going to completely understand your team’s preferences out of the box. Anomaly detection should learn from user feedback to become more accurate over time, using techniques like reinforcement learning and properly handling data from outage periods.
While I had the opportunity to work on these problems at a very large scale at Uber, here at Bigeye I get to face a new challenge: solving these challenges for customers of different sizes and in different industries. Our anomaly detection strategy must work for companies, like Instacart, and for smaller fast-moving teams, like Clubhouse. In this post, I’ll walk through the three areas that are critical to great anomaly detection, and explain how we are designing our anomaly detection strategy to solve them.
Most data observability systems leverage time series forecasting for anomaly detection because in these systems, data metrics are collected from the data periodically, forming a set of time series. Large deviations from the forecasts are flagged as anomalies, which may trigger an alert depending on alerting tolerance.
At Bigeye, we have invested heavily into our machine learning models to improve and tune what they can detect. We now train thousands of models at scale for our customers, each consuming over a hundred features, and use both forecast and non-forecast models to pick up different patterns present in the metrics.
Working with numerous customers has helped improve our models. Our anomaly detection algorithms have detected a wide range of issues for our existing customer base which in turn has enabled us to build upon the repertoire of data quality issues that we can detect out of the box. With each new customer, our ability to serve new customers and help them identify data issues on day one improves. We’ve also been able to improve performance notably compared to conventional forecast-only anomaly detection techniques, reducing false-positive alerts and helping to prevent alert fatigue.
To allow our anomaly detection to adapt to a wide range of metric patterns — like a weekly-seasonal smooth sinusoid or three-hourly seasonal stair-stepping pattern, we have built a custom framework that continuously re-evaluates and selects from among our library of models. This method works not only for identifying each pattern but also for detecting pattern changes over time.
To see how this approach works in action, let’s revisit one of the most difficult issues for anomaly detection: slow degradation.
Slow degradation refers to an issue in the data pipeline that appears to be low severity at first but snowballs over time. Slow degradation can be especially sinister for data teams, resulting in the terrible realization that for weeks or months an undetected issue has been eroding data pipelines.
I have yet to see another data observability platform that can successfully detect slow degradation issues. I suspect this is due to the fundamental problem with detecting these types of issues: the longer they go undetected, the more difficult root-cause analysis becomes. It's like walking across a glacier and coming upon a gaping crevasse. While you can’t miss the crevasse now, its origin was a tiny crack that was easily missed somewhere along the way. Slipping into a crevasse is a vastly different experience than being inconvenienced by a crack in the ice.
To accurately capture slow degradation data issues, we configure a training schedule for anomaly detection modeling to ensure that new data metric values are not immediately fed into the modeling pipeline. In other words, a learning rate (pace of training) is established to ensure that past data patterns are intelligently incorporated into our Autothresholds. As a result, any significant local trends bring up relevant alerts. With those alerts logged, Bigeye makes it easier to perform root cause analysis and identify the subtle culprits of slow degradation.
Figure 1. An example data metric with values slowly creeping up. A slower learning rate is able to help propagate historical patterns (mostly constant over time) to alert on a local trend that may be hard to identify with conventional forecasting methodologies.
Without intelligent adaptation, the onus is on the data team to continually anticipate changes in the business and data usage and tune their machine learning models appropriately. More likely, the team will be hit with an unexpected alert storm when there is a change in the business that invalidates their current alerting configuration. This quickly becomes untenable, especially in dynamic environments with large volumes of data.
This makes adjusting the alerting thresholds deceptively hard but also incredibly important to the data observability system. Set the thresholds too narrow and data teams are flooded with false positives alerts, but too broad, and real anomalies get missed. Setting alert thresholds gets more complicated when factoring in changes in the business. As the business changes, your previously set thresholds might not work anymore.
To address this challenge, Bigeye creates dynamic boundaries that examine historical data to determine if there’s a pattern change and what is anomalous and what isn’t. Some data is allowed to fluctuate more, and there is a characteristic of varying threshold sensitivity in the monitoring process. This automation (Autothresholds) and manual advanced settings are available in Bigeye to manage boundaries. Our customers find that it’s useful to allow Bigeye to do the heavy lifting while retaining the option to adjust the alerting sensitivity when needed.
Figure 2. Several examples of how Bigeye identifies pattern changes in the data metric time series, using only the current pattern for model training.
Since every business and data environment is unique, invariably any anomaly detection will produce some false positives, ours included. What’s important is that the data observability system improves and adjusts to the data team’s preferences and alert sensitivities.
Perhaps the data team is more interested in knowing if there are extreme changes to the data batches rather than small fluctuations. Or, the data team might want to understand if there’s any fluctuation in features within the machine learning feature store so that downstream automation is run with the most consistent inputs.
Bigeye is designed to do the heavy lifting through advanced automation. To ensure that our automation is tailored to our customer’s workflows, our platform collects user feedback and uses it to improve to fit the needs of the data team. Let’s touch on Bigeye’s use of reinforcement learning and treatment of bad values to give you a glimpse of how Bigeye continually improves over time.
Bigeye has a simple built-in system for user inputs. We have designed the experience to make it easy to tell the application if there are false positives or if the user wants a different alert sensitivity. For example, when a data issue notification is fired, and the user thinks that the data batch in question is good in practice, the user can tell Bigeye that the underlying data state is tolerable or that a false positive alert is present. Bigeye will take this information into account so that a similar behavior in the future will not trigger an alert.
How the data observability platform treats bad data metric values is an important consideration. This is the “garbage in, garbage out” scenario in data monitoring that few out-of-the-box tools address well.
Bigeye automatically decides the treatment of bad data metrics or anomalous values so that future thresholds are tracking the optimal state(s) of the data. Bigeye removes bad data values based on past historical data patterns and time series values for each metric. The figure below shows a degradation that Bigeye caught and removed from model training automatically. If this bad-value handling wasn’t in place, the thresholds would adapt to the spike in the metric time series and become too wide, reducing their ability to catch future problems.
Figure 3. In this figure, the data metric became degraded during an incident (with values exceeding the upper Autothreshold for a period of time), but these values do not propagate into the downstream anomaly model training.
Our Vision for Anomaly Detection
In this article, I’ve just scratched the surface of the work we have done to achieve industry-leading anomaly detection. While we already have hundreds of features available and train thousands of models at scale for our customers, we are not resting on our laurels. In the near future, we look forward to unveiling new machine learning automation built on top of the existing Bigeye framework.
My work is motivated by the words of our customers, who have the assurance they will know about data or data pipeline issues before the business is affected, thanks to our continued investment into world-class anomaly detection:
“With Bigeye, I sleep well at night knowing that there is a system checking the quality of the data.” — Tony Ho, Director of Engineering, SignalFire.
“With Bigeye, we have an integrated, one-stop solution for monitoring the health of our data— the ultimate answer as to whether our data is good or not.” — Simon Dong, Head of Data Engineering, Udacity
“On day two of using Bigeye, we were putting checks in place to prevent issues that could have otherwise negatively impacted our business.” Yuda Borochov, CDO, ZipCo.
Schema change detection