The difference between data observability and legacy data quality platforms
The data quality products of yore originated in the on-prem era. Modern data observability is a whole different ballgame. From complexity to scale to the cloud, here's the difference between data quality systems of today and those of yesterday.
Unlike legacy data quality products, which originated in the on-premise era, modern data observability platforms like Bigeye were built from the ground up to handle the complexity and scale of cloud data environments.
In this blog post, we’ll explore how data observability platforms outperform legacy data quality products, and why organizations should make the switch.
What is Informatica?
Informatica, founded in 1993, is an all-in-one data integration platform that provides an interface for building ETL pipelines. The platform helps organizations integrate data from various sources into a data warehouse or data lake. Informatica's no-code interface allows non-technical users to build ETL pipelines without writing any code. It's comparable in ethos/paradigm to something like UIPath.
Informatica was founded in the pre-cloud era, and many large companies still have on-premise Informatica deployments. Over the years, Informatica has also expanded into adjacent products in data quality and governance.
How do Informatica’s data quality products work?
Legacy data quality products like Informatica use rule-based approaches to ensure data quality. That means users define and implement rules for validating, cleansing, and transforming data from different sources.
But, rule-based approaches have limitations:
- They still fall in the category of writing hundreds or thousands of custom checks for each data source or pipeline
- They do not adapt well to changes in data schemas, formats, or business requirements
- They do not provide comprehensive visibility into data health and performance across the entire data ecosystem
- As a result of everything described above, they require heavy professional services resources to set up and maintain
By contrast, modern data observability products like Bigeye take a much more automated and proactive approach. They enable users to monitor and improve data quality at scale using intelligent methods. In particular, Bigeye offers features like:
One of the key features of Bigeye’s data observability product is its Autometrics capability. Autometrics are smart recommendations for monitoring coverage based on an analysis of the user’s data, including column type, semantic details, and formatting. When users connect their database to Bigeye, it automatically indexes their source to generate the catalog and suggest basic autometrics. Customers can then go through and enable the autometrics they want.
When you deploy a standard autometric, Bigeye builds in "Autothresholds" computed from historical data through machine learning models. These autothresholds are dynamic and configurable. You can set them to "narrow," "normal," or "wide." You can also give feedback to the machine learning models on the "Issues" page, if you find that the autothresholds are inaccurate.
Anomaly detection is the process of identifying deviations from expected values or trends. Bigeye has sophisticated anomaly detection algorithms that understand trends and seasonality. These algorithms even recognize hard-to-detect anomalies like "slow degradation."Bigeye can identify and adjust to pattern changes in data, removing the need for manual intervention.
Bigeye’s anomaly detection improves over time, through reinforcement learning and anomaly exclusion. The system learns from user feedback, in order to fine-tune detection and alerting capabilities. Further, Bigeye includes root-cause and impact analysis, to help engineers resolve detected anomalies.
In Bigeye, you can access metadata metrics immediately upon connecting your data warehouse. Examples include "data freshness" and "volume," which indicate the general success or failure of a data pipeline. These metrics primarily focus on whether a table has been updated and/or accessed. The table below presents some examples:
Metadata Metric NameAPI NameDescriptionHours since last loadHOURS_SINCE_LAST_LOADThe time elapsed (in hours) since a table was last modified with an INSERT, COPY, or MERGE operation. Recommended as an automatic metric for each table.Rows insertedROWS_INSERTEDThe total number of rows added to the table through INSERT, COPY, or MERGE statements in the last 24 hours. Recommended as an automatic metric for each table.Read queriesCOUNT_READ_QUERIESThe total number of SELECT queries executed on a table in the last 24 hours. Recommended as an automatic metric for each table.
Metadata metrics are a crucial component of Bigeye's T-shaped monitoring approach, which advises tracking basic metrics across all data while implementing more in-depth monitoring for the most critical datasets, such as those used in financial planning, machine learning models, and high-level executive dashboards.
In conclusion, rule-based data quality solutions like Informatica (or even more modern tools like dbt tests) require users to manually define each metric and their associated thresholds. These rules can be complex and time-consuming to maintain and update as the data changes over time. Additionally, a lack of sophisticated anomaly detection algorithms in these products means that anomalies detected might be false positives or not actionable. By comparison, Bigeye provides instant coverage for the entire data warehouse from the moment customers connect, with the option to go deeper with little additional effort.
If you want to learn more about how Bigeye can help you achieve better data quality at scale, request a demo.
Schema change detection