Data observability in the modern data stack
Over the past five years, there's been a paradigm shift in data engineering. Here's how data observability fits into a “modern” data stack today.
While data observability is often talked about in isolation, it is in fact an outgrowth of the last five years of evolution in data engineering, which has seen trends like the rise of the data engineer, the decline of the Hadoop cluster, and the importance of high-quality data as inputs to machine learning systems.
Data observability tools like Bigeye wouldn’t be possible without near real-time ingestion, cheap computing in data warehouses, and demand from increasingly sophisticated data engineers and analysts.
In this blog post, we talk about the main paradigm shifts in data engineering over the last five years, what a “modern” data stack looks like today, and how data observability fits into the stack.
The last five years
The last five years in data engineering have seen an evolution in tooling, roles, and best practices. This includes:
1. The rise of data engineering as a distinct role and field
As companies hired data scientists, only to realize that the data needed to be aggregated and cleaned first before the data scientists could do their jobs, data engineering emerged as a distinct role.
2. The transition from Hadoop and batch processing to streaming data and tools like Spark, Flink, Kafka, etc.
The Hadoop/MapReduce paradigm of the mid-2010s was always less than ideal: it involved a challenging programming paradigm that required specialized talent, and trickiness around figuring out the duration and sequencing of jobs.
Today, streaming engines like Spark, Flink, and Kafka have addressed many of these issues and enabled real-time data use. This has moved data consumption from just looking at day-old data in dashboards to using machine learning to make real-time decisions that affect the user experience, like offering point-of-sale promos or determining creditworthiness.
3. The rise of cloud data warehouses like Snowflake, BigQuery, and Redshift
Modern data warehouses are much more scalable and cost-effective compared to their predecessors.
Since they are hosted in the cloud and fully managed, teams can scale up and down as needed without having to buy expensive hardware. Furthermore, the pricing for storage and compute is separated. Teams only pay for the compute resources they use.
From a usability perspective, modern data warehouses offer a SQL interface, allowing anyone who knows SQL to query and analyze data. These queries are also extremely performant - modern data warehouses can handle high volumes of concurrent queries with minimal latency.
Since the storage is scalable, and compute and storage are separated, while SQL queries are fast, data warehouses have also enabled the rise of ELT over ETL. It’s now cost-effective to load raw data directly into the warehouse and perform transformations inside the warehouse.
4. The emergence of tools like dbt, Airflow, and orchestration engines that have made working with data more robust, scalable, and repeatable
Before dedicated orchestration tools like Airflow, data teams relied on job orchestration techniques like cron jobs, manual triggers, and homegrown internal tools. These techniques were error-prone, difficult to maintain, and often insecure.
Modern orchestration engines like Airflow, by contrast, offer a way to build much more repeatable and robust data pipelines. They provide greater visibility and control. With orchestration engines like Airflow, teams can immediately see the status and progress of all data jobs in their pipelines. These jobs are automatically correctly sequenced according to their stated dependencies.
Meanwhile, tools like dbt have popularized SQL-based data transformation, lowering the bar for building and managing ELT data pipelines. This has catalyzed the rise of analytics engineering roles.
5. A shift to focusing on data quality, monitoring, and observability.
As data systems became more complex, these practices help ensure that data is reliable and that the value gained from data is maximized.
New tools like Bigeye and dbt tests have turned maintaining data quality from a manual, reactive, debugging process, to a much more proactive, automated process.
6. The rise of metadata tools, data catalogs, and data discovery to help understand what data is available and how it is used.
With the prior “big data” mentality of just throwing everything into the data lake, data users didn’t always know what data had been collected, whether it was useful, or how it was used. This often led to thousands of unused, abandoned, or duplicate tables within companies, leading to ignored alerts and inconsistent metric definitions.
With the rise of metadata tools like data catalogs, users can now much more efficiently find the data they need, and trust that the data is the gold standard source.
7. The application of machine learning to data engineering with tools for anomaly detection, entity resolution, etc.
Machine learning has benefitted from investments in data engineering, and vice versa. For example, data observability tools today take advantage of sophisticated anomaly detection algorithms, while the principles of data engineering have created a much more high-quality data product, which ultimately determines the quality of the machine learning model outputs.
Data engineers will often also build data platforms that make it easier for other engineers to apply machine learning, even if they’re not an expert in the underlying theories and formulas.
8. A move to close the loop by enabling data to flow back to source systems
This completes the data cycle and allows the value added to data to benefit source systems. For example, there are now “reverse ETL” tools that allow you to update your HubSpot or Salesforce records with information about customers’ buying patterns, without it having it be a manual process. This creates a data cycle rather than just a linear journey from data source to data warehouse.
What a modern data stack looks like today
As you can see in the diagram below, a common data stack for companies today involves data flowing from microservices/online services/production databases, into a streaming service like Kafka. That data can then be sent either to a data lake for long-term archival, or to a data warehouse for further transformation.
Separately, there can be customer data being pushed from SaaS services like Hubspot and SalesForce on a more batch/manual basis into the data warehouse.
Within the data warehouse, data analysts use DBT to create SQL transformations that turn the raw data tables into clean, complete tables that are consumable by data scientists or data applications. These transformations will check for formatting inconsistencies, rename columns, and make other necessary adjustments.
On the consumption end, data catalogs suck in metadata from both the data warehouse and the data lake and provide users with a UI for which data lives where. Finally, machine learning applications pull data from the data warehouse (and sometimes the data lake!) for training purposes.
How does a data observability tool fit into the modern data stack?
As you can see, data observability tools like Bigeye typically sit on top of the data warehouse, periodically running SQL queries on columns to calculate statistics on the values of the column, and alerting engineers and data scientists when those values go out of bounds. They can also be used to monitor production databases like Postgres and MySQL.
Modern data observability tools are optimized to keep data warehouse compute costs low, batching and caching the queries.
With all the “modern data stack” terms flying around today, it can be difficult to understand where data observability fits in. But ultimately, data observability is focused on a simple goal: maximizing the value of your data.
By monitoring data in use, data observability provides visibility into issues that could undermine data products, AI models, compliance, and more. While few companies these days hesitate to adopt data warehouses, Airflow for orchestration, or dbt for transformation, data observability is often considered (wrongly) more of a “nice-to-have”. But the reality is that without data observability, you have little visibility into whether your data is even “fit for use.” Data issues can have serious consequences, from compliance violations to faulty AI. The modern data stack provides a strong foundation, but observability cements its impact.
Schema change detection