Navigating A Non-Modern Data Stack: Achieving Analytics Reliability and Data Observability
Unless you're a young startup, your data infrastructure is unlikely to be purely a modern data stack. So, how do you work with what you've got?
Data is at the heart of digital transformation and data-driven decision-making. However, managing and gaining insights from data involves navigating a complex “data stack” comprising various technologies across the data lifecycle. For many organizations, this data stack contains a mix of both modern cloud-based tools and traditional on-premise systems that have been built up over many years.
In this post, we'll take a look at what a Non-Modern Data Stack looks like, how you can implement data observability and analytics reliability in that stack, and the key considerations for operating these stacks alongside their modern counterparts.
Understanding the Non-Modern Data Stack
In recent years, there has been considerable hype surrounding the "modern data stack." This will typically include data warehouses like Snowflake or Redshift, ETL tools like DBT and Airflow, and BI tools like Looker and Tableau.
However, unless you are a relatively young startup, your data infrastructure is unlikely to be purely a modern data stack. Instead, it will likely comprise a blend of modern data stack tools, and what we might call non-modern data stack.
When we talk about the non-modern data stack, there are a few categories to think about: Databases, ETL, and BI. There are a few vendors that have offerings across more than one of these, and often companies will buy the whole stack from them. For example: Microsoft has SQL Server as a database, SSIS as an ETL tool, and PowerBI as a BI tool. IBM has DB2 as a database, Datastage as an ETL tool, and Cognos as a BI tool. But just as often, companies will mix and match. Informatica is a very popular ETL tool, but it's used with any combination of databases.
For many companies, the non-modern stack still powers large parts of their data and analytics workloads. While these technologies might seem outdated compared to the modern data stack, companies continue to use them because they work well for core business functions, and replacing them would require massive time, cost, and effort.
Challenges of Non-Modern Data Stacks
In non-modern data stacks, many of the components are often siloed and owned by specific teams, making end-to-end observability very difficult. Arriving at a clear data lineage and understanding of how data flows through the systems, then, is both a technical and organizational challenge.
The technical challenge is whether you can find a tool that can connect to the variety of these legacy systems and automate as much of the lineage collection as possible. Most tools focus on more modern data platforms that expose lineage and other metadata more openly, and oftentimes legacy tools have less open and more bespoke APIs.
The organizational challenge is identifying who is responsible for these systems. Oftentimes there are technical owners (i.e. database administrators, infrastructure engineers, SREs, etc) who are responsible for the actual system (like a specific SQL Server instance), as well as business owners (i.e. project managers, analysts) who are responsible for the application on top of the systems (e.g. risk reporting framework, financial forecasts). Having multiple different views into the same system makes it much harder to capture lineage information, but also makes it harder to understand who to talk to if something goes wrong.
Analytics Reliability and Data Observability
Why Does Analytics Reliability Matter?
To take a step back, why do analytics reliability and observability in your data systems even matter? Well, analytics reliability can have a huge impact on decision-making. In cases where analytics are unreliable, and data quality is subpar, stakeholders are unlikely to have confidence in data-driven recommendations and insights, resulting in a loss of credibility for data within the organization.
The Significance of Data Observability
While data observability is often defined in terms of knowing the state of your various data systems at all times, another way to look at it is the ability to answer any questions that come up about your data systems. This includes questions like:
- Is my data arriving on time?
- Is my data high quality?
- Is the data outage that happened in the upstream system going to affect the downstream?
The Importance of Continuous Monitoring
A data observability platform like Bigeye makes it easy to continuously monitor the health of both your data pipeline and your actual data targets. Bigeye sits on top of your data warehouse or databases and periodically polls your tables and schemas to check that data quality metrics are falling within bounds.
Unlike other data observability tools, Bigeye offers integrations not only with modern data warehouses like Snowflake and Databricks, but also legacy databases like SAP Hana, and has ambitions to go deeper into the non-modern data stack. Its recent acquisition of Data Advantage Group, which has been in business for two decades and boasts outstanding technology around metadata collection and data lineage, is part of that strategy: the acquisition and subsequent integration of the Data Advantage product allows Bigeye to automatically map data lineage across transactional databases, ETL platforms, data lakes, data warehouses, and business intelligence tools.
Bigeye can plugin to any data system you have and use, whether it's the cutting edge of warehouses or a legacy platform.
Maximizing Existing Infrastructure
Our general recommendation regarding non-modern data stacks is that if the existing legacy system works, then keep it. Many older applications have been running for years and "just work". If you don't foresee needing to update or expand functionality for these applications, then there's no need to switch technologies.
That said, switching technologies can be useful if there is a company-wide effort to consolidate to a new platform (e.g. migrate to GCP), or if you are using a piece of technology that's lacking some sort of key functionality you absolutely need.
Legacy Software Implementation Checklist
One of the areas that you will likely want to beef up your non-modern data stack, is in its analytics reliability and data observability. While in some cases, you may be able to find modern data observability tools that integrate with your legacy database or ETL tool, perhaps even more important is the process of auditing your system and data sources and sitting down with different data stakeholders. Some rough implementation steps to follow include:
- Data Source Assessment: Conduct an inventory of existing data sources, databases, and software tools to understand data origins, volumes, and flows.
- Data lineage analysis: Map out the entire flow of data within a system, tracing its journey from the initial source right through to its ultimate destination. This provides clarity on how data moves and transforms and also highlights potential problem areas.
- Data SLAs: Define service level requirements for data availability, latency, and accuracy. Similar to software SLAs, these are agreements between producers and consumers of data that give your data quality efforts something to measure against.
- Data quality metrics: Determine key data quality indicators like accuracy, completeness, consistency, reliability, and timeliness. To monitor these metrics and ensure that they stay within bounds, you can establish automated checks for data anomalies or discrepancies. These may initially take the form of SQL tests, like DBT tests, but should gradually be further automated.
- Data migration monitoring plan: At many companies running non-modern data stacks, there are often frequent migrations from database A to database B, or from reporting tool X to reporting tool Y. Bigeye Deltas can help in situations like this, ensuring that data gets faithfully migrated/replicated.
Even if your data stack isn't cutting edge, staying up-to-date on industry best practices will help you identify incremental improvements. Small changes like adding metadata, improving data validation, and increasing monitoring can improve reliability and agility. With the right observability strategy, your non-modern data stack can still enable data-driven decision-making. The key is taking a thoughtful approach to implement changes that work within your technical constraints.
Schema change detection