Thought leadership
-
November 3, 2022

20 questions for data observability readiness

Is your company ready to implement data observability? How will you do it? Get started by asking yourself these 20 important questions that will help you orient yourself in the data observability universe.

Kyle Kirwan

Is your company ready to implement data observability? How will you do it? Get started by asking yourself these 20 important questions that will help you orient yourself in the data observability universe.

The data basics

1. Where is your data?

Is your data in a data warehouse, a production database, spreadsheets, an S3 bucket, or Kafka? Most data observability tools, like Bigeye, monitor data at rest in the data warehouse. They also support traditional databases like Postgres and MySQL. Make sure your data storage is compatible with the data observability tools on the market.

2. How much data do you have?

Understand how much data is being generated by your production applications every day, and what kind of data that is (for example: images, text, or video).

3. Is your volume of data increasing steadily?

As the volume of data increases, it becomes harder to query that data (takes more time and is also more computationally expensive). It's also harder to keep track of the state and quality. If your data is steadily growing, you may be ready for data observability investments.

4. How big is your data team?

Your data team size will help determine the optimal way you tackle data quality. If you have a data team of one, it probably makes more sense to add some basic data checks with open-source tools (and leverage their communities) instead of buying something off the shelf. Once you have a small data team (5-7 people), you should begin to think about testing risky assumptions with SQL, setting up continuous integration, and implementing basic data monitoring.

5. What are the most commonly-accessed datasets?

There are probably a few key tables upon which your business is heavily reliant. Identify them as candidates for more granular monitoring. For other tables that are less frequently accessed, you might just monitor a few metrics. This bifurcation is referred to as T-shaped monitoring.

6. Who are the major consumers of the data and what are they using it for?

Understand who is consuming the different tables in your dataset; whether it's analytics dashboards for executives, or a machine learning model for fraud detection. Identifying these consumers will also help you get answers to the next set of questions.

The cost of bad data

7. Do executives and engineers currently trust the data? Do you have an NPS score for that?

To quantify the general trust and user perception of the data, send out a single-question survey to engineers and executives to measure NPS (Net Promoter Score).

The question to ask is likely some variation of: How easily and confidently can you answer your questions using company data?

8. How many data outages have you had in the last quarter? What was their cost?

If you’ve had data incidents in the past quarter, write those down and determine their user-facing cost.

9. How sensitive are your machine learning models to out-of-date data?

Talk to your machine learning engineers and figure out how sensitive the models are to out-of-date or missing data. The less robust the model is, the more important it is to make sure it’s given solid inputs.

10. Are you planning on an imminent IPO or exit?

If the company is planning to IPO or exit soon, this often serves as a catalyst to get the company’s data in better shape. There’s a general understanding that mistakes in top-line metric reporting that are tolerated when the company is a startup, are not tolerated in the public markets, and may even have legal ramifications.

11. Are you doing a data migration soon?

Similarly, investments in data observability are often triggered by large engineering initiatives, like a data migration from one data warehouse to another. If you are planning on doing one of these soon, now may be a good time to invest in data observability, to lower the risk that you accidentally delete historical data that you shouldn’t, or migrate tables over incorrectly.

The data observability status quo

12. Have you set up data tests?

Simple data tests like DBT tests or Great Expectations are easy to set up and pay big dividends. They should be among your top data observability priorities.

13. Do you have change management and CI around your data?

Data tests should be paired with CI/CD to run every time a change is made to the data pipeline.

14. Do you have a staging environment for your data? How consistent is it with your production environment?

Along with the CI and the data tests, you should have a separate staging environment where data engineers can test queries before having others review them. This prevents breaking changes from polluting the production database.

15. Do you find yourself trying to extend/schedule your data tests?

If you find yourself trying to extend your data tests, and build frameworks and visualization platforms on top of them, you may actually be ready for data observability. Data tests and data observability differ in that tests only test for what you’ve already asserted to be correct, while observability is about knowing the complete state of the system.

16. How long does it take for the data scientists to find the data they need?

When data scientists begin a new project, how long does it take for them to find where the necessary data is? Are the datasets well documented, or is it mostly institutional knowledge that requires a number of Slack messages and emails to get to?

17. Are you able to definitively answer questions about key metrics at the company in under ten minutes?

For topline metrics like revenue, orders, website traffic, and search traffic, how long does it take for a product manager to get a number that they trust and are willing to make product and business decisions off of? If it’s more than ten minutes, there’s something wrong.

18. Do your engineers understand how data flows through the system and how it is transformed?

If a new data engineer joins the team, would each of the tenured engineers be able to draw an architecture diagram and explain how data flows through the system, from the production boxes, to the data warehouse? If not, why not? If the reason is that the data flows are too complicated, it may be time to look into data lineage solutions.

Specific data observability decisions

19. Do you want to monitor at the source or at the destination?

We generally recommend that you monitor for data quality at the destination (in the data warehouse, once all the transformations have occurred) rather than the source (the web applications where the data is actually being generated). However, one reason to monitor at the source is that you will catch problems before they percolate downstream.

20. Is data observability technology core to your business?

Once you’ve figured out that you want to invest in data observability, you should document the business case for data observability. We generally recommend that unless data observability technology is actually core to your business, you buy something off-the-shelf.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.