Product
-
February 8, 2023

Using Bigeye Collections to organize and scale your monitoring

There are four different ways to use Bigeye Collections to organize and scale your monitoring. Let's explore each one.

Jon Hsieh

You’ve onboarded your tables, deployed Autometrics, and you're monitoring your data. Over time, your inner Marie Kondo may feel the need to get even more organized. How can Bigeye help you achieve this organized state?

Bigeye Collections give you the ability to bundle metrics and route notifications. They present a shared view that a team can own.  Different users may have different things they want to track, so owners and consumers can mark them as a favorite to quickly highlight their status.

Each collection has an owner or team of owners that are responsible for keeping it healthy communicating to its consumers. The owners can determine how to react when notifications of problems are flagged.

How do people organize their collections in practice? Here are a few strategies:

1. By team and project: One ETL silo vs. another ETL silo

2. By source system: Stitch, Fivetran, and dbt operations

3. By metric type: Data ops / data reliability engineer vs. data analyst

4. By urgency: Route notifications to Pager Duty vs. specific Slack channels vs. email

By team and project

Most commonly, users employ Bigeye Collections to divide monitoring into teams. Each team is responsible for data that corresponds to a subset of the data in the data warehouse or data lake. Notifications get routed specifically to them. Users can go one level deeper with a collection per project or data set. An extra organizational level can keep short-lived datasets and experiments separate from production long-lived datasets.

These data sets are generally owned by a line of business or different functions within a company. For example, Marketing may have their metrics about campaigns, Engineering about app reliability, Sales about the sales funnel, and Product about usage analytics. Each group cares primarily about their datasets, because they contribute to their own business decisions or actions. When anomalies in these datasets are detected, they want to be informed of it, and understand so they can act on it.

By federating responsibility for data quality, long-lived datasets are easier to share across teams. Bigeye’s checks provide part of a data contract, enabling data engineers and analysts to trust data as they produce new insights by combining different datasets. For example ,a Product team may need data from Engineering and from Sales and Marketing to segment feature usage by a customer vertical, customer pipeline status, or marketing campaign.  

By source system

Some users give one team responsibility for the operations of a data ingest, while others are responsible for transformations and reporting. The folks responsible for ingest enable different lines of business with domain expertise to get the data. However, while they don’t know the data, they own the processes and the operations of tools like Fivetran, Stitch, and Airbyte to bring the data in.

In these situations, the Bigeye user is a data ops person who would create collections with the operational metrics (freshness, row count) for each group of ingested tables.  These metrics typically cover a wide set of tables and schemas.  Bigeye notifications for these collections may be sent to the same channels the ingest tools send their notifications.

Examples:

  • Engineering wants to analyze data from an application’s database. They create a collection for an application’s Postgres database, and the data warehouse wherein a Stitch job replicates. In this case, a Bigeye Deltas job would further validate that data.
  • Sales and Marketing wants funnel data to project revenue growth. They create a collection for the data warehouse replica of Hubspot or Salesforce data that Fivetran has ingested.
  • Product wants analytics data from Heap. They create a collection with operational metrics against a data set shared via a Snowflake share.

By metric type

Some customers split collections to focus on different facets of their monitoring – “pipeline reliability” collections, “data quality” collections, and “business insight” collections – which correspond to different stakeholders. Problems with operational collections need to be dealt with before the schema constraint problem.  And schema constraint problems need to be managed before they produce clean business metrics.   Splitting these up into separate collections aids the prioritization of issues to deal with, and aids the routing of notifications to the teams that are responsible for each tier.

A “pipeline reliability” collection is responsible for making sure that data pipelines are connected and that data is flowing. A collection that solely tracks freshness and row count metrics is a concise aggregate where a data ops person can see a quick health summary. The pipeline reliability collection often ensures that data from an app makes it into the data warehouse. It can also ensure that data pipelines (dbt, Airflow) are continuing to flow, or that backfills have returned data to steady state. It is common for these collections to be slightly larger than average, with 20-100 metrics in them corresponding to a few dozen tables spread across several schemas.  

A “data quality” collection is responsible for enforcing that pipeline data is clean. This is tracked with data constraint metrics like nulls or primary key uniqueness metrics. These metrics are most relevant to data engineers who built the pipelines, and the people consuming the data. These collections typically have fewer than 30 metrics on tables in a schema.

Finally, a “business insight” collection is responsible for identifying when the values of the data stray from the norm. A detected anomaly may be an indication of a semantic change or bug in the pipeline or a new trend that can impact data-driven decision making. For example, one could track usage by customer segment for different features or different engagement patterns. You’d use a grouped metric for dimension tracking or maybe relative metrics to ensure monotonic increasing / decreasing values. These typically have <30 metrics on a few tables in a schema.

By urgency

Some Bigeye users divide their metric collections by alert urgency. For example, some users have an “urgent” SLA that sends email to Pager Duty to flag the on-call person.  They also have a “normal” SLA that sends messages to Slack for regular business hours processing.  

The collections labeled "urgent" tend to have a few critical metrics. These metrics are on production tables. Conversely, Bigeye collections labeled “non-critical” tend to have hundreds of metrics, and are treated as informational as opposed to actionable.

Combining conventions

The strategies above do not need to be used in isolation. They can be combined to create collections at manageable scales.  In fact, ~70% of our users collections have a naming convention, and combine at least two of the techniques into their collection names.

Example:

  • [team] [project] [ops|schema|biz] [priority]
  • [source] [ops|schema] [priority]
  • [team] [priority]

Generally, most Bigeye Collections have fewer than 30 metrics in them. This grouping allows data platform owners and leaders to manage data pipelines at a coarser grain.

Summary

Keeping your Bigeye Collections organized and human-readable allows the metrics to stay healthy and reliable. There's no one right way to manage your Bigeye Collections, but hopefully the above strategies have given you inspiration for the various routes you might take, depending on your team, goals, and time frames. Ultimately, if your collections facilitate an easier handoff of responsibility between teams, your inner Marie Kondo can be happy.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.