Thought leadership
-
April 27, 2023

The complete guide to understanding data SLAs

An SLA is a "service level agreement." If you have questions around the SLA meaning, purpose, or components, this post will walk you through it.

Liz Elfman

In the world of software engineering, companies like Slack, Stripe, and Zoom ensure 24/7 service availability by measuring performance and publishing SLAs that define the expected behavior of their software. These companies maintain high levels of reliability despite making rapid changes to their services. For example, Stripe reported 99.99% uptime over 90 days for their API, even while deploying code changes multiple times per day.

Data Service Level Agreements (data SLAs) are analogous to SLAs for software. They guarantee a certain quality and availability for data assets. This comprehensive guide will provide insights into what data SLAs are, their importance, implementation, examples, and when it's time to establish them in your organization.

What are data SLAs?

Data SLAs are agreements between data providers and data consumers that outline the expected level of data quality and data observability.

Data SLAs come from the broader concept of Service Level Agreements (SLAs), which are formal commitments between service providers and their stakeholders, or between different departments within an organization. SLAs define the expected level of service, along with the consequences if these expectations are not met.

Who uses data SLAs?

Data SLAs should be joint projects between the data platform teams, which are responsible for providing data to different departments within a company, and the teams that are consuming the data, such as the product, finance, and marketing teams.

Why are data SLAs important?

Data SLAs help bridge the gap between data engineers and consumers by providing clear expectations and accountability for data quality. They also help mediate the needs of different consumers of the data.

For example, the product team might want to move fast and make changes to their data whenever necessary, while other teams like marketing and finance might expect the data to remain stable and reliable. The SLA is the pre-defined source of truth for those opposing viewpoints.

Implementing data SLAs

Implementing data SLAs is a multi-step process that starts with identifying the applications that require SLAs, such as executive-facing dashboards or core machine-learning models, or core tables. The next step is assembling the constituent components of a data SLA: the SLIs and SLOs. The components are as follows:

Service Level Indicators (SLIs)

SLIs are a quantifiable and agreed-upon measurement of the data. For example, a team might decide to measure the duplicate rate of user records in a users table and set a limit on the acceptable percentage of duplicated records. This measurement can then be monitored and used to evaluate the health of the data. In addition to duplicate rate, there are various other aspects of data that could be measured using SLIs, such as:

  • data freshness
  • nulls and blanks
  • out-of-range values
  • formatting issues

By establishing a set of SLIs, data platform teams can avoid time-consuming back-and-forth conversations and focus on clearly quantified measurements of data quality.

Service Level Objectives (SLOs)

SLOs, are targets set for the performance of the various attributes measured by SLIs. These targets help define what is considered normal or acceptable for a given data aspect. For instance, a team may decide that a 0.25% duplicate user ID rate is tolerable, but anything above 1-2% would negatively impact other processes or teams, such as finance or machine learning models.

Service Level Agreements (SLAs)

In the final step, SLIs and SLOs are packaged up into SLAs. SLAs are agreements not only that the SLI will stay within the SLO, but also define what happens when that SLO is not met.

For example, maybe the data team is tracking the duplicate rate of user UUIDs and aiming for a 99.5% tolerance. However, in the SLA, they make a commitment to a slightly lower level of reliability: 90%. Over the trailing 30 days, they aim to meet the duplicate rate SLI 90% of the time, allowing up to 7.2 hours of downtime.

In the SLA, it’s also agreed that if this threshold is exceeded, the data team will halt all changes to the ELT jobs that feed the users table and stop changes to all upstream services. This commitment ensures that the data infrastructure remains stable and that upstream changes do not disrupt the users table.

Finally, SLAs can include escalation procedures if disagreements arise. For instance, in case of a disagreement over downtime, the issue could be escalated to the VP of Infrastructure for resolution. The SLA serves as a binding commitment to ensure that all stakeholders work together to maintain data quality and reliability.

Examples of Common Data SLAs

  • Freshness: Guaranteeing that data is no more than a certain number of hours or days old.
  • Completeness: Ensuring a specific percentage of data is present and accurate.
  • Accuracy: Defining acceptable error rates for data values.
  • Availability: Ensuring a certain level of uptime for data storage and retrieval systems.

For instance, let's consider a duplicate rate SLI with a 99.5% reliability target. We would measure this SLI every 30 minutes and track the results over a 30-day window. During this period, the data team is allowed a total of 3.6 hours when the duplicate rate exceeds the set threshold. If the duplicate rate surpasses the limit for more than 3.6 hours, the data team has not met its commitment to the company in terms of dataset reliability.

SLOs can be set at different levels of reliability, depending on the specific requirements and priorities of a dataset. Examples include:

  • 99.9% reliability (three nines): 43 minutes of downtime in a 30-day window.
  • 99.5% reliability: 3.6 hours of downtime (as shown in our example).
  • 99% reliability (two nines): 7.2 hours of downtime.
  • 95% reliability: 1.5 days of downtime.
  • 90% reliability: 3 days of downtime.

Setting stricter SLOs signifies a stronger commitment to stakeholders regarding dataset reliability. For example, a 99.9% reliability target means that the SLI could be violated for roughly one minute per day, which is generally acceptable for applications like analytics dashboards.

Common signs It's the right time for data SLAs

It's always prudent to create an SLA between two teams. Like with so many contracts, creating an SLA is a "better safe than sorry" action that doesn't have a downside. However, there are a few common signs that it's time for your organization to build SLAs into the workflow right away. Those signs are:

  1. Frequent data quality issues impacting data consumers and their ability to trust the data
  2. Disagreements between different consumers on the definitions of data quality metrics
  3. A growing data engineering team that requires clear priorities and guidelines for managing data quality

Data SLAs are essential for maintaining high-quality data and ensuring that data consumers can trust the data they are working with. Implementing data SLAs can lead to better communication, prioritization, and accountability for data quality within an organization. By understanding and establishing data SLAs, businesses can optimize their data-driven decision-making processes and maximize the value of their data assets.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.