Thought leadership

April 27, 2023

The complete guide to understanding data SLAs

min read

Liz Elfman

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

In the world of software engineering, companies like Slack, Stripe, and Zoom ensure 24/7 service availability by measuring performance and publishing SLAs that define the expected behavior of their software. These companies maintain high levels of reliability despite making rapid changes to their services. For example, Stripe reported 99.99% uptime over 90 days for their API, even while deploying code changes multiple times per day.

Data Service Level Agreements (data SLAs) are analogous to SLAs for software. They guarantee a certain quality and availability for data assets. This comprehensive guide will provide insights into what data SLAs are, their importance, implementation, examples, and when it's time to establish them in your organization.

What are data SLAs?

Data SLAs are agreements between data providers and data consumers that outline the expected level of data quality and data observability.

Data SLAs come from the broader concept of Service Level Agreements (SLAs), which are formal commitments between service providers and their stakeholders, or between different departments within an organization. SLAs define the expected level of service, along with the consequences if these expectations are not met.

Who uses data SLAs?

Data SLAs should be joint projects between the data platform teams, which are responsible for providing data to different departments within a company, and the teams that are consuming the data, such as the product, finance, and marketing teams.

Why are data SLAs important?

Data SLAs help bridge the gap between data engineers and consumers by providing clear expectations and accountability for data quality. They also help mediate the needs of different consumers of the data.

For example, the product team might want to move fast and make changes to their data whenever necessary, while other teams like marketing and finance might expect the data to remain stable and reliable. The SLA is the pre-defined source of truth for those opposing viewpoints.

Implementing data SLAs

Implementing data SLAs is a multi-step process that starts with identifying the applications that require SLAs, such as executive-facing dashboards or core machine-learning models, or core tables. The next step is assembling the constituent components of a data SLA: the SLIs and SLOs. The components are as follows:

Service Level Indicators (SLIs)

SLIs are a quantifiable and agreed-upon measurement of the data. For example, a team might decide to measure the duplicate rate of user records in a users table and set a limit on the acceptable percentage of duplicated records. This measurement can then be monitored and used to evaluate the health of the data. In addition to duplicate rate, there are various other aspects of data that could be measured using SLIs, such as:

data freshness
nulls and blanks
out-of-range values
formatting issues

By establishing a set of SLIs, data platform teams can avoid time-consuming back-and-forth conversations and focus on clearly quantified measurements of data quality.

Service Level Objectives (SLOs)

SLOs, are targets set for the performance of the various attributes measured by SLIs. These targets help define what is considered normal or acceptable for a given data aspect. For instance, a team may decide that a 0.25% duplicate user ID rate is tolerable, but anything above 1-2% would negatively impact other processes or teams, such as finance or machine learning models.

Service Level Agreements (SLAs)

In the final step, SLIs and SLOs are packaged up into SLAs. SLAs are agreements not only that the SLI will stay within the SLO, but also define what happens when that SLO is not met.

For example, maybe the data team is tracking the duplicate rate of user UUIDs and aiming for a 99.5% tolerance. However, in the SLA, they make a commitment to a slightly lower level of reliability: 90%. Over the trailing 30 days, they aim to meet the duplicate rate SLI 90% of the time, allowing up to 7.2 hours of downtime.

In the SLA, it’s also agreed that if this threshold is exceeded, the data team will halt all changes to the ELT jobs that feed the users table and stop changes to all upstream services. This commitment ensures that the data infrastructure remains stable and that upstream changes do not disrupt the users table.

Finally, SLAs can include escalation procedures if disagreements arise. For instance, in case of a disagreement over downtime, the issue could be escalated to the VP of Infrastructure for resolution. The SLA serves as a binding commitment to ensure that all stakeholders work together to maintain data quality and reliability.

Examples of Common Data SLAs

Freshness: Guaranteeing that data is no more than a certain number of hours or days old.
Completeness: Ensuring a specific percentage of data is present and accurate.
Accuracy: Defining acceptable error rates for data values.
Availability: Ensuring a certain level of uptime for data storage and retrieval systems.

For instance, let's consider a duplicate rate SLI with a 99.5% reliability target. We would measure this SLI every 30 minutes and track the results over a 30-day window. During this period, the data team is allowed a total of 3.6 hours when the duplicate rate exceeds the set threshold. If the duplicate rate surpasses the limit for more than 3.6 hours, the data team has not met its commitment to the company in terms of dataset reliability.

SLOs can be set at different levels of reliability, depending on the specific requirements and priorities of a dataset. Examples include:

99.9% reliability (three nines): 43 minutes of downtime in a 30-day window.
99.5% reliability: 3.6 hours of downtime (as shown in our example).
99% reliability (two nines): 7.2 hours of downtime.
95% reliability: 1.5 days of downtime.
90% reliability: 3 days of downtime.

Setting stricter SLOs signifies a stronger commitment to stakeholders regarding dataset reliability. For example, a 99.9% reliability target means that the SLI could be violated for roughly one minute per day, which is generally acceptable for applications like analytics dashboards.

Common signs It's the right time for data SLAs

It's always prudent to create an SLA between two teams. Like with so many contracts, creating an SLA is a "better safe than sorry" action that doesn't have a downside. However, there are a few common signs that it's time for your organization to build SLAs into the workflow right away. Those signs are:

Frequent data quality issues impacting data consumers and their ability to trust the data
Disagreements between different consumers on the definitions of data quality metrics
A growing data engineering team that requires clear priorities and guidelines for managing data quality

Data SLAs are essential for maintaining high-quality data and ensuring that data consumers can trust the data they are working with. Implementing data SLAs can lead to better communication, prioritization, and accountability for data quality within an organization. By understanding and establishing data SLAs, businesses can optimize their data-driven decision-making processes and maximize the value of their data assets.

share this episode

Resource

Monthly cost ($)

Number of resources

Time (months)

Total cost ($)

Software/Data engineer

$15,000

$540,000

Data analyst

$12,000

$144,000

Business analyst

$10,000

$30,000

Data/product manager

$20,000

$240,000

Total cost

$954,000

Role

Goals

Common needs

Data engineers

Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.

Freshness + volume
Monitoring
Schema change detection
Lineage monitoring

Data scientists

Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.

Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing

Analytics engineers

Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.

Lineage monitoringETL blue/green testing

Business intelligence analysts

The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.

Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing

Other stakeholders

Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.

Integration with analytics toolsReporting and insights

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

The complete guide to understanding data SLAs

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

What are data SLAs?

Who uses data SLAs?

Why are data SLAs important?

Implementing data SLAs

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreements (SLAs)

Examples of Common Data SLAs

Common signs It's the right time for data SLAs

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Bridging the AI Hype Gap: Real-World Insights From Data Leaders On What It Takes To Succeed

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

Join the Bigeye Newsletter

The complete guide to understanding data SLAs

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

What are data SLAs?

Who uses data SLAs?

Why are data SLAs important?

Implementing data SLAs

Service Level Indicators (SLIs)

Service Level Objectives (SLOs)

Service Level Agreements (SLAs)

Examples of Common Data SLAs

Common signs It's the right time for data SLAs

Get the Best of Data Leadership

Stay Informed

Get Data Insights Delivered

Related posts

Bridging the AI Hype Gap: Real-World Insights From Data Leaders On What It Takes To Succeed

SLAs: Not Just for Software Engineers Anymore

Get AI Ready with Governance & Data Observability

Join the Bigeye Newsletter