Incident management for data teams
Software engineers use a framework for incident management to resolve issues. Here's how to apply a few core incident management concepts to data teams.
Incident management is a critical practice for data teams to quickly identify and resolve pipeline and data reliability issues before they harm business operations. While traditionally used in software engineering, the core concepts also apply to managing data systems.
What is incident management?
For software companies, an incident is an unplanned interruption to service or reduction in quality of service. For data teams, an incident could arise from issues in data pipelines that lead to incorrect or missing data in dashboards and applications. For example, when executives see dashboards with missing or clearly erroneous data, it will generally create a fervor that cascades back through the organization to data engineers and scientists.
Implementing robust incident management practices allows data teams to identify issues early, mobilize to diagnose and resolve them, and prevent major business impacts. While incident management has been standard in software engineering and DevOps for years, many data teams are still building out these capabilities.
The incident management process
Incident management is multistep process, with the goal of detecting, diagnosing, and resolving incidents before they lead to significant disruption or impact.
The first step is detecting when an incident has occurred, ideally through automated monitoring and alerting rather than manual checks. Tools like Bigeye can automatically send alerts to the relevant stakeholders if the volume of data being output from a job suddenly drops, for example.
In terms of monitoring data systems, there are typically two types:
- Data pipeline monitoring - This focuses on monitoring the health and performance of the data pipelines themselves. It involves tracking metrics like:
- Data freshness - When each data table/asset was last updated
- Data volume - The number of rows flowing through the pipeline
- Job durations - How long each ETL job takes to run
Data pipeline monitoring helps ensure smooth data flow and identify bottlenecks. It is usually handled by data engineering teams.
2. Data quality monitoring - This focuses on monitoring the actual contents of the data. It involves assessing aspects like:
- Data accuracy - Are values in the expected range?
- Completeness - Are there missing or null values?
- Duplication - Are there duplicate rows?
- Consistency - Does data conform to expected formats?
Data quality monitoring helps identify issues in the data itself that may impact analytics and decisions. It is often handled by data science and analytics teams.
Generally speaking, once a human has looked at an alert and verified that it is a real incident, that alert will be “promoted” into an incident, kicking off the incident management process.
There are incident management solutions like FireHydrant and Lightstep that allow you to create a UI for incidents that everyone internal can reference.
Once an incident is detected, the relevant team members are assembled to investigate and resolve it. While the exact roles will differ from company to company, here are some common ones:
- Incident Commander - Is the person in charge. Doesn’t necessarily need to be an expert on the problematic system, but will generally be a more senior IC. This is the person who's making decisions about what the team's going to do next and who is going to be pulled in to help. They are guiding the process of solving the problem rather than solving the problem themselves.
- Scribe - Documents everything that happens for a complete record.
- Liaison - Handles communication with stakeholders, leadership, and potentially customers. This prevents the subject matter experts from being inundated (and distracted) with requests for updates from executives.
- Subject Matter Experts - Engineers and data scientists who can diagnose and solve issues.
With clearly defined responsibilities, team members can execute the response plan without confusion, even under pressure.
During the diagnosis phase, the team diagnoses the scope, impact, and cause of the incident. A key aspect of this phase is escalating the issue to the appropriate severity level based on business impact. Examples of severity levels include:
- Sev 5 - Minor issue, fixed later. No user impact.
- Sev 4 - Some degradation to internal tools. Minimal user impact.
- Sev 3 - Significant degradation to internal tools. Notable user impact.
- Sev 2 - Partial outage of customer-facing services. High user impact.
- Sev 1 - Complete outage of critical services. Extreme user impact.
For a company like Uber, for example, data incidents that prevent the pricing algorithm from functioning properly could easily escalate to Sev 1 given the potential millions in lost revenue per hour. Appropriate severity levels allow the team to escalate and pull in resources as needed.
During the resolution phase, the team implements changes to resolve the root cause identified during diagnosis. For data teams, this can involve steps like:
- Rolling back dbt model changes
- Rerunning failed extractions
- Restarting Airflow tasks
- Backfilling tables to correct missing data
- Setting up fallbacks
Monitoring should be used to confirm resolution and verify that the appropriate business metrics have been restored.
During the closure step, the team documents the incident and identifies process improvements. The focus should be on learning rather than assigning blame. Post-incident analysis helps strengthen capabilities like monitoring, documentation, and runbooks that reduce the likelihood and impact of future incidents.
Example of data incident at a health tech company
Jake’s team at a health tech company team relies on the timely delivery of clinical and financial data from various third-party vendors to power their analytics platform. One day, Jake notices that the latest claims data from one of their largest vendors has not arrived on schedule.
Jake has monitoring and alerting set up to notify if daily claims files are not delivered within a 2 hour window of the expected time. This allows rapid detection of a potential issue.
Seeing the alert, Jake assembles an incident response team including the data analyst who wrote the ETL responsible for the claims data pipeline, a data engineer familiar with the vendor's API, and himself as incident commander.
The team first confirms that the missing data is only impacting the claims data, and that other sources are updating as expected. They check the vendor's API logs and see errors indicating connection issues on the vendor side. Based on the magnitude of the claims data and the number of downstream reports impacted, they classify this as a Severity 2 incident.
Jake contacts his counterpart at the vendor to escalate the connection issue. Meanwhile, the data engineer manually reruns the data pipeline on stale claims data, after having verified that this would be acceptable for the machine learning models; he also adds a warning on the executive dashboards telling them that there has been a vendor outage.
Once claims data is back up to date, the incident is resolved. Jake's team documents the incident and actions taken for future reference. This might include an auto-fallback to stale data when appropriate.
An incident response process can enable data teams to promptly detect issues, mobilize the right personnel, diagnose the problem, implement fixes, and capture learnings - ultimately minimizing disruption to data consumers. As data teams increasingly borrow principles from software teams, incident management and response is one that shouldn’t be skipped.
Schema change detection