Adrian Vidal
adrianna-vidal
-
June 12, 2026

Klarna's AI customer service deployment | AI Autopsy 002

10 min read

TL;DR: Klarna's AI customer service deployment, announced in February 2024, processed 2.3 million conversations in its first month and cut average handle time from 11 minutes to under 2 minutes. By May 2025, the CEO was admitting lower quality and beginning to rehire human agents. The deployment didn't fail because AI can't do customer service. Klarna's cost per transaction dropped 40% over two years, and by June 2026 the company was describing its hybrid model as working. What went wrong was the evaluation framework: the metrics used to declare success (handle time, aggregate CSAT) measured speed and average satisfaction rather than whether complex customer problems were being resolved. The signals were in the data the AI was producing all along. This article examines what the metrics missed and what production monitoring for AI customer service requires.

Adrian Vidal
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Join The AI Trust Summit on April 16
A one-day virtual summit on the controls enterprise leaders need to scale AI where it counts.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

On February 27, 2024, Klarna published a press release announcing that its AI customer service assistant had handled 2.3 million conversations in its first month, doing "the equivalent work of 700 full-time agents." Average resolution time had dropped from 11 minutes to under 2 minutes. Repeat inquiry rates were down 25 percent. The AI was operating in 23 markets across 35 languages, around the clock. Brad Lightcap, COO of OpenAI, called Klarna "at the very forefront among our partners in AI adoption and practical application."

Fourteen months later, on May 8, 2025, CEO Sebastian Siemiatkowski told Bloomberg: "As cost unfortunately seems to have been a too predominant evaluation factor when organizing this, what you end up having is lower quality." Klarna was reversing course and beginning to rehire human agents.

It's worth understanding what that reversal tells us, and what it doesn't. Let's investigate.

What Klarna's February 2024 announcement actually claimed

The press release was precise and carefully worded in ways that didn't survive the secondary coverage. The $40 million figure wasn't an audited cost saving. It was described as a projected "profit improvement to Klarna in 2024," a forward-looking estimate at the time of announcement. The "700 full-time agents" figure was a productivity equivalence claim, not a headcount action: it described the volume of work handled, not a specific number of people who had been fired at that moment. The CSAT claim ("customer satisfaction score on par with human agents") appeared without a published score, a sample size, or a breakdown by interaction type.

By November 2025, Klarna was still publishing AI productivity figures: the AI was by then doing the "equivalent work of 853 employees" and projected to save $60 million. The story that Klarna performed a clean reversal from AI to humans is inaccurate. What actually happened is that the company discovered its success metrics had described the average while the damage was happening in the tails.

Average resolution time measures speed, not resolution

The headline metric from Klarna's announcement was handle time: under 2 minutes, down from 11. That's a real and significant improvement on one dimension. It's not a measure of whether the customer's problem was solved.

In customer service, the metric that correlates most closely with satisfaction and retention is first-contact resolution rate: did the customer's problem get resolved without them having to call back? A 2-minute interaction that doesn't solve the problem is worse than an 11-minute interaction that does, because the customer still has the problem, has now invested time in an unsuccessful resolution attempt, and will contact the company again in a worse state of mind. Each failed first contact generates a repeat contact, and repeat contact rates are both a cost and a satisfaction driver.

Klarna's press release reported a 25 percent drop in repeat inquiries as a secondary metric. That's the closest proxy to first-contact resolution in the public data. But we don't know whether that figure held across interaction types, how it was calculated, or whether it was sustained beyond the first month. What we do know is that by 2025, multiple analysts and Klarna's own CEO were describing quality degradation, which implies the repeat contact picture had changed.

Kate Leggett, VP Principal Analyst at Forrester, described the pattern directly: "They overpivoted to cost containment, without thinking about the longer-term impact of customer experience."4

The distribution the headline CSAT didn't show

Customer satisfaction is a distribution, not a single number. A system that handles simple interactions at 9/10 and complex ones at 3/10 can produce an aggregate score that looks entirely acceptable, assuming most interactions are simple. The problem surfaces when you look at which interactions the 3/10 score is attached to.

AI customer service systems typically perform well on structured, high-frequency requests: order status, return initiation, password resets, policy questions. They struggle with what analysts and practitioners variously describe as "novel scenarios, nuanced exceptions, and high emotionality."5 For Klarna specifically, that category includes billing disputes, fraud reports, and account closures: exactly the interactions where a customer's relationship with the company is most at risk.

A customer contacting Klarna about a fraudulent charge is not a routine inquiry. Their state is already elevated. They need the problem resolved, not processed. A 2-minute interaction that fails to resolve a fraud report, or produces a generic response that doesn't address their specific situation, doesn't show up as a resolved contact. It shows up as an escalation, a repeat contact, or a churn event. None of those signals are visible in an aggregate CSAT score measured at the interaction level in the month of deployment.

Klarna's own spokesperson, in May 2025, acknowledged: "Some customers get an amazing agent, some a less engaged agent, prompting repeated contacts and higher costs."6 The vocabulary there ("some customers") is describing a distribution problem. The aggregate metrics had obscured what the distribution looked like by interaction type.

The quality problem that predated the AI deployment

A January 2024 Sifted investigation, published approximately four weeks before Klarna's AI press release, documented a different quality problem already underway.7 In late 2023, Klarna had outsourced approximately 750 customer service roles to Foundever and Accenture. According to former employees and internal data surfaced by Sifted, unresolved queries had quadrupled following that transition. Merchant wait times extended from under 24 hours to, in some cases, up to one month. Ticket queues had grown from around 200 to thousands.

This matters for attributing the causes of what happened in 2024 and 2025. The AI deployment in February 2024 landed into an environment where customer service quality was already degraded from the preceding outsourcing. Separating the contribution of the AI from the contribution of the outsourcing transition is not straightforward, and the public record doesn't support a clean account of which failure modes originated where. The Klarna story is not simply a story about what AI did to customer service quality. It's a story about what happens when multiple transformations to a customer service function compound, and the aggregate metrics don't capture which intervention caused which outcome.

What the CEO actually said, and what he attributed it to

Siemiatkowski's public statements on the reversal are worth quoting precisely, because they've been paraphrased in ways that shift the causal story.

On the quality failure: "As cost unfortunately seems to have been a too predominant evaluation factor when organizing this, what you end up having is lower quality."

On the path forward: "Really investing in the quality of the human support is the way of the future for us."

On the role of humans: "From a brand perspective, a company perspective, I just think it's so critical that you are clear to your customer that there will always be a human if you want."

The framing Siemiatkowski used is notable: he attributed the quality problem to decision-making (specifically, to over-weighting cost as an evaluation factor) rather than to AI capability. The failure he described was organizational: the wrong objective function produced the wrong outcome. That's a different diagnosis than "the AI couldn't do the job."

The efficiency gains Klarna claimed were real on one dimension. The company's cost per customer service transaction dropped 40 percent over two years, from $0.32 to $0.19, across a period during which AI handled the majority of contact volume. The financial case for the deployment existed. What didn't exist was the monitoring layer to catch where the deployment was failing alongside the dimension it was succeeding on.

What both diagnoses share is a monitoring problem. Whether the root cause was AI capability or organizational cost-focus, neither explanation is consistent with a production monitoring system that would have surfaced the quality degradation before it became visible in press coverage. If escalation rates by issue type, repeat contact rates broken down by interaction category, and CSAT distributions rather than aggregates were being tracked in something close to real time, the signal would have arrived earlier and with enough specificity to inform a targeted intervention rather than a public reversal.

AI outputs flowing into operations need monitoring, same as any other data

This observation applies to the Klarna case without requiring access to any internal Klarna systems, because it describes a general architectural requirement.

When an AI system's outputs flow into customer service operations, those outputs are data. They're transcripts, resolution labels, CSAT scores, escalation flags, case categories, and contact reasons. They live in systems of record. They feed downstream processes: staffing decisions, training feedback, compliance review, churn modeling. In that context, monitoring for quality signals in the data those AI interactions produce is structurally identical to monitoring any other operational data flow for anomalies.

Escalation rates broken down by issue category tell you where the AI is failing on interaction type. Repeat contact rates by case type tell you where first-contact resolution is degrading. CSAT score distributions (not averages) show you whether the score is uniform across interaction types or whether it's being held up by high performance on simple queries while complex ones degrade. Contact volume patterns tell you whether resolution quality has changed: when the AI handles the same problem in a way that doesn't resolve it, contact volume tends to rise, because customers contact again. All of these are observable in the data the AI produces, without access to the model itself.

The Starbucks inventory case and the Klarna customer service case share an underlying architecture: AI outputs flowing into operational systems and being treated as equivalent to the data those systems previously received from humans. In both cases, the question of whether those outputs had the same reliability properties as their human-generated predecessors was not being answered systematically. An anomaly detection layer applied to the data the AI produces (operating on the outputs rather than the model) is where data observability connects directly to what actually happens when AI deployments run into production problems.

The harder question for enterprise teams deploying AI in customer-facing roles

Klarna's story is more specific than a general warning about AI in customer service. By November 2025, the company was still running significant AI-powered customer service volume and reporting material productivity gains. The lesson is narrower: the metrics used to evaluate the deployment at launch didn't measure the things that ultimately determined whether it was working.

By June 2026, Siemiatkowski had settled on a new framing for the model Klarna had arrived at: "In a world where AI can do the most simplistic customer service, we believe that human customer service will almost be seen as a VIP thing." That framing (AI handles routine, humans handle complex and premium) is where the deployment landed after its detour through the quality problems of 2024 and 2025. Whether it was the destination Klarna had in mind when it announced the equivalent of 700 agents replaced is a question the February 2024 press release doesn't answer.

For any AI system currently handling customer interactions in your organization: what's the breakdown of CSAT by interaction complexity tier? What's your first-contact resolution rate for AI-handled contacts versus human-handled ones, separated by issue type? What's the escalation rate from AI to human, and how has it trended since deployment? What are the repeat contact patterns for customers whose first interaction was AI-handled?

If the honest answer is that you're tracking aggregate handle time and aggregate CSAT, you're measuring what Klarna measured in February 2024. The customers finding out what those metrics missed are the ones with the complex problems, which tends to mean the highest-value customers, the ones most likely to have something worth resolving, and the ones most likely to leave if they don't get it resolved.

share with a colleague
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights
about the author

Adrian Vidal

Writer and Content Strategist, Bigeye

Adrian Vidal is a writer and content strategist at Bigeye, where they explore how organizations navigate the practical challenges of scaling AI responsibly. With over 10 years of experience in communications, they focus on translating complex AI governance and data infrastructure challenges into actionable insights for data and AI leaders.

At Bigeye, their work centers on AI trust: examining how organizations build the governance frameworks, data quality foundations, and oversight mechanisms that enable reliable AI at enterprise scale.

Adrian's interest in data privacy and digital rights informs their perspective on building AI systems that organizations, and the people they serve, can actually trust.

about the author

about the author

Adrian Vidal is a writer and content strategist at Bigeye, where they explore how organizations navigate the practical challenges of scaling AI responsibly. With over 10 years of experience in communications, they focus on translating complex AI governance and data infrastructure challenges into actionable insights for data and AI leaders.

At Bigeye, their work centers on AI trust: examining how organizations build the governance frameworks, data quality foundations, and oversight mechanisms that enable reliable AI at enterprise scale.

Adrian's interest in data privacy and digital rights informs their perspective on building AI systems that organizations, and the people they serve, can actually trust.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Want the practical playbook?

Join us on April 16 for The AI Trust Summit, a one-day virtual summit focused on the production blockers that keep enterprise AI from scaling: reliability, permissions, auditability, data readiness, and governance.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.