How Starbucks' AI inventory tool went from 99% accuracy to retired in nine months | AI Autopsy Issue 01
AI Autopsy is a Bigeye series examining enterprise AI failures and the data conditions that made them possible. Today, we're looking at the public rollback of Starbucks' AI inventory tool.
.png)
.png)
Get the Best of Data Leadership
Stay Informed
Get Data Insights Delivered
NomadGo's AI inventory tool claimed 99% accuracy. By February 2026, five months into its rollout across more than 11,000 Starbucks locations in North America, Reuters was reporting that the system was frequently miscounting and mislabeling items, confusing similar milk types, and sometimes failing to detect products on shelves entirely. Starbucks retired the tool in May 2026. The deployment lasted only nine months in total.
It's worth understanding what it tells us, and what it doesn't. Let's investigate:
The errors happening months before Starbucks pulled the tool
Starbucks deployed NomadGo's computer vision inventory tool in September 2025 as part of CEO Brian Niccol's effort to address inventory shortages he had tied to declining same-store sales. The tool used tablets running on-device AI to automate counts of beverage components, syrups, and milk that had previously been done manually. NomadGo described the technology as a "unique synthesis of on-device 3D spatial intelligence, computer vision, and augmented reality" (NomadGo press release, September 2025). The company's website claims 99% accuracy.
According to Reuters, baristas and managers reported persistent errors: the system confused similar products, failed to recognize items during automated scans, and contributed to both shortages and cases of overstocking. An internal Starbucks newsletter confirmed the discontinuation with a single line: "Effective immediately, Automated Counting will be discontinued."
While we don't have all the answers or internal information from Starbucks directly, the challenges this deployment ran into are familiar territory for the data teams we work with, and we can offer some context on what these rollouts typically require in order to succeed.
The 99% accuracy claim describes test conditions, not store conditions
NomadGo's accuracy figure describes performance in a test environment. In practice, a computer vision test environment means consistent lighting, standardized shelf configurations, a representative product sample the model was trained to recognize, and no employees rushing a shelf scan between customer orders.
A 2019 study by researchers at MIT CSAIL and IBM's Watson AI Lab, published at the NeurIPS conference, quantified what happens when those conditions change. The researchers built a dataset called ObjectNet to test computer vision models under uncontrolled real-world conditions: images with varied backgrounds, rotations, and viewpoints the models hadn't seen in training. Computer vision models that scored 97% on ImageNet, the standard controlled benchmark, dropped to 50–55% accuracy on ObjectNet (Barbu et al., NeurIPS, 2019). The authors weren't criticizing the models. They were pointing out that standard evaluation methodology was systematically optimistic about how those models would perform outside the conditions they were tested in.
A 2021 Stanford study extended this finding across ten real-world deployment domains — from medical imaging to wildlife monitoring — and found the same gap persisting across all of them, even after applying existing robustness techniques (Koh, Sagawa et al., WILDS benchmark, ICML, 2021).
But take the 99% at face value for a moment: applied to 11,000 stores counting inventory on a regular cycle, if each store counts a typical number of inventory items per cycle, a 1% error rate produces a large number of miscounts across the system per cycle. How consequential that is depends on count frequency, how miscounts feed into ordering decisions, and whether errors compound across cycles. None of those specifics are public in Starbucks's case. The point is that the same accuracy percentage means something different at enterprise scale than it does on a benchmark dataset, and the evaluation framework that works for one doesn't automatically transfer to the other.
Pilots and production are different environments
The Starbucks case fits a pattern that has less to do with any specific tool (like NomadGo) and more to do with how enterprise AI deployments are structured before they scale.
A pilot is set up to succeed. It runs in a small number of locations, usually the ones most likely to cooperate. The implementation team pays close attention. Edge cases that fall outside the model's training distribution get caught manually, because a dedicated person is watching. The conditions are as close to the training environment as they'll get at any point during deployment.
An 11,000-store rollout looks nothing like that. Locations vary in layout, lighting, and staffing. Seasonal items rotate in and out. Packaging changes. A video Starbucks released during the original NomadGo announcement showed the system struggling to identify a peppermint syrup bottle sitting among adjacent bottles on a standard store shelf, a failure that appeared not in a stress test but in the company's own promotional material (as reported by Reuters, May 2026, citing the Starbucks product video).
McDonald's ran a comparable experiment with IBM's AI-powered drive-thru ordering system across more than 100 U.S. locations from 2021 until terminating it in July 2024. The drive-thru is a controlled environment by retail standards, with a fixed menu, and a more predictable range of items. The system's accuracy plateaued at 80–85%, below the 90%+ achieved by human workers, and documented failures included a case where it added 260 Chicken McNuggets and nine sweet teas to a single order (CNBC, June 2024).
According to S&P Global Market Intelligence's 2025 Voice of the Enterprise survey of 1,006 IT and business professionals, 42% of organizations reported abandoning most of their AI initiatives in the past year, up from 17% the year before. Among the reasons cited: inaccurate outputs and inflated vendor expectations.
Frontline adoption patterns are a leading indicator
A Starbucks barista who posted on Reddit during the rollout described roughly half the stores in their district reverting to manual counting rather than using the tool. That's one account from one district, and we don't know how representative it is across the broader network. But the pattern it points to is common in large-scale AI deployments: the people using a tool daily accumulate evidence about how it actually performs in their specific conditions well before that evidence reaches the teams with authority to act on it.
Frontline workers encounter the edge cases, the product configurations the model consistently misreads, the workflows the tool disrupts rather than supports. That evidence rarely gets formalized. It shows up in workarounds, declining adoption rates, or eventually in press coverage.
For any enterprise AI rollout, consider whether there's a formal mechanism to capture those signals and route them to the people making continuance decisions. Adoption rate by location, frequency of manual overrides, and task completion time compared to baseline are all measurable without access to model internals. They're the kind of indicators that shorten the gap between when a tool starts underperforming and when leadership has the information to decide what to do about it.
AI outputs flowing into operations need monitoring, same as any other data
This is a general prescription, not a diagnosis of what Starbucks specifically did or didn't do.
When an AI tool's outputs flow directly into operational decisions, those outputs are data, and they need to be treated with the same rigor as any other data in your systems: monitored for accuracy, validated against ground truth on a regular cadence, and measured against standards that reflect your operating conditions rather than the vendor's test environment. An AI-generated inventory count looks identical to a human-generated one in a system of record. If the count is off and nobody is comparing it against physical reality on a meaningful sample basis, the error accumulates and flows into ordering decisions and supply chain actions downstream.
In environments where AI outputs feed into an ERP or back-office system, monitoring those downstream data flows for unexpected behavior is possible: counts that deviate from historical ranges, variance that spikes across store cohorts, or outputs that stop updating entirely. Anomaly detection on these kinds of data flows, applied to what AI tools are producing rather than to the model itself, is where data observability connects directly to AI adoption.
That visibility doesn't require access to the model itself. It applies to the data the model produces, in the systems your team already monitors.
The governance question that usually goes unanswered is who owns production accuracy monitoring once the tool is live, and at what threshold does poor performance trigger escalation. That conversation often defaults to the vendor, who has limited visibility into your operating conditions, or goes unassigned entirely.
The harder question on the future of AI in retail operations
A commenter on Reddit in the wake of the shutdown said it well, "The question the broader industry should be asking is how many AI deployments currently running at scale in retail, logistics, and healthcare are producing the same kind of quiet, compounding errors."
According to RAND Corporation research published in August 2024, based on structured interviews with 65 data scientists and engineers, more than 80% of AI projects fail to reach meaningful production deployment. The ones that do reach production don't come with automatic visibility into whether they're working once they're there.
For any AI tool currently running in your environment: do you know what accuracy threshold you've defined for your specific operating conditions, who is measuring against it, and how often? If the honest answer is that you're relying on the vendor's benchmark, it may be time to reconsider.
Monitoring
Schema change detection
Lineage monitoring
Why did Starbucks stop using its AI inventory tool?
Starbucks retired NomadGo's AI-powered inventory counting system in May 2026, nine months after rolling it out across more than 11,000 North American stores. According to Reuters, the tool frequently miscounted and mislabeled items, including confusing different types of milk and sometimes failing to detect products on shelves entirely. Rather than reducing the inventory shortages CEO Brian Niccol had cited as a priority, the system contributed to both shortages and overstocking. Starbucks confirmed the shutdown in an internal newsletter.
Did NomadGo's 99% accuracy claim hold up in production?
NomadGo's website still claims 99% accuracy for its inventory AI. Reuters documented persistent miscounting and mislabeling across Starbucks stores as early as February 2026, five months into the rollout. A video Starbucks released at launch showed the system failing to identify a peppermint syrup bottle on a standard store shelf (as reported by Reuters, May 2026). The gap between benchmark accuracy and production performance is a documented challenge in computer vision deployment broadly. NomadGo has stated it is "continuously learning from customer and user feedback."
Why do AI tools that work in pilots fail at scale?
Pilots run in controlled conditions: a small number of locations, close oversight, and environments close to the conditions the model was trained on. When a rollout scales to thousands of locations, those conditions diverge. Lighting varies, packaging changes, and employees who weren't part of the pilot use the tool differently. A 2019 study by researchers at MIT CSAIL and IBM's Watson AI Lab found that computer vision models scoring 97% on standard benchmarks dropped to 50–55% when tested on images taken under uncontrolled real-world conditions (Barbu et al., NeurIPS, 2019). The benchmark measures performance in optimal conditions. Production is everything else.