Bigeye Staff
bigeye-staff
-
June 1, 2026

What is data discovery and classification?

9 min read

TL;DR: Data discovery and data classification are distinct but inseparable practices. Discovery answers "what data do we have and where is it?" Classification answers "how sensitive is it and what controls apply?" You can't classify data you haven't found, and labels on undiscovered data can't govern anything. Most enterprise programs treat both as periodic compliance exercises, which worked well enough when humans queried data through defined interfaces. AI agents broke that assumption: they access data continuously, across dozens of sources, without waiting for the next scheduled scan. In an AI environment, unclassified data stops being a compliance gap and becomes an uncontrolled access problem running at machine speed. This article explains how discovery and classification work, what methods each uses, what regulations require both, and why connecting them to real-time enforcement has become the operational requirement they never were before.

Bigeye Staff
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Join The AI Trust Summit on April 16
A one-day virtual summit on the controls enterprise leaders need to scale AI where it counts.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Most organizations have data in more places than any single team can track. A CRM holds customer records. A data warehouse holds copies pulled for analysis. A marketing analytics platform holds another copy. A development environment holds a copy made for testing, probably years ago and never deleted. Shadow data (copies created by business units, exports from deprecated systems, data loaded into unauthorized SaaS integrations) has accumulated over years in locations that governance programs never formally inventoried. Data discovery and classification are how organizations find that data and determine what controls apply to it.

IBM's 2025 Cost of a Data Breach Report found shadow data was involved in approximately 35% of breaches, raising the average cost for those incidents 16% above the global average. The exposure exists because the data was there; it was just never found.

Data discovery is the practice that finds it. Data classification is the practice that determines what to do with it once it's found. The two form a dependency chain that everything else in data governance builds on.

What is data discovery?

Data discovery is the systematic process of scanning and cataloging all data assets across an organization's environment: production databases, data warehouses, cloud storage, SaaS applications, file shares, and the copies and legacy systems that accumulate over time. The goal is a complete, accurate inventory of what data exists, where it lives, who owns it, and what it contains, including data no one is actively managing.

The "systematic" part matters. Point-in-time scans of approved systems don't capture the full picture. A complete discovery program covers approved infrastructure and the shadow systems business units have spun up without central coordination, cloud buckets that were never cataloged, and datasets that predate the current governance program by years. Roughly 55% of enterprise data is estimated to be dark: stored, but unanalyzed, untagged, and invisible to any program trying to manage it. Unstructured data, currently around 80% of enterprise data by volume, is forecast to nearly double by 2028. The scope of what needs to be discovered is growing faster than most manual programs can keep pace with.

Discovery also surfaces data duplication: the same sensitive dataset copied across five systems, each with different access controls and different classification states. A copy of a customer record in a dev environment often has fewer controls than the production record it was made from. Discovery is what reveals that gap exists.

What is data classification?

The output of discovery gets structured labels based on sensitivity, regulatory status, and risk. Discovery answers "this data exists here." Those labels answer "this data is Restricted, it contains PII, HIPAA applies to it, and these specific controls govern it."

Without classification, discovery produces an inventory. With classification, it produces a governed inventory: every data asset carries a label that determines who can access it, how it can be used, what protection it requires, and what happens if it's exposed.

Enterprise programs typically use a four-tier model: Public (no confidentiality requirement), Internal Only (internal use, low harm if disclosed), Confidential (business-sensitive, restricted to authorized roles), and Restricted (highest sensitivity: regulated personal data including PII, PHI, and PCI, or information that creates legal exposure if disclosed). Each tier maps to specific control requirements: encryption standards, access restrictions, audit logging, and masking or tokenization before data enters analytics pipelines or AI systems.

Data discovery vs. data classification: how they differ and why you need both

Discovery and classification are often treated as interchangeable, but they solve distinct problems and require different capabilities.

Discovery is a coverage problem: have you found all the places where data lives, including the ones your governance program didn't formally inventory? Gaps in discovery mean an organization is governing a portion of its data while leaving the rest unmanaged. Unmanaged data is what creates both breach exposure and regulatory liability.

Classification is a context problem: once you've found the data, do you understand what it is and what protections it requires? A database that has been discovered but not classified is an entry in an inventory without controls. The organization knows it exists; it doesn't know whether it holds public information or Social Security Numbers.

The dependency runs in one direction: discovery has to come first, then classification, then enforcement can apply. An organization that has run a discovery exercise without classifying the results has a detailed map of its data landscape and no mechanism for acting on it. Every access restriction, encryption policy, and AI agent access rule depends on a classification label to know what it applies to.

How data discovery works: four methods

Automated scanning traverses networks, databases, cloud storage, and endpoints systematically, reading file contents and metadata to build an inventory. It's the most thorough method for structured data at scale, capable of covering millions of columns across dozens of sources without analyst involvement per table. Automated scanning is the foundation of any discovery program that needs to keep pace with data growth.

Metadata crawling reads schema information, column names, table descriptions, and tags without reading full data content. It's faster and lower-cost than full scanning but less accurate: a column named `id` or `field_47` doesn't reveal whether it contains customer SSNs or inventory numbers. Metadata crawling works well as a first pass for scoping but shouldn't be the sole discovery method for high-risk data environments.

ML-based pattern detection uses machine learning and NLP models to identify sensitive data patterns that regex rules and metadata alone miss. The cases where it makes the difference include free-text PII embedded in notes fields, sensitive codes in generic columns, partial values, and contextual sensitivity (the same field name carrying different sensitivity depending on which schema it lives in). For organizations with legacy data or inconsistent naming conventions, it's where most of the previously invisible sensitive data surfaces.

Manual tagging applies human judgment to data that automated methods can't fully evaluate: highly ambiguous edge cases, novel data types, and high-stakes decisions where the classification has significant consequences. Manual tagging doesn't scale to modern data volumes and shouldn't be the primary method, but it fills the gaps that automation leaves and handles escalations when automated classifiers flag a result for human review.

Dark data and shadow data require specific attention in discovery programs. Dark data is stored but never used or analyzed: log files, archived emails, old backups, legacy system exports that no one ever deleted. Shadow data exists outside sanctioned governance processes: copies made by business units, data in personal cloud accounts, exports from deprecated systems. Both categories hold sensitive data with few or no controls, and both are systematically underrepresented in discovery programs that only scan approved infrastructure. A complete discovery program scans beyond the perimeter: federated scanning across business-unit environments, cloud account enumeration, and coverage of SaaS integrations where data may have been loaded without central coordination.

What regulations require data discovery and classification

GDPR Article 30 requires controllers and processors to maintain a Record of Processing Activities: a structured inventory of every way the organization processes personal data, covering purposes, categories of data subjects, recipients, and retention periods. This is operationally impossible without data discovery. Organizations can't document processing they haven't mapped, and supervisory authorities can request the RoPA at any time. Data discovery is the work that makes Article 30 compliance achievable; the RoPA is its output.

CCPA and CPRA don't explicitly mandate a data inventory, but complying with consumer rights requests (access, deletion, correction) across all systems holding personal information requires knowing where that information lives. CPRA's deletion obligation is practically unenforceable without a maintained, current inventory of every system holding the relevant data. Discovery is the operational prerequisite even where the statute doesn't name it directly.

ISO 27001 Annex A 5.12 requires organizations to classify information assets by their legal, strategic, and operational value, with named information owners and periodic review. Discovery is the prerequisite: you can't classify assets you haven't inventoried. The standard doesn't prescribe a specific classification scheme, but it requires one to exist, be maintained, and be applied consistently.

NIST SP 800-60 provides a data categorization methodology aligned with FIPS 199 impact levels (Low, Moderate, High) and explicitly positions data categorization as a prerequisite to selecting security controls under the NIST Risk Management Framework. The sequence (discover, categorize, select controls, implement) runs through the entire NIST framework.

Why data discovery and classification programs fall short

Most enterprise discovery and classification programs were designed for a world where humans query data through defined interfaces. A quarterly scan, a risk review, an updated data inventory: this cadence was adequate when the primary concern was making sure the right humans had access to the right systems.

Two failure modes existed before AI but are now more consequential. The first is coverage drift: new data sources get added between discovery cycles without being added to the scan scope, leaving newly created or copied data unclassified until the next scheduled review. The second is label staleness: classification labels assigned at a point in time become inaccurate as data moves, schemas change, and fields get repurposed. A column that was Internal six months ago may now contain Restricted data because a new data feed was connected to it. Static labels don't update on their own.

For most compliance use cases, these failures were manageable: periodic discovery could catch up, and stale labels could be corrected in the next review cycle. For AI agent access governance, neither failure mode is acceptable.

Why AI agents change the discovery and classification requirement

AI agents don't query data through predefined interfaces. They access data dynamically, based on task context, at machine speed, and they inherit the permissions of whatever service accounts they run under, which are typically broad. Proofpoint's 2025 Data Security Landscape Report found that 32% of organizations identify unsupervised AI agent data access as a critical threat. A Kiteworks study found that 57% of organizations have fragmented controls over how AI systems access sensitive data, and 63% of organizations involved in data breaches lacked AI governance policies.

The discovery and classification gap is the root cause. If a data store hasn't been discovered, no policy applies to it, and an AI agent that can reach it will use it. If a data asset has been discovered but not classified, no access rule can be configured against it. And if a classification label is stale because the underlying data changed after the last scan, the access control based on that label is enforcing a rule against the wrong sensitivity tier.

The correct architecture for AI environments is continuous: automated discovery running incrementally to catch new and modified data, labels refreshing as data changes, and access controls operating in real time at the point of agent query rather than after the fact. A nightly scan that misses a new shadow data store means an AI agent can access sensitive data for hours before anyone catches it. A quarterly classification review means agents spend three months operating on labels that may have become inaccurate the week after they were set.

That's the gap between discovery and classification as a compliance program and discovery and classification as operational infrastructure for AI governance. The practices are the same; the cadence and the enforcement integration are what change.

Bigeye's Data Classification product runs automated scanning continuously (full scans for initial baseline, incremental scans for ongoing coverage) and sends sensitivity signals directly into AI Guardian, which enforces access controls at the point of agent query in real time. Data lineage context is attached to every finding, so teams can trace which pipelines carry classified data and where it flows. Data governance workflows connect classification labels to policies, ownership, and access decisions across the organization. For a broader view of how discovery, classification, and enforcement connect within an Agent Trust Hub architecture, that article covers the full picture. A free trial is available.

share with a colleague
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

What is the difference between data discovery and data classification?

Data discovery is the process of finding where all data assets live across an organization's environment, including systems that weren't formally inventoried and copies that accumulated over time. The classification step takes discovery's output and labels each asset by its sensitivity tier and regulatory status (Public, Internal, Confidential, or Restricted), with controls applied based on the label. Discovery answers "what data do we have and where?" Those labels answer "how sensitive is it and what governs it?" Both are required: you can't classify data you haven't found, and discovery results without classification produce an inventory with no enforcement mechanism.

What is data discovery used for?

Data discovery is used to build a complete, accurate inventory of all data assets across an organization, covering approved infrastructure and shadow data stores that haven't been formally cataloged. In compliance contexts, GDPR Article 30 requires a Record of Processing Activities that discovery enables; CCPA/CPRA deletion obligations require knowing every system where personal data lives. In security contexts, discovery surfaces unmanaged copies of sensitive data, legacy exports, and shadow data stores that represent breach exposure. In AI governance contexts, discovery is the prerequisite for classification, which in turn is the prerequisite for configuring access controls that govern what AI agents can reach.

How does data classification work?

Data classification applies structured labels to data assets based on their sensitivity, regulatory status, and risk profile. Modern classification uses a combination of content-based detection (scanning field values for SSN patterns, credit card formats, health codes, and other sensitive data types), context-based detection (using schema relationships and lineage to assess sensitivity from how data connects to other data), and automated ML-based classifiers that catch sensitive data in fields with generic or misleading names. After an initial full scan, incremental scanning evaluates only new or modified data to keep classification current without rescanning everything. Those labels feed the access controls that determine who (and which AI systems) can reach each data asset.

Why do data discovery and classification matter for AI governance?

AI agents access data dynamically and continuously, without human review at each query. Any data asset an agent can reach that hasn't been discovered and classified is outside any governance policy; the agent will use it because nothing tells it not to. Proofpoint's 2025 research found 32% of organizations identify unsupervised AI agent data access as a critical threat, and 57% have fragmented controls over how AI systems access sensitive data. The root cause in most cases is a discovery and classification gap: sensitive data that exists but hasn't been labeled, so no access rule applies. Continuous discovery and classification, connected to real-time AI agent access controls, is what enterprises need.

about the author

Bigeye Staff

Bigeye Staff represents the collective voice of the Bigeye team. Each article is informed by the expertise of individual contributors and strengthened through collaboration across our engineers, data experts, and product leaders, reflecting our shared mission to help teams build trust in their data.

about the author

about the author

Bigeye Staff represents the collective voice of the Bigeye team. Each article is informed by the expertise of individual contributors and strengthened through collaboration across our engineers, data experts, and product leaders, reflecting our shared mission to help teams build trust in their data.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Want the practical playbook?

Join us on April 16 for The AI Trust Summit, a one-day virtual summit focused on the production blockers that keep enterprise AI from scaling: reliability, permissions, auditability, data readiness, and governance.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.