Bigeye Staff
bigeye-staff
-
June 1, 2026

What is data classification?

9 min read

TL;DR: Data classification is the process of labeling data according to its sensitivity, risk profile, and regulatory obligations. Most enterprise programs use a four-tier model: Public (L0), Internal Only (L1), Confidential (L2), and Restricted (L3). Classification enables security controls, compliance documentation, and access governance. The gap most programs leave open: classification labels that aren't connected to AI systems leave those systems ungoverned. As AI agents consume enterprise data at scale, the classification status of every field those agents can reach determines whether they act on public information or walk out with protected health records. This article covers the four levels, three classification types, policy framework requirements under NIST and ISO 27001, and how automated classification feeds real-time AI governance, the step most data classification programs don't take.

Bigeye Staff
Get Data Insights Delivered
Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.
Join The AI Trust Summit on April 16
A one-day virtual summit on the controls enterprise leaders need to scale AI where it counts.
Get the Best of Data Leadership
Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Stay Informed

Sign up for the Data Leaders Digest and get the latest trends, insights, and strategies in data management delivered straight to your inbox.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Data classification isn't a new practice. Regulated industries have run classification programs for decades, mapping their data to sensitivity tiers required by HIPAA, PCI DSS, and SOX. What's changed is the consequence of getting it wrong and the scale at which it now has to be done correctly.

IBM's 2025 Cost of a Data Breach Report puts the global average breach cost at $4.44M, with the U.S. average at $10.22M. Shadow data (unclassified or poorly tracked data that organizations don't know they have) was involved in 35% of breaches, raising the average cost for those incidents to $5.27M, 16.2% above the global average. The most expensive breaches aren't always the ones where an attacker exploited a known vulnerability. They're often the ones where unclassified data sat in a place no one was watching.

The second shift is AI. Roughly 55% of enterprise data is estimated to be dark: unclassified, unused, and invisible to governance programs. When AI agents are pointed at enterprise data at scale, that dark data stops being just a compliance risk. It becomes a direct input to systems making autonomous decisions. An AI agent accessing an unclassified column containing Social Security Numbers isn't a hypothetical. It's the default state when classification hasn't been operationalized. Gartner found in April 2026 that organizations with successful AI initiatives invest up to four times more in data governance foundations (including classification) compared to organizations with poor AI outcomes.

The four levels of data classification

Enterprise classification programs use a four-tier sensitivity model. Labels vary by organization, but the underlying structure is consistent.

Public (L0) covers information with no confidentiality requirement: press releases, published financial results, product documentation intended for public consumption. Exposure creates no legal risk or competitive harm.

Internal Only (L1) covers information for internal use but not publicly harmful if inadvertently exposed. Internal process documentation, general employee communications, and aggregate operational metrics typically belong here. Controls are lighter: access is limited to the organization, but no obligation exists to encrypt in transit or restrict within teams.

Confidential (L2) covers business-sensitive information where unauthorized disclosure could create competitive harm, contractual liability, or operational damage. Customer data that doesn't rise to regulatory sensitivity, pricing strategies, contracts, and strategic plans fall in this tier. Access is restricted to authorized roles, and access requests are tracked.

Restricted (L3) covers the highest-sensitivity information: regulated personal data (PII, PHI, PCI), trade secrets, executive communications, and credentials. Unauthorized disclosure creates legal exposure, regulatory liability, or safety risk. This tier requires the strongest controls: encryption at rest and in transit, strict access controls, audit logging, and in most cases masking or tokenization before data enters analytics pipelines.

NIST SP 800-60 maps this structure to impact levels (Low, Moderate, High) across confidentiality, integrity, and availability, and is applicable to federal information systems and widely adopted as a reference by enterprise programs. ISO 27001 Annex A 5.12 requires organizations to classify information assets by their legal, strategic, and operational value, with periodic review by named information owners.

Four types of data classification

How data gets classified matters as much as what it gets classified as. Four approaches are in common use.

Content-based classification examines what the data contains. Pattern matching against field values (detecting Social Security Number formats, credit card structures, email patterns, health record codes) is the most direct method. Traditional content-based classification relied on regex patterns; modern implementations use ML-based detectors that identify sensitive data in fields with inconsistent formatting, partial values, or encoded representations that regex misses. For known sensitive data types at scale, content-based classification is the most reliable approach.

Context-based classification examines where data lives and how it flows, rather than what it contains. A field named `customer_id` in a table joined to a financial transaction schema carries a different risk profile than the same field in a marketing analytics table, even if neither contains obviously sensitive values. Context-based classification draws on schema relationships, table ownership, and data lineage to make those distinctions. It's particularly useful for indirect sensitivity: data that becomes sensitive in combination, even when no individual field triggers a content-based classifier.

User-based (manual) classification assigns labels based on human judgment from data owners, stewards, or subject matter experts. It's the traditional approach for unstructured content: documents, emails, and records that don't conform to structured schemas. Manual classification doesn't scale to modern data volumes. At millions of columns across enterprise data platforms, it can only be applied to a fraction of what needs to be labeled.

Automated classification combines content-based and context-based methods, runs on a schedule or continuously, and covers structured data at scale without analyst involvement per field. After an initial full scan, incremental scanning evaluates only new or modified data, making continuous classification operationally sustainable rather than a periodic project that produces a report and goes stale.

Sensitive data classification: PII, PHI, PCI, and special categories

PII (personally identifiable information), PHI (protected health information), and PCI (payment card data) represent the primary sensitivity clusters driving most classification programs.

PII includes any data that can identify an individual: full name combined with SSN, driver's license number, financial account numbers, biometric identifiers, and geolocation tied to identity. GDPR Article 9 defines an additional "special categories" tier covering data warranting the highest protection: racial or ethnic origin, political opinions, religious beliefs, trade union membership, genetic data, biometric data used to uniquely identify a person, health data, and data concerning sex life or sexual orientation. Most four-tier classification models map special category data to Restricted (L3) by default.

PHI is PII in a healthcare context: patient records, diagnoses, treatment histories, and insurance information. HIPAA requires covered entities to classify and protect PHI with technical safeguards including access controls, audit logs, and encryption. PCI DSS applies to payment card data: card numbers, CVVs, PINs, and cardholder names. PCI DSS prohibits storing CVVs after authorization and requires restricted-tier controls across all in-scope card data.

The practical problem: these data types appear in columns not labeled to indicate what they contain. A column named `field_47` in a legacy data warehouse can hold Social Security Numbers. A column named `notes` can hold free-text entries embedding health information. Classification programs that rely only on column names miss the sensitivity living in the values themselves.

Data classification policy: what it needs to cover

A data classification policy is the document that governs how classification is done, who owns it, and how it's enforced. ISO 27001 Annex A 5.12 requires one as part of any information security management system. NIST SP 800-60 provides a methodology and pre-mapped catalog for federal information systems, widely adopted as a reference by private sector programs as well.

A functional policy addresses five areas. The classification scheme itself: what tiers exist, what criteria govern assignment, and who has authority to assign or change a label. Information ownership: each data asset needs an accountable owner responsible for classification decisions and periodic review. Handling requirements per tier: the specific access controls, encryption standards, and audit logging that apply to each level. Review cadence: ISO 27001 expects at least annual review, with additional review triggered when data changes materially. Exceptions: the process for data that needs to cross tiers, leave its classified environment, or be shared with third parties.

Most organizations have a policy on paper. The gap is in enforcement: policies without automated controls are aspirational documents. Classification labels not connected to access controls, data masking rules, or AI system inputs don't do the governance work the policy was written to accomplish.

How to implement data classification

A practical implementation follows five phases.

Inventory first. Build a list of all data sources: production databases, analytics platforms, data warehouses, SaaS applications, and downstream copies. Shadow data (sensitive information copied into dev environments, analytics sandboxes, or vendor systems) is often where the highest-risk unclassified data lives. The inventory doesn't need to be complete to start; it needs to be scoped well enough to prioritize the first wave.

Prioritize by entity and risk. Start with the data subjects most likely to hold regulated data: customers, account holders, employees. Production tables in financial and healthcare systems come first; analytics copies and dev environments follow. The goal of the initial phase is surfacing the highest-risk findings, not achieving 100% coverage immediately.

Configure classifiers to your environment. Pre-built classifier bundles for HIPAA, PCI, and GDPR cover standard regulatory data types. Custom classifiers handle organization-specific identifiers: proprietary account numbers, internal employee IDs, and product codes that encode sensitive attributes. Classifiers should combine column name matching with value-level detection; relying on column names alone misses the large share of sensitive data in fields with generic or misleading names.

Run a full scan to establish the baseline, then switch to incremental. A full scan surfaces what sensitive data exists, where it lives, and at what sensitivity level. After the baseline is established, incremental scanning (evaluating only new or modified data) keeps classification current without repeating full-table scan costs on every run. Continuous incremental scanning is what converts a point-in-time inventory into ongoing visibility.

Connect classification to controls. Classification findings that live only in a report aren't governing anything. The operational step is connecting findings to the systems that enforce them: access controls, data masking policies, and AI system inputs. Labels need to travel with data through lineage so controls apply consistently as data moves across pipelines.

Where data classification breaks down in AI environments

Two failure modes are specific to AI contexts that traditional classification frameworks weren't designed for.

The first is classification drift. Data moves, pipelines transform it, new columns get added, and existing fields get repurposed. A column classified as Internal six months ago may now contain Restricted data because of a schema change or a new data feed. Static classification (labels assigned once and not revisited) becomes inaccurate as environments evolve. Research from Knostic found that 91% of production models experience measurable data drift over time. Classification labels that don't update as data changes become increasingly unreliable as AI systems depend on them for access decisions.

The second is scale. An analyst reviewing classification findings for 200 tables is feasible. An AI agent touching 20,000 columns across a dozen data sources in a single workflow is not. Manual review processes that work for human data access can't keep pace with the volume and speed of AI data consumption. Classification programs that depend on human review at the point of access weren't designed for the environment they're now operating in.

Both failure modes point to the same requirement: continuous, automated classification with labels that refresh as data changes, with findings connected to the systems that enforce access policy, not just documented in a quarterly report.

Classification as the entry point to AI governance

When an AI agent runs a query, it doesn't ask for permission. It accesses whatever columns it can reach. The question a classification program needs to answer goes beyond "what sensitivity tier does this data belong to?" to "should this AI system be allowed to touch it at all?"

Real-time AI governance connects classification findings directly to enforcement: sensitivity signals that flow from the classification layer into AI access controls, blocking agents from accessing Restricted fields without requiring a policy review at the moment of each query. Gartner found that 80% of unauthorized AI agent transactions through 2028 will stem from internal policy violations, not external attacks. In most cases, those violations happen because the AI system wasn't told what it wasn't allowed to access. Labels from a connected classification program are what inform those access decisions.

When classification findings include full lineage context (tracking which upstream sources fed the classified data and which downstream systems it flows into), the governance chain becomes traceable and auditable. Compliance teams can see not just "this column is Restricted" but "this column is Restricted, it feeds these AI pipelines, and these are the controls currently enforced on it." That's the difference between a classification program as a documentation exercise and one that actively governs AI behavior.

If your team is building this out, Bigeye's Data Classification product provides automated discovery, continuous scanning, and sensitivity signals that feed directly into AI Guardian for real-time agent access enforcement. Data lineage and data governance capabilities are included in the Agent Trust Hub. Teams evaluating how classification connects to guardian agent capabilities or the broader AI trust hub infrastructure will find both covered in depth. A free trial is available.

share with a colleague
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

What is data classification?

Data classification is the process of labeling data assets by their sensitivity, regulatory status, and risk profile, then applying appropriate controls based on those labels. Enterprise programs typically use a four-tier model: Public (no confidentiality requirement), Internal Only (internal use, low harm if disclosed), Confidential (business-sensitive, restricted access), and Restricted (highest sensitivity, including regulated personal data such as PII, PHI, and PCI). Classification is the foundational step for access control, compliance documentation, and AI governance programs: each label defines what a data asset is and what controls apply to it.

What are the four levels of data classification?

The four standard levels are Public (L0), Internal Only (L1), Confidential (L2), and Restricted (L3). Public data has no confidentiality requirement. Internal Only data is for the organization and not publicly harmful if exposed. Confidential data is business-sensitive and restricted to authorized roles. Restricted data carries the highest sensitivity: regulated data including PII, PHI, and PCI, trade secrets, and credentials. NIST SP 800-60 maps these levels to impact tiers across confidentiality, integrity, and availability. ISO 27001 requires organizations to define and maintain these classifications with named information owners and at least annual review.

What are the four types of data classification?

The four common classification types are content-based (detecting sensitive patterns in field values), context-based (assessing sensitivity from schema relationships and data lineage), user-based or manual (human assignment by data owners and stewards), and automated (combining content and context methods to classify data at scale without per-field analyst involvement). In enterprise environments with millions of columns and continuous data movement, only automated classification keeps labels accurate at the pace data changes. Manual and user-based methods remain useful for unstructured content, but they can't cover structured data at AI scale.

What is a data classification policy?

A data classification policy defines an organization's classification tiers, the criteria for assigning data to each tier, information ownership responsibilities, handling requirements per tier (access controls, encryption standards, audit logging), and the review cadence for existing classifications. ISO 27001 Annex A 5.12 requires one as part of any information security management system. NIST SP 800-60 provides a structured methodology for federal and enterprise programs. A classification policy without automated enforcement backing it (connected to access controls and AI system inputs) is a documentation exercise rather than a working governance program.

about the author

Bigeye Staff

Bigeye Staff represents the collective voice of the Bigeye team. Each article is informed by the expertise of individual contributors and strengthened through collaboration across our engineers, data experts, and product leaders, reflecting our shared mission to help teams build trust in their data.

about the author

about the author

Bigeye Staff represents the collective voice of the Bigeye team. Each article is informed by the expertise of individual contributors and strengthened through collaboration across our engineers, data experts, and product leaders, reflecting our shared mission to help teams build trust in their data.

Get the Best of Data Leadership

Subscribe to the Data Leaders Digest for exclusive content on data reliability, observability, and leadership from top industry experts.

Want the practical playbook?

Join us on April 16 for The AI Trust Summit, a one-day virtual summit focused on the production blockers that keep enterprise AI from scaling: reliability, permissions, auditability, data readiness, and governance.

Get Data Insights Delivered

Join hundreds of data professionals who subscribe to the Data Leaders Digest for actionable insights and expert advice.

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.