Thought leadership
-
August 23, 2023

What is data discovery?

How do you break the cycle of confusing, unusable data and use your vast data stores to drive business value? "Data discovery" is key.

Liz Elfman

With data volumes growing exponentially each year, many companies are drowning in their data. New business leaders don’t know which tables answer basic questions around revenue and usage. Often, due to duplicated tables, they can’t even figure out a single trustworthy answer. How do you break this cycle and tap into vast data assets to drive real business value?

That's where data discovery comes in. In this post, we explore what data discovery entails, its key capabilities, tooling, and best practices.

Data discovery: Objectives and benefits

Is your data searchable and accessible? "Data discovery" encompasses the solutions, tools, and practices that allow users to easily search, access, understand and analyze an organization's data assets.

One of the primary objectives of data discovery is to provide users with more efficient data access. Organizations deal with vast volumes of data that spreads across multiple sources. It's challenging to understand all of that data's overall structure and quality. Data discovery cuts down the time spent searching for and prepping data.

Data discovery also plays a pivotal role in improving data reliability. Through data discovery, teams profile data to assess its accuracy, consistency, and completeness. If duplicates or discrepancies occur, the process of data discovery roots them out.

Finally, data discovery supports data governance strategies by providing a clear view of data's location, lineage, and usage. This transparency makes it easier to establish accountability for data quality and protection, enforce data standards, and manage data access rights. Better data governance leads to higher trust in data and its derived insights.

The five concepts of data discovery

Make sure you understand these foundational concepts associated with data discovery:

1. Data source identification: The process of finding and recognizing where data is located within an organization's systems and infrastructure. These data sources can be databases, data warehouses, data lakes, cloud storage, external data feeds, and more. Understanding these sources is critical to performing data discovery.

2. Data profiling: Not all data matters, and some data is more critical than others. Data profiling is the process of examining and summarizing a dataset to understand its structure. It includes statistical analysis, data type recognition, and detection of anomalies like missing values, outliers, or duplicates. This process provides a comprehensive understanding of the data, to determine its reliability and fitness for various analytical purposes.

For example, Uber introduced a tiering system for its data assets, classifying them from Tier 1 (highly critical) to Tier 5 (individually-owned data in staging with no guarantees). Out of over 130k datasets, only 2,500 were identified as Tier 1 and Tier 2. This significantly cut down on the amount of high-priority maintenance work that needed to be done.

3. Data classification: In data discovery, classification is used to categorize data based on predefined criteria. This process helps understand data types, like whether it is sensitive (like PII or confidential business information) or non-sensitive, structured or unstructured. Classification can also help identify data subject to regulatory compliance like GDPR, CCPA, or HIPAA.

4. Data cataloging: A data catalog is a structured and enriched inventory of data assets. Cataloging involves the creation of metadata that describes various aspects of the data, like source, owner, format, and usage. This not only improves data accessibility but also supports data governance initiatives.

Maintaining a comprehensive data catalog makes it easier to search and discover relevant data for specific analytical tasks. Then, analysts and data scientists can access the right data easily and make informed decisions.

The implementation details of a data catalog service might differ. But, looking at examples from Shopify, Meta, etc., they typically involve an (1) ingestion pipeline, (2) an indexing service and (3) a front end component for end users. The ingestion pipeline fetches information from a variety of data stores, then processes and stores them into a search index such as Elasticsearch. This data is then further exposed to end users, for example, by GraphQL APIs via an Apollo client.

5. Data lineage: Understanding data lineage – data's origins, transformations, and where it moves over time – is key in data discovery. It assists in root cause analysis, data quality tracking, compliance, and impact analysis. Data lineage is also key for compliance and auditing purposes. How does a change in data impact the rest of the organization? Data lineage can tell you.

At Shopify, data lineage is built on top of a graph database. This allows users to search and filter the dependencies by source, direction (upstream vs. downstream), and lineage distance (direct vs. indirect). The lineage data offers users insight into how other teams use a data asset and informs owners about potential downstream impacts due to modifications.

The relationship between data discovery, data governance and data quality

A lot of the concepts in data observability can sound and feel similar. So let's explore some key questions surrounding the intricacies of data discovery and its related practices and frameworks.

How are data discovery and data governance related?

Data discovery and data governance are like two sides of the same coin in managing a company's data.

Data discovery is about finding and understanding your data. It involves figuring out where your data comes from, what it's about, and how it can be sorted and understood. It's like making a detailed map of all your data and using tools to better understand what's in it.

Data governance is different in that it serves as a rulebook for your data. It outlines the rules for who can use what data, when, and how they can use it. It also involves setting up procedures to ensure everyone follows these rules.

These processes work closely together. Once you've found and understood your data with data discovery, you need to make sure it's managed properly according to the rules set out by data governance. This partnership ensures the data is used in the right way and that the company stays within any legal or regulatory rules. The process of finding and understanding data can also highlight areas where the rules can improve.

How are data discovery and data quality related?

Data quality involves checking whether your data is reliable, complete, and accurate. The connection between Data Discovery and Data Quality is crucial. If your data is high-quality, the insights you get from the discovery process will be more accurate and useful.

The discovery process can also improve data quality, as it generally involves things like assigning ownership and generating metadata. It also surfaces and exposes more data to business users. Because these users typically have business context and expectations into what data should look like, Data Discovery allows them to ask questions or flag potential anomalies or inaccuracies with the data quality team.

Data discovery tools

Several commercial and open source tools exist to enable data discovery capabilities. The following tools will help you organize data, enable data observability, manage details about the data, create visual data representations, and perform flexible data analyses.

  • Cloud services like AWS Glue Data Catalog and Azure Purview play a vital role in cataloging, classifying, and organizing metadata. They provide a centralized repository to store information about various data assets, including data sources, schemas, and data quality metrics. These catalogs streamline data discovery by allowing users to search and discover relevant data assets easily.
  • Data governance tools like Collibra and Alation provide comprehensive solutions for data lineage, quality, security, and lifecycle management. They facilitate collaboration among data stakeholders, ensuring a consistent understanding of data assets and promoting data-driven decision-making.
  • ETL/ELT platforms like dbt and Airflow play a crucial role in data profiling, preparation, and catalog integration. They automate the extraction, transformation, and loading of data from various sources, ensuring data consistency and quality. They often also come with auto-documentation features that make it easier to keep track of data job metadata.

Data discovery best practices

  • Data Profiling and catalog maintenance: Utilize data profiling tools to understand data characteristics and maintain an up-to-date data catalog for efficient search and discovery.
  • Data lineage tracking: Implement data lineage to trace data transformations and maintain data accuracy across processes.
  • Metadata management and governance: Establish robust metadata management and data governance practices to promote consistency and compliance.
  • Data observability and quality: Implement data pipeline and quality monitoring and a remediation process with the help of a tool like Bigeye. This will help teams quickly identify and rectify any issues that come up during the discovery process and ensure data is reliable moving forward.

In addition to these, adopt an iterative approach to data analysis that allows for step-by-step exploration, helping uncover hidden trends and relationships. Don't ignore data security and privacy safeguards. Finally, make sure to create documentation around data discovery processes and cross-functional training support.

Final thoughts

To begin your data discovery journey, start by asking and addressing these questions:

  1. Where does your data reside? Identify all the sources of your data. This includes databases, cloud storage, data lakes, external APIs, and more.
  2. What is the quality of your current data? Sort the data into different tiers that need to be maintained at different levels of quality.
  3. What tools and resources do you currently have? Take inventory of the existing tools, platforms, and resources at your disposal. Determine if you will need to purchase an off-the-shelf data discovery tool.
  4. How will the results and insights be communicated and implemented? Decide on the channels and methods for disseminating the insights derived from data discovery to relevant stakeholders, whether it will be a UI, a data catalog, or simply deleting a bunch of tables.

Bigeye's data observability tool can play a critical role in enacting your data discovery framework. Want to learn more? Try Bigeye today.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights