Thought leadership
-
August 17, 2023

Data vault modeling: Everything you need to know

What is data vault modeling? From common use cases to best practices to framework architecture, this post has you covered. Read on.

Liz Elfman

Data vault is an agile data modeling technique and architecture, specifically designed for building scalable enterprise data warehouses.

First conceptualized in the 1990s by Dan Linstedt, the Data Vault methodology separates a source system's data structure from its attributes. Instead of conveying business logic through facts, dimensions, or extensively normalized tables, a Data Vault employs a direct approach, transferring data from source systems to a small set of specifically designed tables.

In recent years, data vaults have made a comeback thanks to the rising popularity of data lakehouses. Data lakehouses are a new type of data platform that combines elements of both data lakes and data warehouses. They typically store both raw data and transformed analytic models and tables, using schema-on-read to avoid needing upfront schema definitions.

Data vault modeling is well suited to the lakehouse methodology since the data vault model is easily extensible and ETL changes are easily implemented.

This guide will provide an overview of data vault: its concepts, advantages, key considerations, best practices, and tooling.

What is data vault modelling?

At its core, the data vault model uses a simple three-layer architecture consisting of hubs, links, and satellites. Here’s a summary of each:

  • Hubs store unique business keys, essentially the unique identifiers for business concepts or objects.
  • Links represent relationships between the unique business keys stored in hubs. Links establish the connections or associations between different business objects.
  • Satellites hold all the descriptive data or attributes related to hubs or links, like the textual descriptions, timestamps, or numerical values. They capture the context, details, and history of business keys and their relationships.

In summary, a hub identifies a business concept, a link maps out how that concept relates to others, and a satellite provides rich details about either the concept or its relationships.

Hubs, links, and satellites: The building blocks of data vault modelling

Hubs

They represent the core business concepts by storing unique business keys and act as the foundational entities upon which the entire Data Vault structure is based. An example of a Hub might be a Customer table that contains the unique identifier for each customer entity.

At a minimum, a Hub typically contains:

  • A surrogate key, which is a system-generated unique identifier.
  • The natural business key, which is the actual unique identifier from the source system.
  • Load date or load-end date timestamps indicating when the key was first loaded and possibly when it was superseded.

Hubs are highly stable as they capture the unchanging essence of a business concept. This means they aren't frequently modified even when other aspects of the data model change.

Links

They model the relationships or associations between business concepts (hubs). They encapsulate many-to-many relationships between keys, thereby expressing how business keys relate to each other across the organixation. An example of a link might be a CustomerOrder table that links customer_id to order_id and represents the relationship between customers and orders.

A link contains:

  • A surrogate key for the Link itself (CustomerOrder id)
  • Surrogate keys from the related Hubs to represent the relationship (customer_id, order_id).
  • Load date or load-end date timestamps capturing when the relationship was first and last recognized.

Links are dynamic, mirroring the evolving nature of business relationships. They can quickly adjust to reflect changes in the way different business concepts are interrelated.

Satellites

They store the contextual, descriptive, and temporal data associated with Hubs or Links. This is where the "meat" of the business information resides, such as attributes, textual descriptions, and numerical values. An example of a satellite might be a CustomerDetails table that contains descriptive attributes about each customer like first_name, last_name, email, address etc. with customer_id as a foreign key mapping to the Customer hub.

A satellite contains:

  • A surrogate key that relates back to the hub or link
  • Descriptive attributes about the business key (from hubs) or the relationship (from links)
  • Load date and load-end date timestamps for each record, which help in tracking changes to attribute values over time
  • A possible record source indicating where each piece of descriptive data came from, useful in auditing scenarios

Satellites are highly volatile compared to hubs and links. They capture all the changes and variations in business data attributes, allowing a historical and audit-friendly view of the data's evolution.

Data vault metadata management and scalability considerations

In the context of data vault, metadata provides the essential descriptive information about the various elements like hubs, links, and satellites, as well as about the lineage, load processes, and business context. The types of data vault metadata are as follows:

  1. Structural metadata includes definitions of tables, columns, data types, keys, indexes, constraints, and relationships. provides insight into the schema design, relationships between entities, and general layout of the data within the data vault.
  2. Operational metadata includes data load timestamps, ETL job logs, transformation rules, source system identifiers, and data quality metrics. It ensures traceability and auditability, enabling teams to understand data lineage, transformation logic, source-to-target mappings, and troubleshooting data issues.
  3. Business metadata captures business-centric definitions, rules, and context of the data elements in the data vault. It ensures that data can be interpreted, contextualized, and utilized effectively by business users, bridging the gap between technical data structures and business semantics.

Scalability considerations

Metadata repositories are specialized storage systems (databases or platforms) that house and manage all the metadata components. Repositories contain tables, views, APIs, or services to capture, access, and interact with various metadata types. They provide a single point of reference for both technical teams and business users. Here's what to consider in terms of scalability:

  • Volume and complexity: As a data vault amasses a large volume of metadata, ensure database scalability in both directions - vertical and horizontal.
  • Load and transformation metadata: As ETL processes and logic evolve, metadata accumulation increases - use efficient storage structures like columnar storage and optimized indexing.
  • Versioning and auditability: Implement a versioning mechanism for metadata, allowing rollback, comparison, and audit of changes.

Advantages of data vault

Data vault offers several benefits across multiple areas, depending on the size, context, and goals of the organization. In the following paragraphs, we'll walk through some of them.

Scalable architecture

At the heart of data vault's design lies its modular structure, built upon hubs, links, and satellites. This granular and modular setup allows for the efficient scaling of data structures as the data grows, both in volume and complexity.

Adaptive to changes

Data vault is designed to absorb changes seamlessly. Whether a business undergoes shifts in its processes, introduces new data sources, or modifies existing ones, the Data Vault model can easily adapt without major overhauls or disruptions to the existing system.

Agile data integration and faster time-to-market

  • Incremental data loading: The model supports the integration of new data incrementally. This means that as new data sources or attributes emerge, they can be added without a complete redesign of the warehouse.
  • Parallelization: The inherent separation between different types of data (business keys, relationships, attributes) allows for parallel loading processes. This parallelism ensures faster data ingestion and integration.
  • Reduced dependencies: The compartmentalized structure means that changes in one part of the model don't mandate changes in other sections. This decoupling fosters an environment for agile development and faster deployments, significantly reducing the time-to-market for new features or data sources.

Support for historical data tracking and auditing requirements

  • Immutable data storage: Data vault inherently stores historical changes, particularly within Satellites. This ensures that the system maintains an immutable record of all data versions over time.
  • Auditing and compliance: The ability to track historical data in an immutable fashion supports various auditing and regulatory compliance needs. Organizations can reliably produce data snapshots from any point in time, meeting stringent data retention requirements.
  • Time-variant data: Data vault's structures inherently support time-variant data, capturing precisely when specific changes or additions were made, providing a robust foundation for temporal data analyses.

Enhanced data lineage

  • Transparent data flow: The data vault methodology inherently documents how data moves from source systems to consumption layers. This clear pathway ensures that the origin and the various transformations data undergoes are well-tracked.
  • Building trust: With clear data lineage, data consumers can place more trust in the data, fully understanding where it originates and how it's been processed.
  • Impact analysis: A well-defined lineage allows for effective impact analyses. Teams can understand the repercussions of changes in source systems, transformations, or business logic, ensuring informed decision-making.

Top five use cases for data vault

Below are the primary use cases where the Data Vault approach shines:

  1. Large-scale data integration: Merges data from diverse systems like during company mergers or integrating legacy with new platforms, using Hubs, Links, and Satellites for a seamless integration.
  2. Evolving data sources: Adapts to rapid changes in source system schemas, making it ideal for startups or tech-driven firms due to its incremental modelling flexibility.
  3. Historical data analysis: Tracks and stores data changes over time, crucial for sectors like finance or healthcare where trend analysis and compliance are paramount.
  4. Data warehousing in agile environments: Supports iterative data warehousing in agile methodologies, allowing continuous integration and delivery through its modular design.
  5. Enhanced data governance and lineage: Maintains clear data origins and transformations, essential for industries requiring transparency in data flow and impact analysis.

Data vault best practices

Your data vault architecture is a foundation that can support business users and drive business value, if done correctly. Below are some key best practices to follow when implementing a data vault data model:

  • Business keys identification: Pinpoint unique identifiers in source systems crucial for Hub foundation.
  • Raw data layer: Maintain a "staging area" for unaltered source data for validation and transformation checks.
  • Data lineage: Prioritize comprehensive tracking of data's origin and transformations for transparency and compliance.
  • Satellite design: Structure Satellites for efficient capture and querying of historical changes.
  • Soft links: Use them judiciously, reserving for truly complex scenarios to avoid performance issues.
  • Historical tracking: Standardize mechanisms in Satellites for uniform date and source tracking.
  • Optimized reporting: Build structures like denormalized views atop Data Vault for efficient querying and reporting.

Tooling

Proper tooling can significantly simplify the complexity of implementing and managing a Data Vault project. Many data integration tools now provide capabilities to support data vault style modeling.

  • Oracle: Supports data vault concepts in Oracle SQL Developer Data Modeler. Provides data modeling, lineage, and governance features.
  • Databricks: Supports data vault concepts in Databricks Lakehouse platform, see example here.
  • Azure Data Factory: Mapping data flows enables implementing Data Vault patterns. Built-in data profiling helps identify hub entities.
  • Erwin: Models, maps, and automates the creation, population, and maintenance of data vault solutions on Snowflake.
  • VaultSpeed: Automates the creation of data vaults.
  • Talend: An ETL tool that also offers components for managing hub, link, and satellite tables, business keys, and relationship life cycles.
  • WhereScape RED: Automates development of data vault models and ETL processing. Provides templates to accelerate development.

Final thoughts

While data vault offers many benefits, it also introduces complexities to consider when implementing. It requires consistency in key definitions across source systems. It can also result in complex link table relationships to map.

If you're going to implement a data vault framework, you'll need extensive metadata management. Your framework may also increase ETL processing time due to added data points.

Understanding these nuances upfront allows balancing trade-offs versus a traditional model. As data complexity grows, organizations need more adaptable modeling approaches. By combining standardization, modularity, and detailed historical tracking, a data vault provides a flexible way to structure enterprise data and serve changing analytics needs.

For more ways to optimize your data management, take Bigeye for a spin.

share this episode
Resource
Monthly cost ($)
Number of resources
Time (months)
Total cost ($)
Software/Data engineer
$15,000
3
12
$540,000
Data analyst
$12,000
2
6
$144,000
Business analyst
$10,000
1
3
$30,000
Data/product manager
$20,000
2
6
$240,000
Total cost
$954,000
Role
Goals
Common needs
Data engineers
Overall data flow. Data is fresh and operating at full volume. Jobs are always running, so data outages don't impact downstream systems.
Freshness + volume
Monitoring
Schema change detection
Lineage monitoring
Data scientists
Specific datasets in great detail. Looking for outliers, duplication, and other—sometimes subtle—issues that could affect their analysis or machine learning models.
Freshness monitoringCompleteness monitoringDuplicate detectionOutlier detectionDistribution shift detectionDimensional slicing and dicing
Analytics engineers
Rapidly testing the changes they’re making within the data model. Move fast and not break things—without spending hours writing tons of pipeline tests.
Lineage monitoringETL blue/green testing
Business intelligence analysts
The business impact of data. Understand where they should spend their time digging in, and when they have a red herring caused by a data pipeline problem.
Integration with analytics toolsAnomaly detectionCustom business metricsDimensional slicing and dicing
Other stakeholders
Data reliability. Customers and stakeholders don’t want data issues to bog them down, delay deadlines, or provide inaccurate information.
Integration with analytics toolsReporting and insights

Join the Bigeye Newsletter

1x per month. Get the latest in data observability right in your inbox.