Pythian Blog: Technical Track

To Consider: Data Vault, IoT, and Real-Time Reporting

To Consider: Data Vault, IoT, and Real-Time Reporting
6:36

The Advantages of Data Vault in Agile Data Modeling

Data Vault offers a highly agile approach to data modeling, transforming the traditional methodology of building data architecture. The Data Vault approach allows for incremental development of the data model, eliminating the need to create a fully comprehensive model from the outset. In contrast to traditional methodologies, which require the complete design of the data model during the initial planning phases, the Data Vault enables flexibility in data pipeline implementation.

Challenges with Traditional Approaches

  1. A complete data model must be developed before implementation can begin. A lack of domain knowledge or subject matter expertise in even a small area can delay the entire project.
  2. Small mistakes can introduce significant risks, which become costly and complex to correct once the system is built. This often results in quick fixes, leading to suboptimal, “Band-Aid” solutions.
  3. Introducing new systems or making design changes can become an overwhelming challenge for data architects, requiring significant rework.

How Data Vault addresses these challenges

Data Vault's agile structure, with its classification of entities into Hubs, Links, and Satellites, enables iterative development. The initial phases focus on accurately identifying business keys and performing thorough data profiling to ensure data quality. The model's success hinges on robust data quality controls.

Data Vault allows for sequential development, and minor mistakes, as long as they don’t involve primary keys, can be corrected without affecting other areas of the system. Additionally, integrating new systems and accommodating design enhancements is much simpler compared to traditional approaches.

Example: IoT Integration
Consider an IoT integration in Snowflake with the following architecture. The Events are integrated in Real Time through Azure IoT Hub, staged into Blob Storage as a VARIANT type column  and transformed into Structured Data Warehouse for reporting.



The Evolution of IoT Devices and Challenges in Data Vault Integration

IoT devices evolve over time. For example, a sophisticated medical device equipped with IoT capabilities may initially transmit a set of basic attributes. As the device undergoes enhancements—such as the integration of new sensors—the scope of its event data expands.

For instance:

  • Phase-1: The medical device may emit attributes such as Voltage, Current, and Shutdown time.
  • Phase-2: With the addition of a temperature sensor, the device would then emit Temperature, Voltage, Current, and Shutdown time.

At a high level, we could model the following Hubs and Satellites in Data Vault:

    • Device Hub: Captures the business keys of individual devices, with a corresponding Satellite holding device attributes.
    • Event Hub: Represents individual events streamed from the device, with a Satellite recording sensor values like Voltage, Current, etc. To accommodate new attributes, an additional Satellite for events would be created to capture the new data, such as Temperature.
  • Device Event Link: Bridging the relationship between Device and Event.

A Star-Schema-Like View/Materialized View/Dynamic Table is then constructed using complex joins between Hubs, Links, and Satellites to generate the required dimensions and measures for reporting.

Challenges in real-time IoT reporting with Data Vault

While this approach works well for modeling processes, there are significant challenges in using Data Vault for real-time IoT reporting:

  1. Data Quality: Data Vault relies heavily on natural keys rather than surrogate keys, making accurate data profiling to identify business keys critical. Any errors in profiling can break the model’s integrity.
  2. Complex Reporting Integration: Traditional BI tools are designed to work with Dimensional Models, not Data Vault. This necessitates the creation of intermediate layers, such as views or materialized views, to bridge the gap. These views often involve complex SQL queries, and when additional logic is layered on top, it can result in highly complex queries.
  3. Cloud Cost Considerations: Platforms like Snowflake, which charge based on compute, are sensitive to computationally intensive operations. Complex SQL operations can significantly increase compute costs, especially when reducing the time intervals for near real-time reporting. For example, we've observed a dramatic increase in Snowflake compute credits when reducing the Batch ingest interval from one hour to one minute.
  4. Complexity of Schema Design: Data Vault’s structure is more intricate compared to traditional normalized schemas or dimensional models. Data consumers and analysts may find it challenging to grasp the schema design for querying purposes. Furthermore, it is crucial that ETL developers fully understand the principles of Data Vault, particularly its focus on data quality and the detailed relationships between Hubs, Satellites, and Links. They must unlearn aspects of traditional dimensional modeling and embrace the Data Vault methodology, as the ultimate goal is effective reporting.

Conclusion

While Data Vault presents considerable advantages, especially regarding agility, its application for real-time IoT reporting requires careful evaluation. It is crucial to assess the trade-offs, including design complexity, computational costs, and the necessary human resources, before proceeding.

If the trade-offs are deemed unfavorable, it may be prudent to adhere to the traditional Dimensional Model for Business Intelligence (BI) reporting. The primary benefits of the Dimensional Model include:

  • Simplicity in Schema Design: The schema is straightforward, with many queries being simple SELECT * FROM ... GROUP BY statements.
  • Effective View and Aggregation Logic: This simplifies performance optimization.
  • Rigid Dimensions with Flexible Measures: While dimensions are fixed, new measures can be added through additional fact tables, offering a degree of agility.
  • Effective Representation of Business Structure: The logical design mirrors the business structure effectively.
  • Reduced Computational Costs: SQL queries involve fewer joins, and optimizations like Denormalization, Aggregate Tables can significantly lower compute costs compared to Data Vault.
  • Compatibility with Machine Learning: The model is nearly ready for machine learning applications, with dimensions corresponding to features and measures to determinants.

No Comments Yet

Let us know what you think

Subscribe by email