Pythian Blog: Technical Track

Discussing Data Lineage– Its Definition, Use, and Value

Previously, we discussed metadata and how it has become the connective glue in modern data architectures that allows different technologies to have common layers of reference for process and access automation. Centralized metadata storage through common data catalogs and features stores ensures a corporate wide view into data, complexity, risk and value, and enables separate engineering teams to implement their preferred technologies while maintaining common reference points centrally.
 
 
 
One specialized type of metadata is lineage. The dictionary definition of which is “descent from an ancestor” but in reality, lineage describes the stages or movement and transformation our data takes through an enterprise. These states of creation, integration, transformation, and consumption can be complex and numerous in modern enterprises with the same element moving dozens of times for consumption or enrichment across different systems. This near-constant change leads to large complexity when identifying paths of data when something goes wrong or regulatory requirements dictate that data be handled in specific ways, masked, or deleted.

Lineage data is often centrally collected with third party-tools that leverage the different logging methods used across the enterprise to provide a consolidated view of data, movement, dates, users, and transformations. Collecting the complete enterprise picture can be complex and take time due to the multitude of systems in use today, and the different methods they expose the underlying metadata needed to extract and unify lineage details.

Who uses data lineage information?

While many organizations are in the early stage of capturing  complete lineage data and making it accessible, a growing number of business teams desire this data to effect organizational change. These data consumers span the traditional technology teams, as well as back office support functions, corporate compliance, and product teams.

The first consumer of lineage data in most organizations is compliance and legal teams. They use lineage information to ensure compliance with consumer facing obligations for the deletion or approved use of data. These can include CCPA, CPRA, GDPR, and GLBA in many enterprises. Lineage data is especially useful for compliance needs to show clear proof of data creation, destruction, and consumption across the enterprise.

Product teams will often use data lineage information to determine if there are value add data sets or data derivatives that can be added to existing data products to enhance value and  usability. Lineage information can provide insights for product management teams into how their data is used, untapped uses, and what data is consumed most often to make product investment decisions.

Meanwhile, engineering teams will often use data lineage as a measure of complexity in the data engineering environment. Strong engineering teams will leverage lineage data to simplify pipelines, enabling lower compute costs for transformations and minimizing steps that could lead to pipeline failures or performance bottlenecks. Engineering teams can leverage this information to define more complete data models, systems roadmaps, and transformation pipelines that meet the evolving needs of their data consumers.

DataOps teams and those organizations tasked with the operational stability of data pipelines will often use lineage information for troubleshooting & early event detection by identifying steps of transformation that may not have completed properly or where errant data was entered into data products and systems. This early warning to adverse effects allows rapid engagement, troubleshooting and remediation in highly complex, ever evolving environments.

Lineage can often indicate the level of effort needed to take data from upstream business systems and ready it for consumption by automated processes, analytical models, or third parties. These measures can identify areas for business process improvement, UI enhancements, or improved policies to enable collection of more complete, higher quality teams. Business process improvement and transformation teams will often leverage lineage data to understand how data is flowing around the organization and different systems, and where best to implement business process changes for maximum positive impact.

As organizations look to build holistic data governance programs, lineage is becoming a key focal point to drive visibility, ensuring compliance and offering business teams insight into data usage to make empowered product decisions. Tools like DataHub are bringing this idea to the forefront with modern technology approaches to capturing lineage details for programmatic consumption.

Lineage data will only continue to grow and evolve as our data landscapes become larger, more complex, and more dynamic. By capturing lineage data programmatically early in your data governance programs, you build capabilities for teams to consume while accelerating the decision making and fulfillment of the company’s operational needs.

In our next discussion, we’ll explore the geospatial components of data governance. We’ll also examine the changing regulatory requirements based on where the data is created, as well as the value of capturing geospatial metadata, and the evolving technology platforms to successfully leverage geospatial data. Make sure to sign up for updates so you don’t miss it.

No Comments Yet

Let us know what you think

Subscribe by email