Pythian Blog: Technical Track

Defining Data Retention

Previously, we discussed the value of geospatial data and how it drives much of the personalization and value found in today’s mobile applications, with this value comes risk that can be managed if done holistically across the organization through strong data governance programs. The automation of implementation by leveraging policies defined by legal teams and user context carried with the data will ensure compliance across systems.

 

 

Moving beyond geospatial data, we’ll move into data retention, the policies necessary to ensure data availability while minimizing risk and the technical implementations needed to enforce data retention policies.

Retention is defined as “holding onto or keeping possession of something”. This is a solid basic definition of data retention. Important to note, the definition doesn’t have any element of time, which is defined by r how long the data is retained or how long to access data still in your possession. These are both considerations that must be included in the organization’s data retention policies and systems design assets.

The reasons to retain data are endless and vary by organization and industry. While the list of potential retention reasons is long, organizations should work to shorten the list to the specific reasons why and when they retain data, and what data is in-scope for the policy. If there is no defined business value to data, it should be purged.

  • Legal Requirements: There’s a long list of legal requirements to retain specific types of records for set durations. These span financial documents, data about mergers and acquisitions, employee details and contracts. This is the most common driver of data retention policies and often sets the minimums and maximums for broad sets of record types retention.
  • Historical Context: Organizations often maintain records of products they sell, warranty details, or repair data to ensure they have a long run history of customer engagements, repair work on products or products sold for ensuring support is provided. This data can have a highly flexible retention period based on the industry. Retaining details about a mobile phone purchase beyond 5 years is not terribly helpful as consumers replace devices more often, but retaining records for car repairs for 10+ years is common.
  • Predictions: Many organizations will retain historical data to assist in predicting future needs. This could include product demand, consumer behavior, or financial performance. The value of this data evolves over time with it diminishing at certain points due to shifting economic and customer conditions.
  • Personalization: Many consumer facing organizations will retain data for set periods to ensure they can personalize the buyer experience. This can be as simple as leveraging historical data to show past purchases, and evolve into more complex recommendations of future purchases.
  • Lazy: Many organizations retain data for the simple fact they do not have a policy that defines when it must be deleted and enforce the policy. Data retention policies must include both conditions for when data is retained, as well as explicit definitions of when data must be purged to ensure storage costs do not sky-rocket for data with little or no value to the organization.

Data retention policies should be defined in a way that they are easy to understand, easy to be implemented programmatically, and should enable engineering teams to operate independently most of the time when working with datasets that are known and already leveraged by the organization. In addition to policy definitions, data governance leaders should ensure changes are part of data literacy plans for training and rollout to ensure awareness across the organization.

  • Retain and purge: Policies should include both minimums for data retention time, and maximums that drive purges of unnecessary or costly data.
  • Examples: Strong policies include examples including specific types of data, what records look like and how to properly handle different types of environments or consumption models. Examples enable engineering teams to understand not just data definitions and types, but also what is considered acceptable use when subtle changes may be present in complex data sets.
  • Regularly reported: Enterprise wide reporting should be part of your data governance program and include reporting on the volume of data by type across the organization, the systems it is stored and processed in and if that data is in-compliance for policies for retention and purging.
  • Reviewed for cost: Data retention has costs, both the infrastructure cost of storing data and the risk to the organization for the data being accessible. Policies should define regular reviews to evaluate the cost of storing specific age and data types against the business value to determine if policies need to be updated to purge data sooner than previously required.

Our discussion has focused on primary data generated and stored. Where complexity is introduced is when we begin to evaluate and build policies for derivative data. Today’s analytical environments provide us with endless methods to combine data in new and unique ways. Data retention policies should factor for this by creating a running policy document showing what new data combinations have been evaluated and had policies created for. Data governance programs should define fast track programs for engineers to bring forward new data combinations for review by legal, archive, compliance and architecture teams to determine the policy for the new derivative data type, allowing them to act quickly with analysis.

One common method used to attempt simplifying data retention policies is retaining “all data forever.” While on the surface this sounds like an easy solution, it presents growing costs and risks to an organization that most are unwilling to accept long term. The need for all historical data is limited. Consumer behavior changes, economic indicators change and industry regulations evolve all leading towards the need to purge data that is no longer of value or creates unmanageable risk to the organization.

Data retention is a key component of your data governance programs. Policies must be defined early, shared via data literacy programs and technical controls built to automate the retention, protection and purging of data per policies. Processes should include both technical elements and human processes to ensure that unexpected events are escalated to your data stewards to define new policies where warranted and update existing policies when required.

Next up, we’ll discuss the role of analytical models in our data governance programs. As more organizations have moved towards the use of predictive modeling and machine learning, the need to govern our analytical models, training sets and outputs has become critical to ensuring repeatability, eliminating bias from decision making and protecting organizational intellectual property. Make sure to sign up for updates so you don’t miss the next post.

 

No Comments Yet

Let us know what you think

Subscribe by email