Pythian Blog: Technical Track

Rethinking the Definition of Production in Data-Centric Environments

 

The IT world often speaks in terms of production (PROD) and non-production (NPROD) which can cover an endless set of functions including user acceptance testing, quality assurance, development, validation or staging. This separation is often used to denote environments that have highly sensitive data and high levels of required uptime and stability for production compared to environments that contain subsets of data, enough for testing, but limited to minimize risk.

This IT-centric language is often adopted by business partners in finance & accounting, customer support and logistics organizations to plan working schedules and systems availability. As organizations become more data-centric the need for data not only increases in urgency to drive effective business processes and decisions, but in the context of certain types of work, especially in the data science world where the idea of NPROD data doesn’t exist. Specifically the work done by data scientists and data engineers often can’t be done on scrubbed, de-identified, or other incomplete data sets that are common in NPROD environments.

Data-driven environments have a fundamentally different set of needs around testing, deployment, and visibility then traditional business applications. Data driven environments need access to fresh data on a high level of update frequency to ensure that data engineers and data scientists are able to effect outputs and recommendations on a timeline that has a positive impact on business decisions and customer experiences.

IT organizations can continue to maintain tiers of environments to facilitate the testing of new services, patches, and vendor tools. But as an organization grows their data investments, a need for environments that house unrestricted data sets that are protected from outside threats and flexible in their services and compute capabilities are required.

The key measure of availability for data engineering and data science work products is their availability to users and the output data being presented for user consumption. Where IT teams will test key functionality, reliability and integration before releasing new capabilities to users, data centric products and capabilities will focus on testing to evaluate:

  1. Model Bias: Testing & third party validation to ensure that new models are not reinforcing biases found in training data sets.
  2. Model Accuracy: Comparison of new model outputs against existing baselines to ensure no decrease in accuracy is identified.
  3. Responsiveness: Measurement of the response time for new models and any microservices created to facilitate query of the model and outputs.
  4. Data Completeness: Testing is completed to ensure that new data products have all necessary data elements, meet minimums for completeness of records, and match key counts based on experience with previous revisions of similar data products.
  5. Drift & Performance Over Time: Many models will perform differently over time, as input data changes or as seasonal events occur, testing over a period of time is needed to ensure no adverse behavior affects user experiences.

While IT will often deploy one version of an application, or one version of an upgrade, the data-centric side of an organization will often deploy multiple versions of a data model or data product for consumption so that results and outcomes can be compared, measured and decisions about impact value determined. Architecture for data platforms must support this environment where multiple parallel paths for data transformation and model deployment will be present and simultaneously feeding data, decisions, recommendations, and personalization for users.

These needs for multiple parallel paths supporting data products and models can be enabled through data mesh architectures. The ability to define data landing zones in terms of bronze, silver, and gold and then defining the quality, completeness, and readiness of data landed in each zone. A data mesh architecture enables an organization flexibility to deploy different services to enable data consumption from a variety of platforms, services or integration points.

Data-driven companies must think differently about their approach to technology landscapes. The traditional PROD and NPROD separation is not enough to ensure that data engineers and data scientists have access to complete data sets for exploration, feature engineering and modeling.

Bringing new design paradigms, including data meshes enable organizations to create parallel sets of data pipelines and data products and data models for consumption by diverse data consumers while ensuring data privacy and protection standards are maintained.

I hope this post useful. Let me know if you have any questions in the comments, and don’t forget to  sign up for the next post .

No Comments Yet

Let us know what you think

Subscribe by email