Pythian Blog: Technical Track

Data profile: better knowing your data

Have you ever needed to start a new data analysis project or create a report for the business users querying a database you never worked before? Or simply know the data distribution of a database to create better indexing strategies?

Working as a consultant, I constantly face this challenge where I have to work with a customer’s database that I don’t know about very deeply. For instance, that “Gender” column stores data as “M” and “F” or “Male” and “Female”? Or even, do they use a bit column for that? (Yeah, I saw that a lot already). Does that “Surname” column accept NULL values? If so, what percent of the table contains NULL for that specific column? In a date/time column, what is the minimum and maximum values so I can create my “Time” dimension in a Data warehouse?

This data discovery process, where I need an overview of the data, usually takes a lot of time and a lot of query writing, doing DISTINCT, MIN, MAX, AVG kind of queries and analyzing the result of each individual query. Even with a lot of really good code, completing third party tools out there, it is a cumbersome task and sometimes the customer is not willing to wait while I learn everything about their environment before expecting results.

Today I want to show you a not-so-new feature that we have in SQL Server that will help with the data discovery process. The feature is the Data Profiler Task in SQL Server Integration Services and the Data Profile Viewer.

Now, that’s the time when you ask me, “Data what?!

It’s easy, you’ll see. One of the several tasks in the SQL Server Integration Services that you never use and never took the time to google what is used for is called Data Profiling Task. This task allows you to select a table and what kind of data analysis you want to do in that table/column. When you run the SSIS package it will analyze the table and generate a XML file. Once you have the XML file, all you need to do is to open it using the Data Profile Viewer, which will take care of creating a nice user interface for you to analyze the XML, as you can see in the Figure 1.

 

DataProfile-Image1

Figure 1: Data Profile Viewer

Cool, now let’s see how to create our own analysis.

Step 1: Open SQL Data Tools or SQL BIDS if you’re using SQL Server 2008 R2 or below

Step 2: Create a new SSIS project

Step 3: Add the Data Profiling Task on your project

DataProfile-Image2

Step 4: Double click in the Data Profiling task so we can configure it. In the General tab we have to set the Destination, that means, the location you want to save the XML file. You can choose to save directly to the file system using a File Connection or store in a XML variable inside your package in the case you want to do something else with the XML, maybe store in a database. Let’s leave the default FileConnection option for the Destination Type option and click in New File Connection in the Destination option.

DataProfile-Image3

Step 5: Now we can choose the file location, on my example I am using one of the most used folders every on windows. The “tmp” folder, sometimes also called as “temp” or just “stuff”. (Note: the author doesn’t recommend storing everything in folders called temp nor saving everything in the desktop)

DataProfile-Image4

Step 6: Ok, we’re back to the main window, we have now to choose which kind of analysis we want to run, the database and the table. We have two options, the first one is to use the Profile Requests tab and choose one by one the data analysis, table and columns. The other option and also the simplest one is to use the Quick Profile tab. Using this option we can define one specific table and what analysis you want to run on that table. If you want to run the analysis on multiple tables you will have to click in the Quick Profile option and choose one by one (nothing on this world is perfect).

DataProfile-Image5

As you can see in the image above, I have chosen the Production.Product table of the AdventureWorks2012 database. In the Compute option you have to choose what data analysis you want to run, the names of the options kind of explain what they’ll do, but if you want a detailed explanation of each option you can check the product documentation on this link: https://technet.microsoft.com/en-us/library/bb895263.aspx

Now all you have to do is to run the SSIS package to create the XML file. Once you’re done, you can use the Data Profile Viewer tool to open the XML and analyze its results.

DataProfile-Image6

The Data Profile Viewer is a simple tool that doesn’t need much explanation, just try it for yourself and you’ll certainly like the result.

I hope this can help you to save some time when you need to quickly learn more about the data inside a database. If you have any questions or want to share what your approach is when you need to complete this task, just leave a comment!

 

No Comments Yet

Let us know what you think

Subscribe by email