4) Now via DQS client, let’s create a Knowledge Base!
5) I’ve created following domains:
And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.
6) And for the composite domain Full Address – this is how you configure reference data-sets:
7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:
You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:
When you see a price of an item as -10 (negative 10) – you know it’s not right! It’s BAD data!
When you see some person’s height as 120 inches (10ft!) – you know that’s just not true! Again, BAD data!
It’s not uncommon to see such values in REAL data sets. In this post, we’ll see how SQL Server 2012’s Data Quality Services would help us clean unrealistic a.k.a invalid values in our datasets and bring us a step closer to a CLEAN data!
You’re an analyst for professional basketball team Dallas Mavericks and You need a Player Roster of your Basketball players having CLEAN data for analysis purposes. Someone recently passed you a player roster, great! you have data!
Some of the Data about Height of players just doesn’t look right! A player with named Dominique Jones had height of just 5 inches.
That’s just not right. So for now, you may clean this data-set manually by looking up right values but wouldn’t it be great if you had a Data Cleaning solution that flags unrealistic values for your review every time you get such data-sets? So the analyst decided to build a solution in Data Quality Services – Here are the steps:
Steps to create a Knowledge Base in DQS and clean the invalid values:
Now, First let’s create the Knowledge Base:
You decide to create a rule that checks for player heights that’s less than 5ft and greater than 7ft 6inches for your review. So here’s what the Knowledge Base would look like:
Now after the Knowledge base (KB) has been published, it’s time to use this KB for a Data Quality Project.
Note: This KB can be re-used for creating KB’s for other teams too.
Here’s the result of running the results on our Data – Set:
Note that it says that height has 3 invalid records. Let’s correct them!
In the next tab, you can see the invalid records:
You can correct this invalid values one by one and then approve them:
After that in the next steps, you can see the results – I have highlighted records that we cleaned!
And here’s our clean data-set:
Yay! An analyst of a professional basketball team was able to quickly clean a data-set. And he can re-use the knowledge base he had built to clean other data-sets too in the future! Isn’t that nice?
In this post, we saw how to clean unrealistic or invalid records in a data set by using domain rules in SQL Server 2012 Data Quality Services.
I was playing with Data Quality Services when I though that it would be nice if I could implement what I had learned and create a quick demo that shows how to standardize data using the technology. So here it goes:
Note: This is not a step by step Guide – I am assuming that you are aware of different pieces of DQS. I am going to post a brief outline of how did I go about implementing a quick demo.
1) SQL Server 2012 Data Quality Services
2) SQL Server Integration services to show the DQS cleansing component in action.
We want to standardize to Gender to have only two values M & F – but now data coming from two files creates data inconsistency issues:
Two files that look like this:
File 1: (Excel)
File 2: (CSV)
Let’s see how SQL server 2012’s data quality services would help us solve this problem:
1) I created a Knowledge base with a domain Gender and created domain values that looks like this:
2) Tested the Knowledge base quickly by creating a Data Quality Project. It worked!
3) Switched to SSIS and created a Data Flow which uses following components:
What would be interesting to you to see mapping tab in the DQS cleansing transformation component, so here it is:
5) And let me show you the Target File – note that you can choose to NOT have Gender_Source, Gender_Status & Record Status. The column of interest is Gender_Output