In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .
Situation 1:
we’ve a list of movies and we want to identify “matching” movie titles.
Solution 1:
Create a Matching Policy
1) Mapping the Domains:
2) Configuring the Matching Policy:
Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.
3) View Matching results:
4) Publish the KB (which stores the matching policy)
Once you have the matching policy, you can use this in a Data Quality Project:
5) See How I ran a Data Quality Project (w/ matching activity) in the image below.
Note: You can export the clean data-set via Data Quality Project.
Situation 2:
we’ve a list of Supplier Names and we want to identify “matching” supplier names.
Note that in this situation, you would see how to use more than one domain to create a matching rule.
Solution 2:
Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results
Matching Policy:
Matching results:
Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM
Conclusion:
In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .
Data Profiling in Data Quality Services happens at following stages:
1) While performing Knowledge Discovery activity
1A: In the Discover step:
1b. Also in the manage domain values step:
While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:
Newness
Uniqueness
Validity
Completeness
2) While Performing Cleansing activity:
2A: on the cleansing step:
2b: Also on the mange and view results step:
Here the profiler gives you following statistics:
Corrected values
Suggested Values
Completeness
Accuracy
Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.
3) While performing Matching Policy activity (Knowledge Base Management)
3a. Matching policy step:
3b. Matching Results step:
Here the profiler gives following statistics:
newness
uniqueness
number of clusters
% of matched and unmatched records
avg, min & max cluster size
4) While performing Matching activity (Data Quality Project)
4a. Matching step:
4b. Export step:
Here Profiler gives following statistics:
Newness
uniqueness
completeness
number of clusters
% of matched and unmatched records
avg, min & max cluster size
Conclusion:
In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.
In this post, I’ll show you how composite domains can help you create cross domain rules in Data Quality Services.
Scenario:
You have a data set of employee name, employee category and yearly salary. you want to validate the value in the yearly salary column based on the employee category. Here are the business rules:
Note: for the purpose of the demo, every number is a dollar.
Now, the rule in the Table can be read as:
If employee category is A then yearly salary should be greater than 100000 and less than 200000.
Note: I have kept it simple for demo purposes.
Now here is our Data-Set before we set out to validate it:
Employee Name
Employee Category
Yearly Salary
Jon V Yang
A
127000
Eugene L Huang
B
90000
Ruben Torres
C
83000
Christy Zhu
D
70000
Elizabeth Johnson
A
90000
Julio Ruiz
C
65000
Janet G Alvarez
D
43000
Marco Mehta
B
81000
*Names are taken from Adventure works database. The values in the names and salary column are purely fictional.
Solution:
It’s just an overview, It’s not covered in step by step fashion:
1. Create a KB > created three domains: Employee Category, Employee Name and Yearly Salarly
2. created a composite domain:
3. Under Composite Domain (CD) Rules Tab:
I started out with defining the rules for category A:
And I completed w/ specifying business rules for all four categories
4. Published KB
5. Created a New DQS project > Selected the KB created above
6. Selected the data source > Mapped domains
7. I also selected from the list of selected composite domains:
8. After seeing the cleaning statistics, I switched to the invalid tab to see the records that didn’t match the record:
9. So by now, we have identified records that do not match the rules. A data steward can now correct them if he/she wants to or leave them as it is. Notice the Approve/reject check boxes.
Note that: Not only can you update the yearly salary but you can also update the employee category. So if you think that the employee has been wrongly categorized, you can change that.
10. After this, you can export the data-set which has records that match the business rules and the data-set would be ready to be consumed!
Conclusion:
In this post, we saw how to create cross domain rules using composite domains w/ an example of Employee Category and Yearly Salary.
After data profiling you realize that there are records such as “my Company Inc.” and “my Company Incorporated” – so you set out to automatically find these mismatches in terms inside a value and correct them.
BEFORE cleaned data, the report showed that “my company Inc” revenue is less than that of Google:
AFTER cleaned data, the report correctly shows that “my company Inc” revenue is more than that of Google:
Steps taken to clean data:
(Just an overview, not covered in a step by step fashion)
1. Created the Knowledge Base w/ Two domains Company Names & Revenue
2. Term Based Relations Tab of Company Names domain:
3. Published the KB > Let’s create a DQS project
4. Mapped the domains:
5. DQS cleaned following records:
6. Exported the data and created a report out of clean data-set!
Business user is happy 🙂
conclusion:
In this post, we saw how to correct a word/term within a value in a domain. The example we used was Inc. , Inc and Incorporated . It can be used to correct terms like Jr. and Junior. Sr. and Senior. etc. Things like this are difficult to catch during data entry – But using Term Based Relations, a person who knows the Data can clean it so that it generates correct reports.
After all reports like this are of little to no use, are they? So Let’s NOT create confusing reports anymore.
Please note:
The revenue figures shown are just for demo purposes. I pulled up these numbers from Wikipedia. Please don’t make any financial decision based on these reports and if you do, I am not responsible for that.
The name “my Company Inc” is a fictional firm. It’s not any firm that I am/was associated with in past, future of present. It’s a fictional name!
And Writing disclaimers like this are NO fun – sucks the joy out of “Thinking out Loud” 🙂
In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.
Before we begin, just a primer on Regular Expressions:
“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions
Let’s take an example:
using Regular Expressions you can:
check if a text starts with A or a
check if a text contains the word “data”
check if a text follows the following pattern:Â something@something.something.
among other things..
Now, Note the power of Regular Expressions to check for a pattern like something@something.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:
for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:
1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain
2) Switch to Domain Rules Tab
3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:
4) You can use this Knowledge base to clean email records in Data Quality Records But for now, let’s test our Regular Expression > click Test Domain Rule > Enter few records > Test them!
Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from Erik Eckhardt via the discussion on a Beyond Relational Tip that I had posted.
Conclusion:
In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services
2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.
3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.
4) Now via DQS client, let’s create a Knowledge Base!
5) I’ve created following domains:
Address Line
City
State
Zip
And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.
6) And for the composite domain Full Address – this is how you configure reference data-sets:
7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:
You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:
In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data.
So, what is the difference between Term Based Relations and Domain values – and when to use which? Here is the answer:
Domain Values
Term Based Relations
It allows us to correct the entire value in a domain
It allows us to correct a word/term within a value in a domain
E.g.
USA -> United States
US -> United States
United States of America -> United States
E.g.
John Senior -> John Sr.
John Junior -> John Jr.
George Sr -> George Sr.
Mister Smith -> Mr. Smith
Note that the entire value in the domain got changed.
Note that only PART of the domain value got changed.
When you see a price of an item as -10 (negative 10) – you know it’s not right! It’s BAD data!
When you see some person’s height as 120 inches (10ft!) – you know that’s just not true! Again, BAD data!
It’s not uncommon to see such values in REAL data sets. In this post, we’ll see how SQL Server 2012’s Data Quality Services would help us clean unrealistic a.k.a invalid values in our datasets and bring us a step closer to a CLEAN data!
Situation:
You’re an analyst for professional basketball team Dallas Mavericks and You need a Player Roster of your Basketball players having CLEAN data for analysis purposes. Someone recently passed you a player roster, great! you have data!
Problem:
Some of the Data about Height of players just doesn’t look right! A player with named Dominique Jones had height of just 5 inches.
That’s just not right. So for now, you may clean this data-set manually by looking up right values but wouldn’t it be great if you had a Data Cleaning solution that flags unrealistic values for your review every time you get such data-sets? So the analyst decided to build a solution in Data Quality Services – Here are the steps:
Steps to create a Knowledge Base in DQS and clean the invalid values:
Now, First let’s create the Knowledge Base:
You decide to create a rule that checks for player heights that’s less than 5ft and greater than 7ft 6inches for your review. So here’s what the Knowledge Base would look like:
Now after the Knowledge base (KB) has been published, it’s time to use this KB for a Data Quality Project.
Note: This KB can be re-used for creating KB’s for other teams too.
Here’s the result of running the results on our Data – Set:
Note that it says that height has 3 invalid records. Let’s correct them!
In the next tab, you can see the invalid records:
You can correct this invalid values one by one and then approve them:
After that in the next steps, you can see the results – I have highlighted records that we cleaned!
And here’s our clean data-set:
Conclusion:
Yay! An analyst of a professional basketball team was able to quickly clean a data-set. And he can re-use the knowledge base he had built to clean other data-sets too in the future! Isn’t that nice?
In this post, we saw how to clean unrealistic or invalid records in a data set by using domain rules in SQL Server 2012 Data Quality Services.
I was playing with Data Quality Services when I though that it would be nice if I could implement what I had learned and create a quick demo that shows how to standardize data using the technology. So here it goes:
Note: This is not a step by step Guide – I am assuming that you are aware of different pieces of DQS. I am going to post a brief outline of how did I go about implementing a quick demo.
Technology used:
1) SQL Server 2012 Data Quality Services
2) SQL Server Integration services to show the DQS cleansing component in action.
Problem:
We want to standardize to Gender to have only two values M & F – but now data coming from two files creates data inconsistency issues:
Two files that look like this:
File 1: (Excel)
File 2: (CSV)
Solution:
Let’s see how SQL server 2012’s data quality services would help us solve this problem:
1) I created a Knowledge base with a domain Gender and created domain values that looks like this:
2) Tested the Knowledge base quickly by creating a Data Quality Project. It worked!
3) Switched to SSIS and created a Data Flow which uses following components:
What would be interesting to you to see mapping tab in the DQS cleansing transformation component, so here it is:
5) And let me show you the Target File – note that you can choose to NOT have Gender_Source, Gender_Status & Record Status. The column of interest is Gender_Output