5 tests to validate the quality of your data:


Missing Data:

  • Descriptive statistics could be used to find missing data
  • Tools  like SQL/Excel/R can also be used to look for missing data
  • Some of the attributes of a field are missing: Like Postal Code in an address field


  • Check if all the values are standardized: Google, Google Inc & Alphabet might need to be standardized and categorized as Alphabet
  • Different Date formats used in the same field (MM/DD/YYYY and DD/MM/YYYY)


  • Total size of data (# of rows/columns): Sometimes you may not have all the rows that you were expecting (for e.g. 100k rows for each of your 100k customers) and if that’s not the case then that tells us that we don’t complete dataset at hand


  • Outlier: If someone;s age is 250 then that’s an outlier but also it’s an error somewhere in the data pipeline that needs to be fixed; outliers can be detected using creating quick data visualization
  • Data Type mismatch: If a text field is in a field where other entries are integer that’s also an error


  • Duplicates can be introduced in the data e.g. same rows duplicated in the dataset so that needs to be de-duplicated

Hope that helps!

Paras Doshi

This post is sponsored by MockInterview.co, If you are looking for data science jobs, check out 75+ data science interview questions!

What is the title these days for a person that assures data quality?



What is the title these days for a person that assures data quality?
(I need to hire a person to make sure my data is as good as it can be. They need to inspect the data for issues, create logic for how it can be found and fixed, and finally, court the project through application development for a robust solution to stop it from occurring in the first place.)


Quality of the data shouldnt be a responsibility of just one person — ideally, you want all members of the team (and broader business community) to care and own some part of it. But i like the idea of one person owning the “co-ordination” of how this gets done. It might not be a full time gig in a small org but can see this as a full time role in bigger orgs and enterprises. Some titles:

  1. data co-ordinator
  2. Data quality analyst (or just data analyst)
  3. Data steward
  4. Master data management analyst
  5. Data quality engineer (or just data engineer)
  6. Project manager (data quality)
  7. Manager, data quality and master data management

Read the original question on Quora

How to solve common Data Quality Problems using Data Quality Services? (Part 1)


New Journal Article: First article of 2, where you will be able to see how you can use SQL Server 2012’s DQS to solve common data quality problems. http://bit.ly/172Kh5L

Topics covered:
– Data standardization
– Identifying and correcting unrealistic or invalid values
– Validation and correcting records using Regular Expressions

Read Here: How to solve common Data Quality Problems using Data Quality Services (Part 1)

Matching activity in Data Quality Services in action!


In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .

Situation 1:

we’ve a list of movies and we want to identify “matching” movie titles.

Solution 1:

Create a Matching Policy

1) Mapping the Domains:

2 Data Quality Services matching policy

2) Configuring the Matching Policy:

3 Data Quality Services matching policy

Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.

3) View Matching results:

4 Data Quality Services matching policy de duplication

4) Publish the KB (which stores the matching policy)

Once you have the matching policy, you can use this in a Data Quality Project:

5) See How I ran a Data Quality Project (w/ matching activity) in the image below.

5 Data Quality Project matching activity policy de duplication

Note: You can export the clean data-set via Data Quality Project.

Situation 2:

we’ve a list of Supplier Names and we want to identify “matching” supplier names.

Note that in this situation, you would see how to use more than one domain to create a matching rule.

Solution 2:

Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results

Matching Policy:

6 supplier ID and name two domains in matching rule data quality services

Matching results:

7 supplier ID and name two domains in matching rule data quality services

Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM

DQS MDS Data quality services and master data services


In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .

Data Profiling and SQL Server 2012 Data Quality Services


Data Profiling in Data Quality Services happens at following stages:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1 knowledge discovery profiling data quality services sql server

1b. Also in the manage domain values step:

1b knowledge discovery profiling data quality services sql server

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

  • Newness
  • Uniqueness
  • Validity
  • Completeness

2) While Performing  Cleansing activity:

2A: on the cleansing step:

2 cleansing profiling data quality services sql server

2b: Also on the mange and view results step:

2b cleansing profiling data quality services sql server

Here the profiler gives you following statistics:

  • Corrected values
  • Suggested Values
  • Completeness
  • Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

3) While performing Matching Policy activity (Knowledge Base Management)

3a. Matching policy step:

3a matching policy data quality services microsoft sql

3b. Matching Results step:

3b matching policy data quality services microsoft sql

Here the profiler gives following statistics:

  • newness
  • uniqueness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

4) While performing Matching activity (Data Quality Project)

4a. Matching step:

4a matching activity data quality services microsoft sql

4b. Export step:

4b matching activity data quality services microsoft sql export step

Here Profiler gives following statistics:

  • Newness
  • uniqueness
  • completeness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size


In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.


Data Quality Service’s Composite Domains in action!


In this post, I’ll show you how composite domains can help you create cross domain rules in Data Quality Services.


You have a data set of employee name, employee category and yearly salary. you want to validate the value in the yearly salary column based on the employee category. Here are the business rules:

Note: for the purpose of the demo, every number is a dollar.

Now, the rule in the Table can be read as:

If employee category is A then yearly salary should be greater than 100000 and less than 200000.

1 composite domains data quality services

Note: I have kept it simple for demo purposes.

Now here is our Data-Set before we set out to validate it:

Employee Name Employee Category Yearly Salary
Jon V Yang A 127000
Eugene L Huang B 90000
Ruben  Torres C 83000
Christy  Zhu D 70000
Elizabeth  Johnson A 90000
Julio  Ruiz C 65000
Janet G Alvarez D 43000
Marco  Mehta B 81000

*Names are taken from Adventure works database. The values in the names and salary column are purely fictional.


It’s just an overview, It’s not covered in step by step fashion:

1. Create a KB > created three domains: Employee Category, Employee Name and Yearly Salarly

2. created a composite domain:

2 created a composite domain data quality services

3. Under Composite Domain (CD) Rules Tab:

I started out with defining the rules for category A:

3 create composite domains rules data quality services

And I completed w/ specifying business rules for all four categories

4 create composite domains SQL server 2012

4.  Published KB

5. Created a New DQS project > Selected the KB created above

6.  Selected the data source > Mapped domains

7. I also selected from the list of selected composite domains:

5 view select composite domains data quality project

8. After seeing the cleaning statistics, I switched to the invalid tab to see the records that didn’t match the record:

6 composite domain invalid tab new tab corrected tab correct tab

9. So by now, we have identified records that do not match the rules. A data steward can now correct them if he/she wants to or leave them as it is. Notice the Approve/reject check boxes.

Note that: Not only can you update the yearly salary but you can also update the employee category. So if you think that the employee has been wrongly categorized, you can change that.

10. After this, you can export the data-set which has records that match the business rules and the data-set would be ready to be consumed!


In this post, we saw how to create cross domain rules using composite domains w/ an example of Employee Category and Yearly Salary.


SQL Server 2012 Data Quality Services Term based Relation’s in action!


In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data. In this post, we’ll see a use case of Term based relations. But before we do that, you can consider reading: Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

Now, Here’s Term Based Relation’s in action


After data profiling you realize that there are records such as “my Company Inc.” and “my Company Incorporated” – so you set out to automatically find these mismatches in terms inside a value and correct them.

BEFORE cleaned data, the report showed that “my company Inc” revenue is less than that of Google:

Before data cleaning company names and revenues sql server

AFTER cleaned data, the report correctly shows that “my company Inc” revenue is more than that of Google:

After data cleaning company names and revenues sql server

Steps taken to clean data:

(Just an overview, not covered in a step by step fashion)

1. Created the Knowledge Base w/ Two domains Company Names & Revenue

2. Term Based Relations Tab of Company Names domain:

2 term based relations data quality sql server

3. Published the KB > Let’s create a DQS project

4. Mapped the domains:

3. DQS Project Mapping Domain Names

5. DQS cleaned following records:

4 term based relations inc incorporated corrected records

6. Exported the data and created a report out of clean data-set!

Business user is happy 🙂


In this post, we saw how to correct a word/term within a value in a domain. The example we used was Inc. , Inc and Incorporated . It can be used to correct terms like Jr. and Junior. Sr. and Senior. etc. Things like this are difficult to catch during data entry – But using Term Based Relations, a person who knows the Data can clean it so that it generates correct reports.

After all reports like this are of little to no use, are they? So Let’s NOT create confusing reports anymore.

Please note:

  • The revenue figures shown are just for demo purposes. I pulled up these numbers from Wikipedia. Please don’t make any financial decision based on these reports and if you do, I am not responsible for that.
  • The name “my Company Inc” is a fictional firm. It’s not any firm that I am/was associated with in past, future of present. It’s a fictional name!

And Writing disclaimers like this are NO fun – sucks the joy out of “Thinking out Loud” 🙂