In 2012, people from 162 contries visited ParasDoshi.com!

Standard

Visualizing data is powerful! Thanks to WordPress.com for sending me the 2012’s report – a statistic that I found very encouraging was that people from 162 countries visited this blog! All thanks to the power of Inter webs!

Thanks everyone for the support, Appreciate it!

And Here’s a beautiful Data Visualization:

162 countries blog visitors paras doshi 2012

SQL Server 2012 Data Quality Services Term based Relation’s in action!

Standard

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data. In this post, we’ll see a use case of Term based relations. But before we do that, you can consider reading: Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

Now, Here’s Term Based Relation’s in action

Scenario:

After data profiling you realize that there are records such as “my Company Inc.” and “my Company Incorporated” – so you set out to automatically find these mismatches in terms inside a value and correct them.

BEFORE cleaned data, the report showed that “my company Inc” revenue is less than that of Google:

Before data cleaning company names and revenues sql server

AFTER cleaned data, the report correctly shows that “my company Inc” revenue is more than that of Google:

After data cleaning company names and revenues sql server

Steps taken to clean data:

(Just an overview, not covered in a step by step fashion)

1. Created the Knowledge Base w/ Two domains Company Names & Revenue

2. Term Based Relations Tab of Company Names domain:

2 term based relations data quality sql server

3. Published the KB > Let’s create a DQS project

4. Mapped the domains:

3. DQS Project Mapping Domain Names

5. DQS cleaned following records:

4 term based relations inc incorporated corrected records

6. Exported the data and created a report out of clean data-set!

Business user is happy 🙂

conclusion:

In this post, we saw how to correct a word/term within a value in a domain. The example we used was Inc. , Inc and Incorporated . It can be used to correct terms like Jr. and Junior. Sr. and Senior. etc. Things like this are difficult to catch during data entry – But using Term Based Relations, a person who knows the Data can clean it so that it generates correct reports.

After all reports like this are of little to no use, are they? So Let’s NOT create confusing reports anymore.

Please note:

  • The revenue figures shown are just for demo purposes. I pulled up these numbers from Wikipedia. Please don’t make any financial decision based on these reports and if you do, I am not responsible for that.
  • The name “my Company Inc” is a fictional firm. It’s not any firm that I am/was associated with in past, future of present. It’s a fictional name!

And Writing disclaimers like this are NO fun – sucks the joy out of “Thinking out Loud” 🙂

See what went into building WATSON, an advanced machine learning & natural language processing system powered by Big Data!

Standard

Do you know about Jeopardy! quiz show where a computer named Watson was able to beat world champions? No! Go watch it! Yes? Nice! Isn’t it a feat as grand as the one achieved by Deep blue (chess computer); if not less?

I am always interested in how such advanced computers was built. In case of Watson, It’s fascinating how technologies such as Natural language processing, machine learning & artificial intelligence backed by massive compute & storage power was able to beat two human world champions. And as a person interested in analytic’s and Big Data – I would classify this technology under Big Data and Advanced Data Analytics where computer analyzes lots of data to answer a question asked in a natural language. It also uses advanced machine learning algorithms. To that end, If you’re interested in getting an overview of what went into building WATSON, watch this:

If you’re as amazed as I am, considering sharing what amazed you about this technology via comment section:

Seven Interesting Google Projects that a Data Professional may not have heard about:

Standard

Here’s the list:

1. Google Refine

2. Google Prediction API

3. Google Trends

4. Google Chart Tools

5. Google Big Query

6. Google Correlate

7. Google Fusion Tables

Note: These projects may not be ready to be used in your production environment as some of them are in Beta/Experimental stages and their support/development may be deprecated in future.

Thanks: I thought of writing this blog post after a discussion I had with Parth Acharya about Google and it’s projects for Data Professionals. He pointed me to some of the most interesting samples that used Google Fusion Tables and here’s his one of the blog post on related topic: Google Fusion Table & Data Visualization

How to clean data using Regular Expressions in Data Quality Services?

Standard

In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.

Before we begin, just a primer on Regular Expressions:

“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions

Let’s take an example:

using Regular Expressions you can:

  • check if a text starts with A or a
  • check if a text contains the word “data”
  • check if a text follows the following pattern:  something@something.something.
  • among other things..

Now, Note the power of Regular Expressions to check for a pattern like something@something.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:

for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:

1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain

2) Switch to Domain Rules Tab

3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:

1 sql server data quality cleaning email regular expressions

4) You can use this Knowledge base to clean email records in Data Quality Records But for now,  let’s test our Regular Expression >  click Test Domain Rule > Enter few records > Test them!

2 sql server data quality cleaning email regular expressions test

Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from via the discussion on a Beyond Relational Tip that I had posted.

Conclusion:

In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services

Related articles

How to clean address records using third-party reference data-sets in Data Quality Services?

Standard

In this post, we’ll see how to clean address records using third party reference data-sets in Data Quality Services.

You need to perform following steps to do so

Note that they are not step by step procedures, a high level overview of how DQS works to clean address records using 3rd party reference data-sets:

1) Configure Data Quality Services to use reference data sets. Tutorial here: http://msdn.microsoft.com/en-us/library/hh213070.aspx

2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.

3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.

4) Now via DQS client, let’s create a Knowledge Base!

5) I’ve created following domains:

  • Address Line
  • City
  • State
  • Zip

And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.

1 SQL server 2012 Data Quality Services Domain Management

6) And for the composite domain Full Address – this is how you configure reference data-sets:

2 1 SQL server 2012 Data Quality Services reference data services

7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:

You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set:

SQL server 2012 Data Quality Services data quality project

That’s about it for this post. I hope it helps.

Difference between Term based relations and Domain values in SQL server 2012 Data Quality Services

Standard

In data quality services, a knowledge base (KB) consists of domains. And domains has: domain rules, term based relations, domain values and reference data.

So, what is the difference between Term Based Relations and Domain values – and when to use which? Here is the answer:

Domain Values Term Based Relations
It allows us to correct the entire value in a domain It allows us to correct a word/term within a value in a domain
E.g.

USA -> United States

US -> United States

United States of America -> United States

E.g.

John Senior -> John Sr.

John Junior -> John Jr.

George Sr -> George Sr.

Mister Smith -> Mr. Smith

Note that the entire value in the domain got changed. Note that only PART of the domain value got changed.

Further reading: Data Quality Services concepts: http://msdn.microsoft.com/en-us/library/hh213015.aspx

That’s about it for this post. I hope it helps.