Seven Demo’s for SQL Server 2012 Data Quality Services:

Standard

Here are the seven Demo’s that I had posted over the past few weeks, listing them here:

Detecting invalid values in Price domain or unrealistic values in Height domain

How to standardize data using DQS

How to clean records using third-party reference data-sets from Azure Data Market

How to use regular expressions to validate records?

cleaning records like “my company Inc.” and “my company incorporated” using Term Based Relations

How to use cross-domain rules using composite domains

Matching activity in Data Quality Services in action!

Data Profiling and SQL Server 2012 Data Quality Services

Standard

Data Profiling in Data Quality Services happens at following stages:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1 knowledge discovery profiling data quality services sql server

1b. Also in the manage domain values step:

1b knowledge discovery profiling data quality services sql server

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

  • Newness
  • Uniqueness
  • Validity
  • Completeness

2) While Performing  Cleansing activity:

2A: on the cleansing step:

2 cleansing profiling data quality services sql server

2b: Also on the mange and view results step:

2b cleansing profiling data quality services sql server

Here the profiler gives you following statistics:

  • Corrected values
  • Suggested Values
  • Completeness
  • Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

3) While performing Matching Policy activity (Knowledge Base Management)

3a. Matching policy step:

3a matching policy data quality services microsoft sql

3b. Matching Results step:

3b matching policy data quality services microsoft sql

Here the profiler gives following statistics:

  • newness
  • uniqueness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

4) While performing Matching activity (Data Quality Project)

4a. Matching step:

4a matching activity data quality services microsoft sql

4b. Export step:

4b matching activity data quality services microsoft sql export step

Here Profiler gives following statistics:

  • Newness
  • uniqueness
  • completeness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

Conclusion:

In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.

 

Let’s Install R & RStudio on Windows Machine!

Standard

I was recently searching for a way to do some text mining on Twitter Data. I was interested in a tool that has some “library” that helps to fetch twitter data & later, I wanted to create visualization like say word cloud, time series. etc. Turns out that “R” perfectly suited my needs because of libraries/packages such as TwitteR and ggplot2 – And so, I downloaded and installed R and RStudio on my windows machine. Here are the steps (I am using Windows Server 2008 R2 machine 64 bit):

1. Download R for Windows:

Install R for windows twitter analytics

2. After downloading it > Install it by leaving all options to default.

3. Download RStudio Desktop for windows:

install R studio for windows desktop

4. Install RStudio > leave all options to default.

5. Open RStudio > In the Bottom Right Pane, switch to Packages Tab > Click on Install Packages > In the packages box, type in ggplot2 and > click on Install.

ggplot2 package R Rstudio

5. Check that ggplot2 successfully unpacked and installed > Now similarly install the package: twitteR > make sure it is successfully unpacked and installed.

twitteR package R Rstudio windows analytics6. And I quickly created a chart of Twitter UserName vs Number of Tweets for #sqlpass:

we can do much mire but just wanted to show how you can do social media analytics with R!

Twitter Analytics with R Studio windows Bar Plot

Conclusion:

In this blog post, we saw a step by step process to download and install R and R studio on a windows machine.

 

How I think of what is KnowledgeBase in SQL Server 2012 Data Quality Services

Standard

This Blog Post is meant to share how I think of what is KnowledgeBase in SQL Server 2012 Data Quality Services (DQS) – In my mind, Knowledge Base (KB) captures:

1. WHAT needs cleaning

2. HOW to clean what needs cleaning.

Let’s dive a little deeper, In DQS – a Knowledge Base let’s you do three things: Knowledge Discovery, Domain Management & Matching Policy.

1. Knowledge Discovery:

This activity helps you find “WHAT” needs cleaning. DQS has inbuilt algorithms that helps in analyzing errors, inconsistencies and data quality issues in the sample data-set.

2. Domain management:

This activity helps in defining the rules that will be applied to create “HOW” to clean the data.

3. Matching Policy:

This activity helps in identifying “WHAT” needs to be De-Duplicated (De-Dup) and then it goes about helping create the “HOW” to De-DUP the data.

Conclusion:

In this short blog post, I shared how I think of what is Knowledge Base in SQL Server 2012 Data Quality Services. And here’s the official resource if you want to continue learning: DQS Knowledge Bases and Domains

Back to basics: What is the difference between Data Analysis and Data Mining?

Standard

What is the difference between Data Analaysis and Data Mining:

1) One view is that: Data Mining is one particular form of Data Analysis.

difference between data mining and data analysis

One of the reason I researched about the difference between Data Analysis and Data Mining because I find that the terms are used Interchangeably and now I know why. It’s because Data Mining is considered as a particular form of Data Analysis.

2) I found other view that says:

Data Analysis is meant to support decision-making, support conclusions & Highlight note-worthy information. So when “Analyzing data” – we know what we want; we want answers to support our hypothesis; we want data in summarized form to highlight useful information.

While

Data Mining is meant for “Knowledge discovery” and “predictions”. So when “Mining data” – we look for undefined insights; We want the data to tell us something we didn’t knew before; We want to find patterns in the data that we had not anticipated.

Sources:

http://www-stat.stanford.edu/~jhf/ftp/dm-stat.pdf

http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

http://stats.stackexchange.com/questions/1521/data-mining-and-statistical-analysis

http://en.wikipedia.org/wiki/Data_analysis