Data Profiling in Data Quality Services happens at following stages:
1) While performing Knowledge Discovery activity
1A: In the Discover step:
1b. Also in the manage domain values step:
While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:
Newness
Uniqueness
Validity
Completeness
2) While Performing Cleansing activity:
2A: on the cleansing step:
2b: Also on the mange and view results step:
Here the profiler gives you following statistics:
Corrected values
Suggested Values
Completeness
Accuracy
Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.
3) While performing Matching Policy activity (Knowledge Base Management)
3a. Matching policy step:
3b. Matching Results step:
Here the profiler gives following statistics:
newness
uniqueness
number of clusters
% of matched and unmatched records
avg, min & max cluster size
4) While performing Matching activity (Data Quality Project)
4a. Matching step:
4b. Export step:
Here Profiler gives following statistics:
Newness
uniqueness
completeness
number of clusters
% of matched and unmatched records
avg, min & max cluster size
Conclusion:
In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.
I was recently searching for a way to do some text mining on Twitter Data. I was interested in a tool that has some “library” that helps to fetch twitter data & later, I wanted to create visualization like say word cloud, time series. etc. Turns out that “R” perfectly suited my needs because of libraries/packages such as TwitteR and ggplot2 – And so, I downloaded and installed R and RStudio on my windows machine. Here are the steps (I am using Windows Server 2008 R2 machine 64 bit):
1. Download R for Windows:
2. After downloading it > Install it by leaving all options to default.
3. Download RStudio Desktop for windows:
4. Install RStudio > leave all options to default.
5. Open RStudio > In the Bottom Right Pane, switch to Packages Tab > Click on Install Packages > In the packages box, type in ggplot2 and > click on Install.
5. Check that ggplot2 successfully unpacked and installed > Now similarly install the package: twitteR > make sure it is successfully unpacked and installed.
6. And I quickly created a chart of Twitter UserName vs Number of Tweets for #sqlpass:
we can do much mire but just wanted to show how you can do social media analytics with R!
Conclusion:
In this blog post, we saw a step by step process to download and install R and R studio on a windows machine.
This Blog Post is meant to share how I think of what is KnowledgeBase in SQL Server 2012 Data Quality Services (DQS) – In my mind, Knowledge Base (KB) captures:
1. WHAT needs cleaning
2. HOW to clean what needs cleaning.
Let’s dive a little deeper, In DQS – a Knowledge Base let’s you do three things: Knowledge Discovery, Domain Management & Matching Policy.
1. Knowledge Discovery:
This activity helps you find “WHAT” needs cleaning. DQS has inbuilt algorithms that helps in analyzing errors, inconsistencies and data quality issues in the sample data-set.
2. Domain management:
This activity helps in defining the rules that will be applied to create “HOW” to clean the data.
3. Matching Policy:
This activity helps in identifying “WHAT” needs to be De-Duplicated (De-Dup) and then it goes about helping create the “HOW” to De-DUP the data.
Conclusion:
In this short blog post, I shared how I think of what is Knowledge Base in SQL Server 2012 Data Quality Services. And here’s the official resource if you want to continue learning: DQS Knowledge Bases and Domains
Awesome Blog Posts about DQS written by fine Community folks:
What is the difference between Data Analaysis and Data Mining:
1) One view is that: Data Mining is one particular form of Data Analysis.
One of the reason I researched about the difference between Data Analysis and Data Mining because I find that the terms are used Interchangeably and now I know why. It’s because Data Mining is considered as a particular form of Data Analysis.
2) I found other view that says:
Data Analysis is meant to support decision-making, support conclusions & Highlight note-worthy information. So when “Analyzing data” – we know what we want; we want answers to support our hypothesis; we want data in summarized form to highlight useful information.
While
Data Mining is meant for “Knowledge discovery” and “predictions”. So when “Mining data” – we look for undefined insights; We want the data to tell us something we didn’t knew before; We want to find patterns in the data that we had not anticipated.