Seven Demo’s for SQL Server 2012 Data Quality Services:

Standard

Here are the seven Demo’s that I had posted over the past few weeks, listing them here:

Detecting invalid values in Price domain or unrealistic values in Height domain

How to standardize data using DQS

How to clean records using third-party reference data-sets from Azure Data Market

How to use regular expressions to validate records?

cleaning records like “my company Inc.” and “my company incorporated” using Term Based Relations

How to use cross-domain rules using composite domains

Matching activity in Data Quality Services in action!

How to create a MDS entity via Excel add-in?

Standard

In this Post, we’ll see how I (as an Administrator) created a Master Data Services Entity via MDS add-in for Excel:

1. Created a Model via MDS Web Application. I named it “customer Information”

1 create a Model in Master Data Services2. Switch to Excel

3. Open the File that has the Data that you want to load to MDS

4. Switch to Master Data tab in Excel
[Resource: Steps I followed to Install SQL Server 2012 Master Data Services (MDS) on my Demo Machine: And then Installed the Excel Add-in]

5. Connect to MDS server (via Excel add-in)

6. Select Model as CustomerInformation

7. Under Build Model, select Create Entity

create entity Master Data Services Excel

8. Configure the values in the “Create Entity” > click OK

SQL Server create entity Master Data Services Excel

9. Switch to MDS web application to see the new entity:

MDS web application explorer SQL Server

I hope this helps! Your comments are very welcome!

How to view error(s) that occur during Master data Service’s Staging Process?

Standard

To view error(s) that occur during MDS’s staging process, we have two views: stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails. For the purpose of this blog post, let me show you how I (as an administrator) view information from stg.viw_name_MemberErrorDetails.

1) Name of the entity: supplier & Name of model is Suppliers

2) I imported data into staging table stg.supplier_leaf via SSIS

3) Now here’s how you can see the errors that occur during MDS’s staging process:

we can go to MDS web application to initiate the staging process > start the batch > after completion you can see the status as well as if it has any errors or not:

Master Data Services Errors4) Now if you see that there’s an error, then you can go to stg.viw_name_MemberErrorDetails to see what are the errors. In my case, I am going to run the query:

select * from stg.viw_supplier_MemberErrorDetails where Batch_ID=2

You can get the above query via MDS web application too:

Master Data Services SQL Server Staging Process

5) And as you can imagine, you can get access to this error data via SSIS (SQL Server Integration services) too. So if you have a workflow that a. Loads data to MDS and b. initiates the Batch process via Stored Procedure; then you can program it to get access to errors from the stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails tables.

SQL Server Integration Services and Master Data Services

That’s about it for this post! I hope this helps. Your comments are very welcome!

Hadoop on Windows: How to Browse the Hadoop Filesystem?

Standard

This Blog post applies to Microsoft® HDInsight Preview for a windows machine. In this Blog Post, we’ll see how you can browse the HDFS (Hadoop Filesystem)?

1. I am assuming Hadoop Services are working without issues on your machine.

2. Now, Can you see the Hadoop Name Node Status Icon on your desktop? Yes? Great! Open it (via Browser)

3. Here’s what you’ll see:

Hadoop File System Browse

4. Can you see the “Browse the filesystem” link? click on it. You’ll see:

hadoop file system name node status windows

5. I’ve used the /user/data lately, so Let me browse to see what’s inside this directory:

user data hadoop sqoop hive mapreduce

6. You can also type in the location in the check box that says Goto

7. If you’re on command line, you can do so via the command:

hadoop fs -ls /

hadoop command line list all files system

And if you want to browse files inside a particular directory:

hadoop command line sqoop mapreduce hdfs file system

Official Resource:

HDFS File System Shell Guide

Conclusion

In this post, we saw how to browse Hadoop File system via Hadoop Command Line & Hadoop Name Node Status

Related Articles:

Visualizing MapReduce Algorithm with an Example: Finding Max Temperature

Standard

Problem Statement: Find Maximum Temperature for a city from the Input data.

Step 1) Input Files:

File 1:

New-york, 25

Seattle, 21

New-york, 28

Dallas, 35

File 2:

New-york, 20

Seattle, 21

Seattle, 22

Dallas, 23

File 3:

New-york, 31

Seattle, 33

Dallas, 30

Dallas, 19

Step 2: Map Function

Let’s say Map1, Map2 & Map3 run on File1, File2 & File3 in parallel, Here is their output:

(Note how it outputs the “Key – Value” pair. The key would be used by the reduce function later to do a “group by“)

Map 1:

Seattle, 21

New-york, 28

Dallas, 35

Map 2:

New-york, 20

Seattle, 22

Dallas, 23

Map 3:

New-york, 31

Seattle, 33

Dallas, 30

Step 3: Reduce Function

Reduce Function takes the input from Map1, Map2 & Map3, to give an output:

New-york, 31

Seattle, 33

Dallas, 35

Conclusion:

In this post, we visualized MapReduce Programming Model with an example: Finding Max Temp. for a city.  And as you can imagine you can extend this post, to visualize:

1) Find Minimum Temperature for a city.

2) In this post, the key was City, But you could substitute it by other relevant real world entity to solve similar looking problems.

I hope this helps.

Related Articles:

Visualizing MapReduce Algorithm with WordCount Example

Microsoft® HDInsight Preview for Windows: How to use Sqoop to load data into HDFS from SQL Server?

Standard

In this post, we’ll see how to use Sqoop to load data into HDFS from SQL Server?

With that, here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Before we begin, decide on three things:

3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.

3b. select the table that you want to load into HDFS

In my case, it’s this table:

sql table to be loaded into hadoop hdfs from sql server3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1

You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1

[to learn about how to create directory, read: How to create a directory in Hadoop File System? ]

4. Now Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

5. Navigate to: c:Hadoopsqoop-1.4.2bin>

*This path may change in future, but navigate to the bin folder under the SQOOP_HOME.

6. Run dir command to see various files under this directory.

sqoop list files under the HOMe directory import export

Also you can run sqoop help for more information on the command that we are about to run.

sqoop list of commands help

7. Now here’s the command to Load data from SQL Server to HDFS:

c:Hadoopsqoop-1.4.2bin>sqoop import –connect “jdbc:sqlserver://localhost;dat
abase=UniversityDB;username=sqoop;password=**********” –table student –tar
get-dir /user/data/sqoopstudent1 -m 1

sqoop command to load data from sql server to hadoop file system

8. After successfully running the above command, let’s browse the file in HDFS!

sqoop see the content of the file

That’s about it for this post!

Thanks

Thanks Aviad Ezra who answered my question on this MSDN thread: An error while trying to use Sqoop on HDInsight to import data from SQL server to HDFS

Conclusion:

In this post, we saw how to load data into Hadoop from SQL Server using Sqoop (SQL Hadoop)

Related Articles:

Microsoft® HDInsight Preview for Windows: How to create a directory in Hadoop File System?

Standard

In this post, we’ll see how to create a directory in the Hadoop File System for HDInsight’s windows version.

Here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

4. Here the command to create a directory looks like:

hadoop fs -mkdir /user/data/input

The above command creates /user/data/input

5. Let’s verify that the input directory was created under /user/data

hadoop fs -ls /user/data

hadoop file system list files in a directory create directory

Conclusion:
In this post, we saw how to create a directory in Hadoop (on windows) file system and also we saw how to list files/directory using the -ls command.

Related Articles:

 

Neologism is the new challenge for IT professionals, Here’s why:

Standard

What is Neologism?

Neologism means The coining or use of new words – And I believe it’s one of the challenge faced by IT professionals. Nowadays, we put our time & energy trying to get head around “new terms/words/trends”.

Let’s take couple of example(s):

Sometime back, we had cloud computing. Nowadays, its Big Data; In my mind – Big Data has been coined to mean following technologies/techniques under different contexts:

Big Data Unstrucutred External Text Public Data

Note: The above image is just for illustration purpose. It does not comprehensively cover every technology that is now called “Big Data”. Feel free to point it out if you think I missed something important.

And Neologism is challenge because:

1) Generally, it’s a new trend and there is little to no consensus on what does it “Exactly” mean

2) It means different things in different context

3) Every person can have their own “interpretation” and no one is wrong.

4) It’s a moving ball. The definition used today will change in future. So we always need a “working” definition for these terms.

Now, Don’t get me wrong, It’s fun trying to figure out what does it all mean and trying to gauge whether it matters to me and my organization or not! What do you think – as a Person in Information Technology, do you think that Neologism is one of the challenges faced by us? consider leaving a reply in the comment section!

Related Articles:

Want to learn about BigData? read Oreilly’s Book “Planning for BigData”

Quote for Big-Data / Data-Science/ Data-Analysis enthusiasts:

Who on earth is creating “Big data”?

Examples to help clarify what’s unstructured data and what’s structured?

Sentiment Analysis in R w/ Twitter data feeds

Standard

I followed instructions on this site to perform sentiment analysis about Starbucks from Twitter data feeds.

Here are data visualizations:

1. Sentiment Analysis: Starbucks on Twitter

sentiment analysis starbucks on twitter

2. Comparison cloud:

comparison cloud data visualization

That’s about it for this post, Here are some related tutorials:

If you want to Install R on windows machine, here’s a Tutorial: http://parasdoshi.com/2012/11/13/lets-install-r-rstudio-on-windows-machine/

If you want to try out out Hadoop on windows, Hive and Hive excel add-in w/ Twitter Data, Here’s a Tutorial: http://parasdoshi.com/2012/11/16/how-to-load-twitter-data-into-hadoop-on-azure-cluster-and-then-analyze-it-via-hive-add-in-for-excel/

If you want to Grab Twitter search data using R and export to a tab delimited file. Here’s a tutorial: http://parasdoshi.com/2012/11/24/grab-twitter-search-data-using-r-and-export-to-a-tab-delimited-file/

Matching activity in Data Quality Services in action!

Standard

In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .

Situation 1:

we’ve a list of movies and we want to identify “matching” movie titles.

Solution 1:

Create a Matching Policy

1) Mapping the Domains:

2 Data Quality Services matching policy

2) Configuring the Matching Policy:

3 Data Quality Services matching policy

Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.

3) View Matching results:

4 Data Quality Services matching policy de duplication

4) Publish the KB (which stores the matching policy)

Once you have the matching policy, you can use this in a Data Quality Project:

5) See How I ran a Data Quality Project (w/ matching activity) in the image below.

5 Data Quality Project matching activity policy de duplication

Note: You can export the clean data-set via Data Quality Project.

Situation 2:

we’ve a list of Supplier Names and we want to identify “matching” supplier names.

Note that in this situation, you would see how to use more than one domain to create a matching rule.

Solution 2:

Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results

Matching Policy:

6 supplier ID and name two domains in matching rule data quality services

Matching results:

7 supplier ID and name two domains in matching rule data quality services

Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM

DQS MDS Data quality services and master data services

Conclusion:

In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .