To view error(s) that occur during MDS’s staging process, we have two views: stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails. For the purpose of this blog post, let me show you how I (as an administrator) view information from stg.viw_name_MemberErrorDetails.
1) Name of the entity: supplier & Name of model is Suppliers
2) I imported data into staging table stg.supplier_leaf via SSIS
3) Now here’s how you can see the errors that occur during MDS’s staging process:
we can go to MDS web application to initiate the staging process > start the batch > after completion you can see the status as well as if it has any errors or not:
4) Now if you see that there’s an error, then you can go to stg.viw_name_MemberErrorDetails to see what are the errors. In my case, I am going to run the query:
select * from stg.viw_supplier_MemberErrorDetails where Batch_ID=2
You can get the above query via MDS web application too:
5) And as you can imagine, you can get access to this error data via SSIS (SQL Server Integration services) too. So if you have a workflow that a. Loads data to MDS and b. initiates the Batch process via Stored Procedure; then you can program it to get access to errors from the stg.viw_name_MemberErrorDetails & stg.viw_name_RelationshipErrorDetails tables.
That’s about it for this post! I hope this helps. Your comments are very welcome!
This Blog post applies to Microsoft® HDInsight Preview for a windows machine. In this Blog Post, we’ll see how you can browse the HDFS (Hadoop Filesystem)?
1. I am assuming Hadoop Services are working without issues on your machine.
2. Now, Can you see the Hadoop Name Node Status Icon on your desktop? Yes? Great! Open it (via Browser)
3. Here’s what you’ll see:
4. Can you see the “Browse the filesystem” link? click on it. You’ll see:
5. I’ve used the /user/data lately, so Let me browse to see what’s inside this directory:
6. You can also type in the location in the check box that says Goto
7. If you’re on command line, you can do so via the command:
hadoop fs -ls /
And if you want to browse files inside a particular directory:
Problem Statement: Find Maximum Temperature for a city from the Input data.
Step 1) Input Files:
File 1:
New-york, 25
Seattle, 21
New-york, 28
Dallas, 35
File 2:
New-york, 20
Seattle, 21
Seattle, 22
Dallas, 23
File 3:
New-york, 31
Seattle, 33
Dallas, 30
Dallas, 19
Step 2: Map Function
Let’s say Map1, Map2 & Map3 run on File1, File2 & File3 in parallel, Here is their output:
(Note how it outputs the “Key – Value” pair. The key would be used by the reduce function later to do a “group by“)
Map 1:
Seattle, 21
New-york, 28
Dallas, 35
Map 2:
New-york, 20
Seattle, 22
Dallas, 23
Map 3:
New-york, 31
Seattle, 33
Dallas, 30
Step 3: Reduce Function
Reduce Function takes the input from Map1, Map2 & Map3, to give an output:
New-york, 31
Seattle, 33
Dallas, 35
Conclusion:
In this post, we visualized MapReduce Programming Model with an example: Finding Max Temp. for a city. And as you can imagine you can extend this post, to visualize:
1) Find Minimum Temperature for a city.
2) In this post, the key was City, But you could substitute it by other relevant real world entity to solve similar looking problems.
2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine
Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!
3. Before we begin, decide on three things:
3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.
3b. select the table that you want to load into HDFS
In my case, it’s this table:
3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1
You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1
2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine
Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!
3. Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)
4. Here the command to create a directory looks like:
hadoop fs -mkdir /user/data/input
The above command creates /user/data/input
5. Let’s verify that the input directory was created under /user/data
hadoop fs -ls /user/data
Conclusion: In this post, we saw how to create a directory in Hadoop (on windows) file system and also we saw how to list files/directory using the -ls command.
Neologism means The coining or use of new words – And I believe it’s one of the challenge faced by IT professionals. Nowadays, we put our time & energy trying to get head around “new terms/words/trends”.
Let’s take couple of example(s):
Sometime back, we had cloud computing. Nowadays, its Big Data; In my mind – Big Data has been coined to mean following technologies/techniques under different contexts:
Note: The above image is just for illustration purpose. It does not comprehensively cover every technology that is now called “Big Data”. Feel free to point it out if you think I missed something important.
And Neologism is challenge because:
1) Generally, it’s a new trend and there is little to no consensus on what does it “Exactly” mean
2) It means different things in different context
3) Every person can have their own “interpretation” and no one is wrong.
4) It’s a moving ball. The definition used today will change in future. So we always need a “working” definition for these terms.
Now, Don’t get me wrong, It’s fun trying to figure out what does it all mean and trying to gauge whether it matters to me and my organization or not! What do you think – as a Person in Information Technology, do you think that Neologism is one of the challenges faced by us? consider leaving a reply in the comment section!
In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .
Situation 1:
we’ve a list of movies and we want to identify “matching” movie titles.
Solution 1:
Create a Matching Policy
1) Mapping the Domains:
2) Configuring the Matching Policy:
Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.
3) View Matching results:
4) Publish the KB (which stores the matching policy)
Once you have the matching policy, you can use this in a Data Quality Project:
5) See How I ran a Data Quality Project (w/ matching activity) in the image below.
Note: You can export the clean data-set via Data Quality Project.
Situation 2:
we’ve a list of Supplier Names and we want to identify “matching” supplier names.
Note that in this situation, you would see how to use more than one domain to create a matching rule.
Solution 2:
Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results
Matching Policy:
Matching results:
Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM
Conclusion:
In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .