The Tutorial will teach you how to analyze log files using Hadoop Tools like MapReduce, Hive, SQooP – check it out! It works with both HDInsight for local windows as well as Hadoop on Azure:
Conclusion:
I hope this resource helps you get started on building an end-to-end solution with Hadoop on Windows/Azure.
This Blog post applies to Microsoft® HDInsight Preview for a windows machine. In this Blog Post, we’ll see how you can browse the HDFS (Hadoop Filesystem)?
1. I am assuming Hadoop Services are working without issues on your machine.
2. Now, Can you see the Hadoop Name Node Status Icon on your desktop? Yes? Great! Open it (via Browser)
3. Here’s what you’ll see:
4. Can you see the “Browse the filesystem” link? click on it. You’ll see:
5. I’ve used the /user/data lately, so Let me browse to see what’s inside this directory:
6. You can also type in the location in the check box that says Goto
7. If you’re on command line, you can do so via the command:
hadoop fs -ls /
And if you want to browse files inside a particular directory:
Problem Statement: Find Maximum Temperature for a city from the Input data.
Step 1) Input Files:
File 1:
New-york, 25
Seattle, 21
New-york, 28
Dallas, 35
File 2:
New-york, 20
Seattle, 21
Seattle, 22
Dallas, 23
File 3:
New-york, 31
Seattle, 33
Dallas, 30
Dallas, 19
Step 2: Map Function
Let’s say Map1, Map2 & Map3 run on File1, File2 & File3 in parallel, Here is their output:
(Note how it outputs the “Key – Value” pair. The key would be used by the reduce function later to do a “group by“)
Map 1:
Seattle, 21
New-york, 28
Dallas, 35
Map 2:
New-york, 20
Seattle, 22
Dallas, 23
Map 3:
New-york, 31
Seattle, 33
Dallas, 30
Step 3: Reduce Function
Reduce Function takes the input from Map1, Map2 & Map3, to give an output:
New-york, 31
Seattle, 33
Dallas, 35
Conclusion:
In this post, we visualized MapReduce Programming Model with an example: Finding Max Temp. for a city. And as you can imagine you can extend this post, to visualize:
1) Find Minimum Temperature for a city.
2) In this post, the key was City, But you could substitute it by other relevant real world entity to solve similar looking problems.
2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine
Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!
3. Before we begin, decide on three things:
3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.
3b. select the table that you want to load into HDFS
In my case, it’s this table:
3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1
You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1
2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine
Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!
3. Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)
4. Here the command to create a directory looks like:
hadoop fs -mkdir /user/data/input
The above command creates /user/data/input
5. Let’s verify that the input directory was created under /user/data
hadoop fs -ls /user/data
Conclusion: In this post, we saw how to create a directory in Hadoop (on windows) file system and also we saw how to list files/directory using the -ls command.
1. Upload Twitter Text Data into Hadoop on Azure cluster
2. Create a Hive Table and load the data uploaded in step 1 to the Hive Table
3. Analyze data in Hive via Excel Add-in
Before we begin, I assume you have access to Hadoop on azure, Have your sample data (don’t have one? learn from a blog post), familiar with Hadoop ecosystem and know your way around the Hadoop on Azure Dashboard.
Now, Here are the steps involved:
STEP 1: Upload Twitter Text Data into Hadoop on Azure cluster
1. Have your data to be uploaded ready! I am just going to Copy Paste the File from my host machine to the RDP’ed machine. In this case, the machine that I am going is the Hadoop on Azure cluster.
For the purpose of this blog post, I have a text file having 1500 tweets:
2. Open web browser > Go to your cluster in Hadoop on Azure
3. RDP into your Hadoop on Azure cluster
4. Copy-Paste the File. It’s a small data file so this approach works for now.
Step 2: Create a Hive Table and load the data uploaded in step 1 to the Hive Table
1. Stay on the machine that you Remote Desktop (RDP’ed) into.
2. Open the Hadoop command line (you’ll see a icon on your Desktop)
3. switch to Hive:
4. Use the following Hive Commands:
DROP TABLE IF EXISTS TweetSampleTable;
CREATE TABLE TweetSampleTable ( id string, text string, favorited string, replyToSN string, created string, truncated string, replyToSID string, replyToUID string, statusSource string, screenName string );
LOAD DATA LOCAL INPATH ‘C:appsdistexamplesdatatweets.txt’ OVERWRITE INTO TABLE TweetSampleTable;
Note that for the purpose of this blog-post, I’ve chose string as data type for all fields. This is something that depends on the data that you have. If I were building a solution, I would spend some more time choosing the right data type.
Step 3. Analyze data in Hive via Excel Add-in
1. Switch to Hadoop on Azure Dashboard
2. Go to the Hive Console and run the show tables to verify that there is a tweetsampletable.
3. Now if you haven’t, Download and Install the Hive ODBC Driver from the Downloads section of your Hadoop on Azure Dashboard.
In this blog-post, we would visualize how MapReduce Algorithms operates to perform a Word Count on a Text Input:
First of all, for all programmers out there, Here is the code (Javascript):
[sourcecode language=”javascript”] var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; [/sourcecode]
Courtesy: Microsoft Hadoop on Azure Samples
Now, let’s visualize this using an example.
Suppose the Text is “Hadoop on Azure sample Hadoop is on Windows Azure Hadoop is on Windows server” – Then this is how you can think of what happens to your input when it is processed first by Map function and then by Reduce function:
INPUT
MAP
REDUCE
Hadoop on Azure sample
Hadoop is on Windows Azure
Hadoop is on Windows server
Hadoop
1
Hadoop
3
On
1
Azure
1
on
3
Sample
1
Hadoop
1
Azure
2
Is
1
On
1
Sample
1
Windows
1
Azure
1
Is
2
Hadoop
1
Is
1
Windows
2
On
1
Windows
1
Server
1
Server
1
Conclusion:
In this blog post, we visualized how MapReduce Algorithm operates for a WordCount Example.