Visualizing MapReduce Algorithm with an Example: Finding Max Temperature

Standard

Problem Statement: Find Maximum Temperature for a city from the Input data.

Step 1) Input Files:

File 1:

New-york, 25

Seattle, 21

New-york, 28

Dallas, 35

File 2:

New-york, 20

Seattle, 21

Seattle, 22

Dallas, 23

File 3:

New-york, 31

Seattle, 33

Dallas, 30

Dallas, 19

Step 2: Map Function

Let’s say Map1, Map2 & Map3 run on File1, File2 & File3 in parallel, Here is their output:

(Note how it outputs the “Key – Value” pair. The key would be used by the reduce function later to do a “group by“)

Map 1:

Seattle, 21

New-york, 28

Dallas, 35

Map 2:

New-york, 20

Seattle, 22

Dallas, 23

Map 3:

New-york, 31

Seattle, 33

Dallas, 30

Step 3: Reduce Function

Reduce Function takes the input from Map1, Map2 & Map3, to give an output:

New-york, 31

Seattle, 33

Dallas, 35

Conclusion:

In this post, we visualized MapReduce Programming Model with an example: Finding Max Temp. for a city.  And as you can imagine you can extend this post, to visualize:

1) Find Minimum Temperature for a city.

2) In this post, the key was City, But you could substitute it by other relevant real world entity to solve similar looking problems.

I hope this helps.

Related Articles:

Visualizing MapReduce Algorithm with WordCount Example

Microsoft® HDInsight Preview for Windows: How to use Sqoop to load data into HDFS from SQL Server?

Standard

In this post, we’ll see how to use Sqoop to load data into HDFS from SQL Server?

With that, here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Before we begin, decide on three things:

3a: Username and Password that Sqoop would use to login to the SQL Server database. If you create a new username and pasword, test it via SSMS before you proceed.

3b. select the table that you want to load into HDFS

In my case, it’s this table:

sql table to be loaded into hadoop hdfs from sql server3c: The target directory in HDFS. in my case I want it to be /user/data/sqoopstudent1

You can create by command: hadoop fs -mkdir /user/data/sqoopstudent1

[to learn about how to create directory, read: How to create a directory in Hadoop File System? ]

4. Now Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

5. Navigate to: c:Hadoopsqoop-1.4.2bin>

*This path may change in future, but navigate to the bin folder under the SQOOP_HOME.

6. Run dir command to see various files under this directory.

sqoop list files under the HOMe directory import export

Also you can run sqoop help for more information on the command that we are about to run.

sqoop list of commands help

7. Now here’s the command to Load data from SQL Server to HDFS:

c:Hadoopsqoop-1.4.2bin>sqoop import –connect “jdbc:sqlserver://localhost;dat
abase=UniversityDB;username=sqoop;password=**********” –table student –tar
get-dir /user/data/sqoopstudent1 -m 1

sqoop command to load data from sql server to hadoop file system

8. After successfully running the above command, let’s browse the file in HDFS!

sqoop see the content of the file

That’s about it for this post!

Thanks

Thanks Aviad Ezra who answered my question on this MSDN thread: An error while trying to use Sqoop on HDInsight to import data from SQL server to HDFS

Conclusion:

In this post, we saw how to load data into Hadoop from SQL Server using Sqoop (SQL Hadoop)

Related Articles:

Microsoft® HDInsight Preview for Windows: How to create a directory in Hadoop File System?

Standard

In this post, we’ll see how to create a directory in the Hadoop File System for HDInsight’s windows version.

Here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

4. Here the command to create a directory looks like:

hadoop fs -mkdir /user/data/input

The above command creates /user/data/input

5. Let’s verify that the input directory was created under /user/data

hadoop fs -ls /user/data

hadoop file system list files in a directory create directory

Conclusion:
In this post, we saw how to create a directory in Hadoop (on windows) file system and also we saw how to list files/directory using the -ls command.

Related Articles:

 

Sentiment Analysis in R w/ Twitter data feeds

Standard

I followed instructions on this site to perform sentiment analysis about Starbucks from Twitter data feeds.

Here are data visualizations:

1. Sentiment Analysis: Starbucks on Twitter

sentiment analysis starbucks on twitter

2. Comparison cloud:

comparison cloud data visualization

That’s about it for this post, Here are some related tutorials:

If you want to Install R on windows machine, here’s a Tutorial: http://parasdoshi.com/2012/11/13/lets-install-r-rstudio-on-windows-machine/

If you want to try out out Hadoop on windows, Hive and Hive excel add-in w/ Twitter Data, Here’s a Tutorial: http://parasdoshi.com/2012/11/16/how-to-load-twitter-data-into-hadoop-on-azure-cluster-and-then-analyze-it-via-hive-add-in-for-excel/

If you want to Grab Twitter search data using R and export to a tab delimited file. Here’s a tutorial: http://parasdoshi.com/2012/11/24/grab-twitter-search-data-using-r-and-export-to-a-tab-delimited-file/

There’s been a growing interest in Hadoop & Big Data, Here’s the Proof:

Standard

I like to keep an eye on Technology Trends. One of the ways I do that is by subscribing to leading magazines for articles – I may not always read the entire article but I definitely read the headlines to see what Industry is talking about. during last 12 months or so I have seen a lot of buzz around Big Data and I thought to myself – It would be nice to see a Trend line for Big Data. Taking it a step further, I am also interested in seeing if there is a correlation between growing trend in “Hadoop” and “Big Data”. Also, I wanted to see how it compares with the Terms like Business Intelligence and Data Science. With this, I turned to Google Trends to quickly create a Trend report to see the results.

Here’s the report:

Big Data Hadoop Business Intelligence

Here are some observations:

1) There’s a correlation between Trend of Big Data and Hadoop. In fact, it looks like growing interest in Hadoop fueled interest in “Big Data”.

2) Trend line of Big Data and Hadoop overtook that of Business Intelligence in Oct 2012 and sep 2012 respectively.

3) Decline in Trend line of Business Intelligence.

4) There seems to be a steady increase in Trend line for Business Analytics and Data Science.

And Here’s the Google Trend report URL: http://www.google.com/trends/explore#q=Big%20Data%2C%20Hadoop%2C%20Business%20Intelligence%2C%20Business%20Analytics%2C%20Data%20Science&cmpt=q

What do you think about these trends?

Hadoop on Azure’s Javascript Interactive Console has basic graphing functions:

Standard

The Hadoop on Azure’s Javascript console has basic graphing functions: Bar, Line & Chart. I think this is great becuase it gives an opportunity to visualize data that’s in HDFS directly from the Interactive Javascript Console! Here’s a screenshot:

hadoop on azure bar and line graph javascript

In the console, I ran the help(“graph”) command to see how I can use this function:
Draw a graph of data
graph.bar(data, options) Bar graph
graph.line(data, options) Line graph
graph.pie(data, options) Pie chart

Parameters
data (array) Array of data objects
options (object) Options object, with
x (string) Property to use for x-axis values
y (string) Property to use for y-axis values
title (string) Graph title
orientation (number) x-axis label orientation in degrees
tickInterval (number) x-axis tick interval

Conclusion:

In this blog-post, I posted that Hadoop on Azure’s Javascript Interactive Console has basic graphing functions.

Related articles:

How to Load Twitter data into Hadoop on Azure cluster and then analyze it via Hive add-in for excel?

Standard

In this blog post, we would:

1. Upload Twitter Text Data into Hadoop on Azure cluster

2. Create a Hive Table and load the data uploaded in step 1 to the Hive Table

3. Analyze data in Hive via Excel Add-in

Before we begin, I assume you have access to Hadoop on azure, Have your sample data (don’t have one? learn from a blog post), familiar with Hadoop ecosystem and know your way around the Hadoop on Azure Dashboard.

Now, Here are the steps involved:

STEP 1: Upload Twitter Text Data into Hadoop on Azure cluster

1. Have your data to be uploaded ready! I am just going to Copy Paste the File from my host machine to the RDP’ed machine. In this case, the machine that I am going is the Hadoop on Azure cluster.

For the purpose of this blog post, I have a text file having 1500 tweets:

upload twitter text data to hadoop on azure

2. Open web browser > Go to your cluster in Hadoop on Azure

3. RDP into your Hadoop on Azure cluster

Remote Desktop into Hadoop on Azure cluster

4. Copy-Paste the File. It’s a small data file so this approach works for now.

uploading twitter text data to hadoop on azure hdfs cluster

Step 2: Create a Hive Table and load the data uploaded in step 1 to the Hive Table

1. Stay on the machine that you Remote Desktop (RDP’ed) into.

2. Open the Hadoop command line (you’ll see a icon on your Desktop)

3. switch to Hive:

write hive commands in hadoop on azure

4. Use the following Hive Commands:

DROP TABLE IF EXISTS TweetSampleTable;

CREATE TABLE TweetSampleTable (
id string,
text string,
favorited string,
replyToSN string,
created string,
truncated string,
replyToSID string,
replyToUID string,
statusSource string,
screenName string
);

LOAD DATA LOCAL INPATH ‘C:appsdistexamplesdatatweets.txt’ OVERWRITE INTO TABLE TweetSampleTable;

Note that for the purpose of this blog-post, I’ve chose string as data type for all fields. This is something that depends on the data that you have. If I were building a solution, I would spend some more time choosing the right data type.

Step 3. Analyze data in Hive via Excel Add-in

1. Switch to Hadoop on Azure Dashboard

2. Go to the Hive Console and run the show tables to verify that there is a tweetsampletable.

show all tables in hive hadoop on azure

3. Now if you haven’t, Download and Install the Hive ODBC Driver from the Downloads section of your Hadoop on Azure Dashboard.

4. I setup  a ODBC connection to Hive by following the instructions here: How To Connect Excel to Hadoop on Azure via HiveODBC (en-US)

5. After that, Open Excel. I have Excel 2010 64 bits.

6. Switch to Data Tab > Hive Pane

7. Choose the Hive connection > select Table > Select Columns > And off you go!

you have Hive Data in Excel!

Hadoop on azure Hive Excel addin

Now go Analyze!

Conclusion:

In this blog-post, we saw How to Load Twitter data into Hadoop on Azure cluster and then analyze it via Hive add-in for excel?