Recapping my social media activities during Jan 1 – Feb 24 2013:

Standard

Recapping my social media activities during Jan 1 – Feb 20 2013:

That’s about it for this post.

If we want to read related past posts, here they are:

OCT 3 – OCT 10 2012

OCT 11 – OCT 18 2012

OCT 19 – NOV 11 2012

NOV 12 – DEC 31 2012

Let’s connect and converse on any of these people networks!

paras doshi blog on facebookparas doshi twitter paras doshi google plus paras doshi linkedin

Resource: A great tutorial for Hadoop on local windows and Azure.

Standard

Here’s the resource: http://gettingstarted.hadooponazure.com/gettingStarted.html > “HDInsight Jumpstart”

The Tutorial will teach you how to analyze log files using Hadoop Tools like MapReduce, Hive, SQooP – check it out! It works with both HDInsight for local windows as well as Hadoop on Azure:

HDInsight hadoop on windows starting guide tutorial

Conclusion:

I hope this resource helps you get started on building an end-to-end solution with Hadoop on Windows/Azure.

How to Install Microsoft .Net SDK for Hadoop?

Standard

There are two main steps:

1. Installing Nuget Package manager if you haven’t already.

2. Installing Microsoft .Net SDK for Hadoop

Installing Nuget Package manager

1) Open Visual Studio

2) Tools Menu > Extensions Manager > Search online gallery > Nuget

3) Downloaded and Installed Nuget:

Nuget Package Manager Extensions Manager

4. Restarted Visual Studio

Installing Microsoft .NET SDK for Hadoop

1. Tools menu > Library Package Manager > Package Manager console

2. Installed Map/Reduce, Linq to Hive and WebHDFS component by running following commands in the package manager prompt:

Example for:

install-package Microsoft.Hadoop.MapReduce -pre

Nuget Microsoft SDK for Hadoop install mapreduce

Conclusion:

In this post, we saw how to install Microsoft .NET SDK for Hadoop.

Resource:

Continue learning: Programming MapReduce Jobs with HDInsight Server for Windows

inner workings of HDFS and MapReduce in a nutshell:

Standard

HDFS and MapReduce inner workings in a nutshell.

HDFS MapReduce inner workings

Click on the image to view larger sized image

 

Microsoft® HDInsight Preview for Windows: How to create a directory in Hadoop File System?

Standard

In this post, we’ll see how to create a directory in the Hadoop File System for HDInsight’s windows version.

Here are the steps:

1. You have the Microsoft® HDInsight Preview for Windows Installed on your machine. Here’s a tutorial: Installing HDInsight (Microsoft’s Hadoop) on windows 7

2. Make sure that the Cluster is up & running! To check this, I click on the “Microsoft HDInsight Dashboard” or open http://localhost:8085/ on my machine

Did you get any “wait for cluster to start..” message? No? Great! Hopefully, all your services are working perfectly and you are good to go now!

3. Let’s start the Hadoop Command Line (can you see the Icon on the Desktop? Yes? Great! Open that!)

4. Here the command to create a directory looks like:

hadoop fs -mkdir /user/data/input

The above command creates /user/data/input

5. Let’s verify that the input directory was created under /user/data

hadoop fs -ls /user/data

hadoop file system list files in a directory create directory

Conclusion:
In this post, we saw how to create a directory in Hadoop (on windows) file system and also we saw how to list files/directory using the -ls command.

Related Articles:

 

Sentiment Analysis in R w/ Twitter data feeds

Standard

I followed instructions on this site to perform sentiment analysis about Starbucks from Twitter data feeds.

Here are data visualizations:

1. Sentiment Analysis: Starbucks on Twitter

sentiment analysis starbucks on twitter

2. Comparison cloud:

comparison cloud data visualization

That’s about it for this post, Here are some related tutorials:

If you want to Install R on windows machine, here’s a Tutorial: http://parasdoshi.com/2012/11/13/lets-install-r-rstudio-on-windows-machine/

If you want to try out out Hadoop on windows, Hive and Hive excel add-in w/ Twitter Data, Here’s a Tutorial: http://parasdoshi.com/2012/11/16/how-to-load-twitter-data-into-hadoop-on-azure-cluster-and-then-analyze-it-via-hive-add-in-for-excel/

If you want to Grab Twitter search data using R and export to a tab delimited file. Here’s a tutorial: http://parasdoshi.com/2012/11/24/grab-twitter-search-data-using-r-and-export-to-a-tab-delimited-file/

There’s been a growing interest in Hadoop & Big Data, Here’s the Proof:

Standard

I like to keep an eye on Technology Trends. One of the ways I do that is by subscribing to leading magazines for articles – I may not always read the entire article but I definitely read the headlines to see what Industry is talking about. during last 12 months or so I have seen a lot of buzz around Big Data and I thought to myself – It would be nice to see a Trend line for Big Data. Taking it a step further, I am also interested in seeing if there is a correlation between growing trend in “Hadoop” and “Big Data”. Also, I wanted to see how it compares with the Terms like Business Intelligence and Data Science. With this, I turned to Google Trends to quickly create a Trend report to see the results.

Here’s the report:

Big Data Hadoop Business Intelligence

Here are some observations:

1) There’s a correlation between Trend of Big Data and Hadoop. In fact, it looks like growing interest in Hadoop fueled interest in “Big Data”.

2) Trend line of Big Data and Hadoop overtook that of Business Intelligence in Oct 2012 and sep 2012 respectively.

3) Decline in Trend line of Business Intelligence.

4) There seems to be a steady increase in Trend line for Business Analytics and Data Science.

And Here’s the Google Trend report URL: http://www.google.com/trends/explore#q=Big%20Data%2C%20Hadoop%2C%20Business%20Intelligence%2C%20Business%20Analytics%2C%20Data%20Science&cmpt=q

What do you think about these trends?

Hadoop on Azure’s Javascript Interactive Console has basic graphing functions:

Standard

The Hadoop on Azure’s Javascript console has basic graphing functions: Bar, Line & Chart. I think this is great becuase it gives an opportunity to visualize data that’s in HDFS directly from the Interactive Javascript Console! Here’s a screenshot:

hadoop on azure bar and line graph javascript

In the console, I ran the help(“graph”) command to see how I can use this function:
Draw a graph of data
graph.bar(data, options) Bar graph
graph.line(data, options) Line graph
graph.pie(data, options) Pie chart

Parameters
data (array) Array of data objects
options (object) Options object, with
x (string) Property to use for x-axis values
y (string) Property to use for y-axis values
title (string) Graph title
orientation (number) x-axis label orientation in degrees
tickInterval (number) x-axis tick interval

Conclusion:

In this blog-post, I posted that Hadoop on Azure’s Javascript Interactive Console has basic graphing functions.

Related articles:

How to Load Twitter data into Hadoop on Azure cluster and then analyze it via Hive add-in for excel?

Standard

In this blog post, we would:

1. Upload Twitter Text Data into Hadoop on Azure cluster

2. Create a Hive Table and load the data uploaded in step 1 to the Hive Table

3. Analyze data in Hive via Excel Add-in

Before we begin, I assume you have access to Hadoop on azure, Have your sample data (don’t have one? learn from a blog post), familiar with Hadoop ecosystem and know your way around the Hadoop on Azure Dashboard.

Now, Here are the steps involved:

STEP 1: Upload Twitter Text Data into Hadoop on Azure cluster

1. Have your data to be uploaded ready! I am just going to Copy Paste the File from my host machine to the RDP’ed machine. In this case, the machine that I am going is the Hadoop on Azure cluster.

For the purpose of this blog post, I have a text file having 1500 tweets:

upload twitter text data to hadoop on azure

2. Open web browser > Go to your cluster in Hadoop on Azure

3. RDP into your Hadoop on Azure cluster

Remote Desktop into Hadoop on Azure cluster

4. Copy-Paste the File. It’s a small data file so this approach works for now.

uploading twitter text data to hadoop on azure hdfs cluster

Step 2: Create a Hive Table and load the data uploaded in step 1 to the Hive Table

1. Stay on the machine that you Remote Desktop (RDP’ed) into.

2. Open the Hadoop command line (you’ll see a icon on your Desktop)

3. switch to Hive:

write hive commands in hadoop on azure

4. Use the following Hive Commands:

DROP TABLE IF EXISTS TweetSampleTable;

CREATE TABLE TweetSampleTable (
id string,
text string,
favorited string,
replyToSN string,
created string,
truncated string,
replyToSID string,
replyToUID string,
statusSource string,
screenName string
);

LOAD DATA LOCAL INPATH ‘C:appsdistexamplesdatatweets.txt’ OVERWRITE INTO TABLE TweetSampleTable;

Note that for the purpose of this blog-post, I’ve chose string as data type for all fields. This is something that depends on the data that you have. If I were building a solution, I would spend some more time choosing the right data type.

Step 3. Analyze data in Hive via Excel Add-in

1. Switch to Hadoop on Azure Dashboard

2. Go to the Hive Console and run the show tables to verify that there is a tweetsampletable.

show all tables in hive hadoop on azure

3. Now if you haven’t, Download and Install the Hive ODBC Driver from the Downloads section of your Hadoop on Azure Dashboard.

4. I setup  a ODBC connection to Hive by following the instructions here: How To Connect Excel to Hadoop on Azure via HiveODBC (en-US)

5. After that, Open Excel. I have Excel 2010 64 bits.

6. Switch to Data Tab > Hive Pane

7. Choose the Hive connection > select Table > Select Columns > And off you go!

you have Hive Data in Excel!

Hadoop on azure Hive Excel addin

Now go Analyze!

Conclusion:

In this blog-post, we saw How to Load Twitter data into Hadoop on Azure cluster and then analyze it via Hive add-in for excel?

Visualizing MapReduce Algorithm with WordCount Example:

Standard

In this blog-post, we would visualize how MapReduce Algorithms operates to perform a Word Count on a Text Input:

First of all, for all programmers out there, Here is the code (Javascript):

[sourcecode language=”javascript”]
var map = function (key, value, context) {
var words = value.split(/[^a-zA-Z]/);
for (var i = 0; i < words.length; i++) {
if (words[i] !== "") {
context.write(words[i].toLowerCase(), 1);
}
}
};
var reduce = function (key, values, context) {
var sum = 0;
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
[/sourcecode]

Courtesy: Microsoft Hadoop on Azure Samples

Now, let’s visualize this using an example.

Suppose the Text is “Hadoop on Azure sample Hadoop is on Windows Azure Hadoop is on Windows server” – Then this is how you can think of what happens to your input when it is processed first by Map function and then by Reduce function:

INPUTMAPREDUCE

Hadoop on Azure sample

Hadoop is on Windows Azure

Hadoop is on Windows server

Hadoop1Hadoop3
On1
Azure1on3
Sample1
Hadoop1Azure2
Is1
On1Sample1
Windows1
Azure1Is2
Hadoop1
Is1Windows2
On1
Windows1Server1
Server1

Conclusion:

In this blog post, we visualized how MapReduce Algorithm operates for a WordCount Example.