In this blog post, we would:
1. Upload Twitter Text Data into Hadoop on Azure cluster
2. Create a Hive Table and load the data uploaded in step 1 to the Hive Table
3. Analyze data in Hive via Excel Add-in
Before we begin, I assume you have access to Hadoop on azure, Have your sample data (don’t have one? learn from a blog post), familiar with Hadoop ecosystem and know your way around the Hadoop on Azure Dashboard.
Now, Here are the steps involved:
STEP 1: Upload Twitter Text Data into Hadoop on Azure cluster
1. Have your data to be uploaded ready! I am just going to Copy Paste the File from my host machine to the RDP’ed machine. In this case, the machine that I am going is the Hadoop on Azure cluster.
For the purpose of this blog post, I have a text file having 1500 tweets:
2. Open web browser > Go to your cluster in Hadoop on Azure
3. RDP into your Hadoop on Azure cluster
4. Copy-Paste the File. It’s a small data file so this approach works for now.
Step 2: Create a Hive Table and load the data uploaded in step 1 to the Hive Table
1. Stay on the machine that you Remote Desktop (RDP’ed) into.
2. Open the Hadoop command line (you’ll see a icon on your Desktop)
3. switch to Hive:
4. Use the following Hive Commands:
DROP TABLE IF EXISTS TweetSampleTable;
CREATE TABLE TweetSampleTable (
id string,
text string,
favorited string,
replyToSN string,
created string,
truncated string,
replyToSID string,
replyToUID string,
statusSource string,
screenName string
);
LOAD DATA LOCAL INPATH ‘C:appsdistexamplesdatatweets.txt’ OVERWRITE INTO TABLE TweetSampleTable;
Note that for the purpose of this blog-post, I’ve chose string as data type for all fields. This is something that depends on the data that you have. If I were building a solution, I would spend some more time choosing the right data type.
Step 3. Analyze data in Hive via Excel Add-in
1. Switch to Hadoop on Azure Dashboard
2. Go to the Hive Console and run the show tables to verify that there is a tweetsampletable.
3. Now if you haven’t, Download and Install the Hive ODBC Driver from the Downloads section of your Hadoop on Azure Dashboard.
4. I setup a ODBC connection to Hive by following the instructions here: How To Connect Excel to Hadoop on Azure via HiveODBC (en-US)
5. After that, Open Excel. I have Excel 2010 64 bits.
6. Switch to Data Tab > Hive Pane
7. Choose the Hive connection > select Table > Select Columns > And off you go!
you have Hive Data in Excel!
Now go Analyze!
Conclusion:
In this blog-post, we saw How to Load Twitter data into Hadoop on Azure cluster and then analyze it via Hive add-in for excel?
4 thoughts on “How to Load Twitter data into Hadoop on Azure cluster and then analyze it via Hive add-in for excel?”