First Impression: Google’s BigData offering called BigQuery

Standard

 

As a part of University of Washington’s (UW) cloud class’s assignment, I played with Google’s BigData offering BigQuery and I am writing this blog post to share what I think about it. please note that the views are my own and do not represent those of the instructor’s and fellow students at UW. And also I am not a BigData “Expert”, Think of me as a student trying to get my head around various offerings out there – So if you feel otherwise about what I have written, Just let me know in the comments section. Any-who read along to know what I think of BigQuery:

First up what is BigQuery?

It’s a platform to analyze your data (lot’s of it) by running SQL-Like Queries. And it’s really SQL-Like, and so if you are from SQL world like me – you would not face any issues in getting up and running in seconds by referring to the nicely written documentation.

And other point to consider here is that even though it’s SQL-Like, you’ll be able to analyze considerable number of rows in few seconds. Let me give you an example: I played with a  sample (called gsod) which had 115M rows and as per my experiments, I was able to get answers to simple computations like max, mean, avg, etc in less than couple of seconds. And little complex queries having where, joins and group by in around 5-6 seconds. Your results may vary depending on the type of query you run but the BOTTOMLINE is that it is FAST. that’s a good news!

BigQuery is Fast!

But what bothers me is that How am I suppose to “UPLOAD” lots of data on the Google CLOUD. It takes time, right? But I guess that’s an issue with every cloud based BigData offering. But here’s what I am thinking – If your data is already on the cloud. for e.g. Amazon’s or Microsoft’s – Does it not make sense to run analytic’s on Amazon’s and Microsoft’s cloud instead of porting your data to Google’s?

[Sidenote: I like it that Hadoop on Azure allows Amazon S3 data source. Nice move!]

My concern: Time spent in uploading truckload of data to Google’s cloud just so that we can use it for BigQuery

And even if you have your data on GAE data-store, you’ll have to uplaod your data to BigQuery separately. Source

Zooming out for a moment, I feel the Goal of BigQuery was to offer an easy to use BigData platform, And I feel that’s what they have delivered:

An easy-to-use + easy-to-setup “Hadoop+Hive” Like Offering.

[Update: Aug 20th 2012: I have been thinking about it more and I realized that BigQuery is more about satisfying real-time Big Data Scenario’s. And Hadoop/Hive/MapReduce is more about Batch Oriented  analysis and it’s great if you need to pre-process tons and tons of data]

But this “easiness” means that It is NOT as advanced as a Hadoop Installation (or Hadoop-on-Azure or Amazon’s elastic-map-reduce). But again, it’s easier and faster to get started with BigQuery. I guess, it just depends on what you are trying to achieve and based on that you’ll have to figure which is right tool for your scenario. No generic answer here, Sorry!

And BTW BigQuery supports only CSV – Talk about Variability (One of the V’s of BigData!). Let’s not get into that. I just wanted to Point that out because if you’re looking to analyze data-sets that cannot be converted to CSV for running SQL-Like Queries on top of them then BigQuery is not for you.

Conclusion:

Try out BigQuery. It’s easy to get started. It’s powerful if SQL-Like queries are all what you’ll need to analyze your data. If you are BigData enthusiast/expert/student – It’ll be a nice exercise to mentally compare other BigData offerings with BigQuery.

If you decide to try BigQuery or have already tried it out, I’ll love to hear what you think of it. Please leave a comment!

UPDATE (based on Michael Manoochehri’s comment): I didn’t implied that it is prohibitively expensive to upload data to BigQuery. Because I know, it’s NOT! Here is the result that Michael Manoochehri shared: As a test I once ingested about 350 Gb of CSV data (split into 10gb raw files, then I gzipped each one into ~1Gb). I ingested the entire batch using the bq command line tool, and had the entire dataset in BigQuery in just a few hours. I agree that it’s not 100% trivial to move 300 Gb of data from a local cluster into Google’s cloud – but it’s not really that difficult.

[Update: Aug 20th 2012: If you are interested in the Mechanics behind BigQuery – search for “Google Dremel Whitepaper”. it’s an amazing read]

 

we now have 3 (three) options to run SQL server on CLOUD.

Standard

Following the announcements at “Meet Windows Azure” event, we now have three options to run SQL Server on CLOUD; They are:

1. SQL Azure which is now called Windows Azure SQL Database

2. SQL Server on Windows Azure VM Roles (Nice addition, in my opinion!)

3. SQL Server on Amazon Web Services RDS

And apart from these options,

if you can fire up a VM on cloud and decide to run SQL Server on it – that’s also SQL Server on CLOUD.

Update 22 June 2012:

Naveen commented about running SQL Server on Amazon EC2.

Quick updates from meet windows azure event for Data Professionals

Standard

1. SQL Azure reporting is generally available and backed by SLA

2. You can now run SQL Server on VM roles

3. Azure was rebranded a while back but quick reminder: SQL Azure was renamed to Windows Azure SQL Database and so in the “new” portal – you’ll see “SQL database” instead of SQL Azure.

I’ll blog about these features as and when I get a chance to play with it.

Read all updates here: Now Available: New Services and Enhancements to Windows Azure

And I updated http://parasdoshi.com/whats-new-in-sql-azure/

Get started on Windows Azure: Attend “Meet Windows Azure” event Online

Standard

On June 7th 2012 – there’s an online event called “Meet Windows Azure” where Scott Gu and his Windows Azure team would introduce the Windows Azure platform. You can register here: http://register.meetwindowsazure.com/

If you’re planning to attend – there’s a very interesting tweet-up planned called “Social meet up on Twitter for MEET Windows Azure on June 7th” – All you have to do is follow #MeetAzure, #WindowsAzure on Twitter & Interact! Simple!

There’s an unofficial blog relay, if you write a post – Tweet it to @noopman – Here is the Blog Relay:

Played with Microsoft research “Project Daytona” – MapReduce on Windows Azure

Standard

Recently, I played with Project Daytona which is a MapReduce on Windows Azure.

It seems like a great “Data Analytic’s as a service”. I tried the k-means and the word-count sample application that comes bundled with the project run-time download: http://parasdoshi.visibli.com/share/z14Ty2

The documentation along with the project guides you in a step by step fashion on how to go about setting up the environment but for those who are curious, here is a brief description on how I setup the environment:

1) Uploaded the sample data-sets to Azure Storage

2) Edited the configuration file (ServiceConfiguration.cscfg) to point to correct Azure Storage

3) Chose the Instance size and the no. of Instances for the deployment

4) Deployed the binaries to Windows Azure (.cspkg and .cscfg)

5) Ran the Word Count Sample

6) Ran the K-means Sample

Conclusion: It was pretty amazing to run MapReduce on Windows Azure. If you are into BigData, MapReduce, Data Analytic’s – then check out “Project Daytona”

That’s about for this post. And what do you think about Project Daytona – MapReduce on Windows Azure?

One more way to run SQL Server on cloud: SQL server on AWS RDS

Standard

Up until April 2012, the only way to run SQL server on cloud was “SQL Azure”. But recently AWS announced SQL Server on Cloud. Good news? Probably. it’s always good to have more than one option. So for those who are new to world of AWS, here are few tips before you get hands-on:

1) The way RDS works is that you spin up “db instances”. So here you specify the machine size that would “power” your database. And remember that the type of instance you choose would directly affect your bill.

2) Spend some time understanding the billing structure. Since AWS gives you lot of options – their billing structure is not simple. Don’t get me wrong, I am not saying that lot of options in AWS is bad. it’s just that the billing is not simple and it’s not one-dimensional (there are various dimensions that shapes your billing structure). And why should you invest time? because in the “pay – as – you – go ” model it would directly affect your Bill.

3) understand costs like: cost to back-up database PLUS data-transfer cost.

4) Understand the difference between “Bring your OWN license” and “license included” (Express, Standard and web only. Currently enterprise edition not included here) model in RDS SQL Server

5) and unlike SQL Azure, RDS SQL Server charges on a “per hour” basis.

Note the date of this post: 15th may 2012. Things change very fast, so readers-from-the-future please refer to official documents.

BTW, here are the few blog posts from the web-o-sphere:

1. Expanding the Cloud for Windows Developers

2. First Look – SQL Server on Amazon Web Services RDS

3. Official resource: AWS RDS SQL Server

That’s about it for this post.

How do you reduce the network “latency” between application and SQL Azure?

Standard

I was at SQL Rally recently (10-11 may 2012) where I happened to have a nice talk about SQL Azure with a fellow attendee. They were considering porting their database (that supports one of their apps) to Microsoft’s cloud service. One of the concern they had was “How to reduce the network latency between SQL Azure and their App?” And Since I knew the solution, I shared it with that person. I am sharing it here so others can benefit too.

Now one of the first question that I asked the attendee was: Are you also porting your app along with the database to Azure?

Turns out, They were considering to host the app on Azure cloud too. So technically that’s called a “Code Near” scenario – And in this case, the application and the database both *should* reside in the same data-center. if you do so, the network latency between your app and the database is minimal.

Now, if you have your app on-premise and you are considering SQL Azure, then select the data-center location that has the minimal network latency between your app and SQL azure. Technically it’s called Code-Far scenario I have written about one of the ways you can do so, here’s the URL: Testing latency between client and SQL Azure via client statistics in SSMS

That’s about it for this post.

Official Resource: SQL Azure and Data access