First Impression: Google’s BigData offering called BigQuery



As a part of University of Washington’s (UW) cloud class’s assignment, I played with Google’s BigData offering BigQuery and I am writing this blog post to share what I think about it. please note that the views are my own and do not represent those of the instructor’s and fellow students at UW. And also I am not a BigData “Expert”, Think of me as a student trying to get my head around various offerings out there – So if you feel otherwise about what I have written, Just let me know in the comments section. Any-who read along to know what I think of BigQuery:

First up what is BigQuery?

It’s a platform to analyze your data (lot’s of it) by running SQL-Like Queries. And it’s really SQL-Like, and so if you are from SQL world like me – you would not face any issues in getting up and running in seconds by referring to the nicely written documentation.

And other point to consider here is that even though it’s SQL-Like, you’ll be able to analyze considerable number of rows in few seconds. Let me give you an example: I played with a  sample (called gsod) which had 115M rows and as per my experiments, I was able to get answers to simple computations like max, mean, avg, etc in less than couple of seconds. And little complex queries having where, joins and group by in around 5-6 seconds. Your results may vary depending on the type of query you run but the BOTTOMLINE is that it is FAST. that’s a good news!

BigQuery is Fast!

But what bothers me is that How am I suppose to “UPLOAD” lots of data on the Google CLOUD. It takes time, right? But I guess that’s an issue with every cloud based BigData offering. But here’s what I am thinking – If your data is already on the cloud. for e.g. Amazon’s or Microsoft’s – Does it not make sense to run analytic’s on Amazon’s and Microsoft’s cloud instead of porting your data to Google’s?

[Sidenote: I like it that Hadoop on Azure allows Amazon S3 data source. Nice move!]

My concern: Time spent in uploading truckload of data to Google’s cloud just so that we can use it for BigQuery

And even if you have your data on GAE data-store, you’ll have to uplaod your data to BigQuery separately. Source

Zooming out for a moment, I feel the Goal of BigQuery was to offer an easy to use BigData platform, And I feel that’s what they have delivered:

An easy-to-use + easy-to-setup “Hadoop+Hive” Like Offering.

[Update: Aug 20th 2012: I have been thinking about it more and I realized that BigQuery is more about satisfying real-time Big Data Scenario’s. And Hadoop/Hive/MapReduce is more about Batch Oriented  analysis and it’s great if you need to pre-process tons and tons of data]

But this “easiness” means that It is NOT as advanced as a Hadoop Installation (or Hadoop-on-Azure or Amazon’s elastic-map-reduce). But again, it’s easier and faster to get started with BigQuery. I guess, it just depends on what you are trying to achieve and based on that you’ll have to figure which is right tool for your scenario. No generic answer here, Sorry!

And BTW BigQuery supports only CSV – Talk about Variability (One of the V’s of BigData!). Let’s not get into that. I just wanted to Point that out because if you’re looking to analyze data-sets that cannot be converted to CSV for running SQL-Like Queries on top of them then BigQuery is not for you.


Try out BigQuery. It’s easy to get started. It’s powerful if SQL-Like queries are all what you’ll need to analyze your data. If you are BigData enthusiast/expert/student – It’ll be a nice exercise to mentally compare other BigData offerings with BigQuery.

If you decide to try BigQuery or have already tried it out, I’ll love to hear what you think of it. Please leave a comment!

UPDATE (based on Michael Manoochehri’s comment): I didn’t implied that it is prohibitively expensive to upload data to BigQuery. Because I know, it’s NOT! Here is the result that Michael Manoochehri shared: As a test I once ingested about 350 Gb of CSV data (split into 10gb raw files, then I gzipped each one into ~1Gb). I ingested the entire batch using the bq command line tool, and had the entire dataset in BigQuery in just a few hours. I agree that it’s not 100% trivial to move 300 Gb of data from a local cluster into Google’s cloud – but it’s not really that difficult.

[Update: Aug 20th 2012: If you are interested in the Mechanics behind BigQuery – search for “Google Dremel Whitepaper”. it’s an amazing read]


Seven software delivery models and why software as a service (SaaS) is the future!


This is a reflection piece I thought of writing after watching a demo lecture from a cloud computing course offered at Stanford by professor Timothy Chou. URL of the demo lecture is here: Now, I would suggest you to spare 60 minutes and watch the video but If you do not want to, you can read this blog post in which I aim to summarize the lecture which was pretty much about seven software delivery models and show why SaaS (software as a service) makes sense and why it’s our future. so here we go!

There are seven software delivery models:

1. Software licensing model

2. Open source model

3. Outsourcing model

4. Hybrid Model

5. Hybrid+ Model

6. SaaS (software as a service) Model

7. Internet Model

To truly understand the models in detail, I urge you to watch the video – here I am to just going to reflect upon what I heard and so there might be gaps in what I have to say or you may not understand it completely since the discussion is brief here. Any-who, here is my reflection after watching the video on each software delivery model

(1) Software license model:  This has been the traditional way in which software companies made money. They develop a software and the model is such that if you wish to use the software, you buy license to use the software. And generally if say 10 users are going to use the software, you buy ten licenses. This is the one time cost. Also optionally you pay support charge for the software which is generally on yearly basis. It may include talking about problems in installing the software, figuring out a configuration problem, install updates/patches, etc. And also significant money is then poured in “managing” that software. And managing software requires man-power which is not cheap. Management includes making sure that software is up and running (ideally) 24×7, making sure the software gives required performance, keeping the software secure, taking backup’s, etc. And it costs money. Lots of it!

e.g.. Oracle Database. you buy an Oracle database software. you pay support money. and you invest in managing it. So this was software license model for you. Now let’s move on to next model.

(2) Open source model: Here the software is “free” but you need money to manage the software. optionally you can opt in to receive support.

(3) Outsourcing model: In this model, the software is bought with the one time fee. optionally support cost is paid year by year. But the management work is outsourced. in other words, the management side of the software is performed by some other company. so how does it help? and more importantly, how does it save money?  As we had discussed earlier that managing the software requires man power is not cheap. So say a company in USA hires a software administrator to manage the software for them. He would be paid $100k per year to do so. Now what if we can bring this cost down. This is exactly what companies that accept outsourcing work do. They hire an individual in a country where system administrator’s are paid $50k per year (or even less!) and thus they bring the cost down. And obviously it is not as simple as I said but you get my point, don’t you!?

e.g.  IBM accepts outsourcing work. Infosys (Indian IT giant) accepts outsourcing work. And there is a long list of companies that does this work.

(4) Hybrid Model: Here the company that sells the software also manages it for the client. yes! they do that. The client may not want to have the data center on their site so the software company sets up the datacenter for the client and manages it too for the client. Client may also choose to have the datacenter on their site but still software is managed by the company who sold it. it saves money – you may ask how?! For this I’ll say what professor had to say “standardize, specialize and repeat”. if you do not understand what it means – watch the video!

(5) Hybrid+ Model: It is the hybrid model with simplified cost structure. Just compare the values for the hybrid and hybrid+ and I believe you are smart to understand the difference! I believe that all readers of my blog are smart!

(6) SaaS: Now it is the extension of the hybrid+ model where the client does not get to choose where to run the software. They just get the software via Internet with no questions asked!! Every other detail is managed by the software vendor. All customer needs to do is to sign up for the service and start consuming it via Internet. End of story! Now why is SaaS our future. Well, as you can see from the tables itself that it lets the companies save money. secondly, it let’s the companies focus on what they do best and now worry much about the compute infrastructure that they may need to support their business. Now, for a mid size firm, SaaS does not involve upfront investment. Yes! you subscribe for a software for a month and then you do not like, you can simply say bye-bye to the SaaS vendor or simply, switch the vendor. I do realize that it’s not smooth but at least it’s smoother as compared to traditional models. And moreover, we see SaaS vendors that specialize in particular software. So there may exist a SaaS vendor who specialize in CRM for hospitals. yes! CRM’s only for hospitals. so hospitals that are looking to automate their processes can just give it a spin. End of story! the beauty of it all being that it simplifies the process and it is more effective.

e.g.; Just search “software as a service” and you will find a list of companies that do this today. Also the video has a slide that has a list of all companies that are into SaaS – so watch the video!

(7) Internet Model: Instead of trying to write about it. I am just going to say “Google”. Do you pay Google to use the search engine? No, right?! is Google a Not for profit organization? No! Then, how do they make money. The answer is advertisements. Internet services (or should I call them businesses) like Facebook, Google, Gmail, etc rely on advertisement and thus the cost of the software is not charged from the user.

Now, if you love seeing things from money perspective, here it is:

For first four Models:

Software LicenseOpen sourceOutsourcingHybrid
Software$5000/user/year(One Time)$0/user/year$5000/user/year(One Time)$5000/user/year(One Time)

For remaining three models:

Software +Support +Management$300/user/Month$150/user/MonthAdvertisement. No cost from the end-user perspective.

Disclaimer: values are entered in each Model just for the sake of comparing different models and it may not necessarily be true  for all real world scenario’s.

So yes. That’s it. This was my reflection piece. Do post your comments/feedback/suggestion. And yes, watch the video!