How many websites in USA exceed the data collection limitations of Google Analytics?


Little bit of background:

– I was researching on the limitations of Google Analytics

– After reading the Limitations, I wanted to know – How many websites in USA exceed the limitations of Google Analytics?

So Here’s the Short Answer:

Only 108 sites exceed this limitation

(as of today)

And Here’s the long answer:

Limitations of Google Analytics. Here’s the URL:

And I am quoting from the above URL:

Data Collection limit: You should not send more than 10 million hits per month. If you exceed this limit, there is no assurance that the excess hits will be processed.
Data Freshness limit: Sending more than 200,000 visits per day to Google Analytics will result in your reports being refreshed only once per day

And to take it further, I wanted to know how many website in USA get greater than 10 million hits per month, turns out only 108 websites in US get that much traffic.

so from data collection limit standpoint, only these 100 odd sites would exceed the limitations of Google Analytics.

To put things in Perspective: does not exceed Data Collection Google Analytics Limit:

my space can use google analytics


Just knowing about the Data Collection Limit was not interesting but I combined data from other data sources – it seemed very interesting to me! Anyhoo – In this post, I shared:

> Limitations of Google Analytics

> Answered How many websites in USA exceed the limitations of Google Analytics?

[UPDATE Feb 10th 2013] I made a mistake in correlating data from Quantcast and Google Analytics. Lesson learned: double-check for units when comparing data from two different sources

Florin Dumitrescu pointed out that while Quantcast uses People/Month and Google uses hits/month. They may NOT be always the same. Sorry about this.

Back to basics: Data Mining and Knowledge Discovery Process


Once in a while I go back to basics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, I want to revisit Data Mining and Knowledge Discovery Process:

Here are the steps:

1) Raw Data

2) Data Pre processing (cleaning, sampling, transformation, integration etc)

3) Modeling (Building a Data Mining Model)

4) Testing the Model a.k.a assessing the Model

5) Knowledge Discovery

Here is the visualization:

knowledge discovery process data miningAdditional Note:

In the world of Data Mining and Knowledge discovery, we’re looking for a specific type of intelligence from the data which is Patterns. This is important because patterns tend to repeat and so if we find patterns from our data, we can predict/forecast that such things can happen in future.


In this blog post, we saw the Knowledge Discovery and Data Mining process.

Things I shared on Social Media Networks during Noc 12 – Dec 31 (2012)


Big Data: The Coming Sensor Data Driven Productivity Revolution

Check out some nice getting started tutorials at beyondrelational site:

Complexity is your enemy. Any fool can make something complicated. It is hard to make something simple – Richard Branson

— via Paras Doshi – Blog

The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing to retail and healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data,” says Tim O’Reilly

— via Paras Doshi – Blog

Nice collection of about 20+ videos around the topic of “Data Science”:

Nice collection of videos by Berkeley school of information: #Information #Data

Just found Facebook’s data team’s page:

via V Talk Tech – A Parth Acharya Blog – Nice HeatMap of stocks!

what’s the biggest fear about cloud computing? via Windows Azure

Resource: Presentations from the Sentiment Analysis Symposium

If I switched to the newest “holiday” theme on WordPress, this is how it would look:

Nice! Code School now has R programming language! I have been playing with R for a while now and definitely want to learn more – here’s the link to learn R:

Interesting tool from Google to optimize and analyze web page speeds:

Performed #sentiment #Analysis on #starbucks twitter data using #R ! It was fun!

In 2002: The Data Warehousing Institute estimates that data quality problems cost U.S. businesses more than $600 billion a year. And of course, over the past 10 years, this number would be bigger.

Reading: Business Analytics vs Business Intelligence?

Big data is a nickname for the recent increase in largely external and unstructured business and consumer information. How are businesses across industries harnessing traditional enterprise information management functions and systems to translate big data into useful business intelligence?

For business analytics professionals: 12 webcasts on Jan 30th 2013 #sqlpass #analytics #24hop

Some nice insights about how to build an Internet platform, from the founder of Zipcar:

Let’s connect and converse on any of these people networks!

paras doshi blog on facebookparas doshi twitter paras doshi google plus paras doshi linkedin

Seven Interesting Google Projects that a Data Professional may not have heard about:


Here’s the list:

1. Google Refine

2. Google Prediction API

3. Google Trends

4. Google Chart Tools

5. Google Big Query

6. Google Correlate

7. Google Fusion Tables

Note: These projects may not be ready to be used in your production environment as some of them are in Beta/Experimental stages and their support/development may be deprecated in future.

Thanks: I thought of writing this blog post after a discussion I had with Parth Acharya about Google and it’s projects for Data Professionals. He pointed me to some of the most interesting samples that used Google Fusion Tables and here’s his one of the blog post on related topic: Google Fusion Table & Data Visualization

There’s been a growing interest in Hadoop & Big Data, Here’s the Proof:


I like to keep an eye on Technology Trends. One of the ways I do that is by subscribing to leading magazines for articles – I may not always read the entire article but I definitely read the headlines to see what Industry is talking about. during last 12 months or so I have seen a lot of buzz around Big Data and I thought to myself – It would be nice to see a Trend line for Big Data. Taking it a step further, I am also interested in seeing if there is a correlation between growing trend in “Hadoop” and “Big Data”. Also, I wanted to see how it compares with the Terms like Business Intelligence and Data Science. With this, I turned to Google Trends to quickly create a Trend report to see the results.

Here’s the report:

Big Data Hadoop Business Intelligence

Here are some observations:

1) There’s a correlation between Trend of Big Data and Hadoop. In fact, it looks like growing interest in Hadoop fueled interest in “Big Data”.

2) Trend line of Big Data and Hadoop overtook that of Business Intelligence in Oct 2012 and sep 2012 respectively.

3) Decline in Trend line of Business Intelligence.

4) There seems to be a steady increase in Trend line for Business Analytics and Data Science.

And Here’s the Google Trend report URL:

What do you think about these trends?

Two ideas to make your social network activities “Searchable”:


Some time back, I wanted to search one of my own social network post. It was a resource I had shared and somehow I was not able to “google” it (again). I eventually found it – but it took me 15 odd minutes to scroll down to my twitter feed. It was NOT fun! And I thought to myself – there’s got to be a better way! And I thought – It’ll be great if I solve it for not just Twitter but all my social network activities that includes LinkedIn, Facebook Pages, Google+. So here’s couple of things thats working for me, I hope it helps someone out there too:

Now, before we begin when I say “Searchable” – I mean searchable by YOU (or a human being) and not necessarily search engines. But it turns out, both my ideas increase your chances of getting your social media activities Indexed! With that, Here are the ideas:

1) Syndicate your Social Network Activities (Posts/Images/Updates) to Tumblr/Blogger

I use IFTTT to syndicate my Twitter, Facebook and LinkedIn activities to Blogger

2) Create a post about your social network activities on your blog:

Here’s an Example: Things I shared on Social Media Networks during Oct 19 – Nov 11

Though Idea #2’s main goal is to keep my blog readers updated about my social network activities – But it also acts as a good way to make my social media posts “searchable”.

And remember I said earlier that the chances of your social network posts getting indexed by search engines increases? That’s because WordPress, Tumblr & Blogger’s posts are accessible by Google (unless you choose to block it). So that’s about it for this post. If you like the idea(s), please let me know! And if you have other ideas – also let me know, I am always looking for ways to make my social media activities easily searchable to me as well as for anyone else.

Let’s connect and converse on any of these people networks!

paras doshi blog on facebookparas doshi twitter paras doshi google plus paras doshi linkedin

I played with Twitter Firehose for couple of hours and how you can do so too:


First up: what’s a Twitter Fire-hose?

It’s a real-time stream of tweets! I had pointed out in an earlier post that Twitter gets 340 million tweets per day!

twiiter fire hose 340 millions tweets per dayImage courtesy

Why did I want access to Fire-hose?


I had heard – It’s expensive, Is it?

For an Individual: Absolutely! For companies: Not if they know how to create business value out of it.

Note the words “couple of hours” in the title. I’ll Explain that part later.

How did you get access?

via DataSift. They had a free trial w/ 10$ credit and I tried that. Check them out if want to play with Twitter Firehose. It’s fun!

What did I do with it?

I collected 15,000 tweets over a period of 2 hours containing words “Google” OR “Microsoft“.

Total cost for me: 3-4$

Note: I added the cost just so that you get a general Idea. Look at the pricing page of DataSift for more details.

Are their other Twitter Data  Resellers?

Yes. As of now, it’s DataSift, GNIP and Topsy. search for “Twitter Certified Data Reseller Products” to find the list. I was able to find a Free Trial by DataSift and that’s why I tried DataSift.

If I just want to play with Twitter Data, what are the alternatives?

you can work with their streaming API which gives 1% of tweets. you can find an example here: Grab Twitter search data using R and export to a tab delimited file


In this post, I discussed about how you can try Twitter Firehose. Also pointed you to an alternative of using streaming API which gives 1% of tweets. I hope that helps.