Where can we find datasets that we can play with for Business Intelligence, Data Mining, Data Analysis Projects?


Update 1st August: I found this too: UCI MAchine Learning Repository http://archive.ics.uci.edu/ml/

Update 12 Nov 2012: I found this! Link to 400 datasets! http://www.datawrangling.com/some-datasets-available-on-the-web

Update 19 Dec 2012: Lynn Langit has a list here: http://lynnlangit.wordpress.com/public-datasets/


Recently on SQL Server Data Mining Forum, I answered a question about where to find DataSets for Business Intelligence Project.

Apart from Datasets AdventureWorks and Contoso data-sets, there are places where you can download data-sets to play with for your Business Intelligence, Data Mining or Data Analysis Projects.

Here is the List of data-sets that I have collected:

1. KDNuggests: Datasets for Data Mining

2. Quora: Where can I get large datasets open to the public?

3. Windows Azure Data Market

4. National council of Teachers of Mathematics

5. Introduction to Data Science: Data Sets

6. Hilary Mason’s Data-Set Bundle: https://bitly.com/bundles/hmason/1  (Also featured in Quora Link that I shared earlier)

7. And If you can’t find the data-set, ask it here: http://getthedata.org/  

Have I missed anything? Do comment! I’ll add the link with due credit.

SQL Azure’s Journey from Inception in Year 2008 to June 2012!


SQL Azure has been evolving at an amazing pace. Here is a list that summarizes the evolution till June 2012:

SQL Azure, June 2012 (During Meet Windows Azure event)

SQL Azure reporting is now generally available and is backed by SLA

SQL Azure now called Windows Azure SQL Database

Run SQL Server on Azure VM role

SQL Azure, May 2012

SQL Azure now available in one more data-center: “East USA”

SQL Azure Data SYNC, April 2012

Data SYNC Service Update notes

SQL Azure, February 2012:

Updated pricing model

SQL Azure Labs, February 2012:

SQL Azure security services

Microsoft Codename “Trust Services”

SQL Azure Labs, January 2012:

Project codenamed “Cloud Numerics”

Project codenamed “SQL Azure Compatibility Assessment”

Service Update 8, December 2011:

Increased Max size of Database [Previous: 50 GB. Now: 150 GB]

SQL Azure Federations

SQL Azure import/export updated

SQL Azure management portal gets a facelift.

Expanded support for user defined collations

And there are no additional cost when you go above 50 GB. [so cost of 50 GB database = cost of 150 GB DB = ~500$ per month]

SQL Azure Labs, November 2011:

Microsoft project codename “Data Transfer”

upcoming SQL Azure Q4 2011 service release announced!

SQL Azure federations

150 GB database

SQL Azure management portal will get a facelift with metro styled UI among other great additions

Read more

SQL Azure LABS, October 2011:

Data Explorer preview

Social Analytics preview

New registrations for SQL Azure ODATA stopped.

News at SQLPASS, October 2011:

SQL Azure reporting services CTP is now open for all!

SQL Azure DATA SYNC CTP is now open for all!

Upcoming: 150 GB database!

Upcoming: SQL Azure Federations

SQL Server Developer Tools, Codename “Juneau” supports SQL Azure

Following updates where not added in July 2011 and where later added in September 2011:

New SQL Azure Management Portal

Foundational updates for scalability and performance

Service Update 7, July 2011:

Creating SQL Azure firewall rules with IP detect

SQL Azure Co-Administrator support

Enhanced spatial data types support

Service Update 6, May 2011:

Support for Multiple servers per subscription

SQL Azure database management REST API

JDBC driver

Support for Upgrading DAC packages

Service Update 5, October 2010:

Support for SQL Azure sync

Support for SQL Azure reporting

SQL Azure Error messages

Support for sp_tableoption system stored procedure

Service Update 4, August 2010:

Update on project Houston, A silverlight based app to manage SQL Azure databases

Support for TSQL command to COPY databases

Books Online: Added How-To topics for SQL Azure

Service Update 3, June 2010:

Support for database sizes up to 50 GB

Support for spatial datatype

Added East Asia and Western Europe datacenter

Service Update 2, April 2010:

Support for renaming databases

DAC (Data Tier Applications) support added

Support for SQL Server management studio (SSMS) and Visual Studio (VS)

Service Update 1, February 2010:

Support for new DMV’s

Support for Alter Database Editions

Support for longer running Queries

Idle session Time outs increased from 5 min to 30 min

SQL Server Data Services / SQL Data Services Got a new Name: SQL Azure, July 2009

SQL Server Data Services was Announced, April 2008


PowerPivot Model: How to TEST or EDIT existing connections?


Have you ever had someone sent you a PowerPivot model and asked you to do something with it? And if so – may be, you would have to see what data source(s) the model is using and if applicable, you test the existing connections. If you find yourselves in such a situation, this blog post is for you:

1) Open the excel file and switch to PowerPivot Window

2) Now, switch to Design Tab > Click on Existing Connections:

existing connection design tab of powerpivot model

3) Here you’ll find the list of connections under “PowerPivot Data Connections” > Select the connection you wish to TEST or EDIT > Click on EDIT button

4)  Now here you can edit the data source. And if you click on “Test Connection”, you’ll be able to test it too.

edit data source connection power pivot model

5) After you’re done, click on SAVE.

And you’ll now see the “Existing Connections” Box again:

powerpivot select a connection to a data source that contains the data you want to import

Click CLOSE and you’re done, you have successfully edited or tested the existing connection.



PowerPivot Model: Why am I not seeing “Month Names” in correct logical order?


This blog post is for people who have seen the reports built on PowerPivot model where the Month Names are not in correct logical order. So instead of  “January, February, March, April …” (which is correct logical order), the order in the report would be displayed as “April, February, January, March..” (which is NOT correct).

This is what I am talking about:

powerpivot model month name not sorted correctly

Here, Month names are not sorted correctly, right? So how do we solve this? Let’s see this in this blog post!

Understanding the Sample Data-set

Optional: Download a sample data-set to practice what’s described in this blog-post: Download – Paras Doshi Blog’s sample data set

Now the data looks like this:

Date Daily New number Month Month Name
1/1/2012 0:00 0 1 January
1/2/2012 0:00 0 1 January
1/3/2012 0:00 0 1 January
1/4/2012 0:00 0 1 January
1/5/2012 0:00 0 1 January
1/6/2012 0:00 0 1 January
1/7/2012 0:00 0 1 January
….. …… …… ……

Note that “Daily new number” is used for the demo purpose. I had to anonymize the data before I could share it with you all!

Originally, the table had just two columns but since I wanted to add “time Intelligence” to the model. I added two columns Month and Month Name.

Now here’s the DAX behind these two columns:

Column DAX
Month MONTH(‘Sample Table'[Date])
Month Name RELATED(Month[MonthName])

There’s a relationship between Month Column of “Sample Table” and Month column of “Month” table

And here’s what Month Table looks like:

Month MonthName
1 January
2 February
3 March
4 April
5 May
6 June
7 July
8 August
9 September
10 October
11 November
12 December

Creating a Report on Top of Sample Data-set.

Now if you create the report of Month Name vs Daily New Number, it will look like:

powerpivot model month name not sorted correctly

Oops, Problem detected: Month name are not in correct order.

So now let’s solve it.

Solution to sort month name correctly

Let’s solve this issue in our PowerPivot Model. Here’s what you do:

1. Go to PowerPivot model, Select “Month Name” column from the sample table

2 Toolbar > Home > Sort By column > click on “Sort by column”

3. And set the properties in the dialog box as:

powerpivot model sort by column dialog box

Now, go back to the report and refresh the connection to the model.


powerpivot model month name are sorted correctly

This is so because by default since the month name is “Text” – it was sorted in A to Z format. But you saw how we can use the sort by column property in PowerPivot model to fix this issue.

That’s about it for the post! your feedback is welcome!

Want to Read More? Here are few links:

Back to basics: What is the difference between Data Analysis and Data Mining?


What is the difference between Data Analaysis and Data Mining:

1) One view is that: Data Mining is one particular form of Data Analysis.

difference between data mining and data analysis

One of the reason I researched about the difference between Data Analysis and Data Mining because I find that the terms are used Interchangeably and now I know why. It’s because Data Mining is considered as a particular form of Data Analysis.

2) I found other view that says:

Data Analysis is meant to support decision-making, support conclusions & Highlight note-worthy information. So when “Analyzing data” – we know what we want; we want answers to support our hypothesis; we want data in summarized form to highlight useful information.


Data Mining is meant for “Knowledge discovery” and “predictions”. So when “Mining data” – we look for undefined insights; We want the data to tell us something we didn’t knew before; We want to find patterns in the data that we had not anticipated.







Data Mining: Classification VS Clustering (cluster analysis)


For someone who is new to Data mining, classification and clustering can seem similar because both data mining algorithms essentially “divide” the datasets into sub-datasets; But there is difference between them and this blog-post, we’ll see exactly that:

  • We have a Training set containing data that have been previously categorized
  • Based on this training set, the algorithms finds the category that the new data points belong to
  • We do not know the characteristics of similarity of data in advance
  • Using statistical concepts, we split the datasets into sub-datasets such that the Sub-datasets have “Similar” data
Since a Training set exists, we describe this technique as Supervised learning Since Training set is not used, we describe this technique as Unsupervised learning
Example:We use training dataset which categorized customers that have churned. Now based on this training set, we can classify whether a customer will churn or not. Example:We use a dataset of customers and split them into sub-datasets of customers with “similar” characteristics. Now this information can be used to market a product to a specific segment of customers that has been identified by clustering algorithm

If you want to learn about Data Mining, check out the “free Book in PDF format: Mining the massive data-sets”.

How to Solve Error: After Reboot, SharePoint sites says “Service Unavailable”


I had to reboot my Dev Machine on which I have my SQL Server 2012 Business Intelligence Setup. After reboot, It turns out that when I wanted to open the SharePoint Site (for Power View related work) – it gives me error:

“Service Unavailable

HTTP Error 503. The Service is unavailable”

Now, I did quick searches found this and this

Turns out, for my scenario – the default application pool had stopped:

default application pool stopped

So this is what I did to solve the error:

1) Verified that the account and password for the App Pool is correct

2) Started it.

This is how you do it:

Start > IIS Manager > Application pools > Locate Default Application Pool that stopped > Verify that Account and Password are correct. > Start it.

After doing that, the SharePoint site started working again!

Related articles: