Where can we find datasets that we can play with for Business Intelligence, Data Mining, Data Analysis Projects?

Standard

Update 1st August: I found this too: UCI MAchine Learning Repository http://archive.ics.uci.edu/ml/

Update 12 Nov 2012: I found this! Link to 400 datasets! http://www.datawrangling.com/some-datasets-available-on-the-web

Update 19 Dec 2012: Lynn Langit has a list here: http://lynnlangit.wordpress.com/public-datasets/

——————————————-

Recently on SQL Server Data Mining Forum, I answered a question about where to find DataSets for Business Intelligence Project.

Apart from Datasets AdventureWorks and Contoso data-sets, there are places where you can download data-sets to play with for your Business Intelligence, Data Mining or Data Analysis Projects.

Here is the List of data-sets that I have collected:

1. KDNuggests: Datasets for Data Mining

2. Quora: Where can I get large datasets open to the public?

3. Windows Azure Data Market

4. National council of Teachers of Mathematics

5. Introduction to Data Science: Data Sets

6. Hilary Mason’s Data-Set Bundle: https://bitly.com/bundles/hmason/1  (Also featured in Quora Link that I shared earlier)

7. And If you can’t find the data-set, ask it here: http://getthedata.org/  

Have I missed anything? Do comment! I’ll add the link with due credit.

SQL Azure’s Journey from Inception in Year 2008 to June 2012!

Standard

SQL Azure has been evolving at an amazing pace. Here is a list that summarizes the evolution till June 2012:

SQL Azure, June 2012 (During Meet Windows Azure event)

SQL Azure reporting is now generally available and is backed by SLA

SQL Azure now called Windows Azure SQL Database

Run SQL Server on Azure VM role

SQL Azure, May 2012

SQL Azure now available in one more data-center: “East USA”

SQL Azure Data SYNC, April 2012

Data SYNC Service Update notes

SQL Azure, February 2012:

Updated pricing model

SQL Azure Labs, February 2012:

SQL Azure security services

Microsoft Codename “Trust Services”

SQL Azure Labs, January 2012:

Project codenamed “Cloud Numerics”

Project codenamed “SQL Azure Compatibility Assessment”

Service Update 8, December 2011:

Increased Max size of Database [Previous: 50 GB. Now: 150 GB]

SQL Azure Federations

SQL Azure import/export updated

SQL Azure management portal gets a facelift.

Expanded support for user defined collations

And there are no additional cost when you go above 50 GB. [so cost of 50 GB database = cost of 150 GB DB = ~500$ per month]

SQL Azure Labs, November 2011:

Microsoft project codename “Data Transfer”

upcoming SQL Azure Q4 2011 service release announced!

SQL Azure federations

150 GB database

SQL Azure management portal will get a facelift with metro styled UI among other great additions

Read more

SQL Azure LABS, October 2011:

Data Explorer preview

Social Analytics preview

New registrations for SQL Azure ODATA stopped.

News at SQLPASS, October 2011:

SQL Azure reporting services CTP is now open for all!

SQL Azure DATA SYNC CTP is now open for all!

Upcoming: 150 GB database!

Upcoming: SQL Azure Federations

SQL Server Developer Tools, Codename “Juneau” supports SQL Azure

Following updates where not added in July 2011 and where later added in September 2011:

New SQL Azure Management Portal

Foundational updates for scalability and performance

Service Update 7, July 2011:

Creating SQL Azure firewall rules with IP detect

SQL Azure Co-Administrator support

Enhanced spatial data types support

Service Update 6, May 2011:

Support for Multiple servers per subscription

SQL Azure database management REST API

JDBC driver

Support for Upgrading DAC packages

Service Update 5, October 2010:

Support for SQL Azure sync

Support for SQL Azure reporting

SQL Azure Error messages

Support for sp_tableoption system stored procedure

Service Update 4, August 2010:

Update on project Houston, A silverlight based app to manage SQL Azure databases

Support for TSQL command to COPY databases

Books Online: Added How-To topics for SQL Azure

Service Update 3, June 2010:

Support for database sizes up to 50 GB

Support for spatial datatype

Added East Asia and Western Europe datacenter

Service Update 2, April 2010:

Support for renaming databases

DAC (Data Tier Applications) support added

Support for SQL Server management studio (SSMS) and Visual Studio (VS)

Service Update 1, February 2010:

Support for new DMV’s

Support for Alter Database Editions

Support for longer running Queries

Idle session Time outs increased from 5 min to 30 min

SQL Server Data Services / SQL Data Services Got a new Name: SQL Azure, July 2009

SQL Server Data Services was Announced, April 2008

 

PowerPivot Model: How to TEST or EDIT existing connections?

Standard

Have you ever had someone sent you a PowerPivot model and asked you to do something with it? And if so – may be, you would have to see what data source(s) the model is using and if applicable, you test the existing connections. If you find yourselves in such a situation, this blog post is for you:

1) Open the excel file and switch to PowerPivot Window

2) Now, switch to Design Tab > Click on Existing Connections:

existing connection design tab of powerpivot model

3) Here you’ll find the list of connections under “PowerPivot Data Connections” > Select the connection you wish to TEST or EDIT > Click on EDIT button

4)  Now here you can edit the data source. And if you click on “Test Connection”, you’ll be able to test it too.

edit data source connection power pivot model

5) After you’re done, click on SAVE.

And you’ll now see the “Existing Connections” Box again:

powerpivot select a connection to a data source that contains the data you want to import

Click CLOSE and you’re done, you have successfully edited or tested the existing connection.

 

 

PowerPivot Model: Why am I not seeing “Month Names” in correct logical order?

Standard

This blog post is for people who have seen the reports built on PowerPivot model where the Month Names are not in correct logical order. So instead of  “January, February, March, April …” (which is correct logical order), the order in the report would be displayed as “April, February, January, March..” (which is NOT correct).

This is what I am talking about:

powerpivot model month name not sorted correctly

Here, Month names are not sorted correctly, right? So how do we solve this? Let’s see this in this blog post!

Understanding the Sample Data-set

Optional: Download a sample data-set to practice what’s described in this blog-post: Download – Paras Doshi Blog’s sample data set

Now the data looks like this:

DateDaily New numberMonthMonth Name
1/1/2012 0:0001January
1/2/2012 0:0001January
1/3/2012 0:0001January
1/4/2012 0:0001January
1/5/2012 0:0001January
1/6/2012 0:0001January
1/7/2012 0:0001January
…..………………

Note that “Daily new number” is used for the demo purpose. I had to anonymize the data before I could share it with you all!

Originally, the table had just two columns but since I wanted to add “time Intelligence” to the model. I added two columns Month and Month Name.

Now here’s the DAX behind these two columns:

ColumnDAX
MonthMONTH(‘Sample Table'[Date])
Month NameRELATED(Month[MonthName])

There’s a relationship between Month Column of “Sample Table” and Month column of “Month” table

And here’s what Month Table looks like:

MonthMonthName
1January
2February
3March
4April
5May
6June
7July
8August
9September
10October
11November
12December

Creating a Report on Top of Sample Data-set.

Now if you create the report of Month Name vs Daily New Number, it will look like:

powerpivot model month name not sorted correctly

Oops, Problem detected: Month name are not in correct order.

So now let’s solve it.

Solution to sort month name correctly

Let’s solve this issue in our PowerPivot Model. Here’s what you do:

1. Go to PowerPivot model, Select “Month Name” column from the sample table

2 Toolbar > Home > Sort By column > click on “Sort by column”

3. And set the properties in the dialog box as:

powerpivot model sort by column dialog box

Now, go back to the report and refresh the connection to the model.

Done!

powerpivot model month name are sorted correctly

This is so because by default since the month name is “Text” – it was sorted in A to Z format. But you saw how we can use the sort by column property in PowerPivot model to fix this issue.

That’s about it for the post! your feedback is welcome!

Want to Read More? Here are few links:

Back to basics: What is the difference between Data Analysis and Data Mining?

Standard

What is the difference between Data Analaysis and Data Mining:

1) One view is that: Data Mining is one particular form of Data Analysis.

difference between data mining and data analysis

One of the reason I researched about the difference between Data Analysis and Data Mining because I find that the terms are used Interchangeably and now I know why. It’s because Data Mining is considered as a particular form of Data Analysis.

2) I found other view that says:

Data Analysis is meant to support decision-making, support conclusions & Highlight note-worthy information. So when “Analyzing data” – we know what we want; we want answers to support our hypothesis; we want data in summarized form to highlight useful information.

While

Data Mining is meant for “Knowledge discovery” and “predictions”. So when “Mining data” – we look for undefined insights; We want the data to tell us something we didn’t knew before; We want to find patterns in the data that we had not anticipated.

Sources:

http://www-stat.stanford.edu/~jhf/ftp/dm-stat.pdf

http://stats.stackexchange.com/questions/5026/what-is-the-difference-between-data-mining-statistics-machine-learning-and-ai

http://stats.stackexchange.com/questions/1521/data-mining-and-statistical-analysis

http://en.wikipedia.org/wiki/Data_analysis

 

Data Mining: Classification VS Clustering (cluster analysis)

Standard

For someone who is new to Data mining, classification and clustering can seem similar because both data mining algorithms essentially “divide” the datasets into sub-datasets; But there is difference between them and this blog-post, we’ll see exactly that:

CLASSIFICATIONCLUSTERING
  • We have a Training set containing data that have been previously categorized
  • Based on this training set, the algorithms finds the category that the new data points belong to
  • We do not know the characteristics of similarity of data in advance
  • Using statistical concepts, we split the datasets into sub-datasets such that the Sub-datasets have “Similar” data
Since a Training set exists, we describe this technique as Supervised learningSince Training set is not used, we describe this technique as Unsupervised learning
Example:We use training dataset which categorized customers that have churned. Now based on this training set, we can classify whether a customer will churn or not.Example:We use a dataset of customers and split them into sub-datasets of customers with “similar” characteristics. Now this information can be used to market a product to a specific segment of customers that has been identified by clustering algorithm

If you want to learn about Data Mining, check out the “free Book in PDF format: Mining the massive data-sets”.

How to Solve Error: After Reboot, SharePoint sites says “Service Unavailable”

Standard

I had to reboot my Dev Machine on which I have my SQL Server 2012 Business Intelligence Setup. After reboot, It turns out that when I wanted to open the SharePoint Site (for Power View related work) – it gives me error:

“Service Unavailable

HTTP Error 503. The Service is unavailable”

Now, I did quick searches found this and this

Turns out, for my scenario – the default application pool had stopped:

default application pool stopped

So this is what I did to solve the error:

1) Verified that the account and password for the App Pool is correct

2) Started it.

This is how you do it:

Start > IIS Manager > Application pools > Locate Default Application Pool that stopped > Verify that Account and Password are correct. > Start it.

After doing that, the SharePoint site started working again!

Related articles:

Visualize FourSquare Checkins on Google Maps:

Standard

This is short post written for FUN!

I just figured a way to visualize FourSquare checkins on Google Maps (Courtesy: LifeHacker Post).

This is my check-ins represented on Google Maps when I visited Boston recently:

paras doshi foursquare checins on google maps

Here’s how you do it:

1) Go to https://foursquare.com/feeds/ and login if you haven’t

2) Copy the URL of the KML file which looks like: https://feeds.foursquare.com/history/randomcharacters.kml

3) Go to Google Maps and paste the URL of KML file in search box and hit ENTER!

You’re done!

Happy Tracking! Happy Data Visualization!

Message:

1) This is power of Web 2.0!

2) Mix and Match is the new way of doing things!

What’s the benefit of columnar databases?

Standard

I hear you ask: “why are you writing about columnar databases?”. To answer that, I have spent some time researching about PowerPivot and Tabular Model these days which is powered by what Microsoft calls xVelocity (previously called Vertipaq) engine. It’s a columnar in memory Engine. And curiosity got better of me when I read the word: Columnar and I wondered what’s the benefit of Columnar Databases (aka column oriented databases) ? And why do they just not use the row-oriented database that powers OLTP workload and to answer the questions I did some research and here’s what I found:

First of all, let’s understand the difference between how row-oriented database and column-oriented database stores data:

Consider that we have a Table like this:

IDQuarter…..Sales
1Q1…..100
2Q1…..120
3Q1…..110
4Q1…..130
5Q2…..150
6Q2…..100

Now, in a row-oriented databases, it’s stored like this:

1Q1….100
2Q1….120

..

So on.

 

And in a column oriented databases, it’s stored like this:

123456

 

Q1Q1Q1Q1Q2Q2

 

100120110130150100

 

Benefits of columnar approach:

Imagine that we want to compute the aggregation for column “Sales”. So if we have row-oriented database then we have to get access to every “page” and then extract the value of column from each row. Lot of work, right? But in columnar databases, if we want to compute aggregations over Sales column then we just access the page that has ALL values of sales column. Does it not reduce the IO by not accessing lot of pages?

I know there’s more to the story and this is just a over-simplification of the process – But you get the point, don’t you?

Thus remember that:

Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. – wikipedia

 

The OTHER benefit is in the level of compression that can be achieved. Let’s see this concept in simplest of terms.

consider the following data stored in columnar database:

Q1Q1Q1Q1Q1……30 more timesQ2Q2Q2…….40 more times

The above data can be compressed by using the form:

valueBegin positionEnd position
Q1134
Q23576

And as you can see, we stored 76 values using just 9 values. So when they say we do 10x compression – this is how they do it!

This was an example of Run-length encoding.

Technically there are algorithms like: Run Length Encoding, Dictionary Encoding, GZIP, LZ compression that are used to compress data. The engine (like xVelocity) decides the best algorithm to use.

so let’s connect the dots.

In-memory technologies need that ENTIRE data is loaded into MEMORY before processing. It’s beneficial IF we can compress the data and so more data can be loaded into the same amount of memory. Also, in the realm of OLAP, aggregating  needs to be as efficient as it can be and as we have seen columnar databases are efficient at computing aggregations over many rows of a given column. So there’s certainly some serious benefits of using columnar databases in OLAP scenario’s and NOW I GET IT that why xVelocity is columnar and not row-oriented. Do not get me wrong, I am not saying that row oriented databases are bad, but in context of OLAP, column stores offer some benefits. And in general, the row oriented databases and column oriented databases have their set of pros and cons and there’s no superior way of doing things. Also, column stores does not need to be in-memory or in-memory does not need be to columnar. In case of xVelocity – it’s in-memory + columnar. But again In-memory does NOT equal to columnar databases.

Conclusion

Thus, in this blog post:

we’ve seen:

1) How do columnar databases store data?

2) Benefits of column stores

What we’ve not seen:

1) we’ve not explored the disadvantages and caveats of columnar databases.

Related Articles:

Who on earth is creating “Big data”?

Standard

With all the news about “Big Data”, I had a question:

Where does Big Data come from?

So I researched and here are the Big Data “Sources” that I found:

1. Enterprise data (emails, word documents, pdf’s, etc)

2. Transactions

3. Social Media

4. Sensor Data

5. Public Data (energy, world resource, labor statistics etc)

where does Big Data Come from / Big Data Sources

Am I missing anything? Please feel free to point those out!