Apart from Datasets AdventureWorks and Contoso data-sets, there are places where you can download data-sets to play with for your Business Intelligence, Data Mining or Data Analysis Projects.
Here is the List of data-sets that I have collected:
Have you ever had someone sent you a PowerPivot model and asked you to do something with it? And if so – may be, you would have to see what data source(s) the model is using and if applicable, you test the existing connections. If you find yourselves in such a situation, this blog post is for you:
1) Open the excel file and switch to PowerPivot Window
2) Now, switch to Design Tab > Click on Existing Connections:
3) Here you’ll find the list of connections under “PowerPivot Data Connections” > Select the connection you wish to TEST or EDIT > Click on EDIT button
4) Now here you can edit the data source. And if you click on “Test Connection”, you’ll be able to test it too.
5) After you’re done, click on SAVE.
And you’ll now see the “Existing Connections” Box again:
Click CLOSE and you’re done, you have successfully edited or tested the existing connection.
This blog post is for people who have seen the reports built on PowerPivot model where the Month Names are not in correct logical order. So instead of “January, February, March, April …” (which is correct logical order), the order in the report would be displayed as “April, February, January, March..” (which is NOT correct).
This is what I am talking about:
Here, Month names are not sorted correctly, right? So how do we solve this? Let’s see this in this blog post!
Note that “Daily new number” is used for the demo purpose. I had to anonymize the data before I could share it with you all!
Originally, the table had just two columns but since I wanted to add “time Intelligence” to the model. I added two columns Month and Month Name.
Now here’s the DAX behind these two columns:
Column
DAX
Month
MONTH(‘Sample Table'[Date])
Month Name
RELATED(Month[MonthName])
There’s a relationship between Month Column of “Sample Table” and Month column of “Month” table
And here’s what Month Table looks like:
Month
MonthName
1
January
2
February
3
March
4
April
5
May
6
June
7
July
8
August
9
September
10
October
11
November
12
December
Creating a Report on Top of Sample Data-set.
Now if you create the report of Month Name vs Daily New Number, it will look like:
Oops, Problem detected: Month name are not in correct order.
So now let’s solve it.
Solution to sort month name correctly
Let’s solve this issue in our PowerPivot Model. Here’s what you do:
1. Go to PowerPivot model, Select “Month Name” column from the sample table
2 Toolbar > Home > Sort By column > click on “Sort by column”
3. And set the properties in the dialog box as:
Now, go back to the report and refresh the connection to the model.
Done!
This is so because by default since the month name is “Text” – it was sorted in A to Z format. But you saw how we can use the sort by column property in PowerPivot model to fix this issue.
That’s about it for the post! your feedback is welcome!
What is the difference between Data Analaysis and Data Mining:
1) One view is that: Data Mining is one particular form of Data Analysis.
One of the reason I researched about the difference between Data Analysis and Data Mining because I find that the terms are used Interchangeably and now I know why. It’s because Data Mining is considered as a particular form of Data Analysis.
2) I found other view that says:
Data Analysis is meant to support decision-making, support conclusions & Highlight note-worthy information. So when “Analyzing data” – we know what we want; we want answers to support our hypothesis; we want data in summarized form to highlight useful information.
While
Data Mining is meant for “Knowledge discovery” and “predictions”. So when “Mining data” – we look for undefined insights; We want the data to tell us something we didn’t knew before; We want to find patterns in the data that we had not anticipated.
For someone who is new to Data mining, classification and clustering can seem similar because both data mining algorithms essentially “divide” the datasets into sub-datasets; But there is difference between them and this blog-post, we’ll see exactly that:
CLASSIFICATION
CLUSTERING
We have a Training set containing data that have been previously categorized
Based on this training set, the algorithms finds the category that the new data points belong to
We do not know the characteristics of similarity of data in advance
Using statistical concepts, we split the datasets into sub-datasets such that the Sub-datasets have “Similar” data
Since a Training set exists, we describe this technique as Supervised learning
Since Training set is not used, we describe this technique as Unsupervised learning
Example:We use training dataset which categorized customers that have churned. Now based on this training set, we can classify whether a customer will churn or not.
Example:We use a dataset of customers and split them into sub-datasets of customers with “similar” characteristics. Now this information can be used to market a product to a specific segment of customers that has been identified by clustering algorithm
If you want to learn about Data Mining, check out the “free Book in PDF format: Mining the massive data-sets”.
I had to reboot my Dev Machine on which I have my SQL Server 2012 Business Intelligence Setup. After reboot, It turns out that when I wanted to open the SharePoint Site (for Power View related work) – it gives me error:
I hear you ask: “why are you writing about columnar databases?”. To answer that, I have spent some time researching about PowerPivot and Tabular Model these days which is powered by what Microsoft calls xVelocity (previously called Vertipaq) engine. It’s a columnar in memory Engine. And curiosity got better of me when I read the word: Columnar and I wondered what’s the benefit of Columnar Databases (aka column oriented databases) ? And why do they just not use the row-oriented database that powers OLTP workload and to answer the questions I did some research and here’s what I found:
First of all, let’s understand the difference between how row-oriented database and column-oriented database stores data:
Consider that we have a Table like this:
ID
Quarter
…..
Sales
1
Q1
…..
100
2
Q1
…..
120
3
Q1
…..
110
4
Q1
…..
130
5
Q2
…..
150
6
Q2
…..
100
Now, in a row-oriented databases, it’s stored like this:
1
Q1
….
100
2
Q1
….
120
…
..
So on.
And in a column oriented databases, it’s stored like this:
1
2
3
4
5
6
Q1
Q1
Q1
Q1
Q2
Q2
100
120
110
130
150
100
Benefits of columnar approach:
Imagine that we want to compute the aggregation for column “Sales”. So if we have row-oriented database then we have to get access to every “page” and then extract the value of column from each row. Lot of work, right? But in columnar databases, if we want to compute aggregations over Sales column then we just access the page that has ALL values of sales column. Does it not reduce the IO by not accessing lot of pages?
I know there’s more to the story and this is just a over-simplification of the process – But you get the point, don’t you?
Thus remember that:
Column-oriented systems are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data. – wikipedia
The OTHER benefit is in the level of compression that can be achieved. Let’s see this concept in simplest of terms.
consider the following data stored in columnar database:
Q1
Q1
Q1
Q1
Q1……30 more times
Q2
Q2
Q2…….40 more times
The above data can be compressed by using the form:
value
Begin position
End position
Q1
1
34
Q2
35
76
And as you can see, we stored 76 values using just 9 values. So when they say we do 10x compression – this is how they do it!
Technically there are algorithms like: Run Length Encoding, Dictionary Encoding, GZIP, LZ compression that are used to compress data. The engine (like xVelocity) decides the best algorithm to use.
so let’s connect the dots.
In-memory technologies need that ENTIRE data is loaded into MEMORY before processing. It’s beneficial IF we can compress the data and so more data can be loaded into the same amount of memory. Also, in the realm of OLAP, aggregating needs to be as efficient as it can be and as we have seen columnar databases are efficient at computing aggregations over many rows of a given column. So there’s certainly some serious benefits of using columnar databases in OLAP scenario’s and NOW I GET IT that why xVelocity is columnar and not row-oriented. Do not get me wrong, I am not saying that row oriented databases are bad, but in context of OLAP, column stores offer some benefits. And in general, the row oriented databases and column oriented databases have their set of pros and cons and there’s no superior way of doing things. Also, column stores does not need to be in-memory or in-memory does not need be to columnar. In case of xVelocity – it’s in-memory + columnar. But again In-memory does NOT equal to columnar databases.
Conclusion
Thus, in this blog post:
we’ve seen:
1) How do columnar databases store data?
2) Benefits of column stores
What we’ve not seen:
1) we’ve not explored the disadvantages and caveats of columnar databases.
With all the news about “Big Data”, I had a question:
Where does Big Data come from?
So I researched and here are the Big Data “Sources” that I found:
1. Enterprise data (emails, word documents, pdf’s, etc)
2. Transactions
3. Social Media
4. Sensor Data
5. Public Data (energy, world resource, labor statistics etc)
Am I missing anything? Please feel free to point those out!
Update [11 July] : Claytonbingham adds “Academic Data” to the list, which I think sometimes can also be referred to as “scientific data” (Thanks Claytonbingham! please refer to comments section for his thoughts)