DateTime columns can be tricky for analysis purposes. They don’t work well with Pivot Tables because of the time part, each value seems unique to the Pivot Table & it also creates problems while creating relationships with Date Dimensions. And so, It’s a common need to convert them to just Date before analyzing data & also a common need to create a relationship between the Date (and not DateTime) with Date Dimension Table.
So if it’s possible, I try to do the data type conversion in the source system query. If your source system is SQL Server, you could use this piece of code:
[code language=”sql”] select [your-fields],cast([DateTime_Col] as date) as Date_Col from TableName [/code]
Doing the data type conversion upfront in the source system query is a good thing to do. And I hope this is helpful.
As a part of developing ETL packages, sometimes, I’ve to write T-SQL queries to pull data from SQL server source systems. But before I start doing that, it’s always good to know the version/edition of the source system. Why? because it can determine whether a TSQL operators are available for me to use or not. Case in point, I had a requirements where I could have written a query that uses Pivot & UnPivot operators. So I write a query & it doesn’t work! I spent about 5 minutes trying to debug the code. The code seems OK to me. So I thought of checking the “version”. And there you go, client’s source system was running SQL Server 2000. So that meant, I couldn’t use the Pivot & UnPivot operators.
This was my quick note on how select @@version helps me while I’m TSQL’ing. Next time, I’ll probably check this first, before writing the code. That could save me few minutes 🙂
Once in a while I write about back to basics topics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, we’ll revisit why do we use OLAP and Data Warehouses for business reporting systems. Let me share some of the most common reasons and then I’ll point to resources that offer other reasons.
Let’s see some of the most common reason:
#1: Business Reports should not take lot of time to load.
From a Business User Perspective: They don’t want to wait for their report to populate data. Reports should be fast!
But if business users have to wait for data to show up on their report because of slow query response, then that would be bad for everyone involved. Business Intelligence solutions cannot permeate in an organization if the reports take a lot of time to load:
So from a Technology standpoint: What can you do? And also why did the problem arise in the first place?
Let’s first see why the problem occurred in the first place?
So we have a bunch of database tables. To create a report, we’ll have to summarize (aggregate) values in lots of rows (think millions) and join few tables – turns out that if you query a transactional system (Database / OLTP), then you’ll get a slow response. In some cases, if the data model + data size + queries are not complex, then you could just run a query to create operational business reports and you won’t see any performance issues. But that’s not the case always! So if data model + data size + query requirements for reporting are not simple for OLTP/databases to handle and you see a poor performance – in other words, database/OLTP system takes up a lot of time returning data that the business reports require and the business users would see bad performance. The issue goes beyond complex data model + data size + query. You see, transactional systems may be running other tasks in parallel to returning data to business reports. So there’s the issue of resource contention on the OLTP database.
so that’s no good, right? Not only is the OLTP system bad at running queries needed for business reports but it also does not dedicate it’s resource for us!
So let’s create a copy of databases and have them dedicated to answer questions to business reports. so there’s not an issue of resource contention as we have dedicated resources to handle that. And while, we are at it – why don’t change the data model so that it best suits the queries that are needed for business reporting and analysis?
Well, that’s exactly what OLAP database is. It’s a database that’s created for business reporting and analysis. It’s does some neat things like pre-aggregating some values PLUS the data model in OLAP is also best suited for reporting purposes. (Read more about Star schemas/ data mart / data warehouse / ETL if you’re curious to learn more).
OK – so that’s one reason: To improve performance! Now let’s see another one.
#2: Creating Business Reports over Transactional systems (OLTP) data is NOT developer-friendly:
Ok, so we already covered in the previous section that creating business reports over OLTP can cause performance issues. But there’s more to it then just performance. You see – the requirements of creating business reports is different then the requirements of transactions systems. So? well, that means that the data model used for OLTP is best suited for transactional systems and it is not an optimal data model for analysis and reporting purpose. for example: creating hierarchies, drill-down reports, year-over-year growth among other things are much more efficiently handles by OLAP systems. But if we were to use OLTP database, then it would take a lot of developer hours to write efficient (and correct!) SQL commands (mostly stored procedures) to get OLTP to give data that the business reports need. Also, some of the common business metrics that are used in reporting can be stored in a cube. so that each time a report get’s created, you can re-use the business metric stored in the OLAP cubes.
OK – so OLAP cube saves time (to create reports).
Not only, OLAP cubes perform better at returning data they also help us speed the process of creating reports.
That’s great! let’s see one more reason:
#3: Ad-hoc reporting over OLTP systems creates confusion!
This reason is more about why we should have a data-mart and data-warehouse.. So why do ad-hoc reporting over OLTP systems creates confusion among business users?
Imagine creating reports over a LIVE system that’s getting updated every seconds. If there are ad-hoc (as-needed basis) reports being created by different users – then everyone would see different results. so it’s important to have a common version for everyone. Also imagine, everyone combining data from different data sources. If they’re doing it differently then they would see different data. And not only that, if they’re creating derived (or calculated) columns and their formula’s are different then they would see different data. you see a common pattern here? There’s no conformity in data & formulas in the reports that gets created. What does it cause? Confusion! So what’s needed is what they call in the data-warehouse “single version of truth”. OLAP cubes (which gets data from data-warehouse) provide that common single data source for everyone and thus the conformity in data is maintained while creating business reports.
Also while we are at this, one more consideration that typically reports require historical data at aggregated level. So we don’t want to store each transaction over the last 10 years in an OLTP database, do we? NO! right? In such cases, the historical data is aggregated based on requirements and stored in datamarts (/data-warehouse) which is later consumed by the OLAP cubes and that way OLTP databases do not have to store lot of historical data.
Ok – that’s one more reason OLTP are bad w/ business reporting and analysis and that’s why we need data-marts (data warehouse) and OLAP Cubes.
That’s about it. for this post. Let me point you to Related Resources:
Data Explorer let’s you “Explore” (search) for web-based public data. This is a great way to combine data that you may have in your data-sources with public data sources for data analysis purposes. Sometimes your data might not tell you the reason behind the observed trends, so when that happens – you can try to see if a public data-set might give you the much-needed context. Let me give you an Example before we start hands-on w/ data explorer so that you have better understanding of importance of public datasets. Here’s a sample that I found here. So, Here’s a demo:
An auto company is seeing sales trends of Hybrid cars and SUV’s from the sales data-sources. But what is the reason behind that? company data does not show that. Someone hypothesizes that it might be because of gas prices. So they test out the hypothesis by combining gas prices information available via public data. And turns out gas prices might be the driving force of sales trends! SEE:
if the gas prices increase, then the sale of SUV go down and the sale of Hybrids go up:
You know that public data can be helpful! So how can you search for public data-sets? Well, You can manually search online, ask someone, browse through public data repositories like azure data market (and other data markets), there’s also a public data search engine! OR you can directly search for them via Data Explorer.
Here are the steps:
1) Excel 2010/2013 > Data Explorer Tab > Online Search > type “Tallest Buildings”
2) I selected one of the data-sets that said “Tallest completed building…. ”
3) Now let’s do some filtering and shaping. Here are the requirements:
– Hide columns: Image, notes & key
– clean columns that has heights data
– Show only city name in location
OK, let’s get to this one by one!
4) Hiding Columns:
Click on Filter & Shape button from the Query Settings:
Select Image Column > Right Click > Hide:
Repeat the steps for notes & key column.
Click on DONE
5) clean column that has heights data.
Click on Filter & Shape to open the query editor
A) let’s rename it. Select column > Right Click > Rename to Height > press ENTER
B) let’s remove the values in brackets. Select Column > right click > split column > By delimiter > At each occurrence of the delimiter > Custom and enter “(” > OK
This should transform the data like this:
Hide height.2 and rename the height.1 to height
Click on DONE
6) Let’s just have city names in the location column
click on Filter & shape to load query editor:
A) select location > right click > split column > by delimiter > Custom – Enter: ° in the text box like this:
SPEED is one of the important aspect of Data Analysis. Wouldn’t it be great if you query a data source, you get your answers as soon as possible? Yes? Right! Of course, it depends on factors like the size of the data you are trying to query but wouldn’t it be great if it’s at “SPEED OF THOUGHT“?
So Here’s the Problem:
Databases are mostly disk based and so the bottleneck here is the speed at which can get access to data off the disks.
So what can you do?
Let’s put data in a RAM (memory) because data-access via memory is faster.
If it’s sounds so easy, why didn’t people do it earlier? And why are we talking about “In Memory” NOW?
1) BIGGER Data Size/sets and so today with more data, it takes more time to query data from databases. And so researchers have looked at other approaches. One of the effective approach they found is: In-memory
(And I am not ignoring the advances in Database Technologies like Parallel databases, But for the purpose of understanding “Why In-memory”, it’s important to realize the growing size of data sets and a viable alternative we have to tackle the problem: In memory. And also I am not saying that it’s the ONLY way to go. I am just trying to understand the significance of in-memory technologies. We, as data professionals, have lot’s of choices! And only after evaluating project requirements, we can talk about tools and techniques)
2) PRICE of Memory: Was the price of RAM/memory higher than what it is today? So even though it was a great idea to put data in memory, it was cost-prohibitive.
So Let’s connect the dots: Data Analysis + In Memory Technologies:
What’s common between Microsoft’s PowerPivot, SAP HANA, Tableau and Qlikview?
1) Tools for Data-Analysis/Business-Intelligence 2) Their Back End data architecture is “In Memory”
So since Data Analysis needs SPEED and In-Memory Technologies solves this need – Data Analysis and Business Intelligence Tools adopted “In-memory” as their back-end data architecture. And next time, when you hear a vendor saying “in-memory”, you don’t have to get confused about what they’re trying to say. They’re just saying that we got you covered by giving you ability to query your data at “speed of thought” via our In-memory technologies so that you can go back to your (data) analysis.