A Business Intelligence (BI) system for Sales is being developed at a company. Here are the events that occur:
1) Based on the requirements, It is documented that the Business needs to analyze Sales numbers by product, month, customer & employee
2) While designing the system IT learns that the data is stored at each Invoice Level but since the requirements document doesn’t say anything about having details down to invoice level, they decide to aggregate data before bringing in their system.
3) They develop the BI system within the time frame and sends it to business for data validation.
4) Business Analysts starts looking at the BI system and finds some numbers that don’t look right for a few products and need to see Invoices for those products to make sure that the data is right so they ask IT to give them invoice level data.
5) IT realizes that even though business had not requested Invoice Level data explicitly but they do NEED the lowest level data! They realize it’s crucial to pass data validation. Also, they talk with their business analysts and found out that they may sometimes need to drill down to lowest level data to find insights that may be hidden at the aggregate level.
6) so IT decides to re-work on their solution. This increases the timeline & budget set for the project. Not only that they have lost the opportunity to gain the confidence of business by missing the budget and timeline.
7) They learn to “Design BI system to have the lowest level data even if it’s not asked!” and decides to never make this mistake again in the future!
This concludes the post and it’s important to include lowest level data in your BI system even if it’s not explicitly requested – this will save you time & build your credibility as a Business Intelligence developer/architect.
You have a Fact Sales and Fact Target in your data mart. Fact Sales stores values are product sub category level and fact target stores values at product category level because business sets “sales targets” at a higher (rolled up) level. How do you connect it to a single dimension at different granularity?
Solution:
Here’s the table structure, I just made this up for the demo purpose:
1. Fact Sales
2. Fact Target
3. Dim product sub category
so, you went ahead and tried testing by creating relationship’s to single dimension at different granularity in the cube:
Note how the relationship was specified between Fact Target and Product Sub Category Dimension – it’s joined at a different granularity compared to fact sales. it would be help you from a performance standpoint if the fields that you are using to join the fact and dimension is an int.
So, you browse the cube and here’s what you get:
Note the problem: the target values are being repeated for sub categories but that shouldn’t be happening, right? that’s misleading to business users…ok, to recap what we need to do here: hide target values for subcategories since targets are not set at that granularity. but we do need to show them if the business users pulls in product category.
So here’s a measure group property that comes to the rescue!
Go to Fact Target Measure group’s property > Set IgnoreUnRelatedDimension to False
deploy and browse your cube again, here’s what you will see now:
That’s it! you have successfully joined facts at different granularity to a single dimension.
This is a beginner level post targeted at Developers who are new to SSIS and may not have worked on making a SSIS staging load package incremental. In this post, I’ll share a design pattern that I’ve used to make staging loads incremental which pulls in just new or changed rows from source system.
Tutorial:
Before we begin, why would you want to make a staging load incremental when pulling data from source systems? Two main reasons: 1) the source system may not keep historical data but your Business Intelligence system needs to have it 2) it is also faster and puts less strain on source system while doing data pull.
since this is a beginner’s level, I am going to show you a design pattern when you have a column in the source system that can identify New or Changed Rows. If you do not have a column in the source system that identifies new or changed rows then this topic becomes an advanced level and is out of scope for now.
with that said, let’s see the steps involved.
1) I’ve this kill and fill (a.k.a Full Load) package in my SSIS dev environment:
2) now, let’s make this incremental. so I’ll go ahead and delete the Execute SQL Task that truncates the data.
3) Now, we need a way to be able to pass in the query in our DFT that gets only the new or changed rows. The source system that I am using has a field called modified date and that’s what I’ll be using to pull in new or changed data.
4) Let’s create the query using the help of variables, execute sql task and script task. (Later, we’ll store in the query in a variable and use that variable in the Data Flow Task)
4a) create ModfiedDate and Query variables
4b) create an Execute SQL Task to run the query to get the max ModifiedDate and write it in the ModifiedDate variable that you created.
4c) create a Script Task to get the query using the ModifiedDate variable. This query will extract only new or changed rows from your source system
[code language=”vb”]
Dim ModifiedDate As String Dim sQuery As String ModifiedDate = Dts.Variables("ModifiedDate").Value.ToString sQuery = String.Concat("SELECT [SalesOrderID],[SalesOrderDetailID],[CarrierTrackingNumber],[OrderQty],[ProductID],[SpecialOfferID],[UnitPrice],[ModifiedDate] FROM [sales].[SalesOrderDetail] where [ModifiedDate] >= ‘", String.Concat(ModifiedDate, "’")) MsgBox(String.Concat(" ", sQuery)) Dts.Variables("Query").Value = sQuery
[/code]
5) Now, go to variables section and give a default value to user::Query variable because if you do not do this you won’t be able to go to next steps.
6) Go to Data Flow and change the OLEDB source to use the SQL Command from variable and use the user::Query variable
7) Switch to Control flow and Make sure your precedence constraints are set to run Execute SQL Task > Script Task > Data Flow Task
8) Run the package and you should see the dynamic query that gets generated. Tip: sometimes it’s helpful to run this query that’s generated against the source system for troubleshooting purpose.
9) On the successful run of the package verify that only new rows got added to the staging table. Also, if there are duplicate rows in the staging table, this might need to handled during the dimension load or fact load. you can also consider having the logic in place here to avoid duplicate records in your staging table.
That’s it!
Conclusion:
In this post, you saw how to make a staging load package incremental.
Once in a while I write about back to basics topics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, we’ll revisit why do we use OLAP and Data Warehouses for business reporting systems. Let me share some of the most common reasons and then I’ll point to resources that offer other reasons.
Let’s see some of the most common reason:
#1: Business Reports should not take lot of time to load.
From a Business User Perspective: They don’t want to wait for their report to populate data. Reports should be fast!
But if business users have to wait for data to show up on their report because of slow query response, then that would be bad for everyone involved. Business Intelligence solutions cannot permeate in an organization if the reports take a lot of time to load:
So from a Technology standpoint: What can you do? And also why did the problem arise in the first place?
Let’s first see why the problem occurred in the first place?
So we have a bunch of database tables. To create a report, we’ll have to summarize (aggregate) values in lots of rows (think millions) and join few tables – turns out that if you query a transactional system (Database / OLTP), then you’ll get a slow response. In some cases, if the data model + data size + queries are not complex, then you could just run a query to create operational business reports and you won’t see any performance issues. But that’s not the case always! So if data model + data size + query requirements for reporting are not simple for OLTP/databases to handle and you see a poor performance – in other words, database/OLTP system takes up a lot of time returning data that the business reports require and the business users would see bad performance. The issue goes beyond complex data model + data size + query. You see, transactional systems may be running other tasks in parallel to returning data to business reports. So there’s the issue of resource contention on the OLTP database.
so that’s no good, right? Not only is the OLTP system bad at running queries needed for business reports but it also does not dedicate it’s resource for us!
So let’s create a copy of databases and have them dedicated to answer questions to business reports. so there’s not an issue of resource contention as we have dedicated resources to handle that. And while, we are at it – why don’t change the data model so that it best suits the queries that are needed for business reporting and analysis?
Well, that’s exactly what OLAP database is. It’s a database that’s created for business reporting and analysis. It’s does some neat things like pre-aggregating some values PLUS the data model in OLAP is also best suited for reporting purposes. (Read more about Star schemas/ data mart / data warehouse / ETL if you’re curious to learn more).
OK – so that’s one reason: To improve performance! Now let’s see another one.
#2: Creating Business Reports over Transactional systems (OLTP) data is NOT developer-friendly:
Ok, so we already covered in the previous section that creating business reports over OLTP can cause performance issues. But there’s more to it then just performance. You see – the requirements of creating business reports is different then the requirements of transactions systems. So? well, that means that the data model used for OLTP is best suited for transactional systems and it is not an optimal data model for analysis and reporting purpose. for example: creating hierarchies, drill-down reports, year-over-year growth among other things are much more efficiently handles by OLAP systems. But if we were to use OLTP database, then it would take a lot of developer hours to write efficient (and correct!) SQL commands (mostly stored procedures) to get OLTP to give data that the business reports need. Also, some of the common business metrics that are used in reporting can be stored in a cube. so that each time a report get’s created, you can re-use the business metric stored in the OLAP cubes.
OK – so OLAP cube saves time (to create reports).
Not only, OLAP cubes perform better at returning data they also help us speed the process of creating reports.
That’s great! let’s see one more reason:
#3: Ad-hoc reporting over OLTP systems creates confusion!
This reason is more about why we should have a data-mart and data-warehouse.. So why do ad-hoc reporting over OLTP systems creates confusion among business users?
Imagine creating reports over a LIVE system that’s getting updated every seconds. If there are ad-hoc (as-needed basis) reports being created by different users – then everyone would see different results. so it’s important to have a common version for everyone. Also imagine, everyone combining data from different data sources. If they’re doing it differently then they would see different data. And not only that, if they’re creating derived (or calculated) columns and their formula’s are different then they would see different data. you see a common pattern here? There’s no conformity in data & formulas in the reports that gets created. What does it cause? Confusion! So what’s needed is what they call in the data-warehouse “single version of truth”. OLAP cubes (which gets data from data-warehouse) provide that common single data source for everyone and thus the conformity in data is maintained while creating business reports.
Also while we are at this, one more consideration that typically reports require historical data at aggregated level. So we don’t want to store each transaction over the last 10 years in an OLTP database, do we? NO! right? In such cases, the historical data is aggregated based on requirements and stored in datamarts (/data-warehouse) which is later consumed by the OLAP cubes and that way OLTP databases do not have to store lot of historical data.
Ok – that’s one more reason OLTP are bad w/ business reporting and analysis and that’s why we need data-marts (data warehouse) and OLAP Cubes.
That’s about it. for this post. Let me point you to Related Resources: