A Business Intelligence (BI) system for Sales is being developed at a company. Here are the events that occur:
1) Based on the requirements, It is documented that the Business needs to analyze Sales numbers by product, month, customer & employee
2) While designing the system IT learns that the data is stored at each Invoice Level but since the requirements document doesn’t say anything about having details down to invoice level, they decide to aggregate data before bringing in their system.
3) They develop the BI system within the time frame and sends it to business for data validation.
4) Business Analysts starts looking at the BI system and finds some numbers that don’t look right for a few products and need to see Invoices for those products to make sure that the data is right so they ask IT to give them invoice level data.
5) IT realizes that even though business had not requested Invoice Level data explicitly but they do NEED the lowest level data! They realize it’s crucial to pass data validation. Also, they talk with their business analysts and found out that they may sometimes need to drill down to lowest level data to find insights that may be hidden at the aggregate level.
6) so IT decides to re-work on their solution. This increases the timeline & budget set for the project. Not only that they have lost the opportunity to gain the confidence of business by missing the budget and timeline.
7) They learn to “Design BI system to have the lowest level data even if it’s not asked!” and decides to never make this mistake again in the future!
This concludes the post and it’s important to include lowest level data in your BI system even if it’s not explicitly requested – this will save you time & build your credibility as a Business Intelligence developer/architect.
How to use Execute SQL Task in SSIS to assign value to a variable?
Solution:
This is a beginner level post so I’ll show you how you can use Execute SQL Task to assign a value to a variable. Note that variables can also be given full result set. With that said, here are the steps:
1. Create the query against the source system
Example: ((Note the column name, this will be handy later!)
2. Open SSIS Project > Create the variable
Example
3. Now, drag a Execute SQL Task to Control Flow. Rename it. And go to Edit. Configure SQL Statement Section
4. Now, since we want to store a value to the variable, change the Result Set property to Single Row
5. One last step, go to result set section and map Result Name (remember the column name from #1?!) with Variable Name:
This is a beginner level post targeted at Developers who are new to SSIS and may not have worked on making a SSIS staging load package incremental. In this post, I’ll share a design pattern that I’ve used to make staging loads incremental which pulls in just new or changed rows from source system.
Tutorial:
Before we begin, why would you want to make a staging load incremental when pulling data from source systems? Two main reasons: 1) the source system may not keep historical data but your Business Intelligence system needs to have it 2) it is also faster and puts less strain on source system while doing data pull.
since this is a beginner’s level, I am going to show you a design pattern when you have a column in the source system that can identify New or Changed Rows. If you do not have a column in the source system that identifies new or changed rows then this topic becomes an advanced level and is out of scope for now.
with that said, let’s see the steps involved.
1) I’ve this kill and fill (a.k.a Full Load) package in my SSIS dev environment:
2) now, let’s make this incremental. so I’ll go ahead and delete the Execute SQL Task that truncates the data.
3) Now, we need a way to be able to pass in the query in our DFT that gets only the new or changed rows. The source system that I am using has a field called modified date and that’s what I’ll be using to pull in new or changed data.
4) Let’s create the query using the help of variables, execute sql task and script task. (Later, we’ll store in the query in a variable and use that variable in the Data Flow Task)
4a) create ModfiedDate and Query variables
4b) create an Execute SQL Task to run the query to get the max ModifiedDate and write it in the ModifiedDate variable that you created.
4c) create a Script Task to get the query using the ModifiedDate variable. This query will extract only new or changed rows from your source system
[code language=”vb”]
Dim ModifiedDate As String Dim sQuery As String ModifiedDate = Dts.Variables("ModifiedDate").Value.ToString sQuery = String.Concat("SELECT [SalesOrderID],[SalesOrderDetailID],[CarrierTrackingNumber],[OrderQty],[ProductID],[SpecialOfferID],[UnitPrice],[ModifiedDate] FROM [sales].[SalesOrderDetail] where [ModifiedDate] >= ‘", String.Concat(ModifiedDate, "’")) MsgBox(String.Concat(" ", sQuery)) Dts.Variables("Query").Value = sQuery
[/code]
5) Now, go to variables section and give a default value to user::Query variable because if you do not do this you won’t be able to go to next steps.
6) Go to Data Flow and change the OLEDB source to use the SQL Command from variable and use the user::Query variable
7) Switch to Control flow and Make sure your precedence constraints are set to run Execute SQL Task > Script Task > Data Flow Task
8) Run the package and you should see the dynamic query that gets generated. Tip: sometimes it’s helpful to run this query that’s generated against the source system for troubleshooting purpose.
9) On the successful run of the package verify that only new rows got added to the staging table. Also, if there are duplicate rows in the staging table, this might need to handled during the dimension load or fact load. you can also consider having the logic in place here to avoid duplicate records in your staging table.
That’s it!
Conclusion:
In this post, you saw how to make a staging load package incremental.
In a SQL server data mart, it’s common to have an Identity column (SK columns) in a Fact Table. And it’s also common to add a -1 record to this table for “unknown values”. So if you want to insert an explicit value into an identity column in sql server table, here are the steps:
1) In SSMS, select the Table the Object Explorer > Right click > Script Table as > Insert To > “New Query Editor Window”
2) This gives you a nice starting point to created your identity insert script:
3) Add following code before the insert statement:
SET IDENTITY_INSERT <schema_name>.<table_name> ON go
4) Add following code after the insert statement:
SET IDENTITY_INSERT <schema_name>.<table_name> OFF go
5) Now modify the “VALUES” section of the insert statement. Also, Note that the identity column would not be in column list, you’ll have add it manually in the script.
After the values are added, here’s what the code should look like:
[code language=”sql” gutter=”false”]
USE GO
SET IDENTITY_INSERT <schema_name>.<table_name> ON go
INSERT INTO <schema_name>.<table_name> ([Col1_sk] ,[col2]) VALUES (-1, ‘N/A’) GO
SET IDENTITY_INSERT <schema_name>.<table_name> OFF go
[/code]
Conclusion: I hope this gives you a good starting point to create an identity insert script in SQL server table.
Bus matrix is a very important (if not the most important) topic for a Data Warehouse Developer/Architect. And it was great to re-visit and grow knowledge around this topic at Atlanta SQL BI meeting presented by Bill Anton (b|t).
Here are some of my notes:
Dimensional Modeling is more important than ever specially w/ the rise in self-service BI. it’s important to lay out data in an intuitive way to business users.
Dimensional Modeling is great! But at the same time it’s important to communicate this to business users. The solution doesn’t involve teaching ER diagrams to users, simplified designs, offloading mapping data to business processes work to business users. So, what’s the solution? That’s where Bus Matrix comes into picture!
Bus Matrix is a 2 dimensional intersection of Facts & Dimensions.
Bus matrix can not only help in communication w/ Business users, it can also be useful for IT in project planning and documentation purposes.
In the requirement gathering phase, a Bus Matrix could be a 2 dimensional intersection of Subject Areas (Sales, Inventory, etc) by Source Systems w/ priorities shown at the intersection. It’s great for road-mapping a BI/DW project.
Business Matrix is created during requirements gathering stage (subject area vs source systems), before development (Facts vs Dimensions) and needs to be kept updated over time. After the solution is live, it could be one of the important end-user documentation.
Also, Here’s a great resource page posted by the Presenter (includes his slides): http://byobi.com/blog/bus-matrix/ – check it out! it has some great resources in there.
There’s one thing common between Excel Macro and a Data warehouse: They help an analyst automate tedious tasks. Macro helps automate some of the common excel tasks. Data Warehouse helps analyst automate their “data cobbling/gathering” process. So recently I worked on a task to extract business logic from an Excel 2010 macro. Here’s what I did:
3) Now once the Microsoft Visual Basic for Applications Dialog box opens up, you should see a macro code. Now do NOT press F5 to run the macro! Instead, go to your first line of code and press F8, this will run the macro one line at a time. Open up your excel sheet that had this macro (in second monitor) and see what happens! [Productivity Tip: You don’t need secondary monitor for this but if you do then it will boost your productivity]
The key as you can imagine is to execute the macro one line at a time & visualize it the second monitor. it gives you a good sense of what’s happening (even if you don’t know how to write macro) and you should be able to understand the macro code or document it for data modeller or BI Dev’s so that they can encapsulate the logic in Data Warehouse/Cubes. Or you might be an excel pro trying to understand someone’s macro, then this trick works there too!
Note:
1) I’ve shown the steps that I took in Excel 2010. I haven’t tested if it works with other versions
2) I am not an Excel Macro Expert so If there is a better way, I’d be happy to learn about it. I just happened to figure this out & it helped me understand the logic hidden in a 1300+ lines of macro code.
That’s about it for this post. your feedback is most welcome!
Once in a while I write about back to basics topics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, we’ll revisit why do we use OLAP and Data Warehouses for business reporting systems. Let me share some of the most common reasons and then I’ll point to resources that offer other reasons.
Let’s see some of the most common reason:
#1: Business Reports should not take lot of time to load.
From a Business User Perspective: They don’t want to wait for their report to populate data. Reports should be fast!
But if business users have to wait for data to show up on their report because of slow query response, then that would be bad for everyone involved. Business Intelligence solutions cannot permeate in an organization if the reports take a lot of time to load:
So from a Technology standpoint: What can you do? And also why did the problem arise in the first place?
Let’s first see why the problem occurred in the first place?
So we have a bunch of database tables. To create a report, we’ll have to summarize (aggregate) values in lots of rows (think millions) and join few tables – turns out that if you query a transactional system (Database / OLTP), then you’ll get a slow response. In some cases, if the data model + data size + queries are not complex, then you could just run a query to create operational business reports and you won’t see any performance issues. But that’s not the case always! So if data model + data size + query requirements for reporting are not simple for OLTP/databases to handle and you see a poor performance – in other words, database/OLTP system takes up a lot of time returning data that the business reports require and the business users would see bad performance. The issue goes beyond complex data model + data size + query. You see, transactional systems may be running other tasks in parallel to returning data to business reports. So there’s the issue of resource contention on the OLTP database.
so that’s no good, right? Not only is the OLTP system bad at running queries needed for business reports but it also does not dedicate it’s resource for us!
So let’s create a copy of databases and have them dedicated to answer questions to business reports. so there’s not an issue of resource contention as we have dedicated resources to handle that. And while, we are at it – why don’t change the data model so that it best suits the queries that are needed for business reporting and analysis?
Well, that’s exactly what OLAP database is. It’s a database that’s created for business reporting and analysis. It’s does some neat things like pre-aggregating some values PLUS the data model in OLAP is also best suited for reporting purposes. (Read more about Star schemas/ data mart / data warehouse / ETL if you’re curious to learn more).
OK – so that’s one reason: To improve performance! Now let’s see another one.
#2: Creating Business Reports over Transactional systems (OLTP) data is NOT developer-friendly:
Ok, so we already covered in the previous section that creating business reports over OLTP can cause performance issues. But there’s more to it then just performance. You see – the requirements of creating business reports is different then the requirements of transactions systems. So? well, that means that the data model used for OLTP is best suited for transactional systems and it is not an optimal data model for analysis and reporting purpose. for example: creating hierarchies, drill-down reports, year-over-year growth among other things are much more efficiently handles by OLAP systems. But if we were to use OLTP database, then it would take a lot of developer hours to write efficient (and correct!) SQL commands (mostly stored procedures) to get OLTP to give data that the business reports need. Also, some of the common business metrics that are used in reporting can be stored in a cube. so that each time a report get’s created, you can re-use the business metric stored in the OLAP cubes.
OK – so OLAP cube saves time (to create reports).
Not only, OLAP cubes perform better at returning data they also help us speed the process of creating reports.
That’s great! let’s see one more reason:
#3: Ad-hoc reporting over OLTP systems creates confusion!
This reason is more about why we should have a data-mart and data-warehouse.. So why do ad-hoc reporting over OLTP systems creates confusion among business users?
Imagine creating reports over a LIVE system that’s getting updated every seconds. If there are ad-hoc (as-needed basis) reports being created by different users – then everyone would see different results. so it’s important to have a common version for everyone. Also imagine, everyone combining data from different data sources. If they’re doing it differently then they would see different data. And not only that, if they’re creating derived (or calculated) columns and their formula’s are different then they would see different data. you see a common pattern here? There’s no conformity in data & formulas in the reports that gets created. What does it cause? Confusion! So what’s needed is what they call in the data-warehouse “single version of truth”. OLAP cubes (which gets data from data-warehouse) provide that common single data source for everyone and thus the conformity in data is maintained while creating business reports.
Also while we are at this, one more consideration that typically reports require historical data at aggregated level. So we don’t want to store each transaction over the last 10 years in an OLTP database, do we? NO! right? In such cases, the historical data is aggregated based on requirements and stored in datamarts (/data-warehouse) which is later consumed by the OLAP cubes and that way OLTP databases do not have to store lot of historical data.
Ok – that’s one more reason OLTP are bad w/ business reporting and analysis and that’s why we need data-marts (data warehouse) and OLAP Cubes.
That’s about it. for this post. Let me point you to Related Resources:
Business persons may not realize that Business Analytics project may involve significant efforts for the under the hood technical tasks like Data Cleaning, Data Integration, Building-a-data-warehouse, creating ETL processes, gathering business requirements among other tasks. And that explains the title of this blog: Business Analytics project is like an iceberg. It’s because, a business person may just see the tool used to visualize data but may not realize the work that went into making it “analytics-ready”. From a project management standpoint – before a project is initiated, the discussion about this different aspects of the project need to communicated to the business stakeholders so that they are in the know of efforts involved in building an analytics solution. And with that, Here’s the summary of this discussion in form of an Image:
[Resource] Getting Started with Microsoft HDInsight http://bit.ly/Wru0zv #Hadoop #Microsoft
Reading: “Next Generation Data Warehouse Platforms: #TDWI 2009 Q4 Best Practices Report”; Interesting stats on the total data volume that organizations manage: http://on.fb.me/ZnQ0O1
Presentation Tips: When you move, they look at you. When you stop, they look at the screen – http://bit.ly/uQ5Amf
“In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers”