Bus matrix is a very important (if not the most important) topic for a Data Warehouse Developer/Architect. And it was great to re-visit and grow knowledge around this topic at Atlanta SQL BI meeting presented by Bill Anton (b|t).
Here are some of my notes:
Dimensional Modeling is more important than ever specially w/ the rise in self-service BI. it’s important to lay out data in an intuitive way to business users.
Dimensional Modeling is great! But at the same time it’s important to communicate this to business users. The solution doesn’t involve teaching ER diagrams to users, simplified designs, offloading mapping data to business processes work to business users. So, what’s the solution? That’s where Bus Matrix comes into picture!
Bus Matrix is a 2 dimensional intersection of Facts & Dimensions.
Bus matrix can not only help in communication w/ Business users, it can also be useful for IT in project planning and documentation purposes.
In the requirement gathering phase, a Bus Matrix could be a 2 dimensional intersection of Subject Areas (Sales, Inventory, etc) by Source Systems w/ priorities shown at the intersection. It’s great for road-mapping a BI/DW project.
Business Matrix is created during requirements gathering stage (subject area vs source systems), before development (Facts vs Dimensions) and needs to be kept updated over time. After the solution is live, it could be one of the important end-user documentation.
Also, Here’s a great resource page posted by the Presenter (includes his slides): http://byobi.com/blog/bus-matrix/ – check it out! it has some great resources in there.
Power BI is an exciting new technology in the business analytics space from Microsoft. I’ve played with its current preview version & attended couple of sessions on Power BI at PASS Summit 2013. Based on my first impression, I noted down Problems that Power BI solves. Note that as of today, it’s in preview & so information around cost is not availale yet but I try to learn and understand as much as I can Today about how Power BI is going to help business users & power users in the future. As a part of that, I’m attending Business Analytics VC’s session on “Power BI Info Management and Data Stewardship” by Matthew Roche & Ofer Ashkenazi on Nov 7th 12 PM EST.
Topic: Power BI Info Management and Data Stewardship
“Business intelligence tools continue to improve, letting users shorten their time to insight and take that insight to more devices in more places. But this evolution of BI doesn’t change one fundamental fact of information management: You can’t gain insight from data you can’t access.
In this session, Matthew Roche and Ofer Ashkenazi will introduce the role of the data steward and the self-service information management capabilities included in Power Query and Power BI for Office 365, focusing on how Power BI empowers business users to add value to the organization.”
Closing note:
I recently volunteered at Business Analytics VC as VP of Marketing, so it’s in my interest to spread word about the event but I would not spread word about something unless it gets me personally excited about it! 🙂
I hope to see you at the session and for some reason if you can not make it, we usually record sessions & so you can check out the meeting archives section of the PASS BA VC site after the event.
Data Explorer let’s you “Explore” (search) for web-based public data. This is a great way to combine data that you may have in your data-sources with public data sources for data analysis purposes. Sometimes your data might not tell you the reason behind the observed trends, so when that happens – you can try to see if a public data-set might give you the much-needed context. Let me give you an Example before we start hands-on w/ data explorer so that you have better understanding of importance of public datasets. Here’s a sample that I found here. So, Here’s a demo:
An auto company is seeing sales trends of Hybrid cars and SUV’s from the sales data-sources. But what is the reason behind that? company data does not show that. Someone hypothesizes that it might be because of gas prices. So they test out the hypothesis by combining gas prices information available via public data. And turns out gas prices might be the driving force of sales trends! SEE:
if the gas prices increase, then the sale of SUV go down and the sale of Hybrids go up:
You know that public data can be helpful! So how can you search for public data-sets? Well, You can manually search online, ask someone, browse through public data repositories like azure data market (and other data markets), there’s also a public data search engine! OR you can directly search for them via Data Explorer.
Here are the steps:
1) Excel 2010/2013 > Data Explorer Tab > Online Search > type “Tallest Buildings”
2) I selected one of the data-sets that said “Tallest completed building…. ”
3) Now let’s do some filtering and shaping. Here are the requirements:
– Hide columns: Image, notes & key
– clean columns that has heights data
– Show only city name in location
OK, let’s get to this one by one!
4) Hiding Columns:
Click on Filter & Shape button from the Query Settings:
Select Image Column > Right Click > Hide:
Repeat the steps for notes & key column.
Click on DONE
5) clean column that has heights data.
Click on Filter & Shape to open the query editor
A) let’s rename it. Select column > Right Click > Rename to Height > press ENTER
B) let’s remove the values in brackets. Select Column > right click > split column > By delimiter > At each occurrence of the delimiter > Custom and enter “(” > OK
This should transform the data like this:
Hide height.2 and rename the height.1 to height
Click on DONE
6) Let’s just have city names in the location column
click on Filter & shape to load query editor:
A) select location > right click > split column > by delimiter > Custom – Enter: ° in the text box like this:
Visual analytics is amazing – it helps “data enthusiasts” save time in answering questions using Data. Let’s see one such example. For the purpose of the blog post, I am going to show how to do it in Excel 2010:
Problem:
Here’s the Business Question: What was sales of Tea in North Region in 2012 Q1
Here’s the data:
SALES DATA(2012 Q1)
East
West
Central
North
South
Coffee
$ 7,348.00
$ 7,238.00
$ 1,543.00
$ 9,837.00
$ 1,823.00
Tea
$ 9,572.00
$ 8,235.00
$ 3,057.00
$ 8,934.00
$ 13,814.00
Herbal Tea
$ 5,782.00
$ 8,941.00
$ 9,235.00
$ 392.00
$ 1,268.00
Espresso
$ 9,012.00
$ 2,590.00
$ 4,289.00
$ 7,848.00
$ 340.00
So it’s easy to give out answer using the data: $8934
But let me CHANGE the business question:
WHICH Products in WHAT regions are doing the best?
Now this questions is not as easy as the previous one? WHY? because you’ll have to manually go through each number in a linear fashion to answer the question. Now imagine a bigger data-set. It’ll take even more time.
Solution
What can Excel Power users and Data Enthusiasts do to answer the new business question in an efficient way? Well, let’s see what conditional formatting can do it:
Now with the Data Bars, it’s easier to just glance at the report and see best performing products and regions. For instance, it’s very easy to spot that Tea is performing best in South among all products and region.
So how do you create data bars?
1. Select the data
2. Home > Conditional Formatting > Data Bars
3.Done! you’ll see this:
4. You can play with other options here to see what suits the best for your needs. But I just wanted to point out that there is a way for you to highlight the data in a way that helps you save time in answering business questions using data
Conclusion:
Visual analytics is a great way to quickly analyze data. In most cases, Human brain is much faster at interpreting the visual results as oppose to text/numbers – so why not use it to your advantage. And tools like Excel have inbuilt functionality to help you do that!
Business persons may not realize that Business Analytics project may involve significant efforts for the under the hood technical tasks like Data Cleaning, Data Integration, Building-a-data-warehouse, creating ETL processes, gathering business requirements among other tasks. And that explains the title of this blog: Business Analytics project is like an iceberg. It’s because, a business person may just see the tool used to visualize data but may not realize the work that went into making it “analytics-ready”. From a project management standpoint – before a project is initiated, the discussion about this different aspects of the project need to communicated to the business stakeholders so that they are in the know of efforts involved in building an analytics solution. And with that, Here’s the summary of this discussion in form of an Image:
One of the key thing I’ve learned is importance of differentiating the concepts of “Data Reporting” and “Data Analysis”. So, let’s first see them visually:
Here’s the logic for putting Data Reporting INSIDE Data Analysis: if you need to do “analysis” then you need reports. But you do not have to necessarily do data analysis if you want to do data reporting.
From a process standpoint, Here’s how you can visualize Data Reporting and Data Analysis:
Let’s thing about this for a moment: Why do we need “analysis”?
We need it because TOOLS are really great at generating data reports. But it requires a HUMAN BRAIN to translate those “data points/reports” into “business insights”. This process of seeing the data points and translating them into business insights is core of what is Data Analysis. Here’s how it looks visually:
Note after performing data analysis, we have information like Trends and Insights, Action items or Recommendations, Estimated impact on business that creates business value.
I just researched about Machine Generated Data from the context of “Big data”, Here’s the list I compiled:
– Data sent from Satellites
– Temperature sensing devices
– Flood Detection/Sensing devices
– web logs
– location data
– Data collected by Toll sensors (context: Road Toll)
– Phone call records
– Financial
And a Futuristic one:
Imagine sensors on human bodies that continuously “monitor” health. How about if we use them to detect diabetes/cancer/other-diseases in their early phases. Possible? May be!
Interesting Fact:
Machine can generate data “faster” than humans. This characteristics makes it interesting to think about to analyze machine generate data and in some cases, how to analyze them in real-time or near real-time
Ending Note:
Search for Machine Generated Data, you’ll be able to find much more, it’s worth reading about from the context of Big Data.
In this post, we’ll see matching activity in action. For the demo purpose, I’ll be using Data-Sets that I’ve found via Movies Sample & EIM tutorial .
Situation 1:
we’ve a list of movies and we want to identify “matching” movie titles.
Solution 1:
Create a Matching Policy
1) Mapping the Domains:
2) Configuring the Matching Policy:
Note: You can have Matching Rules on more than one domain. I used one domain for demo purposes.
3) View Matching results:
4) Publish the KB (which stores the matching policy)
Once you have the matching policy, you can use this in a Data Quality Project:
5) See How I ran a Data Quality Project (w/ matching activity) in the image below.
Note: You can export the clean data-set via Data Quality Project.
Situation 2:
we’ve a list of Supplier Names and we want to identify “matching” supplier names.
Note that in this situation, you would see how to use more than one domain to create a matching rule.
Solution 2:
Most of the steps would remain same as situation 1, But I want to show you Matching Policy & Matching Results
Matching Policy:
Matching results:
Also, I want to show that, the matching policy (DQS KB) that we created earlier can be used in Master Data Services too! For more details check out the resource: DQS, MDS & SSIS for EIM
Conclusion:
In this post, we saw how DQS can be used to clean “matching” records. For step by step tutorial, please refer to Movies Sample & EIM tutorial .
Data Profiling in Data Quality Services happens at following stages:
1) While performing Knowledge Discovery activity
1A: In the Discover step:
1b. Also in the manage domain values step:
While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:
Newness
Uniqueness
Validity
Completeness
2) While Performing Cleansing activity:
2A: on the cleansing step:
2b: Also on the mange and view results step:
Here the profiler gives you following statistics:
Corrected values
Suggested Values
Completeness
Accuracy
Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.
3) While performing Matching Policy activity (Knowledge Base Management)
3a. Matching policy step:
3b. Matching Results step:
Here the profiler gives following statistics:
newness
uniqueness
number of clusters
% of matched and unmatched records
avg, min & max cluster size
4) While performing Matching activity (Data Quality Project)
4a. Matching step:
4b. Export step:
Here Profiler gives following statistics:
Newness
uniqueness
completeness
number of clusters
% of matched and unmatched records
avg, min & max cluster size
Conclusion:
In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.
2) Go to datamarket.azure.com > and I subscribed to “Address check – verify, correct, Geocode US and canadian Addresses Data” ; we’ll see how to use this in next steps.
3) Note that as of now, we can only have refernece data-sets from Azure Data Market. However, the MSDN thread: http://social.msdn.microsoft.com/Forums/hu-HU/sqldataqualityservices/thread/750faef8-dd69-4a71-b0c1-18ca2f93d59d suggests that we’ll have an ability to provide our (private/self-provided) reference data-sets in next service updates. So for now we’ll have to connect to Azure data market for reference data-sets and for the purpose of the Demo, I connected to Melissa Data’s Address Check.
4) Now via DQS client, let’s create a Knowledge Base!
5) I’ve created following domains:
Address Line
City
State
Zip
And a composite domain: Full Address which contains domains: Address line, city, state and zip in the composite domains.
6) And for the composite domain Full Address – this is how you configure reference data-sets:
7) After creating the Knowledge Base, start a new DQS project. Here the cleaning happens at the composite domain level and this is a demo of how DQS uses third party reference data-set to classify records as suggested, new, invalid, corrected or correct:
You can see that DQS corrected few records and mentions that the reason in this case was Melissa data’s address verify reference data-set: