There are two type of things I learned in Graduate school:
2. Not Useful! (useless)
This post is NOT about discussing useless learning’s! So let me share one of most useful things that I learned in my two years at a School of management: How to solve business problems? Sounds cliched but that was, I think, one of the most important skills I picked up there. In particular, I learned about Frameworks used to solve business problems, one of them was called “MECE” which is what I want to share with you in this post.
(Side-note: Most folks learn this at some strategy consulting firm like McKinsey but unlike them, I learned about it in school)
Before we begin, I want to share about why you should care and then I’ll talk about what is it.
No matter which team you work for, you are solving problems. You wouldn’t have a job if you’re not doing that — so why not get better at it?
If you want to find a root cause of a business problem (& find the solution faster!) then you need to break it down…to break it down, you need to structure it. Now, they are many ways (or Frameworks) to structure a problem — MECE is one of the most effective frameworks out there. So lets learn about that:
(side-note: MECE framework may sound like a simple idea BUT it’s NOT easy to apply!)
What is MECE?
It’s an acronym and it stands for “Mutually Exclusive and Collectively Exhaustive” which means that when you break a problem into sub-items then they should be:
1. Not overlap with each other (mutually exclusive)
2. If you add up all sub-items then it should represent all possible solutions (collectively exhaustive)
Let’s take an example:
Say that you are asked to analyze “why is Profitability declining?”
Here’s a non-MECE way:
Find Top 10% profitable products [does NOT pass the collectively exhaustive test]
Out of them find products that are have declining profits
Try to find reasons why those products would have declining profits
Here’s a MECE way:
Visual for MECE principle
Break it down to Revenue & Cost
let’s start with cost, let’s say it’s constant = revenue must be going down for declining profits
further break down revenue into 1) Revenue from all non-usa locations 2) USA locations (Note the use of MECE principle here)
let’s say that revenue for non-usa locations is increasing, then it must be USA locations that’s the problem! (Note how effectively are able to narrow down and find the root cause faster!)
Let’s further break down to product categories for USA locations…Continue breaking down the sub-items in a MECE way till you find the root cause
I hope that gives you a good overview of MECE principle.
MECE is one of the few effective frameworks that you can use to solve a business problem. If you want to get better at structuring your ideas (to solve business problems), consider practicing MECE as there are ample resources available online that would help you master this!
There are many techniques to analyze data. In this post, we’re going to talk about two techniques that are critical for good data analysis! They are called “Benchmarking” and “Segmentation” techniques – Let’s talk a bit more about them:
It means that when you analyze your numbers, you compare it against some point of reference. This would help you quickly add context to your analysis and help you assess if the number if good or bad. This is super important! it adds meaning to you data!
Let’s look at an example. CEO wants to see Revenue numbers for 2014 and an analyst is tasked to create this report. If you were the analyst, which report would you think resonated more w/ the CEO? Left or Right?
I hope the above example helped you understand the importance of providing context w/ your data.
Now, let’s briefly talk about where do you get the data for benchmark?
There are two main sources: 1) Internal & 2) External
The example that you saw above was using an Internal source as a benchmark.
An example of an external benchmark could be subscribing to Industry news/data so that you understand how your business is running compared to similar other businesses. If your business sees a huge spike in sales, you need to know if it’s just your business or if it’s an Industry wide phenomenon. For instance, in Q4 most e-commerce sites would see spike in their sales – they would be able to understand what’s driving it only if they analyze by looking at Industry data and realizing that it’s shopping season!
Now, let’s shift gears and talk about technique #2: Segmentation.
Segmentation means that you break your data into categories (a.k.a segments) for analysis. So why do want to do that? Looking at the data at aggregated level is certainly helpful and helps you figure out the direction for your analysis. The real magic & powerful insights are usually derived by analyzing the segments (or sub sets of data)
Let’s a look at an example.
Let’s say CEO of a company looks at profitability numbers. He sees $6.5M and it’s $1M greater than last years – so that’s great news, right? But does that mean everything is fine and there’s no scope of optimization? Well – that could only be found out if you segment your data. So he asks his analyst to look at the data for him. So analyst goes back and after some experimentation & interviews w/ business leaders, he find an interesting insight by segmenting data by customers & sales channel! He finds that even though the company is profitable – there is a huge opportunity to optimize profitability for customer segment #1 across all sales channel (especially channel #1 where there’s a $2M+ loss!) Here’s a visual:
I hope that helps to show that segmentation is a very important technique in data analysis!
In this post, we saw segmentation & benchmark techniques that you can apply in your daily data analysis tasks!
A Business Intelligence (BI) system for Sales is being developed at a company. Here are the events that occur:
1) Based on the requirements, It is documented that the Business needs to analyze Sales numbers by product, month, customer & employee
2) While designing the system IT learns that the data is stored at each Invoice Level but since the requirements document doesn’t say anything about having details down to invoice level, they decide to aggregate data before bringing in their system.
3) They develop the BI system within the time frame and sends it to business for data validation.
4) Business Analysts starts looking at the BI system and finds some numbers that don’t look right for a few products and need to see Invoices for those products to make sure that the data is right so they ask IT to give them invoice level data.
5) IT realizes that even though business had not requested Invoice Level data explicitly but they do NEED the lowest level data! They realize it’s crucial to pass data validation. Also, they talk with their business analysts and found out that they may sometimes need to drill down to lowest level data to find insights that may be hidden at the aggregate level.
6) so IT decides to re-work on their solution. This increases the timeline & budget set for the project. Not only that they have lost the opportunity to gain the confidence of business by missing the budget and timeline.
7) They learn to “Design BI system to have the lowest level data even if it’s not asked!” and decides to never make this mistake again in the future!
This concludes the post and it’s important to include lowest level data in your BI system even if it’s not explicitly requested – this will save you time & build your credibility as a Business Intelligence developer/architect.
Classification algorithms are commonly used to build predictive models. Here’s what they do (simplified!):
Now, here’s the difference between Multi Class and Two Class:
if your Test Data needs to be classified into two classes then you use a two-class classification model.
1. Is it going to Rain today? YES or NO
2. Will the buyer renew his soon-to-expire subscription? YES or NO
3. What is the sentiment of this text? Positive OR Negative
As you can see from above examples the test data needs to be classified in two classes.
Now, look at example #3 – What is the sentiment of the text? What if you also want an additional class called “neutral” – so now there are three classes and we’ll need to use a multi-class classification model. So, If your test data needs to be classified into more than two classes then you use a multi-class classification model.
1. Sentiment analysis of customer reviews? Positive, Negative, Neutral
2. What is the weather prediction for today? Sunny, Cloudy, Rainy, Snow
I hope the examples helped, so next time you have to choose between multi class and two class classification models, ask yourself – does the problem ask you to predict two classes or more? based on that, you’ll need to pick your model.
Take a look at the following chart, do you see any issues with it?
Notice that the month values are shown as “distinct” values instead of shown as a “continuous” values and it misleads the person looking at the chart. Agree? Great! You already know based on your instincts what continuous and discrete values are, it’s just that we will need to label what you already know.
In the example used above, the “Date & Time” shown as a “Sales Date” is a continuous value since you can’t never say the “Exact” time that the event occurred…1/1/2008 22 hours, 15 minutes, 7 seconds, 5 milliseconds…and it goes on…it’s continuous.
But let’s say you wanted to see Number of Units Sold Vs Product Name. now that’s countable, isn’t it? You can say that we sold 150 units of Product X and 250 units of product Y. In this case, Units sold becomes discrete value.
The chart shown above was treating Sales Date as discrete values and hence causing confusion…let’s fix it since now you the difference between continuous and discrete variables:
To develop effective data visualizations, it’s important to understand the data types of your data. In this post, you saw the difference between continuous and discrete variables and their importance in data visualization.
Once in a while I write about back to basics topics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, we’ll revisit why do we use OLAP and Data Warehouses for business reporting systems. Let me share some of the most common reasons and then I’ll point to resources that offer other reasons.
Let’s see some of the most common reason:
#1: Business Reports should not take lot of time to load.
From a Business User Perspective: They don’t want to wait for their report to populate data. Reports should be fast!
But if business users have to wait for data to show up on their report because of slow query response, then that would be bad for everyone involved. Business Intelligence solutions cannot permeate in an organization if the reports take a lot of time to load:
So from a Technology standpoint: What can you do? And also why did the problem arise in the first place?
Let’s first see why the problem occurred in the first place?
So we have a bunch of database tables. To create a report, we’ll have to summarize (aggregate) values in lots of rows (think millions) and join few tables – turns out that if you query a transactional system (Database / OLTP), then you’ll get a slow response. In some cases, if the data model + data size + queries are not complex, then you could just run a query to create operational business reports and you won’t see any performance issues. But that’s not the case always! So if data model + data size + query requirements for reporting are not simple for OLTP/databases to handle and you see a poor performance – in other words, database/OLTP system takes up a lot of time returning data that the business reports require and the business users would see bad performance. The issue goes beyond complex data model + data size + query. You see, transactional systems may be running other tasks in parallel to returning data to business reports. So there’s the issue of resource contention on the OLTP database.
so that’s no good, right? Not only is the OLTP system bad at running queries needed for business reports but it also does not dedicate it’s resource for us!
So let’s create a copy of databases and have them dedicated to answer questions to business reports. so there’s not an issue of resource contention as we have dedicated resources to handle that. And while, we are at it – why don’t change the data model so that it best suits the queries that are needed for business reporting and analysis?
Well, that’s exactly what OLAP database is. It’s a database that’s created for business reporting and analysis. It’s does some neat things like pre-aggregating some values PLUS the data model in OLAP is also best suited for reporting purposes. (Read more about Star schemas/ data mart / data warehouse / ETL if you’re curious to learn more).
OK – so that’s one reason: To improve performance! Now let’s see another one.
#2: Creating Business Reports over Transactional systems (OLTP) data is NOT developer-friendly:
Ok, so we already covered in the previous section that creating business reports over OLTP can cause performance issues. But there’s more to it then just performance. You see – the requirements of creating business reports is different then the requirements of transactions systems. So? well, that means that the data model used for OLTP is best suited for transactional systems and it is not an optimal data model for analysis and reporting purpose. for example: creating hierarchies, drill-down reports, year-over-year growth among other things are much more efficiently handles by OLAP systems. But if we were to use OLTP database, then it would take a lot of developer hours to write efficient (and correct!) SQL commands (mostly stored procedures) to get OLTP to give data that the business reports need. Also, some of the common business metrics that are used in reporting can be stored in a cube. so that each time a report get’s created, you can re-use the business metric stored in the OLAP cubes.
OK – so OLAP cube saves time (to create reports).
Not only, OLAP cubes perform better at returning data they also help us speed the process of creating reports.
That’s great! let’s see one more reason:
#3: Ad-hoc reporting over OLTP systems creates confusion!
This reason is more about why we should have a data-mart and data-warehouse.. So why do ad-hoc reporting over OLTP systems creates confusion among business users?
Imagine creating reports over a LIVE system that’s getting updated every seconds. If there are ad-hoc (as-needed basis) reports being created by different users – then everyone would see different results. so it’s important to have a common version for everyone. Also imagine, everyone combining data from different data sources. If they’re doing it differently then they would see different data. And not only that, if they’re creating derived (or calculated) columns and their formula’s are different then they would see different data. you see a common pattern here? There’s no conformity in data & formulas in the reports that gets created. What does it cause? Confusion! So what’s needed is what they call in the data-warehouse “single version of truth”. OLAP cubes (which gets data from data-warehouse) provide that common single data source for everyone and thus the conformity in data is maintained while creating business reports.
Also while we are at this, one more consideration that typically reports require historical data at aggregated level. So we don’t want to store each transaction over the last 10 years in an OLTP database, do we? NO! right? In such cases, the historical data is aggregated based on requirements and stored in datamarts (/data-warehouse) which is later consumed by the OLAP cubes and that way OLTP databases do not have to store lot of historical data.
Ok – that’s one more reason OLTP are bad w/ business reporting and analysis and that’s why we need data-marts (data warehouse) and OLAP Cubes.
That’s about it. for this post. Let me point you to Related Resources:
SPEED is one of the important aspect of Data Analysis. Wouldn’t it be great if you query a data source, you get your answers as soon as possible? Yes? Right! Of course, it depends on factors like the size of the data you are trying to query but wouldn’t it be great if it’s at “SPEED OF THOUGHT“?
So Here’s the Problem:
Databases are mostly disk based and so the bottleneck here is the speed at which can get access to data off the disks.
So what can you do?
Let’s put data in a RAM (memory) because data-access via memory is faster.
If it’s sounds so easy, why didn’t people do it earlier? And why are we talking about “In Memory” NOW?
1) BIGGER Data Size/sets and so today with more data, it takes more time to query data from databases. And so researchers have looked at other approaches. One of the effective approach they found is: In-memory
(And I am not ignoring the advances in Database Technologies like Parallel databases, But for the purpose of understanding “Why In-memory”, it’s important to realize the growing size of data sets and a viable alternative we have to tackle the problem: In memory. And also I am not saying that it’s the ONLY way to go. I am just trying to understand the significance of in-memory technologies. We, as data professionals, have lot’s of choices! And only after evaluating project requirements, we can talk about tools and techniques)
2) PRICE of Memory: Was the price of RAM/memory higher than what it is today? So even though it was a great idea to put data in memory, it was cost-prohibitive.
So Let’s connect the dots: Data Analysis + In Memory Technologies:
What’s common between Microsoft’s PowerPivot, SAP HANA, Tableau and Qlikview?
1) Tools for Data-Analysis/Business-Intelligence 2) Their Back End data architecture is “In Memory”
So since Data Analysis needs SPEED and In-Memory Technologies solves this need – Data Analysis and Business Intelligence Tools adopted “In-memory” as their back-end data architecture. And next time, when you hear a vendor saying “in-memory”, you don’t have to get confused about what they’re trying to say. They’re just saying that we got you covered by giving you ability to query your data at “speed of thought” via our In-memory technologies so that you can go back to your (data) analysis.
Once in a while I go back to basics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, I want to revisit Data Mining and Knowledge Discovery Process:
Here are the steps:
1) Raw Data
2) Data Pre processing (cleaning, sampling, transformation, integration etc)
3) Modeling (Building a Data Mining Model)
4) Testing the Model a.k.a assessing the Model
5) Knowledge Discovery
Here is the visualization:
In the world of Data Mining and Knowledge discovery, we’re looking for a specific type of intelligence from the data which is Patterns. This is important because patterns tend to repeat and so if we find patterns from our data, we can predict/forecast that such things can happen in future.
In this blog post, we saw the Knowledge Discovery and Data Mining process.
Some year’s ago – I got introduced to SQL. At that time, I recall, I was sitting in a lab and one of the first exercises we did was to create a table in a database and adding data in it. In next lab, we ran SQL commands that updated records and deleted few. After we’re done – our instructor told us what we learned were the most basic programming functions i.e CRUD operations
CRUD stands for Create, Read, Update and Delete.
Let’s see the SQL equivalent of CRUD operations:
Is the concept of CRUD just applicable to SQL?
No. in fact, if you start learning programming or web development – one of the first things that you get to learn is how to run CRUD operations with that particular language.
In this blog post, I documented four (4) basic programming functions i.e. Create, Read, Update and Delete.