The goal of this newsletter is to promote continuous learning for data science and engineering professionals. To achieve this goal, I’ll be sharing articles across various sources that I found interesting. The following 5 articles made the cut for today’s newsletter.
Why dropbox picked Apache superset as data exploration tool?
Apache superset is gaining momentum and if you want to understand the reasons behind that, you can start by reading this article here
Growth: Adjacent User Theory
I love the framing via this LinkedIn post here where Nimit Jain says that Great Growth PM output looks like “We discovered 2 new user segments who are struggling to proceed at 2 key steps in the funnel and simplified the product for them via A/B experiments. This lead to conversion improvement of 5-10% at these steps so far. We are now working to figure the next segment of users to focus on.”; you can read about the Adjacent user theory here
Really good matrix on comparing 2 popular ETL workflow platforms. Read here
A data engineer’s point of view on data democratization
If more people can easily access data that was previously not accessible to them then that’s a good thing. This is a good read on various things to consider, read here
I love mental model and frameworks. I have shared some frameworks on this blog already like 3 W’s (What, Why, what’s next) and 3 P’s (Platform, People, Process) focused on helping analytics leader figure what their analytics roadmap should be. I was reading ‘competing on analytics’ book and came across the 5 stages of Analytical competition which seemed like another framework worth sharing.
The two end of the spectrum are org is flying blind to org is competing through analytics. Stages are:
If you’re looking for career change, that’s never too late!
If you’re looking to learn something new, that’s never too late!
If you’re looking to continue learning and go deeper in data science, that’s never too late!
If you don’t like Software engineering and want to switch to something else, that’s never too late!
…
But if you are after the “Data Science” gold rush, then you did miss the first wave! You are late.
…
But seriously, you should apply first-principles thinking to your career strategy and ideally not jump to whatever’s “hot” because by the time you get on that train, it’s usually too late.
As a data scientist, I am not dissatisfied. I love what I do!
But I might have gotten lucky since I got into this for the right reasons. I was looking for a role that had a little bit of both tech & business and so few years back, Business Intelligence and Data Analysis seemed like a great place to start. So I did that for a while. Then industry evolved and the analytics maturity of the companies that I worked also evolved and so worked on building predictive models and became what they now call “Data scientist”.
It doesn’t mean that data science is the right role for everyone.
One of my friends feels that it’s not that “technical” and doesn’t like this role. He is more than happy with data engineer role where he gets to build stuff and dive deeper into technologies.
One of my other friends doesn’t like that you don’t own business/product outcomes and prefers a product manager role (even though he has worked as a data analyst for a while now and is working on transitioning away).
So, just based on the empirical data that I have, data science might not be an ideal path for everyone.
It depends on your target industry & where they are in their life-cycle.
It has four stages: Startup, Growth, Maturity, Decline.
Generalization is great in earlier stages. If you are targeting jobs at startups; generalize. You should know enough about lot of things.
T-shaped professionals are great for Growth stage. They specialize in something but still know enough about lot of things. E.g. Sr Growth/Marketing Analyst. Know enough about analytics & data science to be dangerous but specializes in marketing.
Specialization is great for mature industries. They know a lot about few things. E.g. Statisticians in an Insurance industry. They have made careers out of building risk models.
This post is focused on basic concepts in linear regression and I will share how to calculate baseline prediction, SSE, SST, R2 and RMSE for a single variable linear regression.
Dataset:
The following figure shows three data points and the best-fit regression line: y = 3x + 2.
The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.
Baseline Prediction:
Baseline prediction is just the average of values of dependent variables. So in this case:
(2 + 2 + 8) / 3 = 4
It won’t take into account the independent variables and just predict the same outcome. We’ll see in a minute why baseline prediction is important.
Here’s what the baseline model would look like:
SSE:
SSE stands for Sum of Squared errors.
Error is the difference between actual and predicted values.
So SSE in this case:
= (2 – 2)^2 + (2 – 5)^2 + (8 – 5)^2
= 0 + 9 + 9
= 18
SST:
SST stands for Total Sum of Squares.
Step 1 is to take the difference between Actual values and Baseline values of the dependent variables.
Step 2 is to Square them each and add them up.
So in this case:
= (2 – 4)^2 + (2 – 4)^2 + (8 – 4)^2
= 24
R2:
Now R2 is 1 – (SSE/SST)
So in this case:
= 1 – (18/24)
= 0.25
RMSE:
RMSE is Root mean squared error. It can be computed using:
Square Root of (SSE/N) where N is the # of dependent variables.
Question (on Quora) Is the R data science course from datacamp worth the money?
Answer:
It depends on your learning style.
If you like watching videos then coursera/udacity might be better.
If you like reading then a book/e-book might be better.
If you like hands-on then something like Data Camp is a great choice. I think they have monthly plans so it’s much cheaper to try them out. When I subscribed to it, it was like 30$/Month or so. I found it was worth it. Also, if you want to see if “hands-on” is how you learn best. Try this: swirl: Learn R, in R. — it’s free! Also, Data Camp has a free course on R too so you could try that as well.
They are used interchangeably since all of them involve working with data to find actionable insights. But I like to differentiate them based on the type of the question you’re asking:
What:
What are my sales number for this quarter?
What is the profit for this year to date?
What are my sales number over the past 6 months?
What did the sales look like same quarter last year?
…
All of these questions are used to report on facts and tools that help you build data models and reports can be classified as “Business Intelligence” tools.
Why:
Why is my sales number higher for this quarter compared to last quarter?
Why are we seeing increase in sales over the past 6 months?
Why are we seeing decrease in profit over the past 6 months?
Why does the profit this quarter less compared to same quarter last year?
…
All of these questions try to figure why something happened? A data analyst typically takes a stab at this. He might use existing Business Intelligence platform to pull data and/or also merge other data sets. He/she then applies data analysis techniques on the data to answer the “why” question and help business user get to the actionable insight.
What’s next:
What will be my sales forecast for next year?
What will be our profit next year for Scenario A, B & C?
Which customers will cancel/churn next quarter?
Which new customers will convert to a high-value customer?
…
All of these questions try to “predict” what will happen next (based on historical data/patterns). Sometimes, you don’t know the questions in the first place so there’s a lot of pro-active thinking going on and usually a “data scientist” are doing that. Sometimes you start with a high level business problem and form “hypothesis” to drive your analysis. All of these can be classified under “data science”.
Now, as you can see as we progressed from What -> Why -> What’s next, the level of sophistication needed to do the analysis also increased. So you need a combination of people, process and technology platform in an organization to go from having a Business Intelligence maturity all the way to achieving data science capabilities.
What-if Analysis is a pretty common analysis done by decision makers. Often, they would just create simple excel tables and adjust their variables manually until they get an answer that works. But instead of doing it manually there are features available in excel that will make your life much easier and analysis much more accurate. So, the goal of this blog post is to introduce you to the Goal Seek and Solver feature to help you do what-if analysis in Excel.
#1. Goal Seek:
Let’s say you are a CEO of an e-commerce startup and wondering about what factors you need to focus on to increase revenue. Here’s what the data (*assume per month) looks like when you start out:
And you want to increase the Revenue to $150K from $125K. The three levers you can pull are website visitors, conversion and revenue per customer.
Now you could manually tweak the values for this variables till you get to $150K but as I promised earlier, there’s a better way!
Let’s start with Goal Seek.
You need to set two variables for Goal Seek.
a. Your goal — which in this case is 150K
b. The variable that needs to be changed to achieve that goal — note that you can specify just one variable to do so. So you need to choose out of the three above what you would like to focus on. Let’s say you want to focus on conversion rate.
So once you have these two things — from the Data Tab in Excel, Go To What-if Analysis, Goal Seek:
Now, specify the values. For this example, we want to figure out what should be the new conversion rate so that our revenue will be $150K. So here’s an example of how that would look on Goal-seek:
After entering the values, you will see the status — you can click “OK” to keep the solution and cancel to go back to what you had:
Perfect! So you need to increase the conversion rate from 1.25% to 1.5% to get to the goal that you had set!
#2: Solver add-in
So, you worked on improving the conversion rate for next month or two and you & your team found out that it’s getting really hard to increase it above 1.35% — And also you found that with the less effort you can move the needle on other variables (website visitors & revenue per customer). Now Goal Seek allows you just set one variable so if you more variables than it doesn’t serve the purpose that well! That is where Solver add-in helps.
Think of Solver as advanced Goal seek where you can set more than one cell that can change. You can also set constraints on what the values could be for all the variables that can change.
Now, for our scenario, the conversion rate is at 1.35% but you want to see the possible changes that you can make for website visitors and revenue per customer to reach $150K.
You also know that you can’t above 1,100,000 Website visitors per month and also need to have less than $11 as revenue per customer.
You will need to enable the Solver add-in in Excel and once you do that you will see that in the Data Tab.
Once you have it, open it and fill up the information needed in the dialog box:
a,. Set objective to Total Revenue with value of 150000
b. By changing cells to: Website Visitors and Revenue per Customer
c. Constraints. Website Visitors <= 1,100,000 and Revenue Per Customer < $11
After that click on Solve.
if it found a solution, it would show you that on Excel and also give you additional options to whether you want to keep the solver solution or restore it to original values:
For our scenario, it suggesting that with website visitors to 1,010,101 and revenue per customer to $11, we should hit our goal.
Click on OK when you’re done.
Conclusion:
In this post, we saw how you can use Goal Seek and Solver add-in using an e-commerce scenario but you these techniques can be applied to wide variety of data analysis problems that can be solved using “what-if” techniques.
Hope this was helpful and I would love to hear from you about how will you use this in your work? Or if you use it already then what do you use it for?
As a data analyst, you should work with the CEO (or other decision makers) on a quarterly (or more frequent if possible) and learn about #1 Strategic objectives and initiatives — #2 after that, you should work together and figure out how analytics could help these initiatives.
So why is learning about strategic initiatives from the executives important?
Because analytics could be applied to lot of problems but you and your team might just have limited bandwidth.
Also, executives want to stay focused on what’s important now and so if your priorities align then you are much likely to succeed in the role.
Let’s take an example:
Scenario 1: As a data analyst, you create bunch of reports from let’s say Google Analytics and throw them at the CEO! It has everything like visitor stats, acquisition stats, retention stats, behavior stats, conversion stats among others! Now by doing so, executives might get what they asked for but then they will still have to go through the reports and map it back to their strategic initiatives and figure out the recommendations on their own. Also, executives might not have the time to do this and may miss critical insights.
Scenario 2: You know that the one of the strategic initiate for the quarter is to improve the conversion rate from landing pages to order-complete page from 1.25% to 1.40% — so your analysis that you send to the executive would not only be focused on just that but also include “recommendations” — like it seems that there is a significant drop-off after customers learn about shipping cost. Then the executive could use those recommendations to drive actions. There’s also another benefit: Any ad-hoc data request that doesn’t align with the strategic objectives can be postponed (or de-prioritized) and let’s you focus on what’s most important for the company.
I prefer scenario #2. And try to create this culture wherever I am working. Executives should be open to sharing strategic initiatives at high-level with everyone in the company and help align everyone’s priorities.
Note: This doesn’t mean that you don’t create reports, you still do that for broader consumption — especially the Key Performance indicators that are key for success but you should look at automating most of that and focus on data analysis and find recommendations that the executives could take some action on.