This post is focused on basic concepts in linear regression and I will share how to calculate baseline prediction, SSE, SST, R2 and RMSE for a single variable linear regression.
Dataset:
The following figure shows three data points and the best-fit regression line: y = 3x + 2.
The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.
Baseline Prediction:
Baseline prediction is just the average of values of dependent variables. So in this case:
(2 + 2 + 8) / 3 = 4
It won’t take into account the independent variables and just predict the same outcome. We’ll see in a minute why baseline prediction is important.
Here’s what the baseline model would look like:
SSE:
SSE stands for Sum of Squared errors.
Error is the difference between actual and predicted values.
So SSE in this case:
= (2 – 2)^2 + (2 – 5)^2 + (8 – 5)^2
= 0 + 9 + 9
= 18
SST:
SST stands for Total Sum of Squares.
Step 1 is to take the difference between Actual values and Baseline values of the dependent variables.
Step 2 is to Square them each and add them up.
So in this case:
= (2 – 4)^2 + (2 – 4)^2 + (8 – 4)^2
= 24
R2:
Now R2 is 1 – (SSE/SST)
So in this case:
= 1 – (18/24)
= 0.25
RMSE:
RMSE is Root mean squared error. It can be computed using:
Square Root of (SSE/N) where N is the # of dependent variables.
Tree-like structures are common in our world: Company Hierarchy, File System on your computer, Product category map among others so you might run into a task of creating a Hierarchical level in your SQL query — In this blog post, I will show you how you can do that using couple of approaches. These approaches can also be used to map “parent – child” relationships.
Two approaches are:
When you know the Tree Depth
When you don’t know the Tree Depth
#1: When you know tree-depth:
When you know the tree-depth (and if it’s not too deep) then you could consider simple CTE’s to come up with the Hierarchical Levels field in your query.
Let’s take an example:
Input:
EmployeeID
FirstName
LastName
Title
ManagerID
1
Ken
Sánchez
Chief Executive Officer
NULL
16
David
Bradley
Marketing Manager
273
273
Brian
Welcker
Vice President of Sales
1
274
Stephen
Jiang
North American Sales Manager
273
285
Syed
Abbas
Pacific Sales Manager
273
Query: (On SQL Server)
with lvl1 as (select [EmployeeID] ,[FirstName] ,[LastName] ,[Title] ,[ManagerID] ,1 as Level FROM [dbo].[employees] where ManagerID is null ) , lvl2 as ( select [EmployeeID] ,[FirstName] ,[LastName] ,[Title] ,[ManagerID] ,2 as Level FROM [dbo].[employees] where ManagerID IN (Select EmployeeID from lvl1) ), lvl3 as ( select [EmployeeID] ,[FirstName] ,[LastName] ,[Title] ,[ManagerID] ,3 as Level FROM [dbo].[employees] where ManagerID IN (Select EmployeeID from lvl2) ) select * from lvl1 union select * from lvl2 union select * from lvl3
Output:
EmployeeID
FirstName
LastName
Title
ManagerID
Level
1
Ken
Sánchez
Chief Executive Officer
NULL
1
273
Brian
Welcker
Vice President of Sales
1
2
16
David
Bradley
Marketing Manager
273
3
274
Stephen
Jiang
North American Sales Manager
273
3
285
Syed
Abbas
Pacific Sales Manager
273
3
#2: When you do NOT know tree-depth:
In other words, if the tree is N-level deep then you are out of luck using option #1. In this case, you should consider the RECURSIVE CTE approach. Here’s an example:
Input: (with the idea that this table will grow over time)
EmployeeID
FirstName
LastName
Title
ManagerID
1
Ken
Sánchez
Chief Executive Officer
NULL
16
David
Bradley
Marketing Manager
273
23
Mary
Gibson
Marketing Specialist
16
273
Brian
Welcker
Vice President of Sales
1
274
Stephen
Jiang
North American Sales Manager
273
275
Michael
Blythe
Sales Representative
274
276
Linda
Mitchell
Sales Representative
274
285
Syed
Abbas
Pacific Sales Manager
273
286
Lynn
Tsoflias
Sales Representative
285
Query: (On SQL Server that supports Recursive CTE)
with HierarchyLvl as ( SELECT [EmployeeID] ,[FirstName] ,[LastName] ,[Title] ,[ManagerID] ,1 as Level FROM [dbo].[employees] where ManagerID is null UNION ALL SELECT e.[EmployeeID] ,e.[FirstName] ,e.[LastName] ,e.[Title] ,e.[ManagerID] ,Level + 1 FROM [dbo].[employees] e INNER JOIN HierarchyLvl d on e.ManagerID = d.EmployeeID ) select * from HierarchyLvl
Output:
EmployeeID
FirstName
LastName
Title
ManagerID
Level
1
Ken
Sánchez
Chief Executive Officer
NULL
1
273
Brian
Welcker
Vice President of Sales
1
2
16
David
Bradley
Marketing Manager
273
3
274
Stephen
Jiang
North American Sales Manager
273
3
285
Syed
Abbas
Pacific Sales Manager
273
3
286
Lynn
Tsoflias
Sales Representative
285
4
275
Michael
Blythe
Sales Representative
274
4
276
Linda
Mitchell
Sales Representative
274
4
23
Mary
Gibson
Marketing Specialist
16
4
Conclusion:
Even if I know tree-depth I will go with option #2 as it’s much easier to read and can accommodate future updates to the table. If you are interested in learning more about this and search for “Recursive Query using Common Table Expression” and you should find technical articles that talk about why it does what it does.
Question (on Quora) Is the R data science course from datacamp worth the money?
Answer:
It depends on your learning style.
If you like watching videos then coursera/udacity might be better.
If you like reading then a book/e-book might be better.
If you like hands-on then something like Data Camp is a great choice. I think they have monthly plans so it’s much cheaper to try them out. When I subscribed to it, it was like 30$/Month or so. I found it was worth it. Also, if you want to see if “hands-on” is how you learn best. Try this: swirl: Learn R, in R. — it’s free! Also, Data Camp has a free course on R too so you could try that as well.
They are used interchangeably since all of them involve working with data to find actionable insights. But I like to differentiate them based on the type of the question you’re asking:
What:
What are my sales number for this quarter?
What is the profit for this year to date?
What are my sales number over the past 6 months?
What did the sales look like same quarter last year?
…
All of these questions are used to report on facts and tools that help you build data models and reports can be classified as “Business Intelligence” tools.
Why:
Why is my sales number higher for this quarter compared to last quarter?
Why are we seeing increase in sales over the past 6 months?
Why are we seeing decrease in profit over the past 6 months?
Why does the profit this quarter less compared to same quarter last year?
…
All of these questions try to figure why something happened? A data analyst typically takes a stab at this. He might use existing Business Intelligence platform to pull data and/or also merge other data sets. He/she then applies data analysis techniques on the data to answer the “why” question and help business user get to the actionable insight.
What’s next:
What will be my sales forecast for next year?
What will be our profit next year for Scenario A, B & C?
Which customers will cancel/churn next quarter?
Which new customers will convert to a high-value customer?
…
All of these questions try to “predict” what will happen next (based on historical data/patterns). Sometimes, you don’t know the questions in the first place so there’s a lot of pro-active thinking going on and usually a “data scientist” are doing that. Sometimes you start with a high level business problem and form “hypothesis” to drive your analysis. All of these can be classified under “data science”.
Now, as you can see as we progressed from What -> Why -> What’s next, the level of sophistication needed to do the analysis also increased. So you need a combination of people, process and technology platform in an organization to go from having a Business Intelligence maturity all the way to achieving data science capabilities.
a. Quora: You could ask questions here and A2A — Build a network and someone may offer to mentor you offline
b. Mooc: You could join courses on MOOC’s like coursera and udacity — they have good forum support so you could use it for getting your questions answered
c. Cold email: There are lot of analytics/data-science professionals active in the community (linkedin groups, blogs, etc) and if you cold email them, you might find one!
d. local meetups: go to local meetups, meet people and find your mentor.
Stepping back, having a mentor helps and accelerates your progress – but not having one, shouldn’t stop you from achieving what you want.