Journal of statistical software paper on tidying data:

Standard

Data cleaning takes up a lot of time during a data science process; it’s not necessarily a bad thing and time spent on cleaning data is worthwhile in most cases; To that end, I was researching some framework that might help me make this process a little bit faster. As a part of my research, I found the Journal of statistical software paper written by Hadley Wickham which had a really good framework to “tidy” data — which is part of data cleaning process.

Author does a great job of defining tidy data:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

And then applying it to 5 examples:

 1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables

It also contains some sample R code; You can read the paper here: http://vita.had.co.nz/papers/tidy-data.pdf

Single variable linear regression: Calculating baseline prediction, SSE, SST, R2 & RMSE:

Standard

Introduction:

This post is focused on basic concepts in linear regression and I will share how to calculate baseline prediction, SSE, SST, R2 and RMSE for a single variable linear regression.

Dataset:

The following figure shows three data points and the best-fit regression line: y = 3x + 2.

The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.

Baseline Prediction:

Baseline prediction is just the average of values of dependent variables. So in this case:

(2 + 2 + 8) / 3 = 4

It won’t take into account the independent variables and just predict the same outcome. We’ll see in a minute why baseline prediction is important.

Here’s what the baseline model would look like:

regression baseline model

SSE:

SSE stands for Sum of Squared errors.

Error is the difference between actual and predicted values.

So SSE in this case:

= (2 – 2)^2 + (2 – 5)^2 + (8 – 5)^2

= 0 + 9 + 9

= 18

SST:

SST stands for Total Sum of Squares.

Step 1 is to take the difference between Actual values and Baseline values of the dependent variables.

Step 2 is to Square them each and add them up.

So in this case:

= (2 – 4)^2 + (2 – 4)^2 + (8 – 4)^2

= 24

R2:

Now R2 is 1 – (SSE/SST)

So in this case:

= 1 – (18/24)

= 0.25

RMSE:

RMSE is Root mean squared error. It can be computed using:

Square Root of (SSE/N) where N is the # of dependent variables.

So in this case, it’s:

SQRT (18/3) = 2.44

 

How Marketable is R programming?

Standard

Someone asked this on Quora: How Marketable is R programming?

Answer:

Let’s step back!

Why do you want to learn R? OR why do people learn R?

To solve problems that R can address. Right?

What problems do you have? OR what problems does your COMPANY have? OR what PROBLEMS your Dream company that you want to join have?

<< LIST THEM DOWN HERE>>

example:

  • I want to predict customers that are going to churn next quarter.
  • I want to identify Marketing channel that drove the revenue growth last quarter.
  • etc..

What’s Next?

NOW, take all of these problems and find ways to solve them.

R may or may not help.

You could just do it in Excel. Then do that.

OR R helps you a little bit in the process but you need something else.

In some case, R is a perfect solution! Like building a model to predict customer churn!

So, What?

you see, learning R is important and you might get a job by showing that you have “R” chops but that will not be enough for career growth. You should be focused on learning to solve business problems using data. use R sometimes. use Excel sometimes. use Python sometimes. use SQL. use Tableau. use << INSERT A TOOL HERE>>. Learn them. Apply them. Figure out their strengths and weakness. BUT learn to use all of these technology platforms to solve problems! Solve problems that are thorny. Solve problems that move the business needle. Solve problems that get your bosses boss promoted.

If you do that, marketing your skills wouldn’t be a concern anymore.

It’s NOT easy. And it WILL take time.

TL;DR: Go for it! Learn R! But more importantly, learn to solve problems with data.

VIEW QUESTION ON QUORA

Book Giveaway: Head First Data Analysis — Ends 07/22/2016

Standard

<< THIS GIVEAWAY IS CLOSED NOW! Thanks for Participating! >>

Head First Data Analysis

Book Giveaway: Head First Data Analysis — A learner’s guide to big numbers, statistics and good decisions!

I love Head First series — if you haven’t read one of these books, you should — it’s great! So when I learned that they had a Data Analysis one, I had to read it. So I bought one and skimmed through it.

Now, Instead of letting it sit on my shelf, I think it might better serve its life purpose if more people read it so I have decided to do this little experiment.

Rules:

  1. You need to have an US-based address so that I can ship it to you (no cost to you!)
  2. You need to comment on this blog post on or before 07/22/2016 — just put your name & email. I’ll contact you if you win*

*Random selection!

Go!

How to create a Histogram in Excel?

Standard

Histogram is a powerful data analysis technique — it let’s you quickly see the distribution of the data you have. So in this post, I am going to list the steps to create histogram in Excel.

It’s a two-step process.

  1. Install “Data Analysis Tool Pak” (free Excel add-in)
  2. Format the data and build the histogram

Step 1: Install Data Analysis Tool Pak.

One of the most useful data analysis add-in in excel is not available by default! It’s called “Analysis ToolPak”

To activate it. Go to File > Excel options > Addins > For the manage field, select Excel add-ins

Histogram Manage Excel add-insMake sure that ToolPak is activated and click OK.

Histogram analysis toolpak excel(Also, Solver is a great add-in as well! It’s not in the scope of this article to discuss that add-in but it’s a powerful add-in as well. For instance, it let’s you work on optimization problems)

Step 2: Format Data and build the Histogram

So now let’s format the data.

You need two things to create a Histogram. 1) Data 2) Range

Here’s an example: (I have about 3000 numbers and need to see the distribution)

You could have other fields on the sheet as well but you need at least the data field. Range is optional but I recommend that you specify the Range so that your histogram would have the bins that you specified — otherwise you could have just used a bar chart!

Note that both of them are numerical.

Data Histogram

Now go to Menu Bar > Data > Data Analysis

Data Analysis HistogramOut of the options available, click on Histogram and select the Input Range and Bin Range > after you’re done, click OK.

Data Analysis Histogram ToolpakYou should see a new worksheet with raw data (ready for charting!). Now, create a Bar chart using the raw data and you have your histogram:

Histogram Excel Data AnalysisConclusion:

In this post I listed the steps you can take to create a Histogram in Excel. Note that there are other options as well — like R (hist function) that let’s you build histogram as well so you do have choice of tools but if you want to stick with excel and it’s good enough then you now know how. Cheers!

Related Post: What is the difference between Histogram & Bar Chart?

What data are data scientists at startups actually analyzing? How is it collected?

Standard

Question: What data are data scientists at startups actually analyzing? How is it collected?
(Coming from a web analytics background I’m wondering what data are data scientist at IT companies actually analyzing. Is it server-side or client-side? Is it collected internally or using some external tool?)

Answer:

Part 1: What are startups analyzing?

It depends on the Business Model and the Stage that they are at.

Business Models: Marketplace, Ecom, SaaS, Media, etc.

Stage: Early, Mid, Late

So let’s say you have a SaaS model and you’re in Mid-stage (post product-market fit stage) then you would tend to be focused on things like: Engagement, Churn, etc…and ideally they should be focused on measuring what aligns best with the strategy (instead of capturing everything!)

Let’s take another example. Let’s say you are a Marketplace in late-stage. So you would tend to be focused more on the “money” and so you can measure things like: transactions, commissions, etc…

I recommend reading “lean analytics” book as it goes much deeper and it’s a great starting point for anyone to understand how analytics could help a startup.

Part 2: How is it collected?

Now this also depends on your product. Assuming you’re a tech startup, you would have Web App and/or Desktop app and/or Mobile app. And now depending on your delivery approach plus your measurement needs, the “how” part will be determined. It would invariably be a combination of your transactions data source, web/mobile events stack (like Google analytics/other-Vendor or Custom), finance data source among others.

This post points to 10 other blogs which lists their “data” stack: The Data Infrastructure Meta-Analysis: How Top Engineering Organizations Built Their Big Data Stacks – The Data Point

View Question on Quora

What is the difference between Histogram & Bar Chart?

Standard
Histogram Bar Chart
 Histogram Bar Chart
The x-axis represents bins. So if you have a continuous variable like age which has values from 0-100 then you can create bins like 0-10, 10-20 and so on (and here bin size = 10). You can change the bin size to analyze the distribution of the data.
X-axis has a numerical (quantitative) variable.
The x-axis represents distinct categories from your data.
The variable on the x-axis is usually qualitative
The order of the bins is important since it is used to understand the distribution of the data. The order of the categories in the bar chart doesn’t matter. We can sort it if we want but it’s not needed.