Data cleaning takes up a lot of time during a data science process; it’s not necessarily a bad thing and time spent on cleaning data is worthwhile in most cases; To that end, I was researching some framework that might help me make this process a little bit faster. As a part of my research, I found the Journal of statistical software paper written by Hadley Wickham which had a really good framework to “tidy” data — which is part of data cleaning process.
Author does a great job of defining tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
And then applying it to 5 examples:
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables
This post is focused on basic concepts in linear regression and I will share how to calculate baseline prediction, SSE, SST, R2 and RMSE for a single variable linear regression.
Dataset:
The following figure shows three data points and the best-fit regression line: y = 3x + 2.
The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.
Baseline Prediction:
Baseline prediction is just the average of values of dependent variables. So in this case:
(2 + 2 + 8) / 3 = 4
It won’t take into account the independent variables and just predict the same outcome. We’ll see in a minute why baseline prediction is important.
Here’s what the baseline model would look like:
SSE:
SSE stands for Sum of Squared errors.
Error is the difference between actual and predicted values.
So SSE in this case:
= (2 – 2)^2 + (2 – 5)^2 + (8 – 5)^2
= 0 + 9 + 9
= 18
SST:
SST stands for Total Sum of Squares.
Step 1 is to take the difference between Actual values and Baseline values of the dependent variables.
Step 2 is to Square them each and add them up.
So in this case:
= (2 – 4)^2 + (2 – 4)^2 + (8 – 4)^2
= 24
R2:
Now R2 is 1 – (SSE/SST)
So in this case:
= 1 – (18/24)
= 0.25
RMSE:
RMSE is Root mean squared error. It can be computed using:
Square Root of (SSE/N) where N is the # of dependent variables.
Question (on Quora) Is the R data science course from datacamp worth the money?
Answer:
It depends on your learning style.
If you like watching videos then coursera/udacity might be better.
If you like reading then a book/e-book might be better.
If you like hands-on then something like Data Camp is a great choice. I think they have monthly plans so it’s much cheaper to try them out. When I subscribed to it, it was like 30$/Month or so. I found it was worth it. Also, if you want to see if “hands-on” is how you learn best. Try this: swirl: Learn R, in R. — it’s free! Also, Data Camp has a free course on R too so you could try that as well.
Someone asked this on Quora: How Marketable is R programming?
Answer:
Let’s step back!
Why do you want to learn R? OR why do people learn R?
To solve problems that R can address. Right?
What problems do you have? OR what problems does your COMPANY have? OR what PROBLEMS your Dream company that you want to join have?
<< LIST THEM DOWN HERE>>
example:
I want to predict customers that are going to churn next quarter.
I want to identify Marketing channel that drove the revenue growth last quarter.
etc..
What’s Next?
NOW, take all of these problems and find ways to solve them.
R may or may not help.
You could just do it in Excel. Then do that.
OR R helps you a little bit in the process but you need something else.
In some case, R is a perfect solution! Like building a model to predict customer churn!
So, What?
you see, learning R is important and you might get a job by showing that you have “R” chops but that will not be enough for career growth. You should be focused on learning to solve business problems using data. use R sometimes. use Excel sometimes. use Python sometimes. use SQL. use Tableau. use << INSERT A TOOL HERE>>. Learn them. Apply them. Figure out their strengths and weakness. BUT learn to use all of these technology platforms to solve problems! Solve problems that are thorny. Solve problems that move the business needle. Solve problems that get your bosses boss promoted.
If you do that, marketing your skills wouldn’t be a concern anymore.
It’s NOT easy. And it WILL take time.
TL;DR: Go for it! Learn R! But more importantly, learn to solve problems with data.
Sometime back I worked on a research project that involved writing some R code – we were searching for tools ways to pull data from multiple social networks, perform text analysis and create effective data visualizations. R seemed like a great tool & so I was searching for a book/guides that teaches me fundamentals I needed to know to get few R related things done. One of the books that I used often during the research project was “R in nutshell”. I didn’t read it cover-to-cover but it was a great reference book for me. I used to read guides online/other-books and then I used to combine information from this book to get stuff done. The section I liked the most was on Data visualization which included some great code snippets to create effective data visualization using ggplot2 library. I used to take code snippets from this book & apply it on data-sets that I had.
Fun stuff!
Also, I liked it that the book has some end-to-end examples that cover the entire life cycle of data analysis/statistical-analysis.
Summary:
I recommend this book as a “reference” for someone who started working with R.
Note:
I received a copy of this book as part of OREILLY’s Blogger program. Thanks OREILLY! If you are a blogger, you should check out that program!
I was recently searching for a way to do some text mining on Twitter Data. I was interested in a tool that has some “library” that helps to fetch twitter data & later, I wanted to create visualization like say word cloud, time series. etc. Turns out that “R” perfectly suited my needs because of libraries/packages such as TwitteR and ggplot2 – And so, I downloaded and installed R and RStudio on my windows machine. Here are the steps (I am using Windows Server 2008 R2 machine 64 bit):
1. Download R for Windows:
2. After downloading it > Install it by leaving all options to default.
3. Download RStudio Desktop for windows:
4. Install RStudio > leave all options to default.
5. Open RStudio > In the Bottom Right Pane, switch to Packages Tab > Click on Install Packages > In the packages box, type in ggplot2 and > click on Install.
5. Check that ggplot2 successfully unpacked and installed > Now similarly install the package: twitteR > make sure it is successfully unpacked and installed.
6. And I quickly created a chart of Twitter UserName vs Number of Tweets for #sqlpass:
we can do much mire but just wanted to show how you can do social media analytics with R!
Conclusion:
In this blog post, we saw a step by step process to download and install R and R studio on a windows machine.