Data cleaning takes up a lot of time during a data science process; it’s not necessarily a bad thing and time spent on cleaning data is worthwhile in most cases; To that end, I was researching some framework that might help me make this process a little bit faster. As a part of my research, I found the Journal of statistical software paper written by Hadley Wickham which had a really good framework to “tidy” data — which is part of data cleaning process.
Author does a great job of defining tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
And then applying it to 5 examples:
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables
Someone asked this on Quora: How Marketable is R programming?
Let’s step back!
Why do you want to learn R? OR why do people learn R?
To solve problems that R can address. Right?
What problems do you have? OR what problems does your COMPANY have? OR what PROBLEMS your Dream company that you want to join have?
<< LIST THEM DOWN HERE>>
I want to predict customers that are going to churn next quarter.
I want to identify Marketing channel that drove the revenue growth last quarter.
NOW, take all of these problems and find ways to solve them.
R may or may not help.
You could just do it in Excel. Then do that.
OR R helps you a little bit in the process but you need something else.
In some case, R is a perfect solution! Like building a model to predict customer churn!
you see, learning R is important and you might get a job by showing that you have “R” chops but that will not be enough for career growth. You should be focused on learning to solve business problems using data. use R sometimes. use Excel sometimes. use Python sometimes. use SQL. use Tableau. use << INSERT A TOOL HERE>>. Learn them. Apply them. Figure out their strengths and weakness. BUT learn to use all of these technology platforms to solve problems! Solve problems that are thorny. Solve problems that move the business needle. Solve problems that get your bosses boss promoted.
If you do that, marketing your skills wouldn’t be a concern anymore.
It’s NOT easy. And it WILL take time.
TL;DR: Go for it! Learn R! But more importantly, learn to solve problems with data.
<< THIS GIVEAWAY IS CLOSED NOW! Thanks for Participating! >>
Book Giveaway: Head First Data Analysis — A learner’s guide to big numbers, statistics and good decisions!
I love Head First series — if you haven’t read one of these books, you should — it’s great! So when I learned that they had a Data Analysis one, I had to read it. So I bought one and skimmed through it.
Now, Instead of letting it sit on my shelf, I think it might better serve its life purpose if more people read it so I have decided to do this little experiment.
You need to have an US-based address so that I can ship it to you (no cost to you!)
You need to comment on this blog post on or before 07/22/2016 — just put your name & email. I’ll contact you if you win*
One of the most useful data analysis add-in in excel is not available by default! It’s called “Analysis ToolPak”
To activate it. Go to File > Excel options > Addins > For the manage field, select Excel add-ins
Make sure that ToolPak is activated and click OK.
(Also, Solver is a great add-in as well! It’s not in the scope of this article to discuss that add-in but it’s a powerful add-in as well. For instance, it let’s you work on optimization problems)
Step 2: Format Data and build the Histogram
So now let’s format the data.
You need two things to create a Histogram. 1) Data 2) Range
Here’s an example: (I have about 3000 numbers and need to see the distribution)
You could have other fields on the sheet as well but you need at least the data field. Range is optional but I recommend that you specify the Range so that your histogram would have the bins that you specified — otherwise you could have just used a bar chart!
Note that both of them are numerical.
Now go to Menu Bar > Data > Data Analysis
Out of the options available, click on Histogram and select the Input Range and Bin Range > after you’re done, click OK.
You should see a new worksheet with raw data (ready for charting!). Now, create a Bar chart using the raw data and you have your histogram:
In this post I listed the steps you can take to create a Histogram in Excel. Note that there are other options as well — like R (hist function) that let’s you build histogram as well so you do have choice of tools but if you want to stick with excel and it’s good enough then you now know how. Cheers!
Question: What data are data scientists at startups actually analyzing? How is it collected? (Coming from a web analytics background I’m wondering what data are data scientist at IT companies actually analyzing. Is it server-side or client-side? Is it collected internally or using some external tool?)
Part 1: What are startups analyzing?
It depends on the Business Model and the Stage that they are at.
Business Models: Marketplace, Ecom, SaaS, Media, etc.
Stage: Early, Mid, Late
So let’s say you have a SaaS model and you’re in Mid-stage (post product-market fit stage) then you would tend to be focused on things like: Engagement, Churn, etc…and ideally they should be focused on measuring what aligns best with the strategy (instead of capturing everything!)
Let’s take another example. Let’s say you are a Marketplace in late-stage. So you would tend to be focused more on the “money” and so you can measure things like: transactions, commissions, etc…
I recommend reading “lean analytics” book as it goes much deeper and it’s a great starting point for anyone to understand how analytics could help a startup.
Part 2: How is it collected?
Now this also depends on your product. Assuming you’re a tech startup, you would have Web App and/or Desktop app and/or Mobile app. And now depending on your delivery approach plus your measurement needs, the “how” part will be determined. It would invariably be a combination of your transactions data source, web/mobile events stack (like Google analytics/other-Vendor or Custom), finance data source among others.
The x-axis represents bins. So if you have a continuous variable like age which has values from 0-100 then you can create bins like 0-10, 10-20 and so on (and here bin size = 10). You can change the bin size to analyze the distribution of the data. X-axis has a numerical (quantitative) variable.
The x-axis represents distinct categories from your data. The variable on the x-axis is usually qualitative
The order of the bins is important since it is used to understand the distribution of the data.
The order of the categories in the bar chart doesn’t matter. We can sort it if we want but it’s not needed.
Prescriptive — How can I use all these things to take business decisions/actions.
With that overview, let’s look at Descriptive analysis a little bit more. This is usually the first step for any organization to start getting value out of all their data. They should be able to answer questions like: What were my sales last quarter? How about same quarter last year? Then compare them to see if they made progress. They can also report on sales (Actual vs goals) for last n months and see if they are trending in the right direction. Things like this! Once you have a good process and platform to get this right then the organization is ready to advance to next step which is diagnostic and this is where you start analyzing the key drivers and underlying reasons to figure “why” it’s happening. But you need start at Descriptive!
Asset (Volunteers, Field offices & Equipments) management & planning for a global crisis response team.
Working in a team, we created statistical surveys for field works to collect data about current state & estimated future needs. We also helped them with data gathering & cleaning tasks. After that, we helped them analyze & visualize the data to find actions for executives leading the global crisis response team.
Here’s a mockup of one of the ten data visualization created for them:
2. make sure to check the “analysis toolpak” checkbox.
3. Now you should see a “data analysis” option under the “Data” pane:
4. Now click on “Data Analysis” and select one of the following options:
Anova, Correlation, Covariance, Descriptive Statistics, Exponential Smoothing, F-Test Two-Sample for Variances, Fourier Analysis, Histogram, Moving Average, Random Number Generation, Rank and Percentile, Regression, Sampling, t-Test, z-Test.
in this case, let’s go with descriptive statistics but you can see that you can perform other tasks as well.
5. Once you click on the descriptive statistics, a dialog box will show up and you will have to enter some data like your input range to generate descriptive statistics. Once you have filled the data needed, click on OK and it should generate descriptive statistics for you in EXCEL!
I hope that helps!
In this post, we saw how to generate descriptive statistics in Microsoft Excel.