As a student preparing for data anaylst & science roles, should I generalize vs specialize?

Standard

This question was posted on Springboard forum.

Here’s my answer:

It depends on your target industry & where they are in their life-cycle.

It has four stages: Startup, Growth, Maturity, Decline.

Industry lifecycle

Generalization is great in earlier stages. If you are targeting jobs at startups; generalize. You should know enough about lot of things.

T-shaped professionals are great for Growth stage. They specialize in something but still know enough about lot of things. E.g. Sr Growth/Marketing Analyst. Know enough about analytics & data science to be dangerous but specializes in marketing.

Specialization is great for mature industries. They know a lot about few things. E.g. Statisticians in an Insurance industry. They have made careers out of building risk models.

Any advice for moving into data science from business intelligence?

Standard

This was asked on Reddit: Any advice for moving into data science from business intelligence?

Here’s my answer:

I come from “Business Intelligence” background and currently work as Sr. Data Scientist. I found that you need two things to transition into data science:

Data Culture: A company where the data culture is such that managers/executives ask big questions that need a data science approach to solve it. If your end-consumers are still asking bunch of “what” questions then your company might NOT be ready for data science. But if your CEO comes to you and says “hey, I got the customer list with the info I asked for but can you help me understand which of these customers might churn next quarter?” — then you have a data science problem at hand. So, try to find companies that have this culture.

Skills: And you need to upgrade your skills to be able to solve data science problems. BI is focused too much on technology and automation and so may need to unlearn few things. For example: Automation is not always important since you might work on problems where a model is needed to predict just a couple of times. Trying to automate wouldn’t be optimal in that case. Also, BI relies heavily on tools but in Data science, you’ll need deeper domain knowledge & problem-solving approach along with technical skills.

Also, I personally moved from BI (as a consultant) -> Analytics (as Analytics Manager) -> Data science (Sr Data Scientist) and this has been super helpful for me. I recommend to transition into Analytics first and then eventually breaking into data science.

Hope that helps!

VIEW THREAD ON REDDIT

In how many dimensions (Vs) is Big Data commonly defined?

Standard

Asked on Quora:

When reading about Big Data, this starts with the definition of Gartner’s analyst Doug Laney (3Vs). IBM is often using 4 dimensions by adding veracity. Some people are using 6 or up to 12 dimensions. I am wondering what’s the most frequently used definition?

Answer:

Here’s my “working” definition of Big Data: if your existing 1) Tools & 2) Processes don’t support the data analysis needs then you have a Big Data problem.

You can add as many V’s as you want to but it all ties back to the notion that you need bigger and better tools and processes to support your data analysis needs as you grow.

Example:

#1. Social Media Data is BIG! It’s Text (variety) and much bigger in size (Volume) and it’s all coming in very fast! (velocity) AND business wants to analyze customer sentiments on social: OK — we have 3V’s problem and need a solution to support this. Maybe Hadoop is the answer. Maybe not. But you do have a “Big Data” problem.

#2: Your Customer Database is broken. They don’t right addresses. Google and Alphabet are showing up as two separate companies when they should be just one. Their employee count is outdated and All of these problems is confusing your business user and they don’t TRUST the data anymore. You have a veracity problem and so you have a BIG Data problem.

Everyone has a BIG DATA problem. It just depends what there “v’s” are AND it most cases “tools” alone will not solve the issue. You need PEOPLE and PROCESS to solve that. Here’s my ranking: 1) PEOPLE 2) PROCESS 3) PLATFORM (tools) for ingredients that are key to solving BIG Data problems.

VIEW QUESTION ON QUORA

How do I learn #SQL for #data analysis?

Standard

Step 1:

This is a good starting point: SQL School Table of Contents

OR, this: Learn SQL

Both of these resources were put together by analytics vendor and is targeted towards beginners.

Step 2:

Review this Quora Thread: How do I learn SQL?

Participate in competitions like this: Solve SQL Code Challenges

Step 3:

If you like to go more in-depth then check out few books:

  1. Head First SQL
  2. Learn SQL the hard Way
  3. Certification books/material from a database vendor

Hope that helps!

VIEW QUESTION ON QUORA

Single variable linear regression: Calculating baseline prediction, SSE, SST, R2 & RMSE:

Standard

Introduction:

This post is focused on basic concepts in linear regression and I will share how to calculate baseline prediction, SSE, SST, R2 and RMSE for a single variable linear regression.

Dataset:

The following figure shows three data points and the best-fit regression line: y = 3x + 2.

The x-coordinate, or “x”, is our independent variable and the y-coordinate, or “y”, is our dependent variable.

Baseline Prediction:

Baseline prediction is just the average of values of dependent variables. So in this case:

(2 + 2 + 8) / 3 = 4

It won’t take into account the independent variables and just predict the same outcome. We’ll see in a minute why baseline prediction is important.

Here’s what the baseline model would look like:

regression baseline model

SSE:

SSE stands for Sum of Squared errors.

Error is the difference between actual and predicted values.

So SSE in this case:

= (2 – 2)^2 + (2 – 5)^2 + (8 – 5)^2

= 0 + 9 + 9

= 18

SST:

SST stands for Total Sum of Squares.

Step 1 is to take the difference between Actual values and Baseline values of the dependent variables.

Step 2 is to Square them each and add them up.

So in this case:

= (2 – 4)^2 + (2 – 4)^2 + (8 – 4)^2

= 24

R2:

Now R2 is 1 – (SSE/SST)

So in this case:

= 1 – (18/24)

= 0.25

RMSE:

RMSE is Root mean squared error. It can be computed using:

Square Root of (SSE/N) where N is the # of dependent variables.

So in this case, it’s:

SQRT (18/3) = 2.44

 

Is the R data science course from datacamp worth the money?

Standard

DataCamp R Data Science

Question (on Quora) Is the R data science course from datacamp worth the money?

Answer:

It depends on your learning style.

If you like watching videos then coursera/udacity might be better.

If you like reading then a book/e-book might be better.

If you like hands-on then something like Data Camp is a great choice. I think they have monthly plans so it’s much cheaper to try them out. When I subscribed to it, it was like 30$/Month or so. I found it was worth it. Also, if you want to see if “hands-on” is how you learn best. Try this: swirl: Learn R, in R. — it’s free! Also, Data Camp has a free course on R too so you could try that as well.

Also, if you want to have free unlimited access for 2-days then try this link: https://www.datacamp.com/invite/G8yVkTrwR3Khn

VIEW QUESTION ON QUORA

Data analytics vs. Data science vs. Business intelligence: what are the key differences/distinctions?

Standard

They are used interchangeably since all of them involve working with data to find actionable insights. But I like to differentiate them based on the type of the question you’re asking:

  • What:

What are my sales number for this quarter?

What is the profit for this year to date?

What are my sales number over the past 6 months?

What did the sales look like same quarter last year?

All of these questions are used to report on facts and tools that help you build data models and reports can be classified as “Business Intelligence” tools.

  • Why:

Why is my sales number higher for this quarter compared to last quarter?

Why are we seeing increase in sales over the past 6 months?

Why are we seeing decrease in profit over the past 6 months?

Why does the profit this quarter less compared to same quarter last year?

All of these questions try to figure why something happened? A data analyst typically takes a stab at this. He might use existing Business Intelligence platform to pull data and/or also merge other data sets. He/she then applies data analysis techniques on the data to answer the “why” question and help business user get to the actionable insight.

  • What’s next:

What will be my sales forecast for next year?

What will be our profit next year for Scenario A, B & C?

Which customers will cancel/churn next quarter?

Which new customers will convert to a high-value customer?

All of these questions try to “predict” what will happen next (based on historical data/patterns). Sometimes, you don’t know the questions in the first place so there’s a lot of pro-active thinking going on and usually a “data scientist” are doing that. Sometimes you start with a high level business problem and form “hypothesis” to drive your analysis. All of these can be classified under “data science”.

Now, as you can see as we progressed from What -> Why -> What’s next, the level of sophistication needed to do the analysis also increased. So you need a combination of people, process and technology platform in an organization to go from having a Business Intelligence maturity all the way to achieving data science capabilities.

Here’s a related blog post that I wrote on this a while back: Business Analytics Continuum: – Insight Extractor – Blog

Data Science

..And you can check out other stuff I write about here: Insight Extractor – Blog – Paras Doshi’s Blog on Analytics, Data Science & Business Intelligence.

VIEW QUESTION ON QUORA