What data are data scientists at startups actually analyzing? How is it collected?

Standard

Question: What data are data scientists at startups actually analyzing? How is it collected?
(Coming from a web analytics background I’m wondering what data are data scientist at IT companies actually analyzing. Is it server-side or client-side? Is it collected internally or using some external tool?)

Answer:

Part 1: What are startups analyzing?

It depends on the Business Model and the Stage that they are at.

Business Models: Marketplace, Ecom, SaaS, Media, etc.

Stage: Early, Mid, Late

So let’s say you have a SaaS model and you’re in Mid-stage (post product-market fit stage) then you would tend to be focused on things like: Engagement, Churn, etc…and ideally they should be focused on measuring what aligns best with the strategy (instead of capturing everything!)

Let’s take another example. Let’s say you are a Marketplace in late-stage. So you would tend to be focused more on the “money” and so you can measure things like: transactions, commissions, etc…

I recommend reading “lean analytics” book as it goes much deeper and it’s a great starting point for anyone to understand how analytics could help a startup.

Part 2: How is it collected?

Now this also depends on your product. Assuming you’re a tech startup, you would have Web App and/or Desktop app and/or Mobile app. And now depending on your delivery approach plus your measurement needs, the “how” part will be determined. It would invariably be a combination of your transactions data source, web/mobile events stack (like Google analytics/other-Vendor or Custom), finance data source among others.

This post points to 10 other blogs which lists their “data” stack: The Data Infrastructure Meta-Analysis: How Top Engineering Organizations Built Their Big Data Stacks – The Data Point

View Question on Quora

What is the difference between Histogram & Bar Chart?

Standard
HistogramBar Chart
 HistogramBar Chart
The x-axis represents bins. So if you have a continuous variable like age which has values from 0-100 then you can create bins like 0-10, 10-20 and so on (and here bin size = 10). You can change the bin size to analyze the distribution of the data.
X-axis has a numerical (quantitative) variable.
The x-axis represents distinct categories from your data.
The variable on the x-axis is usually qualitative
The order of the bins is important since it is used to understand the distribution of the data.The order of the categories in the bar chart doesn’t matter. We can sort it if we want but it’s not needed.

[Video] AI, Deep Learning and Machine Learning

Video

I watched this video over the weekend and wanted to share this very well done presentation by a Venture Capital (VC) firm with you — that’s why I love following VC’s (especially one’s who invest in Data/Analytics theme) since they tend to share some amazing insights on where the industry is going.

Abstract:
“One person, in a literal garage, building a self-driving car.” That happened in 2015. Now to put that fact in context, compare this to 2004, when DARPA sponsored the very first driverless car Grand Challenge. Of the 20 entries they received then, the winning entry went 7.2 miles; in 2007, in the Urban Challenge, the winning entries went 60 miles under city-like constraints.

Things are clearly progressing rapidly when it comes to machine intelligence. But how did we get here, after not one but multiple “A.I. winters”? What’s the breakthrough? And why is Silicon Valley buzzing about artificial intelligence again?

From types of machine intelligence to a tour of algorithms, a16z Deal and Research team head Frank Chen walks us through the basics (and beyond) of AI and deep learning in this slide presentation.

URL: http://a16z.com/2016/06/10/ai-deep-learning-machines/

What is the title these days for a person that assures data quality?

Standard

Question:

What is the title these days for a person that assures data quality?
(I need to hire a person to make sure my data is as good as it can be. They need to inspect the data for issues, create logic for how it can be found and fixed, and finally, court the project through application development for a robust solution to stop it from occurring in the first place.)

Answer:

Quality of the data shouldnt be a responsibility of just one person — ideally, you want all members of the team (and broader business community) to care and own some part of it. But i like the idea of one person owning the “co-ordination” of how this gets done. It might not be a full time gig in a small org but can see this as a full time role in bigger orgs and enterprises. Some titles:

  1. data co-ordinator
  2. Data quality analyst (or just data analyst)
  3. Data steward
  4. Master data management analyst
  5. Data quality engineer (or just data engineer)
  6. Project manager (data quality)
  7. Manager, data quality and master data management

Read the original question on Quora

What is the difference between Row_Number(), Rank() and Dense_Rank() in SQL?

Standard

If the database that you work with supports Window/Analytic functions then the chances are that you have run into SQL use-cases where you have wondered about the difference between Row_Number(), Rank() and Dense_Rank(). In this post, I’ll show you the difference:

So, let’s just run all of them together and see what the output looks like.

Here’s my query: (Thanks StackExchange!)

select DisplayName,Reputation,
Row_Number() OVER (Order by Reputation desc) as RowNumber,
Rank() OVER (Order by Reputation desc) as Rank,
Dense_Rank() OVER (Order by Reputation desc) as DenseRank
from users

Which gives the following output:

DisplayName          Reputation RowNumber Rank DenseRank 
-------------------- ---------- --------- ---- --------- 
Hardik Mishra        9999       1         1    1         
Alex                 9997       2         2    2         
Omnipresent          9997       3         2    2         
Sergei Basharov      9993       4         4    3         
Oleg Pavliv          9991       5         5    4         
Jason Creighton      9991       6         5    4         
Aniko                9991       7         5    4         
Notlikethat          9990       8         8    5         
ZeMoon               9989       9         9    6         
Carl                 9987       10        10   7   
...
...
...     

Note that all the functions are essentially are “ranking” your rows but there are subtle differences:

  1. Row_Number() doesn’t care if the two values are same and it just ranks them differently. Note row #2 and #3, they both have value 9997 but they were assigned 2 and 3 respectively.
  2. Rank() — Now unlike Row_Number(), Rank() would consider that the two values are same and “Rank” them with same value. Note Row #2 and #3, they both have value 9997 and so both were assigned Rank “2” — BUT notice the Rank “3” is missing! In other words, it introduces some “gaps”
  3. Dense_Rank() — Now Dense_Rank() is like Rank() but it doesn’t leave any gaps! Notice that the Rank “3” in the DenseRank field.

I hope this clarified the differences between these SQL Ranking functions — let me know your thoughts in the comments section

Paras Doshi

What are the differences between big data developer and data analyst?

Standard

It depends on how the Analytics & Data Science team is structured in an org but usually you will see following trend:

  1. “Big Data Developer” usually rolls up under the Engineering org. They are responsible for building the data pipelines that feed data to the “data platform” — they use things like Hadoop, Spark, Custom Code, ETL tools, etc to build data pipelines and are responsible developing and maintaining the data platform. And to succeed in this role you need to have deep technical chops. Other titles for this role: Data engineer, Software engineer, etc.
  2. “Data Analyst” usually rolls up under some “business” team like strategy, operations, growth, product, marketing, sales, etc. Data Analyst are the link between the “data platform” and the “business” — these guys are primary consumer of the “data platform” (sometimes you might see shared ownership of data platform between engineering and analytics). They help solve business problems using data and pull data from the “data platform”. These guys need to have a good balance between business and technical skills to be successful in this role.

View the question on Quora.