Doing Data Science at Twitter — A great read!

Doing Data science at Twitter

Doing Data science at Twitter

Why is “Doing Data Science at Twitter” a great read?

This is an insider’s perspective from someone who is working at a company that I classify as having the highest level of analytics maturity — In other words, Twitter is known to apply knowledge gained from data science into their products and business processes.

It’s also important to recognize that every company is different and the analytics/data-science tools/techniques/processes that would be implemented would also vary based on the analytics maturity — I love that this was one of the key insights shared in this article.

Also, the article talks about two types of data scientists…I thought it was great way to classify them because there’s a lot of confusion in the industry around what a Data scientist does. With that, Here’s the URL:

My two-year journey as a data scientist at twitter

Paras Doshi

PS: If you like articles like this, don’t forget to sign up for the newsletter!

Examples to help you differentiate between Business Intelligence and Data Science problems:


In this post, I’ll list few examples from various industries to help you differentiate between business intelligence and data science problems.

Sometime back, I blogged about “Business Analytics Continuum” and in the post we saw that Every Organization has DATA but they use their business data at different levels because of their maturity level. Excel (or other transactional reporting tools) is usually the starting point for any organization – it helps them see WHAT happened. They advance to the next stage, where they get capabilities to slice and dice their data – To find out WHY – and usually this capability is delivered using Business Intelligence tools & techniques. Once the data culture spreads – Thanks to a successful Business Intelligence project – then they soon start to outgrow their business intelligence capabilities by asking problems that need predictive capabilities. This is advanced analytics and Data Science stage. To that end, here are 5 examples to help you differentiate between business intelligence and data science problems:

Business Intelligence.(WHAT & WHY)Data Science & advanced analytics.
Bike Rentals
  1. How many bikes did we rent in Q3 2014? How does that compare to Q3 2013?
  2. What is the trend of total bike rentals at week level? Can you break it down by geography?
Can you predict bike rentals on an hourly basis?
Credit Risk
  1. How many customers have a credit risk of ‘C’?
  2. Can you rank customers by their payments due amount that have a credit risk ‘C’?
Can you predict the credit risk of the customer during contract negotiations stage?
Customer relationship management
  1. How many account cancellations occurred this year (broken down by month and customer segmentation)?
  2. How does percentage of account cancellations this year compare to that previous year?
 Can you predict customer churn?
Flight Delays
  1. What is the trend of % of flight delayed this year?
  2. Can you break down flight delays this year by their reasons?
Can you predict whether a scheduled flight will be delayed by more than 15 minutes?
Customer feedback
  1. What is the customer satisfaction % trend this year?
  2. What is the customer satisfaction % broken down by customer segments and product segments?
Can you classify a customer feedback comment into “positive”, “negative” or “neutral”?

I hope this helps!

How does Internet of Things (#IoT) impact data professionals?


Internet enabled computers to be connected with each other.

Internet enabled Mobile Devices to be connected with each other.

Now, Internet will be used to enable physical things to be connected with each other. This is what is called “Internet of things” (IoT).

So what happens?

since more devices are connected with internet – we will able to generate more data! This is usually good if there’s a business vision around how to make sense of data to increase efficiency of all these things.

Here’s a nice case study from Microsoft (focus on the business case – the things in this case is “elevator” to drive reliability)


This is all good news for data professionals! There will be increased demand for professionals who can help businesses make sense of data generated via IoT.

Also beware of the “hype” around this technology. It’s important to take incremental steps to achieve the vision – Instead of trying to analyze data from ALL devices in your organization, start with one physical thing that matter the most for your organization or start with data that you have and take incremental steps to spread data culture in your organization!

Now that Big Data has become a mainstream word in IT and business, we have a new buzzword to learn/talk about IoT – but remember it’s all about making sense of data and your skills would be more valuable than ever!

Back to basics: Multi Class Classification vs Two class classification.


Classification algorithms are commonly used to build predictive models. Here’s what they do (simplified!):

Machine Learning Predictive Algorithms analytics Introduction

Now, here’s the difference between Multi Class and Two Class:

if your Test Data needs to be classified into two classes then you use a two-class classification model.


1. Is it going to Rain today? YES or NO

2. Will the buyer renew his soon-to-expire subscription? YES or NO

3. What is the sentiment of this text? Positive OR Negative

As you can see from above examples the test data needs to be classified in two classes.

Now, look at example #3 – What is the sentiment of the text? What if you also want an additional class called “neutral” – so now there are three classes and we’ll need to use a multi-class classification model. So, If your test data needs to be classified into more than two classes then you use a multi-class classification model.


1. Sentiment analysis of customer reviews? Positive, Negative, Neutral

2. What is the weather prediction for today? Sunny, Cloudy, Rainy, Snow

I hope the examples helped, so next time you have to choose between multi class and two class classification models, ask yourself – does the problem ask you to predict two classes or more? based on that, you’ll need to pick your model.

Example: Azure Machine Learning (AzureML) studio’s classifier list:

Azure Machine Learning classifiers list

I hope this helps!

PASS Business Analytics VC: 7 Ideas on Encouraging Advanced Analytics by Mark Tabladillo #sqlpass


Thu, Jul 17, 2014 12:00 PM – 1:00 PM EDT

Many companies are starting or expanding their use of data mining and machine learning. This presentation covers seven practical ideas for encouraging advanced analytics in your organization.

Mark Tabladillo is a Microsoft MVP and SAS expert based in Atlanta, GA. His Industrial Engineering doctorate (including applied statistics) is from Georgia Tech. Today, he helps teams become more confident in making actionable business decisions through the use of data mining and analytics. Mark provides training and consulting for companies in the US and around the world. He has spoken at major conferences including Microsoft TechEd, PASS Summit, PASS Business Analytics Conference, Predictive Analytics World, and SAS Global Forum. He tweets @marktabnet and blogs at


hope to see you there!

Paras Doshi
Business Analytics Virtual Chapter’s Co-Leader

PASS Business Analytics Conference – Live Blogging: Day #1


I’m at the Business Analytics conference and I thought of sharing the news that I get to listen here!

On day #1, Kamal Hathi & Amri Netz are keynote speakers today.

They started with progress made during past few months (Power Query, Power BI, Power Map, SQL server 2014, Azure HDInsight….)

Then they shared some user adoption data…

Power Pivot & Power Query:

They also shared  user adoption data about Power BI:

They use Power BI to track user adoption of Power BI.

Power BI demo contest: if you’ve not seen some of the amazing demo’s that were submitted during the Power BI demo’s then you can read them here:

Mobile BI:

Microsoft is committed to having Power BI native apps on different platforms and enable BI on any device

SSRS with Power BI:

BI for the masses

It’s great to see Microsoft committed to create easy to use tools!
The Age of Classic BI -> The Age of Self Service BI -> The Age of Data.
In the new age, everyone in the organization who is curious will have tools that they can use to get to the answers!

Amir’s Demo:

Analysis of Tourism in Hawaii. It was really entertaining 🙂

New features in Power BI:

create dashboards using natural language (KPI editor)

Forecasting in Power View:

Tree maps in Power View

And I just saw a Tree maps in Power View!

Drag items from one chart to another!

you should now be able to drag items from one chart to another chart!

Combine two charts into one!

Nice interactivity feature


The journey to DATA CULTURE begins today…

Business Metrics #2 of N: Customer Retention Rate


In this post, We’ll explore a Business metric called “Customer Retention Rate”

What is it?

It is a metric that helps an organization monitor the % of customers retained.

Let me give you an example:

YearNumber of CustomersRetention Rate

Do you notice the third column that keeps a tab on the percentages of customer retained? This is the basic Idea behind customer retention rate.

How is it used?

This metric correlates with other key business performance measures like: customer service, product quality, customer loyalty. Think about it. If the customer retention rate is higher than the organization must be doing “something” right – that something could be: great loyalty program, great customer service or great product quality! If it’s low then it requires some action from decision makers – they would want to know the reasons so that they could fix the situation.

In earlier post, we talked about Customer Lifetime Value – now higher customer retention rate would also help us have a higher customer lifetime value.

Also it’s important to realize that the cost of acquiring a new customer is typically higher than keeping existing customer – and so organization that sells products/service like to measure the customer retention rate.

Also, if you customer data then you can drill down to find trends in the retention rate. Questions like: Which Age group has the highest retention rate? or which has lower? Retention rate for male customers? And also predicting customer retention rate of a new customer?


In this post, we learned about a business metric “customer retention rate”.

And as a reminder, This series is meant to understand Business Metrics from Analytics Perspective.

What’s “Naive” about Naive Bayes Machine Learning Algorithm?


In this post, I’ll post what why does the “Naive Bayes machine learning” algo have the word Naive in it?

So here is the short answer:

It “assumes” that the features are independent. (In other words: There’s no relation between the features that are used while building the model)

Let’s go a little deeper:

First up, few basic pointers.

> It’s a machine learning algorithm used for classification

> It’s based on Bayesian Statistics.

> you can read about it here:

Now, what do you mean when you mean that it is Naive because it assumes that features are independent?

Let’s take an example:

Suppose, you are building a “credit card approval” model based on Income and CreditScore

(SideNote: For those who do not know what is credit score, here you go:

And you have the following columns in the training data (Note: In machine learning, think of this columns as features)


Here the features are Income & CreditScore and the target of the classification model is Approved.

In real world, there’s some relation between “income” and “creditscore”. Agree? Great! But Naive Bayes doesn’t think so. Let me reiterate the point of this blog post and see if it makes more sense now: it assumes that the features are “independent” and that’s why it is Naive!

I hope this helps. your comments are very welcome!

Sentiment Analysis using LingPipe on windows 7:


In this post, I’ll point you to the resource using which you can perform sentiment analysis using LingPipe on a windows OS. Along with that I’ll share couple of issues that I ran into when I was trying to run this demo on a Windows 7:

So first up, here’s the resource:

Now here are a couple of issues that I had:

1. Error: could not find or load the main class PolarityBasic

lingpipe could not find or load main class polaritybasic

To solve this error, you’ll need to build the files given under the C:lingpipe-4.1.0demostutorialsentiment – we use ANT for this. Let’s see how to do that:

2. Building sentiment.jar using ant jar

After successfully downloading ant on windows and setting the ANT_HOME variable to c:apache-ant-1.8.4 – I was still getting the error that ant is not a recognized command.

So I ran following commands:

C:>set ANT_HOME=C:apache-ant-1.8.1
C:>set JAVA_HOME=C:jdk1.6.0_24
C:>set PATH=%ANT_HOME%bin;%JAVA_HOME%bin
C:>ant -version
// it worked!


Now I ran the following command:

build sentiment.jar ant lingpipe

3. In the tutorial they used POLARITY_DIR – I didn’t use that, Instead I just inputted c:review_polarity because that’s where I unzipped the movie review dataset:

movie review sentiment analysis polarity

Here’s the screenshot about the command that does basic polarity analysis:

sentiment analysis lingpipe windows

And Thanks:

Back to basics: Data Mining and Knowledge Discovery Process


Once in a while I go back to basics to revisit some of the fundamental technology concepts that I’ve learned over past few years. Today, I want to revisit Data Mining and Knowledge Discovery Process:

Here are the steps:

1) Raw Data

2) Data Pre processing (cleaning, sampling, transformation, integration etc)

3) Modeling (Building a Data Mining Model)

4) Testing the Model a.k.a assessing the Model

5) Knowledge Discovery

Here is the visualization:

knowledge discovery process data miningAdditional Note:

In the world of Data Mining and Knowledge discovery, we’re looking for a specific type of intelligence from the data which is Patterns. This is important because patterns tend to repeat and so if we find patterns from our data, we can predict/forecast that such things can happen in future.


In this blog post, we saw the Knowledge Discovery and Data Mining process.