[Resource] 8 Methods to calculate CLV:


There are lot of ways to apply a CLV (customer lifetime value) model. But I hadn’t seen a single document that would summarize all of them — Until I saw this: http://srepho.github.io/CLV/CLV

If you are building a CLV model, one of first things that you might want to figure out is whether you have a contractual model or non-contractual model. And then figure out which methodology would work best for you. Here are 8 methods that were summarized in the link that I shared with you:

  • Naive
  • Recency Frequency Monetary (RFM) Summaries
  • Markov Chains
  • Hazard Functions
  • Survival Regression
  • Supervised Machine Learning using Random Forest


  • Management Heuristics
  • Distribution Based Approaches

Hope that helps!

How does rise of Power BI & Tableau affect SSRS?


It does affect SSRS adoption but SSRS (sql server reporting service) still has a place as long as there’s need for printer-friendly reporting and self-service vendors don’t have a good solution to meet this need.

Also, SSRS is great for automating operational reports that sends out emails with raw data (list of customers, products, sales transaction etc).

I advocate an analytics strategy where we think about satisfying data needs using “self-service”-first (Power BI, tableau, qlik) but if thats not the optimal solution (for cases like need to print it, I just need you to send me raw data in excel, etc) then I’ll mark it as SSRS project. And this architecture is supported by a central data model (aka operational data store, data mart, data warehouse) which makes it much easier to swap in/out any reporting tools that we need and we are not locked in by one vendor.

About 10–20% data requests that I see are SSRS projects and if the self-service platforms start adding features that compete with SSRS, I know I would start using those capabilities and phase out SSRS. But if that doesn’t happen, I will continue using SSRS 🙂


Let me know what you think in the comments section!

Paras Doshi

This post is sponsored by MockInterview.co, If you are looking for data science jobs, check out 75+ data science interview questions!

5 tests to validate the quality of your data:


Missing Data:

  • Descriptive statistics could be used to find missing data
  • Tools  like SQL/Excel/R can also be used to look for missing data
  • Some of the attributes of a field are missing: Like Postal Code in an address field


  • Check if all the values are standardized: Google, Google Inc & Alphabet might need to be standardized and categorized as Alphabet
  • Different Date formats used in the same field (MM/DD/YYYY and DD/MM/YYYY)


  • Total size of data (# of rows/columns): Sometimes you may not have all the rows that you were expecting (for e.g. 100k rows for each of your 100k customers) and if that’s not the case then that tells us that we don’t complete dataset at hand


  • Outlier: If someone;s age is 250 then that’s an outlier but also it’s an error somewhere in the data pipeline that needs to be fixed; outliers can be detected using creating quick data visualization
  • Data Type mismatch: If a text field is in a field where other entries are integer that’s also an error


  • Duplicates can be introduced in the data e.g. same rows duplicated in the dataset so that needs to be de-duplicated

Hope that helps!

Paras Doshi

This post is sponsored by MockInterview.co, If you are looking for data science jobs, check out 75+ data science interview questions!

Journal of statistical software paper on tidying data:


Data cleaning takes up a lot of time during a data science process; it’s not necessarily a bad thing and time spent on cleaning data is worthwhile in most cases; To that end, I was researching some framework that might help me make this process a little bit faster. As a part of my research, I found the Journal of statistical software paper written by Hadley Wickham which had a really good framework to “tidy” data — which is part of data cleaning process.

Author does a great job of defining tidy data:

1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.

And then applying it to 5 examples:

 1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables

It also contains some sample R code; You can read the paper here: http://vita.had.co.nz/papers/tidy-data.pdf