What are the must-know software skills for a career in data analytics after an MBA?

Standard

SQL, Excel & Tableau-like tools are good enough to start. Then add something like R eventually. And then there are tools that are specific to the industry – example: Google Analytics for the tech industry.

Other than that, you should know what do with these tools. You need to know following concepts and continuously build upon that as the industry use-cases and needs evolve:

  1. Spreadsheet modeling
  2. Forecasting
  3. Customer Segmentation
  4. Root cause Analysis
  5. Data Visualization and Dash-boarding
  6. Customer Lifetime value
  7. A/B testing
  8. Web Analytics

VIEW QUESTION ON QUORA

Is it too late to become a good Data Scientist?

Standard

If you’re looking for career change, that’s never too late!

If you’re looking to learn something new, that’s never too late!

If you’re looking to continue learning and go deeper in data science, that’s never too late!

If you don’t like Software engineering and want to switch to something else, that’s never too late!

But if you are after the “Data Science” gold rush, then you did miss the first wave! You are late.

But seriously, you should apply first-principles thinking to your career strategy and ideally not jump to whatever’s “hot” because by the time you get on that train, it’s usually too late.

VIEW QUESTION ON QUORA

How do you become a good data analyst?

Standard

This was asked on quora and here’s my reply:

You can become a great data analyst by continuously improving the analytics maturity of the company/start-up that you work for:

[Go to my blog for more context on the picture above]

If you create bunch of reports and help answer what happened— then try to help business users with why it happened. [Example: Instead of just sending website traffic info, add why the traffic spikes (up/downs) are happening]

If you are working on building bunch of models that answer why questions then try to help build predictive models next [Example: You have been working on a model that helped you answer why customers churned. Now built upon that and predict which customers will churn next]

If you do analytics and data science well and are already answering what, why, what’s next questions and you’re killing it! Then figure out how can you help business owners take action. Or make it easier than ever before to take actions on your data/recommendations.


Other answers for questions are directly/indirectly covered if you do this:

  1. You will have to pick the right tool for the job
  2. You will have to continuously keep learning (by taking online courses and/or you-tube)
  3. Don’t just be a data analyst, be a thought partner to business owners and if possible, transition into role that help you own business outcomes.

Hope that helps!

VIEW QUESTION ON QUORA

How does rise of Power BI & Tableau affect SSRS?

Standard

It does affect SSRS adoption but SSRS (sql server reporting service) still has a place as long as there’s need for printer-friendly reporting and self-service vendors don’t have a good solution to meet this need.

Also, SSRS is great for automating operational reports that sends out emails with raw data (list of customers, products, sales transaction etc).

I advocate an analytics strategy where we think about satisfying data needs using “self-service”-first (Power BI, tableau, qlik) but if thats not the optimal solution (for cases like need to print it, I just need you to send me raw data in excel, etc) then I’ll mark it as SSRS project. And this architecture is supported by a central data model (aka operational data store, data mart, data warehouse) which makes it much easier to swap in/out any reporting tools that we need and we are not locked in by one vendor.

About 10–20% data requests that I see are SSRS projects and if the self-service platforms start adding features that compete with SSRS, I know I would start using those capabilities and phase out SSRS. But if that doesn’t happen, I will continue using SSRS 🙂

VIEW QUESTION ON QUORA


Let me know what you think in the comments section!

Paras Doshi

This post is sponsored by MockInterview.co, If you are looking for data science jobs, check out 75+ data science interview questions!

Any advice for moving into data science from business intelligence?

Standard

This was asked on Reddit: Any advice for moving into data science from business intelligence?

Here’s my answer:

I come from “Business Intelligence” background and currently work as Sr. Data Scientist. I found that you need two things to transition into data science:

Data Culture: A company where the data culture is such that managers/executives ask big questions that need a data science approach to solve it. If your end-consumers are still asking bunch of “what” questions then your company might NOT be ready for data science. But if your CEO comes to you and says “hey, I got the customer list with the info I asked for but can you help me understand which of these customers might churn next quarter?” — then you have a data science problem at hand. So, try to find companies that have this culture.

Skills: And you need to upgrade your skills to be able to solve data science problems. BI is focused too much on technology and automation and so may need to unlearn few things. For example: Automation is not always important since you might work on problems where a model is needed to predict just a couple of times. Trying to automate wouldn’t be optimal in that case. Also, BI relies heavily on tools but in Data science, you’ll need deeper domain knowledge & problem-solving approach along with technical skills.

Also, I personally moved from BI (as a consultant) -> Analytics (as Analytics Manager) -> Data science (Sr Data Scientist) and this has been super helpful for me. I recommend to transition into Analytics first and then eventually breaking into data science.

Hope that helps!

VIEW THREAD ON REDDIT

SQL: How to add Hierarchical Levels in your Query?

Standard

Tree-like structures are common in our world: Company Hierarchy, File System on your computer, Product category map among others so you might run into a task of creating a Hierarchical level in your SQL query — In this blog post, I will show you how you can do that using couple of approaches. These approaches can also be used to map “parent – child” relationships.

Two approaches are:

  1. When you know the Tree Depth
  2. When you don’t know the Tree Depth

SQL Hierarchical

#1: When you know tree-depth:

When you know the tree-depth (and if it’s not too deep) then you could consider simple CTE’s to come up with the Hierarchical Levels field in your query.

Let’s take an example:

Input:

EmployeeIDFirstNameLastNameTitleManagerID
1KenSánchezChief Executive OfficerNULL
16DavidBradleyMarketing Manager273
273BrianWelckerVice President of Sales1
274StephenJiangNorth American Sales Manager273
285SyedAbbasPacific Sales Manager273

Query: (On SQL Server)


with lvl1 as
(select
[EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,1 as Level
FROM [dbo].[employees]
where ManagerID is null
)
,
lvl2 as
(
select
[EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,2 as Level
FROM [dbo].[employees]
where ManagerID IN (Select EmployeeID from lvl1)
),
lvl3 as
(
select
[EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,3 as Level
FROM [dbo].[employees]
where ManagerID IN (Select EmployeeID from lvl2)
)
select * from lvl1
union
select * from lvl2
union
select * from lvl3

Output:

EmployeeIDFirstNameLastNameTitleManagerIDLevel
1KenSánchezChief Executive OfficerNULL1
273BrianWelckerVice President of Sales12
16DavidBradleyMarketing Manager2733
274StephenJiangNorth American Sales Manager2733
285SyedAbbasPacific Sales Manager2733

#2: When you do NOT know tree-depth:

In other words, if the tree is N-level deep then you are out of luck using option #1. In this case, you should consider the RECURSIVE CTE approach. Here’s an example:

Input: (with the idea that this table will grow over time)

EmployeeIDFirstNameLastNameTitleManagerID
1KenSánchezChief Executive OfficerNULL
16DavidBradleyMarketing Manager273
23MaryGibsonMarketing Specialist16
273BrianWelckerVice President of Sales1
274StephenJiangNorth American Sales Manager273
275MichaelBlytheSales Representative274
276LindaMitchellSales Representative274
285SyedAbbasPacific Sales Manager273
286LynnTsofliasSales Representative285

Query: (On SQL Server that supports Recursive CTE)


with HierarchyLvl as
(
SELECT [EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,1 as Level
FROM [dbo].[employees]
where ManagerID is null
UNION ALL
SELECT e.[EmployeeID]
,e.[FirstName]
,e.[LastName]
,e.[Title]
,e.[ManagerID]
,Level + 1
FROM [dbo].[employees] e INNER JOIN HierarchyLvl d on e.ManagerID = d.EmployeeID
)
select * from HierarchyLvl

 

Output:

EmployeeIDFirstNameLastNameTitleManagerIDLevel
1KenSánchezChief Executive OfficerNULL1
273BrianWelckerVice President of Sales12
16DavidBradleyMarketing Manager2733
274StephenJiangNorth American Sales Manager2733
285SyedAbbasPacific Sales Manager2733
286LynnTsofliasSales Representative2854
275MichaelBlytheSales Representative2744
276LindaMitchellSales Representative2744
23MaryGibsonMarketing Specialist164

Conclusion:

Even if I know tree-depth I will go with option #2 as it’s much easier to read and can accommodate future updates to the table. If you are interested in learning more about this and search for “Recursive Query using Common Table Expression” and you should find technical articles that talk about why it does what it does.

Hope this helps!

Data analytics vs. Data science vs. Business intelligence: what are the key differences/distinctions?

Standard

They are used interchangeably since all of them involve working with data to find actionable insights. But I like to differentiate them based on the type of the question you’re asking:

  • What:

What are my sales number for this quarter?

What is the profit for this year to date?

What are my sales number over the past 6 months?

What did the sales look like same quarter last year?

All of these questions are used to report on facts and tools that help you build data models and reports can be classified as “Business Intelligence” tools.

  • Why:

Why is my sales number higher for this quarter compared to last quarter?

Why are we seeing increase in sales over the past 6 months?

Why are we seeing decrease in profit over the past 6 months?

Why does the profit this quarter less compared to same quarter last year?

All of these questions try to figure why something happened? A data analyst typically takes a stab at this. He might use existing Business Intelligence platform to pull data and/or also merge other data sets. He/she then applies data analysis techniques on the data to answer the “why” question and help business user get to the actionable insight.

  • What’s next:

What will be my sales forecast for next year?

What will be our profit next year for Scenario A, B & C?

Which customers will cancel/churn next quarter?

Which new customers will convert to a high-value customer?

All of these questions try to “predict” what will happen next (based on historical data/patterns). Sometimes, you don’t know the questions in the first place so there’s a lot of pro-active thinking going on and usually a “data scientist” are doing that. Sometimes you start with a high level business problem and form “hypothesis” to drive your analysis. All of these can be classified under “data science”.

Now, as you can see as we progressed from What -> Why -> What’s next, the level of sophistication needed to do the analysis also increased. So you need a combination of people, process and technology platform in an organization to go from having a Business Intelligence maturity all the way to achieving data science capabilities.

Here’s a related blog post that I wrote on this a while back: Business Analytics Continuum: – Insight Extractor – Blog

Data Science

..And you can check out other stuff I write about here: Insight Extractor – Blog – Paras Doshi’s Blog on Analytics, Data Science & Business Intelligence.

VIEW QUESTION ON QUORA

Looker vs Tableau: How would you compare them in terms of price & capabilities?

Standard

[Update 6/10/2019: Looker has been acquired by Google and Tableau has been acquired by Salesforce]

Someone asked this on quora so here’s my response: This is a great question — one that I figured it out when I led Analytics at Kiva[.]Org last year so I am happy to add my perspective here on Looker vs Tableau:

Looker vs Tableau

Let’s talk about capabilities first and then price.

Capabilities — Looker vs Tableau

Even though both of these tools are classified under Business Intelligence, they have some pretty clear product differentiation so in this section, I will share that. I will share the three main components of Business Intelligence platforms and then map it back to core strengths of each product.

Business Intelligence platforms typically has three main components:

  1. Data Collection, Storage & Access
  2. Data Modeling
  3. Data Visualization

#1) Data Collection, Storage & Access: Both of these tools don’t do data collection & storage. You will need infrastructure to collect data and store it — typically it is stored in databases. And you can access data from databases using SQL. You will need to connect to these data sources from either of these tools and access data — Note that: On the surface, it might look like Tableau supports more data sources than Looker but there might be workaround to get your data into one of the data sources supported by Looker and take it from there and so I am not awarding extra points to Tableau for this. Also, I am personally a big proponent of using Analytic databases like redshift, vertica, bigquery & Azure DW for Analytics applications which Looker & Tableau both support so calling it a tie here!

#2) Data Modeling: This is Looker’s core strength by a wide margin! Why? This is because of their LookML which is their data modeling layer and I am super impressed by this after using it for a while now! So let’s chat about what data modeling layer means and why you should care.

Data modeling (in this context) means creating data models that take your raw data as input and then it’s cleaned, combined, curated & converted and made ready for data analysis.

Why is this important? Not everyone can clean, curate, combine & convert raw data into analysis-friendly data assets. That’s what data analysts are trained and specialize in. May in the future we will have tools that do that OR maybe we will see plug-and-play (aka turnkey) solutions for few key analysis needs but for now, you need data analysts that can create these data models.

Now there are two ways to create data models:

You can create them on-the-fly (ad-hoc) OR you can publish all of these data models on a platform (like Looker).

There are all sorts of issues with doing it on-the-fly — it works for small teams (<20–30 people) but more than that you need to have some process in place. For instance: You can’t automate data models that you need often so that’s wasted time, Also, you can’t share these models easily with others, creates a single point of failure and if the analyst person is sick or on vacation then no-one gets “insights” from data — the world stops spinning. Yada Yada Yada…So self-service is good after you have few business users who want to consume data.

So what does a self-service platform bring to the table? They help data analyst build these wonderful data-analysis friendly models and publish them so everyone who cares in an org can access it. So the consumer can focus on analysis part and not worry about doing the not-so-good part of making it ready for analysis. Also, this ensures all sorts of other benefits: standardized metric definitions, trusted data sources, better collaboration among analysts, speedier model-delivery process, get out of excel hell and what not!

Think of this way: If you have all key data model available on your self-service platform then your data analyst can focus on 1) advance stuff = more $$$ 2) building more data models (and so eventually they can do more advanced stuff later and more $$$!)

Looker vs Tableau

This is where Looker fits in. Looker is great at this data modeling thing — it’s platform is amazing for anyone looking to solve this problem. You can also do data visualization on top and build dashboards.

Alright, moving on:

#3) Data Visualization: This is Tableau’s forte! No one does data viz better than Tableau, at least right now. There are vendors that are investing significant resources on this and they are close but still Tableau is a leader in this space.

Having said that, let’s map it back to how it help business users & analysts:

Business users and self-service environments:

Tableau is not great at data modeling thing. Yes, you can do basic clean, combine, curate & convert thingy but it doesn’t work well with intermediate to advanced needs. So if you have a self-service data modeling layer already in place that Tableau can connect to and you are looking for a data visualization layer then go for Tableau! You would be able to create some amazing visuals, dashboards and stories that will WOW your business users! But to make sure this scales you need to seriously think about 1) how to overcome the limitations in tableau’s data modeling layer OR 2) use some other tool to build this data modeling layer and connect Tableau to it.

Pro Tip: I highly recommend trying out trials of these products and seeing what works best!

Analysts:

Tableau shines at data discovery! While this certainly helps business users, it’s best leveraged by analyst because whenever they are working on ad-hoc data analysis (one-time, strategic in nature) projects they can be much more effective and discover the underlying trends and patterns in their data by visualizing it using Tableau.

So with that context you might be wondering, What tool did I champion & Implement at Kiva?

This is public knowledge that Kiva is a Looker customer because it’s Logo is on their website so I can share this.

After evaluating about 30+ tools (including Tableau), I ended up championing and eventually leading the initial implementation sprints to implement Looker at Kiva because the goals & vision that we had for Kiva’s data & analytics platform aligned better with having the data modeling layer that met Kiva’s needs. So you need to figure out your goals and vision and then choose the tools with that framework.

Pro Tip #1: It’s insanely hard to figure out what your goals and vision for analytics in an org. To figure this out, you might want to chat with organizations in the same industry at the same size & stage and see what they use. Ask them about what they use and whether it worked for them. Ask them about their Return on investment. This is a great way to get external feedback but you still need to figure out internal needs and prioritize them.

Pro Tip #2: Both of these tools have amazing reviews! You will see them highly ranked in analyst reports too — this is great but it’s important that ever before to clearly define what your organization needs and then map it back to the core strengths of these products (or any other tool for that matter) and go from there!

[I am happy to help evaluate the right tool for you needs, feel free to contact me: Let’s Connect! – Insight Extractor – Blog ]

Pricing — Looker vs Tableau

I can’t talk about Looker’s pricing because it’s not public, I apologize! You need to contact them to get the quote.

However, you can anchor that with Tableau’s pricing which is public: Buy Tableau | Tableau Webstore

Your analyst and power users will need Tableau Desktop/Professional which is $1K and $2K respectively (one-time thing) and then depending on your deployment model: cloud or self-hosted — the price varies:

Looker Tableau Pricing

*Note that Tableau online is a subscription model so you can definitely start small. Let’s say 5 business users in a department and take it from there. If you grow then you can later look at other tools like Looker. (If you are rapidly growing, account for the non-trivial time needed to migrate from one platform to another and so it might be worth it to pick the right tool from the get-go)

Pro Tip: I will encourage you to think about building a ROI model too. You know use some analytics for your analytics projects 😉 — I apologize, couldn’t resist! Anyhow, the point is that instead of just thinking about the “cost”, think about the value-add and anchor your investment figure to that. There’s a reason some analytics tool are priced at let’s say $1000 vs some tools priced at $100,000 — both of them have different value proposition and if you know how to extract value of the tool and can project it then you can get better ROI!

Hope that helps! If I can be of any further help, email me or comment here! Let’s Connect! – Insight Extractor – Blog

VIEW QUESTION ON QUORA

Cheat Sheet to Pick the right graph or chart for your data:

Standard

I have two resources that I use sometimes to pick the right graph or chart for data visualization.

#1: Chart Suggestions:

chart data

#2: Online Tool

(By Juice Labs)

chart pick choose online tool

Introduction to SQL Performance Tuning for Data Analysts

Standard

Introduction:

SQL is a common language used by data analysts (and even business users!) for data analysis — one of the reasons is popular is because it’s not that hard to pick it up. Sure, there is some learning curve especially if you don’t have a computer programming background but once you learn some basic commands, you will be able to apply it and answer a lot of questions. So it does give you lot of power! But sometimes you run into issues where your SQL queries are taking forever to complete and you wonder why that’s the case. In this post, I am going to introduce you to performance tuning that will help to troubleshoot next time you run into performance problems.

Performance Tuning SQL

Performance Tuning SQL

Performance tuning Hierarchy

your queries are slow due to one of the three reasons listed below:

#1: SQL Query optimization

#2: Database software, environment & optimization

#3. Hardware

You should start at Level 1 which is query optimization and then work your way down to other levels. This post will focus on SQL Query optimization as that is something you can control and it is also the most common root cause. Let’s focus on this first and then we will explore other options.

Performance Tuning SQL Queries

Depending on your skill level, you can look at a lot of things. But for the purpose of this blog post lets say you have beginner – intermediate SQL Knowledge and with that, you can look at following things:

  1. Size of your tables: If you are querying tables with millions of rows then that is going to slow down your queries. You can start off with limiting the amount of rows that you are working with using SQL clauses like LIMIT/TOP (depends on database system you are using) that will reduce the number of rows that database works with. This is great for exploratory analysis where you don’t need to look at all rows. Also, you should consider using “where” clause when it’s applicable. So let’s say you just care about a particular product category OR a particular product category then put where clauses where it’s applicable.
  2. Complex joins & aggregations: If you trying to join tables in such a way that returns a large number of rows then it’s going to be slow! If possible you can apply step 1 (limit your rows) to this as well — so let’s say you have two tables that are trying to join but you don’t need everything from table 1 then consider putting where clause on table 1. That would help. Also, if you are using aggregations along with joining tables then you could consider doing aggregations on individual tables first and put them in a subquery and then use them to join with other tables. Here’s an example where you are aggregating the table before joining them with other tables:

    SELECT cat.product_category,
    sub.*
    FROM (
    SELECT p.product_name,
    COUNT(*) AS products
    FROM Product p
    GROUP BY 1
    ) sub
    JOIN ProductCategory cat
    ON cat.product_name = sub.product_name

  3. Query plan: Also, how do you know which statement is a bottleneck in your queries. You could consider running them individually and see but even if that doesn’t help then you can use something called a “Query plan” — depending on the database system you are using the commands can differ but you can try searching the help section for that. It’s called Query/Execution plans and they help you see the order in which the query will be executed. Also, it will have “estimated” time to run stats (which may or may not be accurate) but still good starting point to see how long it might take for complex queries especially as you make changes, you can continuously evaluate without having to run the query. There’s a bit of learning curve on how to read execution/query plans but they are a great way to check the bottlenecks in a query that you wrote. You can try using a command like “EXPLAIN” before your query and check your database help section to see if that’s not the command.

Database software, Environment, Optimization & Hardware:

Let’s say you have tried everything you could to tune your SQL queries then it’s move to explore other options:

#2: Database software, environment & optimization: This is usually owned by Dev Ops or IT team and you will have to work with them. Depending on your team size, there might be a DBA, System Admin or DevOps engineer responsible for this. Here’s few things you should check out along with the IT team:

  • Are there a lot of users running queries at the same time?
  • Upgrade to database software and/or perform database optimization (e.g. indexing)
  • Consider evaluating a database that supports analytics better like Vertica, Redshift, Teradata, Google BigQuery, Microsft Azure SQL Data Warehouse among others — if you have 20+ users hitting a database for querying and have tables with 25M+ rows then this is worth evaluating! Your mileage may vary (depending on your hardware) but I am sharing these thresholds just so you have a starting point. These databases are different in architecture compared to something like MySQL which will bog down as you start to scale analytics in your org.
  • Are you querying a production database that also used to support other apps? If so, consider requesting a copy of a database to work with. IT should be able to set up a copy that gets refreshed let’s say nightly and that should help. You should then route all SQL users to this database and restrict access to production database.

#3: Hardware:

Since database is a software, it is constrained by the resources that it is allocated at hardware level just like any other software. You should dive deeper into this if #1 & #2 don’t work out — This is not the most common root-cause but as a rule of thumb, you should be scaling your hardware resources as other systems are scaled too. if that’s not done regularly then you will hit hardware issues. Also, don’t just upgrade your hardware, as I referred to earlier in #2, consider looking into databases that are better for analytics like vertica, redshift, bigquery etc. Compare all options & do an ROI analysis as upgrading hardware is usually a “duct-tape” solution and you will run into it again if you continue to grow.

Conclusion:

So you now have a framework which should help you when you run into SQL performance problems! Now it’s your turn, I would love to hear about what you did when you ran into performance problems in any of your data analysis project.