Cheat Sheet to Pick the right graph or chart for your data:

Standard

I have two resources that I use sometimes to pick the right graph or chart for data visualization.

#1: Chart Suggestions:

chart data

#2: Online Tool

(By Juice Labs)

chart pick choose online tool

How can I start learning and exploring the field of Big Data Algorithms?

Standard

Someone asked this on Quora about how to learn & explore the field of Big Data Algorithms? Also, mentioned having some background in python already and wanted ideas to work on a good project so with that context, here is my reply:

There are two broad roles available in Data/Big-Data world:

  1. Engineering-oriented: Date engineers, Data Warehousing specialists, Big Data engineer, Business Intelligence engineer— all of these roles are focused on building that data pipeline using code/tools to get the data in some centralized location
  2. Business-oriented: Data Analyst, Data scientist — all of these roles involve using data (from those centralized sources) and helping business leaders make better decisions. *

*smaller companies (or startups) tend to have roles where small teams(or just one person) do it all so the distinction is not that apparent.

big data

Now given your background in python and programming, you might be a great fit for “Data engineer” roles and I would recommend learning about Apache spark (since you can use python code) and start building data pipelines. As you work with a little bit more than you can learn about how to build and deploy end-to-end machine learning projects with python & Apache spark. If you acquire these skills and keep learning — then I am sure you will end up with a good project.

Hope that helped and good luck!

VIEW QUESTION ON QUORA

Where do data scientist hang out online?

Standard

There are few places that I can think of where data scientist hang out online:

  1. Social media: Twitter, Youtube
  2. Q&A: Quora, Reddit (DS threads), Stats.StackExchange
  3. Competition sites: Kaggle, Analytics Vidhya
  4. Blogs: KDNuggets, DataTau
  5. MOOC’s: Coursera, Udacity, Springboard, edx, datacamp

VIEW QUESTION ON QUORA

What analytics data gives you the most actionable advice to improve your blog?

Standard

Someone asked on Quora: What analytics data gives you the most actionable advice to improve your blog? so here’s my answer:

I have been blogging about Analytics for past few years and this question is at the intersection of both so let me give it a shot:

It depends on two things: 1) Your goal for running the blog 2) Age of the blog

#1: Your goal

why blog analytics

First let’s talk about your goal for running the blog. It’s important to define this as this would help set the metrics that you will monitor and take actions to improve it.

Let’s say that the goal of your blog is to earn is to monetize using ads. So your key performance indicator (KPI) will be monthly ad revenue. In that case you can improve by one of the three things: Number of People visiting the blog x % of visitors clicking on ads x average revenue per ad click. You can work on marketing your blog to increase number of people visiting the blog. Then you can work on ad placement on your blog to increase % of visitors clicking the ad and then you can work on trying different ad networks to see which one pays you the most per click.

let’s take one more example. Like me if your goal is to use your blog for “exposure” which helps me build credibility in the field that I work in. In this case, the KPI i look at is Monthly New Visitors. I drill down further to see which marketing channels are driving that change. That helps me identify channels that I can double down on and reduce investments in other areas. For example: I found that Social is not performing that great but Search has been working great — I started investing time in following SEO principles and spent less time on posting on social.

So first step: Define your goal and your KPI needs to align with that.

#2: Age of your blog:

  • Early: Now at this stage, you will need to explore whether you can achieve what you set out to using blogging. So let’s say you wanted to earn money online. In first few weeks/months, you need to figure out if it’s possible. Can you get enough traffic to earn what you wanted? yes? Great! If not, blogging might not be the answer and eventually all your energy is being wasted. Figure this out sooner rather than later — and take first few weeks/months to make sure blogging helps you achieve your goal.
  • Mid: By this stage, you should know how blogging is helping you achieve your goal. So it’s time to pick one metric that matters! So if your goal was to earn money using ads then go for Monthly ad revenue and set up systems to track this. Google Analytics will be a great starting point. Also, at this stage, you should be asking for qualitative feedback. Ask your friends, ask on social, get comments, do guest blogging on popular platforms and see if you get engagement — basically focus on qualitative feedback since you won’t have enough visitors that you can analyze quantitative data.
  • Late: In this stage, you have the data and the blog is starting to get momentum. Don’t stop qualitative feedback loops but now start looking at quantitative data too. Figure out the underlying driving forces that move the needle on your KPI. Focus on improving those!

TL;DR: Define your “why” and then pick a metric— then use combination of qualitative and quantitative data to improve the underlying driving factors to improve the metric.

VIEW ON QUORA

Introduction to SQL Performance Tuning for Data Analysts

Standard

Introduction:

SQL is a common language used by data analysts (and even business users!) for data analysis — one of the reasons is popular is because it’s not that hard to pick it up. Sure, there is some learning curve especially if you don’t have a computer programming background but once you learn some basic commands, you will be able to apply it and answer a lot of questions. So it does give you lot of power! But sometimes you run into issues where your SQL queries are taking forever to complete and you wonder why that’s the case. In this post, I am going to introduce you to performance tuning that will help to troubleshoot next time you run into performance problems.

Performance Tuning SQL

Performance Tuning SQL

Performance tuning Hierarchy

your queries are slow due to one of the three reasons listed below:

#1: SQL Query optimization

#2: Database software, environment & optimization

#3. Hardware

You should start at Level 1 which is query optimization and then work your way down to other levels. This post will focus on SQL Query optimization as that is something you can control and it is also the most common root cause. Let’s focus on this first and then we will explore other options.

Performance Tuning SQL Queries

Depending on your skill level, you can look at a lot of things. But for the purpose of this blog post lets say you have beginner – intermediate SQL Knowledge and with that, you can look at following things:

  1. Size of your tables: If you are querying tables with millions of rows then that is going to slow down your queries. You can start off with limiting the amount of rows that you are working with using SQL clauses like LIMIT/TOP (depends on database system you are using) that will reduce the number of rows that database works with. This is great for exploratory analysis where you don’t need to look at all rows. Also, you should consider using “where” clause when it’s applicable. So let’s say you just care about a particular product category OR a particular product category then put where clauses where it’s applicable.
  2. Complex joins & aggregations: If you trying to join tables in such a way that returns a large number of rows then it’s going to be slow! If possible you can apply step 1 (limit your rows) to this as well — so let’s say you have two tables that are trying to join but you don’t need everything from table 1 then consider putting where clause on table 1. That would help. Also, if you are using aggregations along with joining tables then you could consider doing aggregations on individual tables first and put them in a subquery and then use them to join with other tables. Here’s an example where you are aggregating the table before joining them with other tables:

    SELECT cat.product_category,
    sub.*
    FROM (
    SELECT p.product_name,
    COUNT(*) AS products
    FROM Product p
    GROUP BY 1
    ) sub
    JOIN ProductCategory cat
    ON cat.product_name = sub.product_name

  3. Query plan: Also, how do you know which statement is a bottleneck in your queries. You could consider running them individually and see but even if that doesn’t help then you can use something called a “Query plan” — depending on the database system you are using the commands can differ but you can try searching the help section for that. It’s called Query/Execution plans and they help you see the order in which the query will be executed. Also, it will have “estimated” time to run stats (which may or may not be accurate) but still good starting point to see how long it might take for complex queries especially as you make changes, you can continuously evaluate without having to run the query. There’s a bit of learning curve on how to read execution/query plans but they are a great way to check the bottlenecks in a query that you wrote. You can try using a command like “EXPLAIN” before your query and check your database help section to see if that’s not the command.

Database software, Environment, Optimization & Hardware:

Let’s say you have tried everything you could to tune your SQL queries then it’s move to explore other options:

#2: Database software, environment & optimization: This is usually owned by Dev Ops or IT team and you will have to work with them. Depending on your team size, there might be a DBA, System Admin or DevOps engineer responsible for this. Here’s few things you should check out along with the IT team:

  • Are there a lot of users running queries at the same time?
  • Upgrade to database software and/or perform database optimization (e.g. indexing)
  • Consider evaluating a database that supports analytics better like Vertica, Redshift, Teradata, Google BigQuery, Microsft Azure SQL Data Warehouse among others — if you have 20+ users hitting a database for querying and have tables with 25M+ rows then this is worth evaluating! Your mileage may vary (depending on your hardware) but I am sharing these thresholds just so you have a starting point. These databases are different in architecture compared to something like MySQL which will bog down as you start to scale analytics in your org.
  • Are you querying a production database that also used to support other apps? If so, consider requesting a copy of a database to work with. IT should be able to set up a copy that gets refreshed let’s say nightly and that should help. You should then route all SQL users to this database and restrict access to production database.

#3: Hardware:

Since database is a software, it is constrained by the resources that it is allocated at hardware level just like any other software. You should dive deeper into this if #1 & #2 don’t work out — This is not the most common root-cause but as a rule of thumb, you should be scaling your hardware resources as other systems are scaled too. if that’s not done regularly then you will hit hardware issues. Also, don’t just upgrade your hardware, as I referred to earlier in #2, consider looking into databases that are better for analytics like vertica, redshift, bigquery etc. Compare all options & do an ROI analysis as upgrading hardware is usually a “duct-tape” solution and you will run into it again if you continue to grow.

Conclusion:

So you now have a framework which should help you when you run into SQL performance problems! Now it’s your turn, I would love to hear about what you did when you ran into performance problems in any of your data analysis project.