SQL is a common language used by data analysts (and even business users!) for data analysis — one of the reasons is popular is because it’s not that hard to pick it up. Sure, there is some learning curve especially if you don’t have a computer programming background but once you learn some basic commands, you will be able to apply it and answer a lot of questions. So it does give you lot of power! But sometimes you run into issues where your SQL queries are taking forever to complete and you wonder why that’s the case. In this post, I am going to introduce you to performance tuning that will help to troubleshoot next time you run into performance problems.
Performance tuning Hierarchy
your queries are slow due to one of the three reasons listed below:
#1: SQL Query optimization
#2: Database software, environment & optimization
You should start at Level 1 which is query optimization and then work your way down to other levels. This post will focus on SQL Query optimization as that is something you can control and it is also the most common root cause. Let’s focus on this first and then we will explore other options.
Performance Tuning SQL Queries
Depending on your skill level, you can look at a lot of things. But for the purpose of this blog post lets say you have beginner – intermediate SQL Knowledge and with that, you can look at following things:
- Size of your tables: If you are querying tables with millions of rows then that is going to slow down your queries. You can start off with limiting the amount of rows that you are working with using SQL clauses like LIMIT/TOP (depends on database system you are using) that will reduce the number of rows that database works with. This is great for exploratory analysis where you don’t need to look at all rows. Also, you should consider using “where” clause when it’s applicable. So let’s say you just care about a particular product category OR a particular product category then put where clauses where it’s applicable.
- Complex joins & aggregations: If you trying to join tables in such a way that returns a large number of rows then it’s going to be slow! If possible you can apply step 1 (limit your rows) to this as well — so let’s say you have two tables that are trying to join but you don’t need everything from table 1 then consider putting where clause on table 1. That would help. Also, if you are using aggregations along with joining tables then you could consider doing aggregations on individual tables first and put them in a subquery and then use them to join with other tables. Here’s an example where you are aggregating the table before joining them with other tables:
COUNT(*) AS products
FROM Product p
GROUP BY 1
JOIN ProductCategory cat
ON cat.product_name = sub.product_name
- Query plan: Also, how do you know which statement is a bottleneck in your queries. You could consider running them individually and see but even if that doesn’t help then you can use something called a “Query plan” — depending on the database system you are using the commands can differ but you can try searching the help section for that. It’s called Query/Execution plans and they help you see the order in which the query will be executed. Also, it will have “estimated” time to run stats (which may or may not be accurate) but still good starting point to see how long it might take for complex queries especially as you make changes, you can continuously evaluate without having to run the query. There’s a bit of learning curve on how to read execution/query plans but they are a great way to check the bottlenecks in a query that you wrote. You can try using a command like “EXPLAIN” before your query and check your database help section to see if that’s not the command.
Database software, Environment, Optimization & Hardware:
Let’s say you have tried everything you could to tune your SQL queries then it’s move to explore other options:
#2: Database software, environment & optimization: This is usually owned by Dev Ops or IT team and you will have to work with them. Depending on your team size, there might be a DBA, System Admin or DevOps engineer responsible for this. Here’s few things you should check out along with the IT team:
- Are there a lot of users running queries at the same time?
- Upgrade to database software and/or perform database optimization (e.g. indexing)
- Consider evaluating a database that supports analytics better like Vertica, Redshift, Teradata, Google BigQuery, Microsft Azure SQL Data Warehouse among others — if you have 20+ users hitting a database for querying and have tables with 25M+ rows then this is worth evaluating! Your mileage may vary (depending on your hardware) but I am sharing these thresholds just so you have a starting point. These databases are different in architecture compared to something like MySQL which will bog down as you start to scale analytics in your org.
- Are you querying a production database that also used to support other apps? If so, consider requesting a copy of a database to work with. IT should be able to set up a copy that gets refreshed let’s say nightly and that should help. You should then route all SQL users to this database and restrict access to production database.
Since database is a software, it is constrained by the resources that it is allocated at hardware level just like any other software. You should dive deeper into this if #1 & #2 don’t work out — This is not the most common root-cause but as a rule of thumb, you should be scaling your hardware resources as other systems are scaled too. if that’s not done regularly then you will hit hardware issues. Also, don’t just upgrade your hardware, as I referred to earlier in #2, consider looking into databases that are better for analytics like vertica, redshift, bigquery etc. Compare all options & do an ROI analysis as upgrading hardware is usually a “duct-tape” solution and you will run into it again if you continue to grow.
So you now have a framework which should help you when you run into SQL performance problems! Now it’s your turn, I would love to hear about what you did when you ran into performance problems in any of your data analysis project.