If you create bunch of reports and help answer what happened— then try to help business users with why it happened. [Example: Instead of just sending website traffic info, add why the traffic spikes (up/downs) are happening]
If you are working on building bunch of models that answer why questions then try to help build predictive models next [Example: You have been working on a model that helped you answer why customers churned. Now built upon that and predict which customers will churn next]
If you do analytics and data science well and are already answering what, why, what’s next questions and you’re killing it! Then figure out how can you help business owners take action. Or make it easier than ever before to take actions on your data/recommendations.
Other answers for questions are directly/indirectly covered if you do this:
You will have to pick the right tool for the job
You will have to continuously keep learning (by taking online courses and/or you-tube)
Don’t just be a data analyst, be a thought partner to business owners and if possible, transition into role that help you own business outcomes.
SQL is a common language used by data analysts (and even business users!) for data analysis — one of the reasons is popular is because it’s not that hard to pick it up. Sure, there is some learning curve especially if you don’t have a computer programming background but once you learn some basic commands, you will be able to apply it and answer a lot of questions. So it does give you lot of power! But sometimes you run into issues where your SQL queries are taking forever to complete and you wonder why that’s the case. In this post, I am going to introduce you to performance tuning that will help to troubleshoot next time you run into performance problems.
Performance Tuning SQL
Performance tuning Hierarchy
your queries are slow due to one of the three reasons listed below:
#1: SQL Query optimization
#2: Database software, environment & optimization
You should start at Level 1 which is query optimization and then work your way down to other levels. This post will focus on SQL Query optimization as that is something you can control and it is also the most common root cause. Let’s focus on this first and then we will explore other options.
Performance Tuning SQL Queries
Depending on your skill level, you can look at a lot of things. But for the purpose of this blog post lets say you have beginner – intermediate SQL Knowledge and with that, you can look at following things:
Size of your tables: If you are querying tables with millions of rows then that is going to slow down your queries. You can start off with limiting the amount of rows that you are working with using SQL clauses like LIMIT/TOP (depends on database system you are using) that will reduce the number of rows that database works with. This is great for exploratory analysis where you don’t need to look at all rows. Also, you should consider using “where” clause when it’s applicable. So let’s say you just care about a particular product category OR a particular product category then put where clauses where it’s applicable.
Complex joins & aggregations: If you trying to join tables in such a way that returns a large number of rows then it’s going to be slow! If possible you can apply step 1 (limit your rows) to this as well — so let’s say you have two tables that are trying to join but you don’t need everything from table 1 then consider putting where clause on table 1. That would help. Also, if you are using aggregations along with joining tables then you could consider doing aggregations on individual tables first and put them in a subquery and then use them to join with other tables. Here’s an example where you are aggregating the table before joining them with other tables:
SELECT cat.product_category, sub.* FROM ( SELECT p.product_name, COUNT(*) AS products FROM Product p GROUP BY 1 ) sub JOIN ProductCategory cat ON cat.product_name = sub.product_name
Query plan: Also, how do you know which statement is a bottleneck in your queries. You could consider running them individually and see but even if that doesn’t help then you can use something called a “Query plan” — depending on the database system you are using the commands can differ but you can try searching the help section for that. It’s called Query/Execution plans and they help you see the order in which the query will be executed. Also, it will have “estimated” time to run stats (which may or may not be accurate) but still good starting point to see how long it might take for complex queries especially as you make changes, you can continuously evaluate without having to run the query. There’s a bit of learning curve on how to read execution/query plans but they are a great way to check the bottlenecks in a query that you wrote. You can try using a command like “EXPLAIN” before your query and check your database help section to see if that’s not the command.
Let’s say you have tried everything you could to tune your SQL queries then it’s move to explore other options:
#2: Database software, environment & optimization: This is usually owned by Dev Ops or IT team and you will have to work with them. Depending on your team size, there might be a DBA, System Admin or DevOps engineer responsible for this. Here’s few things you should check out along with the IT team:
Are there a lot of users running queries at the same time?
Upgrade to database software and/or perform database optimization (e.g. indexing)
Consider evaluating a database that supports analytics better like Vertica, Redshift, Teradata, Google BigQuery, Microsft Azure SQL Data Warehouse among others — if you have 20+ users hitting a database for querying and have tables with 25M+ rows then this is worth evaluating! Your mileage may vary (depending on your hardware) but I am sharing these thresholds just so you have a starting point. These databases are different in architecture compared to something like MySQL which will bog down as you start to scale analytics in your org.
Are you querying a production database that also used to support other apps? If so, consider requesting a copy of a database to work with. IT should be able to set up a copy that gets refreshed let’s say nightly and that should help. You should then route all SQL users to this database and restrict access to production database.
Since database is a software, it is constrained by the resources that it is allocated at hardware level just like any other software. You should dive deeper into this if #1 & #2 don’t work out — This is not the most common root-cause but as a rule of thumb, you should be scaling your hardware resources as other systems are scaled too. if that’s not done regularly then you will hit hardware issues. Also, don’t just upgrade your hardware, as I referred to earlier in #2, consider looking into databases that are better for analytics like vertica, redshift, bigquery etc. Compare all options & do an ROI analysis as upgrading hardware is usually a “duct-tape” solution and you will run into it again if you continue to grow.
So you now have a framework which should help you when you run into SQL performance problems! Now it’s your turn, I would love to hear about what you did when you ran into performance problems in any of your data analysis project.
This question was asked on Quora and here’s my answer:
I will list resources broken down by three categories.
Business Knowledge: As a data analyst, you need to have at least basic knowledge of business areas that you are helping with. For example: if you are doing Marketing Analytics then you need to understand basic concepts in marketing and that will make you more effective. You can do so one of the three ways:
On-the-job: Pick up knowledge by interacting with business people and using internal knowledge bases.
Problem structuring: If you are able to break down the problem into core components to identify root cause, you will not only increase your speed to insight but your structure will also help you communicate it more effectively. Learn to break down your problems and use that in communicating your data analysis approach. Imagine this list without the three high-level categories — wouldn’t it look like I am throwing random resources at you? By giving it a structure — Tech, Biz, Communication, I am not only able to structure it but also communicate it to you more effectively. More here: Structure your Thoughts – Bootcamp | PrepLounge.com
Figure out where (location) you want to work and who (company) you want to work for.
Note the “skills” required in job Descriptions at companies in your desired location(s) > find common themes from job descriptions > Pick up those skills if you don’t have them already!
Getting a job is a function of Number of Job Applications and your conversion rate (Offers Received/#of Job Applications). Optimizing # of Job Applications is easy — you just need to apply to as many jobs as you could. To improve conversion rate, you would need to do number of things: clear HR/Culture-fit rounds, clear TECH rounds, create a portfolio of projects to talk about, etc.
You could also consider applying for internships to get experience. This should help you land full-time roles.
It depends on how the Analytics & Data Science team is structured in an org but usually you will see following trend:
“Big Data Developer” usually rolls up under the Engineering org. They are responsible for building the data pipelines that feed data to the “data platform” — they use things like Hadoop, Spark, Custom Code, ETL tools, etc to build data pipelines and are responsible developing and maintaining the data platform. And to succeed in this role you need to have deep technical chops. Other titles for this role: Data engineer, Software engineer, etc.
“Data Analyst” usually rolls up under some “business” team like strategy, operations, growth, product, marketing, sales, etc. Data Analyst are the link between the “data platform” and the “business” — these guys are primary consumer of the “data platform” (sometimes you might see shared ownership of data platform between engineering and analytics). They help solve business problems using data and pull data from the “data platform”. These guys need to have a good balance between business and technical skills to be successful in this role.
When I hire for Data Analyst (Jr. or Intern) positions, I look for three things:
1) Analytical mindset:
I would do this by sharing a hypothetical case study and seeing how you go about solving this. I would look for things like: a) Approach: How do you break down the problem? b) Effectiveness: How effectively can go about solving the case. I am NOT looking for the “Right” answer but just want to see how you go about solving the case.
(Search for “Management consulting case studies” — I usually pick a simple case)
2) Communication skills:
This is pretty standard across many roles but it’s important for data analysts to be able to communicate their recommendations/findings to stakeholders.
3) Basic hard/tech skills + Willingness to learn new tech skills:
I would ask you basic tech questions around SQL, Excel OR other “tech skills” that you might have mentioned in your resume. I am not looking for expert-level knowledge but just want to make sure you know things that you have listed on your resume or things that you studied. Also, I would ask you questions that would help me figure out whether you are open to learning new tech skills.
So now that I have shared the framework with you, let me try and answer your question: How do I answer the most challenging data analysis project that I have done?
a. If you had a good approach for your project then It would mean that you know how to break down data analysis problems and solve them. So solving a basic case study shouldn’t be difficult for you and I could check box #1!
b. If you can communicate the “complexity” of the project effectively then I think I would check the box #2: communication skills!
c. Since you solved a challenging project, I assume that you picked up some tech skills (Bonus points if you picked up new tech skills while solving this problem). Just let me know what tech you used to solve the problem so that I can ask questions around that — if you are able to answer them then I would check box #3!
It’s NOT about the challenging project but your learning/takeaways from that project that will be help you the most!
Now, assuming that the interview team think you are a good “culture fit” plus you came out on top compared to other candidates then you will get an offer to join the team as a Data Analyst!
Power Query is amazing! It takes the data analysis capabilities of Excel to whole new level! In this post, I am going to share three reasons:
1. it enables repeatable mash-up of data!
Have you every had to do your data analysis tasks repeatedly on the data with same structure? Do you get “new” data every other week and need to go through the same data transformation workflow to get to the data that you need?
What’s the solution? Well, you can look at MACRO’s! Or you can request your IT department to create a Business Intelligence platform. However, what if you need to modify your data mashup workflow then these solutions don’t look great, do they now?
Don’t worry! Power Query is here!
It enables repeatable mashup of data like you might have never seen before! You need to try it to believe.
It’s very easy to input new data to Power Query and it enables you to retrieve final output based on new data using a “refresh” feature.
Each data-mashup is recorded as steps which you can go back and edit if you need to.
2. It’s super-flexible!
Any data mashup performed using Power Query is expressed using its formula language called “M”. You can edit the code if you need to and as you can imagine such a platform enables much-needed flexibility for the analyst’s.
3. It has awesome advance features!
Do you want to Merge data? How about Join? Are you tired with VLOOKUP’s! Don’t worry! it’s super easy with Power Query! Here’s a post: Join Excel Tables in Power Query
How about searching for online & open data sets? Done!
How about connecting to data sources that “Data” section of Excel doesn’t support yet? (Example: Facebook) – DONE! Power Query makes that happen for you.
And That’s not a complete list!
Plus you can unlock the “Power” (pun intended) of Power Query by using it with other tools in Power BI Stack. (Power Pivot, Power View, etc…) OR you can use the your final output from Power Query with other tools too! After all it’s an excel file.
If you haven’t already then check out Power Query! it’s free and works with Excel 2010 and above.