Is it too late to become a good Data Scientist?

Standard

If you’re looking for career change, that’s never too late!

If you’re looking to learn something new, that’s never too late!

If you’re looking to continue learning and go deeper in data science, that’s never too late!

If you don’t like Software engineering and want to switch to something else, that’s never too late!

But if you are after the “Data Science” gold rush, then you did miss the first wave! You are late.

But seriously, you should apply first-principles thinking to your career strategy and ideally not jump to whatever’s “hot” because by the time you get on that train, it’s usually too late.

VIEW QUESTION ON QUORA

How do I learn #SQL for #data analysis?

Standard

Step 1:

This is a good starting point: SQL School Table of Contents

OR, this: Learn SQL

Both of these resources were put together by analytics vendor and is targeted towards beginners.

Step 2:

Review this Quora Thread: How do I learn SQL?

Participate in competitions like this: Solve SQL Code Challenges

Step 3:

If you like to go more in-depth then check out few books:

  1. Head First SQL
  2. Learn SQL the hard Way
  3. Certification books/material from a database vendor

Hope that helps!

VIEW QUESTION ON QUORA

SQL: How to add Hierarchical Levels in your Query?

Standard

Tree-like structures are common in our world: Company Hierarchy, File System on your computer, Product category map among others so you might run into a task of creating a Hierarchical level in your SQL query — In this blog post, I will show you how you can do that using couple of approaches. These approaches can also be used to map “parent – child” relationships.

Two approaches are:

  1. When you know the Tree Depth
  2. When you don’t know the Tree Depth

SQL Hierarchical

#1: When you know tree-depth:

When you know the tree-depth (and if it’s not too deep) then you could consider simple CTE’s to come up with the Hierarchical Levels field in your query.

Let’s take an example:

Input:

EmployeeIDFirstNameLastNameTitleManagerID
1KenSánchezChief Executive OfficerNULL
16DavidBradleyMarketing Manager273
273BrianWelckerVice President of Sales1
274StephenJiangNorth American Sales Manager273
285SyedAbbasPacific Sales Manager273

Query: (On SQL Server)


with lvl1 as
(select
[EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,1 as Level
FROM [dbo].[employees]
where ManagerID is null
)
,
lvl2 as
(
select
[EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,2 as Level
FROM [dbo].[employees]
where ManagerID IN (Select EmployeeID from lvl1)
),
lvl3 as
(
select
[EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,3 as Level
FROM [dbo].[employees]
where ManagerID IN (Select EmployeeID from lvl2)
)
select * from lvl1
union
select * from lvl2
union
select * from lvl3

Output:

EmployeeIDFirstNameLastNameTitleManagerIDLevel
1KenSánchezChief Executive OfficerNULL1
273BrianWelckerVice President of Sales12
16DavidBradleyMarketing Manager2733
274StephenJiangNorth American Sales Manager2733
285SyedAbbasPacific Sales Manager2733

#2: When you do NOT know tree-depth:

In other words, if the tree is N-level deep then you are out of luck using option #1. In this case, you should consider the RECURSIVE CTE approach. Here’s an example:

Input: (with the idea that this table will grow over time)

EmployeeIDFirstNameLastNameTitleManagerID
1KenSánchezChief Executive OfficerNULL
16DavidBradleyMarketing Manager273
23MaryGibsonMarketing Specialist16
273BrianWelckerVice President of Sales1
274StephenJiangNorth American Sales Manager273
275MichaelBlytheSales Representative274
276LindaMitchellSales Representative274
285SyedAbbasPacific Sales Manager273
286LynnTsofliasSales Representative285

Query: (On SQL Server that supports Recursive CTE)


with HierarchyLvl as
(
SELECT [EmployeeID]
,[FirstName]
,[LastName]
,[Title]
,[ManagerID]
,1 as Level
FROM [dbo].[employees]
where ManagerID is null
UNION ALL
SELECT e.[EmployeeID]
,e.[FirstName]
,e.[LastName]
,e.[Title]
,e.[ManagerID]
,Level + 1
FROM [dbo].[employees] e INNER JOIN HierarchyLvl d on e.ManagerID = d.EmployeeID
)
select * from HierarchyLvl

 

Output:

EmployeeIDFirstNameLastNameTitleManagerIDLevel
1KenSánchezChief Executive OfficerNULL1
273BrianWelckerVice President of Sales12
16DavidBradleyMarketing Manager2733
274StephenJiangNorth American Sales Manager2733
285SyedAbbasPacific Sales Manager2733
286LynnTsofliasSales Representative2854
275MichaelBlytheSales Representative2744
276LindaMitchellSales Representative2744
23MaryGibsonMarketing Specialist164

Conclusion:

Even if I know tree-depth I will go with option #2 as it’s much easier to read and can accommodate future updates to the table. If you are interested in learning more about this and search for “Recursive Query using Common Table Expression” and you should find technical articles that talk about why it does what it does.

Hope this helps!

Introduction to SQL Performance Tuning for Data Analysts

Standard

Introduction:

SQL is a common language used by data analysts (and even business users!) for data analysis — one of the reasons is popular is because it’s not that hard to pick it up. Sure, there is some learning curve especially if you don’t have a computer programming background but once you learn some basic commands, you will be able to apply it and answer a lot of questions. So it does give you lot of power! But sometimes you run into issues where your SQL queries are taking forever to complete and you wonder why that’s the case. In this post, I am going to introduce you to performance tuning that will help to troubleshoot next time you run into performance problems.

Performance Tuning SQL

Performance Tuning SQL

Performance tuning Hierarchy

your queries are slow due to one of the three reasons listed below:

#1: SQL Query optimization

#2: Database software, environment & optimization

#3. Hardware

You should start at Level 1 which is query optimization and then work your way down to other levels. This post will focus on SQL Query optimization as that is something you can control and it is also the most common root cause. Let’s focus on this first and then we will explore other options.

Performance Tuning SQL Queries

Depending on your skill level, you can look at a lot of things. But for the purpose of this blog post lets say you have beginner – intermediate SQL Knowledge and with that, you can look at following things:

  1. Size of your tables: If you are querying tables with millions of rows then that is going to slow down your queries. You can start off with limiting the amount of rows that you are working with using SQL clauses like LIMIT/TOP (depends on database system you are using) that will reduce the number of rows that database works with. This is great for exploratory analysis where you don’t need to look at all rows. Also, you should consider using “where” clause when it’s applicable. So let’s say you just care about a particular product category OR a particular product category then put where clauses where it’s applicable.
  2. Complex joins & aggregations: If you trying to join tables in such a way that returns a large number of rows then it’s going to be slow! If possible you can apply step 1 (limit your rows) to this as well — so let’s say you have two tables that are trying to join but you don’t need everything from table 1 then consider putting where clause on table 1. That would help. Also, if you are using aggregations along with joining tables then you could consider doing aggregations on individual tables first and put them in a subquery and then use them to join with other tables. Here’s an example where you are aggregating the table before joining them with other tables:

    SELECT cat.product_category,
    sub.*
    FROM (
    SELECT p.product_name,
    COUNT(*) AS products
    FROM Product p
    GROUP BY 1
    ) sub
    JOIN ProductCategory cat
    ON cat.product_name = sub.product_name

  3. Query plan: Also, how do you know which statement is a bottleneck in your queries. You could consider running them individually and see but even if that doesn’t help then you can use something called a “Query plan” — depending on the database system you are using the commands can differ but you can try searching the help section for that. It’s called Query/Execution plans and they help you see the order in which the query will be executed. Also, it will have “estimated” time to run stats (which may or may not be accurate) but still good starting point to see how long it might take for complex queries especially as you make changes, you can continuously evaluate without having to run the query. There’s a bit of learning curve on how to read execution/query plans but they are a great way to check the bottlenecks in a query that you wrote. You can try using a command like “EXPLAIN” before your query and check your database help section to see if that’s not the command.

Database software, Environment, Optimization & Hardware:

Let’s say you have tried everything you could to tune your SQL queries then it’s move to explore other options:

#2: Database software, environment & optimization: This is usually owned by Dev Ops or IT team and you will have to work with them. Depending on your team size, there might be a DBA, System Admin or DevOps engineer responsible for this. Here’s few things you should check out along with the IT team:

  • Are there a lot of users running queries at the same time?
  • Upgrade to database software and/or perform database optimization (e.g. indexing)
  • Consider evaluating a database that supports analytics better like Vertica, Redshift, Teradata, Google BigQuery, Microsft Azure SQL Data Warehouse among others — if you have 20+ users hitting a database for querying and have tables with 25M+ rows then this is worth evaluating! Your mileage may vary (depending on your hardware) but I am sharing these thresholds just so you have a starting point. These databases are different in architecture compared to something like MySQL which will bog down as you start to scale analytics in your org.
  • Are you querying a production database that also used to support other apps? If so, consider requesting a copy of a database to work with. IT should be able to set up a copy that gets refreshed let’s say nightly and that should help. You should then route all SQL users to this database and restrict access to production database.

#3: Hardware:

Since database is a software, it is constrained by the resources that it is allocated at hardware level just like any other software. You should dive deeper into this if #1 & #2 don’t work out — This is not the most common root-cause but as a rule of thumb, you should be scaling your hardware resources as other systems are scaled too. if that’s not done regularly then you will hit hardware issues. Also, don’t just upgrade your hardware, as I referred to earlier in #2, consider looking into databases that are better for analytics like vertica, redshift, bigquery etc. Compare all options & do an ROI analysis as upgrading hardware is usually a “duct-tape” solution and you will run into it again if you continue to grow.

Conclusion:

So you now have a framework which should help you when you run into SQL performance problems! Now it’s your turn, I would love to hear about what you did when you ran into performance problems in any of your data analysis project.

How to remove line feeds (lf) and character return (cr) from a text field in SQL Server?

Standard

I was doing some data cleaning the other day, I ran into the issue of text fields having line feeds (lf) and character returns (cr) — this creates a lot of issues when you do data import/export. I had run into this problem sometime before as well and didn’t remember what I did back then so I am putting the solution here so it can be referenced later if need be.

To solve this, you need to remove LF, CR and/or combination of both. here’s the T-SQL syntax for SQL Server to do so:

SELECT REPLACE(REPLACE(@YourFieldName, CHAR(10), ' '), CHAR(13), ' ')

if you’re using some other database system then you need to figure out how to identify CR and LF’s — in SQL Server, the Char() function helps do that and there should be something similar for the database system that you’re using.

How do I get experience as an entry-level Data analyst?

Standard

It’s a three-step process:

  1. Figure out where (location) you want to work and who (company) you want to work for.
  2. Note the “skills” required in job Descriptions at companies in your desired location(s) > find common themes from job descriptions > Pick up those skills if you don’t have them already!
  3. Start Applying!
    • Getting a job is a function of Number of Job Applications and your conversion rate (Offers Received/#of Job Applications). Optimizing # of Job Applications is easy — you just need to apply to as many jobs as you could. To improve conversion rate, you would need to do number of things: clear HR/Culture-fit rounds, clear TECH rounds, create a portfolio of projects to talk about, etc.
    • You could also consider applying for internships to get experience. This should help you land full-time roles.

Related Answer: Paras Doshi’s answer to How do I prepare myself to be a data analyst?

VIEW QUESTION ON QUORA

What is the difference between Row_Number(), Rank() and Dense_Rank() in SQL?

Standard

If the database that you work with supports Window/Analytic functions then the chances are that you have run into SQL use-cases where you have wondered about the difference between Row_Number(), Rank() and Dense_Rank(). In this post, I’ll show you the difference:

So, let’s just run all of them together and see what the output looks like.

Here’s my query: (Thanks StackExchange!)

select DisplayName,Reputation,
Row_Number() OVER (Order by Reputation desc) as RowNumber,
Rank() OVER (Order by Reputation desc) as Rank,
Dense_Rank() OVER (Order by Reputation desc) as DenseRank
from users

Which gives the following output:

DisplayName          Reputation RowNumber Rank DenseRank 
-------------------- ---------- --------- ---- --------- 
Hardik Mishra        9999       1         1    1         
Alex                 9997       2         2    2         
Omnipresent          9997       3         2    2         
Sergei Basharov      9993       4         4    3         
Oleg Pavliv          9991       5         5    4         
Jason Creighton      9991       6         5    4         
Aniko                9991       7         5    4         
Notlikethat          9990       8         8    5         
ZeMoon               9989       9         9    6         
Carl                 9987       10        10   7   
...
...
...     

Note that all the functions are essentially are “ranking” your rows but there are subtle differences:

  1. Row_Number() doesn’t care if the two values are same and it just ranks them differently. Note row #2 and #3, they both have value 9997 but they were assigned 2 and 3 respectively.
  2. Rank() — Now unlike Row_Number(), Rank() would consider that the two values are same and “Rank” them with same value. Note Row #2 and #3, they both have value 9997 and so both were assigned Rank “2” — BUT notice the Rank “3” is missing! In other words, it introduces some “gaps”
  3. Dense_Rank() — Now Dense_Rank() is like Rank() but it doesn’t leave any gaps! Notice that the Rank “3” in the DenseRank field.

I hope this clarified the differences between these SQL Ranking functions — let me know your thoughts in the comments section

Paras Doshi

How do I prepare myself to be a data analyst?

Standard

Originally published on Quora: How do I prepare myself to be a Data Analyst?

Based on how you are framing your question, it seems that you currently don’t have “Data Analysis” Background but want to build a career in this field. Here are three things you could do:

  1. Learn Tech Skills: You will need technical knowledge to be successful at analyzing data. SQL and Excel are a good starting point. You could do a lot with these tools — then depending on the bandwidth that you might have you could explore R. How do you learn this? Here’s a learning pathway: Learn #Data Analysis online – free curriculum ; Also search for free courses on Coursera or other platforms.
  2. Learn Soft/Business Skills: This is as important as tech skills (if not more!) when it comes to Data Analysis. Finding Insights from your data is half the battle, you will need to put the insights in a context/story and influence business decisions and sometimes influence business change. we know change is always hard! So your soft/business skills will be very important. Also, you will benefit a lot from learning about how to break down problems, communicate your solution by using “business” language vs tech-speak.
  3. Apply them (and keep improving): Now that you have picked up some tech and soft/biz skills, apply them! Get an internship, Help out a non-profit in your free time (Data Kind, Statistics Without borders, Volunteer Match are good resources to find a non-profit) and start applying your skills! It would also help you get some “Real” world experience and applying what you have learned while “learning-on-the-job” is arguably the BEST way to pick something up!

Hope that helps!

How to change the Data Source of a SQL Server Reporting Services Report (Native Mode)?

Standard

Problem:

You have your SQL Server Reporting Services environment in native mode — and you want to modify the data source of a report there.

Solution:

  1. Navigate to Report Manager.
  2. Navigate to the Report that you want to Manage and run it
  3. After the report renders, you will have a breadcrumb navigation on the top right
  4. Click on the Last Part of the Breadcrumb NavigationSSRS properties report native mode
  5. It should open up the “properties” section of this report
  6. On the properties section, you should be able to manage the data source
    SSRS Manage Data Source Native Mode Shared
  7. Make the changes that you wanted to the data source settings of this SSRS report — and don’t forget to click “apply”
  8. Done!

Author: Paras Doshi

Back to Basics — What is DDL, DML, DCL & TCL?

Standard

I was talking with a database administrator about different categories that SQL Commands fall into — and I thought it would be great to document here. So here you go:

ACRONYMDESCRIPTIONSQL COMMANDS
DMLData Manipulation Language: SQL Statements that affect records in a table.SELECT, INSERT, UPDATE, DELETE
DDLData Definition Language: SQL Statements that create/alter a table structureCREATE, ALTER, DROP
DCLData Control Language: SQL Statements that control the level of access that users have on database objectsGRANT, REVOKE
TCLTransaction Control Language: SQL Statements that help you maintain the integrity of data by allowing control over transactionsCOMMIT, ROLLBACK

BONUS (Advance) QUESTION:

Is Truncate SQL command a DDL or DML? Please use comment section!

Author: Paras Doshi