Missing Data:
- Descriptive statistics could be used to find missing data
- Tools like SQL/Excel/R can also be used to look for missing data
- Some of the attributes of a field are missing: Like Postal Code in an address field
Non-standardized:
- Check if all the values are standardized: Google, Google Inc & Alphabet might need to be standardized and categorized as Alphabet
- Different Date formats used in the same field (MM/DD/YYYY and DD/MM/YYYY)
Incomplete:
- Total size of data (# of rows/columns): Sometimes you may not have all the rows that you were expecting (for e.g. 100k rows for each of your 100k customers) and if that’s not the case then that tells us that we don’t complete dataset at hand
Erroneous:
- Outlier: If someone;s age is 250 then that’s an outlier but also it’s an error somewhere in the data pipeline that needs to be fixed; outliers can be detected using creating quick data visualization
- Data Type mismatch: If a text field is in a field where other entries are integer that’s also an error
Duplicates:
- Duplicates can be introduced in the data e.g. same rows duplicated in the dataset so that needs to be de-duplicated
Hope that helps!
This post is sponsored by MockInterview.co, If you are looking for data science jobs, check out 75+ data science interview questions!