Descriptive statistics could be used to find missing data
Tools like SQL/Excel/R can also be used to look for missing data
Some of the attributes of a field are missing: Like Postal Code in an address field
Check if all the values are standardized: Google, Google Inc & Alphabet might need to be standardized and categorized as Alphabet
Different Date formats used in the same field (MM/DD/YYYY and DD/MM/YYYY)
Total size of data (# of rows/columns): Sometimes you may not have all the rows that you were expecting (for e.g. 100k rows for each of your 100k customers) and if that’s not the case then that tells us that we don’t complete dataset at hand
Outlier: If someone;s age is 250 then that’s an outlier but also it’s an error somewhere in the data pipeline that needs to be fixed; outliers can be detected using creating quick data visualization
Data Type mismatch: If a text field is in a field where other entries are integer that’s also an error
Duplicates can be introduced in the data e.g. same rows duplicated in the dataset so that needs to be de-duplicated
Data cleaning takes up a lot of time during a data science process; it’s not necessarily a bad thing and time spent on cleaning data is worthwhile in most cases; To that end, I was researching some framework that might help me make this process a little bit faster. As a part of my research, I found the Journal of statistical software paper written by Hadley Wickham which had a really good framework to “tidy” data — which is part of data cleaning process.
Author does a great job of defining tidy data:
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
And then applying it to 5 examples:
1. Column headers are values, not variable names.
2. Multiple variables are stored in one column.
3. Variables are stored in both rows and columns.
4. Multiple types of observational units are stored in the same table.
5. A single observational unit is stored in multiple tables
I was doing some data cleaning the other day, I ran into the issue of text fields having line feeds (lf) and character returns (cr) — this creates a lot of issues when you do data import/export. I had run into this problem sometime before as well and didn’t remember what I did back then so I am putting the solution here so it can be referenced later if need be.
To solve this, you need to remove LF, CR and/or combination of both. here’s the T-SQL syntax for SQL Server to do so:
if you’re using some other database system then you need to figure out how to identify CR and LF’s — in SQL Server, the Char() function helps do that and there should be something similar for the database system that you’re using.
Are you trying to import an Excel file into SQL Server using SQL Server Integration services…And ran into error that has words like “Non unicode” and “unicode”? Then this blog is for you.
Why does this error occur?
Well it turns out that things like SQL Server and Excel have encoding standards that they follow which provides them a way to process, exchange & store data. BUT turns out that SQL Server and Excel use different standards.
So, the solution is simple right? Import the data from Excel into non-Unicode format because that’s what you need for SQL Server.
So how do you that? Between your Source and Destination tasks, include a task called “Data conversion” and do the following for all columns that have text:
And in the destination task, you’ll have to make sure that the mapping section using the new output aliases that you defined in the “data conversion” step.
In this post, we learned about how to solve a common error that pops up when you try to import excel file to sql server using SSIS. Hope that helps.
Asset (Volunteers, Field offices & Equipments) management & planning for a global crisis response team.
Working in a team, we created statistical surveys for field works to collect data about current state & estimated future needs. We also helped them with data gathering & cleaning tasks. After that, we helped them analyze & visualize the data to find actions for executives leading the global crisis response team.
Here’s a mockup of one of the ten data visualization created for them:
As a part of Business Intelligence projects, we spend a significant amount in extracting, transforming and loading data from source systems. So it’s always helpful to know as much as you can about the data sources like NULLS, keys, statistics among other things. One of the things that I like to do if the data is unknown is to make sure that I get the candidate keys correct to make sure the key used can uniquely identify the rows in the data. It’s really helpful if you do this upfront because it would avoid a lot of duplicate value errors in your projects.
So here’s a quick tutorial on how you can check the candidate key profile using data profiling task in SSIS, You need to perform two main tasks: 1. Generate the xml file using the Data profiling task in SSIS 2. View the content of the xml file using the Data Profile Viewer Tool or using the Open Profile Viewer option in the Data Profiling task editor in SSIS.
Here are the steps: 1a. Open SQL Server Data Tools (Visual Studio/BIDS) and the SSIS project type 1b. Bring in Data Profiling Task on Control Flow 1c. Open the Data Profiler Task editor and configure the destination folder that the tasks uses to create the XML file. You can either create a new connection or use an existing one. If you use an existing connection, make sure that you are setting the OverwriteDestination property to True if you want the file to be overwritten at the destination.
1d. Click on Quick Profile to configure the data source for the data profiler task
1e. In the quick profile form, you’ll need to select the connection, table/view and also specify what you to need to computer. For candidate key profile, make sure that the candidate key profile box is checked.
1f. Run the Task and a XML file should be placed at the destination you specified in step 1C.
Now, It’s time to view what profiler captured.
2a. you can open “Data Profile Viewer” by searching for its name in the start button.
2b. once it opens up, click on open and browse to the xml file generated by the data profiling task.
2c. once the file opens up, you can the candidate key profiles.
2d. Alternatively, You can also open the data profile viewer from the “Data Profiling Task” in SSIS. Go to the Editor > Open Profile Viewer:
Conclusion: In this post, you saw how to profile data using the Data Profiling Task in SSIS.