In this blog – post, I’ll share a quick demo of how you can use Regular Expressions in Data Quality services to clean records. For the purpose of the demo, I’ll show how to clean perform a preliminary cleaning procedure on email records.
Before we begin, just a primer on Regular Expressions:
“a regular expression provides a concise and flexible means to match (specify and recognize) strings of text, such as particular characters, words, or patterns of characters. Common abbreviations for “regular expression” include regex and regexp.” – source: WikiPedia Regular Expressions
Let’s take an example:
using Regular Expressions you can:
- check if a text starts with A or a
- check if a text contains the word “data”
- check if a text follows the following pattern: firstname.lastname@example.org.
- among other things..
Now, Note the power of Regular Expressions to check for a pattern like email@example.com – now can we not use it to perform a basic cleaning on email records? I think, we can! and so, let’s try that:
for the purpose of the blog post, let me quickly show how you can create a Domain in a Knowledge Base that uses Regular Expressions to clean email records:
1) Open Data Quality Client > New Knowledge Base > Domain Management > create a domain
2) Switch to Domain Rules Tab
3) create a new Domain domain rule which uses the Regular Expression: [A-Za-z0-9._-]+@[A-Za-z0-9.-]+.[A-Za-z] to perform a basic cleaning on email records:
4) You can use this Knowledge base to clean email records in Data Quality Records But for now, let’s test our Regular Expression > click Test Domain Rule > Enter few records > Test them!
Note: This is just a basic demo to show Regular Expressions in action. For cleaning Email records, there are other options like using third-party data sets or writing an advanced regular expression. The RegEx I shared was just meant for quick demo but if you want to read more about Regular Expressions used to clean emails then I will strongly recommend you to read email regular expressions that I got to know from Erik Eckhardt via the discussion on a Beyond Relational Tip that I had posted.
In this blog post, we saw how to do basic cleaning on email records using regular expressions in Data Quality Services
- How to detect unrealistic or invalid values using Data Quality Services? (parasdoshi.com)
- How to standardize data using Data Quality Services? (parasdoshi.com)
- How I think of what is KnowledgeBase in SQL Server 2012 Data Quality Services (parasdoshi.com)
- How to Delete a Knowledge Base in SQL Server 2012 Data Quality Services? (parasdoshi.com)
- How to clean address records using third-party reference data-sets in Data Quality Services?(parasdoshi.com)
7 thoughts on “How to clean data using Regular Expressions in Data Quality Services?”
Hi, excellent piece.
I want to create a domain rule to check if the 3rd character of a domain contains 1, 2, or A.
In sql i use substring(colname, 3,1) in (‘1′,’2′,’A’) which I want to implement that in DQS. I tried Pattern match and regular expression but I can’t find what are the valid expressions for Substring match?