Playing w/ the Occupational Employement Statistics Data-Set:


I found some data-sets on Occupational Employment Statistics on Bureau of Labor Statistics site and I played with it to see if I can find something interesting:

Few things about the data & visualization that I am going to share

  • US only
  • I downloaded the national level data But there’s also state level data available if you’re interested to drill down.
  • The reports that you see where created after I got a chance to “clean” the data-set a bit and created a data model that suited basic reporting on top of it.
  • For this blog post, I am going to play w/ May 2010 & 2011 data
  • With the help of original data-set, you can drill down to get statistics about a particular Job Category if you want. For this blog-post, I am going to share visualizations that correspond to Job categories.
  • click on images to see the higher resolution image.

With that, Here are some visualizations:

1) Job Category VS mean hourly salary:

1 Job category vs hourly salary mean bureau of labour statistics

2) Job Category VS number of employees:

2 Job category vs number of employees bureau of labour statistics

3) Scatter Plot:

X Axis: Number of employees

Y – Axis: Wage (Mean Hourly Salary May 2011)

Size of Bubble: Wage (Mean Hourly Salary May 2011)

*Note: This may not be the best approach to create the Scatter Plot as I have used the same value (Mean Hourly Salary May 2011) twice – But since I was just playing w/ it, I went with what I had in the model.

Here’s the visualization:

3 scatter plot number of employees vs mean hourly wage may 2011 employment statistics

Some of the things I observed:

1) I belong to an Industry (Computer and Mathematical occupations) which has relatively higher mean hourly wage.

2) There are few people working in “farming, fishing & forestry occupations” that do not get paid much.

3) There are lots of people working in “office administrative support occupations” that do not get paid much.

4) Management Occupations, Legal Occupations and computer & mathematical occupations have relatively higher mean hourly wages.


In this post, I played w/ Occupational Employment statistics data-sets and shared some visualizations.

Data Profiling and SQL Server 2012 Data Quality Services


Data Profiling in Data Quality Services happens at following stages:

1) While performing Knowledge Discovery activity

1A: In the Discover step:

1 knowledge discovery profiling data quality services sql server

1b. Also in the manage domain values step:

1b knowledge discovery profiling data quality services sql server

While profiling gives you statistics at the various stages in the Data Cleaning or Matching process, it is important to understand what you can do with it. With that, Here are the statistics that we can garner at the knowledge discovery activity:

  • Newness
  • Uniqueness
  • Validity
  • Completeness

2) While Performing  Cleansing activity:

2A: on the cleansing step:

2 cleansing profiling data quality services sql server

2b: Also on the mange and view results step:

2b cleansing profiling data quality services sql server

Here the profiler gives you following statistics:

  • Corrected values
  • Suggested Values
  • Completeness
  • Accuracy

Note the Invalid records under the “source statistics” on left side. In this case 3 records didn’t pass the domain rule.

3) While performing Matching Policy activity (Knowledge Base Management)

3a. Matching policy step:

3a matching policy data quality services microsoft sql

3b. Matching Results step:

3b matching policy data quality services microsoft sql

Here the profiler gives following statistics:

  • newness
  • uniqueness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size

4) While performing Matching activity (Data Quality Project)

4a. Matching step:

4a matching activity data quality services microsoft sql

4b. Export step:

4b matching activity data quality services microsoft sql export step

Here Profiler gives following statistics:

  • Newness
  • uniqueness
  • completeness
  • number of clusters
  • % of matched and unmatched records
  • avg, min & max cluster size


In this post, I listed the statistics provided by Profiler while performing Knowledge Discovery, cleansing, matching policy and matching activity in SQL Server 2012 Data Quality Services.


Data Quality Service’s Composite Domains in action!


In this post, I’ll show you how composite domains can help you create cross domain rules in Data Quality Services.


You have a data set of employee name, employee category and yearly salary. you want to validate the value in the yearly salary column based on the employee category. Here are the business rules:

Note: for the purpose of the demo, every number is a dollar.

Now, the rule in the Table can be read as:

If employee category is A then yearly salary should be greater than 100000 and less than 200000.

1 composite domains data quality services

Note: I have kept it simple for demo purposes.

Now here is our Data-Set before we set out to validate it:

Employee NameEmployee CategoryYearly Salary
Jon V YangA127000
Eugene L HuangB90000
Ruben  TorresC83000
Christy  ZhuD70000
Elizabeth  JohnsonA90000
Julio  RuizC65000
Janet G AlvarezD43000
Marco  MehtaB81000

*Names are taken from Adventure works database. The values in the names and salary column are purely fictional.


It’s just an overview, It’s not covered in step by step fashion:

1. Create a KB > created three domains: Employee Category, Employee Name and Yearly Salarly

2. created a composite domain:

2 created a composite domain data quality services

3. Under Composite Domain (CD) Rules Tab:

I started out with defining the rules for category A:

3 create composite domains rules data quality services

And I completed w/ specifying business rules for all four categories

4 create composite domains SQL server 2012

4.  Published KB

5. Created a New DQS project > Selected the KB created above

6.  Selected the data source > Mapped domains

7. I also selected from the list of selected composite domains:

5 view select composite domains data quality project

8. After seeing the cleaning statistics, I switched to the invalid tab to see the records that didn’t match the record:

6 composite domain invalid tab new tab corrected tab correct tab

9. So by now, we have identified records that do not match the rules. A data steward can now correct them if he/she wants to or leave them as it is. Notice the Approve/reject check boxes.

Note that: Not only can you update the yearly salary but you can also update the employee category. So if you think that the employee has been wrongly categorized, you can change that.

10. After this, you can export the data-set which has records that match the business rules and the data-set would be ready to be consumed!


In this post, we saw how to create cross domain rules using composite domains w/ an example of Employee Category and Yearly Salary.


Things I shared on Social Media Networks during Noc 12 – Dec 31 (2012)


Big Data: The Coming Sensor Data Driven Productivity Revolution

Check out some nice getting started tutorials at beyondrelational site:

Complexity is your enemy. Any fool can make something complicated. It is hard to make something simple – Richard Branson

— via Paras Doshi – Blog

The success of companies like Google, Facebook, Amazon, and Netflix, not to mention Wall Street firms and industries from manufacturing to retail and healthcare, is increasingly driven by better tools for extracting meaning from very large quantities of data,” says Tim O’Reilly

— via Paras Doshi – Blog

Nice collection of about 20+ videos around the topic of “Data Science”:

Nice collection of videos by Berkeley school of information: #Information #Data

Just found Facebook’s data team’s page:

via V Talk Tech – A Parth Acharya Blog – Nice HeatMap of stocks!

what’s the biggest fear about cloud computing? via Windows Azure

Resource: Presentations from the Sentiment Analysis Symposium

If I switched to the newest “holiday” theme on WordPress, this is how it would look:

Nice! Code School now has R programming language! I have been playing with R for a while now and definitely want to learn more – here’s the link to learn R:

Interesting tool from Google to optimize and analyze web page speeds:

Performed #sentiment #Analysis on #starbucks twitter data using #R ! It was fun!

In 2002: The Data Warehousing Institute estimates that data quality problems cost U.S. businesses more than $600 billion a year. And of course, over the past 10 years, this number would be bigger.

Reading: Business Analytics vs Business Intelligence?

Big data is a nickname for the recent increase in largely external and unstructured business and consumer information. How are businesses across industries harnessing traditional enterprise information management functions and systems to translate big data into useful business intelligence?

For business analytics professionals: 12 webcasts on Jan 30th 2013 #sqlpass #analytics #24hop

Some nice insights about how to build an Internet platform, from the founder of Zipcar:

Let’s connect and converse on any of these people networks!

paras doshi blog on facebookparas doshi twitter paras doshi google plus paras doshi linkedin