How to use TSQL checksum to compare data between two tables?

Standard

In any BI project, data validation plays an important part. You want to make sure that the data is right! usually business helps in this validation. As a developer, you might also want to do some preliminary data validation. One of the techniques that I’ve learned recently is to use TSQL checksum to compare data between two tables. In this post , I’ll describe the technique & post a pseudo code.

we’ll create a pseudo code to compare all columns but you should be able to use that to tweak that if you need it.

1) Run checksum(*) on Tables:

On Table1:

select checksum(*) as chktb1 from table1
go

On Table 2:

select checksum(*) as chktb2 from table2
go

At this point, you should get two result sets each populated by checksum values computer over all columns since you passed * in the checksum function.

2) Now let’s join these tables & look at rows w/ different checksum: (in other words, it is going to list all rows that are different between table1 & table2)

select * from
(
select checksum(*) as chktb1 from table1
) as tb1
left join
(
select checksum(*) as chktb2 from table2
) as tb2
on tb1.someid=tb2.someid /* you can have more ids */
where tb1.chktb1 != tb2.chktb2

3) You can add individual column now to see what changed:

select * from
(
select checksum(*) as chktb1, columnname1, columnname2 from table1
) as tb1
left join
(
select checksum(*) as chktb2, columnname1, columnname2 from table2
) as tb2
on tb1.someid=tb2.someid
where tb1.chktb1 != tb2.chktb2

Conclusion:
I hope this helps especially if you don’t have rights to install 3rd party tools on your dev machine.

A quick note on how select @@version helps me while I’m T-SQL’ing:

Standard

As a part of developing ETL packages, sometimes, I’ve to write T-SQL queries to pull data from SQL server source systems. But before I start doing that, it’s always good to know the version/edition of the source system. Why? because it can determine whether a TSQL operators are available for me to use or not. Case in point, I had a requirements where I could have written a query that uses Pivot & UnPivot operators. So I write a query & it doesn’t work! I spent about 5 minutes trying to debug the code. The code seems OK to me. So I thought of checking the “version”. And there you go, client’s source system was running SQL Server 2000. So that meant, I couldn’t use the Pivot & UnPivot operators.

Select Version SQL serverThis was my quick note on how select @@version helps me while I’m TSQL’ing. Next time, I’ll probably check this first, before writing the code. That could save me few minutes 🙂

Do tables in a SQL Azure Database need to have a primary key?

Standard

Answer: No.

Though SQL Azure does need that a table has a clustered index. So to this end, Let’s write some TSQL code and cement this concept in our brain. So as a part of the test, we would create a table with no primary key – However we would certainly create a clustered index on one of the column. So let’s get started:

OK, so here’s the TSQL code to create a table named ‘InternationalStudentList’:

[code language=”sql”]create table InternationalStudentList (StudentName varchar(30),HomeCountry varchar(30), DegreeProgram varchar(30))
go
[/code]

Now let’s insert some data into this table. Here’s the query:

[code language=”sql”]
insert into internationalstudentlist values(‘Paras Doshi’,’India’,’Masters in IT and Management’)
go
[/code]

If you run the above query – you would get the error:

Msg 40054, Level 16, State 1, Line 1
Tables without a clustered index are not supported in this version of SQL Server. Please create a clustered index and try again.

So we know that a table in SQL Azure requires a clustered index, right? So, Let’s create one!
Let’s assume that the InternationalStudentList would be queried a LOT to answer the question: “List all StudentName whose HomeCountry is X” – So based on this information let’s create a clustered index on HomeCountry Column. Here’s the TSQL code:

[code language=”sql”]
create clustered index cix_internationalstudentlist_homecountry ON
InternationalStudentList(HomeCountry)
go
[/code]

Once a clustered index is created, Try inserting data again. And you would see that a row would get successfully Inserted!
Run a Select Command to verify that:

SQL Azure Result of a SQL select command

And as you can see, we were able to insert a row in a table. Remember that this table did not have a primary key But we did create a clustered index. The Goal of the Post is achieved here, But just want to show you the Query Plan for a Query that looks like:

[code language=”sql”]
select StudentName,HomeCountry,DegreeProgram from InternationalStudentList
where HomeCountry = ‘India’
go
[/code]

sql azure clustered index seek

Note that I have run this queries on Management portal for SQL Azure, You can run it in SSMS too. But the goal is to show you that we have clustered index seek and that’s one of the way to tune queries. Not going into details in this post – And That’s about it for this post. Your feedback is welcome.

And Let’s connect! Here are few people networks that I am active on:

paras doshi blog on facebookparas doshi twitterparas doshi google plusparas doshi linkedin

I Look forward to Interacting with you!