How to remove line feeds (lf) and character return (cr) from a text field in SQL Server?

Standard

I was doing some data cleaning the other day, I ran into the issue of text fields having line feeds (lf) and character returns (cr) — this creates a lot of issues when you do data import/export. I had run into this problem sometime before as well and didn’t remember what I did back then so I am putting the solution here so it can be referenced later if need be.

To solve this, you need to remove LF, CR and/or combination of both. here’s the T-SQL syntax for SQL Server to do so:

SELECT REPLACE(REPLACE(@YourFieldName, CHAR(10), ' '), CHAR(13), ' ')

if you’re using some other database system then you need to figure out how to identify CR and LF’s — in SQL Server, the Char() function helps do that and there should be something similar for the database system that you’re using.

What is the difference between Row_Number(), Rank() and Dense_Rank() in SQL?

Standard

If the database that you work with supports Window/Analytic functions then the chances are that you have run into SQL use-cases where you have wondered about the difference between Row_Number(), Rank() and Dense_Rank(). In this post, I’ll show you the difference:

So, let’s just run all of them together and see what the output looks like.

Here’s my query: (Thanks StackExchange!)

select DisplayName,Reputation,
Row_Number() OVER (Order by Reputation desc) as RowNumber,
Rank() OVER (Order by Reputation desc) as Rank,
Dense_Rank() OVER (Order by Reputation desc) as DenseRank
from users

Which gives the following output:

DisplayName          Reputation RowNumber Rank DenseRank 
-------------------- ---------- --------- ---- --------- 
Hardik Mishra        9999       1         1    1         
Alex                 9997       2         2    2         
Omnipresent          9997       3         2    2         
Sergei Basharov      9993       4         4    3         
Oleg Pavliv          9991       5         5    4         
Jason Creighton      9991       6         5    4         
Aniko                9991       7         5    4         
Notlikethat          9990       8         8    5         
ZeMoon               9989       9         9    6         
Carl                 9987       10        10   7   
...
...
...     

Note that all the functions are essentially are “ranking” your rows but there are subtle differences:

  1. Row_Number() doesn’t care if the two values are same and it just ranks them differently. Note row #2 and #3, they both have value 9997 but they were assigned 2 and 3 respectively.
  2. Rank() — Now unlike Row_Number(), Rank() would consider that the two values are same and “Rank” them with same value. Note Row #2 and #3, they both have value 9997 and so both were assigned Rank “2” — BUT notice the Rank “3” is missing! In other words, it introduces some “gaps”
  3. Dense_Rank() — Now Dense_Rank() is like Rank() but it doesn’t leave any gaps! Notice that the Rank “3” in the DenseRank field.

I hope this clarified the differences between these SQL Ranking functions — let me know your thoughts in the comments section

Paras Doshi

SQL Server Query Fundamentals: A Simple example of a Query that uses PIVOT:

Standard

Problem:

Convert the following source data into a schema shown below:

SQL SERVER TSQL PIVOTSolution:

Here’s the code that uses PIVOT function to get to the solution, please use this as a starting point.

Note the use of aggregation function avg – this will depend on the requirement. In the example, the Test_value need to be average if more than one tests were performed.

[code language=”SQL”]

— source data
SELECT [Product_ID],[Test_Desc],[Test_Val] FROM [dbo].[Address]
go

— Destination data using PIVOT function
select * from [dbo].[Address]
pivot( avg(test_val) for test_Desc IN (Test1,Test2,Test3,Test4,Test5))
as Tests

[/code]

TSQL – Quick note about numeric data type to solve “Arithmetic overflow error”

Standard

Problem:

You are working on a query where you are trying to convert source data to numeric data type and you get an “Arithmetic overflow error”.

Solution:

Let’s understand this with an example:

Here’s the source data: 132.56000000 and you want to store just 132.56 so write a query that looks like:

cast([source language=”column”][/source] as numeric(3,2)) as destination_column_name

and after you run the query its throws an error “Arithmetic Overflow Error” – so what’s wrong?

The issue is that you incorrectly specified the precision and scale – by writing the query that says numeric(3,2) you are saying I want 3 data places with 2 on the right (after decimal point) which leaves just 1 place for left.

what you need to write is numeric(5,2) – and this will have 2 places on the right and leaves 3 places for left.

so after you run this, it shouldn’t complain about the arithmetic overflow error. you just need to make sure that the precision and scale of the numeric data type is correct.

Conclusion:

In this post, you saw an example of how to correctly use the precision and scale in the numeric data type and that should help you solve the arithmetic overflow errors.

SQL server Integration services: How to solve “The value violated the integrity constraints for the column” error?

Standard

problem:

you are working on an SSIS package to load a table from a source system and you get an error “The value violated the integrity constraints for the column error” – how do you solve it?

solution:

one the things that the error message should also tell you would be column name. What you want to do is check the table definition of the destination table for any integrity constraints like NOT NULL or PRIMARY KEY. Now once you have that information, go back to your source and figure out if it’s trying to add NULL values in a column that has NOT NULL integrity constraint. Or may be ETL logic is trying to insert duplicate value to the column that has primary key constraint.

Also, the don’t alter the destination table to accept NULL’s or remove integrity constraint. You want to put a logic in your ETL OR fix the data integrity at source. You can use TSQL functions like NULLIF to handle NULL values while querying source systems.

Conclusion:
In this post, we saw how to solve the “The value violated the integrity constraints for the column” error in SSIS.

How to use TSQL checksum to compare data between two tables?

Standard

In any BI project, data validation plays an important part. You want to make sure that the data is right! usually business helps in this validation. As a developer, you might also want to do some preliminary data validation. One of the techniques that I’ve learned recently is to use TSQL checksum to compare data between two tables. In this post , I’ll describe the technique & post a pseudo code.

we’ll create a pseudo code to compare all columns but you should be able to use that to tweak that if you need it.

1) Run checksum(*) on Tables:

On Table1:

select checksum(*) as chktb1 from table1
go

On Table 2:

select checksum(*) as chktb2 from table2
go

At this point, you should get two result sets each populated by checksum values computer over all columns since you passed * in the checksum function.

2) Now let’s join these tables & look at rows w/ different checksum: (in other words, it is going to list all rows that are different between table1 & table2)

select * from
(
select checksum(*) as chktb1 from table1
) as tb1
left join
(
select checksum(*) as chktb2 from table2
) as tb2
on tb1.someid=tb2.someid /* you can have more ids */
where tb1.chktb1 != tb2.chktb2

3) You can add individual column now to see what changed:

select * from
(
select checksum(*) as chktb1, columnname1, columnname2 from table1
) as tb1
left join
(
select checksum(*) as chktb2, columnname1, columnname2 from table2
) as tb2
on tb1.someid=tb2.someid
where tb1.chktb1 != tb2.chktb2

Conclusion:
I hope this helps especially if you don’t have rights to install 3rd party tools on your dev machine.

TSQL Script: How to get list of all tables or view in a database?

Standard

I was documenting the list of tables/views in a data mart & staging databases & I found the following scripts useful:

TSQL To get list of all tables:

[sourcecode type=”sql” wraplines=”false”]
SELECT * FROM <DatabaseName>.information_schema.tables
where TABLE_TYPE=’BASE TABLE’
[/sourcecode]

TSQL To get list of all views:
[sourcecode type=”sql” wraplines=”false”]
SELECT * FROM <DatabaseName>.information_schema.tables
where TABLE_TYPE=’VIEW’
[/sourcecode]

Alternatives (for SQL 2005 onwards):
[sourcecode type=”sql” wraplines=”false”]
SELECT * FROM SYS.TABLES
[/sourcecode]

SQL Spatial Data Types: A Tip to fix STIntersects Method returning NULL values.

Standard

I was working on some queries that used SQL Spatial data types and ran into a problem where STIntersects function was returning NULL. I was expecting it to say 1 or 0, but it kept on returning NULL. Here’s how I solved it:

1. On Technet article for STIntersects, it said “This method always returns null if the spatial reference IDs (SRIDs) of the geometry instances do not match.

here’s my pseudo code for you:
– – Function
– – Input Parameters @Area Geometry, @Point1 varchar(10), @Point2 varchar(10)
DECLARE @p point;
SET @p = geometry::STGeomFromText(‘POINT(@Point1 @Point2)’, 0);
– -Note that the 0 in the above line is SRID
return(SELECT @p.STIntersects(@area));

2. So next step was to make sure that the two geometry instances that I was using had same SRID’s.

3. Since I was passing the @area (geometry data type) to the function, I had to check the SRID for @area.
so wrote a single line of code that used the STSrid function.

Select @area.STSrid

That told me that the SRID of the geometric data type was 4xxx

4. So now I modified the original code to change the SRID of @p variable to same as that of @area variable.

So here’s the updated code:
DECLARE @p point;
SET @p = geometry::STGeomFromText(‘POINT(@Point1 @Point2)’, 4xxx);
return(SELECT @p.STIntersects(@area));

5. Now since the SRID of @area and @p matched, it started returning 1 or 0 values as expected.

SQL Azure: Indexes are very helpful but they cost $Money$

Standard

This blog post is about SQL Azure and not SQL server. And it’s also not about importance of indexes. We know that indexes are important in the world of SQL server as well as in the world of SQL Azure. But in the world of SQL Azure, there is an important concern – when you create indexes, they would occupy “Space” and it will increase the database size. So what’s the point here?  The point is that the pricing model of SQL Azure is database size driven. So bigger the database size, bigger would be your bill. It’s that simple. So to figure out the cost of indexes, we need to know the size of indexes and, We do have a TSQL query to know the size of each index in a database/table.

So here is a TSQL Query that would list indexes in a given database along with its size in KB.

[code]
SELECT
SI.name, SUM(PS.reserved_page_count) * 8.0 as "Size in KB"
FROM sys.indexes AS SI
JOIN sys.dm_db_partition_stats AS PS
ON SI.object_id = PS.object_id
AND PS.index_id = SI.index_id
GROUP BY
SI.name
ORDER BY
2 DESC
[/code]

I ran the above query by logging into the SQL Azure database ‘AdventureWorksLTAZ2008R2’ (which is a sample database for SQL Azure) as a service administrator and here’s the result:

SQL Azure Findding size of IndexesYou can also list indexes along with its size for a particular table in a database. Here’s the TSQL code (Replace SalesLT.product name with your table name):

[code]
SELECT
SI.name, SUM(PS.reserved_page_count) * 8.0 as "Size in KB"
FROM sys.indexes AS SI
JOIN sys.dm_db_partition_stats AS PS
ON SI.object_id = PS.object_id
AND PS.index_id = SI.index_id
WHERE SI.object_id = object_id(‘SalesLT.product’)
GROUP BY
SI.name
ORDER BY
2 DESC
GO
[/code]

Now once you know the size of the index – the next task is to calculate the cost.
The cost of SQL Azure is dependent on the size of a database and here is the pricing model: https://www.windowsazure.com/en-us/pricing/details/#database

And one can definitely embed the pricing logic in the TSQL Query we just saw. check this article for reference: http://blogs.msdn.com/b/sqlazure/archive/2010/08/19/10051969.aspx – Just note that the pricing model is slightly different now. The pricing model is same for database size upto 50 GB. But when you above 50 GB (upto MaxSize 150 GB) there is NO additional cost. so essentially. 50 GB db = $499.95 per month and 150 GB = $499.95 per month. Update your TSQL queries accordingly!

So, Make sure that you use indexes that you have created And Drop indexes that are not used. you need to evaluate performance gain vs cost for your scenario.

Note:

  • I am not implying that Indexes on SQL server are “free”. They consume resources in the world of SQL server too. This post just happen to focus on SQL Azure side of it.
  • The Pricing Model of SQL Azure may change in future.
  • You also need to evaluate performance impact of creating indexes on create, update and delete queries

sys.dm_db_partition_stats : A SQL Azure Dynamic management view (DMV) to calculate database size

Standard

I have written a blog post on beyondrelational site about a SQL Azure DMV – sys.dm_db_partition_stats that can be used to extract information about database size and size of each individual database object. To read the article please go to: http://beyondrelational.com/blogs/parasdoshi/archive/2011/05/30/sys-dm-db-partition-stats-a-sql-azure-dynamic-management-view-dmv-to-calculate-database-size.aspx

TSQL code to know SQL Azure DB size