Collation settings [Server Configuration]

Prev: very slow SQL client
Next: SQL Server Output Buffer

From: Erland Sommarskog on 5 Dec 2009 05:19

Jay (spam(a)nospam.org) writes:
> OK, I'll bite. When is it good to use a binary collation?

When you care more about performance than case-sensitivity.

As Jeroen indicates, it may not be the best choice for the server
default or database default, but say that you have this query:

select * from persons where email like '%abd%'

and there are a lot of persons. It does not help if there is an index on
email, due to the initial wildcard. But not only does SQL Server have to
scan the index, it has to scan the full string for all rows as well. The
comparison rules for Unicode are quite complex, so this is costly. If
you force a binary collation:

select * from persons where email COLLATE Latin1_General_BIN2 LIKE '%abd%'

You can gain speed with a factor of 10. Email addresses are case-
insensitive, but it is perfectly acceptable to force them to lowercase
when you save them.

Notes:
1) If you have an SQL collation and email varchar, this is almost as good.
For varchar, an SQL collation only have 255 characters to deal with,
and comparisons are a lot simpler.
2) For even faster solutions to the problem get this book:
http://www.sqlservermvpdeepdives.com/ and read chapter 17. If you
buy this book you also help War Child International to help children
affected by war, because that's where our royalies go.

--
Erland Sommarskog, SQL Server MVP, esquel(a)sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

From: Jay on 5 Dec 2009 10:28

OK, I got it. Thanks.

It also clears something up that has bugged me for a while: Unix clearly
uses a binary collation.

Thanks,
Jay

From: Jeroen Mostert on 5 Dec 2009 11:36

Jay wrote:
> It also clears something up that has bugged me for a while: Unix clearly
> uses a binary collation.
>
Oh dear, I feel another enormous post coming up. But I'll keep it short this
time.

"Unix" is a very broad moniker that could apply to just about anything, but
if (as I'm thinking) you mean that file systems mostly used by Unix
operating systems tend to be case- and accent-sensitive, then yes.

However, NTFS has the same property. The only reason the file system appears
case-insensitive to you is because the Win32 subsystem usually layered over
interactions with the OS compensates for htis. It is possible to have file
names differing only in case on an NTFS system, and the regular Win32
functions will not be able to properly distinguish between them.

It is certainly not true that "Unix" uses binary collations for everything.
For example, on most recent systems, the "ls" command will sort according to
the collation implied by your locale (or by LC_COLLATE, if you've set it),
not by a simple binary order.

--
J.

From: Erland Sommarskog on 5 Dec 2009 14:33

Jay (spam(a)nospam.org) writes:
> It also clears something up that has bugged me for a while: Unix clearly
> uses a binary collation.

As Jeroen say, not all do. Here is an authentic output:

+� ls
�lder �ldre arvode autoreply.pl* autoreserv.txt
auto-se.pl* bin/ brev/ crontab.save disapproved
emacs/ flimsy.pl forward.test lib/ Maildir/
mbox mbox.save mbox.tmp.4473 nail-11.25/ News/
�vre perllib/ Pnews.1 procmaillog procmailrc.bcl
progs/ public_html/ quiz/ Rnmail.1 slask
socket.pl spamfilter.log spamfilter.pl sql-grupper SQLMPUMS
testsh* trn.1 trn-4.0-test76/

That is clearly not binary. In that case you would have see "�ldre",
"�lder" and "�vre" last in the listin that order.

In SQL Server, there are 18 collations for each collation designator.
Taking Finnish_Swedish as an example, there are:

Finnish_Swedish_BIN
Finnish_Swedish_BIN2
Finnish_Swedish_CI_AI
Finnish_Swedish_CI_AI_WS
Finnish_Swedish_CI_AI_KS
Finnish_Swedish_CI_AI_KS_WS
Finnish_Swedish_CI_AS
Finnish_Swedish_CI_AS_WS
Finnish_Swedish_CI_AS_KS
Finnish_Swedish_CI_AS_KS_WS
Finnish_Swedish_CS_AI
Finnish_Swedish_CS_AI_WS
Finnish_Swedish_CS_AI_KS
Finnish_Swedish_CS_AI_KS_WS
Finnish_Swedish_CS_AS
Finnish_Swedish_CS_AS_WS
Finnish_Swedish_CS_AS_KS
Finnish_Swedish_CS_AS_KS_WS

That is two binary collations, and then 16 which are all combinations
of case- accent, kana- and width-sensitivity. There are no KI and WI
collations in the list, but that is implied by their absence.

--
Erland Sommarskog, SQL Server MVP, esquel(a)sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

First | Prev |
Pages: 1 2 3
Prev: very slow SQL client
Next: SQL Server Output Buffer