From: SnapDive on
SQL Server 2008, server and database are both set to default Latin CI
collation. Some characters are in there that are not valid for
consuming systems. I would like to write a SQL-Update that will
replace the varchar value with a value with a "restricted collation"
like ASCII 7-bit so characters not on a standard US keyboard will be
removed.

How can I do that?

Thanks.

From: Jeroen Mostert on
On 2010-04-13 22:01, SnapDive wrote:
> SQL Server 2008, server and database are both set to default Latin CI
> collation. Some characters are in there that are not valid for
> consuming systems. I would like to write a SQL-Update that will
> replace the varchar value with a value with a "restricted collation"
> like ASCII 7-bit so characters not on a standard US keyboard will be
> removed.
>
> How can I do that?
>
You can't define a custom collation; the only real solution is to write a
function that filters the value and use that in triggers or CHECK constraints:

CREATE FUNCTION dbo.filter_non_ascii(@s VARCHAR(MAX)) RETURNS
VARCHAR(MAX) WITH SCHEMABINDING AS BEGIN
DECLARE @result VARCHAR(MAX) = '';
DECLARE @i INT = 1;
DECLARE @c INT;
WHILE @i <= LEN(@s) BEGIN
SET @c = ASCII(SUBSTRING(@s, @i, 1));
SET @result += CASE WHEN @c > 127 THEN '?' ELSE CHAR(@c) END;
SET @i += 1;
END;
RETURN @result;
END;

Writing INSTEAD OF triggers is a pain; I recommend checking and rejecting
the value (CHECK mycolumn = dbo.filter_non_ascii(mycolumn)) instead of
filtering it as part of updating/inserting, forcing the clients to filter
the value as they see fit before passing it to the database. This also
minimizes surprises.

T-SQL is none too fast at string manipulation, so if you're passing around
huge strings this might become a bottleneck. You may want to consider using
a CLR function in that case, but don't do this as a case of premature
optimization, as this involves some nontrivial work.

If you are limiting the range of characters to ASCII because you are going
to use the strings as machine-readable data, you may want to consider using
Latin1_General_BIN2 for the column's collation. The binary collation does
bitwise comparisons of strings (by code point, rather) and can offer a
performance benefit. Beware that strings must then match by exact
characters, no variation in case is allowed.

--
J.