regex and UTF characters [Shell]

Prev: $$jordan coach bag ed tshirt with amazing price
Next: shell script - resque

From: Guillaume Dargaud on 30 May 2010 17:37

Hello all,
I'm playing with grep/sed on ISO-8859-1 encoded files, and I
notice that . (the dot) doesn't seem to match accented chars,
leading to some pretty unexpected results. I know that
internationalization and encodings are a hornet's nest, so I'm
seeking some advice here...
--
Guillaume Dargaud
http://www.gdargaud.net/

From: Ben Bacarisse on 30 May 2010 19:46

Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes:

> I'm playing with grep/sed on ISO-8859-1 encoded files,

There is a miss-match between the subject line and this remark. UTF
usually means UTF-8 which in a multi-byte encoding used for Unicode.
ISO-8859-1 is a single-byte character set.

> and I
> notice that . (the dot) doesn't seem to match accented chars,
> leading to some pretty unexpected results.

I don't see this problem with either UTF-8 encoded files or with
ISO-8859-1 files (GNU grep 2.5.4). It seems to correctly pick up the
character encoding from the environment (specifically LANG). I don't
say this to "show off" just to point out that it does seem to work.

It may simply be that you have a miss-match between the setting of LANG
and the encoding in the file. You can change LANG for just one command
like this:

LANG=en_GB.iso-8859-1 grep c.d data

> I know that
> internationalization and encodings are a hornet's nest, so I'm
> seeking some advice here...

It can be. Best start with an example. Probably the only way for
everyone to know exactly what you have in the data file is to post a hex
dump of it (keep it short). Post the value of $LANG and the command
line that does not do what you expect. Initially, avoid command lines
that use anything but "plain" characters.

--
Ben.

From: Thomas 'PointedEars' Lahn on 31 May 2010 05:26

Ben Bacarisse wrote:

> Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes:
>> I'm playing with grep/sed on ISO-8859-1 encoded files,
>
> There is a miss-match between the subject line and this remark. UTF
> usually means UTF-8

UTF (usually) means Unicode Transformation Format. Nothing more, nothing
less.

> which in a multi-byte encoding used for Unicode.

The trueness of this statement is questionable. While most of the
characters in the Unicode character set require more than one UTF-8 code
unit (so more than 8 bits, or 1 byte) to be encoded, there are characters
(those below U+0080) that only require one UTF-8 code unit, so 8 bits, or
1 byte to be encoded.

<http://unicode.org/faq/>
<http://rishida.net/tools/conversion/>

> ISO-8859-1 is a single-byte character set.

Now you are obviously confusing character set and encoding.

Further good, less formal, reading on that (if you ignore some sentiments):
<http://www.joelonsoftware.com/articles/Unicode.html>

>> and I notice that . (the dot) doesn't seem to match accented chars,
>> leading to some pretty unexpected results.
>
> I don't see this problem with either UTF-8 encoded files or with
> ISO-8859-1 files (GNU grep 2.5.4). It seems to correctly pick up the
> character encoding from the environment (specifically LANG).

GNU grep(1) uses the character encoding specified by the environment
variables LC_ALL, LC_CTYPE, or LANG. RTFM.

PointedEars

From: pk on 31 May 2010 06:55

Ben Bacarisse wrote:

> Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes:
>
>> I'm playing with grep/sed on ISO-8859-1 encoded files,
>
> There is a miss-match between the subject line and this remark. UTF
> usually means UTF-8 which in a multi-byte encoding used for Unicode.
> ISO-8859-1 is a single-byte character set.
>
>> and I
>> notice that . (the dot) doesn't seem to match accented chars,
>> leading to some pretty unexpected results.
>
> I don't see this problem with either UTF-8 encoded files or with
> ISO-8859-1 files (GNU grep 2.5.4). It seems to correctly pick up the
> character encoding from the environment (specifically LANG). I don't
> say this to "show off" just to point out that it does seem to work.
>
> It may simply be that you have a miss-match between the setting of LANG
> and the encoding in the file. You can change LANG for just one command
> like this:
>
> LANG=en_GB.iso-8859-1 grep c.d data
>
>> I know that
>> internationalization and encodings are a hornet's nest, so I'm
>> seeking some advice here...
>
> It can be. Best start with an example. Probably the only way for
> everyone to know exactly what you have in the data file is to post a hex
> dump of it (keep it short). Post the value of $LANG and the command
> line that does not do what you expect. Initially, avoid command lines
> that use anything but "plain" characters.

I'm not sure this matters, but here's what info sed says (in the "bugs that
are not bugs" section):

`s/.*//' does not clear pattern space
This happens if your input stream includes invalid multibyte
sequences. POSIX mandates that such sequences are _not_ matched
by `.', so that `s/.*//' will not clear pattern space as you would
expect. In fact, there is no way to clear sed's buffers in the
middle of the script in most multibyte locales (including UTF-8
locales). For this reason, GNU `sed' provides a `z' command (for
`zap') as an extension.

To work around these problems, which may cause bugs in shell
scripts, set the `LC_COLLATE' and `LC_CTYPE' environment variables
to `C'.

From: Janis Papanagnou on 31 May 2010 09:03

Thomas 'PointedEars' Lahn wrote:
> Ben Bacarisse wrote:
>
>> Guillaume Dargaud <use_my_web_form(a)www.gdargaud.net> writes:
>>> I'm playing with grep/sed on ISO-8859-1 encoded files,
>> There is a miss-match between the subject line and this remark. UTF
>> usually means UTF-8
>
> [...]
>
>> ISO-8859-1 is a single-byte character set.
>
> Now you are obviously confusing character set and encoding.

ISO/IEC 8859-1: "8-bit single-byte coded graphic character sets"

Omitting the word "coded" is no fault, because it is clear that the
coding is meant if Ben says "single byte"[*]. I am sure no one (but
you) is confusing anything here.

All "character set" needs some encoding specified; all character sets
(ASCII, Latin 1, EBCDIC, etc.) do that. We're not talking about the
visual graphemes or abstract characters, but of the coupling of those
with their respective encoding when we speak of "character sets".

But, anyway, Ben's main point here was that there's mismatch in the
OP's posting.

Janis

[*] There's a better nitpick here, BTW; the definition of byte is not
generally an 8 bit quantity, so it's better to define it as 8-bit byte
or as octet.

| Next | Last
Pages: 1 2 3
Prev: $$jordan coach bag ed tshirt with amazing price
Next: shell script - resque