From: kolmogolov on
On Mar 8, 6:30 pm, j...(a)toerring.de (Jens Thoms Toerring) wrote:
> kolmogo...(a)gmail.com <kolmogo...(a)gmail.com> wrote:
> > On Mar 8, 5:32 pm, "kolmogo...(a)gmail.com" <kolmogo...(a)gmail.com>
> > wrote:
> > (commenting my own follow-up)
> > > It turns out that my sed(1) got confused by the 0xe2 0x80 and 0x93
> > > characters in the description-field.
>
> > > I recall that I just replaced the grep(1) in one of my secrips by
> > > some ``agrep(1)'' for files containing iso-8859-1 characters...
> > That is, my GNU grep(1) did not match a german umlaut in 8859-1
> > by a single dot.
>
> It might be a encoding issue - if you e.g. have set your shell
> to UTF-8 (or the file into which you put the grep command is
> in UTF-8) but the file you're running grep on is in iso-8859-1
> then grep will try to match the UTF-8 characters and, of course,
> miss the characters that are in a different encoding. In the
> iso-8859-1 file e.g. the character 'ä' will be represented by
> the value 0xe4 while in UTF-8 it's actually two bytes, 0xC3
> followed by 0xA4, so grep has no chance to figure out that
> they are supposed to represent the same character. The same
> holds, of course, for sed. That it still works with agrep is
> probably due to agrep also accepting approximate matches.
>
> Since I didn't found a way how to match binary data (and you
> would have to know how the characters you're interested in are
> stored in binary values) the simplest solution might be to con-
> vert the file to the character encoding used for running sed
> and afterwards back to the original encoding. So if the file
> 'in.txt' is in ISO-8859-1 but you "operate" in UTF-8, then the
> following command will first convert the file to UTF-8, then
> run sed on it, replacing 'ä' by 'ö', and the re-convert the
> results back to ISO-8859-1:
>
> iconv -f ISO-8859-1 -t UTF-8 in.txt | sed 's/ä/ö/' | \
> iconv -f UTF-8 -t ISO-8859-1 - > out.txt
>

Thanks a lot!

So, I solved both grep(1) and sed(1) problems by
inserting a single line in my scripts:

LC_CTYPE=en_US.iso88591

to cheat them into treating the data as one-byte-
characters while leaving my shell working
environment intact.

Since lots of my private docments are encooded in
the Big5 charset and my shell always has

LANG=C
LC_CTYPE=zh_TW.Big5
LC_NUMERIC="C"
....

I'm simply not yet ready to switch over...
Happy end!

regards
Rudi



First  |  Prev  | 
Pages: 1 2
Prev: system calls
Next: Job control concerning tcsetpgrp