substitution for octal chars [Shell]

Prev: Pattern matching
Next: Daemons and hangups

From: Lao Ming on 7 Jul 2010 22:44

When I download a text file containing octal chararcters

e.g. \342\200\231

as in: can\342\200\231t or don\342\200\231t

is there a way to replace these with their ascii equivalent
from the shell with sed, perl or awk?
Thanks.

From: Janis Papanagnou on 8 Jul 2010 04:13

Lao Ming schrieb:
> When I download a text file containing octal chararcters
>
> e.g. \342\200\231
>
> as in: can\342\200\231t or don\342\200\231t
>
> is there a way to replace these with their ascii equivalent
> from the shell with sed, perl or awk?

I fear there might not be an ASCII equivalent if some encoding
of a different character set has been used here instead of ASCII.
You'll have to find out what encoding has been used in the first
place. Then the program iconv may help you converting the data.

Janis

> Thanks.

From: Ben Bacarisse on 8 Jul 2010 08:23

Lao Ming <laomingliu(a)gmail.com> writes:

> When I download a text file containing octal chararcters
>
> e.g. \342\200\231
>
> as in: can\342\200\231t or don\342\200\231t
>
> is there a way to replace these with their ascii equivalent
> from the shell with sed, perl or awk?

The example is a useful one. \342\200\231 is the UTF-8 encoding of a
"right single quote" which Unicode recommends as the character to use
for an apostrophe. It is therefore very likely that the file is UTF-8
encoded.

When you say the file contains octal characters it is not clear if you
are showing us the octal values for the characters or whether the file
really has the backslash followed by the three digits. In other words,
does \342\200\231 represent 3 or 12 octets?

If (as is likely) it is the former then iconv (with //translit) is the
place to start. You may run into trouble when there are characters in
the file that have no obvious ASCII equivalent, but that is another
problem.

iconv --from=utf-8 --to=ascii//translit my-input-file

--
Ben.

|
Pages: 1
Prev: Pattern matching
Next: Daemons and hangups