From: lihao on
under vim, these non-ascii characters are displayed as <97>, <92> etc.
and on command line with 'cat -A filename' shows: 'M-BM-^W' , 'M-BM-
^R' respectively... I know <97> is some long dash '--' and <92> is a
single quotation mark. under bash, How can I transfer these characters
into proper ASCII, i.e.:

under vim:
<97> => -
<92> => '

many thanks,
lihao
From: Ben Bacarisse on
lihao <lihao0129(a)gmail.com> writes:

> under vim, these non-ascii characters are displayed as <97>, <92> etc.
> and on command line with 'cat -A filename' shows: 'M-BM-^W' , 'M-BM-
> ^R' respectively... I know <97> is some long dash '--' and <92> is a
> single quotation mark. under bash, How can I transfer these characters
> into proper ASCII, i.e.:

The command

tr "\227\222" "-'"

will translate octal 227 (hex 97) bytes into - and octal 222 (hex 92)
bytes into '.

This may not work because the cat -A output suggests that you may not
have these exact bytes in the file. I don't use cat -A so I can't
interpret the sequences you see.

I find the least ambiguous output is a hex dump ('hd' or 'od -t x1').

If you know the actual encoding used, the iconv utility can be an
excellent way to map between different file encodings. For example:

iconv --from-code=windows-1252 --to-code=ascii//translit

is a close match to what you want (but it doubles the - to --).

--
Ben.
From: Thomas 'PointedEars' Lahn on
lihao wrote:

> under vim, these non-ascii characters are displayed as <97>, <92> etc.
> and on command line with 'cat -A filename' shows: 'M-BM-^W' , 'M-BM-
> ^R' respectively... I know <97> is some long dash '--' and <92> is a
> single quotation mark.

It's "em dash" and "right single quotation mark".

> under bash, How can I transfer these characters into proper ASCII, i.e.:
>
> under vim:
> <97> => -
> <92> => '

(JFYI: This has nothing to do with bash, and little to do with shell-
scripting. You want to choose your forum more carefully next time.
<http://www.catb.org/~esr/faqs/smart-questions.html#forum>)

The original encoding must be Windows-125x which are the only
encodings/character sets I know to use the codepoint range 0x7F..0x9F for
printable characters (ISO/IEC 8859-x have no characters at all there and
ISO-8859-x/Unicode have only control characters there).

<http://en.wikipedia.org/wiki/Western_Latin_character_sets_%28computing%29>

There is a tool to TRanslate between user-defined sets of characters:

tr '\226\222' "-'" < filename

(227 and 222 are the octal representations of hexadecimal 97 and 92,
respectively.)

For getting all the non-equivalent single-character ASCII representations
of Windows-1252 characters, you can use

recode Windows-1252..ASCII filename

(I have tested this to work.) You might want to make a backup of the
original file before.

Both provide, of course, only a crude approximation of the original
characters, if at all (there is no single-character approximation for the
Euro, Ellipsis, or Permille character in US-ASCII, for example; recode -f
manages to represent some of them adequately, though.)

However, you can convert the file to use an encoding for a character set
that contains equivalent characters:

iconv -f Windows-1252 -t UTF-8 < filename > filename-utf8

or

iconv -f Windows-1252 -t UTF-8 -o filename-utf8 filename

(or any other Unicode Transformation Format. I have tested this to work on
a console with UTF-8 locale by first converting EM DASH [U+2014] followed
by RIGHT SINGLE QUOTATION MARK [U+2019] with iconv to Windows-1252, opening
the file with vim, observing the described `<97><92>', and converting it
back to UTF-8 with iconv, observing the same characters that are in the
original UTF-8-encoded file.)


HTH

PointedEars
From: pk on
Thomas 'PointedEars' Lahn wrote:

> (JFYI: This has nothing to do with bash, and little to do with shell-
> scripting. You want to choose your forum more carefully next time.

And this is not a forum, for that matter.

From: Chris F.A. Johnson on
On 2010-04-03, pk wrote:
> Thomas 'PointedEars' Lahn wrote:
>
>> (JFYI: This has nothing to do with bash, and little to do with shell-
>> scripting. You want to choose your forum more carefully next time.
>
> And this is not a forum, for that matter.

Of course it's a forum. It's not a _web_ forum but it is a forum.


--
Chris F.A. Johnson, author <http://shell.cfajohnson.com/>
===================================================================
Shell Scripting Recipes: A Problem-Solution Approach (2005, Apress)
Pro Bash Programming: Scripting the GNU/Linux Shell (2009, Apress)
===== My code in this post, if any, assumes the POSIX locale =====
===== and is released under the GNU General Public Licence =====