From: Mik on
I have files with data and text in Russian Windows encoding (CP1251). My
current locale is UTF-8 (Linux). My Fortran program parses strings in
files and produces computations. I use a utility named 'recode' to
convert text to UTF-8. Windows version of program works without errors,
but Linux version can't parse these files, because Russian Unicode
characters place two bytes per symbol. Which solution is there?

Thanks
From: Mik on
Mik пишет:
> I have files with data and text in Russian Windows encoding (CP1251). My
> current locale is UTF-8 (Linux). My Fortran program parses strings in
> files and produces computations. I use a utility named 'recode' to
> convert text to UTF-8. Windows version of program works without errors,
> but Linux version can't parse these files, because Russian Unicode
> characters place two bytes per symbol. Which solution is there?
>
> Thanks

Strings are approximately such as:

| абвгд | 1 | 23.45 | 67.89 | опрст |
From: Terence on
The whole problem is that 2-byte usage for Russian.

I provide software which runs in many left-to-right languages by
providing external modules of message strings, in several languages,
for each internal message in the program.

Here I use ONLY a one-byte symbol and select the appropriate Microsoft
table for the language required. For Russian this would be the Cyrilic
table. For Polish it's the Slavic table and so on. For Greek I use a
complete Greek table, not the 10 or so top-table physics notation set.

So one solution that occurs to me is:-

Write a program to read the data file and detect the leading byte of
the two-byte UTF-8 code (D0h=Cyrilic, for the Cyrilic coding
throughout the data), and convert the second byte to a new byte
corresponding to a 256-byte DOS Miscrosoft Cyrilic symbol table.

Then use a single-byte Cyrilic table when reading Russian data if
this is possible in Linux or else the nearest distinct Latin
equivalent to make the text understandable (R.N P F...).
Its obviously possible here in the Forum as the Russian comes out
readably.

Another solution is to look up the Russian-coded string internally and
convert it to a word in your language of choice, using single-byte
symbols and store back, in what was amplee space for a now one-byte
coded system.
From: Terence on
I wrote a reply with two soutions. I don't see it.
I was about to comment that the first byte of UTF=8 for Cyrilic is D0h
AND D1h, not jut D0H as I stated. The previous message SAYS it got
posted the simple way. This time there's a different screen!
From: Gerry Ford on

"Terence" <tbwright(a)cantv.net> wrote in message
news:1fd9e9b5-0c7a-45eb-906f-e7c9d2db6bb2(a)f63g2000hsf.googlegroups.com...
> The whole problem is that 2-byte usage for Russian.

That's one of the problems. Another is that the wall, that used to divide
Berlin, shifted east and has kept westerners--at least this westerner--from
communicating with Gospodun Putin's russia on the internet, in particular,
in newsgroups.

I could shed plenty of light on this question, if OP can help me, for
example, use the cyrillic keys on my keyboard.

I could then replicate his data set and put his question in the crossfire.
--
"A belief in a supernatural source of evil is not necessary; men alone
are quite capable of every wickedness."

~~ Joseph Conrad (1857-1924), novelist