unicode and dbf files [Python]

Prev: subprocess executing shell
Next: IDLE python shell freezes after running show() of matplotlib

From: John Machin on 24 Oct 2009 06:58

On Oct 24, 4:14 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
> John Machin wrote:
> > On Oct 23, 3:03 pm, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
> >>John Machin wrote:
>
> >>>On Oct 23, 7:28 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
> >>>>Greetings, all!
>
> >>>>I would like to add unicode support to my dbf project. The dbf header
> >>>>has a one-byte field to hold the encoding of the file. For example,
> >>>>\x03 is code-page 437 MS-DOS.
>
> >>>>My google-fu is apparently not up to the task of locating a complete
> >>>>resource that has a list of the 256 possible values and their
> >>>>corresponding code pages.
>
> >>>What makes you imagine that all 256 possible values are mapped to code
> >>>pages?
>
> >>I'm just wanting to make sure I have whatever is available, and
> >>preferably standard. :D
>
> >>>>So far I have found this, plus variations:http://support.microsoft.com/kb/129631
>
> >>>>Does anyone know of anything more complete?
>
> >>>That is for VFP3. Try the VFP9 equivalent.
>
> >>>dBase 5,5,6,7 use others which are not defined in publicly available
> >>>dBase docs AFAICT. Look for "language driver ID" and "LDID". Secondary
> >>>source: ESRI support site.
>
> >>Well, a couple hours later and still not more than I started with.
> >>Thanks for trying, though!
>
> > Huh? You got tips to (1) the VFP9 docs (2) the ESRI site (3) search
> > keywords and you couldn't come up with anything??
>
> Perhaps "nothing new" would have been a better description. I'd already
> seen the clicketyclick site (good info there)

Do you think so? My take is that it leaves out most of the codepage
numbers, and these two lines are wrong:
65h Nordic MS-DOS code page 865
66h Russian MS-DOS code page 866

> and all I found at ESRI
> were folks trying to figure it out, plus one link to a list that was no
> different from the vfp3 list (or was it that the list did not give the
> hex values? Either way, of no use to me.)

Try this:
http://webhelp.esri.com/arcpad/8.0/referenceguide/

>
> I looked at dbase.com, but came up empty-handed there (not surprising,
> since they are a commercial company).

MS and ESRI have docs ... does that mean that they are non-commercial
companies?

> I searched some more on Microsoft's site in the VFP9 section, and was
> able to find the code page section this time. Sadly, it only added
> about seven codes.
>
> At any rate, here is what I have come up with so far. Any corrections
> and/or additions greatly appreciated.
>
> code_pages = {
> '\x01' : ('ascii', 'U.S. MS-DOS'),

All of the sources say codepage 437, so why ascii instead of cp437?

> '\x02' : ('cp850', 'International MS-DOS'),
> '\x03' : ('cp1252', 'Windows ANSI'),
> '\x04' : ('mac_roman', 'Standard Macintosh'),
> '\x64' : ('cp852', 'Eastern European MS-DOS'),
> '\x65' : ('cp866', 'Russian MS-DOS'),
> '\x66' : ('cp865', 'Nordic MS-DOS'),
> '\x67' : ('cp861', 'Icelandic MS-DOS'),
> '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy

Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
not alone. I suggest that you omit Kamenicky until someone actually
wants it.

> '\x69' : ('cp852', 'Mazovia (Polish) MS-DOS'), # iffy

Look 5 lines back. cp852 is 'Eastern European MS-DOS'. Mazovia
predates and is not the same as cp852. In any case, I suggest that you
omit Masovia until someone wants it. Interesting reading:

http://www.jastra.com.pl/klub/ogonki.htm

> '\x6a' : ('cp737', 'Greek MS-DOS (437G)'),
> '\x6b' : ('cp857', 'Turkish MS-DOS'),
> '\x78' : ('big5', 'Traditional Chinese (Hong Kong SAR, Taiwan)\

big5 is *not* the same as cp950. The products that create DBF files
were designed for Windows. So when your source says that LDID 0xXX
maps to Windows codepage YYY, I would suggest that all you should do
is translate that without thinking to python encoding cpYYY.

> Windows'), # wag

What does "wag" mean?

> '\x79' : ('iso2022_kr', 'Korean Windows'), # wag

Try cp949.

> '\x7a' : ('iso2022_jp_2', 'Chinese Simplified (PRC, Singapore)\
> Windows'), # wag

Very wrong. iso2022_jp_2 is supposed to include basic Japanese, basic
(1980) Chinese (GB2312) and a basic Korean kit. However to quote from
"CJKV Information Processing" by Ken Lunde, "... from a practical
point of view, ISO-2022-JP-2 ..... [is] equivalent to ISO-2022-JP-1
encoding." i.e. no Chinese support at all. Try cp936.

> '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag

Try cp936.

> '\x7c' : ('cp874', 'Thai Windows'), # wag
> '\x7d' : ('cp1255', 'Hebrew Windows'),
> '\x7e' : ('cp1256', 'Arabic Windows'),
> '\xc8' : ('cp1250', 'Eastern European Windows'),
> '\xc9' : ('cp1251', 'Russian Windows'),
> '\xca' : ('cp1254', 'Turkish Windows'),
> '\xcb' : ('cp1253', 'Greek Windows'),
> '\x96' : ('mac_cyrillic', 'Russian Macintosh'),
> '\x97' : ('mac_latin2', 'Macintosh EE'),
> '\x98' : ('mac_greek', 'Greek Macintosh') }

HTH,
John

From: John Machin on 26 Oct 2009 15:21

On Oct 27, 3:22 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
> John Machin wrote:
> > On Oct 24, 4:14 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
> >>John Machin wrote:
>
> >>>On Oct 23, 3:03 pm, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
> >>>>John Machin wrote:
>
> >>>>>On Oct 23, 7:28 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
> > Try this:
> >http://webhelp.esri.com/arcpad/8.0/referenceguide/
>
> Wow. Question, though: all those codepages mapping to 437 and 850 --
> are they really all the same?

437 and 850 *are* codepages. You mean "all those language driver IDs
mapping to codepages 437 and 850". A codepage merely gives an
encoding. An LDID is like a locale; it includes other things besides
the encoding. That's why many Western European languages map to the
same codepage, first 437 then later 850 then 1252 when Windows came
along.

> >> '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
>
> > Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
> > not alone. I suggest that you omit Kamenicky until someone actually
> > wants it.
>
> Yeah, I noticed that. Tentative plan was to implement it myself (more
> for practice than anything else), and also to be able to raise a more
> specific error ("Kamenicky not currently supported" or some such).

The error idea is fine, but I don't get the "implement it yourself for
practice" bit ... practice what? You plan a long and fruitful career
inplementing codecs for YAGNI codepages?
>
> >> '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
>
> > Try cp936.
>
> You mean 932?

Yes.

> Very helpful indeed. Many thanks for reviewing and correcting.

You're welcome.

> Learning to deal with unicode is proving more difficult for me than
> learning Python was to begin with! ;D

?? As far as I can tell, the topic has been about mapping from
something like a locale to the name of an encoding, i.e. all about the
pre-Unicode mishmash and nothing to do with dealing with unicode ...

BTW, what are you planning to do with an LDID of 0x00?

Cheers,

John

From: Ethan Furman on 26 Oct 2009 16:15

John Machin wrote:
> On Oct 27, 3:22 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
>>John Machin wrote:
>>
>>>Try this:
>>>http://webhelp.esri.com/arcpad/8.0/referenceguide/
>>
>>Wow. Question, though: all those codepages mapping to 437 and 850 --
>>are they really all the same?
>
> 437 and 850 *are* codepages. You mean "all those language driver IDs
> mapping to codepages 437 and 850". A codepage merely gives an
> encoding. An LDID is like a locale; it includes other things besides
> the encoding. That's why many Western European languages map to the
> same codepage, first 437 then later 850 then 1252 when Windows came
> along.

Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
to a cp437, and the file came from a german oem machine... could that
file have upper-ascii codes that will not map to anything reasonable on
my \x01 cp437 machine? If so, is there anything I can do about it?

>>>> '\x68' : ('cp895', 'Kamenicky (Czech) MS-DOS'), # iffy
>>
>>>Indeed iffy. Python doesn't have a cp895 encoding, and it's probably
>>>not alone. I suggest that you omit Kamenicky until someone actually
>>>wants it.
>>
>>Yeah, I noticed that. Tentative plan was to implement it myself (more
>>for practice than anything else), and also to be able to raise a more
>>specific error ("Kamenicky not currently supported" or some such).
>
>
> The error idea is fine, but I don't get the "implement it yourself for
> practice" bit ... practice what? You plan a long and fruitful career
> inplementing codecs for YAGNI codepages?

ROFL. Playing with code; the unicode/code page interactions. Possibly
looking at constructs I might not otherwise. Since this would almost
certainly (I don't like saying "absolutely" and "never" -- been
troubleshooting for too many years for that!-) be a YAGNI, implementing
it is very low priority

>>>> '\x7b' : ('iso2022_jp', 'Japanese Windows'), # wag
>>
>>>Try cp936.
>>
>>You mean 932?
>
>
> Yes.
>
>
>>Very helpful indeed. Many thanks for reviewing and correcting.
>
>
> You're welcome.
>
>
>>Learning to deal with unicode is proving more difficult for me than
>>learning Python was to begin with! ;D
>
>
> ?? As far as I can tell, the topic has been about mapping from
> something like a locale to the name of an encoding, i.e. all about the
> pre-Unicode mishmash and nothing to do with dealing with unicode ...

You are, of course, correct. Once it's all unicode life will be easier
(he says, all innocent-like). And dbf files even bigger, lol.

> BTW, what are you planning to do with an LDID of 0x00?

Hmmm. Well, logical choices seem to be either treating it as plain
ascii, and barfing when high-ascii shows up; defaulting to \x01; or
forcing the user to choose one on initial access.

I am definitely open to ideas!

> Cheers,
>
> John

From: John Machin on 26 Oct 2009 20:38

On Oct 27, 7:15 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
> John Machin wrote:
> > On Oct 27, 3:22 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
> >>John Machin wrote:
>
> >>>Try this:
> >>>http://webhelp.esri.com/arcpad/8.0/referenceguide/
>
> >>Wow. Question, though: all those codepages mapping to 437 and 850 --
> >>are they really all the same?
>
> > 437 and 850 *are* codepages. You mean "all those language driver IDs
> > mapping to codepages 437 and 850". A codepage merely gives an
> > encoding. An LDID is like a locale; it includes other things besides
> > the encoding. That's why many Western European languages map to the
> > same codepage, first 437 then later 850 then 1252 when Windows came
> > along.
>
> Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
> to a cp437, and the file came from a german oem machine... could that
> file have upper-ascii codes that will not map to anything reasonable on
> my \x01 cp437 machine? If so, is there anything I can do about it?

ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
meaningless. As for the rest of your question, if the file's encoded
in cpXXX, it's encoded in cpXXX. If either the creator or the reader
or both are lying, then all bets are off.

> > BTW, what are you planning to do with an LDID of 0x00?
>
> Hmmm. Well, logical choices seem to be either treating it as plain
> ascii, and barfing when high-ascii shows up; defaulting to \x01; or
> forcing the user to choose one on initial access.

It would be more useful to allow the user to specify an encoding than
an LDID.

You need to be able to read files created not only by software like
VFP or dBase but also scripts using third-party libraries. It would be
useful to allow an encoding to override an LDID that is incorrect e.g.
the LDID implies cp1251 but the data is actually encoded in koi8[ru]

Read this: http://en.wikipedia.org/wiki/Code_page_437
With no LDID in the file and no encoding supplied, I'd be inclined to
make it barf if any codepoint not in range(32, 128) showed up.

Cheers,
John

From: Ethan Furman on 27 Oct 2009 11:51

John Machin wrote:
> On Oct 27, 7:15 am, Ethan Furman <et...(a)stoneleaf.us> wrote:
>
>>Let me rephrase -- say I get a dbf file with an LDID of \x0f that maps
>>to a cp437, and the file came from a german oem machine... could that
>>file have upper-ascii codes that will not map to anything reasonable on
>>my \x01 cp437 machine? If so, is there anything I can do about it?
>
> ASCII is defined over the first 128 codepoints; "upper-ascii codes" is
> meaningless. As for the rest of your question, if the file's encoded
> in cpXXX, it's encoded in cpXXX. If either the creator or the reader
> or both are lying, then all bets are off.

My confusion is this -- is there a difference between any of the various
cp437s? Going down the list at ESRI: 0x01, 0x09, 0x0b, 0x0d, 0x0f,
0x11, 0x15, 0x18, 0x19, and 0x1b all map to cp437, and they have names
such as US, Dutch, Finnish, French, German, Italian, Swedish, Spanish,
English (Britain & US)... are these all the same?

>>>BTW, what are you planning to do with an LDID of 0x00?
>>
>>Hmmm. Well, logical choices seem to be either treating it as plain
>>ascii, and barfing when high-ascii shows up; defaulting to \x01; or
>>forcing the user to choose one on initial access.
>
> It would be more useful to allow the user to specify an encoding than
> an LDID.

I plan on using the same technique used in xlrd and xlwt, and allowing
an encoding to be specified when the table is opened. If not specified,
it will use whatever the table has in the LDID field.

> You need to be able to read files created not only by software like
> VFP or dBase but also scripts using third-party libraries. It would be
> useful to allow an encoding to override an LDID that is incorrect e.g.
> the LDID implies cp1251 but the data is actually encoded in koi8[ru]
>
> Read this: http://en.wikipedia.org/wiki/Code_page_437
> With no LDID in the file and no encoding supplied, I'd be inclined to
> make it barf if any codepoint not in range(32, 128) showed up.

Sounds reasonable -- especially when the encoding can be overridden.

~Ethan~

First | Prev | Next | Last
Pages: 1 2 3
Prev: subprocess executing shell
Next: IDLE python shell freezes after running show() of matplotlib