From: Terence on
On Oct 26, 8:23 am, Ron Shepard <ron-shep...(a)NOSPAM.comcast.net>
wrote:
> In article <PVYEm.49864$ze1.16...(a)news-server.bigpond.net.au>,
>

This all seems to be about parsing (ascii?) symbols off an input
medium.
I've worked a great length of time in this area and I've found that
the best way is to read the input medium in "suitable" chunks in
"binary" or "transparent" unformatted sequential mode, and use a "next
byte" routuine to get the next byte or cause a new data "chunk" to be
obtained.

Any error signal is to be treated as a signal for special initial
tretment of the last "chunk" to locate a credible EOF signal symbol,
knowing that the last chucnk read in the same buffer is NOT fully
overwritten and therefore will have the later characters apparently
duplicated.

The data captured with this method consists of eight-bit characters
whose numeric values can be from 00 to 255. My preferred technique is
pick up each 8-bit character into the lower byte of a pre-zeroed
(once) 16-bit word and then determine in which section of the ascii
table (#00-#1F, #20-#7F, and #80-#FF) the symbol falls as the first
step in parsing for sense or an EOR/EOF symbol. This reduces the size
of the action tables indexed by the numeric value of the symbol by
dividing the problem up into control, ascii text and accented text.

This allows parsing any language that can use 8-bit characters; and
can be extended to 16-bit character sets. I've used this for very many
left-to-right languages including romaji and greek.