A scheme of dictionary coding of English words [Cryptography]

Prev: Cryptanalyis by Cloning Data from Regular Data.
Next: Nanoscale Random Number Circuit to Secure Future Chips

From: Mok-Kong Shen on 29 Jun 2010 09:39

Let's assume a 6-bit printable coding alphabet Q, e.g. { a-z, A-Z, 0-9,
+, - }, and adopt the following convention for grouping of codewords:

Group 1: two symbols, 1st symbol in Q\{0-9}, 2nd in Q.

Group 2: three symbols, first symbol in {1-9}, 2nd and 3rd in Q.

Group 3: four symbols, first symbol 0, 2nd symbol in Q\0,
3rd-4th in Q.

The cardinality of the these three sets are 3456, 36864 and 258048
respectively, totalling 298368. Considering that Basic English has 850
core words and that a frequency count of books of Project Gutenberg
involves some 40000 words
(see http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists), the
above coding scheme should be fairly sufficient for a dictionary coding
of English words in most practical common applications. For efficiency,
the most frequently used words are certainly to be assigned to group 1
and the comparatively less frequent ones to group 2, with group 3
containing the very seldomly used words. Note that we have reserved the
initial "00" for providing an adequate escape mechanism to do verbatim
coding of all exceptional words that may be required.

Very roughly I estimate that one could this way code with an average of
about 14 bits per word. How would this compare with ASCII coding
followed by a compression?

Thanks.

M. K. Shen

|
Pages: 1
Prev: Cryptanalyis by Cloning Data from Regular Data.
Next: Nanoscale Random Number Circuit to Secure Future Chips