From: Paulo Marques on
Fiziwig wrote:
> [...]
> I shall have to rebuild the dictionary using the Huffman algorithm and
> see what it looks like.

After tens of posts in sci.crypt trying to help people that didn't want
any help and took every constructive criticism as insults, this was a
real breath of fresh air. Thank you, Gary!

If you need help building the huffman-optimized codebook, please send me
the frequency table you're using and I'll try to help you out. I'm
curious about what kind of improvement on the average letters per word
value will the pure huffman give us.

Thanks again,

--
Paulo Marques - www.grupopie.com

"All generalizations are false."
From: Fiziwig on
On Jul 22, 5:09 am, Paulo Marques <pmarq...(a)grupopie.com> wrote:
> Fiziwig wrote:
> > [...]
> > I shall have to rebuild the dictionary using the Huffman algorithm and
> > see what it looks like.
>
> After tens of posts in sci.crypt trying to help people that didn't want
> any help and took every constructive criticism as insults, this was a
> real breath of fresh air. Thank you, Gary!
>
> If you need help building the huffman-optimized codebook, please send me
> the frequency table you're using and I'll try to help you out. I'm
> curious about what kind of improvement on the average letters per word
> value will the pure huffman give us.
>
> Thanks again,

You're welcome.

My frequency data comes from the count of occurrences of each word in
the 1 million word Brown Corpus. It contains about 14,000 words, after
removing bogus words caused by typos in the original source material.
I'm curious myself to see the results. I'm coding up a C++ program to
accomplish the task. I started to do it by hand, but that turned out
to be too big a job with 14,000 words!

As an aside, it might be fun to let each ply of the tree alternate
between consonant-headed subtrees and vowel-headed subtrees so that
every code word would alternate a vowel with a consonant and be
pronounceable. (UGABUGA) Then it could be a spoken code as well. Or
maybe that's more properly called a language, in which we've left
cryptography and entered linguistics. ;-)

--gary


From: rossum on
On Wed, 21 Jul 2010 16:25:38 -0700 (PDT), WTShaw <lurens1(a)gmail.com>
wrote:

>On Jul 21, 3:59 pm, rossum <rossu...(a)coldmail.com> wrote:
>> On Tue, 20 Jul 2010 13:52:30 -0700 (PDT), Fiziwig <fizi...(a)gmail.com>
>> wrote:
>>
>> >I wanted to keep it as simple as possible for a human being using a
>> >hard copy code book and pencil and paper.
>>
>> Playfair.
>>
>> rossum
>
>Playfair, not so great. Yes, hard by hand, but vulnerable by machine.
Yes it is breakable by machine but it is easy to implement with just
pencil and paper.

If I wanted an easy to implement computer cypher I would pick RC4
which is simple enough to be coded from memory.

rossum

From: John Nagle on
On 7/20/2010 10:22 AM, Fiziwig wrote:
> Just for the fun of it (I'm retired with nothing better to do with my
> time :) I've created a new code (not cipher) based on a variation of
> Huffman coding but using the Roman alphabet instead of binary bits to
> encode English words instead of bytes of data. The code words are
> variable length, but self-segregating so decoding is unambiguous in
> all cases.

Some special-purpose handheld devices used to compress text with
a codebook. This allowed low end devices to store an entire Bible
or encyclopedia.

I once encountered a handheld "Bible" device with an off-by-one
error in the codebook. Long words were being replaced by different
words which were alphabetically nearby the correct word. Very strange.

John Nagle
From: Fiziwig on
On Jul 22, 5:09 am, Paulo Marques <pmarq...(a)grupopie.com> wrote:

> If you need help building the huffman-optimized codebook, please send me
> the frequency table you're using and I'll try to help you out. I'm
> curious about what kind of improvement on the average letters per word
> value will the pure huffman give us.

Having never programmed the Huffman Algorithm, I approached the
project with some trepidation. As it turned out, it was a piece of
cake. I used C++. My ONLY bugs were a couple of simple typos and one
uninitialized node pointer in the tree. Altogether it took me about an
hour to have it running, and another hour of fiddling with the format
of the output till I was happy with it. The only adaptation I had to
make to the basic Huffman algorithm was use an alphabet of 26 letters
instead of an alphabet of 2 letters. So instead of having two child
nodes, each parent node had a single list under it with 26 linked
entries.

If you are curious, the whole completed tree is available at
http://fiziwig.com/temp_huff.txt

Here's the sentence I used as a sample on my web page:

Here is a sample of the code used to encrypt this very sentence.
IH CL CG OYM CD CC WUG LA CF GXZU DE HO UXS <-- Original hand-made
code = 31 letters
AR PO E HXD I M CSF PXE F CSF ON CP DDI <-- Genuine Huffman algorithm
= 27 letters = 13% less space

(Note: My word list did not originally include "encrypt" so I cheated
and used "code" again. I had added it by hand into the first code book
at the end of the already used codes. That's not so easy to do with a
Huffman-coded book.)

Note that with the genuine Huffman algorithm the sentence had 4 words
that used one-letter codes. My hand made codebook had no single-letter
codes.

If I encoded the entire 1 million+ word Brown Corpus my results would
be

Total Words 1,075,609
Total Letters 2,476,824
Average letters per word 2.30

This is not that much better than my hand made code that worked out at
3.35 letters per word.

As a code for a living language there also needs to be room for
expansion within the coding scheme, without breaking the already coded
words, which would mean salting the original word file with a lot of
nulls that could later be replaced by real words as that became
necessary. I'm not sure how that would affect the efficiency of the
coding. My hand-made system left expansion gaps (for example, all
codes ending in "T", regardless of length, were unassigned and set
aside for future expansion)

I also noticed that the Huffman code reached deeper into the alphabet
to do its coding. The highest code is PZZZ leaving around 176,000
codes available for expansion. In my hand-made code book the highest
code I used was GYVF, with about 335,000 available for expansion. Both
are perfectly adequate.

I have a hunch I could factor out morphemes from the word list and cut
it in half or better. That way words like "unfortunately" would be
coded as separate morphemes: "un + fortunate + ly" Of course that
would mean a minimum of 4 letters for that word (assuming at least 2
for "fortunate"), while the present Huffman code for "unfortunately"
is "AUN". So I'm not sure morpheme factoring would help compression,
although it would certainly help dictionary size.

I do like the genuine Huffman tree, though. It's clean and efficient.
Thanks for your help and suggestions.

I think next I'll play with alternating consonants and vowels so every
code word is pronounceable. :)

--gary