From: MrD on
Fiziwig wrote:
>
> Total Words 1,075,609
> Total Letters 2,476,824
> Average letters per word 2.30
>
> This is not that much better than my hand made code that worked out
> at 3.35 letters per word.

30% smaller is nothing to be sneezed at.
>
> As a code for a living language there also needs to be room for
> expansion within the coding scheme, without breaking the already
> coded words, which would mean salting the original word file with a
> lot of nulls that could later be replaced by real words as that
> became necessary.

Have you said how many *distinct* words there are in your corpus? Sorry
if you've already said, I couldn't find it.

I believe Dr. Seuss deliberately chose a lexicon of just 300 words; that
one can get by in English with about 1,000 words; most native English
speakers manage all their lives with a lexicon of about 3,000 words, and
that Winston Churchill commanded a lexicon of about 12,000 ("12,000
strong", as adacrypt might say).

How do you handle stemming? E.g. is each form[1] of a verb treated as
a distinct word? I can't see how else you'd do it (without some
complicated grammar rules). So without a stemmer, I suppose the standard
lexicon might grow to about 15,000 and the Churchill lexicon might grow
to about 60,000. The Seuss lexicon probably wouldn't grow that much,
because in addition to his miniature lexicon, Seuss also restricted
himself to a pretty rudimentary grammar (everything is present-tense
indicative, etc.)

Actually I don't have the first clue how much a lexicon shrinks/grows by
as a result of stemming; those numbers are unadulterated guesswork. But
people who work on text-retrieval tools would know the facts.

--
MrD.
[1] MrD is stuck for the right word; he tried "declension", but knew
that wasn't what he meant. MrD thinks Churchill would have known the
right word.
From: Paulo Marques on
Fiziwig wrote:
> [...]
> Having never programmed the Huffman Algorithm, I approached the
> project with some trepidation. As it turned out, it was a piece of
> cake. I used C++. My ONLY bugs were a couple of simple typos and one
> uninitialized node pointer in the tree. Altogether it took me about an
> hour to have it running, and another hour of fiddling with the format
> of the output till I was happy with it. The only adaptation I had to
> make to the basic Huffman algorithm was use an alphabet of 26 letters
> instead of an alphabet of 2 letters. So instead of having two child
> nodes, each parent node had a single list under it with 26 linked
> entries.
>
> If you are curious, the whole completed tree is available at
> http://fiziwig.com/temp_huff.txt

Nice work!

I'm afraid you still have a bug, though :(

When building N-ary huffman trees you have to make sure the number of
elements fill up a N-branched tree _completely_.

This means that for a 26-ary tree, the number of starting elements needs
to be "25 * N + 26" for some integer N. If they're not, you need to add
dummy zero frequency elements to the end of the tree before start
building it.

For instance, if you have 12340 elements -> (12340 - 26) mod 25 = 14, so
you need to add (25-14) = 11 dumy elements, before starting to collapse
the tree. I'm doing this all from memory, so I hope I'm getting the math
right...

That is why in the end, the algorithm is only using letters from A to P
as the first letter and not going up to Z. So, the end result after
inserting the dummy elements should be even better than what you have now.

> [...]
> Note that with the genuine Huffman algorithm the sentence had 4 words
> that used one-letter codes. My hand made codebook had no single-letter
> codes.

That is one thing I was curious about: if the optimized version would
find words with such a large frequency that deserved their own single
symbol. Thanks for satisfying my curiosity ;)

> If I encoded the entire 1 million+ word Brown Corpus my results would
> be
>
> Total Words 1,075,609
> Total Letters 2,476,824
> Average letters per word 2.30
>
> This is not that much better than my hand made code that worked out at
> 3.35 letters per word.

The total "original letters including spaces" (i.e. roughly the total
size of the corpus) would be a nice statistic too, so that we could
appreciate the compression factor of both methods.

Anyway, as MrD pointed out, 30% is not that bad. I could even bend the
statistics the other way around and say your original method increased
the size by 46% ;)

> [...]
> I also noticed that the Huffman code reached deeper into the alphabet
> to do its coding. The highest code is PZZZ leaving around 176,000
> codes available for expansion. In my hand-made code book the highest
> code I used was GYVF, with about 335,000 available for expansion. Both
> are perfectly adequate.

I think that if you want to have expansion capability you shouldn't
leave it to chance. You can add "expansion symbols" explicitly to the
end of the table to allow for expansion, with some appropriate "expected
frequency" for them.

The number of symbols available for expansion is another thing: if don't
expect any expansion, but want to "be prepared" for it, you might even
just reserve a single word for expansion and then, when that word
appeared, it meant the next N chars described the expansion. With N=4
you would already be able to get 26^4 = 456976 expansion words, wasting
only a single symbol on the table.

> I have a hunch I could factor out morphemes from the word list and cut
> it in half or better. That way words like "unfortunately" would be
> coded as separate morphemes: "un + fortunate + ly" Of course that
> would mean a minimum of 4 letters for that word (assuming at least 2
> for "fortunate"), while the present Huffman code for "unfortunately"
> is "AUN". So I'm not sure morpheme factoring would help compression,
> although it would certainly help dictionary size.
>
> I do like the genuine Huffman tree, though. It's clean and efficient.
> Thanks for your help and suggestions.

You're welcome.

> I think next I'll play with alternating consonants and vowels so every
> code word is pronounceable. :)

It will certainly be bigger, though ;)

--
Paulo Marques - www.grupopie.com

"...so she told me it was either her or the ham radio, over."
From: Greg Rose on
In article <xs-dndaPSY2iF9TRnZ2dnUVZ8nudnZ2d(a)novis.pt>,
Paulo Marques <pmarques(a)grupopie.com> wrote:
>The total "original letters including spaces" (i.e. roughly the total
>size of the corpus) would be a nice statistic too, so that we could
>appreciate the compression factor of both methods.

Slightly out of context, I know, but I think it is
worth mentioning that with Huffman coding you can
also leave out the spaces. As you traverse the
tree, arriving at a leaf determines the end of the
word. That would make pronounceability a bit
tougher though.

Greg.
--
From: Fiziwig on
On Jul 23, 12:50 am, MrD <mrdemean...(a)jackpot.invalid> wrote:

>
> Have you said how many *distinct* words there are in your corpus? Sorry
> if you've already said, I couldn't find it.

No, I haven't worked that out. It would be nice to compress out the
trivial duplicates like plurals. What's the sense of having both
"apple" and "apples" as separate entries?

> I believe Dr. Seuss deliberately chose a lexicon of just 300 words; that
> one can get by in English with about 1,000 words; most native English
> speakers manage all their lives with a lexicon of about 3,000 words, and
> that Winston Churchill commanded a lexicon of about 12,000 ("12,000
> strong", as adacrypt might say).

Just this morning I found a nifty list of the 2,000 "most essential"
English words, and it even includes frequency data! http://jbauman.com/gsl.html


> How do you handle stemming? E.g. is each form[1] of a verb treated as
> a distinct word? I can't see how else you'd do it (without some
> complicated grammar rules). So without a stemmer, I suppose the standard
> lexicon might grow to about 15,000 and the Churchill lexicon might grow
> to about 60,000. The Seuss lexicon probably wouldn't grow that much,
> because in addition to his miniature lexicon, Seuss also restricted
> himself to a pretty rudimentary grammar (everything is present-tense
> indicative, etc.)

I haven't addressed that yet. First I need to enumerate all the
various ways that one word can be derived from another. (quick,
quickly, quicker, quickest, quickness, quickosity,
quickability, ... :)

> Actually I don't have the first clue how much a lexicon shrinks/grows by
> as a result of stemming; those numbers are unadulterated guesswork. But
> people who work on text-retrieval tools would know the facts.

I don't know either. It would an interesting topic to explore. But
then, now we're really far afield from cryptography.

--gary

From: Fiziwig on
On Jul 23, 5:25 am, Paulo Marques <pmarq...(a)grupopie.com> wrote:

>
> I'm afraid you still have a bug, though :(
>
> When building N-ary huffman trees you have to make sure the number of
> elements fill up a N-branched tree _completely_.
>
> This means that for a 26-ary tree, the number of starting elements needs
> to be "25 * N + 26" for some integer N. If they're not, you need to add
> dummy zero frequency elements to the end of the tree before start
> building it.

I looked it up. The number of starting elements needs to be congruent
to 1 mod n-1, so it has to be of the form 25X + 1.

I had 9315 words, so I added 11 DUMMY words with 0 probability to make
9326. The results were strange.

Total Words 1075617
Total Letters 1246124
Average letters per word 1.16

The program claims an average of 1.16 letters per word, but that
seemed suspiciously low to me so I looked at the numbers.

The top 14 words are 1-letter codes. Those top 14 words make up 26% of
a typical English text. The next top 225 words are 2-letter codes, and
the whole 239 top words make up 62% of a typical English text. So with
26% * 1 + 36% * 2 and the remaining 38% being mostly 3-letter codes
that works out to 2.12 letters per word (approximately since I didn't
account for those few 4-letter codes) so I suspect there's something
wrong with the calculation done by the program. Although I don't see
what could be wrong. It's simply the total number of code letters used
divided by the total of all word frequencies, and my original corpus
was slightly more than a million words, so that number looks right.
I'll double check it.

Even so, 2.12 is about 8% better than the previous 2.30.

Thanks for spotting my oversight.

-gary