From: Paulo Marques on
Fiziwig wrote:
> On Jul 23, 5:25 am, Paulo Marques <pmarq...(a)grupopie.com> wrote:
>[...]
>> This means that for a 26-ary tree, the number of starting elements needs
>> to be "25 * N + 26" for some integer N. If they're not, you need to add
>> dummy zero frequency elements to the end of the tree before start
>> building it.
>
> I looked it up. The number of starting elements needs to be congruent
> to 1 mod n-1, so it has to be of the form 25X + 1.

My math didn't fail me too much, then :)

>[...]
> Although I don't see
> what could be wrong. It's simply the total number of code letters used
> divided by the total of all word frequencies, and my original corpus
> was slightly more than a million words, so that number looks right.
> I'll double check it.

It should be something like: sum_for_all_words(frequency * code_letters)
/ sum_for_all_words(frequency). I.e. the total number of letters used to
encode the corpus divided by to total number of words. This should give
the average letters per word used to encode the complete corpus.

If you need help debugging the code, you can send it to me privately.
I'm usually good at spotting other people's bugs. I just wish I could
use that superpower for my own programs :(

--
Paulo Marques - www.grupopie.com

"C++ : increment the value of C and use the old one"
From: Fiziwig on
On Jul 23, 11:06 am, Paulo Marques <pmarq...(a)grupopie.com> wrote:

>
> It should be something like: sum_for_all_words(frequency * code_letters)
> / sum_for_all_words(frequency). I.e. the total number of letters used to
> encode the corpus divided by to total number of words. This should give
> the average letters per word used to encode the complete corpus.
>
> If you need help debugging the code, you can send it to me privately.
> I'm usually good at spotting other people's bugs. I just wish I could
> use that superpower for my own programs :(

I found it. In my recursive display function I misplaced one line of
code so I was taking "strlen( tag )* lpScan->weight" before I appended
the final letter for this branch to the tag, so I ended up counting
the length of all the tags as one less than they should have been. I
fixed that and got:

Total Words 1075617
Total Letters 2321741
Average letters per word 2.16

SO overall, Huffman gets 2.16 vs my hand-made 2.35, or 8% improvement.

But more important, I learned a lot by doing this exercise. :)

BTW: as an alternative for making pronounceable codes, I discovered
the best approach is to build codes purely out of consonants, and then
add any old vowels you please when you use the codes. The human ear is
better at picking harmonious vowels than any program could be. So PTN
could be pronounced "patuma", or "aputiamu", or whatever you like,
without disturbing the self-segregating property. Adding a few rules
like "X" = "sh", "C" = "ch", and "Q"="th" makes even oddballs like XQ,
and CCN easy: "shathu", "chachani". Move over Apache Code Talkers. You
have met your match. :)

--gary
From: MrD on
Fiziwig wrote:
> So PTN could be pronounced "patuma", or "aputiamu", or whatever you
> like,

More like "potion" or "patina" or "epitonia", but not "Opountia". If I
understand you correctly.

--
MrD.
From: rossum on
On Fri, 23 Jul 2010 12:08:08 -0700 (PDT), Fiziwig <fiziwig(a)gmail.com>
wrote:

>Move over Apache Code Talkers.
I thought they were Navaho Code Talkers, or were the Apache used as
well?

rossum

From: Fiziwig on
On Jul 24, 5:01 am, rossum <rossu...(a)coldmail.com> wrote:
> On Fri, 23 Jul 2010 12:08:08 -0700 (PDT), Fiziwig <fizi...(a)gmail.com>
> wrote:
>
> >Move over Apache Code Talkers.
>
> I thought they were Navaho Code Talkers, or were the Apache used as
> well?
>
> rossum

I looked it up. I stand corrected. They were Navajo, Cherokee, Choctaw
and Comanche. No Apache. Even at my advanced age I learn something new
every day. :)

--gary