From: Mac McTeague on
First let me say that I am not a computer science student. So I am
not trying to get help with my homework. I am an IT professional who
supports unix, linux and some windows. Mostly on the enterprise
server level. While I am between jobs, I decided to come up with
some projects to help improve some of my scripting skills. While
working on the one I will describe momentarily I come to consider that
my scripting approach may be inadequate for the task. I am just
curious about how people with more scripting and programming knowledge
would approach the problem. What tools would they use? How would
they make it more efficient and work correctly.

The idea was to create a script that would automatically extract from
text terms that were likely to be unknown to the average (college
educated) reader. That is terms that might need to be added to a
glossary or vocabulary list. So I got a list of the 10,000 most
common words in English. Then I got a list of the 20,000 most common
words appearing in all Guttenberg texts. I got rid of duplications
and came up with a list of about 19,000 words. Since that is a fairly
large list, I am going to assume that if a word appears in a text
file that is not on it that many people may not know it or it might be
technically specific. Therefore it might need to be put into a
glossary for explanation

Now I can certainly write a basic loop that looks to see if a word
in a text file is also in the big word list. And if not to print it
to a separate file.
But here is the problem. A typical book might have 50,000 or more
words. The list I have is almost 20,000 words. So that would mean a
billion string comparisons in order to extract a list of potential
words to be added to the glossary. That seems rather a bit much for
grep to handle. Which was my original intention

So I am just curious what approach actual programmers, which I am not,
might take in addressing a problem of this type. Perhaps I should
turn the big word list in to a database. Or perhaps some sort of hash
comparison using perl would be good. I just don't think that my
original idea of doing it as a shell script is sufficient considering
the number or comparisons that need to be made. I might take all day
to run the script. So what would be a more effective approach.
Thanks for any opinions and advice
From: Moi on
On Sun, 16 May 2010 10:36:29 -0700, Mac McTeague wrote:


[snip]

>
> Now I can certainly write a basic loop that looks to see if a word in
> a text file is also in the big word list. And if not to print it to a
> separate file.
> But here is the problem. A typical book might have 50,000 or more
> words. The list I have is almost 20,000 words. So that would mean a
> billion string comparisons in order to extract a list of potential words
> to be added to the glossary. That seems rather a bit much for grep to
> handle. Which was my original intention
>
> So I am just curious what approach actual programmers, which I am not,
> might take in addressing a problem of this type. Perhaps I should turn
> the big word list in to a database. Or perhaps some sort of hash
> comparison using perl would be good. I just don't think that my
> original idea of doing it as a shell script is sufficient considering
> the number or comparisons that need to be made. I might take all day to
> run the script. So what would be a more effective approach. Thanks for
> any opinions and advice

Hashing is the way to go.

a patricia tree or trie could be another choice. Or skiplists.

Hashing N items in N slots (using chaining) you'll need about 1.5 comparisons per item.
A well tuned algorithm will effectively run at the speed of the disk I/O.

I don't know what perl does. It might include some overhead when it resizes
it's hash table or invokes GC etc.

HTH,
AvK
From: Ben Bacarisse on
Mac McTeague <mcteague409(a)gmail.com> writes:
<snip>
> The idea was to create a script that would automatically extract from
> text terms that were likely to be unknown to the average (college
> educated) reader. That is terms that might need to be added to a
> glossary or vocabulary list. So I got a list of the 10,000 most
> common words in English. Then I got a list of the 20,000 most common
> words appearing in all Guttenberg texts. I got rid of duplications
> and came up with a list of about 19,000 words. Since that is a fairly
> large list, I am going to assume that if a word appears in a text
> file that is not on it that many people may not know it or it might be
> technically specific. Therefore it might need to be put into a
> glossary for explanation

It can certainly be done with a script. In fact it is a one-liner. It
is fun to work these things out so I won't show you my solution unless
you ask.

Use man to read up on tr, sort, uniq and comm.

--
Ben.
From: Keith Thompson on
Mac McTeague <mcteague409(a)gmail.com> writes:
[...]
> The idea was to create a script that would automatically extract from
> text terms that were likely to be unknown to the average (college
> educated) reader. That is terms that might need to be added to a
> glossary or vocabulary list. So I got a list of the 10,000 most
> common words in English. Then I got a list of the 20,000 most common
> words appearing in all Guttenberg texts. I got rid of duplications
> and came up with a list of about 19,000 words. Since that is a fairly
> large list, I am going to assume that if a word appears in a text
> file that is not on it that many people may not know it or it might be
> technically specific. Therefore it might need to be put into a
> glossary for explanation
[...]

What you're doing is almost identical to the Unix "spell" command.
I've read an old paper describing its extremely clever design,
but a couple of minutes of Googling haven't turned it up.

--
Keith Thompson (The_Other_Keith) kst-u(a)mib.org <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"