From: [Jongware] on

Ben Bacarisse wrote:
> arnuld <sunrise(a)see.sigs.invalid> writes:
>
>> PROBLEM:
>>
>> I want to read words from standard input and then sort and print
>> the words in decreasing order of appearance along with the count
>> of how many times each word occurred. The language I am using C. I
>> have come up with following design idea:

[...]
>> ii.) any other data structure that I can use here ?
>
> Binary search trees and hash tables are the most obvious choices. For
> simplicity, you could try to keep a sorted array of string pointers.
> This will speed up the accesses when a word is read and will give you
> a simple way to print the words at the end.

I wrote exactly that same program, using only a binary tree. Printing
the words was done with a recursive routine -- which is also a pretty
simple way :-)
Despite heavy use of generalized C functions such as strdup, strcmp,
etc., its performance was astounding; a matter of seconds, and most of
this time was input/output. My test file was an ASCII version of the
King James Bible, so now I know how often people used to say "thee" in
those times.

[Jongware]
From: thomas.mertes on
On 6 Mai, 12:36, arnuld <sunr...(a)see.sigs.invalid> wrote:
> PROBLEM:
>
> I want to read words from standard input and then sort and print
> the words in decreasing order of appearance along with the count
> of how many times each word occurred. The language I am using C. I
> have come up with following design idea:
[snip]
From your description I recognized an example program that I
wrote in Seed7 (sorry I did not use C). The only difference is that
it writes a list of words in increasing order followed by the count.
The wordcnt.sd7 example can be found here:
http://seed7.sourceforge.net/examples/wordcnt.htm

Maybe you can get some inspiration from this example how your
problem can be solved with a hash table.

Greetings Thomas Mertes

Seed7 Homepage: http://seed7.sourceforge.net
Seed7 - The extensible programming language: User defined statements
and operators, abstract data types, templates without special
syntax, OO with interfaces and multiple dispatch, statically typed,
interpreted or compiled, portable, runs under linux/unix/windows.
From: arnuld on
> On Tue, 06 May 2008 19:00:10 -0700, Gene wrote:

> set dictionary D empty
> for each word W in the input
> look up [W, Count] in D
> if lookup succeeds, increment Count
> else /*lookup failed*/ insert [W,1] in D
> end for
> Print the pairs in the dictionary in sorted order

> For the dictionary, there are a couple of implementation choices. A
> balanced tree like AVL or Red-Black trees is a nice choice because you
> can print in sorted order merely by traversing in order. No sorting
> needed. You might look at libavl. I've never tried it, but it looks
> like an extremely high quality C library for balanced trees. Of
> course if you are trying to learn, you may want to roll your own.

> Another approach is a hash table, which makes the lookups and
> insertions faster, but requires a separate sorting pass.



I really don't understand the difference between a Hash-Table and a Binary
Tree because I never came across them. I want to understand and implement
them but when I tried to read it using "Introduction to Algorithms -
Corment et al. 2/e", I see he is using some weired words like "satellite
data"

I don't get it. In order to print the words in decreasing order of their
occurrences and doing it efficiently in C requires me to understand
satellite data and a thing called Big-O ?




> This is a perfect example of a problem that's best solved with a
> programming language designed to solve it. Perl is such a language.
> Here's a solution in perl.
>
> sub main {
> local $/ = undef;
> my %counts;
> foreach my $word (split(/\s+/, <>)) { $counts{$word}++ }
> print map { "$_->[0] $_->[1]\n" }
> sort { $b->[1] <=> $a->[1] }
> map { [$_, $counts{$_}] } keys %counts;
> }
> main;

only 9 lines of code, pretty impressive :)




--
http://lispmachine.wordpress.com/
my email ID is @ the above blog.
just check the "About Myself" page :)

From: arnuld on
> On Tue, 06 May 2008 15:32:58 +0200, [Jongware] wrote:



> I wrote exactly that same program, using only a binary tree. Printing
> the words was done with a recursive routine -- which is also a pretty
> simple way :-)
> Despite heavy use of generalized C functions such as strdup, strcmp,
> etc., its performance was astounding; a matter of seconds, and most of
> this time was input/output. My test file was an ASCII version of the
> King James Bible, so now I know how often people used to say "thee" in
> those times.


I have written the a program that is *nearly* like it before I even posted
my problem but that program stores and prints the words alphabetically and
it uses a binary tree for that.

But now I have to sort and print the input by the number of occurrences in
the input which is not known in advance ( unlike alphabetical sorting
where you already know p < s )


any ideas that you can give me from your code ?




--
http://lispmachine.wordpress.com/
my email ID is @ the above blog.
just check the "About Myself" page :)

From: Robert Maas, http://tinyurl.com/uh3t on
> From: arnuld <sunr...(a)see.sigs.invalid>
> PROBLEM:
> I want to read words from standard input and then sort and print
> the words in decreasing order of appearance along with the count
> of how many times each word occurred.

Here's how I do it:
- Load all the words into a linked list, never mind duplicates at this point.
- Sort the list per alphabetical order, thereby putting duplicate
words into blocks of adjacent positions within the list.
- Traverse the list once, merging adjacent duplicates into single entries.
- Sort the list per descending count, so that it'll be ready to output.
Of course merge-sort is the sorting algorithm used.

> The language I am using C.

I use Lisp, where merge-sort is built-in.

In C you need to find a library for building linked lists and for
doing merge-sort on such lists. Of course it's trivial to build the
linked lists, but it does no good if the way you build them isn't
compatible with the merge-sort algorithm you use, so you might as
well get the merge-sort package first and the either use a
list-builder within it or hand-code your own list-builder to
produce what the merge-sort package expects as input.
Or you can write merge-sort from scratch. It's not really difficult.
The trick is to a pre-pass where you convert a list of N records
into a list of N single-element lists of records. I.e. convert the
list ("hello" "this" "is" "a" "test") into the list-of-lists
(("hello") ("this") ("is") ("a") ("test")) which the main
merge-sort algorithm will interpret as N runs of length 1 which all
need to be merged into a single run. Since all initial runs are of
the same length, there's no need to compute Huffman-optimum merges,
just put all the runs into a revolving FIFO queue and repeat
pop-pop-merge-push until the queue is of length exactly 1.

> I have come up with following design idea:
> 1) Create a singly-linked list and store each word and the count
> in that structure. when the user enters a word which is already
> seen before, then we will just increase the count. That way we will
> have a list of unique words.

How does it know whether it's seen a word before or not? If it's
seen before, how does it know where to find it in the list? It seems to
me it takes order(N) per word to search the list, hence order(N**2) total.
Unless you have a very small amount of data, that is too slow.
Merge-sort is order(n*log(n)) total, much better for large datasets.
I'm not talking about premature micro-optimization. I'm talking
about using a decent algorithm that gives decent performance,
orders of magnitude faster than what you propose.

> I want this program to be fast and consume less memory.

That's not possible. You need to have a reasonable trade-off.
What's more important? Cramming into 2 kilobytes RAM but it takes
two hours to run, or using 2 megabytes RAM and it runs in a few seconds?
You have more than 2 megabytes on your machine, right?
What's a couple megabytes needed to run merge-sort?

Now if you have a really huge dataset, such that the linked list in
RAM is larger than all the RAM you have, so that it thrashes
virtual memory, then you need to modify the above algorithm
slightly to avoid thrashing. Let me know if that is true of your
situation and I'll give you an idea what modification you'll need.
But if the total time thrashing is less than 5 minutes, don't
bother changing the algorithm, because it'll take longer than 5
minutes to modify the code to avoid thrashing.
First  |  Prev  |  Next  |  Last
Pages: 1 2 3
Prev: string splitting
Next: Application for web log analysis