From: Dave B on
On Sunday 4 May 2008 13:26, Janis Papanagnou wrote:

>> Not if you use awk. With awk, you have a linear solution on the number of
>> input records (assuming awk's hash lookups are efficient).
>
> Are you sure awk use hash lookups? Occasionally done efficiency
> analysis makes me suspect that a "standard" awk will use a tree
> representation for associative arrays. The consequence would be
> a O(N logN) complexity for accessing every element individually.

Not sure, but I always thought that that would be a reasonable
implementation of associative arrays (or maps). Now that you mention trees,
however, I'm not so sure anymore, since it's true that implementing maps
with trees has its advantages too (see eg
http://en.wikipedia.org/wiki/Associative_array for a not-too-technical
discussion).

> A full traversal, OTOH, could at least in theory be done in O(N),
> even with a tree representation when additional measures would
> be taken.
>
> I'd be interested whether anyone knows about the implementation
> of associative arrays in the various existing awks. (The POSIX
> document seems not to specify any complexity requirements.)

FWIW, GNU awk (the only awk I have access to the sources ATM) seems to use
hash tables.
According to http://www.gnu.org/manual/gawk/html_node/internals.html,
lookups are performed using the assoc_lookup() function. Looking at the
sources (gawk 3.1.5), the function is in array.c, lines 483-562. The hash
function itself is at lines 331-414 in the same file.

--
D.
From: Dave B on
On Sunday 4 May 2008 19:35, Dave B wrote:

> According to http://www.gnu.org/manual/gawk/html_node/internals.html,

Sorry, the link is

http://www.gnu.org/manual/gawk/html_node/Internals.html

(uppercase I in "Internals.html").

--
D.
From: Stephane CHAZELAS on
2008-05-04, 19:35(+02), Dave B:
[...]
> FWIW, GNU awk (the only awk I have access to the sources ATM) seems to use
> hash tables.
> According to http://www.gnu.org/manual/gawk/html_node/internals.html,
> lookups are performed using the assoc_lookup() function. Looking at the
> sources (gawk 3.1.5), the function is in array.c, lines 483-562. The hash
> function itself is at lines 331-414 in the same file.
[..]

From reading the code in the heirloom toolchest, looks like the
original awk from 1979 (and nawk) by A, W and K also used hash
tables.

So does mawk.

--
St�phane
From: Janis Papanagnou on
Stephane CHAZELAS wrote:
> 2008-05-04, 19:35(+02), Dave B:
> [...]
>
>>FWIW, GNU awk (the only awk I have access to the sources ATM) seems to use
>>hash tables.
>>According to http://www.gnu.org/manual/gawk/html_node/internals.html,
>>lookups are performed using the assoc_lookup() function. Looking at the
>>sources (gawk 3.1.5), the function is in array.c, lines 483-562. The hash
>>function itself is at lines 331-414 in the same file.
>
> [..]
>
> From reading the code in the heirloom toolchest, looks like the
> original awk from 1979 (and nawk) by A, W and K also used hash
> tables.
>
> So does mawk.
>

And some brief timing tests with gawk seem to confirm linear complexity.

Janis
From: Dan Stromberg on
On Sun, 04 May 2008 10:32:33 +0200, Dave B wrote:

> On Sunday 4 May 2008 04:15, Dan Stromberg wrote:
>
>> On Sat, 03 May 2008 19:26:04 +0200, Dave B wrote:
>>
>>> On Saturday 3 May 2008 18:51, Dan Stromberg wrote:
>>>
>>>> Isn't this what join(1) is for?
>>>
>>> AFAIK join operates on sorted files, which doesn't seem to be the case
>>> here.
>>
>> It seems to me either you accept the nlogn sort, or you end up with an
>> n^2 pair of nested loops, like what the OP had.
>
> Not if you use awk. With awk, you have a linear solution on the number
> of input records (assuming awk's hash lookups are efficient). And,
> furthermore, I think it cannot be assumed that the files can be
> reordered at will.

Whether awk uses hashing or not, I realized shortly after I posted that,
and shortly before I canceled the post, that hashing would be an
interesting 3rd kind of solution.