From: Grant on
On Mon, 26 Oct 2009 07:36:13 -0500, Ed Morton <mortonspam(a)gmail.com> wrote:

>Hongyi Zhao wrote:
>> On Mon, 26 Oct 2009 15:08:37 +1100, Grant
>> <g_r_a_n_t_(a)bugsplatter.id.au> wrote:
>>
>>> I forgot about sorting IPs :(
>>>
>>> sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 file2 > file2-sorted
>>
>> In my case, whether I do the above sorting process or not, your scipt
>> will take the same time approximately,
>
>But it won't produce the same output. The binary search algorithm relies
>on the input data being sorted.

Sorry for confusion. I made a boo-boo thinking that the file wasn't
sorted 'cos I did a straight sort + diff -- when I did a proper sort
and diff, I discoverd OP's file was already sorted -- thus okay for
the binary search.


The problem now is whether awk can read the raw data file -- my
impression is that the binary data file contains indirect pointers
to the text strings, the java code OP referenced relies on random
file access read the data.

This is a bit tricky to teach an awk script to do? So awk would need
to read the file and build in-memory data structures then dump proper
unix style plain text data file/s in the END block.

Possible? Yes. Easy? Dunno...

Grant.
--
http://bugsplatter.id.au
From: Grant on
On Mon, 26 Oct 2009 08:03:55 -0500, Ed Morton <mortonspam(a)gmail.com> wrote:

>Hongyi Zhao wrote:
>> On Mon, 26 Oct 2009 07:36:13 -0500, Ed Morton <mortonspam(a)gmail.com>
>> wrote:
>>
>>>>> I forgot about sorting IPs :(
>>>>>
>>>>> sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 file2 > file2-sorted
>>>> In my case, whether I do the above sorting process or not, your scipt
>>>> will take the same time approximately,
>>> But it won't produce the same output. The binary search algorithm relies
>>> on the input data being sorted.
>>
>> Do you mean that I should always do the sorting process like the above
>> firstly?
>
>If you want to do a binary search, then yes you need to sort the input
>data first. The alternative is to not sort the data and do a linear (or
>other uninformed) search, so you can test to see which takes longer
>(sort+binary vs linear) and make your choice.

That's why I wrote a server daemon to hold database in memory, not
pay the data load time for each casual query. Of course which way
to go depends if one is processing some log or data file, or making
casual queries, for example, a log file pretty-printer or web .cgi
script.

I have both pretty-printer and .cgi scripts running. An hourly cron
job processing a log file will check if the daemon is running and
load the database files if required.

Grant.
--
http://bugsplatter.id.au
From: John W. Krahn on
Ed Morton wrote:
> Hongyi Zhao wrote:
>> On Mon, 26 Oct 2009 07:36:13 -0500, Ed Morton <mortonspam(a)gmail.com>
>> wrote:
>>
>>>>> I forgot about sorting IPs :(
>>>>>
>>>>> sort -t. -n -k1,1 -k2,2 -k3,3 -k4,4 file2 > file2-sorted
>>>> In my case, whether I do the above sorting process or not, your scipt
>>>> will take the same time approximately,
>>> But it won't produce the same output. The binary search algorithm
>>> relies on the input data being sorted.
>>
>> Do you mean that I should always do the sorting process like the above
>> firstly?
>
> If you want to do a binary search, then yes you need to sort the input
> data first. The alternative is to not sort the data and do a linear (or
> other uninformed) search, so you can test to see which takes longer
> (sort+binary vs linear) and make your choice.

Or you can use a database (Berkeley DB for example) which means no sort
and no linear search.



John
--
The programmer is fighting against the two most
destructive forces in the universe: entropy and
human stupidity. -- Damian Conway
From: Grant on
On Mon, 26 Oct 2009 20:52:44 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:

>On Mon, 26 Oct 2009 22:20:27 +1100, Grant
><g_r_a_n_t_(a)bugsplatter.id.au> wrote:
>
>>Google translate does a fairly good job with Chinese :) Unfortunately
>>the diagrams have Chinese text embedded in the images -- so I cannot
>>follow the data layout :(
>
>A quick search in google give me the following codes related to the
>format of qqwry.dat, you can checkout it by the following command:
>
>svn checkout http://qqwry.googlecode.com/svn/trunk/ qqwry-read-only
>
>Then you can obtain the following subdirectory:
>qqwry-read-only\libqqwry
>
>I hope the source files on the above folder can give you some hints.

What's wrong with using the nali* applications included in that source?

>If not, I'll translate the webpage for you on which the format of
>qqwry.dat is given.

I've just about reversed engineered the data format from the source
code, but awk is too slow loading a binary data file. The in-place
file binary search written in C (that you have now) is very efficient
(apart from bugs). I can't see why you want to reinvent the wheel
here?

Grant.
--
http://bugsplatter.id.au
From: Hongyi Zhao on
On Wed, 28 Oct 2009 08:58:42 +1100, Grant
<g_r_a_n_t_(a)bugsplatter.id.au> wrote:

>What's wrong with using the nali* applications included in that source?
>
>>If not, I'll translate the webpage for you on which the format of
>>qqwry.dat is given.
>
>I've just about reversed engineered the data format from the source
>code, but awk is too slow loading a binary data file. The in-place
>file binary search written in C (that you have now) is very efficient
>(apart from bugs). I can't see why you want to reinvent the wheel
>here?

OK, Thanks a lot. I just want to konw whether awk will do this more
efficient or not.

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.