From: Hongyi Zhao on
On Tue, 20 Oct 2009 08:20:58 -0700 (PDT), Ed Morton
<mortonspam(a)gmail.com> wrote:

>The email address I use for netnews is just a spam trap, I don't read
>it. Post some SMALL sample input and expected output from that input,
>in particular including the IP addresses that are omitted from the
>output.

See the following minimal example:

1- The test.awk is as follows:

BEGIN{ FS="\t"; OFS="#" }
function ip2nr(ip, nr,ipA) {
# aaa.bbb.ccc.ddd
split(ip,ipA,".")
nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 +
ipA[4]
return nr
}
NR==FNR { addrs[$0] = ip2nr($0); next }
FNR>1 {
start = ip2nr($1)
end = ip2nr($2)
for (ip in addrs) {
if (addrs[ip] >= start && addrs[ip] <= end) {
print ip,$3" "$4
#delete addrs[ip]
#next
}
}
}

The file1 has the following content:

$ cat file1
128.83.194.98
129.21.126.99
129.21.136.140
140.180.130.93
140.180.163.6
161.53.160.104
18.127.1.91
18.181.0.128
18.246.2.48
18.246.2.79
18.246.2.83
18.246.2.88
18.251.7.53

The file2 has the following content:

$ cat file2
StartIP EndIP Country Local
4.21.160.8 4.21.160.15 America MIT
18.0.0.0 18.255.255.255 America MIT
128.30.0.0 128.31.255.255 America MIT
128.52.0.0 128.52.255.255 America MIT
128.83.0.0 128.83.255.255 America The University of Texas at
Austin
129.21.0.0 129.21.255.255 America Rochester
140.180.0.0 140.180.255.255 America Princeton
161.53.0.0 161.53.255.255 Croatia University of Zagreb
university central
computing
192.12.11.0 192.12.11.255 America MIT
192.54.222.0 192.54.222.255 America MIT
192.233.95.0 192.233.95.255 America MIT

The output by running the test.awk:

$ awk -f test.awk file1 file2
18.251.7.53#America MIT
18.181.0.128#America MIT
18.246.2.83#America MIT
18.246.2.48#America MIT
18.246.2.88#America MIT
18.246.2.79#America MIT
18.127.1.91#America MIT
128.83.194.98#America The University of Texas at Austin
129.21.136.140#America Rochester
129.21.126.99#America Rochester
140.180.130.93#America Princeton
140.180.163.6#America Princeton
161.53.160.104#Croatia University of Zagreb university central
computing

2- This time, I use the revised version of your test.awk, i.e.,

BEGIN{ FS="\t"; OFS="#" }
function ip2nr(ip, nr,ipA) {
# aaa.bbb.ccc.ddd
split(ip,ipA,".")
nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 +
ipA[4]
return nr
}
NR==FNR { addrs[$0] = ip2nr($0); next }
FNR>1 {
start = ip2nr($1)
end = ip2nr($2)
for (ip in addrs) {
if (addrs[ip] >= start && addrs[ip] <= end) {
print ip,$3" "$4
delete addrs[ip]
next
}
}
}

The output by running the test.awk will look as follows:

$ awk -f test.awk file1 file2
18.251.7.53#America MIT
128.83.194.98#America The University of Texas at Austin
129.21.136.140#America Rochester
140.180.130.93#America Princeton
161.53.160.104#Croatia University of Zagreb university central
computing

Any hints on this issue? Thanks in advance.

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Hongyi Zhao on
On Wed, 21 Oct 2009 07:48:18 +1100, Grant
<g_r_a_n_t_(a)bugsplatter.id.au> wrote:

>>> > ? ? nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]
>
>The weighting for converting dotquad IP to a number is 256, not
>1000 -- using 1000 will skip IP addresses in your range matching.
>
>Try
> nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4]
>
>or
> nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4]
>
>instead -- the second version is speed optimised for gawk.

I've tried all of the above three expressions for _nr_, and I _always_
get the same results. Could you please give some example to support
your point of view?

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Grant on
On Wed, 21 Oct 2009 13:18:47 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:

>On Wed, 21 Oct 2009 07:48:18 +1100, Grant
><g_r_a_n_t_(a)bugsplatter.id.au> wrote:
>
>>>> > ? ? nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]
>>
>>The weighting for converting dotquad IP to a number is 256, not
>>1000 -- using 1000 will skip IP addresses in your range matching.
>>
>>Try
>> nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4]
>>
>>or
>> nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4]
>>
>>instead -- the second version is speed optimised for gawk.
>
>I've tried all of the above three expressions for _nr_, and I _always_
>get the same results. Could you please give some example to support
>your point of view?

grant(a)deltree:~$ echo 123.123.123.123 > dotquad

grant(a)deltree:~$ awk '{split($1,a,".");ip=((a[1]*256+a[2])*256+a[3])*256+a[4];\
xx=((a[1]*1000+a[2])*1000+a[3])*1000+a[4];print $1, ip, xx}' dotquad
123.123.123.123 2071690107 123123123123

grant(a)deltree:~$ ccfind 123.123.123.123
123.123.123.123 CN:China

grant(a)deltree:~$ ccfind 2071690107
123.123.123.123 CN:China

grant(a)deltree:~$ ccfind 123123123123
(bad query)

grant(a)deltree:~$ cat $(which ccfind)
#!/bin/bash
#
# ccfind 2006-03-05, last edit 2008-08-15
#
# returns '<query> cc:country name' for IP address input queries,
# using the ip2cn-server daemon.
#
# Copyright (C) 2006-2008 Grant Coady <http://bugsplatter.id.au> GPLv2
#
# 2008-08-13
# convert to ip2cn-server operation, no more access locking! :)
#

# check got query
[ -z "$1" ] && echo "
ccfind -- lookup country code and name for IP address
usage $0 aa.bb.cc.dd
" && exit

# get server listen port
port=$(gawk '/^inetport/ {print $2}' /etc/ip2cn-server.conf)

# make query, may be dotquad or numeric (decimal) IP address
echo "$@" | gawk -v port=$port '
BEGIN { service = "/inet/tcp/0/localhost/" port }
$1 == "0" { $1 = "0." }
{ print |& service; service |& getline; print }' 2>/dev/null

# end

Grant.
--
http://bugsplatter.id.au
From: Ed Morton on
Hongyi Zhao wrote:
> On Tue, 20 Oct 2009 08:20:58 -0700 (PDT), Ed Morton
> <mortonspam(a)gmail.com> wrote:
>
>> The email address I use for netnews is just a spam trap, I don't read
>> it. Post some SMALL sample input and expected output from that input,
>> in particular including the IP addresses that are omitted from the
>> output.
>
> See the following minimal example:
>
> 1- The test.awk is as follows:
>
> BEGIN{ FS="\t"; OFS="#" }
> function ip2nr(ip, nr,ipA) {
> # aaa.bbb.ccc.ddd
> split(ip,ipA,".")
> nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 +
> ipA[4]
> return nr
> }
> NR==FNR { addrs[$0] = ip2nr($0); next }
> FNR>1 {
> start = ip2nr($1)
> end = ip2nr($2)
> for (ip in addrs) {
> if (addrs[ip] >= start && addrs[ip] <= end) {
> print ip,$3" "$4
> #delete addrs[ip]
> #next
> }
> }
> }
>
> The file1 has the following content:
>
> $ cat file1
> 128.83.194.98
> 129.21.126.99
> 129.21.136.140
> 140.180.130.93
> 140.180.163.6
> 161.53.160.104
> 18.127.1.91
> 18.181.0.128
> 18.246.2.48
> 18.246.2.79
> 18.246.2.83
> 18.246.2.88
> 18.251.7.53
>
> The file2 has the following content:
>
> $ cat file2
> StartIP EndIP Country Local
> 4.21.160.8 4.21.160.15 America MIT
> 18.0.0.0 18.255.255.255 America MIT
> 128.30.0.0 128.31.255.255 America MIT
> 128.52.0.0 128.52.255.255 America MIT
> 128.83.0.0 128.83.255.255 America The University of Texas at
> Austin
> 129.21.0.0 129.21.255.255 America Rochester
> 140.180.0.0 140.180.255.255 America Princeton
> 161.53.0.0 161.53.255.255 Croatia University of Zagreb
> university central
> computing
> 192.12.11.0 192.12.11.255 America MIT
> 192.54.222.0 192.54.222.255 America MIT
> 192.233.95.0 192.233.95.255 America MIT
>
> The output by running the test.awk:
>
> $ awk -f test.awk file1 file2
> 18.251.7.53#America MIT
> 18.181.0.128#America MIT
> 18.246.2.83#America MIT
> 18.246.2.48#America MIT
> 18.246.2.88#America MIT
> 18.246.2.79#America MIT
> 18.127.1.91#America MIT
> 128.83.194.98#America The University of Texas at Austin
> 129.21.136.140#America Rochester
> 129.21.126.99#America Rochester
> 140.180.130.93#America Princeton
> 140.180.163.6#America Princeton
> 161.53.160.104#Croatia University of Zagreb university central
> computing
>
> 2- This time, I use the revised version of your test.awk, i.e.,
>
> BEGIN{ FS="\t"; OFS="#" }
> function ip2nr(ip, nr,ipA) {
> # aaa.bbb.ccc.ddd
> split(ip,ipA,".")
> nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 +
> ipA[4]
> return nr
> }
> NR==FNR { addrs[$0] = ip2nr($0); next }
> FNR>1 {
> start = ip2nr($1)
> end = ip2nr($2)
> for (ip in addrs) {
> if (addrs[ip] >= start && addrs[ip] <= end) {
> print ip,$3" "$4
> delete addrs[ip]
> next
> }
> }
> }
>
> The output by running the test.awk will look as follows:
>
> $ awk -f test.awk file1 file2
> 18.251.7.53#America MIT
> 128.83.194.98#America The University of Texas at Austin
> 129.21.136.140#America Rochester
> 140.180.130.93#America Princeton
> 161.53.160.104#Croatia University of Zagreb university central
> computing
>
> Any hints on this issue? Thanks in advance.
>
> Best regards.

It's the "next". It's causing the script to skip to the next range in
file2 whenever it finds 1 IP address from file1 in that range, but of
course there could be multiple IP addresses in that same range.

It's not causing this problem, but Grant may be right and you need to
use 256 instead of 1000 as a multiplier - I haven't thought about it
very much so maybe using 1000 will cause problems for some IP addresses.

Try this:

$ cat tst.awk
BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) }
function ip2nr(ip, nr,ipA) {
# aaa.bbb.ccc.ddd
split(ip,ipA,".")
nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
ipA[4]
return nr
}
NR==FNR { addrs[$0] = ip2nr($0); next }
FNR>1 {
start = ip2nr($1)
end = ip2nr($2)
for (ip in addrs) {
if ((addrs[ip] >= start) && (addrs[ip] <= end)) {
print ip,$3" "$4
delete addrs[ip]
}
}
}
$ awk -f tst.awk file1 file2 > o1
$ awk -v scale=1000 -f tst.awk file1 file2 > o2
$ diff o1 o2

to see if it produces any difference in the output from your real, large
input files. If not, I'd go with 256 as the scale. If it does, think
about it and decide which is correct.

Ed.
From: Ed Morton on
Grant wrote:
> On Wed, 21 Oct 2009 13:18:47 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:
>
>> On Wed, 21 Oct 2009 07:48:18 +1100, Grant
>> <g_r_a_n_t_(a)bugsplatter.id.au> wrote:
>>
>>>>>> ? ? nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]
>>> The weighting for converting dotquad IP to a number is 256, not
>>> 1000 -- using 1000 will skip IP addresses in your range matching.
>>>
>>> Try
>>> nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4]
>>>
>>> or
>>> nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4]
>>>
>>> instead -- the second version is speed optimised for gawk.
>> I've tried all of the above three expressions for _nr_, and I _always_
>> get the same results. Could you please give some example to support
>> your point of view?
>
> grant(a)deltree:~$ echo 123.123.123.123 > dotquad
>
> grant(a)deltree:~$ awk '{split($1,a,".");ip=((a[1]*256+a[2])*256+a[3])*256+a[4];\
> xx=((a[1]*1000+a[2])*1000+a[3])*1000+a[4];print $1, ip, xx}' dotquad
> 123.123.123.123 2071690107 123123123123
>
> grant(a)deltree:~$ ccfind 123.123.123.123
> 123.123.123.123 CN:China
>
> grant(a)deltree:~$ ccfind 2071690107
> 123.123.123.123 CN:China
>
> grant(a)deltree:~$ ccfind 123123123123
> (bad query)

I expect you're right and that multiplying by 256 does produce a
"better" representation of the IP address as a decimal number, but can
you think of an example where the range check Hongyi cares about would
fail if we used 1000 instead of 256 as the multiplier?

Ed.