From: Hongyi Zhao on
On Mon, 26 Oct 2009 03:25:57 +1100, Grant
<g_r_a_n_t_(a)bugsplatter.id.au> wrote:

> - - -
> # binary search to suit Ed's script:
>
> lo = 1; hi = range
> while (hi - lo > 1) {
> mid = int((lo + hi) / 2)
> if (start[mid] < curr) { lo = mid } else { hi = mid }
> }
> if (curr < start[hi]) --hi
>
> if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] }
> - - -


I replace your above code snip in Ed's script and get the following
one:

BEGIN{ FS="\t"; OFS="#"; scale=256 }
function ip2nr(ip, nr,ipA) {
# aaa.bbb.ccc.ddd
split(ip,ipA,".")
nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
ipA[4]
return nr
}
NR==FNR {
start[range] = ip2nr($1)
end[range] = ip2nr($2)
name[range] = $3" "$4
range++
next
}
{
curr = ip2nr($0)
lo = 1; hi = range
while (hi - lo > 1) {
mid = int((lo + hi) / 2)
if (start[mid] < curr) { lo = mid } else { hi = mid }
}
if (curr < start[hi]) --hi

if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx]
}
}
notFound[$0]
}
END {
for (ip in notFound) {
print ip,"Not Found." | "cat>&2"
}
}

But, run the above script will give me the the following errors:

awk: whatever.awk:27: }
awk: whatever.awk:27: ^ syntax error

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Hongyi Zhao on
On Sun, 25 Oct 2009 09:18:46 -0500, Ed Morton <mortonspam(a)gmail.com>
wrote:

>Don't know about multi-threading, but if your file of IP addresses is
>huge and your file of ranges and names isn't then this should be a lot
>more efficient:

In fact, the file of ranges and names in my case is more huge than the
file of IP addresses.

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Hongyi Zhao on
On Sun, 25 Oct 2009 09:40:46 -0500, Ed Morton <mortonspam(a)gmail.com>
wrote:

>> BEGIN{ FS="\t"; OFS="#"; scale=256 }
>> function ip2nr(ip, nr,ipA) {
>> # aaa.bbb.ccc.ddd
>> split(ip,ipA,".")
>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>> ipA[4]
>> return nr
>> }
>> NR==FNR {
>> start[FNR] = ip2nr($1)
>> end[FNR] = ip2nr($2)
>> name[FNR] = $3" "$4
>> range = FNR
>
>Change the above 4 lines to:
>
> start[range] = ip2nr($1)
> end[range] = ip2nr($2)
> name[range] = $3" "$4
> range++
>
>to account for the first line of file2 containing a header rather than data.
>
>> next
>> }
>> {
>> curr = ip2nr($0)
>> for (idx=1; idx<=range; idx++)
>> {
>> if ((curr >= start[idx]) && (curr <= end[idx]))
>> {
>> print $0,name[idx]
>> next
>> }
>> }
>> notFound[$0]
>> }
>> END {
>> for (ip in notFound) {
>> print ip,"Not Found." | "cat>&2"
>> }
>> }
>>
>> Run it as:
>>
>> awk -f whatever.awk file2 file1

I've tested it, in my test, the file2 include 36490 IP non-overlap IP
ranges and the file1 include 1627 lines IP addresses.

The result shows that the above script will spend more time than your
original one.

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Grant on
On Mon, 26 Oct 2009 10:16:22 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:

>On Mon, 26 Oct 2009 03:25:57 +1100, Grant
><g_r_a_n_t_(a)bugsplatter.id.au> wrote:
>
>> - - -
>> # binary search to suit Ed's script:
>>
>> lo = 1; hi = range
>> while (hi - lo > 1) {
>> mid = int((lo + hi) / 2)
>> if (start[mid] < curr) { lo = mid } else { hi = mid }
>> }
>> if (curr < start[hi]) --hi
>>
>> if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] }
>> - - -
>
>
>I replace your above code snip in Ed's script and get the following
>one:
>
>BEGIN{ FS="\t"; OFS="#"; scale=256 }
>function ip2nr(ip, nr,ipA) {
> # aaa.bbb.ccc.ddd
> split(ip,ipA,".")
> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>ipA[4]
> return nr
>}
>NR==FNR {
> start[range] = ip2nr($1)
> end[range] = ip2nr($2)
> name[range] = $3" "$4
> range++
> next
>}
>{
> curr = ip2nr($0)
> lo = 1; hi = range
> while (hi - lo > 1) {
> mid = int((lo + hi) / 2)
> if (start[mid] < curr) { lo = mid } else { hi = mid }
> }
> if (curr < start[hi]) --hi
>
> if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx]

kill this one here? --> }

> }
> notFound[$0]
>}
>END {
> for (ip in notFound) {
> print ip,"Not Found." | "cat>&2"
> }
>}
>
>But, run the above script will give me the the following errors:
>
>awk: whatever.awk:27: }
>awk: whatever.awk:27: ^ syntax error

Oh come now, you got to play a bit :) And take a bit more care
with your indentaing so you can see the blocks.

Alright, I rewrote it.

BEGIN {
# FS = "\t" # this is whitespace, unnecessary?
OFS = "#" # is this used?
}

function ip2nr(ip, k) # dotquad IP to number
{
# aaa.bbb.ccc.ddd
split(ip, k, ".")
return ((k[1] * 256 + k[2]) * 256 + k[3]) * 256 + k[4]
}
# load lookup table
NR==FNR {
start[++range] = ip2nr($1) # using one-based array
end[range] = ip2nr($2)
name[range] = $3" "$4
next
}
# process IPs
{
addr = ip2nr($0); lo = 1; hi = range

# binary search
while (hi - lo > 1) {
mid = int((lo + hi) / 2)
if (start[mid] < addr) {
lo = mid
}
else {
hi = mid
}
}

# adjust to closest less than when no match (likely)
if (addr < start[hi])
--hi

if (addr > end[hi]) {
# if addr past end of this block we found undefined IP
print $0, "Not Found" | "cat>&2"
}
else {
# we found a match
print $0, name[hi]
}
}
# end

Untested but ported from working code.

Grant.
--
http://bugsplatter.id.au
From: Ed Morton on
Grant wrote:
> On Mon, 26 Oct 2009 10:16:22 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:
>
>> On Mon, 26 Oct 2009 03:25:57 +1100, Grant
>> <g_r_a_n_t_(a)bugsplatter.id.au> wrote:
>>
>>> - - -
>>> # binary search to suit Ed's script:
>>>
>>> lo = 1; hi = range
>>> while (hi - lo > 1) {
>>> mid = int((lo + hi) / 2)
>>> if (start[mid] < curr) { lo = mid } else { hi = mid }
>>> }
>>> if (curr < start[hi]) --hi
>>>
>>> if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] }
>>> - - -
>>
>> I replace your above code snip in Ed's script and get the following
>> one:
>>
>> BEGIN{ FS="\t"; OFS="#"; scale=256 }
>> function ip2nr(ip, nr,ipA) {
>> # aaa.bbb.ccc.ddd
>> split(ip,ipA,".")
>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>> ipA[4]
>> return nr
>> }
>> NR==FNR {
>> start[range] = ip2nr($1)
>> end[range] = ip2nr($2)
>> name[range] = $3" "$4
>> range++
>> next
>> }
>> {
>> curr = ip2nr($0)
>> lo = 1; hi = range
>> while (hi - lo > 1) {
>> mid = int((lo + hi) / 2)
>> if (start[mid] < curr) { lo = mid } else { hi = mid }
>> }
>> if (curr < start[hi]) --hi
>>
>> if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx]
>
> kill this one here? --> }
>
>> }
>> notFound[$0]
>> }
>> END {
>> for (ip in notFound) {
>> print ip,"Not Found." | "cat>&2"
>> }
>> }
>>
>> But, run the above script will give me the the following errors:
>>
>> awk: whatever.awk:27: }
>> awk: whatever.awk:27: ^ syntax error
>
> Oh come now, you got to play a bit :) And take a bit more care
> with your indentaing so you can see the blocks.
>
> Alright, I rewrote it.
>
> BEGIN {
> # FS = "\t" # this is whitespace, unnecessary?

Yes, it's necessary since the names contain non-tab whitespace. - Note 1.

> OFS = "#" # is this used?

Yes, OP wants # separating output fields, not a space. - Note 2.

> }
>
> function ip2nr(ip, k) # dotquad IP to number
> {
> # aaa.bbb.ccc.ddd
> split(ip, k, ".")
> return ((k[1] * 256 + k[2]) * 256 + k[3]) * 256 + k[4]
> }
> # load lookup table
> NR==FNR {
> start[++range] = ip2nr($1) # using one-based array

Yes, but it'll store the titles in start[1] which is undesirable. That's
why I incremented range after the assignments.


> end[range] = ip2nr($2)
> name[range] = $3" "$4

See Note 1 above.

> next
> }
> # process IPs
> {
> addr = ip2nr($0); lo = 1; hi = range
>
> # binary search
> while (hi - lo > 1) {
> mid = int((lo + hi) / 2)
> if (start[mid] < addr) {
> lo = mid
> }
> else {
> hi = mid
> }
> }
>
> # adjust to closest less than when no match (likely)
> if (addr < start[hi])
> --hi
>
> if (addr > end[hi]) {
> # if addr past end of this block we found undefined IP
> print $0, "Not Found" | "cat>&2"
> }
> else {
> # we found a match
> print $0, name[hi]

See Note 2 above.

Ed.
> }
> }
> # end
>
> Untested but ported from working code.
>
> Grant.