From: Hongyi Zhao on
On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortonspam(a)gmail.com>
wrote:

>Adding a "delete" and a "next" would make the script more efficient if
>you have a large list of IP addresses in file1 and each range in file2
>is distinct:
>
>BEGIN{ FS="\t"; OFS="#" }
>function ip2nr(ip, nr,ipA) {
> # aaa.bbb.ccc.ddd
> split(ip,ipA,".")
> nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]
> return nr
>}
>NR==FNR { addrs[$0] = ip2nr($0); next }
>FNR>1 {
> start = ip2nr($1)
> end = ip2nr($2)
> for (ip in addrs) {
> if (addrs[ip] >= start && addrs[ip] <= end) {
> print ip,$3" "$4
> delete addrs[ip]
> next
> }
> }
>}
>
> Ed.

Thanks a lot.

In my case, the file2, i.e., the IP database is a huge one (including
373375 lines), and I find that your above revised awk script will omit
some IP addresses in for the file1 in the output.

Considering that it's not advisable to post attachments to this news
group, I've post you via mail about the following issue along with
all files used and generated by me.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Grant on
On Tue, 20 Oct 2009 11:52:51 +0800, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:

>On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortonspam(a)gmail.com>
>wrote:
>
>>Adding a "delete" and a "next" would make the script more efficient if
>>you have a large list of IP addresses in file1 and each range in file2
>>is distinct:
>>
>>BEGIN{ FS="\t"; OFS="#" }
>>function ip2nr(ip, nr,ipA) {
>> # aaa.bbb.ccc.ddd
>> split(ip,ipA,".")
>> nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]
>> return nr
>>}
>>NR==FNR { addrs[$0] = ip2nr($0); next }
>>FNR>1 {
>> start = ip2nr($1)
>> end = ip2nr($2)
>> for (ip in addrs) {
>> if (addrs[ip] >= start && addrs[ip] <= end) {
>> print ip,$3" "$4
>> delete addrs[ip]
>> next
>> }
>> }
>>}
>>
>> Ed.
>
>Thanks a lot.
>
>In my case, the file2, i.e., the IP database is a huge one (including
>373375 lines), and I find that your above revised awk script will omit
>some IP addresses in for the file1 in the output.

In that case it's probably easier to work with decimal IPs,
something like (gawk fragment):

function cc_lookup(addr, a, i, l, m, h)
{
....
# binary search ip2c-data for country code
split(addr, a, "."); i = ((a[1]*256+a[2])*256+a[3])*256+a[4]
l = 1; h = ipdatsize
while (h - l > 1) {
m = int((l + h) / 2)
if (ipdata_str[m] < i) { l = m } else { h = m }
}
if (i < ipdata_str[h]) --h
if (i > ipdata_end[h]) return "--:unassigned"

# return country code and country name
return sprintf("%s:%s", ipdata_cc[h], ipname[ipdata_cc[h]])
}

Though I have a smaller lookup table of 102k records since I'm
interested in country code lookup, adjacent blocks are merged
during database file creation.
>
>Considering that it's not advisable to post attachments to this news
>group, I've post you via mail about the following issue along with
>all files used and generated by me.

Make a very limited system that demonstrates your issues?

Grant.
--
http://bugsplatter.id.au
From: Hongyi Zhao on
On Tue, 20 Oct 2009 17:58:34 +1100, Grant
<g_r_a_n_t_(a)bugsplatter.id.au> wrote:

>Make a very limited system that demonstrates your issues?

I've also give you a copy of that mail, perhaps this way will give you
more informations on this issue.

BTW, when the IPdatebase is huge, the lookup process will require so
many time. Are there some methods to decrease the lookup time from
the IPdatebase?

Best regards.
--
..: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.
From: Ed Morton on
On Oct 19, 10:52 pm, Hongyi Zhao <hongyi.z...(a)gmail.com> wrote:
> On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortons...(a)gmail.com>
> wrote:
>
>
>
>
>
> >Adding a "delete" and a "next" would make the script more efficient if
> >you have a large list of IP addresses in file1 and each range in file2
> >is distinct:
>
> >BEGIN{ FS="\t"; OFS="#" }
> >function ip2nr(ip,      nr,ipA) {
> >     # aaa.bbb.ccc.ddd
> >     split(ip,ipA,".")
> >     nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]
> >     return nr
> >}
> >NR==FNR { addrs[$0] = ip2nr($0); next }
> >FNR>1 {
> >     start = ip2nr($1)
> >     end   = ip2nr($2)
> >     for (ip in addrs) {
> >         if (addrs[ip] >= start && addrs[ip] <= end) {
> >             print ip,$3" "$4
> >             delete addrs[ip]
> >             next
> >         }
> >     }
> >}
>
> >     Ed.
>
> Thanks a lot.  
>
> In my case, the file2, i.e., the IP database is a huge one (including
> 373375 lines), and I find that your above revised awk script will omit
> some IP addresses in for the file1 in the output.
>
> Considering that it's not advisable to post attachments to this news
> group,  I've  post you via mail about the following issue along with
> all files used and generated by me.
> --
> .: Hongyi Zhao [ hongyi.zhao AT gmail.com ] Free as in Freedom :.- Hide quoted text -
>
> - Show quoted text -

The email address I use for netnews is just a spam trap, I don't read
it. Post some SMALL sample input and expected output from that input,
in particular including the IP addresses that are omitted from the
output.

Ed.
From: Grant on
On Tue, 20 Oct 2009 08:20:58 -0700 (PDT), Ed Morton <mortonspam(a)gmail.com> wrote:

>On Oct 19, 10:52 pm, Hongyi Zhao <hongyi.z...(a)gmail.com> wrote:
>> On Mon, 19 Oct 2009 11:40:51 -0500, Ed Morton <mortons...(a)gmail.com>
>> wrote:
>>
>>
>>
>>
>>
>> >Adding a "delete" and a "next" would make the script more efficient if
>> >you have a large list of IP addresses in file1 and each range in file2
>> >is distinct:
>>
>> >BEGIN{ FS="\t"; OFS="#" }
>> >function ip2nr(ip,      nr,ipA) {
>> >     # aaa.bbb.ccc.ddd
>> >     split(ip,ipA,".")
>> >     nr = ipA[1] * 1000000000 + ipA[2] * 1000000 + ipA[3] * 1000 + ipA[4]

The weighting for converting dotquad IP to a number is 256, not
1000 -- using 1000 will skip IP addresses in your range matching.

Try
nr = ipA[1] * 2^24 + ipA[2] * 2^16 + ipA[3] * 2^8 + ipA[4]

or
nr = ((ipA[1] * 256 + ipA[2]) * 256 + ipA[3]) * 256 + ipA[4]

instead -- the second version is speed optimised for gawk.

>> >     return nr
>> >}
>> >NR==FNR { addrs[$0] = ip2nr($0); next }
>> >FNR>1 {
>> >     start = ip2nr($1)
>> >     end   = ip2nr($2)
>> >     for (ip in addrs) {
>> >         if (addrs[ip] >= start && addrs[ip] <= end) {
>> >             print ip,$3" "$4
>> >             delete addrs[ip]
>> >             next
>> >         }
>> >     }
>> >}

Grant.
--
http://bugsplatter.id.au