From: Ed Morton on
Grant wrote:
> On Sun, 25 Oct 2009 05:04:43 +0000 (UTC), Kaz Kylheku <kkylheku(a)gmail.com> wrote:
>
>> On 2009-10-25, Hongyi Zhao <hongyi.zhao(a)gmail.com> wrote:
>>> Another issue is that: if the file1 is a huge one, say including
>>> several thousands entries in it, the above process will be time
>>> consuming. So
>>> is it possible to revise the above awk script with multithread
>>> technology
>>> to improve the efficiency?
>> What kind of machine are you using that a file of a mere several thousand
>> entries causes a performance problem, yet threads somehow help?
>
> Dunno about threads being applicable, but his lookup file has
> 300k records -- no point re-reading that for each of the thousand
> input lines, which is what Ed's solution does.

No it doesn't, it reads each file once. It reads the first one into
memory and starts deleting records from it as they're found in the
second one.

Ed.

> Grant.
From: Ed Morton on
Hongyi Zhao wrote:
> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com>
> wrote:
>
>> $ cat tst.awk
>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) }
>> function ip2nr(ip, nr,ipA) {
>> # aaa.bbb.ccc.ddd
>> split(ip,ipA,".")
>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>> ipA[4]
>> return nr
>> }
>> NR==FNR { addrs[$0] = ip2nr($0); next }
>> FNR>1 {
>> start = ip2nr($1)
>> end = ip2nr($2)
>> for (ip in addrs) {
>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) {
>> print ip,$3" "$4
>> delete addrs[ip]
>> }
>> }
>> }
>
> Another issue is that: if the file1 is a huge one, say including
> several thousands entries in it, the above process will be time
> consuming. So
> is it possible to revise the above awk script with multithread
> technology
> to improve the efficiency?
>
> Thanks in advance.

Don't know about multi-threading, but if your file of IP addresses is
huge and your file of ranges and names isn't then this should be a lot
more efficient:

BEGIN{ FS="\t"; OFS="#"; scale=256 }
function ip2nr(ip, nr,ipA) {
# aaa.bbb.ccc.ddd
split(ip,ipA,".")
nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
ipA[4]
return nr
}
NR==FNR {
start[FNR] = ip2nr($1)
end[FNR] = ip2nr($2)
name[FNR] = $3" "$4
range = FNR
next
}
{
curr = ip2nr($0)
for (idx=1; idx<=range; idx++)
{
if ((curr >= start[idx]) && (curr <= end[idx]))
{
print $0,name[idx]
next
}
}
notFound[$0]
}
END {
for (ip in notFound) {
print ip,"Not Found." | "cat>&2"
}
}

Run it as:

awk -f whatever.awk file2 file1

Note the switch in order of input files.

You could change the search algorithm (for (idx=...) ) to binary search
or something if you know any properties of "file2" that you could take
advantage of.

Ed.
From: Ed Morton on
Ed Morton wrote:
> Hongyi Zhao wrote:
>> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com>
>> wrote:
>>
>>> $ cat tst.awk
>>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) }
>>> function ip2nr(ip, nr,ipA) {
>>> # aaa.bbb.ccc.ddd
>>> split(ip,ipA,".")
>>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>>> ipA[4]
>>> return nr
>>> }
>>> NR==FNR { addrs[$0] = ip2nr($0); next }
>>> FNR>1 {
>>> start = ip2nr($1)
>>> end = ip2nr($2)
>>> for (ip in addrs) {
>>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) {
>>> print ip,$3" "$4
>>> delete addrs[ip]
>>> }
>>> }
>>> }
>>
>> Another issue is that: if the file1 is a huge one, say including
>> several thousands entries in it, the above process will be time
>> consuming. So
>> is it possible to revise the above awk script with multithread
>> technology
>> to improve the efficiency?
>>
>> Thanks in advance.
>
> Don't know about multi-threading, but if your file of IP addresses is
> huge and your file of ranges and names isn't then this should be a lot
> more efficient:
>
> BEGIN{ FS="\t"; OFS="#"; scale=256 }
> function ip2nr(ip, nr,ipA) {
> # aaa.bbb.ccc.ddd
> split(ip,ipA,".")
> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
> ipA[4]
> return nr
> }
> NR==FNR {
> start[FNR] = ip2nr($1)
> end[FNR] = ip2nr($2)
> name[FNR] = $3" "$4
> range = FNR

Change the above 4 lines to:

start[range] = ip2nr($1)
end[range] = ip2nr($2)
name[range] = $3" "$4
range++

to account for the first line of file2 containing a header rather than data.

> next
> }
> {
> curr = ip2nr($0)
> for (idx=1; idx<=range; idx++)
> {
> if ((curr >= start[idx]) && (curr <= end[idx]))
> {
> print $0,name[idx]
> next
> }
> }
> notFound[$0]
> }
> END {
> for (ip in notFound) {
> print ip,"Not Found." | "cat>&2"
> }
> }
>
> Run it as:
>
> awk -f whatever.awk file2 file1
>
> Note the switch in order of input files.
>
> You could change the search algorithm (for (idx=...) ) to binary search
> or something if you know any properties of "file2" that you could take
> advantage of.
>
> Ed.
From: Grant on
On Sun, 25 Oct 2009 09:18:46 -0500, Ed Morton <mortonspam(a)gmail.com> wrote:

>Hongyi Zhao wrote:
>> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com>
>> wrote:
>>
>>> $ cat tst.awk
>>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) }
>>> function ip2nr(ip, nr,ipA) {
>>> # aaa.bbb.ccc.ddd
>>> split(ip,ipA,".")
>>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>>> ipA[4]
>>> return nr
>>> }
>>> NR==FNR { addrs[$0] = ip2nr($0); next }
>>> FNR>1 {
>>> start = ip2nr($1)
>>> end = ip2nr($2)
>>> for (ip in addrs) {
>>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) {
>>> print ip,$3" "$4
>>> delete addrs[ip]
>>> }
>>> }
>>> }
>>
>> Another issue is that: if the file1 is a huge one, say including
>> several thousands entries in it, the above process will be time
>> consuming. So
>> is it possible to revise the above awk script with multithread
>> technology
>> to improve the efficiency?
>>
>> Thanks in advance.
>
>Don't know about multi-threading, but if your file of IP addresses is
>huge and your file of ranges and names isn't then this should be a lot
>more efficient:
>
>BEGIN{ FS="\t"; OFS="#"; scale=256 }
>function ip2nr(ip, nr,ipA) {
> # aaa.bbb.ccc.ddd
> split(ip,ipA,".")
> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>ipA[4]
> return nr
>}
>NR==FNR {
> start[FNR] = ip2nr($1)
> end[FNR] = ip2nr($2)
> name[FNR] = $3" "$4
> range = FNR
> next
>}
>{
> curr = ip2nr($0)

And it's in this part that I use the binary search instead of your
linear scan to replace average 150k compares with ~19 compares :)
- - -
> for (idx=1; idx<=range; idx++)
> {
> if ((curr >= start[idx]) && (curr <= end[idx]))
> {
> print $0,name[idx]
> next
> }
- - -
# binary search to suit Ed's script:

lo = 1; hi = range
while (hi - lo > 1) {
mid = int((lo + hi) / 2)
if (start[mid] < curr) { lo = mid } else { hi = mid }
}
if (curr < start[hi]) --hi

if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] }
- - -
> }
> notFound[$0]
>}
>END {
> for (ip in notFound) {
> print ip,"Not Found." | "cat>&2"
> }
>}
>
>Run it as:
>
> awk -f whatever.awk file2 file1
>
>Note the switch in order of input files.
>
>You could change the search algorithm (for (idx=...) ) to binary search
>or something if you know any properties of "file2" that you could take
>advantage of.

I think OPs data to be looked up is just a list of several thousand IP
addresses, so the binary search above would be closer to OP's solution?

Binary search assumes lookup file (the one with 300k records) is sorted.

Where I do a similar task with IPs, the database is normalised to two
files:

$ cat er-ip2c
+----------------+
| ip2c-index |
|----------------| +--------------+
|*IP block start*| | ip2c-names |
| IP block end | +--------------+
| Country code |---|*Country code*|
+----------------+ | Country name |
+--------------+

I don't know if OP has repetive data in their lookup table (third field)
that could be normalised like this, normalised lookup tables only save
some memory in this type of application.

Grant.
--
http://bugsplatter.id.au
From: Grant on
On Mon, 26 Oct 2009 03:25:57 +1100, Grant <g_r_a_n_t_(a)bugsplatter.id.au> wrote:

>On Sun, 25 Oct 2009 09:18:46 -0500, Ed Morton <mortonspam(a)gmail.com> wrote:
>
>>Hongyi Zhao wrote:
>>> On Wed, 21 Oct 2009 02:47:54 -0500, Ed Morton <mortonspam(a)gmail.com>
>>> wrote:
>>>
>>>> $ cat tst.awk
>>>> BEGIN{ FS="\t"; OFS="#"; scale=(scale ? scale : 256) }
>>>> function ip2nr(ip, nr,ipA) {
>>>> # aaa.bbb.ccc.ddd
>>>> split(ip,ipA,".")
>>>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>>>> ipA[4]
>>>> return nr
>>>> }
>>>> NR==FNR { addrs[$0] = ip2nr($0); next }
>>>> FNR>1 {
>>>> start = ip2nr($1)
>>>> end = ip2nr($2)
>>>> for (ip in addrs) {
>>>> if ((addrs[ip] >= start) && (addrs[ip] <= end)) {
>>>> print ip,$3" "$4
>>>> delete addrs[ip]
>>>> }
>>>> }
>>>> }
>>>
>>> Another issue is that: if the file1 is a huge one, say including
>>> several thousands entries in it, the above process will be time
>>> consuming. So
>>> is it possible to revise the above awk script with multithread
>>> technology
>>> to improve the efficiency?
>>>
>>> Thanks in advance.
>>
>>Don't know about multi-threading, but if your file of IP addresses is
>>huge and your file of ranges and names isn't then this should be a lot
>>more efficient:
>>
>>BEGIN{ FS="\t"; OFS="#"; scale=256 }
>>function ip2nr(ip, nr,ipA) {
>> # aaa.bbb.ccc.ddd
>> split(ip,ipA,".")
>> nr = (((((ipA[1] * scale) + ipA[2]) * scale) + ipA[3]) * scale) +
>>ipA[4]
>> return nr
>>}
>>NR==FNR {
>> start[FNR] = ip2nr($1)
>> end[FNR] = ip2nr($2)
>> name[FNR] = $3" "$4
>> range = FNR
>> next
>>}
>>{
>> curr = ip2nr($0)
>
>And it's in this part that I use the binary search instead of your
>linear scan to replace average 150k compares with ~19 compares :)
> - - -
>> for (idx=1; idx<=range; idx++)
>> {
>> if ((curr >= start[idx]) && (curr <= end[idx]))
>> {
>> print $0,name[idx]
>> next
>> }
> - - -
> # binary search to suit Ed's script:
>
> lo = 1; hi = range
> while (hi - lo > 1) {
> mid = int((lo + hi) / 2)
> if (start[mid] < curr) { lo = mid } else { hi = mid }
> }
> if (curr < start[hi]) --hi
>
> if (curr > end[hi]) { notFound[$0] } else { print $0,name[idx] }
> - - -
>> }
>> notFound[$0]
>>}
>>END {
>> for (ip in notFound) {
>> print ip,"Not Found." | "cat>&2"
>> }
>>}
>>
>>Run it as:
>>
>> awk -f whatever.awk file2 file1
>>
>>Note the switch in order of input files.
>>
>>You could change the search algorithm (for (idx=...) ) to binary search
>>or something if you know any properties of "file2" that you could take
>>advantage of.
>
>I think OPs data to be looked up is just a list of several thousand IP
>addresses, so the binary search above would be closer to OP's solution?
>
>Binary search assumes lookup file (the one with 300k records) is sorted.

And assumes IP blocks are non-overlapping! Gaps in IP blocks return the
unknown.
>
>Where I do a similar task with IPs, the database is normalised to two
>files:
>
>$ cat er-ip2c
>+----------------+
>| ip2c-index |
>|----------------| +--------------+
>|*IP block start*| | ip2c-names |
>| IP block end | +--------------+
>| Country code |---|*Country code*|
>+----------------+ | Country name |
> +--------------+
>
>I don't know if OP has repetive data in their lookup table (third field)
>that could be normalised like this, normalised lookup tables only save
>some memory in this type of application.

Context where I'm doing similar task to what OP is after:
....
FILENAME == ip2c_index {
# record format: <block start> <block end> <cc>
ipdata_str[++ipdatsize] = $1
ipdata_end[ipdatsize] = $2
ipdata_cc[ipdatsize] = $3
next
}
FILENAME == ip2c_names {
# record format: <cc>:<country name>
split($0, a, ":")
ipname[a[1]] = a[2]
next
}
....
function cc_lookup(addr, a, i, l, m, h)
{
....
# binary search ip2c-data for country code
split(addr, a, "."); i = ((a[1]*256+a[2])*256+a[3])*256+a[4]
l = 1; h = ipdatsize
while (h - l > 1) {
m = int((l + h) / 2)
if (ipdata_str[m] < i) { l = m } else { h = m }
}
if (i < ipdata_str[h]) --h
if (i > ipdata_end[h]) return "--:unassigned"

# return country code and country name
return sprintf("%s:%s", ipdata_cc[h], ipname[ipdata_cc[h]])
}
>
>Grant.
--
http://bugsplatter.id.au