From: dougarobertson on
Hi,
I've got a large dataset (350,000,000 records) containing transactions,
from which I want to select a subset of around 250,000,000 records and
then dedupe on id, place and date, which takes it down to around
15,000,000 records. It's (just, given memory constraints) possible to
do this using a simple proc sort but, partly because some further data
manipulation may be desirable and partly to get to grips with them for
the first time, I've tried using hash tables.
On the first couple of attempts it ran out of memory, but with a bit of
tinkering I got it to just below the limit; the (slightly edited) log
is below.
I am now trying to understand why so much memory is used for this
datastep. I realise 15,000,000 records is a lot to hold as a hash table
in memory, but there are only 3 variables (all numeric, total length
11) and the final data set that is output only uses about 160MB on the
disk, so why does the datastep need 919MB of memory? Is there any way I
can reduce this?
I'll be rerunning the process with several similar sized data sets so I
can't really afford to be teetering so close to the edge of our
available memory.
I am working with 1GB of memory on a UNIX platform, working remotely on
a Windows based PC through Enterprise Guide 4.

Any comments would be very welcome.

Thanks,

Doug



21 options fullstimer;
22 data _null_;
23 length hhd_id 4;
24 if 0 then set bigdata (rename=(hhd_id=c_hi));
25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ;
26 hh.definekey ('id','place','date') ;
27 hh.definedata ('id','place','date') ;
28 hh.definedone () ;
29 do until (eof) ;
30 set bigdata (where=(flag='Y') keep=id place date flag
rename=(id=c_id)) end = eof ;
31 id=input(c_id,10.);
32 hh.replace () ;
33 end ;
34 rc = hh.output (dataset: "visits") ;
35
36 run;

NOTE: The data set VISITS has 14374264 observations and 3 variables.
NOTE: There were 243729816 observations read from the data set bigdata
WHERE flag='Y';
NOTE: DATA statement used (Total process time):
real time 11:50.98
user cpu time 7:05.57
system cpu time 1:03.51
Memory 918868k
Page Faults 185
Page Reclaims 224845
Page Swaps 0
Voluntary Context Switches 0
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 0

From: yzg9 on
Check this: http://support.sas.com/techsup/unotes/SN/016/016920.html=20

SN-V9-016920=20
"FATAL: Insufficient memory to execute data step program" message when
creating a large hash object =09


The following messages will be written to the SAS log when you run
out
of memory when adding keys and associated data items to a hash
object.
=20
FATAL: Insufficient memory to execute data step program. Aborted
during the EXECUTION phase.
NOTE: The SAS System stopped processing this step because of
insufficient memory.
=20
In some cases, these messages may be generated even though there is
still memory available. These types of failures may occur on
machines
which have an XMRLMEM value which will overflow a 4-byte integer.
This
causes in an excessively large memory allocation request which
results
in the inaccurate 'out of memory' condition.
=20
The code below illustrates how to get the value of XMRLMEM.
=20
data _null_;
amt =3D getoption('xmrlmem');
put amt=3D;
run;
=20
If the value of amt is 2,147,483,647 or greater then you may be
getting
the insufficient memory error incorrectly. This problem is fixed in
SAS
9.1.3 Service Pack 4.
=20
Also beginning with SP4, when the system option MSGLEVEL=3DI is
specified, a message is written to the SAS log reporting the number
of
items added to the hash object when the memory shortage occurs.

*****

John Gerstle, MS=20
Biostatistician
Northrop Grumman
CDC Information Technological Support Contract (CITS)
NCHSTP \DHAP \HICSB \Research, Analysis, and Evaluation Section
Centers for Disease Control and Prevention

"Boss. We've got cats." "Meow"

"All truth passes through three stages:
First, it is ridiculed;
Second, it is violently opposed;
Third, it is accepted as being self-evident."
- Arthur Schopenhauer (1830)


>>-----Original Message-----
>>From: owner-sas-l(a)listserv.uga.edu
[mailto:owner-sas-l(a)listserv.uga.edu]
>>On Behalf Of dougarobertson(a)gmail.com
>>Sent: Thursday, December 28, 2006 4:44 PM
>>To: sas-l(a)uga.edu
>>Subject: Hash Table Memory Usage
>>
>>Hi,
>>I've got a large dataset (350,000,000 records) containing
transactions,
>>from which I want to select a subset of around 250,000,000 records and
>>then dedupe on id, place and date, which takes it down to around
>>15,000,000 records. It's (just, given memory constraints) possible to
>>do this using a simple proc sort but, partly because some further data
>>manipulation may be desirable and partly to get to grips with them for
>>the first time, I've tried using hash tables.
>>On the first couple of attempts it ran out of memory, but with a bit
of
>>tinkering I got it to just below the limit; the (slightly edited) log
>>is below.
>>I am now trying to understand why so much memory is used for this
>>datastep. I realise 15,000,000 records is a lot to hold as a hash
table
>>in memory, but there are only 3 variables (all numeric, total length
>>11) and the final data set that is output only uses about 160MB on the
>>disk, so why does the datastep need 919MB of memory? Is there any way
I
>>can reduce this?
>>I'll be rerunning the process with several similar sized data sets so
I
>>can't really afford to be teetering so close to the edge of our
>>available memory.
>>I am working with 1GB of memory on a UNIX platform, working remotely
on
>>a Windows based PC through Enterprise Guide 4.
>>
>>Any comments would be very welcome.
>>
>>Thanks,
>>
>>Doug
>>
>>
>>
>>21 options fullstimer;
>>22 data _null_;
>>23 length hhd_id 4;
>>24 if 0 then set bigdata (rename=3D(hhd_id=3Dc_hi));
>>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ;
>>26 hh.definekey ('id','place','date') ;
>>27 hh.definedata ('id','place','date') ;
>>28 hh.definedone () ;
>>29 do until (eof) ;
>>30 set bigdata (where=3D(flag=3D'Y') keep=3Did place date =
flag
>>rename=3D(id=3Dc_id)) end =3D eof ;
>>31 id=3Dinput(c_id,10.);
>>32 hh.replace () ;
>>33 end ;
>>34 rc =3D hh.output (dataset: "visits") ;
>>35
>>36 run;
>>
>>NOTE: The data set VISITS has 14374264 observations and 3 variables.
>>NOTE: There were 243729816 observations read from the data set bigdata
>> WHERE flag=3D'Y';
>>NOTE: DATA statement used (Total process time):
>> real time 11:50.98
>> user cpu time 7:05.57
>> system cpu time 1:03.51
>> Memory 918868k
>> Page Faults 185
>> Page Reclaims 224845
>> Page Swaps 0
>> Voluntary Context Switches 0
>> Involuntary Context Switches 0
>> Block Input Operations 0
>> Block Output Operations 0
From: dougarobertson on
John,
Thanks for getting back to me so quickly.
The XMRLMEM value on my system is 858,962,944, so I don't think this
problem applies in my case. Thanks for the suggestion though.

Doug


"Gerstle, John (CDC/CCID/NCHHSTP) (CTR)" wrote:

> Check this: http://support.sas.com/techsup/unotes/SN/016/016920.html=20
>
> SN-V9-016920=20
> "FATAL: Insufficient memory to execute data step program" message when
> creating a large hash object =09
>
>
> The following messages will be written to the SAS log when you run
> out
> of memory when adding keys and associated data items to a hash
> object.
> =20
> FATAL: Insufficient memory to execute data step program. Aborted
> during the EXECUTION phase.
> NOTE: The SAS System stopped processing this step because of
> insufficient memory.
> =20
> In some cases, these messages may be generated even though there is
> still memory available. These types of failures may occur on
> machines
> which have an XMRLMEM value which will overflow a 4-byte integer.
> This
> causes in an excessively large memory allocation request which
> results
> in the inaccurate 'out of memory' condition.
> =20
> The code below illustrates how to get the value of XMRLMEM.
> =20
> data _null_;
> amt =3D getoption('xmrlmem');
> put amt=3D;
> run;
> =20
> If the value of amt is 2,147,483,647 or greater then you may be
> getting
> the insufficient memory error incorrectly. This problem is fixed in
> SAS
> 9.1.3 Service Pack 4.
> =20
> Also beginning with SP4, when the system option MSGLEVEL=3DI is
> specified, a message is written to the SAS log reporting the number
> of
> items added to the hash object when the memory shortage occurs.
>
> *****
>
> John Gerstle, MS=20
> Biostatistician
> Northrop Grumman
> CDC Information Technological Support Contract (CITS)
> NCHSTP \DHAP \HICSB \Research, Analysis, and Evaluation Section
> Centers for Disease Control and Prevention
>
> "Boss. We've got cats." "Meow"
>
> "All truth passes through three stages:
> First, it is ridiculed;
> Second, it is violently opposed;
> Third, it is accepted as being self-evident."
> - Arthur Schopenhauer (1830)
>
>
> >>-----Original Message-----
> >>From: owner-sas-l(a)listserv.uga.edu
> [mailto:owner-sas-l(a)listserv.uga.edu]
> >>On Behalf Of dougarobertson(a)gmail.com
> >>Sent: Thursday, December 28, 2006 4:44 PM
> >>To: sas-l(a)uga.edu
> >>Subject: Hash Table Memory Usage
> >>
> >>Hi,
> >>I've got a large dataset (350,000,000 records) containing
> transactions,
> >>from which I want to select a subset of around 250,000,000 records and
> >>then dedupe on id, place and date, which takes it down to around
> >>15,000,000 records. It's (just, given memory constraints) possible to
> >>do this using a simple proc sort but, partly because some further data
> >>manipulation may be desirable and partly to get to grips with them for
> >>the first time, I've tried using hash tables.
> >>On the first couple of attempts it ran out of memory, but with a bit
> of
> >>tinkering I got it to just below the limit; the (slightly edited) log
> >>is below.
> >>I am now trying to understand why so much memory is used for this
> >>datastep. I realise 15,000,000 records is a lot to hold as a hash
> table
> >>in memory, but there are only 3 variables (all numeric, total length
> >>11) and the final data set that is output only uses about 160MB on the
> >>disk, so why does the datastep need 919MB of memory? Is there any way
> I
> >>can reduce this?
> >>I'll be rerunning the process with several similar sized data sets so
> I
> >>can't really afford to be teetering so close to the edge of our
> >>available memory.
> >>I am working with 1GB of memory on a UNIX platform, working remotely
> on
> >>a Windows based PC through Enterprise Guide 4.
> >>
> >>Any comments would be very welcome.
> >>
> >>Thanks,
> >>
> >>Doug
> >>
> >>
> >>
> >>21 options fullstimer;
> >>22 data _null_;
> >>23 length hhd_id 4;
> >>24 if 0 then set bigdata (rename=3D(hhd_id=3Dc_hi));
> >>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ;
> >>26 hh.definekey ('id','place','date') ;
> >>27 hh.definedata ('id','place','date') ;
> >>28 hh.definedone () ;
> >>29 do until (eof) ;
> >>30 set bigdata (where=3D(flag=3D'Y') keep=3Did place date =
> flag
> >>rename=3D(id=3Dc_id)) end =3D eof ;
> >>31 id=3Dinput(c_id,10.);
> >>32 hh.replace () ;
> >>33 end ;
> >>34 rc =3D hh.output (dataset: "visits") ;
> >>35
> >>36 run;
> >>
> >>NOTE: The data set VISITS has 14374264 observations and 3 variables.
> >>NOTE: There were 243729816 observations read from the data set bigdata
> >> WHERE flag=3D'Y';
> >>NOTE: DATA statement used (Total process time):
> >> real time 11:50.98
> >> user cpu time 7:05.57
> >> system cpu time 1:03.51
> >> Memory 918868k
> >> Page Faults 185
> >> Page Reclaims 224845
> >> Page Swaps 0
> >> Voluntary Context Switches 0
> >> Involuntary Context Switches 0
> >> Block Input Operations 0
> >> Block Output Operations 0

From: Ken Borowiak on
On Thu, 28 Dec 2006 13:43:31 -0800, dougarobertson(a)GMAIL.COM wrote:

>Hi,
>I've got a large dataset (350,000,000 records) containing transactions,
>from which I want to select a subset of around 250,000,000 records and
>then dedupe on id, place and date, which takes it down to around
>15,000,000 records. It's (just, given memory constraints) possible to
>do this using a simple proc sort but, partly because some further data
>manipulation may be desirable and partly to get to grips with them for
>the first time, I've tried using hash tables.
>On the first couple of attempts it ran out of memory, but with a bit of
>tinkering I got it to just below the limit; the (slightly edited) log
>is below.
>I am now trying to understand why so much memory is used for this
>datastep. I realise 15,000,000 records is a lot to hold as a hash table
>in memory, but there are only 3 variables (all numeric, total length
>11) and the final data set that is output only uses about 160MB on the
>disk, so why does the datastep need 919MB of memory? Is there any way I
>can reduce this?
>I'll be rerunning the process with several similar sized data sets so I
>can't really afford to be teetering so close to the edge of our
>available memory.
>I am working with 1GB of memory on a UNIX platform, working remotely on
>a Windows based PC through Enterprise Guide 4.
>
>Any comments would be very welcome.
>
>Thanks,
>
>Doug
>
>
>
>21 options fullstimer;
>22 data _null_;
>23 length hhd_id 4;
>24 if 0 then set bigdata (rename=(hhd_id=c_hi));
>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ;
>26 hh.definekey ('id','place','date') ;
>27 hh.definedata ('id','place','date') ;
>28 hh.definedone () ;
>29 do until (eof) ;
>30 set bigdata (where=(flag='Y') keep=id place date flag
>rename=(id=c_id)) end = eof ;
>31 id=input(c_id,10.);
>32 hh.replace () ;
>33 end ;
>34 rc = hh.output (dataset: "visits") ;
>35
>36 run;
>
>NOTE: The data set VISITS has 14374264 observations and 3 variables.
>NOTE: There were 243729816 observations read from the data set bigdata
> WHERE flag='Y';
>NOTE: DATA statement used (Total process time):
> real time 11:50.98
> user cpu time 7:05.57
> system cpu time 1:03.51
> Memory 918868k
> Page Faults 185
> Page Reclaims 224845
> Page Swaps 0
> Voluntary Context Switches 0
> Involuntary Context Switches 0
> Block Input Operations 0
> Block Output Operations 0

Doug,

You could avoid goobling up as much memory by not putting all the keys in
the data portion of the hash table. Check out this thread (and reference):
http://listserv.uga.edu/cgi-bin/wa?A2=ind0611C&L=sas-l&P=R37784&m=217688

HTH,
Ken
From: Ken Borowiak on
On Thu, 28 Dec 2006 13:43:31 -0800, dougarobertson(a)GMAIL.COM wrote:

>Hi,
>I've got a large dataset (350,000,000 records) containing transactions,
>from which I want to select a subset of around 250,000,000 records and
>then dedupe on id, place and date, which takes it down to around
>15,000,000 records. It's (just, given memory constraints) possible to
>do this using a simple proc sort but, partly because some further data
>manipulation may be desirable and partly to get to grips with them for
>the first time, I've tried using hash tables.
>On the first couple of attempts it ran out of memory, but with a bit of
>tinkering I got it to just below the limit; the (slightly edited) log
>is below.
>I am now trying to understand why so much memory is used for this
>datastep. I realise 15,000,000 records is a lot to hold as a hash table
>in memory, but there are only 3 variables (all numeric, total length
>11) and the final data set that is output only uses about 160MB on the
>disk, so why does the datastep need 919MB of memory? Is there any way I
>can reduce this?
>I'll be rerunning the process with several similar sized data sets so I
>can't really afford to be teetering so close to the edge of our
>available memory.
>I am working with 1GB of memory on a UNIX platform, working remotely on
>a Windows based PC through Enterprise Guide 4.
>
>Any comments would be very welcome.
>
>Thanks,
>
>Doug
>
>
>
>21 options fullstimer;
>22 data _null_;
>23 length hhd_id 4;
>24 if 0 then set bigdata (rename=(hhd_id=c_hi));
>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ;
>26 hh.definekey ('id','place','date') ;
>27 hh.definedata ('id','place','date') ;
>28 hh.definedone () ;
>29 do until (eof) ;
>30 set bigdata (where=(flag='Y') keep=id place date flag
>rename=(id=c_id)) end = eof ;
>31 id=input(c_id,10.);
>32 hh.replace () ;
>33 end ;
>34 rc = hh.output (dataset: "visits") ;
>35
>36 run;
>
>NOTE: The data set VISITS has 14374264 observations and 3 variables.
>NOTE: There were 243729816 observations read from the data set bigdata
> WHERE flag='Y';
>NOTE: DATA statement used (Total process time):
> real time 11:50.98
> user cpu time 7:05.57
> system cpu time 1:03.51
> Memory 918868k
> Page Faults 185
> Page Reclaims 224845
> Page Swaps 0
> Voluntary Context Switches 0
> Involuntary Context Switches 0
> Block Input Operations 0
> Block Output Operations 0

Doug,

I am still pondering a better hash-ish solution than the one I already gave,
but consider this index solution. IIRC, Richard Devenezia provided such a
solution to a post earlier this year.

data has_dupkeys ;
input a b c d ;
datalines ;
1 2 3 1
1 2 3 2
2 2 2 1
2 2 2 2
1 2 3 3
2 2 2 4
;
run ;

/*-- Define table with an index with unique keys --*/
proc sql ;
create table nodupkeys
( a num,
b num,
c num ) ;

create unique index abc on nodupkeys(a, b, c) ;
quit ;

proc append base=nodupkeys data=has_dupkeys force ;
run ;

/* ... can then drop the index from NODUPKEYS */

It won't run as fast, but it will alleviate your memory concerns (at the
expense of disk space).

HTH,
Ken