From: yzg9 on
Doug,

Here's a version of Ken's suggestion. (I've been playing with this code
for the past two weeks and I think I'm getting it. :) ). Read those
recent Hash papers from NESUG, SUGI, etc.

Such as (this is untested code):

/**I found that I had to add this in order to get the point to work***/
Data bigdata; set bigdata; n=_n_; run;

data visits ;
length hhd_id 4;
if 0 then set bigdata (rename=(hhd_id=c_hi));

dcl hash hh1 (hashexp:16);
hh1.definekey ('id','n');
hh1.definedata ('id','n');
hh1.definedone();

declare hash hh2 (hashexp:16 /*, ordered: 'a'*/) ;
hh.definekey ('id') ;
hh.definedata ('n','_n_'') ;
hh.definedone () ;

do until (eof1);
set bigdata end=eof1;
if hh2.find() ne 0 then _n_=0; _n_++1;
hh2.replace();
n+1; /**this will point to the observation when there's
a match**/
hh1.add();
end;

do until (eof2) ;
if hh1.find() then do;
set bigdata (where=(flag='Y')
keep=id place date flag
rename=(id=c_id))
end = eof2
point=n ;
id=input(c_id,10.
end ;
run;


John Gerstle, MS
Biostatistician
Northrop Grumman
CDC Information Technological Support Contract (CITS)
NCHSTP \DHAP \HICSB \Research, Analysis, and Evaluation Section
Centers for Disease Control and Prevention

"Boss. We've got cats." "Meow"

"All truth passes through three stages:
First, it is ridiculed;
Second, it is violently opposed;
Third, it is accepted as being self-evident."
- Arthur Schopenhauer (1830)

>>-----Original Message-----
>>From: owner-sas-l(a)listserv.uga.edu
[mailto:owner-sas-l(a)listserv.uga.edu]
>>On Behalf Of Ken Borowiak
>>Sent: Thursday, December 28, 2006 5:09 PM
>>To: SAS-L(a)LISTSERV.UGA.EDU
>>Cc: Ken Borowiak
>>Subject: Re: Hash Table Memory Usage
>>
>>On Thu, 28 Dec 2006 13:43:31 -0800, dougarobertson(a)GMAIL.COM wrote:
>>
>>>Hi,
>>>I've got a large dataset (350,000,000 records) containing
transactions,
>>>from which I want to select a subset of around 250,000,000 records
and
>>>then dedupe on id, place and date, which takes it down to around
>>>15,000,000 records. It's (just, given memory constraints) possible to
>>>do this using a simple proc sort but, partly because some further
data
>>>manipulation may be desirable and partly to get to grips with them
for
>>>the first time, I've tried using hash tables.
>>>On the first couple of attempts it ran out of memory, but with a bit
of
>>>tinkering I got it to just below the limit; the (slightly edited) log
>>>is below.
>>>I am now trying to understand why so much memory is used for this
>>>datastep. I realise 15,000,000 records is a lot to hold as a hash
table
>>>in memory, but there are only 3 variables (all numeric, total length
>>>11) and the final data set that is output only uses about 160MB on
the
>>>disk, so why does the datastep need 919MB of memory? Is there any way
I
>>>can reduce this?
>>>I'll be rerunning the process with several similar sized data sets so
I
>>>can't really afford to be teetering so close to the edge of our
>>>available memory.
>>>I am working with 1GB of memory on a UNIX platform, working remotely
on
>>>a Windows based PC through Enterprise Guide 4.
>>>
>>>Any comments would be very welcome.
>>>
>>>Thanks,
>>>
>>>Doug
>>>
>>>
>>>
>>>21 options fullstimer;
>>>22 data _null_;
>>>23 length hhd_id 4;
>>>24 if 0 then set bigdata (rename=(hhd_id=c_hi));
>>>25 declare hash hh (hashexp:16 /*, ordered: 'a'*/) ;
>>>26 hh.definekey ('id','place','date') ;
>>>27 hh.definedata ('id','place','date') ;
>>>28 hh.definedone () ;
>>>29 do until (eof) ;
>>>30 set bigdata (where=(flag='Y') keep=id place date flag
>>>rename=(id=c_id)) end = eof ;
>>>31 id=input(c_id,10.);
>>>32 hh.replace () ;
>>>33 end ;
>>>34 rc = hh.output (dataset: "visits") ;
>>>35
>>>36 run;
>>>
>>>NOTE: The data set VISITS has 14374264 observations and 3 variables.
>>>NOTE: There were 243729816 observations read from the data set
bigdata
>>> WHERE flag='Y';
>>>NOTE: DATA statement used (Total process time):
>>> real time 11:50.98
>>> user cpu time 7:05.57
>>> system cpu time 1:03.51
>>> Memory 918868k
>>> Page Faults 185
>>> Page Reclaims 224845
>>> Page Swaps 0
>>> Voluntary Context Switches 0
>>> Involuntary Context Switches 0
>>> Block Input Operations 0
>>> Block Output Operations 0
>>
>>Doug,
>>
>>You could avoid goobling up as much memory by not putting all the keys
in
>>the data portion of the hash table. Check out this thread (and
reference):
>>http://listserv.uga.edu/cgi-bin/wa?A2=ind0611C&L=sas-l&P=R37784&m=2176
88
>>
>>HTH,
>>Ken