From: bcubeb3 on
Suppose I have two data sets: CO and EA
and I want to find the subset of CO that are matches to EA
and likewise the subset of CO that aren't matches to EA
The following code works. However when CO and EA are over a million
observations
the proc sql procedure is very very very slow. I want to know if there
is an optimized way to do
what I am doing below:

/*************CO *********************/
data home.CO;
input fname $ lname $;
datalines;
John Smith
Mary Jane
Will Smith
Eric Manelow
Christina Aguilera
;

data home.CO;
set home.CO;
FL = CATS(fname,lname);
run;

data home.EA;
input fname $ lname $;
datalines;
Thomas Jefferson
Lady Gaga
Michael Jackson
Kobe Bryant
Will Smith
Eric Manelow
Christina Aguilera
;

data home.EA;
set home.EA;
FL = CATS(fname,lname);
run;

PROC SQL;
CREATE TABLE home.matches AS
SELECT *
FROM home.CO as b
where EXISTS
(SELECT *
FROM home.EA as a
WHERE compare(b.FL,a.FL,'I') EQ 0
)
;
QUIT;

PROC SQL;
CREATE TABLE home.nomatches AS
SELECT *
FROM home.CO as b
where NOT EXISTS
(SELECT *
FROM home.EA as a
WHERE compare(b.FL,a.FL,'I') EQ 0
)
;
QUIT;
From: Patrick on
Hi

Thanks that you provided your code and sample data. That makes it so
much easier to give an answer.


If you're processing this in SAS then I'd suggest to use a SAS hash
lookup instead of any SQL code.

Do you know this paper?
http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf

Using a hash lookup your code could then look like this:

data work.CO;
input fname $ lname $;
datalines;
John Smith
Mary Jane
Will Smith
Eric Manelow
Christina Aguilera
;

data work.EA;
input fname $ lname $;
datalines;
Thomas Jefferson
Lady Gaga
Michael Jackson
Kobe Bryant
Will Smith
Eric Manelow
Christina Aguilera
;

data CoMatches CoNonMatches;
length fname lname $ 8;
drop rc;
if _N_ = 1 then do;
declare hash h(dataset:'work.EA');
h.defineKey('fname','lname');
h.defineDone();
call missing(fname,lname);
end;
set work.co;
rc=h.check();
if rc=0 then output CoMatches;
else output CoNonMatches;
run;


Only one pass through the data is needed. I'm sure this will perform
much better than the code you've posted.

As your lookup table has 1M rows: Make sure to read the section about
memory usage in the paper I've sent the link for.

If you have to do it in SQL (i.e. because this should execute in a
Data Base) then I believe to remember that a subselect performs better
than EXIST.

HTH
Patrick
 | 
Pages: 1
Prev: Exact binomial test
Next: ORACLE XMLTYPE