From: theorbo on
Hi. I'm seeking advice on an alternative way to solve my problem.
Let me describe the problem:

First my two datasets:

ANCHOR GROUPS (AG) - 100 obs & 500 variables
NEW PERSONS (NP) - 5,000 obs & 500 variables (these are the same
variables as in dataset AG)


I have my anchor groups and need to figure out which group each of the
5000 people from NP are most similar to. Unfortunately data reduction
techniques with the variables is not an option currently.

To do this I've been doing the following ...

Iteratively going through NP and merging 1 observation's 500
variables' values to the AG dataset. Then I compute the difference
between VAR1_AG and VAR1_NP ... VAR500_AG and VAR500_NP. I then sum
the square of the 500 differences to get a "distance" measure. I then
know that the NP observation is most similar to the AG observation
where it has the minimum "distance" measure.

Iteratively going through these 5000 observations (merge, compute
distance, etc.) takes an extremely long time, especially if the number
of variables or observations increase.

Is there a different approach that you might suggest? The math is
relatively simple but like with most things in SAS there are always a
million and one ways to do something. And to reiterate, we considered
some data reduction techniques but they are not going to be feasible
at the current time.

Thank you. Please post if you have any questions or need any
clarification to help me.
From: Ya on
On Jun 1, 1:44 pm, theorbo <theo...(a)gmail.com> wrote:
> Hi.  I'm seeking advice on an alternative way to solve my problem.
> Let me describe the problem:
>
> First my two datasets:
>
> ANCHOR GROUPS (AG) - 100 obs & 500 variables
> NEW PERSONS (NP) - 5,000 obs & 500 variables (these are the same
> variables as in dataset AG)
>
> I have my anchor groups and need to figure out which group each of the
> 5000 people from NP are most similar to.  Unfortunately data reduction
> techniques with the variables is not an option currently.
>
> To do this I've been doing the following ...
>
> Iteratively going through NP and merging 1 observation's 500
> variables' values to the AG dataset.  Then I compute the difference
> between VAR1_AG and VAR1_NP ... VAR500_AG and VAR500_NP.  I then sum
> the square of the 500 differences to get a "distance" measure.  I then
> know that the NP observation is most similar to the AG observation
> where it has the minimum "distance" measure.
>
> Iteratively going through these 5000 observations (merge, compute
> distance, etc.) takes an extremely long time, especially if the number
> of variables or observations increase.
>
> Is there a different approach that you might suggest?  The math is
> relatively simple but like with most things in SAS there are always a
> million and one ways to do something.  And to reiterate, we considered
> some data reduction techniques but they are not going to be feasible
> at the current time.
>
> Thank you.  Please post if you have any questions or need any
> clarification to help me.

Instead of comparing 500 vars, you can transpose the data first, then
compare one variable,
it also makes the calculation of distance much easier (use by
processing):

data anchor;
input id1 x1 x2 x3;
cards;
1 23 34 65
2 45 67 89
3 32 43 76
;

data person;
input id2 x1 x2 x3;
cards;
1 24 32 64
2 34 42 74
3 145 167 189
4 30 40 70
;

proc transpose data=anchor out=tanchor;
by id1;
var x1-x3;
run;

proc transpose data=person out=tperson;
by id2;
var x1-x3;
run;


proc sql;
create table distance as
select distinct a.id1,b.id2,sum((a.col1-b.col1)**2) as ss
from tanchor a, tperson b
where a._name_=b._name_
group by id2,id1
order by id2,ss
;

data distance;
set distance;
by id2 ss;
if first.id2;
run;

proc print;
var id2 id1 ss;
run;

id2 id1 ss

1 1 6
2 3 9
3 2 30000
4 3 49

HTH

Ya
From: theorbo on
On Jun 1, 6:22 pm, Ya <huang8...(a)gmail.com> wrote:
> On Jun 1, 1:44 pm, theorbo <theo...(a)gmail.com> wrote:
>
>
>
>
>
> > Hi.  I'm seeking advice on an alternative way to solve my problem.
> > Let me describe the problem:
>
> > First my two datasets:
>
> > ANCHOR GROUPS (AG) - 100 obs & 500 variables
> > NEW PERSONS (NP) - 5,000 obs & 500 variables (these are the same
> > variables as in dataset AG)
>
> > I have my anchor groups and need to figure out which group each of the
> > 5000 people from NP are most similar to.  Unfortunately data reduction
> > techniques with the variables is not an option currently.
>
> > To do this I've been doing the following ...
>
> > Iteratively going through NP and merging 1 observation's 500
> > variables' values to the AG dataset.  Then I compute the difference
> > between VAR1_AG and VAR1_NP ... VAR500_AG and VAR500_NP.  I then sum
> > the square of the 500 differences to get a "distance" measure.  I then
> > know that the NP observation is most similar to the AG observation
> > where it has the minimum "distance" measure.
>
> > Iteratively going through these 5000 observations (merge, compute
> > distance, etc.) takes an extremely long time, especially if the number
> > of variables or observations increase.
>
> > Is there a different approach that you might suggest?  The math is
> > relatively simple but like with most things in SAS there are always a
> > million and one ways to do something.  And to reiterate, we considered
> > some data reduction techniques but they are not going to be feasible
> > at the current time.
>
> > Thank you.  Please post if you have any questions or need any
> > clarification to help me.
>
> Instead of comparing 500 vars, you can transpose the data first, then
> compare one variable,
> it also makes the calculation of distance much easier (use by
> processing):
>
> data anchor;
> input id1 x1 x2 x3;
> cards;
> 1 23 34 65
> 2 45 67 89
> 3 32 43 76
> ;
>
> data person;
> input id2 x1 x2 x3;
> cards;
> 1 24 32 64
> 2 34 42 74
> 3 145 167 189
> 4 30 40 70
> ;
>
> proc transpose data=anchor out=tanchor;
> by id1;
> var x1-x3;
> run;
>
> proc transpose data=person out=tperson;
> by id2;
> var x1-x3;
> run;
>
> proc sql;
> create table distance as
> select distinct a.id1,b.id2,sum((a.col1-b.col1)**2) as ss
> from tanchor a, tperson b
> where a._name_=b._name_
> group by id2,id1
> order by id2,ss
> ;
>
> data distance;
>  set distance;
> by id2 ss;
> if first.id2;
> run;
>
> proc print;
> var id2 id1 ss;
> run;
>
> id2    id1       ss
>
>  1      1         6
>  2      3         9
>  3      2     30000
>  4      3        49
>
> HTH
>
> Ya- Hide quoted text -
>
> - Show quoted text -

Thanks, Ya. I had thought about that but didn't know if the PROC SQL
portion would choke on the large number of variables, etc. Thanks
again for your suggestion.