calculating variables across repeated ids [SAS]

Prev: calculating variables across repeated ids
Next: Check if a Variable is Null or not

From: Muthia Kachirayan on 8 Mar 2010 12:51

On Mon, Mar 8, 2010 at 1:14 PM, Katy Seib <uralva(a)gmail.com> wrote:

> **Intro
> First, thanks To Ron for posting the SAS-L wiki this morning- very helpful!
> I too am relatively new to posting at the listserv but have enjoyed
> reading
> for a while. I'm a guest researcher at CDC and a grad student finishing up
> my thesis which is a longitudinal study - much of which is beyond the scope
> of my formal SAS education - I've had fun finding resources like this, UCLA
> and various SAS tips across the internet. I think my question today is
> relatively simple but a problem for me nonetheless.
> **Problem
> I have been calculating variables and have a LONG but effective algorithm
> for "filling in" variables that are constant across an observation. It's
> tedious and I still worry about errors. I know there must be a simpler way
> to fill in this information AS it's calculated. Below I have included an
> example and sample data. The SAM variable is a crude time variable and in
> reality there are 23 serotypes. The data step "correlate" gives me the
> yes/no answer for sam=3 only but I need it for all sam (the question being
> answered by calculating this variable is: "did this infant have a antibody
> titer greater than .35 at 10 weeks (sam=3)?". Here I've done it for
> Serotype_1 but will have to do it for all serotypes. I'm not pasting my
> fill
> in code because it's long and clumsy requiring 6 data steps. This is not a
> first occurrence var, so I'm having trouble applying previous similar
> solutions. Thanks all.
>
> data ALLFILLED;
> input ID SAM SEROTYPE_1 SEROTYPE_9V;
> cards;
> 1 0.2 8.22 36.1
> 1 0.3 1.81 22.1
> 1 1.0 .41 4.8
> 1 2.0 .33 3.2
> 1 3.0 .30 2.1
> 1 4.0 .23 2.0
> 1 5.0 .10 1.9
> 1 6.0 .03 0.9
> 2 0.2 38.1 100.3
> 2 0.3 25.81 95.2
> 2 1.0 22.3 66.2
> 2 2.0 21.5 54.2
> 2 3.0 18.3 49.5
> 2 4.0 6.2 31.1
> 2 5.0 6.1 23.3
> 2 6.0 3.1 22
> ;
> RUN;
>
> PROC PRINT DATA=ALLFILLED; RUN;
>
> DATA CORRELATE;
> SET ALLFILLED;
> IF SAM=3 AND SEROTYPE_1=. THEN CORR3_SER1=.;
> ELSE IF SAM=3 AND SEROTYPE_1>.35 THEN CORR3_SER1=1;
> ELSE IF SAM=3 AND SEROTYPE_1<.35 THEN CORR3_SER1=0;
> BY ID;
> RUN;
>
> PROC PRINT DATA=CORRELATE; RUN;
>

Katy,

Here is one way. Does this help you. Data has to be presorted by id and sam
as you have now. proc sort is shown here just in case your large data set is
not sorted.

proc sort data = ALLFILLED;
by id sam;
run;

data need;
do until(last.sam);
set allfilled;
by id sam;
if serotype_1 = . then corr_ser1 = .;
else if serotype_1 > .35 then corr_ser1 = 1;
else corr_ser1 = 0;
end;
run;
proc print data = need;
run;

From: Muthia Kachirayan on 8 Mar 2010 13:56

Does the sql solution help you? If you show the expected output for the data
set shown here it will replace several lines of explanation and there will
be several ways of doing it in SAS!

Muthia Kachirayan

On Mon, Mar 8, 2010 at 2:35 PM, Katy Seib <uralva(a)gmail.com> wrote:

> I tried this and again with a modification but still no dice. What you're
> doing answers my question for each sam (which in a another context is
> useful). However now I'm looking for just the timepoint sam=3, filled in
> for all sam. Later, when I get lab results back for colonization -
> regardless of when the colonization occurs, and later transpose the data (at
> the sam of colonization), I can see if titers at that critical time point
> are related (or not). I'm sure there are data management strategies that
> would probably render all this workaround unnecessary but that is way beyond
> the scope of this one problem. In fact, if anyone does know of a good data
> management resource - (publication/website/pdf/etc), I am interested in
> further developing that skill. Right now I have hundreds of files that I am
> managing (or trying to) for this one project. It has been an amazing
> experience and I am really grateful for this opportunity - I just hope I can
> live up to the expectations!
>
>
>> Katy,
>>
>> Here is one way. Does this help you. Data has to be presorted by id and
>> sam
>> as you have now. proc sort is shown here just in case your large data set
>> is
>> not sorted.
>>
>> proc sort data = ALLFILLED;
>> by id sam;
>> run;
>>
>> data need;
>> do until(last.sam);
>> set allfilled;
>> by id sam;
>> if serotype_1 = . then corr_ser1 = .;
>> else if serotype_1 > .35 then corr_ser1 = 1;
>> else corr_ser1 = 0;
>> end;
>> run;
>> proc print data = need;
>> run;
>>
>
>

From: Ya Huang on 8 Mar 2010 13:44

Is this what you want?

proc sql;
select *, max(case when SAM=3 AND SEROTYPE_1>.35 THEN 1
when SAM=3 AND SEROTYPE_1<.35 THEN 0
else . end) as CORR3_SER1
from ALLFILLED
group by id
order by id,sam
;

ID SAM SEROTYPE_1 SEROTYPE_9V CORR3_SER1
----------------------------------------------------
1 0.2 8.22 36.1 0
1 0.3 1.81 22.1 0
1 1 0.41 4.8 0
1 2 0.33 3.2 0
1 3 0.3 2.1 0
1 4 0.23 2 0
1 5 0.1 1.9 0
1 6 0.03 0.9 0
2 0.2 38.1 100.3 1
2 0.3 25.81 95.2 1
2 1 22.3 66.2 1
2 2 21.5 54.2 1
2 3 18.3 49.5 1
2 4 6.2 31.1 1
2 5 6.1 23.3 1
2 6 3.1 22 1

On Mon, 8 Mar 2010 12:14:41 -0500, Katy Seib <uralva(a)GMAIL.COM> wrote:

>**Intro
>First, thanks To Ron for posting the SAS-L wiki this morning- very helpful!
> I too am relatively new to posting at the listserv but have enjoyed
reading
>for a while. I'm a guest researcher at CDC and a grad student finishing up
>my thesis which is a longitudinal study - much of which is beyond the scope
>of my formal SAS education - I've had fun finding resources like this, UCLA
>and various SAS tips across the internet. I think my question today is
>relatively simple but a problem for me nonetheless.
>**Problem
>I have been calculating variables and have a LONG but effective algorithm
>for "filling in" variables that are constant across an observation. It's
>tedious and I still worry about errors. I know there must be a simpler way
>to fill in this information AS it's calculated. Below I have included an
>example and sample data. The SAM variable is a crude time variable and in
>reality there are 23 serotypes. The data step "correlate" gives me the
>yes/no answer for sam=3 only but I need it for all sam (the question being
>answered by calculating this variable is: "did this infant have a antibody
>titer greater than .35 at 10 weeks (sam=3)?". Here I've done it for
>Serotype_1 but will have to do it for all serotypes. I'm not pasting my
fill
>in code because it's long and clumsy requiring 6 data steps. This is not a
>first occurrence var, so I'm having trouble applying previous similar
>solutions. Thanks all.
>
>data ALLFILLED;
>input ID SAM SEROTYPE_1 SEROTYPE_9V;
>cards;
>1 0.2 8.22 36.1
>1 0.3 1.81 22.1
>1 1.0 .41 4.8
>1 2.0 .33 3.2
>1 3.0 .30 2.1
>1 4.0 .23 2.0
>1 5.0 .10 1.9
>1 6.0 .03 0.9
>2 0.2 38.1 100.3
>2 0.3 25.81 95.2
>2 1.0 22.3 66.2
>2 2.0 21.5 54.2
>2 3.0 18.3 49.5
>2 4.0 6.2 31.1
>2 5.0 6.1 23.3
>2 6.0 3.1 22
>;
>RUN;
>
>PROC PRINT DATA=ALLFILLED; RUN;
>
>DATA CORRELATE;
>SET ALLFILLED;
> IF SAM=3 AND SEROTYPE_1=. THEN CORR3_SER1=.;
>ELSE IF SAM=3 AND SEROTYPE_1>.35 THEN CORR3_SER1=1;
>ELSE IF SAM=3 AND SEROTYPE_1<.35 THEN CORR3_SER1=0;
>BY ID;
>RUN;
>
>PROC PRINT DATA=CORRELATE; RUN;

From: Katy Seib on 8 Mar 2010 13:35

I tried this and again with a modification but still no dice. What you're
doing answers my question for each sam (which in a another context is
useful). However now I'm looking for just the timepoint sam=3, filled in
for all sam. Later, when I get lab results back for colonization -
regardless of when the colonization occurs, and later transpose the data (at
the sam of colonization), I can see if titers at that critical time point
are related (or not). I'm sure there are data management strategies that
would probably render all this workaround unnecessary but that is way beyond
the scope of this one problem. In fact, if anyone does know of a good data
management resource - (publication/website/pdf/etc), I am interested in
further developing that skill. Right now I have hundreds of files that I am
managing (or trying to) for this one project. It has been an amazing
experience and I am really grateful for this opportunity - I just hope I can
live up to the expectations!

> Katy,
>
> Here is one way. Does this help you. Data has to be presorted by id and sam
> as you have now. proc sort is shown here just in case your large data set
> is
> not sorted.
>
> proc sort data = ALLFILLED;
> by id sam;
> run;
>
> data need;
> do until(last.sam);
> set allfilled;
> by id sam;
> if serotype_1 = . then corr_ser1 = .;
> else if serotype_1 > .35 then corr_ser1 = 1;
> else corr_ser1 = 0;
> end;
> run;
> proc print data = need;
> run;
>

From: Katy Seib on 8 Mar 2010 13:58

Very close, when I try to add a line to keep missing data =missing, I get
the same results (CORR3_SER1=0 INSTEAD OF .). Lot to learn about sql. I
added a third person below to simulate missing data (which there is plenty
of). I'm sure it's a syntax problem on my end.....

data ALLFILLED;
input ID SAM SEROTYPE_1 SEROTYPE_9V;
cards;
1 0.2 8.22 36.1
1 0.3 1.81 22.1
1 1.0 .41 4.8
1 2.0 .33 3.2
1 3.0 .30 2.1
1 4.0 .23 2.0
1 5.0 .10 1.9
1 6.0 .03 0.9
2 0.2 38.1 100.3
2 0.3 25.81 95.2
2 1.0 22.3 66.2
2 2.0 21.5 54.2
2 3.0 18.3 49.5
2 4.0 6.2 31.1
2 5.0 6.1 23.3
2 6.0 3.1 22
3 0.2 38.1 100.3
3 0.3 25.81 95.2
3 1.0 22.3 66.2
3 2.0 21.5 54.2
3 3.0 . .
3 4.0 6.2 31.1
3 5.0 6.1 23.3
3 6.0 3.1 22
;
RUN;
proc sql;
select *, max(case when SAM=3 AND SEROTYPE_1>.35 THEN 1
when SAM=3 AND SEROTYPE_1<.35 THEN 0
when sam=3 and serotype_1=. THEN .
else . end) as CORR3_SER1
from ALLFILLED
group by id
order by id,sam
;

On Mon, Mar 8, 2010 at 12:14 PM, Katy Seib <uralva(a)gmail.com> wrote:

> **Intro
> First, thanks To Ron for posting the SAS-L wiki this morning- very helpful!
> I too am relatively new to posting at the listserv but have enjoyed
> reading
> for a while. I'm a guest researcher at CDC and a grad student finishing up
> my thesis which is a longitudinal study - much of which is beyond the scope
> of my formal SAS education - I've had fun finding resources like this, UCLA
> and various SAS tips across the internet. I think my question today is
> relatively simple but a problem for me nonetheless.
> **Problem
> I have been calculating variables and have a LONG but effective algorithm
> for "filling in" variables that are constant across an observation. It's
> tedious and I still worry about errors. I know there must be a simpler way
> to fill in this information AS it's calculated. Below I have included an
> example and sample data. The SAM variable is a crude time variable and in
> reality there are 23 serotypes. The data step "correlate" gives me the
> yes/no answer for sam=3 only but I need it for all sam (the question being
> answered by calculating this variable is: "did this infant have a antibody
> titer greater than .35 at 10 weeks (sam=3)?". Here I've done it for
> Serotype_1 but will have to do it for all serotypes. I'm not pasting my
> fill
> in code because it's long and clumsy requiring 6 data steps. This is not a
> first occurrence var, so I'm having trouble applying previous similar
> solutions. Thanks all.
>
> data ALLFILLED;
> input ID SAM SEROTYPE_1 SEROTYPE_9V;
> cards;
> 1 0.2 8.22 36.1
> 1 0.3 1.81 22.1
> 1 1.0 .41 4.8
> 1 2.0 .33 3.2
> 1 3.0 .30 2.1
> 1 4.0 .23 2.0
> 1 5.0 .10 1.9
> 1 6.0 .03 0.9
> 2 0.2 38.1 100.3
> 2 0.3 25.81 95.2
> 2 1.0 22.3 66.2
> 2 2.0 21.5 54.2
> 2 3.0 18.3 49.5
> 2 4.0 6.2 31.1
> 2 5.0 6.1 23.3
> 2 6.0 3.1 22
> ;
> RUN;
>
> PROC PRINT DATA=ALLFILLED; RUN;
>
> DATA CORRELATE;
> SET ALLFILLED;
> IF SAM=3 AND SEROTYPE_1=. THEN CORR3_SER1=.;
> ELSE IF SAM=3 AND SEROTYPE_1>.35 THEN CORR3_SER1=1;
> ELSE IF SAM=3 AND SEROTYPE_1<.35 THEN CORR3_SER1=0;
> BY ID;
> RUN;
>
> PROC PRINT DATA=CORRELATE; RUN;
>

| Next | Last
Pages: 1 2
Prev: calculating variables across repeated ids
Next: Check if a Variable is Null or not