Difficulty with large dataset and "normalizing" variables Proc Univariate & Proc Means [SAS]

Prev: data step over proc sql
Next: Scorecard : Probabailities

From: kvasikonkav on 2 Jul 2010 16:39

Hi-

First, sorry for the long post. I don't want to annoy any of you,
especially since I am in desperate need of your help. I've noticed
that a lot of questions get answered effectively only after the OP
states what the "big picture" is, so I am going to try to do that.

I have an unbalanced (Different start and end dates, as well as left
and right censored) panel (Company x Year) dataset.

I am trying to test whether one specific variable, (call it X1) is a
significant predictor of
(1) the probability that an event, (call it X2 where X2=1,0) occurs
in a specific year (logistic or multinomial logistic)
(2) how long a company "lives" before the event occurs (PHREG or
competing hazards model)
(3) the characterstics of the company after the event occurs, (call
it a vector X3)

I am trying to "control for the usual suspects" (call it a vector X4)
in my regressions. Regardless of the wisdom of doing so, I am
following results from other published papers as my baseline model.

In addition to the X1 and X4, I would like to test the significance of
several other variables of interest.

I have a set of characteristics or levels (call it vector X5) for each
(company, year) observation. I woudl like to inclede these in my
regression (I actually know how to do this all on my own!)

I would like to create a a set of variables or differences (call it
vector X6) that measure how far from"THE MEAN" the company is
immediately before the event for each of the characteristics in X5.

"THE MEAN" is a bit hard. I would like to group the data by some
elements of X5 that make them "similar" (e.g. industry, year, size,
etc or combination of them, such as INDUSTRYxYEAR groupings) and take
the "MEAN" of the group. For "MEAN" itself, I think that a plain old
mean is not a great idea, so maybe a windsorized mean or some other
measure of central tendency. For robustness, I would like to use
several different measures of central tendency.

Once I have a decent set of different definitions of "the
mean" (generated by different groupings of the data), I am unsure if a
plain difference, or ratio, would be more informative in the
regression. Let's say I'm talking about copany j, and I have a group,
call it J, to which j belongs. Which ie better? Difference = X5(j) -
MEAN(X5|J). Ratio = X5(j) / MEAN(X5|J).

I would also like to see if "how the company got here" matters in the
regressions. for each company, the X5 vector is serially correlated
through time. Lagged X5 is insanely colinear with X5, and power is all
gone, and estimated variance is inflated. So, I am faced with a
similar problem as above - I can either do a difference = X5-
Lagged(X5), or a ratio = X5/Lagged(X5)

So, in summary, I have a hypothesis that most companies are in or
close to equilibrium most of the time. If a company deviates ( due to
some unobservable shock) from the average for at least some elements
of X5, the firm is more likely to have the defined event, and and the
event will move the firm back toward equilibrium. Any comments on how
I plan on testing the hypothesis, as described above, would REALLY be
appreciated.

Now, onto my specific problems:

In order to get a few "Group Mean" estimates, I have been running Proc
Univariate. Somehow, it seems insanely slow. I was able to speed it
up a bit by switching to "By" instead of "Class", but My code is
pasted below. Any suggestion on how to improve speed and/or be more
clever about finding measures of central tendency?

/*CODE*/

ods select WinsorizedMeans;
ods output WINSORIZEDmeans=pooledest;
proc univariate data=b01
trimmed=1 .02
winsorized=.02
robustscale
/*noprint*/;
Var
csGP csOIBDP csOIADP csPI csIB csNI csCHE
csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
csDLTIS csDLTR csDLCCH csFINCF csCHECH
SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
MVEquity BtoM MLev IntCov
;
run;
data comp.pooledest; set pooledest; run;
quit;

/*
---------------------------------------------------------------------------
---------------------------------------------------------------------------
This section takes the winsorized mean of several variables and stores
them.
This data is partitioned by a single variable:
YEAR, FF12, SIZE, BEME
---------------------------------------------------------------------------
---------------------------------------------------------------------------
*/

proc sort data=b01; by year; run;
ods select WinsorizedMeans;
ods output WINSORIZEDmeans=Year;
proc univariate data=b01
trimmed=1 .02
winsorized=.02
robustscale
/*noprint*/;
by year /*class year*/;
Var
csGP csOIBDP csOIADP csPI csIB csNI csCHE
csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
csDLTIS csDLTR csDLCCH csFINCF csCHECH
SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
MVEquity BtoM MLev IntCov
;
run;
data comp.year; set year; run;
quit;

/*
---------------------------------------------------------------------------
---------------------------------------------------------------------------
This section takes the winsorized mean of several variables and stores
them.
This data is partitioned by a pair of variable:
YEARxFF12, YEARxSIZE, YEARxBEME, FF12xSIZE, FF12xBEME, SIZExBEME
---------------------------------------------------------------------------
---------------------------------------------------------------------------
*/

proc sort data=b01;
by year ff12 ;
run;
ods select WinsorizedMeans;
ods output WINSORIZEDmeans=YRxFF;
proc univariate data=b01
trimmed=1 .02
winsorized=.02
robustscale
/*noprint*/;
by year ff12 ;
Var
csGP csOIBDP csOIADP csPI csIB csNI csCHE
csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
csDLTIS csDLTR csDLCCH csFINCF csCHECH
SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
MVEquity BtoM MLev IntCov
;
run;
data comp.YRxFF; set YRxFF; run;

/*
---------------------------------------------------------------------------
---------------------------------------------------------------------------
This section takes the winsorized mean of several variables and stores
them.
This data is partitioned by a triplet of variable:
YEARxFF12xSIZE (no BEME), YEARxFF12xBEME (no SIZE),
YEARxSIZExBEME (no FF12), FF12xSIZExBEME (no YEAR)
---------------------------------------------------------------------------
---------------------------------------------------------------------------
*/

proc sort data=b01;
by year ff12 size ;
run;
ods select WinsorizedMeans;
ods output WINSORIZEDmeans=threenoBM;
proc univariate data=b01
trimmed=1 .02
winsorized=.02
robustscale
/*noprint*/;
by year ff12 size ;
Var
csGP csOIBDP csOIADP csPI csIB csNI csCHE
csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
csDLTIS csDLTR csDLCCH csFINCF csCHECH
SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
MVEquity BtoM MLev IntCov
;
run;
data comp.threenoBM; set threenoBM; run;

---------------------------------------------------------------------------
---------------------------------------------------------------------------
This section takes the winsorized mean of several variables and stores
them.
This data is partitioned by ALL variables:
YEAR, FF12, SIZE, BEME
---------------------------------------------------------------------------
---------------------------------------------------------------------------
*/

proc sort data=b01;
by year ff12 size beme;
run;
ods select WinsorizedMeans;
ods output WINSORIZEDmeans=fourway;
proc univariate data=b01
trimmed=1 .02
winsorized=.02
robustscale
/*noprint*/;
by year ff12 size beme;
Var
csGP csOIBDP csOIADP csPI csIB csNI csCHE
csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
csDLTIS csDLTR csDLCCH csFINCF csCHECH
SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
MVEquity BtoM MLev IntCov
;
run;
data comp.fourway; set fourway; run;

proc sort data= year; by year; run;
proc transpose data=year out=comp.year1; by year; id varname; var mean
stdmean; run;
PROC EXPORT DATA= WORK.year1
OUTFILE= "R:\Dissertation\ChartsAndTables\FinancialStatements.XLS"
DBMS=EXCEL REPLACE; SHEET="Year";
RUN;

From: Tom Abernathy on 2 Jul 2010 20:01

I don't know anything about Windsorized means, but you can eliminate
all but the last PROC SORT.
If a dataset is sorted by YEAR FF12 SIZE BEME then it is sorted by
YEAR and by YEAR FF12 ...

On Jul 2, 4:39 pm, kvasikonkav <kvasikon...(a)gmail.com> wrote:
> Hi-
>
> First, sorry for the long post. I don't want to annoy any of you,
> especially since I am in desperate need of your help. I've noticed
> that a lot of questions get answered effectively only after the OP
> states what the "big picture" is, so I am going to try to do that.
>
> I have an unbalanced (Different start and end dates, as well as left
> and right censored) panel (Company x Year) dataset.
>
> I am trying to test whether one specific variable, (call it X1) is a
> significant predictor of
> (1) the probability that an event, (call it X2 where X2=1,0) occurs
> in a specific year (logistic or multinomial logistic)
> (2) how long a company "lives" before the event occurs (PHREG or
> competing hazards model)
> (3) the characterstics of the company after the event occurs, (call
> it a vector X3)
>
> I am trying to "control for the usual suspects" (call it a vector X4)
> in my regressions. Regardless of the wisdom of doing so, I am
> following results from other published papers as my baseline model.
>
> In addition to the X1 and X4, I would like to test the significance of
> several other variables of interest.
>
> I have a set of characteristics or levels (call it vector X5) for each
> (company, year) observation. I woudl like to inclede these in my
> regression (I actually know how to do this all on my own!)
>
> I would like to create a a set of variables or differences (call it
> vector X6) that measure how far from"THE MEAN" the company is
> immediately before the event for each of the characteristics in X5.
>
> "THE MEAN" is a bit hard. I would like to group the data by some
> elements of X5 that make them "similar" (e.g. industry, year, size,
> etc or combination of them, such as INDUSTRYxYEAR groupings) and take
> the "MEAN" of the group. For "MEAN" itself, I think that a plain old
> mean is not a great idea, so maybe a windsorized mean or some other
> measure of central tendency. For robustness, I would like to use
> several different measures of central tendency.
>
> Once I have a decent set of different definitions of "the
> mean" (generated by different groupings of the data), I am unsure if a
> plain difference, or ratio, would be more informative in the
> regression. Let's say I'm talking about copany j, and I have a group,
> call it J, to which j belongs. Which ie better? Difference = X5(j) -
> MEAN(X5|J). Ratio = X5(j) / MEAN(X5|J).
>
> I would also like to see if "how the company got here" matters in the
> regressions. for each company, the X5 vector is serially correlated
> through time. Lagged X5 is insanely colinear with X5, and power is all
> gone, and estimated variance is inflated. So, I am faced with a
> similar problem as above - I can either do a difference = X5-
> Lagged(X5), or a ratio = X5/Lagged(X5)
>
> So, in summary, I have a hypothesis that most companies are in or
> close to equilibrium most of the time. If a company deviates ( due to
> some unobservable shock) from the average for at least some elements
> of X5, the firm is more likely to have the defined event, and and the
> event will move the firm back toward equilibrium. Any comments on how
> I plan on testing the hypothesis, as described above, would REALLY be
> appreciated.
>
> Now, onto my specific problems:
>
> In order to get a few "Group Mean" estimates, I have been running Proc
> Univariate. Somehow, it seems insanely slow. I was able to speed it
> up a bit by switching to "By" instead of "Class", but My code is
> pasted below. Any suggestion on how to improve speed and/or be more
> clever about finding measures of central tendency?
>
> /*CODE*/
>
> ods select WinsorizedMeans;
> ods output WINSORIZEDmeans=pooledest;
> proc univariate data=b01
> trimmed=1 .02
> winsorized=.02
> robustscale
> /*noprint*/;
> Var
> csGP csOIBDP csOIADP csPI csIB csNI csCHE
> csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
> csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
> csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
> csDLTIS csDLTR csDLCCH csFINCF csCHECH
> SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
> MVEquity BtoM MLev IntCov
> ;
> run;
> data comp.pooledest; set pooledest; run;
> quit;
>
> /*
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> This section takes the winsorized mean of several variables and stores
> them.
> This data is partitioned by a single variable:
> YEAR, FF12, SIZE, BEME
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> */
>
> proc sort data=b01; by year; run;
> ods select WinsorizedMeans;
> ods output WINSORIZEDmeans=Year;
> proc univariate data=b01
> trimmed=1 .02
> winsorized=.02
> robustscale
> /*noprint*/;
> by year /*class year*/;
> Var
> csGP csOIBDP csOIADP csPI csIB csNI csCHE
> csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
> csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
> csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
> csDLTIS csDLTR csDLCCH csFINCF csCHECH
> SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
> MVEquity BtoM MLev IntCov
> ;
> run;
> data comp.year; set year; run;
> quit;
>
> /*
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> This section takes the winsorized mean of several variables and stores
> them.
> This data is partitioned by a pair of variable:
> YEARxFF12, YEARxSIZE, YEARxBEME, FF12xSIZE, FF12xBEME, SIZExBEME
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> */
>
> proc sort data=b01;
> by year ff12 ;
> run;
> ods select WinsorizedMeans;
> ods output WINSORIZEDmeans=YRxFF;
> proc univariate data=b01
> trimmed=1 .02
> winsorized=.02
> robustscale
> /*noprint*/;
> by year ff12 ;
> Var
> csGP csOIBDP csOIADP csPI csIB csNI csCHE
> csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
> csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
> csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
> csDLTIS csDLTR csDLCCH csFINCF csCHECH
> SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
> MVEquity BtoM MLev IntCov
> ;
> run;
> data comp.YRxFF; set YRxFF; run;
>
> /*
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> This section takes the winsorized mean of several variables and stores
> them.
> This data is partitioned by a triplet of variable:
> YEARxFF12xSIZE (no BEME), YEARxFF12xBEME (no SIZE),
> YEARxSIZExBEME (no FF12), FF12xSIZExBEME (no YEAR)
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> */
>
> proc sort data=b01;
> by year ff12 size ;
> run;
> ods select WinsorizedMeans;
> ods output WINSORIZEDmeans=threenoBM;
> proc univariate data=b01
> trimmed=1 .02
> winsorized=.02
> robustscale
> /*noprint*/;
> by year ff12 size ;
> Var
> csGP csOIBDP csOIADP csPI csIB csNI csCHE
> csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
> csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
> csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
> csDLTIS csDLTR csDLCCH csFINCF csCHECH
> SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
> MVEquity BtoM MLev IntCov
> ;
> run;
> data comp.threenoBM; set threenoBM; run;
>
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> This section takes the winsorized mean of several variables and stores
> them.
> This data is partitioned by ALL variables:
> YEAR, FF12, SIZE, BEME
> ---------------------------------------------------------------------------
> ---------------------------------------------------------------------------
> */
>
> proc sort data=b01;
> by year ff12 size beme;
> run;
> ods select WinsorizedMeans;
> ods output WINSORIZEDmeans=fourway;
> proc univariate data=b01
> trimmed=1 .02
> winsorized=.02
> robustscale
> /*noprint*/;
> by year ff12 size beme;
> Var
> csGP csOIBDP csOIADP csPI csIB csNI csCHE
> csACT csPPENT csINTAN csAT csDLC csLCT csDLTT csLT
> csCSTK csCAPS csRE csTSTK csCEQ csPSTK csSEQ csLSE
> csOANCF csCAPX csAQC csIVNCF csPRSTKC csSSTK csDV
> csDLTIS csDLTR csDLCCH csFINCF csCHECH
> SaleToAsset NWC SaleToNWC NetCapx ROA ROE DivRatio
> MVEquity BtoM MLev IntCov
> ;
> run;
> data comp.fourway; set fourway; run;
>
> proc sort data= year; by year; run;
> proc transpose data=year out=comp.year1; by year; id varname; var mean
> stdmean; run;
> PROC EXPORT DATA= WORK.year1
> OUTFILE= "R:\Dissertation\ChartsAndTables\FinancialStatements.XLS"
> DBMS=EXCEL REPLACE; SHEET="Year";
> RUN;

From: kvasikonkav on 2 Jul 2010 22:49

Tom-

Thanks!

I'm really tired, so this sort of obvious redundancy is beyond my pea-
brain.

This will help.

On Jul 2, 7:01 pm, Tom Abernathy <tom.aberna...(a)gmail.com> wrote:
> I don't know anything about Windsorized means, but you can eliminate
> all but the last PROC SORT.
> If a dataset is sorted by YEAR FF12 SIZE BEME then it is sorted by
> YEAR and by YEAR FF12 ...
>

|
Pages: 1
Prev: data step over proc sql
Next: Scorecard : Probabailities