From: kangtsui on
I've searched this forum for the answer of my question, but it seems
that it has not been discussed.

I was trying to do variable selection by GLMSELECT (LASSO or LAR).
However, I have so many categorical IVs in my pool. The manual says
that GLMSELECT would split the columns of those categorical IVs, but I
think it would be an issue for me. I hope that columns for the same
variable enter or exit the model together. Is there a way to get
around this?

Actually my real problem is to build a model with a continuous DV and
a lot of continuous IVs. The reason I don't want to run variable
selection on the original variables are
1. there are missing values here and there. sometimes I could replace
it with mean, min, or max, but sometimes it does not make sense to
fill the hole with any number
2. many times that the relation (I'm looking for) between the DV and
IVs are not linear, or even monotonic.

Therefore, I was thinking to apply some algorithm to bin all the IVs
(based on the size of each bin and also the relation with DV) and keep
missing value as one category, which makes perfect sense to me. Then I
encounter the problem how to select the categorical variables. I hate
to use forward/ backward/ stepwise approaches since usually they
overfit a lot.

Anyone has an idea? Great thanks.

Jun
From: Paige Miller on
On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:

> I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> However, I have so many categorical IVs in my pool. The manual says
> that GLMSELECT would split the columns of those categorical IVs, but I
> think it would be an issue for me. I hope that columns for the same
> variable enter or exit the model together. Is there a way to get
> around this?
>
> Actually my real problem is to build a model with a continuous DV and
> a lot of continuous IVs. The reason I don't want to run variable
> selection on the original variables are
> 1. there are missing values here and there. sometimes I could replace
> it with mean, min, or max, but sometimes it does not make sense to
> fill the hole with any number
> 2. many times that the relation (I'm looking for) between the DV and
> IVs are not linear, or even monotonic.

It's hard to see how these two paragraphs go together. In paragraph 1,
you say that you have categorical IVs, but in paragraph 2, your real
problem has nothing to do with categorical IVs, your real problem is
missing values, and furthermore you want non-linear modeling on top of
that (which means you shouldn't be using PROC GLMSELECT).

So mark me down as confused. Perhaps you could explain further?

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
From: kangtsui on
On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
> On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> > However, I have so many categorical IVs in my pool. The manual says
> > that GLMSELECT would split the columns of those categorical IVs, but I
> > think it would be an issue for me. I hope that columns for the same
> > variable enter or exit the model together. Is there a way to get
> > around this?
>
> > Actually my real problem is to build a model with a continuous DV and
> > a lot of continuous IVs. The reason I don't want to run variable
> > selection on the original variables are
> > 1. there are missing values here and there. sometimes I could replace
> > it with mean, min, or max, but sometimes it does not make sense to
> > fill the hole with any number
> > 2. many times that the relation (I'm looking for) between the DV and
> > IVs are not linear, or even monotonic.
>
> It's hard to see how these two paragraphs go together. In paragraph 1,
> you say that you have categorical IVs, but in paragraph 2, your real
> problem has nothing to do with categorical IVs, your real problem is
> missing values, and furthermore you want non-linear modeling on top of
> that (which means you shouldn't be using PROC GLMSELECT).
>
> So mark me down as confused. Perhaps you could explain further?
>
> --
> Paige Miller
> paige\dot\miller \at\ kodak\dot\com

Thanks for your response. Let me try to make it more clear.
What I have for the problem is a continuous DV and a bunch of
continuous IVs, which have missing values to deal with. My goal is to
build an interpretable model on these variables, no matter they're
binned or not. There're two approaches I could think of. One is to
filling the missing values first for all IVs and run GLMSELECT(LASSO).
The issue is that there's no perfect way to replace missing values for
such many variables and some useful variable might not have a linear
effect to the DV. Then the next approach came into my mind, which is
to bin the variables first and run variable selection on the binned
ones. It's simple to make missing values as one category for each
variable, however, GLMSELECT will split the categorical variables
while doing selection. I hope all the columns of the same variable
would enter or exit the model together. Grouped LASSO is not built
into GLMSELECT right?
Sorry for the confusing, but I really wanted to give the whole story
of what I was doing instead of asking one specific question. Thanks.

Jun
From: Paige Miller on
On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
> On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
>
>
> > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> > > However, I have so many categorical IVs in my pool. The manual says
> > > that GLMSELECT would split the columns of those categorical IVs, but I
> > > think it would be an issue for me. I hope that columns for the same
> > > variable enter or exit the model together. Is there a way to get
> > > around this?
>
> > > Actually my real problem is to build a model with a continuous DV and
> > > a lot of continuous IVs. The reason I don't want to run variable
> > > selection on the original variables are
> > > 1. there are missing values here and there. sometimes I could replace
> > > it with mean, min, or max, but sometimes it does not make sense to
> > > fill the hole with any number
> > > 2. many times that the relation (I'm looking for) between the DV and
> > > IVs are not linear, or even monotonic.
>
> > It's hard to see how these two paragraphs go together. In paragraph 1,
> > you say that you have categorical IVs, but in paragraph 2, your real
> > problem has nothing to do with categorical IVs, your real problem is
> > missing values, and furthermore you want non-linear modeling on top of
> > that (which means you shouldn't be using PROC GLMSELECT).
>
> > So mark me down as confused. Perhaps you could explain further?
>
> > --
> > Paige Miller
> > paige\dot\miller \at\ kodak\dot\com
>
> Thanks for your response. Let me try to make it more clear.
> What I have for the problem is a continuous DV and a bunch of
> continuous IVs, which have missing values to deal with. My goal is to
> build an interpretable model on these variables, no matter they're
> binned or not. There're two approaches I could think of. One is to
> filling the missing values first for all IVs and run GLMSELECT(LASSO).
> The issue is that there's no perfect way to replace missing values for
> such many variables and some useful variable might not have a linear
> effect to the DV. Then the next approach came into my mind, which is
> to bin the variables first and run variable selection on the binned
> ones. It's simple to make missing values as one category for each
> variable, however, GLMSELECT will split the categorical variables
> while doing selection. I hope all the columns of the same variable
> would enter or exit the model together. Grouped LASSO is not built
> into GLMSELECT right?
> Sorry for the confusing, but I really wanted to give the whole story
> of what I was doing instead of asking one specific question. Thanks.
>
> Jun

I never think binning is a good idea with continuous variables.

This whole question boils down to: how best to deal with missing
values in a complicated modeling situation, which may be nonlinear,
but I just don't see PROC GLMSELECT as an option here.

I don't think SAS has great tools for what may be a non-linear
modeling situation, however there are tools for linear modeling. You
may want to look at PROC MI and PROC MIANALYZE. Also Partial Least
Squares modeling (PROC PLS) has the ability to "impute" missing values
based upon the EM algorithm, so that may be an option as well. As far
as I know, these procedures only handle linear modeling situations.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
From: Paige Miller on
On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote:
> On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
>
>
> > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> > > > However, I have so many categorical IVs in my pool. The manual says
> > > > that GLMSELECT would split the columns of those categorical IVs, but I
> > > > think it would be an issue for me. I hope that columns for the same
> > > > variable enter or exit the model together. Is there a way to get
> > > > around this?
>
> > > > Actually my real problem is to build a model with a continuous DV and
> > > > a lot of continuous IVs. The reason I don't want to run variable
> > > > selection on the original variables are
> > > > 1. there are missing values here and there. sometimes I could replace
> > > > it with mean, min, or max, but sometimes it does not make sense to
> > > > fill the hole with any number
> > > > 2. many times that the relation (I'm looking for) between the DV and
> > > > IVs are not linear, or even monotonic.
>
> > > It's hard to see how these two paragraphs go together. In paragraph 1,
> > > you say that you have categorical IVs, but in paragraph 2, your real
> > > problem has nothing to do with categorical IVs, your real problem is
> > > missing values, and furthermore you want non-linear modeling on top of
> > > that (which means you shouldn't be using PROC GLMSELECT).
>
> > > So mark me down as confused. Perhaps you could explain further?
>
> > > --
> > > Paige Miller
> > > paige\dot\miller \at\ kodak\dot\com
>
> > Thanks for your response. Let me try to make it more clear.
> > What I have for the problem is a continuous DV and a bunch of
> > continuous IVs, which have missing values to deal with. My goal is to
> > build an interpretable model on these variables, no matter they're
> > binned or not. There're two approaches I could think of. One is to
> > filling the missing values first for all IVs and run GLMSELECT(LASSO).
> > The issue is that there's no perfect way to replace missing values for
> > such many variables and some useful variable might not have a linear
> > effect to the DV. Then the next approach came into my mind, which is
> > to bin the variables first and run variable selection on the binned
> > ones. It's simple to make missing values as one category for each
> > variable, however, GLMSELECT will split the categorical variables
> > while doing selection. I hope all the columns of the same variable
> > would enter or exit the model together. Grouped LASSO is not built
> > into GLMSELECT right?
> > Sorry for the confusing, but I really wanted to give the whole story
> > of what I was doing instead of asking one specific question. Thanks.
>
> > Jun
>
> I never think binning is a good idea with continuous variables.
>
> This whole question boils down to: how best to deal with missing
> values in a complicated modeling situation, which may be nonlinear,
> but I just don't see PROC GLMSELECT as an option here.
>
> I don't think SAS has great tools for what may be a non-linear
> modeling situation, however there are tools for linear modeling. You
> may want to look at PROC MI and PROC MIANALYZE. Also Partial Least
> Squares modeling (PROC PLS) has the ability to "impute" missing values
> based upon the EM algorithm, so that may be an option as well. As far
> as I know, these procedures only handle linear modeling situations.
>
> --
> Paige Miller
> paige\dot\miller \at\ kodak\dot\com

Clarification: when I say "I don't think SAS has great tools for what
may be a non-linear modeling situation, however there are tools for
linear modeling" I am referring to handling missing value in non-
linear modeling situations.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com