From: kangtsui on
On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
> On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
>
>
> > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR)..
> > > > > However, I have so many categorical IVs in my pool. The manual says
> > > > > that GLMSELECT would split the columns of those categorical IVs, but I
> > > > > think it would be an issue for me. I hope that columns for the same
> > > > > variable enter or exit the model together. Is there a way to get
> > > > > around this?
>
> > > > > Actually my real problem is to build a model with a continuous DV and
> > > > > a lot of continuous IVs. The reason I don't want to run variable
> > > > > selection on the original variables are
> > > > > 1. there are missing values here and there. sometimes I could replace
> > > > > it with mean, min, or max, but sometimes it does not make sense to
> > > > > fill the hole with any number
> > > > > 2. many times that the relation (I'm looking for) between the DV and
> > > > > IVs are not linear, or even monotonic.
>
> > > > It's hard to see how these two paragraphs go together. In paragraph 1,
> > > > you say that you have categorical IVs, but in paragraph 2, your real
> > > > problem has nothing to do with categorical IVs, your real problem is
> > > > missing values, and furthermore you want non-linear modeling on top of
> > > > that (which means you shouldn't be using PROC GLMSELECT).
>
> > > > So mark me down as confused. Perhaps you could explain further?
>
> > > > --
> > > > Paige Miller
> > > > paige\dot\miller \at\ kodak\dot\com
>
> > > Thanks for your response. Let me try to make it more clear.
> > > What I have for the problem is a continuous DV and a bunch of
> > > continuous IVs, which have missing values to deal with. My goal is to
> > > build an interpretable model on these variables, no matter they're
> > > binned or not. There're two approaches I could think of. One is to
> > > filling the missing values first for all IVs and run GLMSELECT(LASSO)..
> > > The issue is that there's no perfect way to replace missing values for
> > > such many variables and some useful variable might not have a linear
> > > effect to the DV. Then the next approach came into my mind, which is
> > > to bin the variables first and run variable selection on the binned
> > > ones. It's simple to make missing values as one category for each
> > > variable, however, GLMSELECT will split the categorical variables
> > > while doing selection. I hope all the columns of the same variable
> > > would enter or exit the model together. Grouped LASSO is not built
> > > into GLMSELECT right?
> > > Sorry for the confusing, but I really wanted to give the whole story
> > > of what I was doing instead of asking one specific question. Thanks.
>
> > > Jun
>
> > I never think binning is a good idea with continuous variables.
>
> > This whole question boils down to: how best to deal with missing
> > values in a complicated modeling situation, which may be nonlinear,
> > but I just don't see PROC GLMSELECT as an option here.
>
> > I don't think SAS has great tools for what may be a non-linear
> > modeling situation, however there are tools for linear modeling. You
> > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least
> > Squares modeling (PROC PLS) has the ability to "impute" missing values
> > based upon the EM algorithm, so that may be an option as well. As far
> > as I know, these procedures only handle linear modeling situations.
>
> > --
> > Paige Miller
> > paige\dot\miller \at\ kodak\dot\com
>
> Clarification: when I say "I don't think SAS has great tools for what
> may be a non-linear modeling situation, however there are tools for
> linear modeling" I am referring to handling missing value in non-
> linear modeling situations.
>
> --
> Paige Miller
> paige\dot\miller \at\ kodak\dot\com

Thanks for your help. Do you recommend PROC GAM for my case, if I
could handle missing values on my own? Is there a tool to do variable
selection for non-linear models? Thanks.

Jun
From: Paige Miller on
On Jul 21, 3:26 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
> On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
>
>
> > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> > > > > > However, I have so many categorical IVs in my pool. The manual says
> > > > > > that GLMSELECT would split the columns of those categorical IVs, but I
> > > > > > think it would be an issue for me. I hope that columns for the same
> > > > > > variable enter or exit the model together. Is there a way to get
> > > > > > around this?
>
> > > > > > Actually my real problem is to build a model with a continuous DV and
> > > > > > a lot of continuous IVs. The reason I don't want to run variable
> > > > > > selection on the original variables are
> > > > > > 1. there are missing values here and there. sometimes I could replace
> > > > > > it with mean, min, or max, but sometimes it does not make sense to
> > > > > > fill the hole with any number
> > > > > > 2. many times that the relation (I'm looking for) between the DV and
> > > > > > IVs are not linear, or even monotonic.
>
> > > > > It's hard to see how these two paragraphs go together. In paragraph 1,
> > > > > you say that you have categorical IVs, but in paragraph 2, your real
> > > > > problem has nothing to do with categorical IVs, your real problem is
> > > > > missing values, and furthermore you want non-linear modeling on top of
> > > > > that (which means you shouldn't be using PROC GLMSELECT).
>
> > > > > So mark me down as confused. Perhaps you could explain further?
>
> > > > > --
> > > > > Paige Miller
> > > > > paige\dot\miller \at\ kodak\dot\com
>
> > > > Thanks for your response. Let me try to make it more clear.
> > > > What I have for the problem is a continuous DV and a bunch of
> > > > continuous IVs, which have missing values to deal with. My goal is to
> > > > build an interpretable model on these variables, no matter they're
> > > > binned or not. There're two approaches I could think of. One is to
> > > > filling the missing values first for all IVs and run GLMSELECT(LASSO).
> > > > The issue is that there's no perfect way to replace missing values for
> > > > such many variables and some useful variable might not have a linear
> > > > effect to the DV. Then the next approach came into my mind, which is
> > > > to bin the variables first and run variable selection on the binned
> > > > ones. It's simple to make missing values as one category for each
> > > > variable, however, GLMSELECT will split the categorical variables
> > > > while doing selection. I hope all the columns of the same variable
> > > > would enter or exit the model together. Grouped LASSO is not built
> > > > into GLMSELECT right?
> > > > Sorry for the confusing, but I really wanted to give the whole story
> > > > of what I was doing instead of asking one specific question. Thanks..
>
> > > > Jun
>
> > > I never think binning is a good idea with continuous variables.
>
> > > This whole question boils down to: how best to deal with missing
> > > values in a complicated modeling situation, which may be nonlinear,
> > > but I just don't see PROC GLMSELECT as an option here.
>
> > > I don't think SAS has great tools for what may be a non-linear
> > > modeling situation, however there are tools for linear modeling. You
> > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least
> > > Squares modeling (PROC PLS) has the ability to "impute" missing values
> > > based upon the EM algorithm, so that may be an option as well. As far
> > > as I know, these procedures only handle linear modeling situations.
>
> > > --
> > > Paige Miller
> > > paige\dot\miller \at\ kodak\dot\com
>
> > Clarification: when I say "I don't think SAS has great tools for what
> > may be a non-linear modeling situation, however there are tools for
> > linear modeling" I am referring to handling missing value in non-
> > linear modeling situations.
>
> > --
> > Paige Miller
> > paige\dot\miller \at\ kodak\dot\com
>
> Thanks for your help. Do you recommend PROC GAM for my case, if I
> could handle missing values on my own? Is there a tool to do variable
> selection for non-linear models? Thanks.
>
> Jun

I think I still wasn't clear.

SAS has good tools for linear and non-linear modeling.

SAS has good tools in the presence of outliers in linear models, using
PROC MI, PROC MIANALYZE or PROC PLS. SAS does not (as far as I know)
have good tools in the presence of outliers for non-linear modeling.

I don't see how PROC GAM handles non-linear models with continuous
variables. PROC NLIN is the procedure that will fit almost any non-
linear model you can devise; however as far as I know the only outlier
handling for PROC NLIN is to remove from the fitting algorithm any
observations that have even one missing value in the IVs or DV.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com
From: kangtsui on
On Jul 21, 4:00 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
> On Jul 21, 3:26 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
>
>
> > On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> > > > > > > However, I have so many categorical IVs in my pool. The manual says
> > > > > > > that GLMSELECT would split the columns of those categorical IVs, but I
> > > > > > > think it would be an issue for me. I hope that columns for the same
> > > > > > > variable enter or exit the model together. Is there a way to get
> > > > > > > around this?
>
> > > > > > > Actually my real problem is to build a model with a continuous DV and
> > > > > > > a lot of continuous IVs. The reason I don't want to run variable
> > > > > > > selection on the original variables are
> > > > > > > 1. there are missing values here and there. sometimes I could replace
> > > > > > > it with mean, min, or max, but sometimes it does not make sense to
> > > > > > > fill the hole with any number
> > > > > > > 2. many times that the relation (I'm looking for) between the DV and
> > > > > > > IVs are not linear, or even monotonic.
>
> > > > > > It's hard to see how these two paragraphs go together. In paragraph 1,
> > > > > > you say that you have categorical IVs, but in paragraph 2, your real
> > > > > > problem has nothing to do with categorical IVs, your real problem is
> > > > > > missing values, and furthermore you want non-linear modeling on top of
> > > > > > that (which means you shouldn't be using PROC GLMSELECT).
>
> > > > > > So mark me down as confused. Perhaps you could explain further?
>
> > > > > > --
> > > > > > Paige Miller
> > > > > > paige\dot\miller \at\ kodak\dot\com
>
> > > > > Thanks for your response. Let me try to make it more clear.
> > > > > What I have for the problem is a continuous DV and a bunch of
> > > > > continuous IVs, which have missing values to deal with. My goal is to
> > > > > build an interpretable model on these variables, no matter they're
> > > > > binned or not. There're two approaches I could think of. One is to
> > > > > filling the missing values first for all IVs and run GLMSELECT(LASSO).
> > > > > The issue is that there's no perfect way to replace missing values for
> > > > > such many variables and some useful variable might not have a linear
> > > > > effect to the DV. Then the next approach came into my mind, which is
> > > > > to bin the variables first and run variable selection on the binned
> > > > > ones. It's simple to make missing values as one category for each
> > > > > variable, however, GLMSELECT will split the categorical variables
> > > > > while doing selection. I hope all the columns of the same variable
> > > > > would enter or exit the model together. Grouped LASSO is not built
> > > > > into GLMSELECT right?
> > > > > Sorry for the confusing, but I really wanted to give the whole story
> > > > > of what I was doing instead of asking one specific question. Thanks.
>
> > > > > Jun
>
> > > > I never think binning is a good idea with continuous variables.
>
> > > > This whole question boils down to: how best to deal with missing
> > > > values in a complicated modeling situation, which may be nonlinear,
> > > > but I just don't see PROC GLMSELECT as an option here.
>
> > > > I don't think SAS has great tools for what may be a non-linear
> > > > modeling situation, however there are tools for linear modeling. You
> > > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least
> > > > Squares modeling (PROC PLS) has the ability to "impute" missing values
> > > > based upon the EM algorithm, so that may be an option as well. As far
> > > > as I know, these procedures only handle linear modeling situations.
>
> > > > --
> > > > Paige Miller
> > > > paige\dot\miller \at\ kodak\dot\com
>
> > > Clarification: when I say "I don't think SAS has great tools for what
> > > may be a non-linear modeling situation, however there are tools for
> > > linear modeling" I am referring to handling missing value in non-
> > > linear modeling situations.
>
> > > --
> > > Paige Miller
> > > paige\dot\miller \at\ kodak\dot\com
>
> > Thanks for your help. Do you recommend PROC GAM for my case, if I
> > could handle missing values on my own? Is there a tool to do variable
> > selection for non-linear models? Thanks.
>
> > Jun
>
> I think I still wasn't clear.
>
> SAS has good tools for linear and non-linear modeling.
>
> SAS has good tools in the presence of outliers in linear models, using
> PROC MI, PROC MIANALYZE or PROC PLS. SAS does not (as far as I know)
> have good tools in the presence of outliers for non-linear modeling.
>
> I don't see how PROC GAM handles non-linear models with continuous
> variables. PROC NLIN is the procedure that will fit almost any non-
> linear model you can devise; however as far as I know the only outlier
> handling for PROC NLIN is to remove from the fitting algorithm any
> observations that have even one missing value in the IVs or DV.
>
> --
> Paige Miller
> paige\dot\miller \at\ kodak\dot\com

Sorry for confusing you. By mentioning GAM, I was thinking to apply
some general additive models on my case. It's additive but with
possibly non-linear form of the IVs. I never thought of running models
with non-linear forms. I could have link functions on my DV, but the
form of the model, right side of the equation in other words, should
be as simple as linear, additive with components of IVs, after
transformation, either polynomial or spline or some other forms.
Thanks.

Jun
From: Paige Miller on
On Jul 22, 2:00 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
> On Jul 21, 4:00 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
>
>
> > On Jul 21, 3:26 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > On Jul 21, 2:22 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > On Jul 21, 11:14 am, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > > On Jul 20, 10:02 pm, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > > > On Jul 20, 2:46 pm, Paige Miller <paige.mil...(a)kodak.com> wrote:
>
> > > > > > > On Jul 20, 10:53 am, "kangt...(a)gmail.com" <kangt...(a)gmail.com> wrote:
>
> > > > > > > > I was trying to do variable selection by GLMSELECT (LASSO or LAR).
> > > > > > > > However, I have so many categorical IVs in my pool. The manual says
> > > > > > > > that GLMSELECT would split the columns of those categorical IVs, but I
> > > > > > > > think it would be an issue for me. I hope that columns for the same
> > > > > > > > variable enter or exit the model together. Is there a way to get
> > > > > > > > around this?
>
> > > > > > > > Actually my real problem is to build a model with a continuous DV and
> > > > > > > > a lot of continuous IVs. The reason I don't want to run variable
> > > > > > > > selection on the original variables are
> > > > > > > > 1. there are missing values here and there. sometimes I could replace
> > > > > > > > it with mean, min, or max, but sometimes it does not make sense to
> > > > > > > > fill the hole with any number
> > > > > > > > 2. many times that the relation (I'm looking for) between the DV and
> > > > > > > > IVs are not linear, or even monotonic.
>
> > > > > > > It's hard to see how these two paragraphs go together. In paragraph 1,
> > > > > > > you say that you have categorical IVs, but in paragraph 2, your real
> > > > > > > problem has nothing to do with categorical IVs, your real problem is
> > > > > > > missing values, and furthermore you want non-linear modeling on top of
> > > > > > > that (which means you shouldn't be using PROC GLMSELECT).
>
> > > > > > > So mark me down as confused. Perhaps you could explain further?
>
> > > > > > > --
> > > > > > > Paige Miller
> > > > > > > paige\dot\miller \at\ kodak\dot\com
>
> > > > > > Thanks for your response. Let me try to make it more clear.
> > > > > > What I have for the problem is a continuous DV and a bunch of
> > > > > > continuous IVs, which have missing values to deal with. My goal is to
> > > > > > build an interpretable model on these variables, no matter they're
> > > > > > binned or not. There're two approaches I could think of. One is to
> > > > > > filling the missing values first for all IVs and run GLMSELECT(LASSO).
> > > > > > The issue is that there's no perfect way to replace missing values for
> > > > > > such many variables and some useful variable might not have a linear
> > > > > > effect to the DV. Then the next approach came into my mind, which is
> > > > > > to bin the variables first and run variable selection on the binned
> > > > > > ones. It's simple to make missing values as one category for each
> > > > > > variable, however, GLMSELECT will split the categorical variables
> > > > > > while doing selection. I hope all the columns of the same variable
> > > > > > would enter or exit the model together. Grouped LASSO is not built
> > > > > > into GLMSELECT right?
> > > > > > Sorry for the confusing, but I really wanted to give the whole story
> > > > > > of what I was doing instead of asking one specific question. Thanks.
>
> > > > > > Jun
>
> > > > > I never think binning is a good idea with continuous variables.
>
> > > > > This whole question boils down to: how best to deal with missing
> > > > > values in a complicated modeling situation, which may be nonlinear,
> > > > > but I just don't see PROC GLMSELECT as an option here.
>
> > > > > I don't think SAS has great tools for what may be a non-linear
> > > > > modeling situation, however there are tools for linear modeling. You
> > > > > may want to look at PROC MI and PROC MIANALYZE. Also Partial Least
> > > > > Squares modeling (PROC PLS) has the ability to "impute" missing values
> > > > > based upon the EM algorithm, so that may be an option as well. As far
> > > > > as I know, these procedures only handle linear modeling situations.
>
> > > > > --
> > > > > Paige Miller
> > > > > paige\dot\miller \at\ kodak\dot\com
>
> > > > Clarification: when I say "I don't think SAS has great tools for what
> > > > may be a non-linear modeling situation, however there are tools for
> > > > linear modeling" I am referring to handling missing value in non-
> > > > linear modeling situations.
>
> > > > --
> > > > Paige Miller
> > > > paige\dot\miller \at\ kodak\dot\com
>
> > > Thanks for your help. Do you recommend PROC GAM for my case, if I
> > > could handle missing values on my own? Is there a tool to do variable
> > > selection for non-linear models? Thanks.
>
> > > Jun
>
> > I think I still wasn't clear.
>
> > SAS has good tools for linear and non-linear modeling.
>
> > SAS has good tools in the presence of outliers in linear models, using
> > PROC MI, PROC MIANALYZE or PROC PLS. SAS does not (as far as I know)
> > have good tools in the presence of outliers for non-linear modeling.
>
> > I don't see how PROC GAM handles non-linear models with continuous
> > variables. PROC NLIN is the procedure that will fit almost any non-
> > linear model you can devise; however as far as I know the only outlier
> > handling for PROC NLIN is to remove from the fitting algorithm any
> > observations that have even one missing value in the IVs or DV.
>
> > --
> > Paige Miller
> > paige\dot\miller \at\ kodak\dot\com
>
> Sorry for confusing you. By mentioning GAM, I was thinking to apply
> some general additive models on my case. It's additive but with
> possibly non-linear form of the IVs. I never thought of running models
> with non-linear forms. I could have link functions on my DV, but the
> form of the model, right side of the equation in other words, should
> be as simple as linear, additive with components of IVs, after
> transformation, either polynomial or spline or some other forms.
> Thanks.
>
> Jun

If that's what you are considering, transformation of the DV, then any
of the SAS modeling procedures might work. Again, I do think you
should investigate PROC PLS, as not only does it "impute" missing
values, as I mentioned, but the algorithm works well when you have
many correlated IVs.

In general, you would choose the modeling algorithm independently of
how you handle missing values. One doesn't determine the other. But as
a practical statement, you have only a limited number of software
options for handling the missings, and many options for modeling,
which is why PROC PLS looks good to me.

--
Paige Miller
paige\dot\miller \at\ kodak\dot\com