From: amw5gster on
Howdy,

Silly question that's likely to show I'm overlooking something simple,
but I'm stumped. I have a dset of approx 8M observations and I'm
trying to grow an EMiner decision tree on a binary target variable.
There are about 20 independent variables, mostly interval (dates), but
some nominal, a few binary and one ordinal. The proportion of true
events is about 12%. I have not set any prior probabilities, nor
profit/cost values.

The tree runs, but returns no splits. It just won't grow. I've tried
dropping the signif value to .00001, using upwards of 11 maximum
branches and my max depth to 10. I also tried having the tree build on
as few as 2 IVs.

I was able to build a tree when I took a sample of 100K records and
forced the %age of true events in the sample to be 50%. Naturally I
don't want to misrepresent the proportion, and I figured that 12%
wasn't terribly rare for a d-tree.

Am I outright doing something wrong or is this expected behavior?

From: Sigurd Hermansen on
Dropping the significance value may have the opposite of the effect that
you expect. Generally it takes a 'purer' separation of true events from
others to attain 1% as opposed to a 5% Type 1 error 'significance'. A
small proportion of true events in data makes it even harder (due to
'long-tail' distributions of errors).

SAS/EM takes statistical significance seriously and won't produce
results in some situations unless the user explicitly increases the
acceptable level of Type 1 error. I would prefer a 'decision cost' basis
for exploratory data analyses that do not pretend to conduct a
hypothesis test, but I understand why SAS implements decision trees this
way.
Sig

-----Original Message-----
From: owner-sas-l(a)listserv.uga.edu [mailto:owner-sas-l(a)listserv.uga.edu]
On Behalf Of amw5gster(a)gmail.com
Sent: Wednesday, November 29, 2006 11:50 AM
To: sas-l(a)uga.edu
Subject: Decision Tree refuses to grow


Howdy,

Silly question that's likely to show I'm overlooking something simple,
but I'm stumped. I have a dset of approx 8M observations and I'm trying
to grow an EMiner decision tree on a binary target variable. There are
about 20 independent variables, mostly interval (dates), but some
nominal, a few binary and one ordinal. The proportion of true events is
about 12%. I have not set any prior probabilities, nor profit/cost
values.

The tree runs, but returns no splits. It just won't grow. I've tried
dropping the signif value to .00001, using upwards of 11 maximum
branches and my max depth to 10. I also tried having the tree build on
as few as 2 IVs.

I was able to build a tree when I took a sample of 100K records and
forced the %age of true events in the sample to be 50%. Naturally I
don't want to misrepresent the proportion, and I figured that 12% wasn't
terribly rare for a d-tree.

Am I outright doing something wrong or is this expected behavior?
From: Peter Flom on
<<<

Silly question that's likely to show I'm overlooking something simple,
but I'm stumped. I have a dset of approx 8M observations and I'm
trying to grow an EMiner decision tree on a binary target variable.
There are about 20 independent variables, mostly interval (dates), but
some nominal, a few binary and one ordinal. The proportion of true
events is about 12%. I have not set any prior probabilities, nor
profit/cost values.

The tree runs, but returns no splits. It just won't grow. I've tried
dropping the signif value to .00001, using upwards of 11 maximum
branches and my max depth to 10. I also tried having the tree build
on
as few as 2 IVs.

I was able to build a tree when I took a sample of 100K records and
forced the %age of true events in the sample to be 50%. Naturally I
don't want to misrepresent the proportion, and I figured that 12%
wasn't terribly rare for a d-tree.
>>>

I don't know how trees work in SAS, but in other software, this can
easily happen. It could be that none of the IVs are very good at
separating the DV.

Peter
From: Vadim Pliner on
I'm afraid you set too many branches for your independent variables
when some of those variables apparently have a lot of distinctive
values. It looks like the number of competitive splits at each level
should be astronomical in your case. What decision tree node of Eminer
does, it adjusts all p-values for multiple comparisons, and since the
number of those comparisons looks to be huge from what you wrote, it
may produce very big adjusted p-values. To grow your tree, try to
decrease the "maximum number of branches from a node" to, say, 2 or 3
and increase your p-value to, say, 0.05 or even 0.1.

HTH,
Vadim Pliner

amw5gster(a)gmail.com wrote:
> Howdy,
>
> Silly question that's likely to show I'm overlooking something simple,
> but I'm stumped. I have a dset of approx 8M observations and I'm
> trying to grow an EMiner decision tree on a binary target variable.
> There are about 20 independent variables, mostly interval (dates), but
> some nominal, a few binary and one ordinal. The proportion of true
> events is about 12%. I have not set any prior probabilities, nor
> profit/cost values.
>
> The tree runs, but returns no splits. It just won't grow. I've tried
> dropping the signif value to .00001, using upwards of 11 maximum
> branches and my max depth to 10. I also tried having the tree build on
> as few as 2 IVs.
>
> I was able to build a tree when I took a sample of 100K records and
> forced the %age of true events in the sample to be 50%. Naturally I
> don't want to misrepresent the proportion, and I figured that 12%
> wasn't terribly rare for a d-tree.
>
> Am I outright doing something wrong or is this expected behavior?

From: David L Cassell on
amw5gster(a)GMAIL.COM wrote:
>
>Howdy,
>
>Silly question that's likely to show I'm overlooking something simple,
>but I'm stumped. I have a dset of approx 8M observations and I'm
>trying to grow an EMiner decision tree on a binary target variable.
>There are about 20 independent variables, mostly interval (dates), but
>some nominal, a few binary and one ordinal. The proportion of true
>events is about 12%. I have not set any prior probabilities, nor
>profit/cost values.
>
>The tree runs, but returns no splits. It just won't grow. I've tried
>dropping the signif value to .00001, using upwards of 11 maximum
>branches and my max depth to 10. I also tried having the tree build on
>as few as 2 IVs.
>
>I was able to build a tree when I took a sample of 100K records and
>forced the %age of true events in the sample to be 50%. Naturally I
>don't want to misrepresent the proportion, and I figured that 12%
>wasn't terribly rare for a d-tree.
>
>Am I outright doing something wrong or is this expected behavior?

In addition to the excellent advice from Sig and Vadim, let me add a
note about taking subsamples. SAS EM uses the same underlying
protocols as PROC SURVEYSELECT when it does a sample (the first
step in the SEMMA model of data mining). It treats your data as if
you requested a stratified sample, with your 0/1 DV providing the
strata, and it does a simple random sample of your data within each
stratum. [This may or may NOT be ideal for your setting.] And it
can (in theory) handle the weights that accrue as a result of this
sampling.

So cutting the data down to 2 million - 1 million 'true' and 1 million
'false' will give you that 50% while keeping all the 'true' events.
Or subset even further. But the smaller you cut these pieces, the
more you run into problems with probabilities of missing rare events
that may be related to independent variables. Which can really be a
source of misrepresentation, in a different way. At this point, you
need to think about how the sampling should really be done.

HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330

_________________________________________________________________
Fixing up the home? Live Search can help
http://imagine-windowslive.com/search/kits/default.aspx?kit=improve&locale=en-US&source=hmemailtaglinenov06&FORM=WLMTAG
 |  Next  |  Last
Pages: 1 2
Prev: Error on SAS ACCESS TO ORACLE
Next: PROC QLIM - Heckman