From: Aaron Bramson on
Hello Everybody,

I would like to perform a 2-sample k-s test. I've seen some posts on the
archive about the one-sample goodness-of-fit version of the
Kolgomorov-Smirnov test, but I'm interested in the 2-sample version.

Here's a description of the method:
http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test

I think all the necessary components are available; e.g.
Accumulate[BinCounts[_list_]] to get the ecdf of both datasets, abs, max of
a list, etc. But the data management is a bit above my current skill
level. Also, since all other software packages seem to include this test
capability, I would be really surprised if there
wasn't a package somewhere that included it by now, but I've searched a lot
and can't find it. Can anybody help me locate this this?

Alternatively, would anybody like to work with me to build this in case it
can't be found?

Thanks in Advance,
Aaron


From: Ray Koopman on
On Jul 18, 11:10 pm, Aaron Bramson <aaronbram...(a)gmail.com> wrote:
> Hello Everybody,
>
> I would like to perform a 2-sample k-s test. I've seen some posts on the
> archive about the one-sample goodness-of-fit version of the
> Kolgomorov-Smirnov test, but I'm interested in the 2-sample version.
>
> Here's a description of the method:

> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test

>
> I think all the necessary components are available; e.g.
> Accumulate[BinCounts[_list_]] to get the ecdf of both datasets, abs, max of
> a list, etc. But the data management is a bit above my current skill
> level. Also, since all other software packages seem to include this test
> capability, I would be really surprised if there
> wasn't a package somewhere that included it by now, but I've searched a lot
> and can't find it. Can anybody help me locate this this?
>
> Alternatively, would anybody like to work with me to build this in case it
> can't be found?
>
> Thanks in Advance,
> Aaron

This is based on code by Thomas Waterhouse, taken from
https://stat.ethz.ch/pipermail/r-devel/2009-July/054106.html

The arguments y1 and y2 are lists of data. It returns {x,p},
where x is the usual K-S test statistic n1*n2*Max|cdf1-cdf2|,
and p is the corresponding p-value, conditional on
the distribution of ties in the pooled data.

ks2[y1_,y2_] := Block[{n1 = Length(a)y1, n2 = Length(a)y2,
pool = Sort(a)Join[y1,y2], x,n,u}, {x =
Max(a)Abs[n2*Tr(a)UnitStep[y1-#]-n1*Tr(a)UnitStep[y2-#]&/@Rest(a)Union@pool],
n = n1+n2; u = Table[0,{n2+1}]; Do[ Which[
i+j == 0, u[[j+1]] = 1,
i+j < n && pool[[i+j]] < pool[[i+j+1]] && Abs[n2*i-n1*j] >= x,
u[[j+1]] = 0,
i == 0, u[[j+1]] = u[[j]],
j > 0, u[[j+1]] += u[[j]]], {i,0,n1},{j,0,n2}];
N[1 - Last@u/Multinomial[n1,n2]]} ]

From: Darren Glosemeyer on
Aaron Bramson wrote:
> Hello Everybody,
>
> I would like to perform a 2-sample k-s test. I've seen some posts on the
> archive about the one-sample goodness-of-fit version of the
> Kolgomorov-Smirnov test, but I'm interested in the 2-sample version.
>
> Here's a description of the method:
> http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-sample_Kolmogorov.E2.80.93Smirnov_test
>
> I think all the necessary components are available; e.g.
> Accumulate[BinCounts[_list_]] to get the ecdf of both datasets, abs, max of
> a list, etc. But the data management is a bit above my current skill
> level. Also, since all other software packages seem to include this test
> capability, I would be really surprised if there
> wasn't a package somewhere that included it by now, but I've searched a lot
> and can't find it. Can anybody help me locate this this?
>
> Alternatively, would anybody like to work with me to build this in case it
> can't be found?
>
> Thanks in Advance,
> Aaron
>
>
>

Hi Aaron,

Here is some code written by Andy Ross at Wolfram for the two sample
Kolmogorov-Smirnov test. KolmogorovSmirnov2Sample computes the test
statistic, and KSBootstrapPValue provides a bootstrap estimate of the
p-value given the two data sets, the number of simulations for the
estimate and the test statistic.

In[1]:= empiricalCDF[data_, x_] := Length[Select[data, # <= x
&]]/Length[data]

In[2]:= KolmogorovSmirnov2Sample[data1_, data2_] :=
Block[{sd1 = Sort[data1], sd2 = Sort[data2], e1, e2,
udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
n2 = Length[data2], T},
e1 = empiricalCDF[sd1, #] & /@ udat;
e2 = empiricalCDF[sd2, #] & /@ udat;
T = Max[Abs[e1 - e2]];
(1/Sqrt[n1]) (Sqrt[(n1*n2)/(n1 + n2)]) T
]

In[3]:= KSBootstrapPValue[data1_, data2_, nSamp_, T_] :=
Block[{udat = Union[Flatten[{data1, data2}]], n1 = Length[data1],
n2 = Length[data2], d1Samples, d2Samples, Tdist},
d1Samples = RandomChoice[udat, {nSamp, n1}];
d2Samples = RandomChoice[udat, {nSamp, n2}];
Tdist =
MapThread[
KolmogorovSmirnov2Sample[#1, #2] &, {d1Samples, d2Samples}];
1 - empiricalCDF[Tdist, T] // N
]

(* create two data sets *)
In[4]:= data1 = RandomReal[NormalDistribution[], 15];

In[5]:= data2 = RandomReal[NormalDistribution[], 15];

(* compute the test statistic *)
In[6]:= stat = KolmogorovSmirnov2Sample[data1, data2] // N

Out[6]= 0.329983

(* estimate the p-value from 10000 bootstrap samples *)
In[7]:= KSBootstrapPValue[data1, data2, 10000, stat]

Out[7]= 0.0213


Darren Glosemeyer
Wolfram Research

From: Bill Rowe on
On 7/19/10 at 2:11 AM, aaronbramson(a)gmail.com (Aaron Bramson) wrote:

>I would like to perform a 2-sample k-s test. I've seen some posts
>on the archive about the one-sample goodness-of-fit version of the
>Kolgomorov-Smirnov test, but I'm interested in the 2-sample version.

>Here's a description of the method:
>http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test#Two-
>sample_Kolmogorov.E2.80. 93Smirnov_test

>I think all the necessary components are available; e.g.
>Accumulate[BinCounts[_list_]] to get the ecdf of both datasets, abs,
>max of a list, etc. But the data management is a bit above my
>current skill level. Also, since all other software packages seem
>to include this test capability, I would be really surprised if
>there wasn't a package somewhere that included it by now, but I've
>searched a lot and can't find it. Can anybody help me locate this
>this?

>Alternatively, would anybody like to work with me to build this in
>case it can't be found?

It is simple to create a function that will do what is needed.
For example,

ksTwoSampleTest[xdata_, ydata_] :=
Module[{nx, ny, k},
{nx, ny} = Length /@ {xdata, ydata};
k = Max[nx, ny];
Sqrt[nx ny/(nx + ny)] Max@
Table[Abs[Quantile[xdata, x/k] - Quantile[ydata, x/k]], {x, k}]]

Note, while this is a simple implementation it may not be
optimal for large data sets. My *guess* is by using Quantile and
not pre-sorting the data, there is more work being done by this
code than is really needed. i suspect that the approach I've
used here has a complexity of order n^2 which should't be a
problem for modest data sets but will certainly be an issue for
large data sets.


From: Bill Rowe on
On 7/20/10 at 3:41 AM, readnews(a)sbcglobal.net (Bill Rowe) wrote:

>On 7/19/10 at 2:11 AM, aaronbramson(a)gmail.com (Aaron Bramson) wrote:
>

>>Alternatively, would anybody like to work with me to build this in
>>case it can't be found?

>It is simple to create a function that will do what is needed. For
>example,

And then supplied an incorrect solution. Please ignore it.