17.5.5.2 Algorithms (Two-Sample Kolmogorov-Smirnov Test)

The procedure below draws on NAG algorithms.

Consider two independent samples X and Y, with the size of $n_1\,\!$ and $n_2\,\!$ .Denoted as $x_1,x_2,\ldots ,x_{n_1}\,\!$ and $y_1,y_2,\ldots ,y_{n_1}\,\!$ respectively. Let F(x) and G(x) represent their respective, unknown distribution functions. Also let $S_1(x)\,\!$ and $S_2(x)\,\!$ denote the values of sample empirical distribution functions.

The null hypothesis :F(x)=G(x)

The alternative hypothesis $H_1\,\!$ :F(x)<>G(x) the associated p-value is a two-tailed probability;

or $H_1\,\!$ :F(x)>G(x) the associated p-value is an upper-tailed probability,

or $H_1\,\!$ : F(x)<G(x) the associated p-value is a lower-tailed probability

For the first case of $H_1\,\!$ , the statistics $D_{n_1,n_2} \,\!$ represents the largest absolute deviation of the two empirical distribution functions.

For the second case of $H_1\,\!$ , the statistics $D_{n_1,n_2}^{+} \,\!$ represents the largest positive deviation between the empirical distribution function of the first sample and the empirical distribution function of the second sample, that is $D_{n_1,n_2}^{+}=\max \{S_1(x)-S_2(x),0\}\,\!$ .

For the third case of $H_1\,\!$ , the statistics $D_{n_1,n_2}^{-} \,\!$ represents the largest positive deviation between the empirical distribution function of the second sample and the empirical distribution function of the first sample, that is $D_{n_1,n_2}^{-}=\max \{S_2(x)-S_1(x),0\}\,\!$ .

KS-test2 also returns the standard statistics $Z=\sqrt{(n_1*n_2)/(n_1+n_2)}*D\,\!$ ,

where $D\,\!$ maybe $D_{n_1,n_2}\,\!$ , $D_{n_1,n_2}^{+} \,\!$ , $D_{n_1,n_2}^{-} \,\!$ depending on the choice of the alternative hypothesis.

The distribution of the statistic $Z\,\!$ converges asymptotically to a distribution given by Smirnov as $n_1\,\!$ and $n_2\,\!$ increase. The probability, under the null hypothesis, of obtaining a value of the test statistic as extreme as that observed, is computed.

If $max(n_1,n_2)\leq 2500\,\!$ and $n_1*n_2\leq 10000\,\!$ then an exact method is given by Kim and Jinrich. Otherwise $p\,\!$ is computed using the approximations suggested by Kim and Jenrich (1973)

Note that the method used only exact for continuous theoretical distributions.

This method computes the two-sided probability. The one-sided probabilities are estimated by having the two-sided probability. This is a good estimate for small $p\,\!$ , that is $p\leq 0.10\,\!$ , but it becomes very poor for larger $p\,\!$ .

For more details of the algorithm, please refer to nag_2_sample_ks_test (g08cdc) .