17.1.6.3 Algorithms (CrossTabs)


Contents


CrossTabs is also called Contingency Tables. This tool is used to examine the existence or the strength of any association between variables.

CrossTabs Method

Frequency Counts

Define

\(X_i\) are distinct values of row variable in ascending order, i.e. \(X_1 < X_2 < \cdots X_R \)
\(Y_i\) are distinct values of column variable in ascending order, i.e. \(Y_1 < Y_2 < \cdots Y_C \)
\(f_{ij}\) is the frequency with respect to cell \((i,j)\)
\(r_i = \sum_{j=1}^{C}f_{ij}\) is subtotal of the \(i\)th row
\(c_j = \sum_{i=1}^{R}f_{ij}\) is subtotal of the \(j\)th column
\(N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i\) is the total number.

Marginal and Cell

Statistics Formula and Explanation
Count \[f_{ij}\]
Expected Count \[E_{ij} = \frac{r_i c_j}{N}\]
Row Percent \[100*\frac{f_{ij}}{r_i}\]
Column Percent \[100*\frac{f_{ij}}{c_j}\]
Total Percent \[100*\frac{f_{ij}}{N}\]
Residual \[R_{ij} = f_{ij} - E_{ij}\]
Std. Residual \[StdR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}}}\]
Adj. Residual \[AdjR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}\left(1-\frac{r_i}{N}\right)\left(1-\frac{c_j}{N}\right)}}\]

Chi-Square Statistics

Statistics Formula and Explanation Degree of Freedom
Pearson Chi-Square \[\chi_p^2 = \sum_{ij} \frac{(f_{ij}-E_{ij})^2}{E_{ij}}\] \[(R-1)(C-1)\]
Likelihood Ratio \[\chi_{LR}^2 = -2\sum_{ij} f_{ij} \ln (E_{ij}/f_{ij})\] \[(R-1)(C-1)\]
Linear Association \(\chi_{LA}^2 = (N-1)r^2\), where \(r\) is the Pearson correlation coefficient. \[1\]
Continuity Correction \(\chi_C^2 = \frac{N(|f_{11}f_{22}-f_{12}f_{21}|-0.5N)^2}{r_1r_2c_1c_2} I(|f_{11}f_{22}-f_{12}f_{21}|>0.5N)\), which is calculated only for 2 x 2 table \[1\]

Fisher's Exact Test

This test is useful when some expected cell count is low (less than 5). It's calculated only for 2 x 2 table. Suppose we have the table in the following:

\[X_1\] \[X_2\] Subtotal/Total
\[Y_1\] \[n_1\] \[n_3\] \[n_1+n_3\]
\[Y_2\] \[n_2\] \[n_4\] \[n_2+n_4\]
Subtotal/Total \[n_1+n_2\] \[n_3+n_4\] \[N\]

Under the null hypothesis (Independence), the count of the first cell \(N_1\) is a hypergeometric distribution with probability given by

\(Pr(N_1=n_1) = \frac{(n_1+n_2)!(n_3+n_4)!(n_1+n_3)!(n_2+n_4)!}{N!n_1!n_2!n_3!n_4!}\), \(\max(0,n_1-n_4)\leq N_1 \leq \min(n_1+n_2,n_1+n_3)\).

one-Sided test

The one-sided test significance level is calculated by

p(left-sided test) =\( Pr(N_1\leq n_1)\)
p(right-sided test) =\( Pr(N_1\geq n_1)\)

Two-Sided tail

The two-tail significance is

\[p_2 = p_1 + p_3\]

where

\(p_{1}= Pr(N_1\leq n_1)\), if \(n_{1}\leq (n_{1}+n_{2})(n_{1}+n_{3})/N\)
\(p_{1}= Pr(N_1\geq n_1)\), if \(n_{1}>(n_{1}+n_{2})(n_{1}+n_{3})/N\)


\[p_3 = \sum_{x:\text{ between }\min(n_1+n_2,n_1+n_3) \text{ and } (n_1+1); Pr(N_1=x) \leq Pr(N_1=n_1)} Pr(N_1=x)\]

Measures of Association

Define

\[D_r = N^2 - \sum_{i=1}^{R}r_i^2\]
\[D_c = N^2 - \sum_{j=1}^{C}c_j^2\]
\[C_{ij} = \sum_{h<i}\sum_{k<j}f_{hk}+\sum_{h>i}\sum_{k>j}f_{hk}\]
\[D_{ij} = \sum_{h<i}\sum_{k>j}f_{hk}+\sum_{h>i}\sum_{k<j}f_{hk}\]
\[P = \sum_{ij}f_{ij}C_{ij}\]
\[Q = \sum_{ij}f_{ij}D_{ij}\]
\(r_i = \sum_{j=1}^{C}f_{ij}\) is subtotal of the \(i\)th row
\(c_j = \sum_{i=1}^{R}f_{ij}\) is subtotal of the \(j\)th column
\(N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i\) is the total number.
Statistics Formula and Explanation Standard Error
Phi Coefficient \(\phi = \sqrt{\chi_p^2/N}\), which is calculated for not 2 x 2 table. For a 2 x 2 table, it is equal to \(r\)

The value ranges from \([0,M]\), where \(M = min(\sqrt{R-1},\sqrt{C-1})\),

Cramer's V \[V = \sqrt{\frac{\chi_p^2}{N\min\{R,C\}}}\]
Contingency Coefficient \[CC = \sqrt{\frac{\chi_p^2}{\chi_p^2+N}}\]
Gamma \[\gamma = \frac{P-Q}{P+Q}\] \[\frac{2}{P+Q}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\]
Kendall Tau-b \[\tau_b = \frac{P-Q}{\sqrt{D_rD_c}}\] \[2\sqrt{\frac{1}{D_rD_c}\left[\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2\right]}\]
Tau-c \(\tau_c = \frac{(P-Q)q}{N^2(q-1)}\), where \(q = \min\{R,C\}\) \[\frac{2q}{N^2(q-1)}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\]
Somer's D C\(|\)R \[d_{C|R} = \frac{P-Q}{D_r}\] \[\frac{2}{D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\]
R\(|\)C \[d_{R|C} = \frac{P-Q}{D_c}\] \[\frac{2}{D_c}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\]
Symmetric \[d = 2\frac{P-Q}{D_c+D_r}\] \[\frac{4}{D_c+D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\]
Lambda C\(|\)R \(\lambda_{C|R} = \frac{1}{N-c_m}\left(\sum_{i=1}^{R}f_{im}-c_m\right)\), where \(f_{im}\) is the largest count in ith row, and \(c_m\) is the largest column subtotal. \(\sqrt{ \frac{ N - \displaystyle\sum_{i=1}^{R} f_{im} }{ (N-c_m)^3 } \left(\sum_{i=1}^{R} f_{im} + c_m -2\sum_{i=1}^{R} (f_{im}|l_i=l) \right) }\),

where \(l_i\) is the column index of \(f_{im}\), \(l\) is the index of column subtotal for \(c_m\).

R\(|\)C \(\lambda_{R|C} = \frac{1}{N-r_m}\left(\sum_{j=1}^{C}f_{mj}-r_m\right)\),

where \(f_{mj}\) is the largest count in jth column, and \(r_m\) is the largest row subtotal.

\(\sqrt{ \frac{ N - \displaystyle\sum_{j=1}^{C} f_{mj} }{ (N-r_m)^3 } \left(\sum_{j=1}^{C} f_{mj} + r_m -2\sum_{j=1}^{C} (f_{mj}|k_j=k) \right) }\),

where \(k_j\) is the row index of \(f_{mj}\), \(k\) is the index of row subtotal for \(r_m\).

Symmetric \[\lambda = \frac { \displaystyle \sum_{i=1}^{R}f_{im} + \sum_{j=1}^{C}f_{mj} - c_m - r_m }{2N-r_m-c_m}\] \(\frac{1}{w^2} \sqrt{ wvy - 2w^2\left( N-\sum_{i=1}^{R} (f_{im}|i=k_{l_i}) \right) - 2v^2(N-f_{kl}) }\)

where \(w=2N-r_m-c_m\), \(v = 2N - \sum_{i=1}^{R}f_{im} - \sum_{j=1}^{C}f_{mj}\), \(x = \sum_{i=1}^R (f_{im}|l_i=l) + \sum_{j=1}^C (f_{mj}|k_j=k) + f_{km} + f_{ml}\), and \(y = 8N - w - v - 2x\).

Uncertainty C\(|\)R \(U_{R|C} = \frac{U(X)+U(Y)-U(XY)}{U(Y)}\), where \(U(X) = -\sum_{i=1}^{R}\frac{r_i}{N}\ln\frac{r_i}{N}\), and \(U(Y) = -\sum_{j=1}^{C}\frac{c_j}{N}\ln\frac{c_j}{N}\), and \(U(XY) = -\sum_{ij}\frac{f_{ij}}{N}\ln\frac{f_{ij}}{N}\) \(\frac{1}{NU(Y)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}\), where \(P = \sum_{ij}f_{ij}\ln\left(\frac{r_ic_j}{f_{ij}N}\right)^2\)
R\(|\)C \[U_{C|R} = \frac{U(X)+U(Y)-U(XY)}{U(X)}\] \[\frac{1}{NU(X)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}\]
Symmetric \[U = 2\frac{U(X)+U(Y)-U(XY)}{U(X)+U(Y)}\] \[\frac{2}{N(U(X)+U(Y))}\sqrt{P-\frac{1}{N}\left(U(X)+U(Y)-U(XY)\right)^2}\]

Measures of Agreement

This table is calculated only when two conditions are satisfied (1) square table, i.e. \(R=C\), and (2) the row variable and column variable have same values.

The Kappa statistic is calculated by

\[ \kappa = \frac{N\sum_{i=1}^{R}f_{ii} - \sum_{i=1}^{R}r_ic_i}{N^2 - \sum_{i=1}^{R}r_ic_i}\]

The standard error is estimated by:

\(SE_1 = \frac{1}{1-p_e} \sqrt{ \frac{A+B-C}{N} }\).

where \(p_e = \frac{ \sum_{i=1}^R r_i c_i }{ N^2 }\), \( A = \sum_{i=1}^R \frac{f_{ii}}{N} \left( 1-\frac{(r_i+c_i)(1- \kappa)}{N} \right)^2\),
\(B = (1-\kappa)^2 \sum_{i=1}^R \sum_{j=1, j \ne i}^{C} \frac{f_{ij} (r_i+c_j)^2}{N^3}\) and \(C = \Bigl( \kappa - p_e( 1-\kappa ) \Bigr)^2\).

The corresponding asymptotic standard error under the null hypothesis \(\kappa = 0\) is given by

\[SE_0 = \sqrt{\frac{1}{N\left(N^2 - \sum_{i=1}^{R}r_ic_i\right)^2} \left[N^2\sum_{i=1}^{R}r_ic_i + \left(\sum_{i=1}^{R}r_ic_i\right)^2 - N \sum_{i=1}^{R}r_ic_i(r_i+c_i)\right]}\]

Another related statistic is Bowker, which is used to test \(H_0: p_{ij} = p_{ji}\) for all pairs. If \(R>2\), the statistic is calculated as

\[Bo = \sum_{i=1}^R \sum_{j=1}^{j<i}\frac{(f_{ij}-f_{ji})^2}{f_{ij}+f_{ji}}\]

For lager samples, \(Bo\) is asymptotically chi-square distribution with degree of freedom \(0.5R(R-1)\).

Note that for 2 x 2 table, Bowker's test is equal to McNemar's test. So we only give Bowker's test.

Odds Ratio and Relative Risk

These statistics are calculated only for 2 x 2 table.

Odds Ratio

The Odds Ratio is calculated as

\[OR = \frac{f_{11}f_{22}}{f_{12}f_{21}}\]

Relative Risk

The Relative Risks are given by

\[P(Y_1|X_1)/P(Y_1|X_2) = \frac{f_{11}(f_{21}+f_{22})}{f_{21}(f_{11}+f_{12})}\]
\[P(Y_1|X_2)/P(Y_1|X_1) = \frac{f_{21}(f_{11}+f_{12})}{f_{11}(f_{21}+f_{22})}\]
\[P(Y_2|X_1)/P(Y_2|X_2) = \frac{f_{12}(f_{21}+f_{22})}{f_{22}(f_{12}+f_{11})}\]
\[P(Y_2|X_2)/P(Y_2|X_1) = \frac{f_{22}(f_{12}+f_{11})}{f_{12}(f_{21}+f_{22})}\]

Cochran-Mantel-Haenszel

Define

\(K\) be the number of layers
\(f_{ijk}\) be the frequency in the ith row, jth column and kth layer
\(c_{jk} = \sum_{i=1}^{R} f_{ijk}\) be the jth column, kth layer subtotal
\(r_{ik} = \sum_{j=1}^{C} f_{ijk}\) be the ith row, kth layer subtotal
\(n_{k} = \sum_{i=1}^{R}\sum_{j=1}^{C} f_{ijk}\) be the kth layer subtotal
\(E_{ijk} = \frac{r_{ik}c_{jk}}{n_k}\) be the expected frequency of the ith row jth column kth layer cell
\[\hat{p}_{ik} = \frac{f_{i1k}}{r_{ik}}, d_k = \hat{p}_{1k} - \hat{p}_{2k}, \hat{p}_{k} = \frac{c_{1k}}{n_{k}}\]

Mantel-Haenszel statistic

The Mantel-Haenszel statistic is given by

\[MH = \left(\sum_{k=1}^{K}\frac{r_{1k}r_{2k}}{n_k-1} \hat{p}_{k}(1-\hat{p}_{k}) \right)^{-1/2}\left(\big|\sum_{k=1}^{K} (f_{11k}-E_{11k})\big|-0.5\right)sgn\left(\sum_{k=1}^{K} (f_{11k}-E_{11k})\right)\]

where sgn is the sign function \(sgn(x) = I(x>0)-I(x<0)+0*I(x=0)\).


Breslow-Day statistic

The Breslow-Day statistic is

\[BD = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2\]

where \(V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}\).

Tarone’s Statistic

The Tarone’s Statistic is

\[T = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2- \frac{\sum_{k=1}^{K}\left[f_{11k}-\hat{f}_{11k}\right]^2}{\sum_{k=1}^{K}\frac {1}{V_k} }\]

where \(V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}\).

Common Odds Ratio

For a 2×2×K table, the odds ratio at the kth layer is \(OR_{k}\). Assuming that the true common odds ratio exists,taht is \(OR_{1}=OR_{2}=...OR_{K}\) , Mantel-Haenszel's estimator of the common odds ratio is

\[\hat OR_{MH}=\frac{\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}{\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}\]

The asymptotic variance for \(ln(\hat OR_{MH})\) is:

\[\hat Var[ln(\hat OR_{MH})]=\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{12k} f_{21k}+(f_{12k}+f_{21k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{12k}+f_{21k})f_{12k} f_{21k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}\]

The lower confidence limit(LCL) and upper confidence limit(UCL) for \(ln(\hat OR_{MH})\) is:

\(ln(\hat OR_{MH})-z({alpha}/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}\) and \(ln(\hat OR_{MH})+z(alpha/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}\)