17.1.6.3 Algorithms (CrossTabs)
Contents
CrossTabs is also called Contingency Tables. This tool is used to examine the existence or the strength of any association between variables.
CrossTabs Method
- Frequency Counts
- Marginal and Cell
- Chi-Square Tests Table
- Fisher's Exact Test Table (2 x 2 only)
- Measures of Association
- Measures of Agreement
- Odds Ratio and Relative Risk (2 x 2 only)
- Cochran-Mantel-Haenszel
Frequency Counts
Define
- \(X_i\) are distinct values of row variable in ascending order, i.e. \(X_1 < X_2 < \cdots X_R \)
- \(Y_i\) are distinct values of column variable in ascending order, i.e. \(Y_1 < Y_2 < \cdots Y_C \)
- \(f_{ij}\) is the frequency with respect to cell \((i,j)\)
- \(r_i = \sum_{j=1}^{C}f_{ij}\) is subtotal of the \(i\)th row
- \(c_j = \sum_{i=1}^{R}f_{ij}\) is subtotal of the \(j\)th column
- \(N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i\) is the total number.
Marginal and Cell
| Statistics | Formula and Explanation |
|---|---|
| Count | \[f_{ij}\] |
| Expected Count | \[E_{ij} = \frac{r_i c_j}{N}\] |
| Row Percent | \[100*\frac{f_{ij}}{r_i}\] |
| Column Percent | \[100*\frac{f_{ij}}{c_j}\] |
| Total Percent | \[100*\frac{f_{ij}}{N}\] |
| Residual | \[R_{ij} = f_{ij} - E_{ij}\] |
| Std. Residual | \[StdR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}}}\] |
| Adj. Residual | \[AdjR_{ij} = \frac{R_{ij}}{\sqrt{E_{ij}\left(1-\frac{r_i}{N}\right)\left(1-\frac{c_j}{N}\right)}}\] |
Chi-Square Statistics
| Statistics | Formula and Explanation | Degree of Freedom |
|---|---|---|
| Pearson Chi-Square | \[\chi_p^2 = \sum_{ij} \frac{(f_{ij}-E_{ij})^2}{E_{ij}}\] | \[(R-1)(C-1)\] |
| Likelihood Ratio | \[\chi_{LR}^2 = -2\sum_{ij} f_{ij} \ln (E_{ij}/f_{ij})\] | \[(R-1)(C-1)\] |
| Linear Association | \(\chi_{LA}^2 = (N-1)r^2\), where \(r\) is the Pearson correlation coefficient. | \[1\] |
| Continuity Correction | \(\chi_C^2 = \frac{N(|f_{11}f_{22}-f_{12}f_{21}|-0.5N)^2}{r_1r_2c_1c_2} I(|f_{11}f_{22}-f_{12}f_{21}|>0.5N)\), which is calculated only for 2 x 2 table | \[1\] |
Fisher's Exact Test
This test is useful when some expected cell count is low (less than 5). It's calculated only for 2 x 2 table. Suppose we have the table in the following:
| \[X_1\] | \[X_2\] | Subtotal/Total | |
|---|---|---|---|
| \[Y_1\] | \[n_1\] | \[n_3\] | \[n_1+n_3\] |
| \[Y_2\] | \[n_2\] | \[n_4\] | \[n_2+n_4\] |
| Subtotal/Total | \[n_1+n_2\] | \[n_3+n_4\] | \[N\] |
Under the null hypothesis (Independence), the count of the first cell \(N_1\) is a hypergeometric distribution with probability given by
\(Pr(N_1=n_1) = \frac{(n_1+n_2)!(n_3+n_4)!(n_1+n_3)!(n_2+n_4)!}{N!n_1!n_2!n_3!n_4!}\), \(\max(0,n_1-n_4)\leq N_1 \leq \min(n_1+n_2,n_1+n_3)\).
one-Sided test
The one-sided test significance level is calculated by
- p(left-sided test) =\( Pr(N_1\leq n_1)\)
- p(right-sided test) =\( Pr(N_1\geq n_1)\)
Two-Sided tail
The two-tail significance is
\[p_2 = p_1 + p_3\]
where
- \(p_{1}= Pr(N_1\leq n_1)\), if \(n_{1}\leq (n_{1}+n_{2})(n_{1}+n_{3})/N\)
- \(p_{1}= Pr(N_1\geq n_1)\), if \(n_{1}>(n_{1}+n_{2})(n_{1}+n_{3})/N\)
- \[p_3 = \sum_{x:\text{ between }\min(n_1+n_2,n_1+n_3) \text{ and } (n_1+1); Pr(N_1=x) \leq Pr(N_1=n_1)} Pr(N_1=x)\]
Measures of Association
Define
- \[D_r = N^2 - \sum_{i=1}^{R}r_i^2\]
- \[D_c = N^2 - \sum_{j=1}^{C}c_j^2\]
- \[C_{ij} = \sum_{h<i}\sum_{k<j}f_{hk}+\sum_{h>i}\sum_{k>j}f_{hk}\]
- \[D_{ij} = \sum_{h<i}\sum_{k>j}f_{hk}+\sum_{h>i}\sum_{k<j}f_{hk}\]
- \[P = \sum_{ij}f_{ij}C_{ij}\]
- \[Q = \sum_{ij}f_{ij}D_{ij}\]
- \(r_i = \sum_{j=1}^{C}f_{ij}\) is subtotal of the \(i\)th row
- \(c_j = \sum_{i=1}^{R}f_{ij}\) is subtotal of the \(j\)th column
- \(N = \sum_{j=1}^{C}c_j = \sum_{i=1}^{R}r_i\) is the total number.
| Statistics | Formula and Explanation | Standard Error | |
|---|---|---|---|
| Phi Coefficient | \(\phi = \sqrt{\chi_p^2/N}\), which is calculated for not 2 x 2 table. For a 2 x 2 table, it is equal to \(r\)
The value ranges from \([0,M]\), where \(M = min(\sqrt{R-1},\sqrt{C-1})\), |
||
| Cramer's V | \[V = \sqrt{\frac{\chi_p^2}{N\min\{R,C\}}}\] | ||
| Contingency Coefficient | \[CC = \sqrt{\frac{\chi_p^2}{\chi_p^2+N}}\] | ||
| Gamma | \[\gamma = \frac{P-Q}{P+Q}\] | \[\frac{2}{P+Q}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\] | |
| Kendall | Tau-b | \[\tau_b = \frac{P-Q}{\sqrt{D_rD_c}}\] | \[2\sqrt{\frac{1}{D_rD_c}\left[\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2\right]}\] |
| Tau-c | \(\tau_c = \frac{(P-Q)q}{N^2(q-1)}\), where \(q = \min\{R,C\}\) | \[\frac{2q}{N^2(q-1)}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\] | |
| Somer's D | C\(|\)R | \[d_{C|R} = \frac{P-Q}{D_r}\] | \[\frac{2}{D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\] |
| R\(|\)C | \[d_{R|C} = \frac{P-Q}{D_c}\] | \[\frac{2}{D_c}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\] | |
| Symmetric | \[d = 2\frac{P-Q}{D_c+D_r}\] | \[\frac{4}{D_c+D_r}\sqrt{\sum_{ij}f_{ij}(C_{ij}-D_{ij})^2-\frac{1}{N}(P-Q)^2}\] | |
| Lambda | C\(|\)R | \(\lambda_{C|R} = \frac{1}{N-c_m}\left(\sum_{i=1}^{R}f_{im}-c_m\right)\), where \(f_{im}\) is the largest count in ith row, and \(c_m\) is the largest column subtotal. | \(\sqrt{ \frac{ N - \displaystyle\sum_{i=1}^{R} f_{im} }{ (N-c_m)^3 } \left(\sum_{i=1}^{R} f_{im} + c_m -2\sum_{i=1}^{R} (f_{im}|l_i=l) \right) }\), where \(l_i\) is the column index of \(f_{im}\), \(l\) is the index of column subtotal for \(c_m\). |
| R\(|\)C | \(\lambda_{R|C} = \frac{1}{N-r_m}\left(\sum_{j=1}^{C}f_{mj}-r_m\right)\),
where \(f_{mj}\) is the largest count in jth column, and \(r_m\) is the largest row subtotal. |
\(\sqrt{ \frac{ N - \displaystyle\sum_{j=1}^{C} f_{mj} }{ (N-r_m)^3 } \left(\sum_{j=1}^{C} f_{mj} + r_m -2\sum_{j=1}^{C} (f_{mj}|k_j=k) \right) }\), where \(k_j\) is the row index of \(f_{mj}\), \(k\) is the index of row subtotal for \(r_m\). | |
| Symmetric | \[\lambda = \frac { \displaystyle \sum_{i=1}^{R}f_{im} + \sum_{j=1}^{C}f_{mj} - c_m - r_m }{2N-r_m-c_m}\] | \(\frac{1}{w^2} \sqrt{ wvy - 2w^2\left( N-\sum_{i=1}^{R} (f_{im}|i=k_{l_i}) \right) - 2v^2(N-f_{kl}) }\) where \(w=2N-r_m-c_m\), \(v = 2N - \sum_{i=1}^{R}f_{im} - \sum_{j=1}^{C}f_{mj}\), \(x = \sum_{i=1}^R (f_{im}|l_i=l) + \sum_{j=1}^C (f_{mj}|k_j=k) + f_{km} + f_{ml}\), and \(y = 8N - w - v - 2x\). | |
| Uncertainty | C\(|\)R | \(U_{R|C} = \frac{U(X)+U(Y)-U(XY)}{U(Y)}\), where \(U(X) = -\sum_{i=1}^{R}\frac{r_i}{N}\ln\frac{r_i}{N}\), and \(U(Y) = -\sum_{j=1}^{C}\frac{c_j}{N}\ln\frac{c_j}{N}\), and \(U(XY) = -\sum_{ij}\frac{f_{ij}}{N}\ln\frac{f_{ij}}{N}\) | \(\frac{1}{NU(Y)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}\), where \(P = \sum_{ij}f_{ij}\ln\left(\frac{r_ic_j}{f_{ij}N}\right)^2\) |
| R\(|\)C | \[U_{C|R} = \frac{U(X)+U(Y)-U(XY)}{U(X)}\] | \[\frac{1}{NU(X)}\sqrt{P-N\left(U(X)+U(Y)-U(XY)\right)^2}\] | |
| Symmetric | \[U = 2\frac{U(X)+U(Y)-U(XY)}{U(X)+U(Y)}\] | \[\frac{2}{N(U(X)+U(Y))}\sqrt{P-\frac{1}{N}\left(U(X)+U(Y)-U(XY)\right)^2}\] | |
Measures of Agreement
This table is calculated only when two conditions are satisfied (1) square table, i.e. \(R=C\), and (2) the row variable and column variable have same values.
The Kappa statistic is calculated by
- \[ \kappa = \frac{N\sum_{i=1}^{R}f_{ii} - \sum_{i=1}^{R}r_ic_i}{N^2 - \sum_{i=1}^{R}r_ic_i}\]
The standard error is estimated by:
- \(SE_1 = \frac{1}{1-p_e} \sqrt{ \frac{A+B-C}{N} }\).
where \(p_e = \frac{ \sum_{i=1}^R r_i c_i }{ N^2 }\), \( A = \sum_{i=1}^R \frac{f_{ii}}{N} \left( 1-\frac{(r_i+c_i)(1- \kappa)}{N} \right)^2\),
\(B = (1-\kappa)^2 \sum_{i=1}^R \sum_{j=1, j \ne i}^{C} \frac{f_{ij} (r_i+c_j)^2}{N^3}\) and \(C = \Bigl( \kappa - p_e( 1-\kappa ) \Bigr)^2\).
The corresponding asymptotic standard error under the null hypothesis \(\kappa = 0\) is given by
- \[SE_0 = \sqrt{\frac{1}{N\left(N^2 - \sum_{i=1}^{R}r_ic_i\right)^2} \left[N^2\sum_{i=1}^{R}r_ic_i + \left(\sum_{i=1}^{R}r_ic_i\right)^2 - N \sum_{i=1}^{R}r_ic_i(r_i+c_i)\right]}\]
Another related statistic is Bowker, which is used to test \(H_0: p_{ij} = p_{ji}\) for all pairs. If \(R>2\), the statistic is calculated as
- \[Bo = \sum_{i=1}^R \sum_{j=1}^{j<i}\frac{(f_{ij}-f_{ji})^2}{f_{ij}+f_{ji}}\]
For lager samples, \(Bo\) is asymptotically chi-square distribution with degree of freedom \(0.5R(R-1)\).
Note that for 2 x 2 table, Bowker's test is equal to McNemar's test. So we only give Bowker's test.
Odds Ratio and Relative Risk
These statistics are calculated only for 2 x 2 table.
Odds Ratio
The Odds Ratio is calculated as
\[OR = \frac{f_{11}f_{22}}{f_{12}f_{21}}\]
Relative Risk
The Relative Risks are given by
- \[P(Y_1|X_1)/P(Y_1|X_2) = \frac{f_{11}(f_{21}+f_{22})}{f_{21}(f_{11}+f_{12})}\]
- \[P(Y_1|X_2)/P(Y_1|X_1) = \frac{f_{21}(f_{11}+f_{12})}{f_{11}(f_{21}+f_{22})}\]
- \[P(Y_2|X_1)/P(Y_2|X_2) = \frac{f_{12}(f_{21}+f_{22})}{f_{22}(f_{12}+f_{11})}\]
- \[P(Y_2|X_2)/P(Y_2|X_1) = \frac{f_{22}(f_{12}+f_{11})}{f_{12}(f_{21}+f_{22})}\]
Cochran-Mantel-Haenszel
Define
- \(K\) be the number of layers
- \(f_{ijk}\) be the frequency in the ith row, jth column and kth layer
- \(c_{jk} = \sum_{i=1}^{R} f_{ijk}\) be the jth column, kth layer subtotal
- \(r_{ik} = \sum_{j=1}^{C} f_{ijk}\) be the ith row, kth layer subtotal
- \(n_{k} = \sum_{i=1}^{R}\sum_{j=1}^{C} f_{ijk}\) be the kth layer subtotal
- \(E_{ijk} = \frac{r_{ik}c_{jk}}{n_k}\) be the expected frequency of the ith row jth column kth layer cell
- \[\hat{p}_{ik} = \frac{f_{i1k}}{r_{ik}}, d_k = \hat{p}_{1k} - \hat{p}_{2k}, \hat{p}_{k} = \frac{c_{1k}}{n_{k}}\]
Mantel-Haenszel statistic
The Mantel-Haenszel statistic is given by
\[MH = \left(\sum_{k=1}^{K}\frac{r_{1k}r_{2k}}{n_k-1} \hat{p}_{k}(1-\hat{p}_{k}) \right)^{-1/2}\left(\big|\sum_{k=1}^{K} (f_{11k}-E_{11k})\big|-0.5\right)sgn\left(\sum_{k=1}^{K} (f_{11k}-E_{11k})\right)\]
where sgn is the sign function \(sgn(x) = I(x>0)-I(x<0)+0*I(x=0)\).
Breslow-Day statistic
The Breslow-Day statistic is
\[BD = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2\]
where \(V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}\).
Tarone’s Statistic
The Tarone’s Statistic is
- \[T = \sum_{k=1}^{K} V_k \left[f_{11k}-\hat{f}_{11k}\right]^2- \frac{\sum_{k=1}^{K}\left[f_{11k}-\hat{f}_{11k}\right]^2}{\sum_{k=1}^{K}\frac {1}{V_k} }\]
where \(V_k = \frac{1}{\hat{f}_{11k}}+\frac{1}{\hat{f}_{12k}}+\frac{1}{\hat{f}_{21k}}+\frac{1}{\hat{f}_{22k}}\).
Common Odds Ratio
For a 2×2×K table, the odds ratio at the kth layer is \(OR_{k}\). Assuming that the true common odds ratio exists,taht is \(OR_{1}=OR_{2}=...OR_{K}\) , Mantel-Haenszel's estimator of the common odds ratio is
- \[\hat OR_{MH}=\frac{\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}{\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}\]
The asymptotic variance for \(ln(\hat OR_{MH})\) is:
- \[\hat Var[ln(\hat OR_{MH})]=\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{11k}+f_{22k})f_{12k} f_{21k}+(f_{12k}+f_{21k})f_{11k} f_{22k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{11k} f_{22k}}{n_{k}}\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}+\frac{\sum_{k=1}^{K}\frac{(f_{12k}+f_{21k})f_{12k} f_{21k}}{n_{k}^2}}{2\sum_{k=1}^{K}\frac{f_{12k} f_{21k}}{n_{k}}}\]
The lower confidence limit(LCL) and upper confidence limit(UCL) for \(ln(\hat OR_{MH})\) is:
- \(ln(\hat OR_{MH})-z({alpha}/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}\) and \(ln(\hat OR_{MH})+z(alpha/2)\sqrt{\hat Var[ln(\hat OR_{MH})]}\)