Inferential Statistics(4) - the Association Between Categorical Variables
1. Independence and Dependence(Association)
-
Inferential statistics
- Single Samplingpopulation: Significance testing & Confidence intervals.
- Comparing two proportions/means: Significance testing & Confidence intervals.
- Association between two categorical variables/ quantitative variables.: chi-square
-
r-square&chi-square
- r-square:描述的是2个变量(Sample)之间的关系,不能进行推论整体(Population)。
- chi-square:推测多个变量总体(Population)总体关系。
-
chi-square
卡方检验是一种用途很广的计数资料的假设检验方法。它属于非参数检验的范畴,主要是比较两个及两个以上样本率( 构成比)以及两个分类变量的关联性分析。其根本思想就是在于比较理论频数和实际频数的吻合程度或拟合优度问题。
它在分类资料统计推断中的应用,包括:两个率或两个构成比比较的卡方检验;多个率或多个构成比比较的卡方检验以及分类资料的相关分析等。
2. Testing Categorical Variables for Independence
-
Expected cell count
\[Expected \ cell \ count = \frac{(Row \ total) * (Column \ total)}{Total\ sample \ size}\] -
Chi-squared Test Statistic
\[\chi^2=\sum{\frac{(observed \ count-expected \ count)^2}{expected \ count}}\]-
Chi-squared Distribution
-
Always positive
-
Degrees of freedom from row and column \(df = (r-1)*(c-1)\)
-
Mean equals df, standard deviation equals sqrt(2df)
-
As df increases, distribution goes to bell shaped.
-
Large χ2 provides evidence against independence
-
-
-
Five Steps of the Chi-Squared Test
-
Assumptions
- Two categorical variables
- Randomization
- random sampling
- a randomized experiment
-
Hypotheses
- H0: The two variables are independent.
- Ha: The two variables are dependent(associated).
-
Test statistic \(\chi^2=\sum{\frac{(observed \ count-expected \ count)^2}{expected \ count}}\)
\[Expected \ cell \ count = \frac{(Row \ total) * (Column \ total)}{Total\ sample \ size}\] -
P-value
- Right-tail probability above observed χ2 value, for the chi-squared distribution with df = (r-1) * (c-1)
-
Conclusion
- reject H0 when P-value <= significance level
-
-
Misuse of the Chi-Squared Test
- misuses are interpreting:
- a small P-value as automatically providing evidence for a strong and practically meaningful association.
- a large P-value as providing evidence for independence.
- Other misuses are applying the chi-squared test.
- When some of the expected frequencies are too small.
- When separate rows or columns are dependent samples(McNemar’s test), such as when each row of the table refers to the same subjects.
- To data that do not result from a random sample or randomized experiment.
- To data by classifying quantitative variables into categories. This results in a loss of information. It is usually more appropriate to analyze the data with methods for quantitative variables.
- misuses are interpreting:
3. The Strength of the Association
How strong is the association?
-
Measure of association
A measure of association is a statistic or a parameter that summarizes the strength of the dependence between two variables.
-
Difference of Proportions
-
Relative Risk
- The relative risk is often more informative than the difference of proportions for comparing proportions that are both close to 0.
-
Odds and Odds Ratio
-
This equals p1/(1- p1) in the first row and p2/(1- p2) in the second row.
-
The odds ratio is then the ratio of these two odds: \(odds \ ratio = \frac{p_1/(1-p_1)}{p_2/(1-p_2)}\)
- Values farther from 1 represent stronger associations.
-
-
Residual Analysis
-
Residual: The difference between an observed and expected count in a particular cell.
-
Standardized residual for a cell equals: \((observed \ count - expected \ count)/se\)
-
se denotes a standard error for the sampling distribution of (observed count -expected count), when the variables are independent. \(se = \sqrt{residual_i*(1-prob_.c)(1-prob_r.)}\)
-
a standardized residual larger than 3 in absolute value value provides strong evidence against independence in that cell.
-
-
-
Cramer’s V
-
Chi-square says that there is a significant relationship between variables, but it does not say just how significant and important this is. Cramer’s V is a post-test to give this additional information.
-
Cramér’s V is a number between 0 and 1 that indicates how strongly two categorical variables are associated.
-
Formula
\[\phi_c = \sqrt{\frac{\chi^2}{N(k - 1)}}\]- ϕc denotes Cramér’s V;
- χ2 is the Pearson chi-square statistic from the aforementioned test;
- N is the sample size involved in the test
- k is the lesser number of categories of either variable.
- ϕc denotes Cramér’s V;
-
4. Goodness of Fit Chi-Squared Tests
-
Compare an observed frequency distribution with a frequency distribution on the basis of a theory.
- has to be discrete
- can be any measurement level(categorical, ordinal, numeric)
- chi-squared test
-
N: the number of categories(at least 5 )
- df: N-1
-
large χ2 small P-valuereject H0
-
Difference between Goodness of Fit and Association
- has to be discrete
-
Notes for Goodness of Fit
- Can group several categories to meet the frequency requirements.
- Can only apply to compare observed sample to theoretical sample. Two samples can be only compared for association.
5. Fisher’s exact test(permutation tests)
-
Assumptions:
- Two binary categorical variables
- Randomization, such as random sampling or a randomized experiment.
- number of each cell < 5
-
Hypotheses:
- H0: The two variables are independent (H0: p1 = p2)
- Ha: The two variables are associated (Choose Ha: p1≠p2 or p1>p2 or Ha: p1<p2)
-
Test statistic:
- First cell count(this determines the others, given the margin totals.
- 遍历计算。
-
P-value:
- Probability that the first cell count equals the observed value or a value even more extreme than observed in the direction predicted by Ha
-
Conclusion:
- Report P-value and interpret in context. If a decision is needed, reject H0 when P-value ≦ significance level(such as 0.05)
Calculation for Fisher’s Exact Test
-
Example: