Inferential Statistics(08)-Multiple Regression

Inferential Statistics(08) - Multiple Regression

1. Regression Analysis with samples

The multiple regression model relates the mean μ_y of a quantitative response variable y to a set of explanatory variables x₁,x₂……

\[\mu_y = \alpha + {\beta}_1*x_1+ {\beta}_2*x_2+ {\beta}_3*x_3+....+ {\beta}_m*x_m\]

Simple&Multiple regression
- Multiple regression: controlling for other variables.
  - the model assumes that the slope for a particular explanatory is identical for all fixed values of the other explanatory variables.(??? not sure about this!!!)
    - 因变量受欢迎程度和变量猫的毛发和年纪两者有关联，但是毛发会随着年纪增加而减少。
    - 因变量房屋的价值和自变量卧室数量、大小等有关联，但卧室数量和大小密切相关。
- Simple regression: ignoring other variables.
scatterplot matrix

2. R and R² for Multiple Regression

Multiple Correlation R[0,1]

For a multiple regression model, the multiple correlation is the correlation between the observed y values and the predicted y^ values(determined by a set of explanatory variables).
- correlation matrix
R²

R² denotes the proportion of variance in y accounted for by the model.
- \[R^2 = \frac{\sum{(y-\overline{y})^2}-\sum{(y-\hat{y})^2}}{\sum{(y-\overline{y})^2}}\]
- **Properties of R² **
  - R² [0,1]
  - R² gets larger or stays the same at worst, whenever an additional explanatory variable is added to the multiple regression model.
  - The value of R² value does not depend on the units of measurement.

2. Inferences Using Multiple Regression

Overall test
- F-test
  1. Assumptions
    - The regression equation truly holds for the population means.
    - The data were gathered using randomization.
    - The response variable y has a normal distribution at each combination of values of the explanatory variables, with the same standard deviation.
  2. Hypothesis
    - \[H_0:\beta_1=\beta_2=...=\beta_n=0 \\ H_a: At \ least \ one \ \beta\ parameter\ is\ not\ equal\ to\ 0.\]
  3. F-test
    - \[F = \frac{Mean\ square\ for \ regression(MSR)}{Mean\ square\ error(MSE)}\]
    - Degree of freedom
      - df1 = number of explanatory variables in the model.(自变量个数)
      - df2= n-number of parameters in regression equation.(样本数-（自变量+常数）)
  4. P-Value
  5. Conclusion
    - The smaller the P-value, the stronger the evidence that at least one explanatory has an effect on y.
Single test

Aim: to follow up from the F test to investigate which explanatory variables have a statistically significant effect on predicting y.
- T-test
  1. Assumptions
    - Multiple regression model holds for population mean. This implies that there is a linear relationship between the mean of y and each explanatory variable, holding the others constant.
    - The slope of this line is the same, no matter what the values for these other predictors.(自变量之间不会相互影响！！！)
      - 思考：现实生活中并不是这样，卧室数量的变化会导致人们对厕所数量看法的变化。
  2. Hypotheses:
    - \[H_0:\beta_i=0, controlling\ for \ other \ predictions \\ H_a: \beta_i≠0, controlling \ for \ other \ predictions\]
  3. T-test:
    - \[t= (b_i-0)/se \\ df = n-number \ of\ parameters\ in \ regression \ equation\]
  4. P-value:
    
    Two-tail probability from t distribution of values larger than observed t test statistic.
  5. Conclusion:
- Confidence Interval \(b_i \pm_\alpha/2 * se_{b_i} \\ df = n-number \ of \ parameters \ in \ regression \ equation\)
Interpretation
- The F test is typically performed first before looking at the individual t inferences. The F test result tells us if there is sufficient evidence to make it worthwhile to consider the individual effects.

3. Checking a Regression Model Using Residual Plots

Assumptions:
1. linearity
  - for each predictor x:x and y linearly related for any combination of other x
2. normality
  - Residuals, which measure the deviations from the mean of y as predicted by the regression equation, should distributed normally.
  - residuals of the overall test!!
  - Solution: histogram
    - skew is not a problem as long as the sample size is large.
3. homoscedasticity 同方差性
  - variance of residuals same over entire range of x
  - 一般来说, 受教育水平越高的人群收入变动越大,而教育水平较低的人群收入相差不会太大 (比如说, 有最低工资法)。这时就出现异方差问题了, 因为e的方差会随着教育水平x的增大而增大。
  - violation of homoscedasticity:
4. independence of errors
  - Residuals unrelated
  - Solution: random sampling/ random assignment
  - !! time series samples
5. sufficient observation
  - enough observations:
    - n≥10*the number of predictors
6. absence of outliers
  - Inspect standardized residuals more extreme than -3/+3
  - only remove outliers when this is a plausible explanation.

4. Categorical Predictors

Binary indicators

Binary indicators represent not the quantity of a measured property, but its quality.
Multiple indicators(dummy variables)
- Generally, a categorical explanatory variable in a regression model uses one fewer indicator variable than the number of categories.(see below pic)
- Why can’t we specify the three categories merely by setting up a variable x that equals 1 for homes in good condition, 0 for homes in average condition, and say -1 for poor condition?
  - Because this would treat condition as quantitative rather than categorical. I t would treat condition as if different categories corresponded to different amounts of the variable. But the variable measures which condition, not how much condition.
  - Treating it as quantitative in inappropriate.
  - 一定要将categorical和quantitative区分！！不要将categorical和quantitative混用，盲目使用量化手段，对数字盲目崇拜。混用两者将制造无尽的混乱。
    - 反例：用quantitative代替categorical，比如高校目标达成度评价。本身中国将分数进行量化成百分值就本不合理，60-70分有多大差距？现在，又将其进行目标细分，每个目标达成度为多少？目标只有达成与未达成之分，诚然将达成人数和未达成人数统计可以换算为百分比，但这也是binary categories两分法占比，但是现在却拿学生分数作为目标达成比值，比如70分，就是目标达成0.7(70/100)。0.7目标达成到底是什么意思？学生掌握了知识的70%？那和掌握80%的同学有什么区别？本来categorical(达成/未达成)变成了quantitative(达成程度)，将不可精确量化数值的变量转换为百分数，在数学上完全错误。
      - 相关逻辑谬误—麦纳马拉谬误(大数据谬误)McNamara fallacy：
        
        The McNamara fallacy (also known as the quantitative fallacy[1]), named for Robert McNamara, the US Secretary of Defense from 1961 to 1968, involves making a decision based solely on quantitative observations (or metrics) and ignoring all others. The reason given is often that these other observations cannot be proven.
        
        The first step is to measure whatever can be easily measured. This is OK as far as it goes. The second step is to disregard that which can’t be easily measured or to give it an arbitrary quantitative value. This is artificial and misleading. The third step is to presume that what can’t be measured easily really isn’t important. This is blindness. The fourth step is to say that what can’t be easily measured really doesn’t exist. This is suicide.
        
        — Daniel Yankelovich, “Corporate Priorities: A continuing study of the new demands on business” (1972).
        
        The fallacy refers to McNamara’s belief as to what led the United States to defeat in the Vietnam War—specifically, his quantification of success in the war (e.g., in terms of enemy body count), ignoring other variables.
    - 反例：用categorical代替quantitative, 比如高校科研评价。高校教师科研文章发表数量本身只是一个数字，但经过人为划定，大于某个数值就为好，小于某个数值(quantitative)就是不好（categorical），将数量关系简单转化为两分，非黑即白。这种案例很多很多，人类有简单化评价的倾向，是非、冷暖、好坏，许多对立二元词汇都是二分评价描述，有助于人类进行快速决策。但是，很多事物并非严格对立，并且量变到质变的阈值并不清楚，甚至并不恒定。比如，是否购房决定与房屋大小、位置、卧室数量等变量有关，但卧室数量到达一定数量时，该变量对于购房决定的影响会越来越小，因为很少有人会在意20个卧室和25个卧室有多大区别。所以在统计学中要计算变量之间的Interaction（见下文）。因此，我们要警惕Occam’s razor。虽然我是Less is more的忠实信徒，但同时要记住爱因斯坦的一句话：
      
      “Everything should be made as simple as possible, but no simpler.”—Einstein
Interaction

For two explanatory variables, interaction exists when the slope of the linear relationship between μy and one explanatory variable changes as the other explanatory variable changes.
- How do we know whether the interaction shown by the sample is sufficiently large to indicate that there is interaction in the population?
  - There is a significance test for checking this, but it is beyond the scope of this book. In practice, it’s usually adequate to investigate interaction informally by using graphics. For example, suppose there are two predictors, one quantitative and one categorical. Then you can plot y against the quantitative predictor, identifying the data points by the category of the categorical variable, as we did in Example 11. Do the points seem to go up or go down at quite different rates, taking into account sampling error? If so, it’s safer to fit a separate regression line for each category, which then allows different slopes.

5. Modeling a Categorical Response

When y is categorical, a different regression model applies, called logistic regression.

Logistic regression equation

A regression equation for an S-shaped curve for the probability of success p is
- \[p = \frac{e^{\alpha+\beta x}}{1+e^{\alpha+\beta x}}\]
- \[x= -\frac{\alpha}{\beta} \\ slope = \frac{\beta}{4}\\ when \ p=0.5\]
Inference for Logistic Regression–z-test \(z=\frac{\beta-0}{se}\)
checking the logistic regression model
- classification table
- specificity&sensitivity
  - sensitivity(灵敏性)(P(E|H)*P(H))
    - TPR: true positive rate，描述识别出的所有正例占所有正例的比例。
      - 医生说你得病了，你是真的得病概率（他没说错的百分率）。
  - specificity(特异性)(P(-E|-H))*P(-H)
    - TNR: true negative rate，描述识别出的负例占所有负例的比例。
      - 医生说你没得病，你是真没病的规律（他没说错的百分率）。
    - 相关阅读：Bayes’ Law
      - R语言实现
      - 敏感性：（sensitivity）对真目标做出阳性反应的程度。敏感性越高，则越容易鉴定出目标，即越灵敏。
      - 特异性：（specificity）对假目标做出阴性反应的程度。特异性越高，则越不容易误报，只针对特定情况才有阳性反应，即筛选能力强，或者说针对性强。