Inferential Statistics(06) - Describing Quantitative Association
1. Regression Analysis with samples
-
the Correlation r
- r=[-1,1]
- the larger the absolute value of r, the stronger the linear association.
-
Least Squares Line
Let x denote the mean of x, y the mean of y, Sx the standard deviation of the x values, and Sy the standard deviation of the y values.
- \[\hat{y}=b*x + a \\ b = r(\frac{S_y}{S_x}) \\ a= \overline{y}-b\overline{x}\]
- \[sum \ of \ squared \ residuals\ = \sum{(residual)^2}=\sum(y-\hat{y})^2\]
- \[\mu_y=\alpha + \beta*\chi \\ with \ \sigma= standard \ deviation\]
-
Residual
The difference y-y^ between an observed outcome y and its predicted value y^ is the prediction error, called a residual.
- A residual is the vertical distance between the data point and the regression line.
- The smaller the distance, the better the prediction.
2. Regression Model with population
-
A model is a simple approximation for how variables relate in a population.
-
Conditional distribution is the probability distribution of y values at a fixed value of χ.
-
A regression model describes how the population mean μ_y of each conditional distribution for the response variable depends on the value χ of the explanatory variable.
-
a straight-line model:
3. Significance Test About a Population Slope β
-
Assumptions:
-
Relationship in population satisfies regression model
-
data gathered using randomization
-
population y values at each x value have normal distribution, with same standard deviation at each x value.
-
-
Hypotheses
- H0: β=0 (x is independent from y)
- Ha: β≠0
-
Test statistic
- \[t = \frac{(b-0)}{se_b}\]
-
P-value
Two-tail probability of t test statistic value more extreme than observed, using t distribution with df=n-2.
-
Conclusion
Interpret P-value in context. If a decision is needed, reject H0 if P-value ≦ significance level.
4. Confidence Interval for β
\[\beta=b \pm t_.025(se) \\ df=n-2\]-
Confidence Interval and Prediction Interval
-
The residual standard deviation \(SD_{res} = \sqrt{\frac{\sum{(y-\hat{y})}^2}{n-2}}\)
- the residual sum of squares
- n-2 because 2 variables a and b
-
Confidence Interval
- μ_y refers to the variability of all the y values around their mean y \(CI_{\mu_y}= \hat{y}\pm2\frac{SD_{res}}{\sqrt{n}}\)
-
Prediction Interval
-
individual y variability at a fixed x.
-
wider than CI \(PI_{y_i}=\hat{y}\pm2SD_{res}\)
-
-
r^2
-
r^2 gives the proportion of the overall variability in y that can be attributed to the linear regression model.
-
-
5. The Strength of Association
-
r
-
correlation
-
If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean.
-
-
r^2
\[r^2=\frac{\sum{(y-\overline{y})^2}-\sum{(y-\hat{y})^2}}{\sum{(y-\overline{y})^2}}\\ where \\ \hat{y} = predicted \ y \ on\ the\ regression\ line \\ \overline{y} = mean \ of \ y\]-
r^2 is interpreted as the proportional reduction in error.
- For instance if r^2 = 0.4, the error using y^ to predict y is 40% smaller than the error using y_ to predict y.
-
Properties of r^2
-
r^2 falls between 0 and 1
-
r^2=1
- when y-y^ = 0
- all the data points fall exactly on the regression line.
- There is no prediction error using x to predict y.
-
r^2=0
- when y-y^=y-y-
- y^=y-
- slope b=0
- the regression line and mean give the same predictions.
-
the closer r^2 is to 1, the stronger the linear association.
- the more effective the regression equation y= a + bx.
-
-
-
Correlation r and square r^2
- r
- r falls between -1 and 1.
- It represents by how many standard deviations y is predicted to change when x changes by one standard deviation.
- regression toward the mean.
- r^2
- r^2 falls between 0 and 1.
- It summarizes the reduction in the prediction error when using the regression equation rather than the mean of y.
- r
6. Potential Problems with Regression
-
Nonlinearity
- same r but different shape.
-
Outliers
-
check standardized residuals
-
-
Correlation≠ Causation
-
Inappropriate extrapolation
-
Ecological fallacy
if the subjects are grouped for the observations, such as when the data refer to county summaries instead of individual people. the correlation tends to increase in magnitude.
- Ecological fallacy:区群谬误,又称生态谬误,层次谬误,是一种在分析统计资料时常犯的错误。和以偏概全相反,区群谬误是一种以全概偏,如果仅基于群体的统计数据就对其下属的个体性质作出推论,就是犯上区群谬误。这谬误假设了群体中的所有个体都有群体的性质(因此塑型(Sterotypes)也可能犯上区群谬误)。区群谬误的相反情况为化约主义(Reductionism)。
-
Restriction of range
the size of the correlation depends on the range of x values sampled: The correlation tends to be smaller when we sample only a restricted range of x values than when we use the entire range.
7. The Analysis of Variance(ANOVA) table
-
Total SS = Regression SS + Residual SS \(\sum(y-\overline{y})^2=\sum(y-\hat{y})^2+\sum(\hat{y}-\overline{y})^2\)
-
Mean square(MS)
-
Mean square error(MSE)
MSE is the residual sum of squares divided by tis df value \(s^2=\frac{\sum{(y-\hat{y})^2}}{n-2}\)
-
-
MSE= (SDres)^2
-
ANOVA F statistic \(F=\frac{Mean\ square\ for\ regression(Regression\ SS)}{Mean\ square\ error(MSE)}\)