Inferential Statistics(06)-Describing Quantitative Association

Inferential Statistics(06) - Describing Quantitative Association

the Correlation r
- r=[-1,1]
- the larger the absolute value of r, the stronger the linear association.
Least Squares Line

Let x denote the mean of x, y the mean of y, S_x the standard deviation of the x values, and S_y the standard deviation of the y values.
- \[\hat{y}=b*x + a \\ b = r(\frac{S_y}{S_x}) \\ a= \overline{y}-b\overline{x}\]
- \[sum \ of \ squared \ residuals\ = \sum{(residual)^2}=\sum(y-\hat{y})^2\]
- \[\mu_y=\alpha + \beta*\chi \\ with \ \sigma= standard \ deviation\]
Residual

The difference y-y^ between an observed outcome y and its predicted value y^ is the prediction error, called a residual.
- A residual is the vertical distance between the data point and the regression line.
- The smaller the distance, the better the prediction.

A model is a simple approximation for how variables relate in a population.
Conditional distribution is the probability distribution of y values at a fixed value of χ.
A regression model describes how the population mean μ_y of each conditional distribution for the response variable depends on the value χ of the explanatory variable.
a straight-line model:

Assumptions:
1. Relationship in population satisfies regression model
2. data gathered using randomization
3. population y values at each x value have normal distribution, with same standard deviation at each x value.
Hypotheses
- H₀: β=0 (x is independent from y)
- H_a: β≠0
Test statistic
- \[t = \frac{(b-0)}{se_b}\]
P-value

Two-tail probability of t test statistic value more extreme than observed, using t distribution with df=n-2.
Conclusion

Interpret P-value in context. If a decision is needed, reject H₀ if P-value ≦ significance level.

\[\beta=b \pm t_.025(se) \\ df=n-2\]

r
- correlation
- If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean.
r^2
\[r^2=\frac{\sum{(y-\overline{y})^2}-\sum{(y-\hat{y})^2}}{\sum{(y-\overline{y})^2}}\\ where \\ \hat{y} = predicted \ y \ on\ the\ regression\ line \\ \overline{y} = mean \ of \ y\]
- r^2 is interpreted as the proportional reduction in error.
  - For instance if r^2 = 0.4, the error using y^ to predict y is 40% smaller than the error using y_ to predict y.
- Properties of r^2
  - r^2 falls between 0 and 1
  - r^2=1
    - when y-y^ = 0
    - all the data points fall exactly on the regression line.
    - There is no prediction error using x to predict y.
  - r^2=0
    - when y-y^=y-y-
    - y^=y-
    - slope b=0
    - the regression line and mean give the same predictions.
  - the closer r^2 is to 1, the stronger the linear association.
    - the more effective the regression equation y= a + bx.
Correlation r and square r^2
- r
  - r falls between -1 and 1.
  - It represents by how many standard deviations y is predicted to change when x changes by one standard deviation.
  - regression toward the mean.
- r^2
  - r^2 falls between 0 and 1.
  - It summarizes the reduction in the prediction error when using the regression equation rather than the mean of y.

Nonlinearity
- same r but different shape.
Outliers
- check standardized residuals
Correlation≠ Causation
Inappropriate extrapolation
Ecological fallacy

if the subjects are grouped for the observations, such as when the data refer to county summaries instead of individual people. the correlation tends to increase in magnitude.
- Ecological fallacy:区群谬误,又称生态谬误，层次谬误，是一种在分析统计资料时常犯的错误。和以偏概全相反，区群谬误是一种以全概偏，如果仅基于群体的统计数据就对其下属的个体性质作出推论，就是犯上区群谬误。这谬误假设了群体中的所有个体都有群体的性质(因此塑型(Sterotypes)也可能犯上区群谬误)。区群谬误的相反情况为化约主义（Reductionism)。
Restriction of range

the size of the correlation depends on the range of x values sampled: The correlation tends to be smaller when we sample only a restricted range of x values than when we use the entire range.

Total SS = Regression SS + Residual SS \(\sum(y-\overline{y})^2=\sum(y-\hat{y})^2+\sum(\hat{y}-\overline{y})^2\)
Mean square(MS)
- Mean square error(MSE)
  
  MSE is the residual sum of squares divided by tis df value \(s^2=\frac{\sum{(y-\hat{y})^2}}{n-2}\)
MSE= (SDres)^2
ANOVA F statistic \(F=\frac{Mean\ square\ for\ regression(Regression\ SS)}{Mean\ square\ error(MSE)}\)