Inferential Statistics(06)-Describing Quantitative Association

 

Inferential Statistics(06) - Describing Quantitative Association

1. Regression Analysis with samples

  1. the Correlation r

    • r=[-1,1]
    • the larger the absolute value of r, the stronger the linear association.
  2. Least Squares Line

    Let x denote the mean of x, y the mean of y, Sx the standard deviation of the x values, and Sy the standard deviation of the y values.

    • \[\hat{y}=b*x + a \\ b = r(\frac{S_y}{S_x}) \\ a= \overline{y}-b\overline{x}\]
    • \[sum \ of \ squared \ residuals\ = \sum{(residual)^2}=\sum(y-\hat{y})^2\]
    • \[\mu_y=\alpha + \beta*\chi \\ with \ \sigma= standard \ deviation\]
  3. Residual

    The difference y-y^ between an observed outcome y and its predicted value y^ is the prediction error, called a residual.

    • A residual is the vertical distance between the data point and the regression line.
    • The smaller the distance, the better the prediction.

2. Regression Model with population

  • A model is a simple approximation for how variables relate in a population.

  • Conditional distribution is the probability distribution of y values at a fixed value of χ.

  • A regression model describes how the population mean μ_y of each conditional distribution for the response variable depends on the value χ of the explanatory variable.

  • a straight-line model:

3. Significance Test About a Population Slope β

  1. Assumptions:

    1. Relationship in population satisfies regression model

    2. data gathered using randomization

    3. population y values at each x value have normal distribution, with same standard deviation at each x value.

  2. Hypotheses

    • H0: β=0 (x is independent from y)
    • Ha: β≠0
  3. Test statistic

    • \[t = \frac{(b-0)}{se_b}\]
  4. P-value

    Two-tail probability of t test statistic value more extreme than observed, using t distribution with df=n-2.

  5. Conclusion

    Interpret P-value in context. If a decision is needed, reject H0 if P-value ≦ significance level.

4. Confidence Interval for β

\[\beta=b \pm t_.025(se) \\ df=n-2\]
  • Confidence Interval and Prediction Interval

    • The residual standard deviation \(SD_{res} = \sqrt{\frac{\sum{(y-\hat{y})}^2}{n-2}}\)

      • the residual sum of squares
      • n-2 because 2 variables a and b
    • Confidence Interval

      • μ_y refers to the variability of all the y values around their mean y \(CI_{\mu_y}= \hat{y}\pm2\frac{SD_{res}}{\sqrt{n}}\)
    • Prediction Interval

      • individual y variability at a fixed x.

      • wider than CI \(PI_{y_i}=\hat{y}\pm2SD_{res}\)

    • r^2

      • r^2 gives the proportion of the overall variability in y that can be attributed to the linear regression model.

5. The Strength of Association

  • r

    • correlation

    • If an x value is a certain number of standard deviations from its mean, then the predicted y is r times that many standard deviations from its mean.

  • r^2

    \[r^2=\frac{\sum{(y-\overline{y})^2}-\sum{(y-\hat{y})^2}}{\sum{(y-\overline{y})^2}}\\ where \\ \hat{y} = predicted \ y \ on\ the\ regression\ line \\ \overline{y} = mean \ of \ y\]
    • r^2 is interpreted as the proportional reduction in error.

      • For instance if r^2 = 0.4, the error using y^ to predict y is 40% smaller than the error using y_ to predict y.
    • Properties of r^2

      • r^2 falls between 0 and 1

      • r^2=1

        • when y-y^ = 0
        • all the data points fall exactly on the regression line.
        • There is no prediction error using x to predict y.
      • r^2=0

        • when y-y^=y-y-
        • y^=y-
        • slope b=0
        • the regression line and mean give the same predictions.
      • the closer r^2 is to 1, the stronger the linear association.

        • the more effective the regression equation y= a + bx.
  • Correlation r and square r^2

    • r
      • r falls between -1 and 1.
      • It represents by how many standard deviations y is predicted to change when x changes by one standard deviation.
      • regression toward the mean.
    • r^2
      • r^2 falls between 0 and 1.
      • It summarizes the reduction in the prediction error when using the regression equation rather than the mean of y.

6. Potential Problems with Regression

  1. Nonlinearity

    • same r but different shape.
  2. Outliers

    • check standardized residuals

  3. Correlation≠ Causation

  4. Inappropriate extrapolation

  5. Ecological fallacy

    if the subjects are grouped for the observations, such as when the data refer to county summaries instead of individual people. the correlation tends to increase in magnitude.

    • Ecological fallacy:区群谬误,又称生态谬误,层次谬误,是一种在分析统计资料时常犯的错误。和以偏概全相反,区群谬误是一种以全概偏,如果仅基于群体的统计数据就对其下属的个体性质作出推论,就是犯上区群谬误。这谬误假设了群体中的所有个体都有群体的性质(因此塑型(Sterotypes)也可能犯上区群谬误)。区群谬误的相反情况为化约主义(Reductionism)。

  6. Restriction of range

    the size of the correlation depends on the range of x values sampled: The correlation tends to be smaller when we sample only a restricted range of x values than when we use the entire range.

7. The Analysis of Variance(ANOVA) table

  • Total SS = Regression SS + Residual SS \(\sum(y-\overline{y})^2=\sum(y-\hat{y})^2+\sum(\hat{y}-\overline{y})^2\)

  • Mean square(MS)

    • Mean square error(MSE)

      MSE is the residual sum of squares divided by tis df value \(s^2=\frac{\sum{(y-\hat{y})^2}}{n-2}\)

  • MSE= (SDres)^2

  • ANOVA F statistic \(F=\frac{Mean\ square\ for\ regression(Regression\ SS)}{Mean\ square\ error(MSE)}\)

8. Exponential regression