Basic Statistics - (01) Introduction
1. Cases, Variables and levels of measurements
-
Variables
- characteristics of something or someone
-
Cases
- something or someone
-
Variable vs Constant
2. Data matrix and frequency table
-
Data Matrix**
- data table: the starting point of any statistic analysis
-
Frequency Table
- Categorical
- Ordinal
03 Graphs and shapes of distributions
-
Graphs
-
Nominal/ ordinal
-
bar vs pie
- pie chart
- bar graph
-
advantages and contrast of bar/pie
-
-
Histogram
- Interval/ratio
- histogram
-
Shapes of distributions
- unimodal: bell shape
- other shapes:
- skewed to the right/left
- bimodal: two peaks
4.Mode, median and mean
-
Mode 众数
- value that occurs most frequently(the most frequently)
- often used if a variable is measured on a nominal or ordinal level.
-
Median中位数
- the middle value of your observations when arranged from the smallest to the largest
- even respondents: average of the two middle values
-
Mean平均值
- the sum of all the values divided by the number of observations
5 Range, interquartile range and box plot
-
Range
- the difference between highest and lowest value
- easy to understand
- simple to compute
-
doesn’t give a good impression of the variability
- only takes account of extreme values
-
Interquartile range四分位距
-
leaves out the extreme values
-
distribution in four equal parts
-
Q1&Q3
-
Turkey: 直接将数组以中位数(Q2)分为两个数组,并分别寻找两个数组中位数作为Q1和Q3。
-
数组:1,2,3,4,5,6,7,8,9,10
Q1=3;Q3=8
-
-
更常用方法为:取(n+1)/4分位的数为Q1位置。
-
比如:数组:1,2,3,4,5,6,7,8,9,10
Q1位置为:(10+1)/4 =2.75, 即数组的第2.75位=(2*0.25+3*0.75)=2.75
同理:Q3=(8*0.75 + 9*0.25)=8.25
-
-
注意:Interquartile range有不同的计算方式,结果会稍有不同。参见
-
-
IQR
-
IQR是interquartile range的缩写,中文叫四分位距。对于一组样本,我们计算出第一四分位数Q1Q1以及第三四分位数Q3Q3,IQR就是它们的差。
\[Q_2=median\] \[IQR = Q_3 - Q_1\]
-
-
离群点:我们经常用IQR做离群点排除,比如小于q1−1.5(IQR)的数或者大于q3+1.5(IQR)的数就被认为是离群点。
-
6 Variance and standard deviation
-
Variance 方差
- 方差是在概率论和统计方差衡量随机变量或一组数据时离散程度的度量。概率论中方差用来度量随机变量和其数学期望(即均值)之间的偏离程度。统计中的方差(样本方差)是每个样本值与全体样本值的平均数之差的平方值的平均数。
- larger variance –>larger variability–>the more the values are spread out around the mean
- disadvantage: the metric of the variance is the metric of the variable under analysis SQUARED.
-
Standard deviation 标准差
\[S = \sqrt{\frac{\sum(x-\overline{x})^2}{(n-1)}}\]- most often used measure of dispersion
7 Z-score
-
number of standard deviations removed from the mean.
-
z分数(z-score),也叫标准分数(standard score)是一个数与平均数的差再除以标准差的过程。在统计学中,标准分数是一个观测或数据点的值高于被观测值或测量值的平均值的标准偏差的符号数。
z分数可以回答这样一个问题:”一个给定分数距离平均数多少个标准差?”在平均数之上的分数会得到一个正的标准分数,在平均数之下的分数会得到一个负的标准分数。 z分数是一种可以看出某分数在分布中相对位置的方法。
- \[Z = \frac{x-\overline{x}}{S}\]
-
z score and normal distribution
-
any distribution
-
standardization process 标准化