Basic Statistics - (09) R[4] Sampling Distribution
1. Sampling from the population
-
sample(data,size=)
In this lab we have access to the entire population. In real life, this is rarely the case. Often we gather information by taking a sample from a population. In the lectures, you’ve become familiar with the male beard length (in millimeters) of hipsters in Scandinavia. In this lab, we will be working with this example.
If we were interested in estimating the average male beard length of hipsters in Scandinavia, in R we can use the
sample()
function to sample from the population. For instance, to sample 50 inhabitants from our Scandinavian male hipster population which is included in the variablescandinavia_data
, we could do the following:sample(scandinavia_data, size = 50)
. This command collects a simple random sample of size 50. If we didn’t have access to the entire male hipster Scandinavian population, working with these 50 inhabitants would be considerably simpler than having to go through the entire Scandinavian male hipster population.-
Instruction
- Make sure not to remove the
set.seed(11225)
code. This makes sure that you will get the same results as the solution code - Sample 100 values from the dataset
scandinavia_data
and store this in a variablefirst_sample
- Calculate the mean of
first sample
and print the result.
- Make sure not to remove the
-
Answer
# variable scandinavia_data contains the beard lengths of scandinavian male population set.seed(11225) first_sample<-sample(scandinavia_data, size= 100) first_sample mean(first_sample)
> # variable scandinavia_data contains the beard lengths of scandinavian male population > set.seed(11225) > first_sample<-sample(scandinavia_data, size= 100) > first_sample [1] 27.99663 24.06417 23.48866 23.75189 27.29676 24.45529 28.23739 23.04652 [9] 23.74036 28.76409 28.12697 31.55525 22.05166 23.69069 27.32476 22.51063 [17] 27.79851 28.65205 19.44881 21.84980 31.13625 27.46308 26.22158 22.35466 [25] 20.81535 27.51356 22.84791 27.03847 20.50305 26.59806 24.03845 33.41994 [33] 25.07562 27.25052 27.96126 26.65947 20.68557 26.78273 31.56429 16.31928 [41] 23.27613 25.58413 23.68520 30.32883 24.99264 22.70653 22.07276 28.34237 [49] 31.62887 25.12103 22.62322 26.29557 22.99453 21.99805 28.44408 24.74419 [57] 20.63301 27.68588 26.60778 23.19137 22.59585 25.24416 20.86827 25.24960 [65] 31.63289 24.96406 26.42585 23.94020 23.19363 28.22559 23.36927 22.90029 [73] 24.71661 24.60919 29.18013 32.82084 23.33415 28.00259 29.57501 22.81163 [81] 30.19587 23.66761 20.68534 28.89661 23.27482 21.55286 28.16404 29.45489 [89] 24.22142 25.88669 22.46760 23.97159 25.77642 31.02772 26.19912 23.75873 [97] 24.82717 24.50296 21.66709 26.00112 > mean(first_sample) [1] 25.42916
-
2. Loop 循环
-
for(i in ){}
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way.
The distribution of sample means, or the sampling distribution, can help us understand this variability. However, before continuing with the sampling distribution, we will firstly introduce the concept of a for loop in R.
Every time some operation has to be repeated a specific number of times, a for loop may come in handy. We only need to specify how many times or upon which conditions those operations need execution: we assign initial values to a control loop variable, perform the loop and then, once finished, we typically do something with the results.
-
Instruction
Look at the example in the sample code and think what this code will do. Subsequently, press “Submit Answer”.
-
Answer
# initialize an empty vector new_number <- NULL # run an operation 10 times. # The ith position of new number will be set to i # at the end of the loop, the vector new_number is printed for (i in 1:10){ new_number[i] <- i } print(new_number)
# initialize an empty vector > new_number <- NULL > > # run an operation 10 times. > # The ith position of new number will be set to i > # at the end of the loop, the vector new_number is printed > for (i in 1:10){ new_number[i] <- i } > > print(new_number) [1] 1 2 3 4 5 6 7 8 9 10
-
3. Mean of the Sampling Distribution
-
mean(sample)
The mean of a sample that you take from the population will never be very far away from the population mean (provided that you randomly sample from the population). Furthermore, the mean of the sampling distribution, that is the mean of the mean of all the samples that we took from the population will never be far away from the population mean. Let’s observe this in practice.
-
Instruction
- Calculate the mean of the population and print it. Note that the population is included in the variable
scandinavia_data
. - Calculate the mean of the sample means and print it. Note that the sample means are included in the variable
sample_means
. - Note how close the two are.
- Calculate the mean of the population and print it. Note that the population is included in the variable
-
Answer
# set the seed such that you will get the same sample as in the solution code set.seed(11225) # empty vector sample means sample_means <- NULL # take 200 samples from scandinavia_data for (i in 1:500){ samp <- sample(scandinavia_data, 200) sample_means[i] <- mean(samp) } # calculate the population mean, that is, the mean of scandinavia_data and print it mean(scandinavia_data) # calculate the mean of the sample means, that is, sample_means mean(sample_means)
> # set the seed such that you will get the same sample as in the solution code > set.seed(11225) > > # empty vector sample means > sample_means <- NULL > > # take 200 samples from scandinavia_data > for (i in 1:500){ samp <- sample(scandinavia_data, 200) sample_means[i] <- mean(samp) } > > # calculate the population mean, that is, the mean of scandinavia_data and print it > mean(scandinavia_data) [1] 24.97331 > > # calculate the mean of the sample means, that is, sample_means > mean(sample_means) [1] 24.9644
-
4. Standard Deviation of the Sampling Distribution
-
σ/sqrt(n)
In the previous weeks you have become familiar with the concept of standard deviation. You may recall that this concept refers to the spread of a distribution. In R you can calculate the standard deviation using the function
sd()
.However, the standard deviation of the sampling distribution is called the standard error. The standard error is calculated slightly differently from the standard deviation. The formula for the standard error can be found below: \(\overline{x} = \frac{\sigma}{\sqrt{n}}\)
In this formula, the sigma refers to the standard deviation, while n refers to the sample size of the sample.
-
Instruction
- Calculate the standard deviation of the population and put it in the variable
population_sd
. Note that the population can be found in the variablescandinavia_data
. Printpopulation_sd
to the console. - Use
population_sd
to calculate the standard error of a sample of 200 cases and put it in the variablesampling_sd
. Printsampling_sd
to the console.
- Calculate the standard deviation of the population and put it in the variable
-
Answer
# standard deviation of the population population_sd<-sd(scandinavia_data) population_sd # standard deviation of the sampling distribution sampling_sd<-population_sd/sqrt(200) sampling_sd
> # standard deviation of the population > population_sd<-sd(scandinavia_data) > population_sd [1] 3.466054 > # standard deviation of the sampling distribution > > sampling_sd<-population_sd/sqrt(200) > sampling_sd [1] 0.245087
-
5 The Central Limit Theorem
Earlier we touched on the sampling distribution and its mean and standard deviations. Now, we will look at the central limit theorem, one of the most important theorems when it comes to inferential statistics. Briefly this theorem states the following:
“Provided that the sample size is sufficiently large, the sampling distribution of the sample mean is approximately normally distributed even if the variable of interest is not normally distributed in the population”
In this exercise we will take a look at a new population of simulated household income of citizens in the United States. The data is stored in a variable called household_income
. This population is right skewed. We will take a 1000 samples of n = 200 from this population and calculate the sample mean each time. You will see that the sampling distribution, just as the central limit theorem states, is normally distributed.
-
Instructions
- Make a histogram of the variable
household_income
and see how the population is skewed to the right. - Make a histogram of the variable sample_means. Can you see how this variable looks normally distributed?
- You can press on the button “Previous plot” to check the previous histogram.
- Make a histogram of the variable
-
Answer
# empty vector sample means sample_means <- NULL # take 200 samples from scandinavia_data for (i in 1:1000){ samp <- sample(household_income, 200) sample_means[i] <- mean(samp) } # make a histogram of household_income hist(household_income, xlab="INCOME", ylab="N", main="household income") # make a histogram of sample_means hist(sample_means, xlab="INCOME", ylab="N", main="sample mean")
6. Z-SCORE
-
pnorm(z-score, lower.tail=TRUE/FALSE)
A zscore by itself may not always be easy to interpret. Yes, it does indicate the amount of standard deviations away from the population mean, but this may sound like jibberish to many people. Wouldn’t it be great to translate a z score to a probability?
Z scores can be easily translated to probabilities. There are multiple ways to do so:
- Look up the z score in a table
- Calculate the probability using R
In R we can use the
pnorm()
function to calculate the probability of obtaining a given score or a more extreme score in the population. Basically this calculate an area under the bell curve. The functionpnorm()
has several parameters you can include such as:- q: The observation for which you want to calculate the probability
- mean: The population mean
- sd: The population standard deviation
- lower.tail: Indicates whether you want to calculate the area under the curve left of your observations or right of your observations
Let’s look at how to use
pnorm()
and let’s play around with thelower.tail
option.-
直接给出z值后,就不用再输入mean和sd了。
-
Instruction
- Recall that the z score for the scandinavian hipster in the previous exercise was 2.02. Calculate the area left of this observation by specifying
lower.tail = TRUE
in pnorm and print this probability. - Now calculate the area under the curve right of this observation by specifying
lower.tail = FALSE
and print this probability. - You can use
pnorm()
the following way:pnorm(2.02, lower.tail = TRUE)
- Recall that the z score for the scandinavian hipster in the previous exercise was 2.02. Calculate the area left of this observation by specifying
-
Answer
# calculate the area under the curve left of the observation pnorm(2.02,lower.tail= TRUE) # calculate the area under the curve right of the observation pnorm(2.02,lower.tail= FALSE)
> # calculate the area under the curve left of the observation > pnorm(2.02,lower.tail= TRUE) [1] 0.9783083 > # calculate the area under the curve right of the observation > pnorm(2.02,lower.tail= FALSE) [1] 0.02169169
7. Sample Distribution of Sampling Distribution
So far we have calculated the probabilities of observations using mean and standard deviation values from the population. However, we can also calculate these observation probabilities using mean and standard deviation values from the sample. For instance, we could have a question along the lines of what is the probability that the sample mean of the beard length of 50 Scandinavian hipsters is larger or equal to 26 millimeters. Because in this example we are talking about a specific sample from the population, we make use of the sampling distribution and not the population distribution.
Because we make use of the sampling distribution, we are now using the standard deviation of the sampling distribution which is calculated using the formula σ/sqrt(n).
-
Instruction
Calculate the probability that a sample mean of the beard length of 50 Scandinavian hipsters is larger or equal to 26 millimeters. Recall that the population is contained in the variable
scandinavia_data
. Take all the steps indicated by comments in the editor. -
Answer
# calculate the population mean population_mean <- mean(scandinavia_data) # calculate the population standard deviation population_sd <- sd(scandinavia_data) # calculate the standard deviation of the sampling distribution and put it in a variable sampling_sd sampling_sd<- population_sd/sqrt(50) sampling_sd # calculate the Z score and put in a variable called z_score z_score<- (26-population_mean)/sampling_sd z_score # cumulative probability calculation. Don't forget to set lower.tail to FALSE pnorm(z_score, lower.tail=FALSE)
> # calculate the population mean > population_mean <- mean(scandinavia_data) > > # calculate the population standard deviation > population_sd <- sd(scandinavia_data) > > # calculate the standard deviation of the sampling distribution and put it in a variable sampling_sd > sampling_sd<- population_sd/sqrt(50) > sampling_sd [1] 0.4901741 > # calculate the Z score and put in a variable called z_score > z_score<- (26-population_mean)/sampling_sd > z_score [1] 2.094534 > # cumulative probability calculation. Don't forget to set lower.tail to FALSE > pnorm(z_score, lower.tail=FALSE) [1] 0.01810623
8. Sample Distribution of Binominal Distribution
So let’s continue working with proportions. Imagine we took a random sample of 200 people from London’s general population, and a proportion of 0.13 of these people were hipsters. We however know that in the population of London, the proportion of hipsters is 0.10. What is the probability of finding a sample of 200 with a proportion of 0.13 or more hipsters?
Let’s break this problem into steps. Firstly we can calculate the standard deviation of the sampling distribution. The second step is using a function that may look familiar: pnorm()
. Although we do not have a mean, we can use our sampling and population proportions. Our sampling proportion will constitute the q argument here, while our population proportion will constitute the mean argument. Now let’s get going, what is the probability of finding a sample of 200 with a proportion of 0.13 or more hipsters?
- Instruction
- Calculate the probability of finding a sample of 200 with a proportion of 0.13 or more hipsters using the
pnorm()
function.
- Calculate the probability of finding a sample of 200 with a proportion of 0.13 or more hipsters using the
-
Answer
# calculate the standard deviation of the sampling distribution and put in a variable called sample_sd sample_sd = sqrt(0.1*0.9/200) # calculate the probability pnorm(0.13, mean=0.1, sd= sample_sd, lower.tail=FALSE)
> # calculate the standard deviation of the sampling distribution and put in a variable called sample_sd > sample_sd = sqrt(0.1*0.9/200) > > # calculate the probability > pnorm(0.13, mean=0.1, sd= sample_sd, lower.tail=FALSE) [1] 0.0786496