Basic Statistics - (07) R[3] Randomness & Probability
1. Probability mass and density functions
-
barplot()
From the lectures you may recall the concepts of probability mass and density functions. Probability mass functions relate to the probability distributions discrete variables, while probability density functions relate to probability distributions of continuous variables. Suppose we have the following probability density function:
-
Instruction
Using the
barplot
function, make a probability histrogram of the above above probability mass function. Specify the height of the bars with the y variable and the names of the bars (names.arg
), that is, the labels on the x axis, with the x variable in your dataframe. -
Answer
# the data frame data <- data.frame(outcome = 0:5, probs = c(0.1, 0.2, 0.3, 0.2, 0.1, 0.1)) # make a histogram of the probability distribution barplot(data$probs, ylim=c(0,1), names.arg=data$outcome)
-
-
The Normal Distribution
Density, distribution function, quantile function and random generation for the normal distribution with mean equal to
mean
and standard deviation equal tosd
.-
Keywords
-
-
Usage
dnorm(x, mean = 0, sd = 1, log = FALSE)
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
-
Arguments
-
x, q
vector of quantiles.
-
p
vector of probabilities.
-
n
number of observations. If
length(n) > 1
, the length is taken to be the number required. -
mean
vector of means.
-
sd
vector of standard deviations.
-
log, log.p
logical; if TRUE, probabilities p are given as log(p).
-
lower.tail
logical; if TRUE (default), probabilities are (P[X \le x]) otherwise, (P[X > x]).
-
-
Details
If
mean
orsd
are not specified they assume the default values of0
and1
, respectively.The normal distribution has density \(f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/2\sigma^2}\) where (\mu) is the mean of the distribution and (\sigma) the standard deviation.
-
Value
dnorm
gives the density,pnorm
gives the distribution function,qnorm
gives the quantile function, andrnorm
generates random deviates.The length of the result is determined by
n
forrnorm
, and is the maximum of the lengths of the numerical arguments for the other functions.The numerical arguments other than
n
are recycled to the length of the result. Only the first elements of the logical arguments are used.For
sd = 0
this gives the limit assd
decreases to 0, a point mass atmu
.sd < 0
is an error and returnsNaN
. -
dnorm()
&pnorm
&qnorm()
&rnorm()
-
dnorm(data,mean,sd)
绘制给定数字分布图:该函数给出给定平均值和标准偏差在每个点的概率分布的高度。
需要先给定数字范围
并利用
plot(x="给定data范围",y="dnorm(data,mean,sd)")进行绘制。
-
pnorm(X值, mean, sd)
累积分布函数:该函数给出正态分布随机数的概率小于给定数的值。查找相应X值的累积概率。
-
qnorm(percent(%),mean,sd)
该函数采用概率值,并给出累积值与概率值匹配的数字。查找相应累积概率的X值。
-
rnorm(number)
此函数用于生成n个分布正常的随机数。 它将样本大小作为输入,并生成许多随机数。
-
-
Exercise
dnorm()
For continuous variables, the values of a variable are associated with a probability density. To get a probability, you will need to consider an interval under the curve of the probability density function. Probabilities here are thus considered surface areas.
In this exercise, we will simulate some random normally distributed data using the
rnorm()
function. This data is contained within thedata
vector. You will then need to visualize the data.-
Instruction
- Check the documentation of the the
dnorm
function usinghelp(dnorm)
- Now calculate the density of the
data
vector and store it in a vector calleddensity
- Finally make a plot with as x variable the
data
vector and as y variable thedensity
variable
- Check the documentation of the the
-
Answer
# simulating data set.seed(11225) data <- rnorm(10000) # check for documentation of the dnorm function help(dnorm) # calculate the density of data and store it in the variable density density <- dnorm(data) # make a plot with as x variable data and as y variable density plot(x = data, y = density)
-
-
The cumulative probability distribution
cumsum()
In the last two exercises, we saw the probability distributions of a discrete and a continuous variable. In this exercise we will jump into cumulative probability distributions. Let’s go back to our probability density function of the first exercise:
All the probabilities in the table are included in the dataframe
probability_distribution
which contains the variablesoutcome
andprobs
. We could sum individual probabilities in order to get a cumulative probability of a given value. However, in some cases, the functioncumsum()
may come in handy. Whatcumsum()
does is that returns a vector whose elements are the cumulative sums of the elements of the arguments. For instance, if we would have a vector which contains the elements:c(1, 2, 3)
,cumsum()
would returnc(1, 3, 6)
-
Instructions
- Calculate the probability that a variable x is smaller or equal to two. Put the result in the variable
prob
. You can use the values from the table displayed above. - Calculate the cumulative probability that a variable x is respectively 0, smaller or equal to one, smaller or equal to two, and smaller or equal to three. Use the
cumsum()
functions for this and print the output to the console.
- Calculate the probability that a variable x is smaller or equal to two. Put the result in the variable
-
Answer
# probability that x is smaller or equal to two 0.6 #' probability that x is 0, smaller or equal to one, #' smaller or equal to two, and smaller or equal to three cumsum(probability_distribution$probs)
# probability that x is smaller or equal to two 0.6 #' probability that x is 0, smaller or equal to one, #' smaller or equal to two, and smaller or equal to three cumsum(probability_distribution$probs)
-
2. Summary statistics
-
The mean
One of the first things that you would like to know about a probability distribution are some summary statistics that capture the essence of the distribution. One example of such a summary statistics is the mean. The mean of a probability distribution is calculated by taking the weighted average of all possible values that a random variable can take. In the case of a discrete variable, you calculate the sum of each possible value times its probability. Let’s go back to our probability mass function of the first exercise.
-
Instruction
- Calculate the expected value of the probability distribution and store this in the variable
expected_score
.expected_score
should be a number rounded to 1 decimal. Note that you have a dataframedata
available in your console that contains a vector of outcomes calledoutcome
and a vector of probabilities calledprobs
. This dataframe is the exact same as the table displayed above. - Print the variable
expected_score
- Calculate the expected value of the probability distribution and store this in the variable
-
Answer
# calculate the expected probability value and assign it to the variable expected_score expected_score<-sum(data$outcome * data$probs) # print the variable expected_score expected_score
# calculate the expected probability value and assign it to the variable expected_score expected_score<-sum(data$outcome * data$probs) # print the variable expected_score expected_score
-
-
Variance and the standard deviation
In addition to the mean, sometimes you would also like to know about the spread of the distribution. The variance is often taken as a measure of spread of a distribution. It is the squared deviation of an observation from its mean. If you want to calculate it on the basis of a probability distribution, it is the sum of the squared difference between the individual observation and their mean multiplied by their probabilities. See the following formula: var(X)=∑(xi−x¯)2∗Pi(xi)var(X)=∑(xi−x¯)2∗Pi(xi).
If we want to turn that variance into the standard deviation, all we need to do is to take its square root. Let’s go back to our probability mass function of the first exercise and see if we can get the variance.
-
Instruction
- Calculate the variance of the mass function displayed above and store this in a variable called
variance
. The mean of probability mass function, displayed as x¯x¯ in the formula, is stored in the variableexpected_score
. Note that you have a dataframedata
available in your console that contains a vector of outcomes calledoutcome
and a vector of probabilities calledprobs
. This dataframe is the exact same as the table displayed above. - Calculate the standard deviation of the mass function displayed above and store this in a variable called
std
.
- Calculate the variance of the mass function displayed above and store this in a variable called
-
Answer
# the mean of the probability mass function expected_score <- sum(data$outcome * data$probs) # calculate the variance and store it in a variable called variance variance <- sum((data$outcome-expected_score)^2*data$probs) variance # calculate the standard deviation and store it in a variable called std std<-sqrt(variance)
> # the mean of the probability mass function > expected_score <- sum(data$outcome * data$probs) > expected_score [1] 2.3 > # calculate the variance and store it in a variable called variance > variance <- sum((data$outcome-expected_score)^2*data$probs) > variance [1] 2.01 > # calculate the standard deviation and store it in a variable called std > std<-sqrt(variance) > std [1] 1.417745
-
3. The Normal Distribution and Cumulative Probability
-
pnorm()
In the previous assignment we calculated probabilities according to the normal distribution by looking at an image. However, it is not always as simple as that. Sometimes we deal with cases where we want to know the probability that a normally distributed variable is between a certain interval. Let’s work with an example of female hair length.
Hair length is considered to be normally distributed with a mean of 25 centimeters and a standard deviation of 5. Imagine we wanted to know the probability that a woman’s hair length is less than 30. We can do this in R using the
pnorm()
function. This function calculates the cumultative probability. We can use it the following way:pnorm(30, mean = 25, sd = 5)
. If you wanted to calculate the probability of a woman having a hair length larger or equal to 30 centimers, you can set thelower.tail
argument to FALSE. For instance,pnorm(30, mean = 25, sd = 5, lower.tail = FALSE)
. Let’s visualize this. Note that the first example is visualized on the left, while the second example is visualized on the right:-
Instruction
Calculate the probability of a woman having a hair length less than 20 centimeters using a mean of 25 and a standard deviation of 5. Use the
pnorm()
function and round the value to two decimals. -
Answer
# probability of a woman having a hair length of less than 20 centimeters lower<-pnorm(20, mean= 25, sd = 5) round(lower, 2)
-
4. The normal distribution and quantiles
-
qnorm()
Sometimes we have a probability that we want to associate with a value. This is basically the opposite situation as the situation described in the previous question. Say we want the value of a woman’s hair length that corresponds with the 0.2 quantile (=20th percentile). Let’s consider visually what this means:
In the visualization, we are given a blue area with a probability of 0.2. We however want to know the value that is associated with the yellow dotted vertical line. This value is the 0.2 quantile (=20th percentile) and divides the curve in an area that contains the lower 20% of the scores and an area that the rest of the scores. If our variable is normally distributed, in R we can use the function
qnorm()
to do so. We can specify the probability as the first parameter, then specify the mean and then specify the standard deviation, for example,qnorm(0.2, mean = 25, sd = 5)
.-
Instruction
Calculate the 85th percentile of the distribution of female hair length and round this value to two decimals. Note that the mean is 25 and the standard deviation is 5.
-
Answer
# 85th percentile of female hair length quantile<-qnorm(0.85, mean = 25, sd = 5) round(quantile, 2)
-
5. The normal distribution and Z scores
A special form of the normal probability distribution is the standard normal distribution, also known as the z - distribution. A z distribution has a mean of 0 and a standard deviation of 1. Often you can transform variables to z values. You can transform the values of a variable to z-scores by subtracting the mean, and dividing this by the standard deviation. If you perform this transformation on the values of a data set, your transformed data set will ave a mean of 0 and a standard deviation of 1. The formula to transform a value to a z score is the following:
The Z-score represents how many standard deviations from the mean a value lies.
-
Instruction
Imagine we have a woman with a hair length of 38 centimers and the average hair length was 25 centimers and the standard deviation was 5 centimers. Calculate the Z value for this woman and store it in the variable
z_value
. Roundz_value
to one decimal. -
Answer
# calculate the z value and store it in the variable z_value z_value <- function(x){(x-25)/5} z_value(38)
6. The binomial distribution
The binomial distribution is important for discrete variables. There are a few conditions that need to be met before you can consider a random variable to binomially distributed:
- There is a phenomenon or trial with two possible outcomes and a constant probability of success - this is called a Bernoulli trial
- All trials are independent
Other ingredients that are essential to a binomial distribution is that we need to observe a certain number of trials, let’s call this n, and we count the number of successes in which we are interested, let’s call this x. Useful summary statistics for a binomial distribution are the same as for the normal distribution: the mean and the standard deviation.
The mean is calculated by multiplying the number of trials n by the probability of a success denoted by p. The standard deviation of a binomial distribution is calculated by the following formula:
-
Instruction
- Consider an example where we have made an exam consisting of 25 multiple choice questions. Each questions has 5 possible answers. This means that the probability of answering a question correctly by chance is 0.2. Calculate the mean of this distribution and store it in a variable called
mean_chance
- Calculate the standard deviation of this distribution and store it in the variable
std_chance
.
- Consider an example where we have made an exam consisting of 25 multiple choice questions. Each questions has 5 possible answers. This means that the probability of answering a question correctly by chance is 0.2. Calculate the mean of this distribution and store it in a variable called
-
Answer
# calculate the mean and store it in the variable mean_chance mean_chance<-0.2*25 mean_chance # calculate the standard deviation and store it in the variable std_chance std_chance<-(25*0.2*0.8)^0.5 std_chance
7. Calculating probabilities of binomial distributions in R
-
dbinom()
&pbinom()
-
二项分布
dbinom(x, size, prob) pbinom(x, size, prob) qbinom(p, size, prob) rbinom(n, size, prob)
- x是数字的向量。
- p是概率向量。
- n是观察的数量。
- size是试验的数量。
- prob是每个试验成功的概率。
Just as with the normal distribution, we can also calculate probabilities according to the binomial distributions. Let’s consider the example in the previous question. We had an exam with 25 questions and 0.2 probability of guessing a question correctly. In contrast to the normal distribution, when we have to deal with a binomial distribution we can calculate the probability of exactly answering say 5 questions correctly. This is because a binomial distribution is a discrete distribution.
When we want to calculate the probability of answering 5 questions correctly, we can use the
dbinom
function. This function calculates an exact probability. If we would like to calculate an interval of probabilities, say the probability of answer 5 or more questions correctly, we can use thepbinom
function. We have already seen a similar function when we were dealing with the normal distribution: thepnorm()
function.-
Instruction
- Look at the documentation of the functions
dbinom()
andpbinom()
. Calculate the exact probability of answering 5 questions correctly and store this in the variablefive_correct
- Calculate the cumulative probability of answering at least 5 questions correctly and store this in the variable
atleast_five_correct
- Look at the documentation of the functions
-
Answer
# probability of answering 5 questions correctly five_correct <- dbinom(5, size = 25, prob = 0.2) # probability of answering at least 5 questions correctly atleast_five_correct <- pbinom(4, size = 25, prob = 0.2, lower.tail = FALSE)
-
-
pbinom()
Remember the concept of quantiles? If not, let me briefly recap it. Quantiles are used when you have a probability and you want to associate this probability with a value. In our last example we had 25 questions and the probability of guessing a question correctly was 0.2. Also, in our last example we wanted to know the probability of answering at least 5 questions correctly and used the
pbinom()
function to do so. With quantiles, we do the exact opposite; we want to calculate the value that is associated with for instance the 0.2 quantile (=20th percentile). In case we are working with a binomial distribution, we can use the functionqbinom()
for this.-
Instructions
Calculate the 60th percentile of the binomial distribution of exam questions. Note that the number of questions is 25 and the probability of guessing a question correctly is 0.2.
-
Answer
# calculate the 60th percentile qbinom(0.6, 25, 0.2)
-