Inferential Statistics(2)-R[1]
1. Matrices
In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.
You can construct a matrix in R with the matrix() function. Consider the following example: matrix(1:9, byrow = TRUE, nrow = 3, ncol = 3)
In the matrix() function:
-
The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which constructs the vector c(1, 2, 3, 4, 5, 6, 7, 8, 9).
-
The argument byrow indicates that the matrix is filled by the rows. This means that the matrix is filled from left to right and when the first row is completed, the filling continues on the second row. If we want the matrix to be filled by the columns, we just place byrow = FALSE.
-
The third argument nrow indicates that the matrix should have three rows.
-
The fourth argument ncol indicates the number of columns that the matrix should have
-
Instructions
- Construct a matrix with 5 rows and 4 columns containing the numbers 1 up to 20 and assign it to the variable
m
. Specify thebyrow
argument to beTRUE
- Print m to the console
- Construct a matrix with 5 rows and 4 columns containing the numbers 1 up to 20 and assign it to the variable
-
Answer
# Construction of a matrix with 5 rows that contain the numbers 1 up to 20 and assign it to m m<-matrix(1:20,byrow = TRUE, nrow=5) # print m to the console m
> # Construction of a matrix with 5 rows that contain the numbers 1 up to 20 and assign it to m > m<-matrix(1:20,byrow = TRUE, nrow=5) > > # print m to the console > m [,1] [,2] [,3] [,4] [1,] 1 2 3 4 [2,] 5 6 7 8 [3,] 9 10 11 12 [4,] 13 14 15 16 [5,] 17 18 19 20
2. Factors
因子被用来表示类别数据,因此也被称为“类别变量”。
The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently.
A good example of a categorical variable is the variable student_status
. An individual can either be “student” or “not student”. This means that “student” and “not student” are two values of the categorical variable student_status
and every observation can be assigned one of these values. We can do this using the factor
function.
-
Instructions
- Turn the vector
student_status
into a factor and put this in a variable calledcategorical_student
- Print the variable
categorical_student
- Turn the vector
-
Answers
# a vector called student_status student_status <- c("student", "not student", "student", "not student") # turn student_status into a factor and save it in the variable categorical_student categorical_student<-factor(student_status) # print categorical_student to the console categorical_student
> student_status <- c("student", "not student", "student", "not student") > > # turn student_status into a factor and save it in the variable categorical_student > categorical_student<-factor(student_status) > > # print categorical_student to the console > categorical_student [1] student not student student not student Levels: not student student
3. Dataframes
You may remember the matrix, a multi-dimensional object that we discussed earlier. All the elements that you put in a matrix should be of the same type. However, when performing a market research survey, you often have questions such as:
- ‘Are your married?’ or ‘yes/no’ questions (= boolean data type)
- ‘How old are you?’ (= numeric data type)
- ‘What is your opinion on this product?’ or other ‘open-ended’ questions (= character data type)
The output, namely the respondents’ answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one. A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.
3.1 Inspecting dataframes
There are several functions you can use to inspect your dataframe. To name a few
-
head
: this by default prints the first 6 rows of the dataframe -
tail
: this by default prints the last 6 rows to the console -
str
: this prints the structure of your dataframe -
dim
: this by default prints the dimensions, that is, the number of rows and columns of your dataframe -
colnames
: this prints the names of the columns of your dataframe -
Instructions
- Print the first 6 rows of mtcars
- Print the structure of the dataframe mtcars
- Print the dimensions of the dataframe mtcars
-
Answers
# print the first 6 rows of mtcars head(mtcars) tail(mtcars) # print the structure of mtcars str(mtcars) # print the dimensions of mtcars dim(mtcars) colnames(mtcars)
> # print the first 6 rows of mtcars > head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 > tail(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2 > # print the structure of mtcars > str(mtcars) 'data.frame': 32 obs. of 11 variables: $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... $ disp: num 160 160 108 258 360 ... $ hp : num 110 110 93 110 175 105 245 62 95 123 ... $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... $ wt : num 2.62 2.88 2.32 3.21 3.44 ... $ qsec: num 16.5 17 18.6 19.4 17 ... $ vs : num 0 0 1 1 0 1 0 1 1 1 ... $ am : num 1 1 1 0 0 0 0 0 0 0 ... $ gear: num 4 4 4 3 3 3 3 4 4 4 ... $ carb: num 4 4 1 1 2 1 4 2 2 4 ... > > # print the dimensions of mtcars > dim(mtcars) [1] 32 11 > colnames(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" [11] "carb"
3.2 Constructing a dataframe yourself
Since using built-in data sets is not even half the fun of creating your own data sets, the rest of this chapter is based on your personally developed data set.
As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our solar system. The main features of a planet are:
- The type of planet (Terrestrial or Gas Giant).
- The planet’s diameter relative to the diameter of the Earth.
- The planet’s rotation across the sun relative to that of the Earth.
- If the planet has rings or not (TRUE or FALSE).
You construct a data frame with the data.frame()
function. As arguments, you should provide the above mentioned vectors as input that should become the different columns of that data frame. Therefore, it is important that each vector used to construct a data frame has an equal length. But do not forget that it is possible (and likely) that they contain different types of data.
-
Instructions
- Use the function
data.frame()
to construct a data frame. Call this variableplanet_df
.
- Use the function
-
Answers
# planets vector planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") # type vector type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") # diameter vector diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) # rotation vector rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) # rings vector rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) # construct a dataframe planet_df from all the above variables planet_df<-data.frame(planets, type, diameter, rotation, rings) planet_df
> # planets vector > planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") > > # type vector > type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") > > # diameter vector > diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) > > # rotation vector > rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) > > # rings vector > rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) > > # construct a dataframe planet_df from all the above variables > planet_df<-data.frame(planets, type, diameter, rotation, rings) > planet_df planets type diameter rotation rings 1 Mercury Terrestrial planet 0.382 58.64 FALSE 2 Venus Terrestrial planet 0.949 -243.02 FALSE 3 Earth Terrestrial planet 1.000 1.00 FALSE 4 Mars Terrestrial planet 0.532 1.03 FALSE 5 Jupiter Gas giant 11.209 0.41 TRUE 6 Saturn Gas giant 9.449 0.43 TRUE 7 Uranus Gas giant 4.007 -0.72 TRUE 8 Neptune Gas giant 3.883 0.67 TRUE
3.3 Indexing and selecting columns from a dataframe
In the same way as you indexed your vectors, you can select elements from your dataframe using square brackets. Different from dataframes however, you now have multiple dimensions: rows and columns. That’s why you can use a comma in the middle of the brackets to differentiate between rows and columns. For instance, the following code planet_df[1,2]
would select the element in the first row and the second column from the dataframe planet_df
.
You can also use the $
operator to select an entire column from a dataframe. For instance, planet_df$planets
would select the entire planets column from the dataframe planet_df.
-
Instructions
- Select the elements in the first row, and the second and third column from
planet_df
- Select the entire third column from
planet_df
- Select the elements in the first row, and the second and third column from
-
Answers
# select the values in the first row and second and third columns planet_df[1,c(2,3)] # select the entire third column planet_df$diameter
> # select the values in the first row and second and third columns > planet_df[1,c(2,3)] type diameter 1 Terrestrial planet 0.382 > # select the entire third column > planet_df$diameter [1] 0.382 0.949 1.000 0.532 11.209 9.449 4.007 3.883
4. List
A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, type of activity that has to do be done.
A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other.
You can easily construct a list using the list()
function. In this function you can wrap the different elements like so: list(item1, item2, item3)
.
-
Instructions
- Put the objects
my_vector
,my_matrix
andmy_df
into a list calledmy_list
- Make sure to print
my_list
- Put the objects
-
Answers
# Vector with numerics from 1 up to 10 my_vector <- 1:10 # Matrix with numerics from 1 up to 9 my_matrix <- matrix(1:9, ncol = 3) # First 10 elements of the built-in data frame 'mtcars' my_df <- mtcars[1:10,] # Construct my_list with these different elements: my_list<-list(my_vector,my_matrix,my_df) # print my_list to the console
># Vector with numerics from 1 up to 10 >my_vector <- 1:10 > ># Matrix with numerics from 1 up to 9 >my_matrix <- matrix(1:9, ncol = 3) > > # First 10 elements of the built-in data frame 'mtcars' > my_df <- mtcars[1:10,] > > # Construct my_list with these different elements: > my_list<-list(my_vector,my_matrix,my_df) > > # print my_list to the console > my_list [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 [[3]] mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
4.1 Selecting elements from a list
Your list will often be built out of numerous elements and components. Therefore, getting a single element, multiple elements, or a component out of it is not always straightforward. One way to select a component is using the numbered position of that component. For example, to “grab” the first component of my_list
you type my_list[[1]]
Another way to check is to refer to the names of the components: my_list[["my_vector"]]
selects the my_vector
vector.
A last way to grab an element from a list is using the $
sign. The following code would select my_df
from my_list
: my_list$my_df
.
Besides selecting components, you often need to select specific elements out of these components. For example, with my_list[[1]][1]
you select from the first component of my_list
the first element. This would select the number 1.
-
Instructions
- Grab the second element of my_list and print it to the console
- Grab the first column of the third component of
my_list
and print it to the console
-
Answers
# Vector with numerics from 1 up to 10 my_vector <- 1:10 # Matrix with numerics from 1 up to 9 my_matrix <- matrix(1:9, ncol = 3) # First 10 elements of the built-in data frame 'mtcars' my_df <- mtcars[1:10,] # Construct list with these different elements: my_list <- list(my_vector, my_matrix, my_df) my_list # Grab the second element of my_list and print it to the console my_list[2] my_list[[2]] # Grab the first column of the third component of `my_list` and print it to the console my_list[3] my_list[[3]] my_list[[3]][,1]
> # Vector with numerics from 1 up to 10 > my_vector <- 1:10 > > # Matrix with numerics from 1 up to 9 > my_matrix <- matrix(1:9, ncol = 3) > > # First 10 elements of the built-in data frame 'mtcars' > my_df <- mtcars[1:10,] > > # Construct list with these different elements: > my_list <- list(my_vector, my_matrix, my_df) > my_list [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 [[3]] mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 > # Grab the second element of my_list and print it to the console > my_list[2] [[1]] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > my_list[[2]] [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 > > # Grab the first column of the third component of `my_list` and print it to the console > my_list[3] [[1]] mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 > my_list[[3]] mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 > my_list[[3]][,1] [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
5. Function
5.1 Mean
1.So far we have seen many datatypes in R. The next thing to learn about is functions. We have already seen many functions when working with vectors, dataframes and lists. For instance, when making a list, we used the function list()
to make one.
In programming, functions are used to incorporate sets of instructions that we want to use repeatedly. A function is actually a piece of code written to carry out a specified task; it may accept arguments or parameters (or not) and it may return one or more values (or not!).
Let’s look at a pre-programmed function in R: mean
. To consult the R documentation on this function, you can use the following commands:
help(mean)
?mean
Try these commands out in the console. If you do so, you’ll be redirected to www.RDocumentation.org. If you would type this function into you R studio console, a help tab would automatically open in R studio.
There is another way of getting help on a function. For instance, if you want to know which parameters need to be provided, you can use the R function args
on the specified function. An example of using args
on a function is the following: args(mean)
-
Instructions
- Consult the documentation of the
mean()
function by typing?mean
into the console. Also type the code to get the documentation of themean()
function in your R script. - Consult the arguments that the
mean()
function takes by typingargs(mean)
into the console. Also type the code to get the arguments of themean()
function in your R script.
- Consult the documentation of the
-
Answers
# ask for help on the mean function ?mean # ask for the arguments used by the mean function args(mean)
> # ask for help on the mean function > ?mean > > # ask for the arguments used by the mean function > args(mean) function (x, ...) NULL
2.When getting help on the mean function, you saw that it takes an argument x. X here is just an arbitrary name for the object that you want to find the mean of. Usually this object will be an R vector. We also saw the ...
. This is called an elipsis and is used to provide a number of optional arguments to the function.
Remember that R can match arguments both by position and by name. Let’s say we want to find the mean of a vector called temperature. An example of matching by name is the following:
mean(x = temperature)
An example of matching by position is the following:
mean(temperature)
In this exercise, we have provided you with a vector of 5 numbers. There are the grades that you got during the semester.
- Instructions
- Calculate the mean of the vector
grades
using matching by position - Calculate the mean of the vector
grades
using matching by name
- Calculate the mean of the vector
-
Answer
# a grades vector grades <- c(8.5, 7, 9, 5.5, 6) # calculate the mean of grades using matching by name mean(x=grades) # calculate the mean of grades using matching by position mean(grades)
> # a grades vector > grades <- c(8.5, 7, 9, 5.5, 6) > > # calculate the mean of grades using matching by name > mean(x=grades) [1] 7.2 > > # calculate the mean of grades using matching by position > mean(grades) [1] 7.2
3.When we looked at the documentation of mean
. The documentation showed us the following method:
mean(x, trim = 0, na.rm = FALSE, ...)
其中,参数x为计算对象,可以是向量、矩阵、数组或数据框;trim用于设置计算均值前去掉两端数据的百分比,即计算结尾均值,取值在0~0.5之间;na.rm为逻辑值,指示是否允许有缺失值(NA)的情况,默认为FALSE(不允许)
As you can see, both trim
and na.rm
have default values. However, x
doesn’t. This makes x a required argument. That means that the function mean will throw an error if x hasn’t been specified. Trim and na.rm are however optional arguments with default values and can be changed or specified by the user.
Na.rm can be changed by the user if a given vector contains missing values. For instance, if a the aforementioned vector called temperature would have missing values, calling mean on it would throw an output of NA
. If you want the mean function to exclude the NA values when calculating the mean, you can specify na.rm = TRUE
. Let’s bring this into practice:
-
Instructions
- Calculate the mean of the grades vector without removing
NA
values. - Calculate the mean of the grades vector with removing
NA
values and observe the difference.
- Calculate the mean of the grades vector without removing
-
Answers
# a grades vector grades <- c(8.5, 7, 9, NA, 6) mean(x=grades) mean(grades, na.rm = TRUE) mean(grades, na.rm= TRUE, trim=0.3)
> # a grades vector > grades <- c(8.5, 7, 9, NA, 6) > > mean(x=grades) [1] NA > mean(grades, na.rm = TRUE) [1] 7.625 > mean(grades, na.rm= TRUE, trim=0.3) [1] 7.75
5.2 Making your own functions
During the last 3 exercises, we have been using existing functions. However, you can also write your own functions. You can define a function using function()
code chunk. For instance, look at the code below to see a function that takes 2 parameters, a
and b
, sums them, and returns them.
sum_a_b <- function(a, b){
return (a + b)
}
You could call this function and assign its result to the variable result
, using the following code: result = sum_a_b(4, 5)
This would put the value of 9 into the variable result
. Off you go; now it’s your turn !
-
Instructions
- Look at the construction of the function
sum_a_b
. Make a similar function that multiplies a and b and call thismultiply_a_b
- Call
multiply_a_b
with the valuesa = 3
andb = 7
and store the result into a variable calledresult
- Look at the construction of the function
-
Answers
# make a function called multiply_a_b multiply_a_b<-function(a,b){ return (a*b) } # call the function multiply_a_b and store the result into a variable result multiply_a_b(3,7)
> # make a function called multiply_a_b > multiply_a_b<-function(a,b){ return (a*b) } > > # call the function multiply_a_b and store the result into a variable result > multiply_a_b(3,7) [1] 21
6. Getting your data into R
6.1.One important thing before you actually do analyses on your data, is that you will need to get your data into R. R contains many functions to read in data from different formats. To name only a few:
read.table
: Reads in tabular data such as txt filesread.csv
: Read in data from a comma-separated file formatreadWorksheetFromFile
: Reads in an excel worksheetread.spss
: Reads in data from .sav SPSS format.
For the current exercise, we have put the R mtcars dataset into a csv file format and put this on github. The data can be found on the following link: http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv
-
Instructions
- Load in the mtcars dataset using the
read.csv
function. Your code should look something like: read.csv(“dataset_url”). Store the result into a variable calledcars
- Print the first 6 rows of the dataset using the
head
function
- Load in the mtcars dataset using the
-
Answers
# load in the data and store it in the variable cars cars<-read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv") # print the first 6 rows of the dataset using the head() function head(cars)
> # load in the data and store it in the variable cars > cars<-read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars.csv") > # print the first 6 rows of the dataset using the head() function > head(cars) mpg cyl disp hp drat wt qsec vs am gear carb car 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant
6.2.In the last exercise, you just read in your first dataset. All you needed to specify was the “address” where the dataset could be found. However, sometimes data isn’t stored into the most convenient format. For instance, sometimes the separator that separates all the different cells is different than you would expect.
You can specify the separator in your read.csv
function using the sep
argument. By default, this argument for csv files is a comma. You can however easily change this to a tab by using the following code: sep = '\t'
. In the current exercise, you will be working with the following url:
http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv
-
Instructions
- For this exercise we’ll be working with the same dataset found at the following url specified above. The dataset however has a “;” as separator. Load in the dataset by specifying the “;” as separator. Save it in the variable
cars
- Print the first 6 rows of cars to the console using the
head()
function
- For this exercise we’ll be working with the same dataset found at the following url specified above. The dataset however has a “;” as separator. Load in the dataset by specifying the “;” as separator. Save it in the variable
-
Answers
# load in the dataset cars<-read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep=";") # print the first 6 rows of the dataset cars
> # load in the dataset > cars<-read.csv("http://s3.amazonaws.com/assets.datacamp.com/course/uva/mtcars_semicolon.csv", sep=";") > > # print the first 6 rows of the dataset > cars mpg cyl disp hp drat wt qsec vs am gear carb 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 11 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 12 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 13 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 14 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 16 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 17 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 18 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 19 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 20 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 21 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 23 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 24 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 25 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 26 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 28 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 29 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 30 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 31 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 32 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
6.3. get your own dataset from your computer
In the previous assignments, we practised reading in files in R. So far, all of these files were on the internet. However, if you would work with R studio on your own computer, you would probably like to read in local files.
When reading in local files, it’s good to have an idea what your working directory is. Your working directory is basically the part of your file system that R will look for files. Usually this is something along the lines of C:/Users/Username/documents. Of course this working directory is not static and can be changed by the user.
6.4.准备工作
在R中进行任何操作和分析工作之前,先需要读取数据。保存在工作目录中的数据可以直接读取,非工作目录的其他位置在读取时需要指明路径。因此第一步工作是了解R的工作目录。下面是具体的代码,输入getwd函数,R返回当前的工作目录。
#查看工作目录``getwd``()``[1] ``"C:/Users/Documents"
你也可以对R的工作目录进行更改,使用setwd函数可以更改工作目录的路径。下面是具体的代码。
#设置工作目录``setwd``(``"C:\\Users\\ r"``)
设置好工作目录后,开始读取数据,并创建数据表。我们的数据在工作目录下,因此直接读取并命名为loandata。
#读取并创建数据表``loandata=``data.frame``(``read.csv``(``'loan_data.csv'``,header = 1))
In R there are two important functions:
getwd()
: This function will retrieve the current working directory for the user-
setwd()
: This functions allows the user to set her own working directory -
Instructions
- Retrieve your current working directory
-
Answers
# retrieve the current working directory getwd()
> # retrieve the current working directory > getwd() [1] "/home/repl"
6.5. Checking files in your working directory
R has some great convenience functions for checking the files that exist in your current working directory. For instance, list.files()
lists all the files that exists in your working directory.
-
Instructions
- List all the files that exist in your working directory
- Read in the
cars.csv
file. Specify the separator to be a;
-
Answers
# list all the files in the working directory list.files() # read in the cars dataset and store it in a variable called cars cars<-data.frame(read.csv("cars.csv", sep=";")) # print the first 6 rows of cars head(cars)
> # list all the files in the working directory > list.files() [1] "cars.csv" > > # read in the cars dataset and store it in a variable called cars > cars<-data.frame(read.csv("cars.csv", sep=";")) > > # print the first 6 rows of cars > head(cars) mpg cyl disp hp drat wt qsec vs am gear carb 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
7. ggplot2
Although base R comes with a lot of useful functions, you will not be able to fully leverage the full power of R without being able to import R modules developed by others. Imagine we want to do some great plotting and we want to use ggplot2 for it. If we want to do so, we need to take 2 steps:
- Install the package ggplot2 using
install.packages("ggplot2")
- Load the package ggplot2 using
library(ggplot2)
orrequire(ggplot2)
In datacamp however, most packages are already installed and readily available. As such, you won’t need to run install.packages()
-
Instructions
- Load the ggplot2 package using the
library
function - Load the ggplot2 package using the
require
function
- Load the ggplot2 package using the
-
Answers
# load the ggplot2 package using the library function library(ggplot2) # load the ggplot2 package using the require function require(ggplot2) ?ggplot2