2+2
[1] 4
This notebook provides an introduction to programming in R. R is a programming language that is widely used for statistical analysis in research and industry. It is also an important tool to create data products, like presentations, (automated) reports, applications and software packages.
Alternatives to R include, for example, Python, julia, matlab, … and their extensions.
Before we start, we would like to ask you about your background.
R itself comes with very few functionalities. Also, the standard GUI is not very beautiful as you can see in Figure 1.
2+2
[1] 4
2+2
.R uses different classes of objects, e.g. lists, matrices, vectors, data frames…
Class | Example |
---|---|
numeric | 2.2,c(5,2) |
character | 'Hello' |
logical | TRUE |
list | list('Hello',5) |
matrix | matrix(5,3,2) |
You can generate a new object, e.g. a vector with the assignment operator <-
(or using =
).
a
that collects all integers from 1 to 5. Use the R command c()
.<- c(1,2,3,4,5)
a a
[1] 1 2 3 4 5
# Alternatively (shorter)
<- c(1:5)
a
# Even shorter
<- 1:5 a
M
with 3 rows and 2 columns that lists all integer numbers from 1 to 6.<- matrix(c(1,2,3,4,5,6), nrow=3, ncol=2) # define a 3x2 matrix
M M
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
R supports many operations for vectors and matrices.
Standard calculations | |
---|---|
Addition | + |
Subtraction | - |
Multiplication | * |
Division | / |
Exponentiation | ^ |
a
and M
by 2./2 a
[1] 0.5 1.0 1.5 2.0 2.5
/2 M
[,1] [,2]
[1,] 0.5 2.0
[2,] 1.0 2.5
[3,] 1.5 3.0
A strength of R is that many add-on packages are available. These packages extend the capability of R. A package is a file that contains a collection of related functions and variables. Most packages are hosted at CRAN, the Comprehensive R Archive Network, and can be installed in R/RStudio with command install.packages()
.
library()
# Install package with name "hdm"
install.packages("hdm")
hdm
package, you have to load the library vialibrary("hdm")
You can load data, e.g., from .csv
files or R data files .rda
. You have to make sure that you set the right working directory. You can check and change the working directory with
# Check
getwd()
# Change working directory
setwd("INSERT_YOUR_PATH_HERE")
Click here to download an exemplary data file. Provided you work in the right directory (i.e., a directory with a subdirectory data
, where you saved the data set), you can load the data with the load()
command.
# It's good to start with a clean desk:
# Type the following command to remove all previously created objects
rm(list=ls())
# Load the data
load("data/data_counterfactual.rda")
# Type ls() to show all objects in your session
ls()
[1] "data_counterfactual"
Now, you can work with the data set called data_counterfactual
. Try out the basic commands
dim(data_counterfactual)
summary(data_counterfactual)
plot(data_counterfactual$Y_a1)
table(data_counterfactual)
R has a comprehensive built-in help system. To get help for the function lm()
provided by the stats
package, which estimates a linear regression, you can use any of the following commands
help.start()
- general help and extensive introduction to Rhelp(lm)
- help for function lm()
?lm
- same resultapropos("lm")
- list all functions containing string "lm"
??lm
- extensive search on all documents containing the string "lm"
example(lm)
- show an example of function lm()
(if available)RSiteSearch("lm")
- search for lm
in help manuals and archived mailing listsAfter the short introduction above, you will find a more detailed version in this section.
We start by emptying the workspace and specifying the current directory.
rm(list=ls())
<- "YOUR_DIRECTORY"
directory setwd(directory)
R is based upon different classes of objects. An object is easily created using the assignment symbol <-
or the equality symbol “=”.
<- 5
x x
[1] 5
= 5
y y
[1] 5
The most basic object type in R is a vector. A vector is an ordered combination of simple objects that are of the same type, for instance numbers or letters. A vector is created using the c()
call. Although you may want to declare a scalar (i.e. a single number) as an object, it effectively is created as a vector of length 1.
# Vector of numbers
<- c(2,4,1,4,5)
vector1 vector1
[1] 2 4 1 4 5
# Vector of characters
<- c("a","b,","c","d")
vector2 vector2
[1] "a" "b," "c" "d"
In the case of strings (or characters, as they are called in R), the value to be stored needs to be declared by quotation marks. If this is not done, R looks for objects with corresponding names:
# We declared x and y already
<- c(y,x)
vector3 vector3
[1] 5 5
# We haven't created objects called a,b,c,d before, this is why we get an error
<- c(a,b,c,d) vector4
Error in eval(expr, envir, enclos): object 'a' not found
Logical statements can be collected in vectors, as well.
= c(TRUE, FALSE, TRUE, FALSE)
vector.logic vector.logic
[1] TRUE FALSE TRUE FALSE
You can use vectors to create a new vector, too. However, remember that the elements of a vector need to be of the same class. If they are not of the same class, R tries to make them the same class of objects. In our case, the elements of the created vectors become characters.
<- c(vector1, vector3)
vector5 vector5
[1] 2 4 1 4 5 5 5
<- c(vector1, vector2)
vector6 vector6
[1] "2" "4" "1" "4" "5" "a" "b," "c" "d"
A way to create vectors that is helpful in many applications is to use the functions rep()
and seq()
. rep()
repeats a certain object as many times as specified and seq()
creates a sequence of numbers.
# Repeat "a" five times
<- rep("a", 5)
vector7 vector7
[1] "a" "a" "a" "a" "a"
# Repeat 5 five times
<- rep(5,5)
vector8 vector8
[1] 5 5 5 5 5
# Repeat a vector 5 times
<- rep(vector3, 5)
vector9 vector9
[1] 5 5 5 5 5 5 5 5 5 5
# Create a sequence from 0 to 10
<- seq(0,10)
vector10 vector10
[1] 0 1 2 3 4 5 6 7 8 9 10
# Create a sequence from 0 to 20 in steps of 5
<- seq(0,20,5)
vector11 vector11
[1] 0 5 10 15 20
# Create a sequence from 5 to 15 in 7 steps
seq(5,15, length=7+1)
[1] 5.000000 6.428571 7.857143 9.285714 10.714286 12.142857 13.571429
[8] 15.000000
# A sequence of integers is also created easily as
<- 1:5
vector12 <- 5:-4 vector13
Moreover, empty vectors can be created with indication of the length and type of vector.
<- vector(mode="numeric", length=4)
empty.vector empty.vector
[1] 0 0 0 0
<- vector(mode = "character", length=2)
empty.vector2 empty.vector2
[1] "" ""
More details can be found in the help files. You can assess them by typing ?vector
.
Basic R provides a set of operations that can be performed on vectors. If you want to access a particular element of a vector, this can be done with square brackets.
# Access the first element of vector1
1] vector1[
[1] 2
# Access element 4 of vector1
4] vector1[
[1] 4
It is possible to access more than one element at the same time. In this case you need to create a vector with an index of the elements you want to access.
# Remember we already created a vector called "vector1"
vector1
[1] 2 4 1 4 5
# Access element 1 and 4 of vector1
c(1,4)] vector1[
[1] 2 4
# The index vector can also be specified before
= c(1,3,5)
index vector1[index]
[1] 2 1 5
# It is also possible to declare which elements should NOT be access by -
-index] vector1[
[1] 4 4
Alternatively, it is also possible to use logical statements.
<3] vector1[vector1
[1] 2 1
==4] vector1[vector1
[1] 4 4
# "OR" operation with |
==4 | vector1<2] vector1[vector1
[1] 4 1 4
# "AND" operation with &
>1 & vector1<4] vector1[vector1
[1] 2
Sometimes it’s also helpful declare conditions on values for elements of vectors to select them. For instance, we would like to find those entries in a vector that satisfy a certain condition. For instance, the next command will return a vector that says if the condition is TRUE or FALSE for each element in the vector.
> 2 vector1
[1] FALSE TRUE FALSE TRUE TRUE
Now let’s show the entries that satisfy the condition (as we did before).
> 2] vector1[vector1
[1] 4 4 5
Sometimes we want to know which of the entries in the vector (in terms of their place in the row of elements) are the ones that satisfy the condition. We can use the which()
function.
which(vector1 > 2)
[1] 2 4 5
Hence, we can find out that the 2nd, 4th, and 5th entry in the vector satisfy the condition (i.e., that they are greater than two). To see this, remember how vector1
looks like.
vector1
[1] 2 4 1 4 5
In R it is possible to calculate with vectors. Many mathematical functions that are applied to single numbers can also be applied to the elements of a vector.
# Power of 2
4^2
[1] 16
5^2
[1] 25
10^2
[1] 100
# Can also be applied to a vector with the same elements
# the operation is executed elementwise
c(4,5,10)^2
[1] 16 25 100
# Alternatively substract a number from each of the elements
c(4,5,10) - 10
[1] -6 -5 0
R provides mathematical functions to calculate the sum sum()
and the mean mean()
of the elements of a vector. Moreover, it offers calls to compute the length of a vector.
sum(vector1)
[1] 16
mean(vector1)
[1] 3.2
length(vector1)
[1] 5
It is also possible to put vectors together with append()
or simply c()
, to reverse it and to indicate the number of distinct elements unique()
.
append(vector1, vector3)
[1] 2 4 1 4 5 5 5
rev(vector1)
[1] 5 4 1 4 2
unique(vector1)
[1] 2 4 1 5
Vectors are probably the most important objects in R. However, in most analyses the data is multivariate, i.e. comprise more than one variable. First let us consider matrices.
# By default matrices are `filled' up by columns
= matrix(c("a","b","c","d"), ncol = 2, nrow = 2)
mat1 mat1
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d"
# They can also be `filled' up by rows
= matrix(c("a","b","c","d"), ncol = 2, nrow = 2, byrow = TRUE)
mat2 mat2
[,1] [,2]
[1,] "a" "b"
[2,] "c" "d"
# Check that matrix1 and transposed matrix2 are the same
== t(mat2) # (elementwise comparison) mat1
[,1] [,2]
[1,] TRUE TRUE
[2,] TRUE TRUE
A matrix has two dimensions, rows and columns. These can be accessed by squared brackets, whereas the first entry in brackets refers to rows and the second entry (separated by a ,
) to columns.
# First row of matrix 1
1,] mat1[
[1] "a" "c"
# First column of matrix 1
1] mat1[,
[1] "a" "b"
A row or a column of a matrix is a vector, again.
# To check this
class(mat1[1,])
[1] "character"
class(mat1[,1])
[1] "character"
And, the other way around… Matrices can be constructed from binding vectors column-wise or row-wise together.
= cbind(vector1, rev(vector1))
mat2 mat2
vector1
[1,] 2 5
[2,] 4 4
[3,] 1 1
[4,] 4 4
[5,] 5 2
= rbind(vector1, rev(vector1))
mat3 mat3
[,1] [,2] [,3] [,4] [,5]
vector1 2 4 1 4 5
5 4 1 4 2
Matrices have two dimensions. However, one might think of more dimensions than rows and columns. For instance, imagine you want to draw random numbers that you want to store din a matrix and you want to repeat it 2 times. Then arrays are useful.
# Create an empty array with 3 dimensions, for instance
# an arrangement of 4 2x2 matrices
<- array(NA, dim = c(2,2,4))
arr1 arr1
, , 1
[,1] [,2]
[1,] NA NA
[2,] NA NA
, , 2
[,1] [,2]
[1,] NA NA
[2,] NA NA
, , 3
[,1] [,2]
[1,] NA NA
[2,] NA NA
, , 4
[,1] [,2]
[1,] NA NA
[2,] NA NA
# Say you want to draw four random numbers that you store in a matrix
# and you want to repeat this 4 times the four numbers drawn at the same
# time are stored in one matrix
# A seed is set to make this example reproducible
set.seed(1234)
# Note: You have to run the two previous lines of code before!
# We can implement this in a loop (although this is not the most elegant way)
for (i in 1:dim(arr1)[3]){
<- rnorm(4)
arr1[,,i]
}
arr1
, , 1
[,1] [,2]
[1,] -1.2070657 1.084441
[2,] 0.2774292 -2.345698
, , 2
[,1] [,2]
[1,] 0.4291247 -0.5747400
[2,] 0.5060559 -0.5466319
, , 3
[,1] [,2]
[1,] -0.5644520 -0.4771927
[2,] -0.8900378 -0.9983864
, , 4
[,1] [,2]
[1,] -0.77625389 0.9594941
[2,] 0.06445882 -0.1102855
Another common object type in R is a list. The great advantage of a list is that it allows to collect items of different classes.
<- list("hello", 1, arr1, vector1, mat1)
mylist mylist
[[1]]
[1] "hello"
[[2]]
[1] 1
[[3]]
, , 1
[,1] [,2]
[1,] -1.2070657 1.084441
[2,] 0.2774292 -2.345698
, , 2
[,1] [,2]
[1,] 0.4291247 -0.5747400
[2,] 0.5060559 -0.5466319
, , 3
[,1] [,2]
[1,] -0.5644520 -0.4771927
[2,] -0.8900378 -0.9983864
, , 4
[,1] [,2]
[1,] -0.77625389 0.9594941
[2,] 0.06445882 -0.1102855
[[4]]
[1] 2 4 1 4 5
[[5]]
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d"
A very useful property of lists is that the collected items can be accessed by their names via a $
operator.
# First let us assign names to the elements of the list
names(mylist) = c("Word", "Number", "Array", "Vector", "Matrix")
# Now we can call the elements by their names
$Word mylist
[1] "hello"
$Matrix mylist
[,1] [,2]
[1,] "a" "c"
[2,] "b" "d"
Alternatively, you can access the elements of a list via their position using [[]]
.
1]] mylist[[
[1] "hello"
Let us now turn to data frames, the class of object that is probably used most frequently in applied analyses. Data frames share the benefits of both matrices, i.e. the rows and columns of a data set can be accessed via squared brackets, and lists, i.e. that it is possible to use the $
operator and exploit the naming of variables. Moreover, the different variables collected in a data frame can be of different classes, e.g. one characteristic is a numeric variable (e.g. age) and another categorical (e.g. occupation). A data frame can be constructed from a matrix or a vector.
# Age
= c(38, 20, 45, 20)
age = c("bus driver", "barkeeper", "nurse", "student")
occ = c(2000, 1400, 1400, 800)
inc = data.frame(age, occ, inc)
df df
age occ inc
1 38 bus driver 2000
2 20 barkeeper 1400
3 45 nurse 1400
4 20 student 800
# A matrix would not work here (try!)
#matrix = as.matrix(cbind(age, occ))
# Now variables can be accessed via the dimension operators
1,] df[
age occ inc
1 38 bus driver 2000
2,] df[
age occ inc
2 20 barkeeper 1400
1] df[,
[1] 38 20 45 20
2] df[,
[1] bus driver barkeeper nurse student
Levels: barkeeper bus driver nurse student
# Or via the $ operator
$age df
[1] 38 20 45 20
$occ df
[1] bus driver barkeeper nurse student
Levels: barkeeper bus driver nurse student
# It is also possible to access the data via logical statements
$age <40,] df[df
age occ inc
1 38 bus driver 2000
2 20 barkeeper 1400
4 20 student 800
3]>1000,] df[df[,
age occ inc
1 38 bus driver 2000
2 20 barkeeper 1400
3 45 nurse 1400
Loops are an integral part of basic programming. A loop is a sequence of instructions that are repeated until a certain condition is reached. For instance, one might be interested in the cumulative maximum of the square of every element in the following vector v
.
<- c(1,4,2,10,5) v
For our task “compute the cumulative maximum of the square of every element in the vector v
”, the loop might look like
# Begin the loop with a `for` statement and choose running index with starting
# and ending value
for (i in 1:length(v)) {
# For each elemenet in the vector, compute the maximum
<- v[i]^2
square
# Compute the cumulative maximum
if (i == 1) {
# The very first element in the loop is the
# cumulative maximum by definition
= square
maxsq
}
if (i > 1) {
# Cumulative maximum for the
# second, third, ...., last element in
# the loop
<- max(square, maxsq)
maxsq
}
# Print (i.e. show) the cumulative maximum
print(maxsq)
}
[1] 1
[1] 16
[1] 16
[1] 100
[1] 100
“A function is a collection of statements that you can execute wherever and whenever you want” (Langtangen 2011, 1:93). Functions are central to statistical programming as they allow to perform any kind of operation taking an object as an input and giving another object as an output. Thereby, the user can define the operation by himself. Suppose you want to implement the sum of squares of two numbers, i.e., doing something like the following
2^2 + 4^2
[1] 20
You can write a function, which takes any two values as input and returns the sum of the squared input variables as output.
# x and y are the input objects
<- function(x, y){
myfunc # this is the operation we want to implement
= x^2 + y^2
z # this is the output of the function
return(z)
}
# Apply function to test if it does what we want it to do
myfunc(2, 4)
[1] 20
Although the function implementation might be a bit artificial in this example, the great advantage of the self-written function is that it can now be applied to any input values. Thereby, we know that nothing else will happen than what we told the function to do (except for programming errors!).
# Run the function for another pair of values
myfunc(4, 6)
[1] 52
R is a vectorized language, i.e. most functions can be applied not only to a single object but be executed for a collection of them. As in the examples above, it is possible to compute not only the square of a single number, but to calculate it for all elements of a vector.
In some cases, a loop is not the best way of implementation. For instance, always writing a loop from scratch might be prone to errors and sometimes a loop can be slow in terms of computation time. The task we want to execute can be implemented in an alternative way which makes use of vectorization in R. Thereby, you can take away two R lessons:
Vectorization in R means that a certain function can be applied to a vector of elements. For instance, v^2
computes the squares for each element of the vector in R making it possible to replace the first line in the loop.
# Use vectorized function instead of a loop
<- v^2 v2
Moreover, basic R provides a function to compute a cumulative maximum. Therefore, there is no need to use a loop at all. Using implemented R functions is helpful since they can help you to save time and to avoid mistakes. Hence, doing some research for common problems can be very rewarding in many cases.
cummax(v2)
[1] 1 16 16 100 100
# or, equivalently
cummax(v^2)
[1] 1 16 16 100 100
apply()
and lapply()
In many cases, the task that should be implemented in a R program/function is more complicated than the previous example. The increased complexity already comes from using more complex objects than vectors like matrices or lists.
R provides an apply()
family which allows to execute a function to a collection of objects. Suppose, now, we want to compute the cumulative maximum of the squares of not only one vector but on 5 vectors.
<- c(5, 8, 1, 0, 2)
u <- c(1, 2, 1, 5, 1)
w <- c(8, 9, 1, 9, 8)
y <- c(10, 10, 10, 12, 10) z
The function apply()
makes it possible to execute a function on one dimension of a matrix, say on all columns. We could implement our task as following.
# Bind the vectors to a matrix (the vectors correspond to the columns)
<- cbind(u, v, w, y, z)
mat
# apply the function to each column in the matrix (MARGIN = 2)
apply(mat, 2, function(x) cummax(x^2))
u v w y z
[1,] 25 1 1 64 100
[2,] 64 16 4 81 100
[3,] 64 16 4 81 100
[4,] 64 100 25 81 144
[5,] 64 100 25 81 144
Alternatively, we can redefine our self-written function myfunc()
and call it via apply()
, too.
# x is the input object here
<- function(x){
myfunc # this is the operation we want to implement
<- cummax(x^2)
y # this is the output of the function
return(y)
}
# Or we can use the self-written function from above myfunct()
apply(mat, 2, myfunc)
u v w y z
[1,] 25 1 1 64 100
[2,] 64 16 4 81 100
[3,] 64 16 4 81 100
[4,] 64 100 25 81 144
[5,] 64 100 25 81 144
Alternatively, one can bind the vector together as the rows a matrix and apply the function to the rows. We obtain the same results.
# Bind the vectors to a matrix (the vectors correspond to the rows)
= rbind(u, v, w, y, z)
mat
# apply the function to each row in the matrix (MARGIN = 1)
apply(mat, 1, function(x) cummax(x^2))
u v w y z
[1,] 25 1 1 64 100
[2,] 64 16 4 81 100
[3,] 64 16 4 81 100
[4,] 64 100 25 81 144
[5,] 64 100 25 81 144
Combining the vectors to a matrix might work in this case. However, a problem might arise when vectors have different lengths. Suppose, in addition we have a sixth vector x
, which is longer than the others. Then, the matrix approach would fail without appropriate extension of the other vectors. A solution would be to collect the vectors in a list and execute our function via lapply()
.
= c(1,2,3,5,6,1,5,6)
x
= list(u, v, w, y, z, x)
list1
# lapply() executes the function for each element of a list
lapply(list1, function(x) cummax(x^2))
[[1]]
[1] 25 64 64 64 64
[[2]]
[1] 1 16 16 100 100
[[3]]
[1] 1 4 4 25 25
[[4]]
[1] 64 81 81 81 81
[[5]]
[1] 100 100 100 144 144
[[6]]
[1] 1 4 9 25 36 36 36 36
lapply()
: Map()
, mapply()
, sapply()
An alternative and identical implementation can be achieved using the Map()
function or, similarly the mapply()
function
Map(function(x) cummax(x^2), list1)
[[1]]
[1] 25 64 64 64 64
[[2]]
[1] 1 16 16 100 100
[[3]]
[1] 1 4 4 25 25
[[4]]
[1] 64 81 81 81 81
[[5]]
[1] 100 100 100 144 144
[[6]]
[1] 1 4 9 25 36 36 36 36
mapply(function(x) cummax(x^2), list1)
[[1]]
[1] 25 64 64 64 64
[[2]]
[1] 1 16 16 100 100
[[3]]
[1] 1 4 4 25 25
[[4]]
[1] 64 81 81 81 81
[[5]]
[1] 100 100 100 144 144
[[6]]
[1] 1 4 9 25 36 36 36 36
The Map()
function allows for a yet more general way of vectorization using multiple arguments.
For instance, suppose we want are given a second list of the same length. For each of the six elements, we would like to know the maximum of the cummulative maximum of the squared entries across the two lists.
= list(c(1:5), c(5:8), c(10:-1), c(0:5), c(2:8), c(9:4))
list2 mapply(function(x,y) max(cummax(x^2), cummax(y^2)), list1, list2)
[1] 64 100 100 81 144 81
# Check
Map(function(x) cummax(x^2), list1)
[[1]]
[1] 25 64 64 64 64
[[2]]
[1] 1 16 16 100 100
[[3]]
[1] 1 4 4 25 25
[[4]]
[1] 64 81 81 81 81
[[5]]
[1] 100 100 100 144 144
[[6]]
[1] 1 4 9 25 36 36 36 36
Map(function(x) cummax(x^2), list2)
[[1]]
[1] 1 4 9 16 25
[[2]]
[1] 25 36 49 64
[[3]]
[1] 100 100 100 100 100 100 100 100 100 100 100 100
[[4]]
[1] 0 1 4 9 16 25
[[5]]
[1] 4 9 16 25 36 49 64
[[6]]
[1] 81 81 81 81 81 81
The function vapply()
provides the opportunity to execute a specified function on elements in a vector. However, in our example this would not simplify our code. Vectorization provides elegant and efficient implementations which pays off in large data sets. Moreover, vectorization paves the way for parallelization, i.e. executing the task on a number of nodes/cores at the same time. For instance, the R package parallel
provides parallelized functions of lapply
(mclapply
) and Map
(mcMap
).