4.5 Data Structures

4.5.1 Vectors

Data structures in R (and other languages) are ways of storing and organizing more than one value together. The most basic data structure in R is a one dimensional sequence of values called a vector:

# the c() function creates a vector
x <- c(1,2,3)
[1] 1 2 3

The vector in R has a special property that all values contained in the vector must have the same type, from the list described above. When constructing a vector, R will coerce values to the most general type if it encounters values of different types:

c(1,2,"3")
[1] "1" "2" "3"
c(1,2,TRUE,FALSE)
[1] 1 2 1 0
c(1,2,NA) # note missing values stay missing
[1] 1 2 NA
c("1",2,NA,NaN) # NA stays, NaN is cast to a character type
[1] "1" "2" NA "NaN"

In addition to having a single type, vectors also have a length, which is defined as the number of elements in the vector:

x <- c(1,2,3)
length(x)
[1] 3

Internally, R is much more efficient at operating on vectors than individual elements separately. With numeric vectors, you can perform arithmetic operations on vectors of compatible size just as easily as individual values:

c(1,2) + c(3,4)
[1] 4 6
c(1,2) * c(3,4)
[1] 3 8
c(1,2) * c(3,4,5) # operating on vectors of different lengths raises warning, but still works
[1] 3 8 5
Warning message:
In c(1, 2) * c(3, 4, 5) :
  longer object length is not a multiple of shorter object length

In the example above, we multiplied a vector of length 2 with a vector of length 3:

c(1,2) * c(3,4,5) # operating on vectors of different lengths raises warning, but still works
[1] 3 8 5
Warning message:
In c(1, 2) * c(3, 4, 5) :
  longer object length is not a multiple of shorter object length

Rather than raise an error and aborting, R merely emits a warning message about the vectors not having divisible lengths. So how did R decide the third value should be 5? Because R cycles through each vector and multiplies the values element-wise until the longest vector has had an operation performed on all its values:

c(1,2) * c(3,4,5) # yields: 1*3 2*4 1*5
[1] 3 8 5
Warning message:
In c(1, 2) * c(3, 4, 5) :
  longer object length is not a multiple of shorter object length
c(1,2) * c(3,4,5,6) # yields: 1*3 2*4 1*5 2*6
[1] 3 8 5 12

R will sometimes work in ways you don’t expect. Be careful to read warnings and check that your code does what you expect!

4.5.2 Matrices

A matrix in R is simply the 2 dimensional version of a vector. That is, it is a rectangle of values that all have the same type, e.g. number, character, logical, etc. A matrix may be constructed using the vector notation described above and specifying the number of rows and columns the matrix should have, and Instead of having a length like a vector, it has \(m \times n\) dimensions:

# create a matrix with two rows and three columns containing integers
A <- matrix(c(1,2,3,4,5,6)
       nrow = 2, ncol = 3, byrow=1
      )
A
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
dim(A) # the dim function prints out the dimensions of the matrix, rows first
[1] 2 3

Because a matrix is 2 dimensional, it can be transposed from \(m \times n\) to be \(n \times m\) using the t() function:

# A defined above as a 2 x 3 matrix
t(A)
[,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
dim(t(A))
[1] 3 2

4.5.3 Lists and data frames

Vectors and matrices have the special property that all items must be of the same type, e.g. numbers. Lists and data frames are data structures that do not have this requirement. Similar to vectors, lists and data frames are both one dimensional sequences of values, but the values can be of mixed types. For instance, the first item of a list may be a vector of numbers, while the second is a vector of character strings. These are the most flexible data structures in R, and are among the most commonly used.

Lists can be created using the list() function:

my_list <- list(
  c(1,2,3),
  c("A","B","C","D")
)
my_list
[[1]]
[1] 1 2 3

[[2]]
[1] "A" "B" "C" "D"
my_list[[1]] # access the first item of the list
[1] 1 2 3
my_list[[2]] # access the second item of the list
[1] "A" "B" "C" "D"

The arguments passed to list() define the values and their order of the list. In the above example, the list has two elements: one vector of 3 numbers and one vector of 4 character strings. Note you can access individual items of the list using the [[N]] syntax, where N is the 1-based index of the element.

Lists can also be defined and indexed by name:

my_list <- list(
  numbers=c(1,2,3),
  categories=c("A","B","C","D")
)
my_list
$numbers
[1] 1 2 3

$categories
[1] "A" "B" "C" "D"
my_list$numbers # access the first item of the list
[1] 1 2 3
my_list$categories # access the second item of the list
[1] "A" "B" "C" "D"

The elements of the list have been assigned the names numbers and categories when creating the list, though any valid R identifier names can be used. When elements are associated with names they can be accessed using the list$name syntax.

Lists and data frames are the same underlying data structure, however differ in one important respect: the elements of a data frame must all have the same length, while the elements of a list do not. You may create a data frame with the data.frame() function:

my_df <- data.frame( # recall '.' has no special meaning in R
  numbers=c(1,2,3),
  categories=c("A","B","C","D")
)
Error in data.frame(c(1, 2, 3), c("A", "B", "C", "D")) :
  arguments imply differing number of rows: 3, 4
my_df <- data.frame(
  numbers=c(1,2,3),
  categories=c("A","B","C")
)
my_df
  numbers categories
1       1          A
2       2          B
3       3          C
my_df$numbers
[1] 1 2 3
my_df[1] # numeric indexing also works, and returns a subset data frame
  numbers
1       1
2       2
3       3
my_df[1]$numbers
[1] 1 2 3
# this syntax is [<row>,<column>], and if either is omitted return all
my_df[,1] # return all rows of the first column as a vector
[1] 1 2 3
my_df$categories
[1] "A" "B" "C"

Note the data frame is printed as a matrix with element names as columns and automatically numbered rows. You may access specific elements of a data frame in a number of ways:

my_df$numbers[1] # extract the first value of the numbers column
[1] 1
my_df[1,1] # same as above, recall the [<row>,<column>] syntax
[1] 1
my_df$categories[3] # extract the third value of the categories column
[1] "C"

In the examples above, the operation of extracting out different parts of a vector, matrix, list, or data frame is called subsetting. R provides many different ways to subset a data structure and discussing all of them is beyond the scope of this book. However, mastering subsetting will help your code be more concise and correct. See the Read More link on Subsetting below:

Advanced R - Subsetting