4.4 Basic Types of Values

The most common type of value in R is the number, e.g. 1.0 or 1e-5 for \(10^{-5}\). For most practical purposes, R does not distinguish between numbers with fractional parts (e.g. 1.123) and integers (e.g. 1); a number is a number. In addition to numbers, there are some other types of values that are special in R:

  • logical or boolean values - TRUE or FALSE. Internally, R stores TRUE as the number 1 and FALSE as the number 0. Generally, R interprets non-zero numbers as TRUE and 0 as FALSE, but it is good practice to supply the tokens TRUE or FALSE when an argument expects a logical value.
  • missing values - NA. NA is a special value that indicates a value is missing.
  • missing vectors - NULL. Similar to NA, NULL indicates that a vector, rather than a value, is missing. Vectors will be described in the next section on data strutures.
  • factors - Factors are a complex type used in statistical models and are covered in greater detail later
  • infinity - Inf and -Inf. These values encode what R understands to be positive or negative infinity, or any number divided by 0.
  • impossible values - NaN. This value corresponds to the mathematically ‘impossible’ or undefined value of 0/0.
  • character data - "value". R can store character data in the form of strings. Note R does not interpret string values by default, so "1" and 1 are distinct.
  • dates and times - R has a basic type to store dates and times (together termed a datetime, which includes both components). Internally, R stores datetimes as the fractional number of days since January 1, 1970, using negative numbers for earlier dates.
  • complex numbers - R can store complex numbers using the complex function.

Unsurprisingly, R cannot perform computations on NA, NaN, or Inf values. Each of these values have an ‘infectious’ quality to them, where if they are mixed in with other values, the result of the computation reverts to the first of these values encountered:

# this how to create a vector of 4 values in R
x <- c(1,2,3,NA)
mean(x) # compute the mean of values that includes NA
[1] NA
mean(x,na.rm=TRUE) # remove NA values prior to computing mean
[1] 2
mean(c(1,2,3,NaN))
[1] NaN
mean(c(NA,NaN,1))
[1] NA

If your code produces values that are not numbers as you expect, this suggests there are one of these values in your input, and need to be handled explicitly.

4.4.1 Factors

Factors are objects that R uses to handle categorical variables, i.e. variables that can take one of a distinct set of values for each sample. For example, a variable indicating whether a subject had a disease or was a control could be encoded using a factor with values Disease or Control. Consider an example dataset with six subjects where three are disease and three are control, and we create a factor from a corresponding variable of character strings using the factor() function:

case_status <- factor(
  c('Disease','Disease','Disease',
    'Control','Control','Control'
  )
)
case_status
[1] Disease Disease Disease Control Control Control
Levels: Control Disease

The factor case_status prints as a vector of labels, either Disease or Control. The distinct values in the factor are called levels, and this factor has two: Control and Disease. Internally, a factor is stored as a vector of integers where each level has the same value:

as.numeric(case_status)
[1] 2 2 2 1 1 1
str(case_status)
 Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1

By default, R assigns integers to levels in alphanumeric order; since “Control” comes lexicographically before “Disease,” the Control level is assigned the integer 1 and Disease is assigned 2. Each value of the factor corresponds to these integers, and since Disease came before Control, the numeric values of the factor are (2, 2, 2, 1, 1, 1). The integer values assigned to each level allow the factor to be sorted:

sort(case_status)
[1] Control Control Control Disease Disease Disease
Levels: Control Disease

Note the order of the factor levels has changed so that controls, which have a value of 1, precede disease, which have a value of 2. The integers assigned to each level can be specified explicitly when creating the factor:

case_status <- factor(
  c('Disease','Disease','Disease','Control','Control','Control'),
  levels=c('Disease','Control')
)
case_status
[1] Disease Disease Disease Control Control Control
Levels: Control Disease
str(case_status)
 Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2

The base R functions for reading in CSV Files load columns with character values as factors by default (you may turn this off with stringsAsFactors=FALSE to read.csv()), and in other situations you may have factors created by other functions that need to have their integer values changed. This process is called releveling the factor, and may be accomplished by passing a factor into the factor() function and specifying new levels:

str(case_status)
 Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2
factor(case_status, levels=c("Control","Disease"))
 Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1

Controlling the order of levels in a factor is important in a number of situations. One is when specifying the reference category for categorical variables when constructing model matrices to pass to statistical models, the details of which are beyond the scope of this book. A second is when the order of categorical variables when passed to ggplot, which is covered in greater detail in [Reordering 1-D Data Elements] in the Grammar of Graphics chapter. The forcats tidyverse package provides more powerful functions for working with categorical variables stored in factors.