styler packagescale()The most common type of value in R is the number, e.g. 1.0 or 1e-5 for
\(10^{-5}\). For most practical purposes, R does not distinguish between numbers
with fractional parts (e.g. 1.123) and integers (e.g. 1); a number is a
number. In addition to numbers, there are some other types of values that are
special in R:
TRUE or FALSE. Internally, R stores TRUE
as the number 1 and FALSE as the number 0. Generally, R interprets
non-zero numbers as TRUE and 0 as FALSE, but it is good practice to supply
the tokens TRUE or FALSE when an argument expects a logical value.NA. NA is a special value that indicates a value is
missing.NULL. Similar to NA, NULL indicates that a vector,
rather than a value, is missing. Vectors will be described in the next section
on data strutures.Inf and -Inf. These values encode what R understands to be
positive or negative infinity, or any number divided by 0.NaN. This value corresponds to the mathematically
‘impossible’ or undefined value of 0/0."value". R can store character data in the form of
strings. Note R does not interpret string values by default, so "1" and 1
are distinct.complex
function.Unsurprisingly, R cannot perform computations on NA, NaN, or Inf values.
Each of these values have an ‘infectious’ quality to them, where if they are
mixed in with other values, the result of the computation reverts to the first
of these values encountered:
# this how to create a vector of 4 values in R
x <- c(1,2,3,NA)
mean(x) # compute the mean of values that includes NA
[1] NA
mean(x,na.rm=TRUE) # remove NA values prior to computing mean
[1] 2
mean(c(1,2,3,NaN))
[1] NaN
mean(c(NA,NaN,1))
[1] NAIf your code produces values that are not numbers as you expect, this suggests there are one of these values in your input, and need to be handled explicitly.
Factors are objects that R uses to handle categorical variables, i.e. variables
that can take one of a distinct set of values for each sample. For example,
a variable indicating whether a subject had a disease or was a control could be
encoded using a factor with values Disease or Control. Consider an example
dataset with six subjects where three are disease and three are control, and
we create a factor from a corresponding variable of character strings using the
factor() function:
case_status <- factor(
c('Disease','Disease','Disease',
'Control','Control','Control'
)
)
case_status
[1] Disease Disease Disease Control Control Control
Levels: Control DiseaseThe factor case_status prints as a vector of labels, either Disease or
Control. The distinct values in the factor are called levels, and this
factor has two: Control and Disease. Internally, a factor is stored as a
vector of integers where each level has the same value:
as.numeric(case_status)
[1] 2 2 2 1 1 1
str(case_status)
Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1By default, R assigns integers to levels in alphanumeric order; since “Control”
comes lexicographically before “Disease,” the Control level is assigned the
integer 1 and Disease is assigned 2. Each value of the factor corresponds
to these integers, and since Disease came before Control, the numeric values
of the factor are (2, 2, 2, 1, 1, 1). The integer values assigned to each level
allow the factor to be sorted:
sort(case_status)
[1] Control Control Control Disease Disease Disease
Levels: Control DiseaseNote the order of the factor levels has changed so that controls, which have a value of 1, precede disease, which have a value of 2. The integers assigned to each level can be specified explicitly when creating the factor:
case_status <- factor(
c('Disease','Disease','Disease','Control','Control','Control'),
levels=c('Disease','Control')
)
case_status
[1] Disease Disease Disease Control Control Control
Levels: Control Disease
str(case_status)
Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2The base R functions for reading in CSV Files load columns with character
values as factors by default (you may turn this off with
stringsAsFactors=FALSE to read.csv()), and in other situations you may have
factors created by other functions that need to have their integer values
changed. This process is called releveling the factor, and may be accomplished
by passing a factor into the factor() function and specifying new levels:
str(case_status)
Factor w/ 2 levels "Disease","Control": 1 1 1 2 2 2
factor(case_status, levels=c("Control","Disease"))
Factor w/ 2 levels "Control","Disease": 2 2 2 1 1 1Controlling the order of levels in a factor is important in a number of situations. One is when specifying the reference category for categorical variables when constructing model matrices to pass to statistical models, the details of which are beyond the scope of this book. A second is when the order of categorical variables when passed to ggplot, which is covered in greater detail in [Reordering 1-D Data Elements] in the Grammar of Graphics chapter. The forcats tidyverse package provides more powerful functions for working with categorical variables stored in factors.