styler
packagescale()
The most common type of value in R is the number, e.g. 1.0
or 1e-5
for
\(10^{-5}\). For most practical purposes, R does not distinguish between numbers
with fractional parts (e.g. 1.123
) and integers (e.g. 1
); a number is a
number. In addition to numbers, there are some other types of values that are
special in R:
TRUE
or FALSE
. Internally, R stores TRUE
as the number 1
and FALSE
as the number 0
. Generally, R interprets
non-zero numbers as TRUE
and 0
as FALSE
, but it is good practice to supply
the tokens TRUE
or FALSE
when an argument expects a logical value.NA
. NA
is a special value that indicates a value is
missing.NULL
. Similar to NA
, NULL
indicates that a vector,
rather than a value, is missing. Vectors will be described in the next section
on data strutures.Inf
and -Inf
. These values encode what R understands to be
positive or negative infinity, or any number divided by 0
.NaN
. This value corresponds to the mathematically
‘impossible’ or undefined value of 0/0
."value"
. R can store character data in the form of
strings. Note R does not interpret string values by default, so "1"
and 1
are distinct.complex
function.Unsurprisingly, R cannot perform computations on NA
, NaN
, or Inf
values.
Each of these values have an ‘infectious’ quality to them, where if they are
mixed in with other values, the result of the computation reverts to the first
of these values encountered:
# this how to create a vector of 4 values in R
<- c(1,2,3,NA)
x mean(x) # compute the mean of values that includes NA
1] NA
[mean(x,na.rm=TRUE) # remove NA values prior to computing mean
1] 2
[mean(c(1,2,3,NaN))
1] NaN
[mean(c(NA,NaN,1))
1] NA [
If your code produces values that are not numbers as you expect, this suggests there are one of these values in your input, and need to be handled explicitly.
Factors are objects that R uses to handle categorical variables, i.e. variables
that can take one of a distinct set of values for each sample. For example,
a variable indicating whether a subject had a disease or was a control could be
encoded using a factor with values Disease
or Control
. Consider an example
dataset with six subjects where three are disease and three are control, and
we create a factor from a corresponding variable of character strings using the
factor()
function:
<- factor(
case_status c('Disease','Disease','Disease',
'Control','Control','Control'
)
)
case_status1] Disease Disease Disease Control Control Control
[: Control Disease Levels
The factor case_status
prints as a vector of labels, either Disease
or
Control
. The distinct values in the factor are called levels, and this
factor has two: Control
and Disease
. Internally, a factor is stored as a
vector of integers where each level has the same value:
as.numeric(case_status)
1] 2 2 2 1 1 1
[str(case_status)
/ 2 levels "Control","Disease": 2 2 2 1 1 1 Factor w
By default, R assigns integers to levels in alphanumeric order; since “Control”
comes lexicographically before “Disease,” the Control
level is assigned the
integer 1
and Disease
is assigned 2
. Each value of the factor corresponds
to these integers, and since Disease
came before Control
, the numeric values
of the factor are (2, 2, 2, 1, 1, 1). The integer values assigned to each level
allow the factor to be sorted:
sort(case_status)
1] Control Control Control Disease Disease Disease
[: Control Disease Levels
Note the order of the factor levels has changed so that controls, which have a value of 1, precede disease, which have a value of 2. The integers assigned to each level can be specified explicitly when creating the factor:
<- factor(
case_status c('Disease','Disease','Disease','Control','Control','Control'),
levels=c('Disease','Control')
)
case_status1] Disease Disease Disease Control Control Control
[: Control Disease
Levelsstr(case_status)
/ 2 levels "Disease","Control": 1 1 1 2 2 2 Factor w
The base R functions for reading in CSV Files load columns with character
values as factors by default (you may turn this off with
stringsAsFactors=FALSE
to read.csv()
), and in other situations you may have
factors created by other functions that need to have their integer values
changed. This process is called releveling the factor, and may be accomplished
by passing a factor into the factor()
function and specifying new levels:
str(case_status)
/ 2 levels "Disease","Control": 1 1 1 2 2 2
Factor wfactor(case_status, levels=c("Control","Disease"))
/ 2 levels "Control","Disease": 2 2 2 1 1 1 Factor w
Controlling the order of levels in a factor is important in a number of situations. One is when specifying the reference category for categorical variables when constructing model matrices to pass to statistical models, the details of which are beyond the scope of this book. A second is when the order of categorical variables when passed to ggplot, which is covered in greater detail in [Reordering 1-D Data Elements] in the Grammar of Graphics chapter. The forcats tidyverse package provides more powerful functions for working with categorical variables stored in factors.