styler
packagescale()
In programming, iteration refers to stepping sequentially through a set or
collection of objects, be it a vector of numbers, the columns of a matrix, etc.
In non-functional languages like python, C, etc. there are particular control
structures that implement iteration, commonly called loops. If you have
worked with these languages, you may be familiar with for and while loops,
which are some of these iteration control structures. However, R was designed to
execute iteration in a different way than these other languages, and provides
two forms of iteration: vectorized operations, and functional programming with
apply()
.
Note that R does have for
and while
loop support in the language. However,
these loop structures often have poor performance, and should generally be
avoided in favor of the functional style of iteration described below.
If you really, really want to learn how to use for loops in R, read this, but don’t say I didn’t warn you when your code slows to a crawl for unknown reasons:
The simplest form of iteration in R comes in vectorized computation. This sounds fancy, but it just means R intrinsically knows how to perform many operations on vectors and matrices as well as individual values. We have already seen examples of this above when performing arithmetic operations on vectors:
<- c(1,2,3,4,5)
x + 3 # add 3 to every element of vector x
x 1] 4 5 6 7 8
[* x # elementwise multiplication, 1*1 2*2 etc
x 1] 1 4 9 16 25
[<- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
x_mat + 3 # add 3 to every element of matrix x_mat
x_mat 1] [,2] [,3]
[,1,] 4 6 8
[2,] 5 7 9
[# the * operator always means element-wise
* x_mat
x_mat 1] [,2] [,3]
[,1,] 1 9 25
[2,] 4 16 36 [
In addition to simple arithmetic operations, R also has syntax for vector-vector, matrix-vector, and matrix-matrix operations, like matrix multiplication and dot products:
# the %*% operator stands for matrix multiplication
%*% c(1,2,3) # [ 2x3 ] * [ 3 ]
x_mat 1]
[,1,] 22
[2,] 28
[%*% t(x_mat) # recall t() is the transpose function, making [ 2x3 ] * [ 3x2 ]
x_mat 1] [,2]
[,1,] 35 44
[2,] 44 56 [
These forms of implicit iteration are very powerful, and the R program has been highly optimized to perform these operations very quickly. If you can cast your iteration into a vector or matrix multiplication, it is a good idea to do so. For other more complex or custom iteration, we must first talk briefly about functional programming.
R is a functional programming language at its core, which means it is designed around the use of functions. In the previous section, we saw that functions are defined and assigned to names just like variables. This means that functions can be passed to other functions just like variables! Consider the following example.
Let’s consider a general formulation of vector transformation:
\[ \bar{\mathbf{x}} = \frac{\mathbf{x} - t_r(\mathbf{x})}{s(\mathbf{x})} \]
Here, \(\mathbf{x}\) is a vector of real numbers, and \(\bar{\mathbf{x}}\) is defined as a vector of the same length where each value has had some average or central value \(t_r(\mathbf{x})\) subtracted from it, and is divided by a scaling factor \(s(\mathbf{x})\) to control the range of resulting values. Both \(t_r(\mathbf{x})\) and \(s(\mathbf{x})\) are scalars (i.e. individual numbers) and dependent upon the values of \(\mathbf{x}\). If \(t_r\) is arithmetic mean and \(s\) is standard deviation, we have defined the standardization transformation mentioned in earlier examples:
<- rnorm(100, mean=20, sd=10)
x <- (x - mean(x))/sd(x) x_zscore
However, there are many different ways to define the central value of a set of numbers:
Each of these central value methods accepts a vector of numbers, but their behaviors are different, and are appropriate in different situations. Likewise, there are many possible scaling strategies we might consider:
We may wish to explore these different methods without writing entirely new code for each combination when trying out different transformation techniques.
In R and other functional languages, we can easily accomplish this by passing functions as arguments to other functions. Consider the following R function:
# note R already has a built in function named "transform"
<- function(x, t_r, s) {
my_transform return((x - t_r(x))/s(x))
}
This should look familiar to the equation presented earlier, except now in code
the arguments t_r
and s
are passed as arguments. If we wished to transform
using a Z-score normalization,
we could call my_transform
as follows:
<- rnorm(100,mean=20,sd=10)
x <- my_transform(x, mean, sd)
x_zscore mean(x_zscore)
1] 0
[sd(x_zscore)
1] 1 [
In the my_transform
function call, the second and third arguments are the
names of the mean
and sd
functions, respectively. In the definition of
my_transform
we use the syntax t_r(x)
and s(x)
to indicate that these
arguments should be treated as functions. Using this strategy, we could just as
easily define a transformation using median
and sum
for t_r
and s
if we
wished to:
<- rnorm(100,mean=20,sd=10)
x <- my_transform(x, median, sum)
x_transform median(x_transform)
1] 0
[sum(x_transform) # this quantity does not have an a priori known value (or meaning for that matter, it's just an example)
1] 0.013 [
We can also write our own functions and pass them to get the my_transform
function to have desired behavior. The following scales the values of x
to
have a range of \([0,1]\):
<- function(x) {
data_range return(max(x) - min(x))
}# my_transform computes: (x - min(x))/(max(x) - min(x))
<- my_transform(x, min, data_range)
x_rescaled min(x_rescaled)
1] 0
[max(x_rescaled)
1] 1 [
The data_range
function simply subtracts the minimum value of x
from the
maximum value and returns the result.
This feature of passing functions as arguments to other functions is a fundamental property of functional programming languages. Now we are ready to finally talk about how iteration is performed in R.
apply()
and friendsWhen working with lists and matrices in R, there are often times when you want
to perform a computation on every row or every column separately. A common
example of this in data science mentioned above is feature standardization.
Earlier we wrote a Z-score
transformation that accepts a
vector, subtracts the mean from each element, and divides the result by the
standard deviation of the data. This ensures the data has a mean and standard
deviation of 0 and 1, respectively. However, this function only operates on a
single vector of numbers. Large datasets have many features, each of which may
be individual vectors, that we desire to perform this same Z-score
transformation on separately. In other words, we have one function that we wish
to execute on either every row or every column of a matrix and return the
result. This is a form of iteration that can be implemented in a functional
style using the apply
function.
This is the signature of the apply
function, from the RStudio help(apply)
page:
apply(X, MARGIN, FUN, ..., simplify = TRUE)
Here, X
is a matrix (i.e. a rectangle of numbers) that we wish to perform a
computation on for either each row or each column. MARGIN
indicates whether
the matrix should be traversed by rows (MARGIN=1
) or columns (MARGIN=2
).
FUN
is the name of a function that accepts a vector and returns either a
vector or a scalar value that we wish to execute on either the rows or columns.
apply()
then executes FUN
on each row or column of X
and returns the
result. For example:
<- function(x) {
zscore return((x-mean(x))/sd(x))
}# construct a matrix of 50 rows by 100 columns with samples drawn from a normal distribution
<- matrix(
x_mat rnorm(100*50, mean=20, sd=5),
nrow=50,
ncol=100
)# z-transform the rows of x_mat, so that each column has mean,sd of 0,1
<- apply(x_mat, 2, zscore)
x_mat_zscore # we can check that all the columns of x_mat_zscore have mean close to zero with apply too
<- apply(x_mat_zscore, 2, mean)
x_mat_zscore_means # note: due to machine precision errors, these results will not be exactly zero, but are very close
# note: the all() function returns true if all of its arguments are TRUE
all(x_mat_zscore_means<1e-15)
1] TRUE [
The same approach can be used when X
is a list or data frame rather than a
matrix using the lapply()
function (hint: the l
in lapply
stands for
“list”). Here is the function signature for lapply
:
lapply(X, FUN, ...)
Recall that lists and data frames can be thought of as vectors where each
element can be its own vector. Therefore, there is only one axis along which to
iterate on the elements and there is not MARGIN
argument as in apply
. This
function returns a new list of the same dimension as the original list with
elements returned by FUN
:
<- list(
x feature1=rnorm(100,mean=20,sd=10),
feature2=rnorm(100,mean=50,sd=5)
)<- lapply(x, zscore)
x_zscore # check that the means are close to zero
<- lapply(x_zscore, mean)
x_zscore_means all(x_zscore_means < 1e-15)
1] TRUE [
This functional programming pattern might be counter intuitive at first, but it is well worth your while to learn.