R is a programming language that’s especially designed for data analysis and data visualization. In some cases, it’s more convenient to use R than C++ or Java, making R a key data analysis tool. In this chapter, we describe similarities and differences between R and its close relatives: Matlab and Python. We then delve into the R programming language to learn about data types, control flow, interfacing with C++, etc.
Note: The examples below are abridged; the book contains more details.
Here’s a quick introduction to the R programming language:
# This is a comment
a = 4 # single statement
a = 4; b = 3; c = b # multiple statements
# R has a dynamic type system; you can define a variable without declaring its
# type explicitly and you can change the variable's type in the same session:
a = 3.2
a = "foo"
a <- 4 # another way to assign value
# Here are a few different ways to display the value of a variable:
a = 4
print(a)
a # same thing
cat(a) # same thing
# cat can print multiple variables one after the other
cat("The value of the variable a is: ", a)
# Here are some important commands with short explanations:
x = 3 # assign value 3 to variable x
y = 3 * x + 2 # basic variable assignment and arithmetic
ratio.of.x.and.y = x / y # divide x by y and assign result
ls() # list variable names in workspace memory
ls(all.names = TRUE) # list all variables including hidden ones
ls.str() # print annotated list of variable names
save.image(file = "file-name") # save all variables to a file
save(x, y, file = "varFile") # save specified variables
rm(x, y) # clear variables x and y from memory
rm(list = ls()) # clear all variables in workspace memory
load("varFile") # load variables from file back to the workspace
cat(x)
The function system(command)
executes the given shell command:
# change directory to home directory
setwd("~")
# display all files in current directory
dir(path = ".", all.files = TRUE)
# execute bash command ls -al (in Linux)
system("ls -al")
R features easy installation of both core R and third party packages:
# install package ggplot2
install.packages("ggplot2")
# install package from a particular mirror site
install.packages("ggplot2", repos = "http://cran.r-project.org")
# install a package from source, rather than binary
install.packages("ggplot2", type = "source")
library(ggplot2) # bring package into scope
# display all datasets in the package ggplot2
data(package = "ggplot2")
installed.packages() # display a list of installed packages
update.packages() # update currently installed packages
The code below demonstrates masking the built-in constant pi
by a global
environment variable with the value 3, and retrieving the original value after
clearing it:
pi # a built-in constant whose value is 3.141593
pi = 3 # redefines variable pi
pi # .GlobalEnv match
rm(pi) # removes masking variables
pi # back to 3.141593
The following shows how to use files for running commands (input) and output:
# A sink records output to the given file.
# To print to both the screen and to a file (tee), use the following variation:
sink(file = "outputFile", split = TRUE)
source("chapter7/foo.R") # executes R code written in a text file
system("cat outputFile") # prints what was captured by the sink above
print("this is a command in foo.R")
Scalar types include numeric, integer, logical, string, dates, and factors:
rm(list = ls()) # clears all variables
a = 3.2; b = 3 # double types
b
typeof(b) # function returns type of object
c = as.integer(b) # cast to integer type
c
typeof(c)
c = 3L # alternative to casting: L specifies integer
d = TRUE
d
e = as.numeric(d) # casting to numeric
e
f = "this is a string" # string
f
ls.str() # show variables and their types
# Factor variables assume values in a predefined set of possible values.
# The code below demonstrates the use of factors in R:
current.season = factor("summer",
levels = c("summer", "fall", "winter", "spring"),
ordered = TRUE) # ordered factor
current.season
levels(current.season) # display factor levels
my.eye.color = factor(
"brown",
levels = c("brown", "blue", "green"),
ordered = FALSE) # unordered factor
my.eye.color
Vectors, arrays, lists, and dataframes are collections that hold multiple scalar values:
# c() concatenates arguments to create a vector
x = c(4, 3, 3, 4, 3, 1)
x
length(x)
2 * x + 1 # element-wise arithmetic
# Boolean vector (default is FALSE)
y = vector(mode = "logical", length = 4)
y
# numeric vector (default is 0)
z = vector(length = 3, mode = "numeric")
z
q = rep(3.2, times = 10) # repeat value multiple times
q
w = seq(0, 1, by = 0.1) # values in [0,1] in 0.1 increments
w
# 11 evenly spaced numbers between 0 and 1
w = seq(0, 1, length.out = 11)
w
# create an array of booleans reflecting whether condition holds
w <= 0.5
any(w <= 0.5) # is it true for some elements?
all(w <= 0.5) # is it true for all elements?
which(w <= 0.5) # for which elements is it true?
w[w <= 0.5] # extracting from w entries for which w<=0.5
subset(w, w <= 0.5) # an alternative with the subset function
w[w <= 0.5] = 0 # zero out all components smaller or equal to 0.5
w
# Arrays are multidimensional generalizations of vectors
z = seq(1, 20,length.out = 20) # create a vector 1, 2, ..., 20
x = array(data = z, dim = c(4, 5)) # create a 2-d array
x
x[2, 3] # refer to the second row and third column
x[2, ] # refer to the entire second row
x[-1, ] # all but the first row - same as x[c(2, 3, 4), ]
y = x[c(1, 2), c(1, 2)] # 2x2 top left sub-matrix
2 * y + 1 # element-wise operation
y %*% y # matrix product (both arguments are matrices)
# inner product (both vectors have the same dimensions)
x[1, ] %*% x[1, ]
t(x) # matrix transpose
outer(x[, 1], x[, 1]) # outer product
rbind(x[1, ], x[1, ]) # vertical concatenation
cbind(x[1, ], x[1, ]) # horizontal concatenation
# Multidimensional arrays
m = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
m
m[3] # counting by columns A[3] = A[1, 2]
# Lists are ordered collections which permit positions to hold variables
# of different types
x = list(name = "John", age = 55, no.children = 2, children.ages = c(15, 18))
names(x) # displays all position names
x[[2]] # second element
x[2] # list containing second element
x$name # value in list corresponding to name
x["name"] # same thing
x$children.ages[2] # same as L[[4]][2]
# Smaller arrays are extended as needed
a = c(1, 2)
b = c(10, 20, 30, 40, 50)
a + b
b[7] = 70
b
# A dataframe is an ordered sequence of lists sharing the same signature
vecn = c("John Smith", "Jane Doe")
veca = c(42, 45)
vecs = c(50000, 55000)
df = data.frame(name = vecn, age = veca, salary = vecs)
df
names(df) = c("NAME", "AGE", "SALARY") # modify column names
df
The core R package datasets
contains many interesting and demonstrative
datasets, such as the iris
dataset, whose first four dimensions are numeric
measurements describing flower geometry and whose last dimension is a string
describing the flower species. The below example explores the iris
dataset:
library(datasets)
names(iris) # lists the dimension (column) names
head(iris, 4) # show first four rows
iris[1,] # first row
iris$Sepal.Length[1:10] # sepal length of first ten samples
# allow replacing iris$Sepal.Length with the shorter Sepal.Length
attach(iris, warn.conflicts = FALSE)
mean(Sepal.Length) # average of Sepal.Length across all rows
colMeans(iris[, 1:4]) # means of all four numeric columns
# extract all rows whose Sepal.Length variable is less than 5
# and whose species is not setosa
subset(iris, Sepal.Length < 5 & Species != "setosa")
# count number of rows corresponding to setosa species
dim(subset(iris, Species == "setosa"))[1]
# Provides a statistical summary of the different dataframe columns
summary(iris)
# Read text file into dataframe
# df = read.table("irisFile.txt", header = TRUE)
# same but from Internet location
# df = read.table("http://www.exampleURL.com/irisFile.txt", header = TRUE)
# The following commands require an editor (e.g., Vim)
# edit(iris) # examine data as spreadsheet
# iris = edit(iris) # edit dataframe/variable
# newIris = edit(iris) # edit dataframe/variable but keep original
Functions in R are similar to those in C++ and Java. When calling a function, the arguments flow into the function according to their order at the call site:
# foo is a user-defined function
foo = function(x = 1, y = 2, z = 3) {
return(x + y + z)
}
foo(10, 20, 30) # parameter bindings by order
foo(y = 20, x = 10, z = 30) # (potentially) out of order parameter bindings
foo(x = 10, y = 20, z = 30) # passing 3 parameters
foo(z = 30) # x and y are missing and are assigned default values
foo(10) # in-order parameter binding with last two parameters missing
# myPower(.,.) raises the first argument to the power of the second; the former
# is named bas and has a default value of 10; the latter is named pow and has
# a default value of 2
myPower = function(bas = 10, pow = 2) {
return(bas ^ pow) # raise base to a power
}
myPower(2, 3) # 2 is bound to bas and 3 to pow (in-order)
myPower(pow = 3, bas = 2) # same binding as above (out-of-order parameter names)
myPower(bas = 3) # default value of pow is used
# R passes variables by value: Changing passed arguments inside a function
# does not modify their respective values in the calling environment
x = 2
myPower2 = function(x) { x = x ^ 2; return(x) }
y = myPower2(x) # does not change x outside the function
x
y
The flow of control of R code is very similar to that of other programming languages. Below are some examples:
a = 10
b = 5
c = 1
if (a < b) {
d = 1
} else if (a == b) {
d = 2
} else {
d = 3
}
d
Similaryly, below are some examples of loops:
sm = 0
# repeat for 100 iteration, with num taking values 1:100
for (num in seq(1, 100, by = 1)) {
sm = sm + num
}
sm # same as sum(1:100)
repeat {
sm = sm - num
num = num - 1
if (sm == 0) break # if sm == 0 then stop the loop
}
sm
a = 1
b = 10
# continue the loop as long as b > a
while (b > a) {
sm = sm + 1
a = a + 1
b = b - 1
}
sm
It is best to avoid loops when programming in R. There are two reasons for this: simplifying code and computational speed-up. Many mathematical computations on lists, vectors, or arrays may be performed without loops using component-wise arithmetic. The code below demonstrates the computational speedup resulting from replacing a loop with vectorized code:
a = 1:10
c = 0 # compute sum of squares using a for-loop
for (e in a) c = c + e^2
c
sum(a ^ 2) # same operation using vector arithmetic
a = 1:1000000
c = 0
# time comparison with a million elements
system.time(for (e in a) c = c + e ^ 2)
system.time(sum(a ^ 2))
# The sapply function simplifies code but the computational speed-up may not
# apply in the same way as it did above
a = seq(0, 1, length.out = 10)
b = 0
c = 0
for (e in a) {
b = b + exp(e)
}
b
c = sum(sapply(a, exp))
c
# sapply with an anonymous function f(x) = exp(x ^ 2)
sum(sapply(a, function(x) { return(exp(x ^ 2)) }))
# or simply
sum(sapply(a, function(x) exp(x ^ 2)))
Consider the code below, which compares two implementations of matrix
multiplication: The first uses R’s internal matrix multiplication
and the second implements it through three nested loops,
each containing a scalar multiplication. The first matrix multiplication is
faster by several orders of magnitude even for a relatively small n
.
The key difference is that the built-in matrix multiplication runs compiled C++
code:
n = 100
nsq = n * n
# generate two random matrices
a = matrix(runif(nsq), nrow = n, ncol = n)
b = matrix(runif(nsq), nrow = n, ncol = n)
system.time(a %*% b) # built-in matrix multiplication
matMult = function(a, b, n) {
m = matrix(data = 0, nrow = n, ncol = n)
for (i in 1:n)
for (j in 1:n)
for (k in 1:n)
m[i, j] = m[i, j] + a[i, k] * b[k, j]
return(m)
}
system.time(matMult(a, b, n)) # nested loops implementation
For example, consider the task of computing some function given two vectors
a
and b
of size n
. The C code to compute the result appears below
(fooC.c
); note the presence of the pre-processor directive include <R.h>
.
You may execute code from fooC.R
after running the command R CMD SHLIB fooC.c
; the runnable example below runs that command for you:
#include <R.h>
#include <math.h>
void fooC(double* a, double* b, int* n, double* result) {
int i, j;
for (i = 0; i < (*n); i++) {
result[i] = 0;
for (j = 0; j < (*n); j++)
result[i] += pow(a[j] + i + 1, b[j]);
}
}
dyn.load("chapter7/fooC.so") # load the compiled C code
a = seq(0, 1, length = 10)
b = seq(0, 1, length = 10)
c = rep(0, times = 10)
x = .C("fooC", a, b, as.integer(10), c) # calls the fooC function
x[[4]] # the 4th element contains the result
fooR = function(a, b, n) {
result = rep(0, times = n)
for (i in 1:n)
for (j in 1:n)
result[i] = result[i] + (a[j] + i) ^ b[j]
return(result)
}
fooR(a, b, 10)
sizes = seq(10, 1000, length = 10)
rTime = rep(0, 10)
cTime = rTime
i = 1
for (n in sizes) {
a = seq(0, 1, length = n)
b = seq(0, 1, length = n)
c = rep(0, times = n)
cTime[i] = system.time(.C("fooC", a, b, as.integer(n), c))
rTime[i] = system.time(fooR(a, b, n))
i = i + 1
}
df = stack(list(C = cTime, R = rTime))
names(df) = c("system.time", "language")
df$size = sizes
# plot run time as a function of array size for R and .C implementations
qplot(x = size,
y = system.time,
lty = language,
color = language,
data = df,
size = I(1.5),
geom = "line",
xlab = "array size",
ylab = "computation time (sec)")
The .Call
function offers a higher degree of flexibility than .C
in terms
of passing and returning R arrays or dataframes. The example below shows how
to use .Call
to implement the same functionality as the .C
example above.
The type SEXP
represents an R object. The REAL
macro returns a pointer
to the corresponding memory for reading or writing to it. Here’s an example:
#include <R.h>
#include <Rinternals.h>
#include <Rmath.h>
#include <math.h>
SEXP fooC2(SEXP aR, SEXP bR)
{
int i, j, n = length(aR);
double *a = REAL(aR), *b = REAL(bR);
SEXP Rval = allocVector(REALSXP, n);
for (i = 0; i < n; i++) {
REAL(Rval)[i] = 0;
for (j = 0; j < n; j++)
REAL(Rval)[i] += pow(a[j] + i + 1, b[j]);
}
return Rval;
}
dyn.load("chapter7/fooC2.so") # load the compiled C code
a = seq(0, 1, length = 10)
b = seq(0, 1, length = 10)
.Call("fooC2", a, b)
When R starts it executes the function .First
in ~/.Rprofile
(if it exists).
Similarly, R executes the function .Last
in the same file at the end of any R.
Here’s an example of how to customize your R environment:
print(pi) # prints 3.14159265358979 thanks to options in ~/.Rprofile
.First = function() {
options(prompt = 'R >', digits = 15)
library('ggplot2')
}
.Last = function() {
cat(date(), 'Bye')
}