Chapter preview

Chapter 7

Learning R

R is a programming language that’s especially designed for data analysis and data visualization. In some cases, it’s more convenient to use R than C++ or Java, making R a key data analysis tool. In this chapter, we describe similarities and differences between R and its close relatives: Matlab and Python. We then delve into the R programming language to learn about data types, control flow, interfacing with C++, etc.

Note: The examples below are abridged; the book contains more details.

  1. Getting Started
  2. Executing Shell Commands
  3. Installing Packages
  4. Scope
  5. Input and Output Files
  6. Scalar Data Types
  7. Vectors, Arrays, Lists, and Dataframes
  8. The Iris Dataset
  9. Functions
  10. Conditionals
  11. Loops
  12. Vectorization
  13. Running Compiled Code
  14. Interfacing with C Code
  15. Modern Interfaces to C Code
  16. Customization

Getting Started

Here’s a quick introduction to the R programming language:

# This is a comment
a = 4  # single statement 
a = 4; b = 3; c = b  # multiple statements

# R has a dynamic type system; you can define a variable without declaring its
# type explicitly and you can change the variable's type in the same session:
a = 3.2
a = "foo"
a <- 4  # another way to assign value

# Here are a few different ways to display the value of a variable:
a = 4
print(a)
a  # same thing
cat(a)  # same thing
# cat can print multiple variables one after the other 
cat("The value of the variable a is: ", a)

# Here are some important commands with short explanations:
x = 3  # assign value 3 to variable x
y = 3 * x + 2  # basic variable assignment and arithmetic
ratio.of.x.and.y = x / y  # divide x by y and assign result
ls()  # list variable names in workspace memory
ls(all.names = TRUE)  # list all variables including hidden ones
ls.str()  # print annotated list of variable names
save.image(file = "file-name")  # save all variables to a file
save(x, y, file = "varFile")  # save specified variables
rm(x, y)  # clear variables x and y from memory
rm(list = ls())  # clear all variables in workspace memory
load("varFile")  # load variables from file back to the workspace
cat(x)

Executing Shell Commands

The function system(command) executes the given shell command:

# change directory to home directory
setwd("~")  
# display all files in current directory
dir(path = ".", all.files = TRUE)
# execute bash command ls -al (in Linux)
system("ls -al")  

Installing Packages

R features easy installation of both core R and third party packages:

# install package ggplot2
install.packages("ggplot2")
# install package from a particular mirror site
install.packages("ggplot2", repos = "http://cran.r-project.org")
# install a package from source, rather than binary
install.packages("ggplot2", type = "source")
library(ggplot2)  # bring package into scope
# display all datasets in the package ggplot2
data(package = "ggplot2")
installed.packages()  # display a list of installed packages
update.packages()  # update currently installed packages

Scope

The code below demonstrates masking the built-in constant pi by a global environment variable with the value 3, and retrieving the original value after clearing it:

pi  # a built-in constant whose value is 3.141593
pi = 3  # redefines variable pi
pi  # .GlobalEnv match
rm(pi)  # removes masking variables
pi  # back to 3.141593

Input and Output Files

The following shows how to use files for running commands (input) and output:

input-and-output-files.R

# A sink records output to the given file.
# To print to both the screen and to a file (tee), use the following variation:
sink(file = "outputFile", split = TRUE)

source("chapter7/foo.R")  # executes R code written in a text file
system("cat outputFile")  # prints what was captured by the sink above

foo.R

print("this is a command in foo.R")

Scalar Data Types

Scalar types include numeric, integer, logical, string, dates, and factors:

rm(list = ls())  # clears all variables

a = 3.2; b = 3  # double types
b
typeof(b)  # function returns type of object
c = as.integer(b)  # cast to integer type
c
typeof(c)
c = 3L  # alternative to casting: L specifies integer
d = TRUE
d
e = as.numeric(d)  # casting to numeric
e
f = "this is a string"  # string
f
ls.str()  # show variables and their types

# Factor variables assume values in a predefined set of possible values.
# The code below demonstrates the use of factors in R:
current.season = factor("summer",
          levels = c("summer", "fall", "winter", "spring"),
          ordered = TRUE)  # ordered factor
current.season
levels(current.season)  # display factor levels
my.eye.color = factor(
    "brown",
    levels = c("brown", "blue", "green"),
    ordered = FALSE)  # unordered factor
my.eye.color

Vectors, Arrays, Lists, and Dataframes

Vectors, arrays, lists, and dataframes are collections that hold multiple scalar values:

# c() concatenates arguments to create a vector
x = c(4, 3, 3, 4, 3, 1)  
x
length(x)
2 * x + 1  # element-wise arithmetic
# Boolean vector (default is FALSE)
y = vector(mode = "logical", length = 4)
y
# numeric vector (default is 0)
z = vector(length = 3, mode = "numeric")
z
q = rep(3.2, times = 10) # repeat value multiple times
q
w = seq(0, 1, by = 0.1)  # values in [0,1] in 0.1 increments
w
# 11 evenly spaced numbers between 0 and 1
w = seq(0, 1, length.out = 11)
w
# create an array of booleans reflecting whether condition holds
w <= 0.5
any(w <= 0.5) # is it true for some elements?
all(w <= 0.5) # is it true for all elements?
which(w <= 0.5) # for which elements is it true?
w[w <= 0.5] # extracting from w entries for which w<=0.5
subset(w, w <= 0.5) # an alternative with the subset function
w[w <= 0.5] = 0 # zero out all components smaller or equal to 0.5
w

# Arrays are multidimensional generalizations of vectors
z = seq(1, 20,length.out = 20)  # create a vector 1, 2, ..., 20
x = array(data = z, dim = c(4, 5))  # create a 2-d array
x
x[2, 3]  # refer to the second row and third column
x[2, ]  # refer to the entire second row
x[-1, ]  # all but the first row - same as x[c(2, 3, 4), ]
y = x[c(1, 2), c(1, 2)]  # 2x2 top left sub-matrix
2 * y + 1  # element-wise operation
y %*% y  # matrix product (both arguments are matrices)
# inner product (both vectors have the same dimensions)
x[1, ] %*% x[1, ]
t(x)  # matrix transpose
outer(x[, 1], x[, 1])  # outer product
rbind(x[1, ], x[1, ])  # vertical concatenation
cbind(x[1, ], x[1, ])  # horizontal concatenation

# Multidimensional arrays
m = matrix(c(1, 2, 3, 4), nrow = 2, ncol = 2)
m
m[3]  # counting by columns A[3] = A[1, 2]

# Lists are ordered collections which permit positions to hold variables
# of different types
x = list(name = "John", age = 55, no.children = 2, children.ages = c(15, 18))
names(x)  # displays all position names
x[[2]]  # second element
x[2]  # list containing second element
x$name  # value in list corresponding to name
x["name"]  # same thing
x$children.ages[2]  # same as L[[4]][2]

# Smaller arrays are extended as needed
a = c(1, 2)
b = c(10, 20, 30, 40, 50)
a + b
b[7] = 70
b

# A dataframe is  an ordered sequence of lists sharing the same signature
vecn = c("John Smith", "Jane Doe")
veca = c(42, 45)
vecs = c(50000, 55000)
df = data.frame(name = vecn, age = veca, salary = vecs)
df
names(df) = c("NAME", "AGE", "SALARY")  # modify column names
df

The Iris Dataset

The core R package datasets contains many interesting and demonstrative datasets, such as the iris dataset, whose first four dimensions are numeric measurements describing flower geometry and whose last dimension is a string describing the flower species. The below example explores the iris dataset:

library(datasets)
names(iris)  # lists the dimension (column) names
head(iris, 4)  # show first four rows
iris[1,]  # first row
iris$Sepal.Length[1:10]  # sepal length of first ten samples
# allow replacing iris$Sepal.Length with the shorter Sepal.Length
attach(iris, warn.conflicts = FALSE)
mean(Sepal.Length)  # average of Sepal.Length across all rows
colMeans(iris[, 1:4])  # means of all four numeric columns

# extract all rows whose Sepal.Length variable is less than 5 
# and whose species is not setosa
subset(iris, Sepal.Length < 5 & Species != "setosa")
# count number of rows corresponding to setosa species
dim(subset(iris, Species == "setosa"))[1]

# Provides a statistical summary of the different dataframe columns
summary(iris)

# Read text file into dataframe
# df = read.table("irisFile.txt", header = TRUE)  
# same but from Internet location
# df = read.table("http://www.exampleURL.com/irisFile.txt", header = TRUE)

# The following commands require an editor (e.g., Vim)
# edit(iris)  # examine data as spreadsheet
# iris = edit(iris)  # edit dataframe/variable
# newIris = edit(iris)  # edit dataframe/variable but keep original

Functions

Functions in R are similar to those in C++ and Java. When calling a function, the arguments flow into the function according to their order at the call site:

# foo is a user-defined function
foo = function(x = 1, y = 2, z = 3) {
    return(x + y + z)
}
foo(10, 20, 30)  # parameter bindings by order
foo(y = 20, x = 10, z = 30)  # (potentially) out of order parameter bindings
foo(x = 10, y = 20, z = 30)  # passing 3 parameters
foo(z = 30)  # x and y are missing and are assigned default values
foo(10)  # in-order parameter binding with last two parameters missing

# myPower(.,.) raises the first argument to the power of the second; the former
# is named bas and has a default value of 10; the latter is named pow and has
# a default value of 2
myPower = function(bas = 10, pow = 2) {
    return(bas ^ pow) # raise base to a power
}
myPower(2, 3)  # 2 is bound to bas and 3 to pow (in-order)
myPower(pow = 3, bas = 2)  # same binding as above (out-of-order parameter names)
myPower(bas = 3)  # default value of pow is used

# R passes variables by value: Changing passed arguments inside a function
# does not modify their respective values in the calling environment
x = 2
myPower2 = function(x) { x = x ^ 2; return(x) }
y = myPower2(x)  # does not change x outside the function
x
y

Conditionals

The flow of control of R code is very similar to that of other programming languages. Below are some examples:

a = 10
b = 5
c = 1
if (a < b) {
    d = 1
} else if (a == b) {
    d = 2
} else {
    d = 3
}
d

Loops

Similaryly, below are some examples of loops:

sm = 0
# repeat for 100 iteration, with num taking values 1:100
for (num in seq(1, 100, by = 1)) {
    sm = sm + num
}
sm  # same as sum(1:100)

repeat {
    sm = sm - num
    num = num - 1
    if (sm == 0) break  # if sm == 0 then stop the loop
}
sm

a = 1
b = 10
# continue the loop as long as b > a
while (b > a) {
    sm = sm + 1
    a = a + 1
    b = b - 1
}
sm

Vectorization

It is best to avoid loops when programming in R. There are two reasons for this: simplifying code and computational speed-up. Many mathematical computations on lists, vectors, or arrays may be performed without loops using component-wise arithmetic. The code below demonstrates the computational speedup resulting from replacing a loop with vectorized code:

a = 1:10
c = 0  # compute sum of squares using a for-loop
for (e in a) c = c + e^2
c

sum(a ^ 2) # same operation using vector arithmetic

a = 1:1000000
c = 0
# time comparison with a million elements
system.time(for (e in a) c = c + e ^ 2)
system.time(sum(a ^ 2))

# The sapply function simplifies code but the computational speed-up may not
# apply in the same way as it did above
a = seq(0, 1, length.out = 10)
b = 0
c = 0
for (e in a) {
    b = b + exp(e)
}
b
c = sum(sapply(a, exp))
c
# sapply with an anonymous function f(x) = exp(x ^ 2)
sum(sapply(a, function(x) { return(exp(x ^ 2)) }))
# or simply
sum(sapply(a, function(x) exp(x ^ 2)))

Running Compiled Code

Consider the code below, which compares two implementations of matrix multiplication: The first uses R’s internal matrix multiplication and the second implements it through three nested loops, each containing a scalar multiplication. The first matrix multiplication is faster by several orders of magnitude even for a relatively small n. The key difference is that the built-in matrix multiplication runs compiled C++ code:

n = 100
nsq = n * n
# generate two random matrices
a = matrix(runif(nsq), nrow = n, ncol = n)
b = matrix(runif(nsq), nrow = n, ncol = n)

system.time(a %*% b)  # built-in matrix multiplication

matMult = function(a, b, n) {
    m = matrix(data = 0, nrow = n, ncol = n)
    for (i in 1:n)
      for (j in 1:n)
          for (k in 1:n)
              m[i, j] = m[i, j] + a[i, k] * b[k, j]
    return(m)
   }
system.time(matMult(a, b, n)) # nested loops implementation

Interfacing with C Code

For example, consider the task of computing some function given two vectors a and b of size n. The C code to compute the result appears below (fooC.c); note the presence of the pre-processor directive include <R.h>. You may execute code from fooC.R after running the command R CMD SHLIB fooC.c; the runnable example below runs that command for you:

fooC.c

#include <R.h>
#include <math.h>

void fooC(double* a, double* b, int* n, double* result) {
  int i, j;
  for (i = 0; i < (*n); i++) {
    result[i] = 0;
    for (j = 0; j < (*n); j++)
      result[i] += pow(a[j] + i + 1, b[j]);
  }
}

fooC.R

dyn.load("chapter7/fooC.so")  # load the compiled C code
a = seq(0, 1, length = 10)
b = seq(0, 1, length = 10)
c = rep(0, times = 10)
x = .C("fooC", a, b, as.integer(10), c)  # calls the fooC function
x[[4]]  # the 4th element contains the result

fooR = function(a, b, n) {
    result = rep(0, times = n)
    for (i in 1:n)
        for (j in 1:n)
            result[i] = result[i] + (a[j] + i) ^ b[j]
    return(result)
}
fooR(a, b, 10)

sizes = seq(10, 1000, length = 10)
rTime = rep(0, 10)
cTime = rTime
i = 1
for (n in sizes) {
    a = seq(0, 1, length = n)
    b = seq(0, 1, length = n)
    c = rep(0, times = n)
    cTime[i] = system.time(.C("fooC", a, b, as.integer(n), c))
    rTime[i] = system.time(fooR(a, b, n))
    i = i + 1
}

df = stack(list(C = cTime, R = rTime))
names(df) = c("system.time", "language")
df$size = sizes
# plot run time as a function of array size for R and .C implementations
qplot(x = size,
      y = system.time,
      lty = language, 
	  color = language, 
      data = df,
      size = I(1.5), 
	  geom = "line", 
      xlab = "array size", 
	  ylab = "computation time (sec)")

Modern Interfaces to C Code

The .Call function offers a higher degree of flexibility than .C in terms of passing and returning R arrays or dataframes. The example below shows how to use .Call to implement the same functionality as the .C example above. The type SEXP represents an R object. The REAL macro returns a pointer to the corresponding memory for reading or writing to it. Here’s an example:

fooC2.c

#include <R.h>
#include <Rinternals.h>
#include <Rmath.h>
#include <math.h>

SEXP fooC2(SEXP aR, SEXP  bR)
{ 
  int i, j, n = length(aR);
  double *a = REAL(aR), *b = REAL(bR);
  SEXP Rval = allocVector(REALSXP, n);
  
  for (i = 0; i < n; i++) {
    REAL(Rval)[i] = 0;
    for (j = 0; j < n; j++)
      REAL(Rval)[i] += pow(a[j] + i + 1, b[j]);
  }
  return Rval;
}

fooC2.R

dyn.load("chapter7/fooC2.so")  # load the compiled C code
a = seq(0, 1, length = 10)
b = seq(0, 1, length = 10)
.Call("fooC2", a, b)

Customization

When R starts it executes the function .First in ~/.Rprofile (if it exists). Similarly, R executes the function .Last in the same file at the end of any R. Here’s an example of how to customize your R environment:

pi.R

print(pi)  # prints 3.14159265358979 thanks to options in ~/.Rprofile

.Rprofile

.First = function() {
    options(prompt = 'R >', digits = 15)
    library('ggplot2')
}
.Last = function() {
    cat(date(), 'Bye')
}