Chapter preview

Chapter 8

Visualizing Data in R and Python

Visualizing data is key in effective data analysis: to perform initial investigations, to confirm or refuting data models, and to elucidate mathematical or algorithmic concepts. In this chapter, we explore different types of data graphs using the R programming language, which has excellent graphics functionality; we end the chapter with a description of Python’s matplotlib module — a popular Python tool for data visualization.

Note: The examples below are abridged; the book contains more details.

  1. Graphing Data in R
  2. Datasets
  3. Packages
  4. Strip Plots
  5. Histograms
  6. Line Plots
  7. Kernel Functions
  8. Smoothing Histograms Using Gaussian Kernels
  9. Smoothing Histograms Using qplot
  10. Smoothing Histograms Using ggplot
  11. Scatter Plots
  12. Smoothing Scatter Plots
  13. Facets
  14. All-Pairs Relationships
  15. Contour Plots
  16. Box Plots
  17. qq-Plots
  18. Devices
  19. Data Preparation
  20. Graphing Data in Python
  21. Line Plots
  22. Histograms
  23. Contour Plots
  24. Surface Plots

Graphing Data in R

We focus on two R graphics packages: graphics and ggplot2. To install the latter and to bring it to scope, type the following commands:

install.packages("ggplot2")
library(ggplot2)

Datasets

We use three datasets to explore data graphs:

names(faithful)  # variable names
summary(faithful)  # variable summary
summary(mtcars)
names(ggplot2::mpg)

Packages

The following code displays a scatter plot of the columns of an arbitrary dataframe df containing two variables col1 and col2 using the graphics and ggplot2 packages:

set.seed(1)  # constant seed for reproducibility
df = data.frame(replicate(2, sample(0:100, 100, rep = TRUE)))
names(df) = c("col1", "col2")

# Using the plot function from the graphics package
plot(x = df$col1, y = df$col2, xlab = "x", ylab = "y")
title(main = "Using plot")  # add title

# Using the two main functions in ggplot2: qplot and ggplot
qplot(x = col1, y = col2, data = df, geom = "point") + ggtitle("Using qplot")
ggplot(df, aes(x = col1, y = col2)) + geom_point() + ggtitle("Using ggplot")

Strip Plots

To plot a strip plot using the graphics package, call plot with a single numerical dataframe column:

plot(faithful$eruptions, 
     xlab = "sample number",
     ylab = "eruption times (min)",
     main = "Old Faithful Eruption Times")

Histograms

A histogram divides the range of numeric values into bins and displays the number of data values falling within each bin. The width of a bin influences the level of detail:

# Using the hist function to show Old Faithful's eruption times
hist(faithful$eruptions, 
     breaks = 20, 
     xlab = "eruption times (min)", 
     ylab = "count", 
     main = "Using hist")

# Old Faithful's waiting time to the next eruption using qplot
qplot(x = waiting, 
      data = faithful, 
      binwidth = 3, 
      main = "Using qplot - Waiting Time to Next Eruption (min)")

# The same histogram but using ggplot this time
ggplot(faithful, aes(x = waiting)) + geom_histogram(binwidth = 1) +
      ggtitle("Using ggplot - Waiting Time to Next Eruption (min)")

# Histograms can show density where the height of each bin times its width
# equals the count of samples falling in the bin divided by the total count
ggplot(faithful, aes(x = waiting, y = ..density..)) + 
      geom_histogram(binwidth = 4) +
      ggtitle("Using ggplot to Show Density")

Line Plots

A line plot is a graph displaying a relation between x and y as a line in a Cartesian coordinate system. The relation may correspond to an abstract mathematical function or to relation present between two variables in a specific dataset:

# Using the curve function
sinc = function(x) {
  return(sin(pi * x) / (pi * x))
}
curve(sinc, -7, +7, main = "Using curve")

# Using plot
s = sort.int(mpg$cty, index.return = T)
#  S$x holds the sorted values of city mpg
#  S$ix holds the indices of the sorted values of city mpg
#  First plot the sorted city mpg values with a line plot
plot(s$x,
     main = "Using plot",
     type = "l", 
     lty = 2, 
     xlab = "sample number (sorted by city mpg)",
     ylab = "mpg")
lines(mpg$hwy[S$ix] ,lty = 1)  # add dashed line of hwy mpg
legend("topleft", c("highway mpg", "city mpg"), lty = c(1, 2))

# Using qplot
x = seq(-2, 2, length.out = 30)
y = x ^ 2
qplot(x, y, geom = "line", main = "Using qplot")
qplot(x, y, geom = c("point", "line"))  # multiple geometries can be present

# Using ggplot
x = seq(-2, 2, length.out = 30)
y = x ^ 2
dataframe = data.frame(x = x, y = y)
ggplot(dataframe, aes(x = x, y = y)) + geom_line() + geom_point() +
  ggtitle("Using ggplot")

Kernel Functions

Kernel functions are used for smoothing histograms. The following R code displays the use of four popular kernel choices (Gaussian, triangular, tricube, and uniform):

x_grid = seq(-3, 3, length.out = 100)
gaussian = function(x) dnorm(x)
triangular = function(x) {
  ind = abs(x) > 1
  x = 1 - abs(x)
  x[ind] = 0
  return(x)  
}
tricube = function(x) {
  ind = abs(x) > 1
  x = (1 - abs(x)^3)^3 
  x[ind] = 0
  return(x)
}
uniform = function(x) {
  ind = abs(x) > 1
  x = x * 0 + 1/2
  x[ind] = 0
  return(x)
}
df = stack(list("uniform" = uniform(x_grid),
               "triangular" = triangular(x_grid),
               "gaussian" = gaussian(x_grid),
               "tricube" = tricube(x_grid),
               "uniform" = uniform(x_grid / 2) / 2,
               "triangular" = triangular(x_grid / 2) / 2,
               "gaussian" = gaussian(x_grid / 2) / 2,
               "tricube" = tricube(x_grid / 2) / 2))
head(df)  # first six lines
names(df) = c("kernel.value", "kernel.type")
df$x = x_grid
df$h[1:400] = "$h = 1$"
df$h[401:800] = "$h = 2$"
head(df)  # first six lines
qplot(x,
      kernel.value,
      data = df,
      facets = kernel.type~h,
      geom = "line",
      xlab = "$x$",
      ylab = "$K_h(x)$")

Smoothing Histograms Using Gaussian Kernels

The R code below graphs the smoothed histogram of the data {-1, 0, 0.5, 1, 2, 5, 5.5, 6} using the Gaussian kernel:

data = c(-1, 0, 0.5, 1, 2, 5, 5.5, 6)
data_size = length(data)
x_grid = seq(-3, data_size, length.out = 100)
kernel_values = x_grid %o% rep(1, data_size)
f = x_grid * 0
for (i in 1:data_size) {
  kernel_values[, i] = dnorm(x_grid, data[i], 1/6) / data_size
  f = f + kernel_values[, i]
}
plot(x_grid, f, xlab = "$x$", ylab = "$f_h(x)$", type = "l")
for (i in 1:data_size)
    lines(x_grid, kernel_values[, i] / 2, lty = 2)
title("Smoothed Histogram ($h=1/6$)", font.main = 1)

f = x_grid * 0
for(i in 1:data_size) {
  kernel_values[, i] = dnorm(x_grid, data[i], 1/3) / data_size
  f = f + kernel_values[, i]
}
plot(x_grid, f, xlab = "$x$", ylab = "$f_h(x)$", type = "l")
for (i in 1:data_size)
    lines(x_grid, kernel_values[, i] / 2, lty = 2)
title("Smoothed Histogram ($h=1/3$)", font.main = 1)

f = x_grid * 0
for (i in 1:data_size) {
  kernel_values[, i] = dnorm(x_grid, data[i], 1) / data_size
  f = f + kernel_values[, i]
}
plot(x_grid, f, xlab = "$x$" , ylab = "$f_h(x)$", type = "l")
for (i in 1:data_size)
    lines(x_grid, kernel_values[,i] / 2, lty = 2)
title("Smoothed Histogram ($h=1$)", font.main = 1)

Smoothing Histograms Using qplot

The ggplot2 package incorporates smoothed histogram graphing into qplot. The value of h is controlled via the adjust parameter:

qplot(x = c(2, 3, 6, 7),
      y = ..density..,
      geom = c("density"),
      kernel = "gaussian",
      adjust = 0.05,
      main = "adjust = 0.05",
      xlab = "$x$",
      ylab = "$f_h(x)$",
      xlim = c(0, 9))

# Increasing the value of h causes more overlap
qplot(x = c(2,3,6,7),
      y = ..density..,
      geom = c("density"),
      kernel = "gaussian",
      adjust = 0.2,
      main = "adjust = 0.2",
      xlab = "$x$",
      ylab = "$f_h(x)$",
      xlim = c(0, 9))

# Increasing the value of h further aggregates the four peaks
# into two, each responsible for a corresponding pair of nearby points
qplot(x = c(2, 3, 6, 7),
      y = ..density..,
      geom = c("density"),
      kernel = "gaussian",
      adjust = 0.5,
      main = "adjust = 0.5",
      xlab = "$x$",
      ylab = "$f_h(x)$",
      xlim = c(0, 9))

# Finally, further increasing the value of h results in a histogram
# with very wide bins in which all points fall in the same bin
qplot(x = c(2, 3, 6, 7),
      y = ..density..,
      geom = c("density"),
      kernel = "gaussian",
      adjust = 10,
      main = "adjust = 10",
      xlab = "$x$",
      ylab = "$f_h(x)$",
      xlim = c(0, 9))

Smoothing Histograms Using ggplot

The figure below contrasts a histogram with a smoothed histogram using the ggplot function. To enhance the visualization, we made the histogram semi-transparent using the alpha argument (which takes a value between 0 and 1 indicating the transparency level):

ggplot(faithful, aes(x = waiting, y = ..density..)) +
  geom_histogram(alpha = 0.3, bins = 30) +
  geom_density(size = 1.5, color = "red")

Scatter Plots

A scatter plot graphs the relationships between two numeric variables. It graphs each pair of variables as a point in a two-dimensional space whose coordinates are the corresponding (x, y) values:

# To create a scatter plot with the graphics package,
# call plot with two dataframe columns
plot(faithful$waiting,
     faithful$eruptions,
     main = "Using plot",
     xlab = "waiting time (min)",
     ylab = "eruption time (min)")

# The arguments pch, col, and cex modify the marker's shape, color, and size
plot(faithful$waiting,
     faithful$eruptions,
     pch = 17,
     col = 2,
     cex = 1.2,
     main = "Specifying pch, col, and cex",
     xlab = "waiting times (min)",
     ylab = "eruption time (min)")

# Distinguish the points based on the value of another dataframe column
plot(mtcars$hp,
     mtcars$mpg,
     pch = mtcars$am,
     cex = 1.2,
     main = "MPG vs. HP by Transmission",
     xlab = "horsepower",
     ylab = "miles per gallon")
legend("topright", c("automatic", "manual"), pch = c(0, 1))

# Using qplot
qplot(x = waiting,
      y = eruptions,
      data = faithful,
      main = "Using qplot - Waiting Times (sec) vs. Eruptions (min)")

# The graph below shows a scatter plot of car weights vs mpg
qplot(x = wt, 
      y = mpg, 
      data = mtcars, 
      main = "Using qplot - MPG vs. Weight (x1000 lbs)")

# Denoting the number of cylinders by size using the size argument
qplot(x = wt,
      y = mpg,
      data = mtcars,
      size = cyl,
      main = "Using qplot with Size - MPG vs. Weight (x1000 lbs) by Cylinder")

# Alternatively, color can be used to encode the number of cylinders
qplot(x = wt,
      y = mpg,
      data = mtcars,
      color = cyl,
      main = "Using qplot with Color - MPG vs. Weight (x1000 lbs) by Cylinder")

Smoothing Scatter Plots

We demonstrate smoothing scatter plots with several examples below:

# To adjust h (see smoothing histograms), we modify the value of span
qplot(disp,
      mpg,
      data = mtcars,
      main = "MPG vs Eng. Displacement (span = 0.2)") +
  stat_smooth(method = "loess",
              method.args = list(degree = 0),
              span = 0.2,
              se = FALSE)

# Increasing the value of span results in a less wiggly curve
qplot(disp,
      mpg,
      data = mtcars,
      main = "MPG vs Eng. Displacement (span = 1)") +
  stat_smooth(method = "loess",
              method.args = list(degree = 0),
              span = 1,
              se = FALSE)

# Selecting an even larger h results in a nearly constant line
qplot(disp,
      mpg,
      data = mtcars,
      main = "MPG vs Eng. Displacement (span = 10)") +
  stat_smooth(method = "loess",
              method.args = list(degree = 0),
              span = 10,
              se = FALSE)

# Omitting span reverts to a default value
qplot(disp,
      mpg,
      data = mtcars,
      main = "MPG vs Eng. Displacement (default span)") +
  stat_smooth(method = "loess",
              method.args = list(degree = 0),
              se = FALSE)

Facets

In some cases, we want to examine multiple plots with the same x or y axis in different side-by-side panels; qplot enables this using the facet_wrap() function or the facets argument.

# Visualize the relationship between city mpg and engine displacement
# for different classes of vehicles
ggplot(mpg, aes(displ, cty)) + geom_point() +
  facet_wrap(~class)

# The facet argument is helpful when we want to arrange the sub-plots 
# in a specific order; below, we add new dataframe columns with more appropriate
# names for better labeling
mtcars$amf[mtcars$am==0] = "automatic"
mtcars$amf[mtcars$am==1] = "manual"
mtcars$vsf[mtcars$vs==0] = "flat"
mtcars$vsf[mtcars$vs==1] = "V-shape"

qplot(x = wt,
      y = mpg,
      facets = .~amf,
      data = mtcars,
      main = "MPG vs. Weight by Transmission")

qplot(x = wt,
      y = mpg,
      facets = vsf~.,
      data = mtcars,
      main = "MPG vs. Weight by Engine") +
  stat_smooth(se = FALSE)

qplot(x = wt,
      y = mpg,
      data = mtcars,
      facets = vsf~amf,
      main = "MPG vs. Weight by Transmission and Engine")

All-Pairs Relationships

In other cases, we want to examine all-pairs relationships:

# Explore all-pairs relationships between city mpg, highway mpg,
# and engine displacement volume
df = mpg[, c("cty", "hwy", "displ")]
plot(df, main = "City MPG vs. Highway MPG vs. Engine Displacement Volume")

# Alternatively, we can use the ggpairs function (from the GGallly package);
# it also displays smoothed histograms of all variables in the diagonal panels
# and the correlation coefficients in the upper triangle
GGally::ggpairs(df)

Contour Plots

The most convenient way to graph a two-dimensional function f(x, y) is graphing its equal height contours:

x_grid = seq(-1, 1, length.out = 100)
y_grid = x_grid
# Create a dataframe containing all possible combinations of (x, y)
df = expand.grid(x_grid, y_grid)
dim(df) #  number of rows is 100 x 100 - one for each combination
names(df) = c("x", "y")  # modify column names for better labeling
df$z = df$x ^ 2 + df$y ^ 2
head(df)
ggplot(df, aes(x = x, y = y, z = z)) + stat_contour()

Box Plots

Histograms are very useful for summarizing numeric data in that they show a rough distribution of values. An alternative that is often used in conjunction with histograms is box plots. A box plot is composed of a box, an inner line bisecting the box, whiskers that extend to either side of the box, and outliers:

# Display 0 through 100 percentiles at 0.1 increments
# for the dataset containing 1, 2, 3, 4.
quantile(c(1, 2, 3, 4), seq(0, 1, length.out = 11))

ggplot(mpg, aes("", hwy)) +
  geom_boxplot() +
  coord_flip() +
  scale_x_discrete("") +
  ggtitle("Highway MPG")

# Plot several box plots side-by-side in order to compare data
# corresponding to different values of a factor variable
ggplot(mpg, aes(reorder(class, -hwy, median), hwy)) +
  geom_boxplot() +
  coord_flip() +
  scale_x_discrete("class") +
  ggtitle("Highway MPG by Class")


qq-Plots

Quantile-quantile plots, also known as qq-plots, are useful for comparing two datasets, one of which may be sampled from a certain distribution. They are essentially scatter plots of the quantiles of one dataset vs. the quantiles of another:

set.seed(1)  # constant seed for reproducibility
df = data.frame(samples = c(rnorm(200, 1, 1),
                            rnorm(200, 0, 1),
                            rnorm(200, 0, 2)))
df$parameter[1:200]   = "$\\mathcal{N}(1, 1)$"
df$parameter[201:400] = "$\\mathcal{N}(0, 1)$"
df$parameter[401:600] = "$\\mathcal{N}(0, 2)$"
qplot(samples,
      bins = 30,
      facets = parameter~.,
      geom = "histogram",
      main = "Datasets",
      data = df)

# We compute below the qq-plots of these three datasets (y axis)
# vs. a sample drawn from the N(0, 1) distribution (x axis)
ggplot(df, aes(sample = samples)) +
  stat_qq() +
  facet_grid(.~parameter) +
  ggtitle("Quantile-Quantile of Datasets")

# Compare two distributions and show how the $t$-distribution has tails that
# are heavier than the Gaussian N(0, 1) distribution
x_grid = seq(-6, 6, length.out = 200)
df = data.frame(density = dnorm(x_grid, 0, 1))
df$tdensity = dt(x_grid, 1.5)
df$x = x_grid
ggplot(df, aes(x = x, y = density)) +
  geom_area(fill = I("grey")) +
  geom_line(aes(x = x, y = tdensity)) +
  ggtitle("$\\mathcal{N}(0, 1)$ (shaded) and $t$-distribution with 1.5 Degrees of Freedom")

df$samples = rnorm(200, 0, 1)
pm = list(df = 1.5)
ggplot(df, aes(sample = samples)) +
  stat_qq(distribution = qt, dparams = pm) +
  ggtitle("Quantile-Quantile of the Two Distributions Above")

Devices

The following examples are useful when saving plots to disk:

# Save the active graphics window to a file
ggsave(file = "plot.pdf")

# Save all future graphics to file myplots.pdf
pdf("plot", height = 5, width = 5, pointsize = 11)
# Graphics plotting
qplot(...)
qplot(...)
# Close graphics file and return to display driver
dev.off()

Data Preparation

We emphasize in this book graphing data by first creating a dataframe with the appropriate data and informative column names; this approach is better than keeping the data in arrays. To create a dataframe, use the data.frame function, for example:

vec1 = c("foo", "bar")
vec2 = c(31, 42)
vec3 = c(1, 2)
df = data.frame(name = vec1, ages = vec2, salary = vec3)
df
# You can change column names using the names function
names(df) = c("last.name", "age", "annual.income")
df

x = list(a = c(1, 2), b = c(3, 4), c = c(5, 6))
x
stack(x)  # this dataframe above is ready for graphing

# Plot normal density for different sigma values
x_grid = seq(-8, 8, length.out = 100)
gaussian_function =
    function(x, s) exp(-x ^ 2 / (2 * s ^ 2)) / (sqrt(2 * pi) * s)
df = stack(list("$\\sigma = 1$" = gaussian_function(x_grid, 1),
                "$\\sigma = 2$" = gaussian_function(x_grid, 2),
                "$\\sigma = 3$" = gaussian_function(x_grid, 3),
                "$\\sigma = 4$" = gaussian_function(x_grid, 4)))

names(df) = c("y", "sigma");
df$x = x_grid
head(df) 
qplot(x,
      y,
      color = sigma,
      lty = sigma,
      geom = "line",
      data = df,
      main = "Normal Density for Different $\\sigma$ Values",
      xlab = "$x$",
      ylab = "$f(x)$")

Graphing Data in Python

Despite the fact that R has excellent graphics capabilities, it’s sometimes desirable to graph data in another programming language. For example, when the entire data analysis process is in Python, it may make sense to graph the data within Python as well. Below and in the rest of this chapter, we explore matplotlib — a popular Python module for graphing data.

import matplotlib.pyplot as plt

# Working with figures in matplotlib
f1 = plt.figure()  # open a figure
f2 = plt.figure()  # open a second figure
plt.savefig('f2.pdf')  # save fig 2 - the active figure
plt.close(f2)
plt.savefig('f1.pdf')  # save fig 2 - the active figure
plt.close(f1)

Line Plots

The code below displays three line plots: linear (black solid line), quadratic (black dashed line), and cubic (black dotted line).

import matplotlib.pyplot as plt
from numpy import array

x_grid = array(range(1, 100)) / 30.0
plt.figure()  # open a figure
plt.plot(x_grid, x_grid, "k-")  # draws a solid line
plt.plot(x_grid, x_grid ** 2, "k--")  # draws a dashed line
plt.plot(x_grid, x_grid ** 3, "k.")  # draws a dotted line
plt.xlabel(r"$x$")
plt.ylabel(r"$y=f(x)$")
plt.title("Linear, Quadratic, and Cubic Growth")
plt.show()

Histograms

The function hist(x, n) creates a histogram of the data in x using n bins:

import matplotlib.pyplot as plt
from numpy.random import randn

plt.hist(randn(10000), 50)
plt.xlabel(r"$x$")
plt.ylabel(r"$count$")
plt.title(r"Histogram of 10000 Gaussian $\mathcal{N}(0, 1)$ Samples")
plt.xlim([-4, 4])
plt.show()

Contour Plots

The function contourf displays z = f(x, y) as a function of x, y, quantizing the z values into several constant values. The code below shows how to use contourf to display a contour plot of z = 3 * x ^ 2 + 5 * y ^ 2:

import matplotlib.pyplot as plt
import numpy as np

def f(x, y):
    return 3 * x ** 2 + 5 * y ** 2


x_grid = np.linspace(-2, 2, 100)
y_grid = np.linspace(-2, 2, 100)
# Create two ndarrays xx, yy containing x and y coordinates
xx, yy = np.meshgrid(x_grid, y_grid)
zz = f(xx, yy)
# Draw contour graph with 6 levels using gray colormap
#   (white=high, black=low)
plt.contourf(xx,
             yy,
             zz,
             6,
             cmap="gray")
# Add black lines to highlight contours levels
plt.contour(xx,
            yy,
            zz,
            6,
            colors = "black",
            linewidth = .5)
plt.show()

Surface Plots

The function plot_surface is similar to contourf except that it displays a 3D surface plot; the code below shows how to use it:

import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D

def f(x, y):
    return 3 * x ** 2 + 5 * y ** 2


x_grid = np.linspace(-2, 2, 30)
y_grid = np.linspace(-2, 2, 30)
xx, yy = np.meshgrid(x_grid, y_grid)
zz = f(xx, yy)
surf_figure = plt.figure()
figure_axes = Axes3D(surf_figure)
figure_axes.plot_surface(xx, yy, zz, rstride=1, cstride=1, cmap="Blues")
plt.show()