Visualizing data is key in effective data analysis: to perform initial investigations, to confirm or refuting data models, and to elucidate mathematical or algorithmic concepts. In this chapter, we explore different types of data graphs using the R programming language, which has excellent graphics functionality; we end the chapter with a description of Python’s matplotlib module — a popular Python tool for data visualization.
Note: The examples below are abridged; the book contains more details.
We focus on two R graphics packages: graphics and ggplot2. To install the latter and to bring it to scope, type the following commands:
install.packages("ggplot2")
library(ggplot2)
We use three datasets to explore data graphs:
names(faithful) # variable names
summary(faithful) # variable summary
summary(mtcars)
names(ggplot2::mpg)
The following code displays a scatter plot of the columns of an arbitrary
dataframe df
containing two variables col1
and col2
using
the graphics
and ggplot2
packages:
set.seed(1) # constant seed for reproducibility
df = data.frame(replicate(2, sample(0:100, 100, rep = TRUE)))
names(df) = c("col1", "col2")
# Using the plot function from the graphics package
plot(x = df$col1, y = df$col2, xlab = "x", ylab = "y")
title(main = "Using plot") # add title
# Using the two main functions in ggplot2: qplot and ggplot
qplot(x = col1, y = col2, data = df, geom = "point") + ggtitle("Using qplot")
ggplot(df, aes(x = col1, y = col2)) + geom_point() + ggtitle("Using ggplot")
To plot a strip plot using the graphics package, call plot
with a single
numerical dataframe column:
plot(faithful$eruptions,
xlab = "sample number",
ylab = "eruption times (min)",
main = "Old Faithful Eruption Times")
A histogram divides the range of numeric values into bins and displays the number of data values falling within each bin. The width of a bin influences the level of detail:
# Using the hist function to show Old Faithful's eruption times
hist(faithful$eruptions,
breaks = 20,
xlab = "eruption times (min)",
ylab = "count",
main = "Using hist")
# Old Faithful's waiting time to the next eruption using qplot
qplot(x = waiting,
data = faithful,
binwidth = 3,
main = "Using qplot - Waiting Time to Next Eruption (min)")
# The same histogram but using ggplot this time
ggplot(faithful, aes(x = waiting)) + geom_histogram(binwidth = 1) +
ggtitle("Using ggplot - Waiting Time to Next Eruption (min)")
# Histograms can show density where the height of each bin times its width
# equals the count of samples falling in the bin divided by the total count
ggplot(faithful, aes(x = waiting, y = ..density..)) +
geom_histogram(binwidth = 4) +
ggtitle("Using ggplot to Show Density")
A line plot is a graph displaying a relation between x
and y
as a line
in a Cartesian coordinate system. The relation may correspond to an abstract
mathematical function or to relation present between two variables
in a specific dataset:
# Using the curve function
sinc = function(x) {
return(sin(pi * x) / (pi * x))
}
curve(sinc, -7, +7, main = "Using curve")
# Using plot
s = sort.int(mpg$cty, index.return = T)
# S$x holds the sorted values of city mpg
# S$ix holds the indices of the sorted values of city mpg
# First plot the sorted city mpg values with a line plot
plot(s$x,
main = "Using plot",
type = "l",
lty = 2,
xlab = "sample number (sorted by city mpg)",
ylab = "mpg")
lines(mpg$hwy[S$ix] ,lty = 1) # add dashed line of hwy mpg
legend("topleft", c("highway mpg", "city mpg"), lty = c(1, 2))
# Using qplot
x = seq(-2, 2, length.out = 30)
y = x ^ 2
qplot(x, y, geom = "line", main = "Using qplot")
qplot(x, y, geom = c("point", "line")) # multiple geometries can be present
# Using ggplot
x = seq(-2, 2, length.out = 30)
y = x ^ 2
dataframe = data.frame(x = x, y = y)
ggplot(dataframe, aes(x = x, y = y)) + geom_line() + geom_point() +
ggtitle("Using ggplot")
Kernel functions are used for smoothing histograms. The following R code displays the use of four popular kernel choices (Gaussian, triangular, tricube, and uniform):
x_grid = seq(-3, 3, length.out = 100)
gaussian = function(x) dnorm(x)
triangular = function(x) {
ind = abs(x) > 1
x = 1 - abs(x)
x[ind] = 0
return(x)
}
tricube = function(x) {
ind = abs(x) > 1
x = (1 - abs(x)^3)^3
x[ind] = 0
return(x)
}
uniform = function(x) {
ind = abs(x) > 1
x = x * 0 + 1/2
x[ind] = 0
return(x)
}
df = stack(list("uniform" = uniform(x_grid),
"triangular" = triangular(x_grid),
"gaussian" = gaussian(x_grid),
"tricube" = tricube(x_grid),
"uniform" = uniform(x_grid / 2) / 2,
"triangular" = triangular(x_grid / 2) / 2,
"gaussian" = gaussian(x_grid / 2) / 2,
"tricube" = tricube(x_grid / 2) / 2))
head(df) # first six lines
names(df) = c("kernel.value", "kernel.type")
df$x = x_grid
df$h[1:400] = "$h = 1$"
df$h[401:800] = "$h = 2$"
head(df) # first six lines
qplot(x,
kernel.value,
data = df,
facets = kernel.type~h,
geom = "line",
xlab = "$x$",
ylab = "$K_h(x)$")
The R code below graphs the smoothed histogram of the data
{-1, 0, 0.5, 1, 2, 5, 5.5, 6}
using the Gaussian kernel:
data = c(-1, 0, 0.5, 1, 2, 5, 5.5, 6)
data_size = length(data)
x_grid = seq(-3, data_size, length.out = 100)
kernel_values = x_grid %o% rep(1, data_size)
f = x_grid * 0
for (i in 1:data_size) {
kernel_values[, i] = dnorm(x_grid, data[i], 1/6) / data_size
f = f + kernel_values[, i]
}
plot(x_grid, f, xlab = "$x$", ylab = "$f_h(x)$", type = "l")
for (i in 1:data_size)
lines(x_grid, kernel_values[, i] / 2, lty = 2)
title("Smoothed Histogram ($h=1/6$)", font.main = 1)
f = x_grid * 0
for(i in 1:data_size) {
kernel_values[, i] = dnorm(x_grid, data[i], 1/3) / data_size
f = f + kernel_values[, i]
}
plot(x_grid, f, xlab = "$x$", ylab = "$f_h(x)$", type = "l")
for (i in 1:data_size)
lines(x_grid, kernel_values[, i] / 2, lty = 2)
title("Smoothed Histogram ($h=1/3$)", font.main = 1)
f = x_grid * 0
for (i in 1:data_size) {
kernel_values[, i] = dnorm(x_grid, data[i], 1) / data_size
f = f + kernel_values[, i]
}
plot(x_grid, f, xlab = "$x$" , ylab = "$f_h(x)$", type = "l")
for (i in 1:data_size)
lines(x_grid, kernel_values[,i] / 2, lty = 2)
title("Smoothed Histogram ($h=1$)", font.main = 1)
The ggplot2
package incorporates smoothed histogram graphing into qplot
.
The value of h
is controlled via the adjust
parameter:
qplot(x = c(2, 3, 6, 7),
y = ..density..,
geom = c("density"),
kernel = "gaussian",
adjust = 0.05,
main = "adjust = 0.05",
xlab = "$x$",
ylab = "$f_h(x)$",
xlim = c(0, 9))
# Increasing the value of h causes more overlap
qplot(x = c(2,3,6,7),
y = ..density..,
geom = c("density"),
kernel = "gaussian",
adjust = 0.2,
main = "adjust = 0.2",
xlab = "$x$",
ylab = "$f_h(x)$",
xlim = c(0, 9))
# Increasing the value of h further aggregates the four peaks
# into two, each responsible for a corresponding pair of nearby points
qplot(x = c(2, 3, 6, 7),
y = ..density..,
geom = c("density"),
kernel = "gaussian",
adjust = 0.5,
main = "adjust = 0.5",
xlab = "$x$",
ylab = "$f_h(x)$",
xlim = c(0, 9))
# Finally, further increasing the value of h results in a histogram
# with very wide bins in which all points fall in the same bin
qplot(x = c(2, 3, 6, 7),
y = ..density..,
geom = c("density"),
kernel = "gaussian",
adjust = 10,
main = "adjust = 10",
xlab = "$x$",
ylab = "$f_h(x)$",
xlim = c(0, 9))
The figure below contrasts a histogram with a smoothed histogram using
the ggplot
function. To enhance the visualization, we made the histogram
semi-transparent using the alpha
argument (which takes a value between 0 and 1
indicating the transparency level):
ggplot(faithful, aes(x = waiting, y = ..density..)) +
geom_histogram(alpha = 0.3, bins = 30) +
geom_density(size = 1.5, color = "red")
A scatter plot graphs the relationships between two numeric variables.
It graphs each pair of variables as a point in a two-dimensional space whose
coordinates are the corresponding (x, y)
values:
# To create a scatter plot with the graphics package,
# call plot with two dataframe columns
plot(faithful$waiting,
faithful$eruptions,
main = "Using plot",
xlab = "waiting time (min)",
ylab = "eruption time (min)")
# The arguments pch, col, and cex modify the marker's shape, color, and size
plot(faithful$waiting,
faithful$eruptions,
pch = 17,
col = 2,
cex = 1.2,
main = "Specifying pch, col, and cex",
xlab = "waiting times (min)",
ylab = "eruption time (min)")
# Distinguish the points based on the value of another dataframe column
plot(mtcars$hp,
mtcars$mpg,
pch = mtcars$am,
cex = 1.2,
main = "MPG vs. HP by Transmission",
xlab = "horsepower",
ylab = "miles per gallon")
legend("topright", c("automatic", "manual"), pch = c(0, 1))
# Using qplot
qplot(x = waiting,
y = eruptions,
data = faithful,
main = "Using qplot - Waiting Times (sec) vs. Eruptions (min)")
# The graph below shows a scatter plot of car weights vs mpg
qplot(x = wt,
y = mpg,
data = mtcars,
main = "Using qplot - MPG vs. Weight (x1000 lbs)")
# Denoting the number of cylinders by size using the size argument
qplot(x = wt,
y = mpg,
data = mtcars,
size = cyl,
main = "Using qplot with Size - MPG vs. Weight (x1000 lbs) by Cylinder")
# Alternatively, color can be used to encode the number of cylinders
qplot(x = wt,
y = mpg,
data = mtcars,
color = cyl,
main = "Using qplot with Color - MPG vs. Weight (x1000 lbs) by Cylinder")
We demonstrate smoothing scatter plots with several examples below:
# To adjust h (see smoothing histograms), we modify the value of span
qplot(disp,
mpg,
data = mtcars,
main = "MPG vs Eng. Displacement (span = 0.2)") +
stat_smooth(method = "loess",
method.args = list(degree = 0),
span = 0.2,
se = FALSE)
# Increasing the value of span results in a less wiggly curve
qplot(disp,
mpg,
data = mtcars,
main = "MPG vs Eng. Displacement (span = 1)") +
stat_smooth(method = "loess",
method.args = list(degree = 0),
span = 1,
se = FALSE)
# Selecting an even larger h results in a nearly constant line
qplot(disp,
mpg,
data = mtcars,
main = "MPG vs Eng. Displacement (span = 10)") +
stat_smooth(method = "loess",
method.args = list(degree = 0),
span = 10,
se = FALSE)
# Omitting span reverts to a default value
qplot(disp,
mpg,
data = mtcars,
main = "MPG vs Eng. Displacement (default span)") +
stat_smooth(method = "loess",
method.args = list(degree = 0),
se = FALSE)
In some cases, we want to examine multiple plots with the same x
or y
axis
in different side-by-side panels; qplot
enables this using the facet_wrap()
function or the facets
argument.
# Visualize the relationship between city mpg and engine displacement
# for different classes of vehicles
ggplot(mpg, aes(displ, cty)) + geom_point() +
facet_wrap(~class)
# The facet argument is helpful when we want to arrange the sub-plots
# in a specific order; below, we add new dataframe columns with more appropriate
# names for better labeling
mtcars$amf[mtcars$am==0] = "automatic"
mtcars$amf[mtcars$am==1] = "manual"
mtcars$vsf[mtcars$vs==0] = "flat"
mtcars$vsf[mtcars$vs==1] = "V-shape"
qplot(x = wt,
y = mpg,
facets = .~amf,
data = mtcars,
main = "MPG vs. Weight by Transmission")
qplot(x = wt,
y = mpg,
facets = vsf~.,
data = mtcars,
main = "MPG vs. Weight by Engine") +
stat_smooth(se = FALSE)
qplot(x = wt,
y = mpg,
data = mtcars,
facets = vsf~amf,
main = "MPG vs. Weight by Transmission and Engine")
In other cases, we want to examine all-pairs relationships:
# Explore all-pairs relationships between city mpg, highway mpg,
# and engine displacement volume
df = mpg[, c("cty", "hwy", "displ")]
plot(df, main = "City MPG vs. Highway MPG vs. Engine Displacement Volume")
# Alternatively, we can use the ggpairs function (from the GGallly package);
# it also displays smoothed histograms of all variables in the diagonal panels
# and the correlation coefficients in the upper triangle
GGally::ggpairs(df)
The most convenient way to graph a two-dimensional function f(x, y)
is
graphing its equal height contours:
x_grid = seq(-1, 1, length.out = 100)
y_grid = x_grid
# Create a dataframe containing all possible combinations of (x, y)
df = expand.grid(x_grid, y_grid)
dim(df) # number of rows is 100 x 100 - one for each combination
names(df) = c("x", "y") # modify column names for better labeling
df$z = df$x ^ 2 + df$y ^ 2
head(df)
ggplot(df, aes(x = x, y = y, z = z)) + stat_contour()
Histograms are very useful for summarizing numeric data in that they show a rough distribution of values. An alternative that is often used in conjunction with histograms is box plots. A box plot is composed of a box, an inner line bisecting the box, whiskers that extend to either side of the box, and outliers:
# Display 0 through 100 percentiles at 0.1 increments
# for the dataset containing 1, 2, 3, 4.
quantile(c(1, 2, 3, 4), seq(0, 1, length.out = 11))
ggplot(mpg, aes("", hwy)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("") +
ggtitle("Highway MPG")
# Plot several box plots side-by-side in order to compare data
# corresponding to different values of a factor variable
ggplot(mpg, aes(reorder(class, -hwy, median), hwy)) +
geom_boxplot() +
coord_flip() +
scale_x_discrete("class") +
ggtitle("Highway MPG by Class")
Quantile-quantile plots, also known as qq-plots, are useful for comparing two datasets, one of which may be sampled from a certain distribution. They are essentially scatter plots of the quantiles of one dataset vs. the quantiles of another:
set.seed(1) # constant seed for reproducibility
df = data.frame(samples = c(rnorm(200, 1, 1),
rnorm(200, 0, 1),
rnorm(200, 0, 2)))
df$parameter[1:200] = "$\\mathcal{N}(1, 1)$"
df$parameter[201:400] = "$\\mathcal{N}(0, 1)$"
df$parameter[401:600] = "$\\mathcal{N}(0, 2)$"
qplot(samples,
bins = 30,
facets = parameter~.,
geom = "histogram",
main = "Datasets",
data = df)
# We compute below the qq-plots of these three datasets (y axis)
# vs. a sample drawn from the N(0, 1) distribution (x axis)
ggplot(df, aes(sample = samples)) +
stat_qq() +
facet_grid(.~parameter) +
ggtitle("Quantile-Quantile of Datasets")
# Compare two distributions and show how the $t$-distribution has tails that
# are heavier than the Gaussian N(0, 1) distribution
x_grid = seq(-6, 6, length.out = 200)
df = data.frame(density = dnorm(x_grid, 0, 1))
df$tdensity = dt(x_grid, 1.5)
df$x = x_grid
ggplot(df, aes(x = x, y = density)) +
geom_area(fill = I("grey")) +
geom_line(aes(x = x, y = tdensity)) +
ggtitle("$\\mathcal{N}(0, 1)$ (shaded) and $t$-distribution with 1.5 Degrees of Freedom")
df$samples = rnorm(200, 0, 1)
pm = list(df = 1.5)
ggplot(df, aes(sample = samples)) +
stat_qq(distribution = qt, dparams = pm) +
ggtitle("Quantile-Quantile of the Two Distributions Above")
The following examples are useful when saving plots to disk:
# Save the active graphics window to a file
ggsave(file = "plot.pdf")
# Save all future graphics to file myplots.pdf
pdf("plot", height = 5, width = 5, pointsize = 11)
# Graphics plotting
qplot(...)
qplot(...)
# Close graphics file and return to display driver
dev.off()
We emphasize in this book graphing data by first creating a dataframe with
the appropriate data and informative column names; this approach is better than
keeping the data in arrays. To create a dataframe, use the data.frame
function, for example:
vec1 = c("foo", "bar")
vec2 = c(31, 42)
vec3 = c(1, 2)
df = data.frame(name = vec1, ages = vec2, salary = vec3)
df
# You can change column names using the names function
names(df) = c("last.name", "age", "annual.income")
df
x = list(a = c(1, 2), b = c(3, 4), c = c(5, 6))
x
stack(x) # this dataframe above is ready for graphing
# Plot normal density for different sigma values
x_grid = seq(-8, 8, length.out = 100)
gaussian_function =
function(x, s) exp(-x ^ 2 / (2 * s ^ 2)) / (sqrt(2 * pi) * s)
df = stack(list("$\\sigma = 1$" = gaussian_function(x_grid, 1),
"$\\sigma = 2$" = gaussian_function(x_grid, 2),
"$\\sigma = 3$" = gaussian_function(x_grid, 3),
"$\\sigma = 4$" = gaussian_function(x_grid, 4)))
names(df) = c("y", "sigma");
df$x = x_grid
head(df)
qplot(x,
y,
color = sigma,
lty = sigma,
geom = "line",
data = df,
main = "Normal Density for Different $\\sigma$ Values",
xlab = "$x$",
ylab = "$f(x)$")
Despite the fact that R has excellent graphics capabilities, it’s sometimes
desirable to graph data in another programming language. For example, when
the entire data analysis process is in Python, it may make sense to graph
the data within Python as well. Below and in the rest of this chapter, we
explore matplotlib
— a popular Python module for graphing data.
import matplotlib.pyplot as plt
# Working with figures in matplotlib
f1 = plt.figure() # open a figure
f2 = plt.figure() # open a second figure
plt.savefig('f2.pdf') # save fig 2 - the active figure
plt.close(f2)
plt.savefig('f1.pdf') # save fig 2 - the active figure
plt.close(f1)
The code below displays three line plots: linear (black solid line), quadratic (black dashed line), and cubic (black dotted line).
import matplotlib.pyplot as plt
from numpy import array
x_grid = array(range(1, 100)) / 30.0
plt.figure() # open a figure
plt.plot(x_grid, x_grid, "k-") # draws a solid line
plt.plot(x_grid, x_grid ** 2, "k--") # draws a dashed line
plt.plot(x_grid, x_grid ** 3, "k.") # draws a dotted line
plt.xlabel(r"$x$")
plt.ylabel(r"$y=f(x)$")
plt.title("Linear, Quadratic, and Cubic Growth")
plt.show()
The function hist(x, n)
creates a histogram of the data in x
using n
bins:
import matplotlib.pyplot as plt
from numpy.random import randn
plt.hist(randn(10000), 50)
plt.xlabel(r"$x$")
plt.ylabel(r"$count$")
plt.title(r"Histogram of 10000 Gaussian $\mathcal{N}(0, 1)$ Samples")
plt.xlim([-4, 4])
plt.show()
The function contourf
displays z = f(x, y)
as a function of x, y
,
quantizing the z
values into several constant values. The code below shows
how to use contourf
to display a contour plot of z = 3 * x ^ 2 + 5 * y ^ 2
:
import matplotlib.pyplot as plt
import numpy as np
def f(x, y):
return 3 * x ** 2 + 5 * y ** 2
x_grid = np.linspace(-2, 2, 100)
y_grid = np.linspace(-2, 2, 100)
# Create two ndarrays xx, yy containing x and y coordinates
xx, yy = np.meshgrid(x_grid, y_grid)
zz = f(xx, yy)
# Draw contour graph with 6 levels using gray colormap
# (white=high, black=low)
plt.contourf(xx,
yy,
zz,
6,
cmap="gray")
# Add black lines to highlight contours levels
plt.contour(xx,
yy,
zz,
6,
colors = "black",
linewidth = .5)
plt.show()
The function plot_surface
is similar to contourf
except that it displays
a 3D surface plot; the code below shows how to use it:
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import Axes3D
def f(x, y):
return 3 * x ** 2 + 5 * y ** 2
x_grid = np.linspace(-2, 2, 30)
y_grid = np.linspace(-2, 2, 30)
xx, yy = np.meshgrid(x_grid, y_grid)
zz = f(xx, yy)
surf_figure = plt.figure()
figure_axes = Axes3D(surf_figure)
figure_axes.plot_surface(xx, yy, zz, rstride=1, cstride=1, cmap="Blues")
plt.show()