## EDA

*May 11, 2018*

# Motivation

*In my graduate program we had an excellent text on Regression Methods in Biostatistics (which I will refer to as **VGSM**). Unfortunately, all of the examples and code from the courses, labs, and text were in Stata. As an R user, I kept having to translate the topics from Stata into R. But this actually turned into a valuable exercise because it forced me to separate the underlying statistical concepts from the programming languages.*

*I've decided to post various parts of the text/topics because I'm sure there will be future R users in the same boat :)*

## Exploratory Data Analysis (and a quick story)

"Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone--as the first step." - John W. Tukey, Exploratory Data Analysis, 1977

I think the **"first step"** is the most important part point of exploratory data analysis (EDA)--it should precede any modeling, inferences, or fancy machine learning algorithm. Without looking at the predictors and outcomes in a graph, modeling data can be a waste of time or misleading.

The first homework problem in one of my biostatistics courses came with a data set and a handful of questions about modeling. There was no formal instruction on how (or even if) to perform any visual inspection of the data before attempting to answer the questions about modeling, interactions, etc. I'd worked as an analyst long enough to know I needed to look at the data before attempting to figure out any of the homework questions. After some basic histograms and box-plots, I found a value that was physically impossible (a body-mass-index of 600+). I documented this discovery in my write-up, then ran the model with and without this data point, and explained my thought process.

I don't recall what the lesson was on (or if I even chose the right model), but I *do* remember the feeling of pride when I received praise for visually inspecting the data before trying to fit a model or make inferences.

I'll be the first to admit, I'm not great at remembering tests, assumptions, transformations, or even how to write equations on a whiteboard. I'm OK at math, but definitely not exceptional at it. But *I love* visualizing all kinds of data, and that passion (and a lot of general curiosity) has served me well in solving a variety of data-related problems.

## Essential graphs for EDA

The chapter that addresses exploring data in VGSM is chapter 2 "Exploratory and Descriptive Methods". The authors recommend three graphs for numerical data: box-plots, histograms, and Q-Q plots.

## The Data

The data we will be using for this post comes from the "Western Collaborative Group Study (WstClbGrpStdy)". You can read more about it here. These data are available in the native `wcgs`

data frame, but I'll be loading the version that came with my text because it has a few additional variables.

I also name this imported file `WstClbGrpStdy`

because I like to capitalize and disemvowel (yes, that **is** a thing, thank you, James Joyce) names of data frames to distinguish them from other R objects (vectors, lists, etc.).

These data come from an epidemiological study of behavioral attributes and coronary heart disease. The findings were published in 1964, and ironically there isn't a single graph used to display these data in their manuscript. It's hard to know how familiar the authors were with EDA (John Tukey didn't published Exploratory Data Analysis until 1977), so we can't really hold the lack of graphs and figures against them. The code to import this data set is below:

```
WstClbGrpStdy <- readxl::read_xls("Data/wcgs.xls")
```

## Words and their meanings

Scientific jargon and language isn't just a way to make people avoid you at parties--it also exists so we can have precise and unambiguous terms for certain ideas. I try to choose my language carefully when describing a process, function, object, etc. so I don't confuse anyone who reads what I'm writing (and that person is usually me!).

The tidyverse makes it easier to have some specificity and precision with the tools you use in R. All `tidyverse`

packages and functions have a unified language and well defined terms/ideas.

A simple example of this is `utils::str()`

vs. `dplyr::glimpse()`

. Viewing the shape and structure of an object is a common starting point, but it can also turn into a place of confusion for newcomers to the R language. The `str()`

function gives a lot of information (and jargon) that isn't always necessary to start exploring your data.

I prefer using `dplyr::glimpse()`

because it 1) puts more of the data on the screen than `str()`

, 2) it displays the format for each variable in the data set in a consistent way, and 3) it omits the additional information that occasionally gets printed with `str()`

.

See the image below for an example:

I add `78`

to ensure everything prints nicely to the screen width.

```
glimpse(WstClbGrpStdy, 78)
## Observations: 3,154
## Variables: 22
## $ age <dbl> 50, 51, 59, 51, 44, 47, 40, 41, 50, 43, 59, 54, 48, 39, ...
## $ arcus <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ behpat <chr> "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A1", "A...
## $ bmi <dbl> 31.32, 25.33, 28.69, 22.15, 22.31, 27.12, 23.24, 22.96, ...
## $ chd69 <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "N...
## $ chol <dbl> 249, 194, 258, 173, 214, 206, 190, 212, 130, 233, 181, 2...
## $ dbp <dbl> 90, 74, 94, 80, 80, 76, 78, 84, 70, 80, 86, 76, 78, 74, ...
## $ dibpat <chr> "Type A", "Type A", "Type A", "Type A", "Type A", "Type ...
## $ height <dbl> 67, 73, 70, 69, 71, 64, 70, 70, 71, 68, 72, 67, 71, 70, ...
## $ id <dbl> 2343, 3656, 3526, 22057, 12927, 16029, 3894, 11389, 1268...
## $ lnsbp <dbl> 4.883, 4.787, 5.063, 4.836, 4.836, 4.754, 4.804, 4.868, ...
## $ lnwght <dbl> 5.298, 5.257, 5.298, 5.011, 5.075, 5.063, 5.088, 5.075, ...
## $ ncigs <dbl> 25, 25, 0, 0, 0, 80, 0, 25, 0, 25, 10, 0, 20, 0, 4, 0, 0...
## $ sbp <dbl> 132, 120, 158, 126, 126, 116, 122, 130, 112, 120, 130, 1...
## $ smoke <chr> "Yes", "Yes", "No", "No", "No", "Yes", "No", "Yes", "No"...
## $ t1 <dbl> -1.6334, -4.0634, 0.6397, 1.1218, 2.4250, -0.7875, -0.60...
## $ time169 <dbl> 1367, 2991, 2960, 3069, 3081, 2114, 2929, 3010, 3104, 28...
## $ typchd69 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ uni <dbl> 0.48607, 0.18595, 0.72780, 0.62446, 0.37898, 0.73550, 0....
## $ weight <dbl> 200, 192, 200, 150, 160, 158, 162, 160, 195, 187, 206, 1...
## $ wghtcat <chr> "170-200", "170-200", "170-200", "140-170", "140-170", "...
## $ agec <chr> "46-50", "51-55", "56-60", "51-55", "41-45", "46-50", "3...
```

“Exploratory data analysis is detective work–in the purest sense–finding and revealing the clues.” - John W. Tukey, Exploratory Data Analysis, 1977

The first variable we will look at in this data frame is the systolic blood pressure (`sdp`

). In R, we want to make sure this variable is formatted correctly before putting it in a graph. We can also use `glimpse()`

on a single variable.

```
WstClbGrpStdy$sbp %>% glimpse()
## num [1:3154] 132 120 158 126 126 116 122 130 112 120 ...
```

## Base R histograms with `hist()`

A histogram takes a numeric variable and displays the range of values of across the horizontal scale (x-axis). Histograms also divide the horizontal scale into segments, called “bins.” The left-hand side of a histogram displays how many observations fall into each bin (y-axis).

Base R comes with a set of `graphics::`

functions for plotting data (this just means you won’t have to install any packages to use them). We can build a histogram using base R with the `graphics::hist()`

. This function takes the data you want to plot as the first argument (`WstClbGrpStdy$sbp`

).

```
# base R histogram of sbp
graphics::hist(WstClbGrpStdy$sbp)
```

Notice this plot lists “Histogram of WstClbGrpStdy$sbp” as the title and labels the x-axis as “WstClbGrpStdy$sbp”.

Technically this is all the function needs, but I recommend supplying something to the `main =`

title and `xlab =`

label arguments. Titles and labels help describe what we’re seeing (or what we expected to see) when R renders the plot. Future us will thank present us if take the extra time to add a title and label our plots.

```
graphics::hist(x = WstClbGrpStdy$sbp,
main = "Histogram of systolic blood pressure",
xlab = "Systolic blood pressure in mm Hg (sbp)")
```

Curious what to include in a title or label? I tend to stick with the type of graph and the variable I’m displaying in plain English (no jargon). For labels, I include the name of the measurement (no acronyms) and the units. A title and label for every graph might seem like an arduous level of detail, but if we end up investigating a data set with many abbreviated, nebulous variable names, we are going to appreciate

being able to look at a plot or graph and know precisely what it contains.

## How many bins?

The `graphics::hist()`

command will automatically divide up the data into bins, which usually isn’t a bad option. We can also specify the number of bins with the `breaks`

argument. This changes the number and width of the vertical bars. As suggested in VGSM, the

“rule of thumb is to choose the number of bins to be about

`1 + 3.3 log10(n)`

”

We can calculate this below:

```
wcgs_sbp_bins <- 1 + 3.3*log10(nrow(WstClbGrpStdy))
wcgs_sbp_bins
## [1] 12.55
```

This new numerical vector can be supplied to the `breaks`

argument in the `hist()`

graph.

## Adding reference lines to a `hist()`

plot

Add a reference line with a measure of central tendency to the histogram graph with the `abline()`

function. The first argument is either `h`

or `v`

(for a horizontal or vertical line), followed by the values you want the line to represent (in our case the `median()`

of `sbp`

).

I add two additional arguments to designate the color (`col = "blue"`

)

and width of the line (`lwd = 4`

).

```
graphics::hist(x = WstClbGrpStdy$sbp,
main = "Histogram of systolic blood pressure",
xlab = "Systolic blood pressure in mm Hg (sbp)",
breaks = wcgs_sbp_bins)
abline(v = median(WstClbGrpStdy$sbp),
col = "blue",
lwd = 4)
```

## Histogram quick plots with `qplot`

The next option for making a histogram comes from the `ggplot2`

package. The `qplot()`

function (short for quick plot) is a fast way to plot a single variable. Titles can be added using `ggtitle()`

and the same `xlab()`

function from before.

```
ggplot2::qplot(sbp, data = WstClbGrpStdy) +
ggplot2::ggtitle("Histogram of systolic blood pressure") +
xlab("Systolic blood pressure in mm Hg (sbp)")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

The `qplot`

function will create a histogram or bar chart depending on the format of the variable you provide (numeric or factor). As the message tells us, the default number of bins is 30. This is more than twice the number of bins we calculated before, and we can see this has a

profound effect on how these data are represented in the graph (*more on this later*).

If you’re like me and prefer to use the pipes from the `magrittr`

package, this can be re-written using a little syntactic sugar:

```
WstClbGrpStdy %>% ggplot2::qplot(sbp, data = .) +
ggplot2::ggtitle("Histogram of systolic blood pressure") +
xlab("Systolic blood pressure in mm Hg (sbp)")
```

## Why did we have to specify a period (`.`

) in the `data =`

argument?

`magrittr`

interprets the period argument as “*take the named parameter as the object on the left of the %>% operator.*”

So `Data %>% function(variable, named_argument = .)`

is equivalent to

`function(variable, named_argument = Data)`

Notice we also get a message from this graph about the number of `bins`

being used. The `stat_bin()`

argument is using a default of `30`

, but we can adjust this with `bins`

argument.

## Adding the median to a `qplot()`

We can also add the median to this plot by adding a `geom`

, specifically a `ggplot2::geom_vline()`

. This function takes an aesthetic mapping argument (`aes(xintercept = median(sbp))`

), a color argument (`col`

), and a `size`

argument that will determine the width of the vertical line.

```
WstClbGrpStdy %>%
ggplot2::qplot(sbp, data = ., bins = wcgs_sbp_bins) +
ggplot2::geom_vline(aes(xintercept = median(sbp)), # add median
col = 'blue',
size = 2) +
ggplot2::ggtitle("Histogram of systolic blood pressure") +
xlab("Systolic blood pressure in mm Hg (sbp)")
```

This histogram now looks more like the histogram we made using `graphics::hist()`

above (but with fewer bins).

`ggplot`

histogram plots

Below is an example of how to get a histogram using `ggplot2`

. I’ve covered using `ggplot2`

elsewhere, so I won’t repeat myself here. But know that `ggplot2`

allows users to build a plot layer by layer, starting with a `data`

argument, mapping variables to aesthetics (`aes(x = sbp)`

), and then supplying a `geom`

. We got a small introduction to this method in the example above when we added the `ggplot2::geom_vline()`

layer to the `ggplot2::qplot()`

.

Again, we can use `magrittr`

to increase the readability on these functions.

```
WstClbGrpStdy %>% # Data
ggplot2::ggplot(aes(x = sbp)) + # variable
ggplot2::geom_histogram(bins = wcgs_sbp_bins) + # geom
ggplot2::geom_vline(aes(xintercept = median(sbp)), # add median
col = 'blue',
size = 2) +
ggplot2::ggtitle("Histogram of systolic blood pressure") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
```

I don’t use `qplot()`

s very often, but only because I like the having the ability to customize and build my plots layer-by-layer.

## What are we seeing in a histogram?

These histograms are showing us the frequency of each value in `sbp`

. We can see above that most systolic blood pressures appear to be relatively close (although *slightly* lower than) to the median. We can also see the distribution extends to the right of the median (higher `sbp`

values) all the way out to ~230 mm Hg. This is called a “right-skewed” distribution–the tail extends out to the right side (vs a “left-skewed” distribution where the tail extends out to the left side).

```
median(WstClbGrpStdy$sbp)
## [1] 126
```

It’s also a good idea to view multiple histograms with different levels of `bins`

to get an idea for how this variable’s values are distributed.

```
# Create some histograms -----
WCGS_hist_15 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 15) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (15 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_20 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 20) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (20 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_25 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 25) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (25 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
WCGS_hist_30 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 30) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
ggtitle("Histogram of sbp (30 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
# arrange these plots ----
library(ggpubr)
ggarrange(WCGS_hist_15, WCGS_hist_20, WCGS_hist_25, WCGS_hist_30,
ncol = 2, nrow = 2)
```

```
ggsave(filename = "Images/02_ggplot2_hists.png")
```

When we look at all four levels of `bins`

, we can see that the distributions have a collection of scores on the left-hand side, but the tail extends to the higher or more positive scores. The is an example of **right-skewness**.

When building histograms, we’re looking for the highs, lows, and overall spread of the data. The trick is to look at enough bin values to give us an idea about the variation of values in our variable, without making it too granular to interpret.

## What aren’t we seeing in a histogram?

The histogram doesn’t tell us anything about the summary statistics we might be using for inferences (such as the mean, standard deviation, median, etc.). It also doesn’t give us a clear picture of the extreme cases–it’s more of a “bird’s eye view” of a variable’s distribution. For more descriptive information, we will have to move onto box-plots.

## Percentiles, Medians, and Box-plots

Box-plots were originally created by John W. Tukey (although he called them “box-and-whisker” plots). Box-plots are useful for understanding the shape of variable’s distribution because they divide up the values into distinct proportions. For example, if we order all the values in a variable from lowest to highest, the middle number is the 50th percentile (also called the median). I know half the values lie below this number, and the other half are above it.

*Why use the median and not the mean (or average)?*

The median isn’t as sensitive to extreme values as the mean. I’ll demonstrate this with a quick example. Take a vector (`x`

) with 10 values and get the `median()`

and `mean()`

.

```
x <- c(4, 7, 9, 11, 13, 15, 17, 19, 21)
median(x)
## [1] 13
mean(x)
## [1] 12.89
```

These two numbers are relatively close to one another. Now we can take a quick look at the distribution with `hist(x)`

.

```
hist(x)
```

This distribution looks symmetrical–more importantly, it looks symmetrical around the mean and the median (`12.89`

and `13`

). This is important because many modelling techniques assume the distribution of the variables are “normal” or approximately bell-shaped. Now we create a second vector (`y`

), which is identical to `x`

, but the lowest and highest values have been changed substantially (`1`

to `0.0001`

and `21`

to `21000`

). Calculate the `median()`

and `mean()`

of `y`

.

```
y <- c(0.0001, 3, 7, 9, 11, 13, 15, 17, 21000)
median(y)
## [1] 11
mean(y)
## [1] 2342
```

Notice how the median didn’t change at all, but the mean has increased from `12.89`

to `2342`

. Take a look at the distribution of `y`

with `hist()`

:

```
hist(y)
```

The bell-shape is gone, replaced by just two bars (one for the 8 values under 5,000, and one for the single value above 20,0000).

## Base R box-plots with `boxplot()`

Box-plots use quartiles to divide the data into the 25th, 50th (median), and 75th percentiles. We can create a box-plot in base R using the `graphics::boxplot()`

function.

```
graphics::boxplot(WstClbGrpStdy$sbp,
main = "Box-plot of Systolic blood pressure in mm Hg",
ylab = "Systolic blood pressure in mm Hg (sbp)")
```

According to the authors in VGSM, a box-plot displays the following information:

- Location, as measured by the median
- Spread, as measured by the height of the box (this is called the interquartile range or IQR)
- Range of the observations
- Presence of outliers
- Some information about shape

I’ve created a graphic that adds a little detail for each element of the box-plot.

## Relationship between histograms and box-plots

Each of these imaginary box-plots has a corresponding histogram. These box plots weren’t created in R (and aren’t drawn perfectly to scale), but I’ve included this figure to illustrate the information captured in

a box-plot might be hard to see in a histogram:

Extreme values (sometimes called outliers) are made visible with a box-plot–they show up as dots or values beyond the 1.5x thresholds of each “whisker”.

We can also see the location changes by the shifting line within the box (the median).

## Box-plots with `qplot()`

The code to create a box-plot is very similar to creating a histogram with `qplot()`

. We supply a data frame (`WstClbGrpStdy`

) to the `qplot()`

function, but we also need to add an x argument (`x = " "`

), because a box-plot geom needs to map the values of `sbp`

onto the `y`

axis. We add our title and label, but also set the `xlab()`

label to `NULL`

so the `x`

doesn’t display.

```
WstClbGrpStdy %>%
qplot(x = "", y = sbp, data = ., geom = "boxplot") +
ggtitle("Box-plot of Systolic blood pressure in mm Hg") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab(NULL)
```

`ggplot`

box-plots

Finally, we can extend the `ggplot2`

functions to the layered grammar by piping a data frame to the `ggplot(aes(`

functions, mapping `x`

to an empty value (`" "`

), `y`

to the `sbp`

variable, then adding the

`geom_boxplot()`

.

The title and labels get added in the same way of as the `qplot()`

function.

```
WstClbGrpStdy %>%
ggplot(aes(x = "", y = sbp)) +
geom_boxplot() +
ggtitle("Box-plot of Systolic blood pressure in mm Hg") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab(NULL)
```

## What are we seeing in a boxplot?

To summarize, the box displays the upper and lower quartiles as boundaries for three percentiles of the data: 1) the bottom of the box is the 25th percentile, meaning one-quarter of the values are below this value 2) the median (the line in the middle of the box) tells us how far the middle value is from the 25th and 75th percentile, and 3) the top of the box is the 75th percentile, so one-quarter of the values are above this value.

The entire length of the box is the interquartile range (IQR). The whiskers show 1) the maximum values within 1.5 times the IQR from the upper (75th) quartile, and 2) the minimum values that are within 1.5x the IQR from the lower (25th) quartile. The points beyond the whiskers are the extreme values.

With regard to these data, I thought this description from VGSM was enlightening,

“right-skewness will be indicated if the upper whisker is longer than the lower whisker or if there are more outliers in the upper range. Both the boxplot and the histogram show evidence for right-skewness in the SBP data.”

```
require(ggplot2)
require(ggpubr)
sbp_boxplot = WstClbGrpStdy %>%
ggplot(aes(x = "",
y = sbp)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
scale_y_continuous(expand = c(0,5),
limit = c(90, 250)) +
ggtitle("Box-plot & Histogram of systolic blood pressure") +
xlab(NULL) +
ylab(NULL)
sbp_histogram = WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 20) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
scale_x_continuous(expand = c(0,5),
limit = c(90, 250))
ggarrange(sbp_boxplot,
sbp_histogram,
heights = c(2.5, 3),
align = "hv",
ncol = 1,
nrow = 2)
```

By combining these graphs, we can see how the median (blue line) in the histogram splits the distribution in the box, and how the skewness in the histogram is represented in the whiskers and points of the box-plot.

## Base R Q-Q plots

This brings us to the final plot we will cover: the Q-Q plot, or quantile-quantile plot. This plot is helpful when we want to know how closely our data aligns with the bell-shaped “normal” distribution.

We can create a Q-Q plot in base R using the following functions. The `qqnorm()`

function needs a variable we want to plot (`sbp`

).

```
qqnorm(WstClbGrpStdy$sbp,
main = "Systolic blood pressure QQ plot",
ylab = "Systolic Blood Pressure mm Hg (sbp)")
```

The interpretation of the Q-Q plot is the following from VGSM,

“a normal Q–Q plot is constructed so that the data points fall along an approximately straight line when the data are from a normal distribution, and deviate systematically from a straight line when the data are from other distributions.”

## What does it mean if data are normally distributed?

In a broader sense, if I were to randomly sample a data point from a variable with a symmetrical, bell-shaped distribution, that point has an approximately equal probability of coming from either side of the distribution’s mean (or median).

The `sbp`

data do not seem to fall along a straight line, but it would be helpful if we had a straight line for reference. We can add a straight line with `qqline()`

–this function needs the same variable we provided to `qqnorm()`

, but I also specified the line width in the `lwd = 2`

argument. I also included a `col`

argument to color the systolic blood pressure values, and a `grid()`

argument to give our coordinate system some gridlines.

```
qqnorm(WstClbGrpStdy$sbp,
main = "Systolic blood pressure QQ plot",
col = "blue",
ylab = "Systolic Blood Pressure mm Hg (sbp)")
qqline(WstClbGrpStdy$sbp,
lwd = 2)
grid(lty = "dotted",
col = "gray75")
```

We can see these values definitely do not fall along the straight reference line of “theoretical quantiles”.

## Q-Q quick plots with `stat_qq()`

The Q-Q plots in `ggplot2`

can be built in a variety of ways. The first is by setting the `sample`

aesthetic to the variable we want to see (`sbp`

) and then adding a `stat_qq()`

layer.

```
WstClbGrpStdy %>%
ggplot(aes(sample = sbp)) +
stat_qq() +
ggtitle("Q-Q plot of systolic blood pressure")
```

If we want to add a reference line to the `stat_qq()`

plot, `ggplot2`

version 2.2.1.9000 is now equipped with a function `stat_qq_line()`

for a straight reference line. I use the geom for creating the points on the graph (`geom_qq()`

) along with some opacity and color options, then add the reference line with `stat_qq_line()`

.

```
WstClbGrpStdy %>%
ggplot(aes(sample = sbp)) +
geom_qq(alpha = 0.3,
color = "blue") +
stat_qq_line() +
ggtitle("Q-Q plot of systolic blood pressure")
```

Q-Q plots are also helpful for determining the shape of the distribution. See the following quote from VGSM text:

“The shape and direction of the curvature can be used to diagnose the deviation from normality. Upward curvature is indicative of right-skewness, while downward curvature is indicative of left-skewness.”

The upward curve of the systolic blood pressure shows these data are right-skewed (which we knew from looking at the other two plots, but it’s always good to verify this again!).

## Heavy and light tailed distributions

We are going to diverge from the `sbp`

variable in the `WstClbGrpStdy`

data frame and use the `Tails`

data frame to show how two different distributions look in the three graphs we covered thus far.

```
Tails %>% glimpse(78)
## Observations: 1,000
## Variables: 2
## $ heavy_tail <dbl> -0.56, -0.40, -0.03, -0.77, -0.26, -0.38, -0.47, 0.32,...
## $ light_tail <dbl> 0.04615, 0.12579, 0.02782, 0.03540, 0.02829, 0.02750, ...
```

This data frame only has two variables, `heavy_tail`

and `light_tail`

. As the VGSM text notes, Q-Q plots can tell us a lot about how a distribution deviates from the normal “bell-curve”.

“The other two common patterns are S-shaped. An S-shape as in [see below] indicates a heavy-tailed distribution,”

We can plot the `heavy_tail`

data below and arrange all three plots to display together.

```
Heavy_qq <- Tails %>%
ggplot(aes(sample = heavy_tail)) +
geom_qq(alpha = 0.3,
color = "blue") +
stat_qq_line() +
ggtitle("Q-Q plot of heavy tail distribution")
# ylab(NULL)
Heavy_box <- Tails %>%
ggplot(aes(x = "y",
y = heavy_tail)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
scale_y_continuous(expand = c(0, 1),
limit = c(-6, 6)) +
ggtitle("Box-plot of heavy tail distribution") +
xlab(NULL) +
ylab(NULL)
Heavy_hist <- Tails %>%
ggplot(aes(x = heavy_tail)) +
geom_histogram() +
geom_vline(aes(xintercept = median(heavy_tail)),
col = 'blue',
size = 1) +
scale_x_continuous(expand = c(0, 1),
limit = c(-6, 6)) +
ggtitle("Histogram of heavy tail distribution") +
xlab(NULL) +
ylab(NULL)
# arrange these plots ----
library(ggpubr)
ggarrange(Heavy_hist,
Heavy_qq,
Heavy_box,
align = "hv",
ncol = 2,
nrow = 2,
widths = c(5, 5, 5))
```

The S-curve being referred to is the blue line in the Q-Q plot, and the distribution has a heavy tail if the line of values curves above the reference line on the positive (higher) end of the x-axis, but below the line on the negative (or lower) end of the x-axis.

We can also see the box-plot has many outliers outside the whiskers.

“while an S-shape like [see below] is indicative of a light-tailed distribution.”

```
Light_qq <- Tails %>%
ggplot(aes(sample = light_tail)) +
geom_qq(alpha = 0.3,
color = "blue") +
stat_qq_line() +
ggtitle("Q-Q plot of light tail distribution")
Light_box <- Tails %>%
ggplot(aes(x = "y",
y = light_tail)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
ggtitle("Box-plot of light tail distribution") +
xlab(NULL) +
ylab(NULL)
Light_hist <- Tails %>%
ggplot(aes(x = light_tail)) +
geom_histogram() +
geom_vline(aes(xintercept = median(light_tail)),
col = 'blue',
size = 1) +
ggtitle("Histogram of light tail distribution") +
xlab(NULL) +
ylab(NULL)
# arrange these plots ----
library(ggpubr)
ggarrange(Light_hist,
Light_qq,
Light_box,
align = "hv",
ncol = 2,
nrow = 2,
widths = c(5, 5, 5))
```

This S-curve crosses the reference line, but in opposing directions from the heavy tail data (the line of values curves *below* the reference line on the positive (higher) end of the x-axis, and *above* the reference line on the negative (or lower) end of the x-axis).

The box-plot shows fewer outliers beyond the whiskers here, too.

## Data Transformations

Many times data will require transformations before they can be entered into a statistical model. We have seen that variables can deviate from a normal distribution in their skewness or kurtosis, so visualization is important both before and after any transformations are made.

Common transformations are natural log (`log`

) and `log10`

. I liked the description of what a log transformation does to a skewed distribution in the VGSM,

“…a log transformation deemphasizes differences at the upper end of the scale and emphasizes those at the lower end.”

We will create two new variables of `sbp`

and look at their distributions using histograms (set at 25 bins each).

```
WstClbGrpStdy <- WstClbGrpStdy %>%
dplyr::mutate(log_sbp = log(sbp))
WCGS_log_sbp_hist_15 <- WstClbGrpStdy %>%
ggplot(aes(x = log_sbp)) +
geom_histogram(bins = 25) +
geom_vline(aes(xintercept = median(log_sbp)),
col = 'blue',
size = 1) +
geom_vline(aes(xintercept = mean(log_sbp)),
col = 'red',
size = 1) +
ggtitle("Histogram of log(sbp) (25 bins)") +
xlab("Ln of Systolic blood pressure (log_sbp)") # label
WCGS_hist_15 <- WstClbGrpStdy %>% # Data
ggplot(aes(x = sbp)) + # variable
geom_histogram(bins = 25) + # geom
geom_vline(aes(xintercept = median(sbp)),
col = 'blue',
size = 1) +
geom_vline(aes(xintercept = mean(sbp)),
col = 'red',
size = 1) +
ggtitle("Histogram of sbp (25 bins)") + # title
xlab("Systolic blood pressure in mm Hg (sbp)") # label
# arrange these plots ----
library(ggpubr)
ggarrange(WCGS_hist_15,
WCGS_log_sbp_hist_15,
align = "hv",
ncol = 2,
widths = c(5, 5))
```

We can see the `log_sbp`

distribution is more symmetrical around the median/mean.

### Doesn’t transforming my data make it harder to interpret when I model or graph them?

Maybe, but we just showed `sbp`

before and after a transformation, so it’s possible to present both distributions without the reader losing track of the plot. The authors also point “a difference on the transformed scale is still a difference.”

## Categorical Variables

Categorical or count data has less options for graphing, but it’s still possible to create a bar graph and see how these numbers differ within a variable. Below is a bar graph of `behpat`

, which is the behavior type as a factor with levels `A1`

, `A2`

, `B3`

, and `B4`

.

```
# convert behpat to factor -----
WstClbGrpStdy$behpat <- factor(WstClbGrpStdy$behpat)
# create bar chart -----
WstClbGrpStdy %>% ggplot(aes(x = behpat)) +
geom_bar() +
ggtitle("Bar chart of behavior types") +
xlab("Behavior types behpat")
```

## Outcome and Predictor Variable Scatter Plots

This section introduces multivariate plots with a plot of systolic blood pressure (`sbp`

) versus weight. I reproduce these plots below in base R with the `plot()`

function.

```
plot(WstClbGrpStdy$sbp, WstClbGrpStdy$weight,
main = "Systolic blood pressure vs. weight",
xlab = "Weight",
ylab = "Systolic blood pressure")
```

And we can create a similar chart in `ggplot2`

with the `geom_point()`

function.

```
WstClbGrpStdy %>%
ggplot(aes(x = weight, y = sbp)) +
geom_point() +
ggtitle("Systolic blood pressure vs. weight") +
xlab("Weight") +
ylab("Systolic blood pressure")
```

```
ggsave(filename = "Images/02_scatter_wtXsbp.png")
```

## Adding a `LO`

cally `WE`

ighted `S`

catterplot `S`

moother (or `LOWESS`

)

Sometimes it helpful to fit a line through a scatter plot of two continuous variables to see if their relationship is linear. The text uses the `lowess sbp weight, bw(0.25)`

from Stata, and I’ve reproduced this graph below in `ggplto2()`

with a `ggplot2::geom_smooth(method = "loess")`

. Note that this includes a standard error (in gray).

```
WstClbGrpStdy %>%
ggplot(aes(x = weight, y = sbp)) +
ggplot2::geom_point() +
ggplot2::geom_smooth(method = "loess") +
ggplot2::ggtitle("Systolic blood pressure vs. weight") +
xlab("Weight") +
ylab("Systolic blood pressure")
```

“This is all just a fancy way of drawing a flexible curve through a cloud of points.”

The upward trends are due to the lack of data at these ends of the distributions.

## Multivariate box-plots

Box-plots are also useful when we need to look at the distribution of a variable’s values across the levels or values of another variable. For example, if we wanted to examine systolic blood pressure `sbp`

across behavior types (in the `behpat`

variable), we could do this with the code below:

```
WstClbGrpStdy %>%
ggplot(aes(x = behpat, y = sbp)) +
geom_boxplot() +
ggtitle("Box-plot of Systolic blood pressure and behavior types") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab("Behavior types (behpat)")
```

*BONUS:* What if you want to use different statistics in your box-plot?

The four box plots above show the median, interquartile range, and extreme values in `sbp`

across the behavior types in `behpat`

. But what if we wanted to compare these summary statistics to the mean and standard deviation of `sbp`

across the behavior categories?

Fortunately `ggplot2`

will let me customize each element of the box-plot using the `ggplot2::stat_summary()`

and `ggplot2::stat_boxplot()`

functions.

```
WstClbGrpStdy %>%
group_by(behpat) %>%
summarise(mean_sbp = mean(sbp),
med_sbp = median(sbp))
## # A tibble: 4 x 3
## behpat mean_sbp med_sbp
## <fct> <dbl> <dbl>
## 1 A1 129. 126
## 2 A2 130. 128
## 3 B3 128. 124
## 4 B4 127. 126
```

```
WstClbGrpStdy %>%
ggplot(aes(x = behpat,
y = sbp,
color = behpat)) + # different color box for each level of wghtcat
geom_boxplot(aes(x = behpat,
y = sbp),
show.legend = FALSE) + # remove legend
stat_summary(fun.y = mean, # plots mean sbp per level of wghtcat
color = "darkslategray",
geom = "point",
size = 1.5) +
stat_summary(fun.data = mean_sd, # plots sd of sbp per level of wghtcat
geom = "errorbar", # as error bars
linetype = 2, # makes them dashed
color = "darkslategray", # and differnt color
width = 0.3,
size = 0.5) +
stat_boxplot(aes(x = behpat,# customizes boxplot error bar
y = sbp),
show.legend = FALSE,
geom = "errorbar",
linetype = 1,
width = 0.2) +
ggtitle("Box-plot of Systolic blood pressure and behavior types") +
ylab("Systolic blood pressure in mm Hg (sbp)") +
xlab("Behavior types (behpat)")
```

This shows us how systolic blood pressure varies according to the type of behavior. The box plot shows the distribution of `sbp`

, while the two `stat_summary()`

functions show the mean and standard deviations over the four levels of `behpat`

.

The next section shows how to create a cross-tabulation of weight categories and behavior types in Stata using `tabulate behpat wghtcat, column`

.

We can do this in R, but first we should convert the `wghtcat`

variable to a factor and order the levels.

```
WstClbGrpStdy$wghtcat <- factor(WstClbGrpStdy$wghtcat,
levels = c("< 140", "140-170",
"170-200", "> 200"))
```

To create cross-tabs for factor variables, we could use base R’s `table`

function.

```
table(WstClbGrpStdy$behpat, WstClbGrpStdy$wghtcat)
##
## < 140 140-170 170-200 > 200
## A1 20 125 98 21
## A2 100 612 514 99
## B3 90 610 443 73
## B4 22 191 116 20
```

But I recommend learning some of `dplyr`

s handy programming syntax, because then you’ll be able to create your own custom table functions. These were adapted from this thread on StackOverflow.

```
freq_crosstab <- function(data, var1, var2) {
var1 <- rlang::enquo(var1)
var2 <- rlang::enquo(var2)
data %>%
dplyr::count(!!var1, !!var2) %>%
tidyr::spread(!!var2, n, fill = 0) %>%
dplyr::mutate(Total := rowSums(dplyr::select(., -!!var1)),
Freq = Total / sum(Total),
Freq = paste0(round(Freq*100, 2), "%")) %>%
dplyr::bind_rows(dplyr::bind_cols(!!rlang::quo_name(var1) := "Total",
dplyr::summarize_if(., is.numeric, sum)))
}
freq_crosstab(WstClbGrpStdy, behpat, wghtcat)
## # A tibble: 5 x 7
## behpat `< 140` `140-170` `170-200` `> 200` Total Freq
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 A1 20 125 98 21 264 8.37%
## 2 A2 100 612 514 99 1325 42.01%
## 3 B3 90 610 443 73 1216 38.55%
## 4 B4 22 191 116 20 349 11.07%
## 5 Total 232 1538 1171 213 3154 <NA>
```

Or add some percents to the frequency columns and make it tall.

```
freq_tall_table <- function(data,
group_var,
prop_var) {
group_var <- enquo(group_var)
prop_var <- enquo(prop_var)
data %>%
group_by(!!group_var, !!prop_var) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n),
freq = paste0(round(freq*100, 2), "%")) %>%
ungroup
}
freq_tall_table(WstClbGrpStdy, behpat, wghtcat)
## # A tibble: 16 x 4
## behpat wghtcat n freq
## <fct> <fct> <int> <chr>
## 1 A1 < 140 20 7.58%
## 2 A1 140-170 125 47.35%
## 3 A1 170-200 98 37.12%
## 4 A1 > 200 21 7.95%
## 5 A2 < 140 100 7.55%
## 6 A2 140-170 612 46.19%
## 7 A2 170-200 514 38.79%
## 8 A2 > 200 99 7.47%
## 9 B3 < 140 90 7.4%
## 10 B3 140-170 610 50.16%
## 11 B3 170-200 443 36.43%
## 12 B3 > 200 73 6%
## 13 B4 < 140 22 6.3%
## 14 B4 140-170 191 54.73%
## 15 B4 170-200 116 33.24%
## 16 B4 > 200 20 5.73%
```

Knowing how to use the `tidyr`

functions is always helpful for rearranging columns and rows.

```
freq_tall_table(WstClbGrpStdy, behpat, wghtcat) %>%
tidyr::spread(wghtcat, n)
## # A tibble: 16 x 6
## behpat freq `< 140` `140-170` `170-200` `> 200`
## <fct> <chr> <int> <int> <int> <int>
## 1 A1 37.12% NA NA 98 NA
## 2 A1 47.35% NA 125 NA NA
## 3 A1 7.58% 20 NA NA NA
## 4 A1 7.95% NA NA NA 21
## 5 A2 38.79% NA NA 514 NA
## 6 A2 46.19% NA 612 NA NA
## 7 A2 7.47% NA NA NA 99
## 8 A2 7.55% 100 NA NA NA
## 9 B3 36.43% NA NA 443 NA
## 10 B3 50.16% NA 610 NA NA
## 11 B3 6% NA NA NA 73
## 12 B3 7.4% 90 NA NA NA
## 13 B4 33.24% NA NA 116 NA
## 14 B4 5.73% NA NA NA 20
## 15 B4 54.73% NA 191 NA NA
## 16 B4 6.3% 22 NA NA NA
```

## Multivariable Descriptions

Finally, this chapter concludes with a discussion of graphing and displaying pairwise combinations of variables. The scatter plot matrix is shown following a correlation matrix in Stata (`correlate sbp age weight height (obs=3154)`

). To create a correlation matrix in R, we could use the `cor()`

function. I like to keep everything as `tibble()`

or `data.frame`

in my working environment, so I do a little reformatting to the output.

```
CorrTable <- WstClbGrpStdy %>%
dplyr::select(sbp,
age,
weight,
height) %>%
cor() %>%
as_tibble() %>%
rownames_to_column() %>%
mutate(rowname =
case_when(rowname == 1 ~ 'sbp',
rowname == 2 ~ 'age',
rowname == 3 ~ 'weight',
rowname == 4 ~ 'height'))
CorrTable
## # A tibble: 4 x 5
## rowname sbp age weight height
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 sbp 1 0.166 0.253 0.0184
## 2 age 0.166 1 -0.0344 -0.0954
## 3 weight 0.253 -0.0344 1 0.533
## 4 height 0.0184 -0.0954 0.533 1
```

A graph is displayed that shows `weight`

and `sbp`

across categories of behavior type. `ggplot2`

can do this type of graphing with `facet_wrap`

ing. We can take what we’ve learned about the `geom_smooth(method = "loess")`

and apply it here, only now we will add a `group = behpat`

argument and a `facet_wrap()`

function.

```
WstClbGrpStdy %>%
ggplot(aes(x = weight,
y = sbp,
group = behpat)) + # separate by this factor
geom_point(aes(color = behpat), # color aesthetic added to this layer
show.legend = FALSE) + # don't need this
geom_smooth(method = "loess") + # add the loess line
facet_wrap(. ~ behpat, # now wrap the plot in a 2x2 layout
nrow = 2,
ncol = 2) +
ggtitle("Scatterplots of SBP vs. weight by behavior pattern")
```

Great! Now we can look at more complicated combinations of plots, like the scatter plot matrix.

The `pairs()`

plot shows multiple pairwise correlations between `sbp`

`age`

`weight`

and `height`

below.

```
pairs(WstClbGrpStdy[ , c("sbp", "age", "weight", "height")],
main = "A scatter plot matrix between sbp, age, weight, and height")
```

This chart is lacking in some details, so we will add additional formatting and get a smoothed line through the lower scatter plots.

```
# WstClbGrpStdy[ , c(14, 1, 20, 9)]
pairs(WstClbGrpStdy[ , c(14, 1, 20, 9)],
lower.panel = panel.smooth,
main = "A scatter plot matrix between sbp, age, weight, and height")
```

We can also create the scatter plot matrix with `ggplot2`

using the `GGally`

package (you might also need the fix found here: `devtools::install_github("ggobi/ggally#266")`

).

This function needs us to specify the columns we want to use in the plot, and `list()`

s of options for the `lower()`

and `diag()`

plots.

```
# devtools::install_github("ggobi/ggally#266")
# WstClbGrpStdy[ , c(14, 1, 20, 9)] # use a little column subsetting...
GGally::ggpairs(WstClbGrpStdy[ , c(14, 1, 20, 9)],
lower = list(
continuous = "smooth",
mapping = aes(alpha = 0.1)),
diag = list(
continuous = "barDiag",
mapping = aes(alpha = 0.1))) +
ggtitle("A scatter plot matrix between sbp, age, weight, and height")
```

Compare this to the results in `CorrTable`

:

```
knitr::kable(CorrTable)
```

rowname | sbp | age | weight | height |
---|---|---|---|---|

sbp | 1.0000 | 0.1657 | 0.2532 | 0.0184 |

age | 0.1657 | 1.0000 | -0.0344 | -0.0954 |

weight | 0.2532 | -0.0344 | 1.0000 | 0.5329 |

height | 0.0184 | -0.0954 | 0.5329 | 1.0000 |

## In conclusion

I hope anyone using VGSM or in the TICR program using R (or wanting to) will find this useful. I will be posting the other chapters after adding newer options/packages that have come out since the time I completed my coursework.