(This publication is still a work in progress)

“What is to be sought in designs for the display of information is the clear portrayal of complexity. Not the complication of the simple; rather … the revelation of the complex.”
- Edward R. Tufte

# Introduction

The ggstatsplot package is an opinionated collection of plots made with ggplot2 and is designed for exploratory data analysis or for producing publication-ready statistical graphics. All plots share an underlying principle of displaying information-rich plots with all necessary statistical details included in the plots themselves. Although the plots produced by ggstatsplot are still ggplot objects and can thus be further modified using ggplot2 commands, there is a limit to how many such modifications can be made. That is, it is less flexible than ggplot2, but that’s a feature and not a bug. The original intent behind this package is to offload struggles associated with constructing the plot and focus more on the interpretation of that data displayed in the plot.

# Graphical perception

Graphical perception involves visual decoding of the encoded information in graphs. ggstatsplot incorporates the paradigm proposed in Cleveland (1985, Chapter 4) to facilitate making visual judgments about quantitative information effortless and almost instantaneous. Based on experiments, Cleveland proposes that there are ten elementary graphical-perception tasks that we perform to visually decode quantitative information in graphs (organized from most to least accurate; Cleveland, 1985, p.254)-

• Position along a common scale
• Position along identical, non-aligned scales
• Length
• Angle (Slope)
• Area
• Volume
• Color hue - Color saturation - Density

So the key principle of Cleveland’s paradigm for data display is-

“We should encode data on a graph so that the visual decoding involves [graphical-perception] tasks as high in the ordering as possible.”

For example, decoding the data point values in ggbetweenstats requires position judgments along a common scale (Figure-1):

# for reproducibility
set.seed(123)

# plot
ggstatsplot::ggbetweenstats(
data = dplyr::sample_frac(ggstatsplot::movies_long, size = 0.5),
x = genre,
y = rating,
title = "Figure-1: IMDB rating by film genre",
xlab = "Genre",
ylab = "IMDB rating (average)",
ggtheme = hrbrthemes::theme_ipsum_tw(),
ggstatsplot.layer = FALSE,
outlier.tagging = TRUE,
outlier.label = title,
messages = FALSE
)

There are few instances where ggstatsplot diverges from recommendations made in Cleveland’s paradigm:

• For the categorical/nominal data, ggstatsplot uses pie charts (see Figure-2) which rely on angle judgments, which are less accurate (as compared to bar graphs, e.g., which require position judgments). This shortcoming is assuaged to some degree by using plenty of labels that describe percentages for all slices. This makes angle judgment unnecessary and pre-vacates any concerns about inaccurate judgments about percentages.
# for reproducibility
set.seed(123)

# plot
ggstatsplot::ggpiestats(
data = ggstatsplot::movies_long,
main = genre,
condition = mpaa,
title = "Figure-2: Distribution of MPAA ratings by film genre",
legend.title = "layout",
caption = "MPAA: Motion Picture Association of America",
package = "ggsci",
palette = "default_jama",
messages = FALSE
)
#> Warning: chr_along() is soft-deprecated as of rlang 0.2.0.
#> This warning is displayed once per session.
• Cleveland’s paradigm also emphasizes that superposition of data is better than juxtaposition (Cleveland, 1985, p.201) because this allows for a more incisive comparison of the values from different parts of the dataset. This recommendation is violated in all grouped_ variants of the function (see Figure-3). Note that the range for Y-axes are no longer the same across juxtaposed subplots and so visually comparing the data becomes difficult. On the other hand, in the superposed plot, all data have the same range and coloring different parts makes the visual discrimination of different components of the data, and their comparison, easier. But the goal of grouped_ variants of functions is to not only show different aspects of the data but also to run statistical tests and showing detailed results for all aspects of the data in a superposed plot is difficult. Therefore, this is a compromise ggstatsplot is comfortable with, at least to produce plots for quick exploration of different aspects of the data.
# for reproducibility
set.seed(123)

# plot
ggstatsplot::combine_plots(
# plot 1: superposition
ggplot2::ggplot(
data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" |
genre == "Drama"),
mapping = ggplot2::aes(x = length,
y = rating,
color = genre)
) +
ggplot2::geom_jitter(size = 3, alpha = 0.5) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::labs(title = "superposition (recommended in Cleveland's paradigm)") +
ggstatsplot::theme_ggstatsplot(),
# plot 2: juxtaposition
ggstatsplot::grouped_ggscatterstats(
data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" |
genre == "Drama"),
x = length,
y = rating,
grouping.var = genre,
marginal = FALSE,
title.prefix = "Genre",
title.text = "juxtaposition (ggstatsplot implementation in grouped_ functions)",
title.size = 12
),
# combine for comparison
title.text = "Two ways to compare different aspects of data",
nrow = 2,
labels = c("(a)", "(b)")
)
#> Warning: The output from grouped_ functions are not ggplot objects and therefore can't be further modified with ggplot2 functions.
#> 

The grouped_ plots follow the Shrink Principle (Tufte, 2001, p.166-7) for high-information graphics, which dictates that the data density and the size of the data matrix can be maximized to exploit maximum resolution of the available data-display technology. Given the large maximum resolution afforded by most computer monitors today, saving grouped_ plots with appropriate resolution ensures no loss in legibility with reduced graphics area.

# Graphical integrity (and clean design)

Graphical excellence consists of communicating complex ideas with clarity and in a way that the viewer understands the greatest number of ideas in a short amount of time all the while not quoting the data out of context. The package follows the principles for graphical integrity (as outlined in Tufte, 2001):

• The physical representation of numbers is proportional to the numerical quantities they represent (e.g., Figure-1 and Figure-2 show how means (in ggbetweenstats) or percentages (ggpiestats) are proportional to the vertical distance or the area, respectively).

• All important events in the data have clear, detailed, and thorough labeling (e.g., Figure-1 plot shows how ggbetweenstats labels means, sample size information, and outliers; same can be appreciated for ggpiestats in Figure-2).

• None of the plots have design variation (e.g., abrupt change in scales) over the surface of a same graphic because this can lead to a false impression about variation in data.

• The number of information-carrying dimensions never exceed the number of dimensions in the data (e.g., using area to show one-dimensional data).

• All plots are designed to have no chartjunk (like moiré vibrations, fake perspective, dark grid lines, etc.) (Tufte, 2001, Chapter 5).

There are some instances where ggstatsplot graphs don’t follow principles of clean graphics, as formulated in the Tufte theory of data graphics (Tufte, 2001, Chapter 4). The theory has four key principles:

1. Above all else show the data.
2. Maximize the data-ink ratio.
3. Erase non-data-ink.
4. Erase redundant data-ink, within reason.

In particular, default plots in ggstatsplot can sometimes violate one of the principles from 2-4. According to these principles, every bit of ink should have reason for its inclusion in the graphic and should convey some new information to the viewer. If not, such ink should be removed. One instance of this is bilteral symmetry of data measures. For example, in Figure-1, we can see that both the box and violin plots are mirrored, which consumes twice the space in the graphic without adding any new information. But this redundancy is tolerated for the sake of beauty that such symmetrical shapes can bring to the graphic. Even Tufte admits that efficiency is but one consideration in the design of statistical graphics (Tufte, 2001, p. 137). Additionally, these principles were formulated in an era in which computer graphics had yet to revolutionize the ease with which graphics could be produced and thus some of the concerns about minimizing data-ink for easier production of graphics are not as relevant as they were.

# Statistical analysis

As an extension of ggplot2, ggstatsplot has the same expectations about the structure of the data. More specifically,

• The data should be an object of class data.frame (a tibble dataframe will also work).

• The data should be organized following the principles of tidy data, which specify how statistical structure of a data frame (variables and observations) should be mapped to physical structure (columns and rows). More specifically, tidy data means all variables have their own columns and each row corresponds to a unique observation (Wickham, 2014).

• All ggstatsplot functions remove NAs from variables of interest (similar to ggplot2; Wickham, 2016, p.207) in the data and display total sample size (n) in the subtitle to inform the user/reader about the number of observations included for both the statistical analysis and the visualization. But, when sample sizes differ across tests in the same function, ggstatsplot makes an effort to inform the user of this aspect (Figure-4).


# creating a new dataset without any NAs in variables of interest
msleep_no_na <-
dplyr::filter(
.data = ggplot2::msleep,
!is.na(sleep_rem),
!is.na(awake),
!is.na(brainwt),
!is.na(bodywt)
)

# variable names vector
var_names <- c("REM sleep",
"time awake",
"brain weight",
"body weight")

# combining two plots
ggstatsplot::combine_plots(
# plot *without* any NAs
ggstatsplot::ggcorrmat(
data = msleep_no_na,
corr.method = "kendall",
sig.level = 0.001,
cor.vars = c(sleep_rem, awake:bodywt),
cor.vars.names = var_names,
matrix.type = "upper",
colors = c("#B2182B", "white", "#4D4D4D"),
title = "Correlalogram for mammals sleep dataset",
subtitle = "sleep units: hours; weight units: kilograms",
messages = FALSE
),
# plot *with* NAs
ggstatsplot::ggcorrmat(
data = ggplot2::msleep,
corr.method = "kendall",
sig.level = 0.001,
cor.vars = c(sleep_rem, awake:bodywt),
cor.vars.names = var_names,
matrix.type = "upper",
colors = c("#B2182B", "white", "#4D4D4D"),
title = "Correlalogram for mammals sleep dataset",
subtitle = "sleep units: hours; weight units: kilograms",
messages = FALSE
),
labels = c("(a)", "(b)"),
nrow = 1
)

## Types of statistics supported

Functions Description Parametric Non-parametric Robust Bayes Factor
ggbetweenstats Between group/condition comparisons $$\checkmark$$ $$\checkmark$$ $$\checkmark$$ $$\checkmark$$
gghistostats Distribution of a numeric variable $$\checkmark$$ $$\checkmark$$ $$\checkmark$$ $$\checkmark$$
ggcorrmat Correlation matrix $$\checkmark$$ $$\checkmark$$ $$\checkmark$$ $$\times$$
ggscatterstats Correlation between two variables $$\checkmark$$ $$\checkmark$$ $$\checkmark$$ $$\checkmark$$
ggpiestats Association between categorical variables $$\checkmark$$ $$\times$$ $$\times$$ $$\times$$
ggcoefstats Regression model coefficients $$\checkmark$$ $$\times$$ $$\checkmark$$ $$\times$$

## Types of statistical tests supported

Functions Type Test Effect size 95% CI available?
ggbetweenstats Parametric Independent samples t-test Cohen’s d, Hedge’s g $$\checkmark$$
ggbetweenstats Parametric One-way ANOVA $p\eta^2, p\omega^2$ $$\checkmark$$
ggbetweenstats Robust Yuen’s test for trimmed means $\xi$ $$\checkmark$$
ggbetweenstats Robust Heteroscedastic one-way ANOVA for trimmed means $\xi$ $$\checkmark$$
ggpiestats Parametric $\text{Pearson's}~ \chi^2 ~\text{test}$ Cramer’s V $$\checkmark$$
ggpiestats Parametric McNemar’s test odds ratio (OR) $$\checkmark$$
ggscatterstats/ggcorrmat Parametric Pearson’s r r $$\checkmark$$
ggscatterstats/ggcorrmat Non-parametric $\text{Spearman's}~ \rho$ $\rho$ $$\checkmark$$
ggscatterstats/ggcorrmat Robust Percentage bend correlation r $$\checkmark$$
gghistostats Parametric One-sample t-test Cohen’s d $$\checkmark$$
gghistostats Non-parametric One-sample Wilcoxon signed rank test Cohen’s d $$\times$$
gghistostats Robust One-sample percentile bootstrap robust estimator $$\checkmark$$
ggcoefstats Parametric Regression models $\beta$ $$\checkmark$$

## Statistical variation

One of the important functions of a plot is to show the variation in the data, which comes in two forms:

• Measurement noise: In ggstatsplot, the actual variation in measurements is shown by plotting a combination of (jittered) raw data points with a boxplot laid on top (Figure-1) or a histogram (Figure-5). None of the plots, where empirical distribution of the data is concerned, show the sample standard deviation because they are poor at conveying information about limits of the sample and presence of outliers (Cleveland, 1985, p.220).
# for reproducibility
set.seed(123)

# plot
ggstatsplot::gghistostats(
data = morley,
x = Speed,
test.value = 792,
test.value.line = TRUE,
bf.message = TRUE,
xlab = "Speed of light (km/sec, with 299000 subtracted)",
title = "Figure-5: Distribution of Speed of light",
caption = "Note: Data collected across 5 experiments (20 measurements each)",
messages = FALSE
)
• Sample-to-sample statistic variation: Although, traditionally, this variation has been shown using the standard error of the mean (SEM) of the statistic, ggstatsplot plots instead use 95% confidence intervals (e.g., Figure-6). This is because the interval formed by error bars correspond to a 68% confidence interval, which is not a particularly interesting interval (Cleveland, 1985, p.222-225).
# for reproducibility
set.seed(123)

# plot
ggstatsplot::ggcoefstats(x = lme4::lmer(
total.fruits ~ nutrient + rack + (nutrient |
popu / gen),
data = lme4::Arabidopsis
),
p.kr = FALSE)
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).

## Reporting results

The default setting in ggstatsplot is to produce plots with statistical details included. Most often than not, the results are displayed as a subtitle in the plot. Great care has been taken into which details are included in statistical reporting and why.

1. APA guidelines (APA, 2009) are followed (for the most part) by default:
• Percentages are displayed with no decimal places (Figure-2).
• Correlations, t-tests, and chi-squared tests are reported with the degrees of freedom in parentheses and the significance level (Figure-2, Figure-3b, Figure-5).
• ANOVAs are reported with two degrees of freedom and the significance level (Figure-1).
• Regression results are presented with the unstandardized or standardized estimate (beta), whichever was specified by the user, along with the statistic (depending on the model, this can be a t or z statistic) and the corresponding significance level (Figure-6).
1. Default statistical tests:

2. Dealing with null results:

3. Avoiding the “p-value error”:
The p-value indexes the probability that the researchers have falsely rejected a true null hypothesis (Type I error, i.e.) and can rarely be exactly 0. And yet over 97,000 manuscripts on Google Scholar report the p-value to be p = 0.000 (Lilienfeld et al., 2015), putatively due to relying on default computer outputs. All p-values displayed in ggstatsplot plots avoid this mistake. Anything less than p < 0.001 is displayed as such (e.g, Figure-1). The package deems it unimportant how infinitesimally small the p-values are and, instead, puts emphasis on the effect size magnitudes and their 95% CIs.

# Appendix

## Appendix A: Documentation

There are three main documents one can rely on to learn how to use ggstatsplot:

## Appendix B: Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub repository for this package: https://github.com/IndrajeetPatil/ggstatsplot/issues

## Appendix C: Session information

Summarizing session information for reproducibility.

options(width = 200)
devtools::session_info()\$platform
#>  setting  value
#>  version  R version 3.5.1 (2018-07-02)
#>  os       Windows 10 x64
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.1252
#>  ctype    English_United States.1252
#>  tz       America/New_York
#>  date     2018-10-21