vignettes/web_only/ggstatsplot_paper.Rmd
ggstatsplot_paper.Rmd
(This publication is still a work in progress)
“What is to be sought in designs for the display of information is the clear portrayal of complexity. Not the complication of the simple; rather … the revelation of the complex.”
 Edward R. Tufte
The ggstatsplot
package is an opinionated collection of plots made with ggplot2
and is designed for exploratory data analysis or for producing publicationready statistical graphics. All plots share an underlying principle of displaying informationrich plots with all necessary statistical details included in the plots themselves. Although the plots produced by ggstatsplot
are still ggplot
objects and can thus be further modified using ggplot2
commands, there is a limit to how many such modifications can be made. That is, it is less flexible than ggplot2
, but that’s a feature and not a bug. The original intent behind this package is to offload struggles associated with constructing the plot and focus more on the interpretation of that data displayed in the plot.
Graphical perception involves visual decoding of the encoded information in graphs. ggstatsplot
incorporates the paradigm proposed in Cleveland (1985, Chapter 4) to facilitate making visual judgments about quantitative information effortless and almost instantaneous. Based on experiments, Cleveland proposes that there are ten elementary graphicalperception tasks that we perform to visually decode quantitative information in graphs (organized from most to least accurate; Cleveland, 1985, p.254)
So the key principle of Cleveland’s paradigm for data display is
“We should encode data on a graph so that the visual decoding involves [graphicalperception] tasks as high in the ordering as possible.”
For example, decoding the data point values in ggbetweenstats
requires position judgments along a common scale (Figure1):
# for reproducibility
set.seed(123)
# plot
ggstatsplot::ggbetweenstats(
data = dplyr::sample_frac(ggstatsplot::movies_long, size = 0.5),
x = genre,
y = rating,
title = "Figure1: IMDB rating by film genre",
xlab = "Genre",
ylab = "IMDB rating (average)",
ggtheme = hrbrthemes::theme_ipsum_tw(),
ggstatsplot.layer = FALSE,
outlier.tagging = TRUE,
outlier.label = title,
messages = FALSE
)
There are few instances where ggstatsplot
diverges from recommendations made in Cleveland’s paradigm:
ggstatsplot
uses pie charts (see Figure2) which rely on angle judgments, which are less accurate (as compared to bar graphs, e.g., which require position judgments). This shortcoming is assuaged to some degree by using plenty of labels that describe percentages for all slices. This makes angle judgment unnecessary and prevacates any concerns about inaccurate judgments about percentages.# for reproducibility
set.seed(123)
# plot
ggstatsplot::ggpiestats(
data = ggstatsplot::movies_long,
main = genre,
condition = mpaa,
title = "Figure2: Distribution of MPAA ratings by film genre",
legend.title = "layout",
caption = "MPAA: Motion Picture Association of America",
package = "ggsci",
palette = "default_jama",
messages = FALSE
)
#> Warning: `chr_along()` is softdeprecated as of rlang 0.2.0.
#> This warning is displayed once per session.
grouped_
variants of the function (see Figure3). Note that the range for Yaxes are no longer the same across juxtaposed subplots and so visually comparing the data becomes difficult. On the other hand, in the superposed plot, all data have the same range and coloring different parts makes the visual discrimination of different components of the data, and their comparison, easier. But the goal of grouped_
variants of functions is to not only show different aspects of the data but also to run statistical tests and showing detailed results for all aspects of the data in a superposed plot is difficult. Therefore, this is a compromise ggstatsplot
is comfortable with, at least to produce plots for quick exploration of different aspects of the data.# for reproducibility
set.seed(123)
# plot
ggstatsplot::combine_plots(
# plot 1: superposition
ggplot2::ggplot(
data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" 
genre == "Drama"),
mapping = ggplot2::aes(x = length,
y = rating,
color = genre)
) +
ggplot2::geom_jitter(size = 3, alpha = 0.5) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::labs(title = "superposition (recommended in Cleveland's paradigm)") +
ggstatsplot::theme_ggstatsplot(),
# plot 2: juxtaposition
ggstatsplot::grouped_ggscatterstats(
data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" 
genre == "Drama"),
x = length,
y = rating,
grouping.var = genre,
marginal = FALSE,
title.prefix = "Genre",
title.text = "juxtaposition (`ggstatsplot` implementation in `grouped_` functions)",
title.size = 12
),
# combine for comparison
title.text = "Two ways to compare different aspects of data",
nrow = 2,
labels = c("(a)", "(b)")
)
#> Warning: The output from `grouped_` functions are not `ggplot` objects and therefore can't be further modified with `ggplot2` functions.
#>
The grouped_
plots follow the Shrink Principle (Tufte, 2001, p.1667) for highinformation graphics, which dictates that the data density and the size of the data matrix can be maximized to exploit maximum resolution of the available datadisplay technology. Given the large maximum resolution afforded by most computer monitors today, saving grouped_
plots with appropriate resolution ensures no loss in legibility with reduced graphics area.
Graphical excellence consists of communicating complex ideas with clarity and in a way that the viewer understands the greatest number of ideas in a short amount of time all the while not quoting the data out of context. The package follows the principles for graphical integrity (as outlined in Tufte, 2001):
The physical representation of numbers is proportional to the numerical quantities they represent (e.g., Figure1 and Figure2 show how means (in ggbetweenstats
) or percentages (ggpiestats
) are proportional to the vertical distance or the area, respectively).
All important events in the data have clear, detailed, and thorough labeling (e.g., Figure1 plot shows how ggbetweenstats
labels means, sample size information, and outliers; same can be appreciated for ggpiestats
in Figure2).
None of the plots have design variation (e.g., abrupt change in scales) over the surface of a same graphic because this can lead to a false impression about variation in data.
The number of informationcarrying dimensions never exceed the number of dimensions in the data (e.g., using area to show onedimensional data).
All plots are designed to have no chartjunk (like moiré vibrations, fake perspective, dark grid lines, etc.) (Tufte, 2001, Chapter 5).
There are some instances where ggstatsplot
graphs don’t follow principles of clean graphics, as formulated in the Tufte theory of data graphics (Tufte, 2001, Chapter 4). The theory has four key principles:
In particular, default plots in ggstatsplot
can sometimes violate one of the principles from 24. According to these principles, every bit of ink should have reason for its inclusion in the graphic and should convey some new information to the viewer. If not, such ink should be removed. One instance of this is bilteral symmetry of data measures. For example, in Figure1, we can see that both the box and violin plots are mirrored, which consumes twice the space in the graphic without adding any new information. But this redundancy is tolerated for the sake of beauty that such symmetrical shapes can bring to the graphic. Even Tufte admits that efficiency is but one consideration in the design of statistical graphics (Tufte, 2001, p. 137). Additionally, these principles were formulated in an era in which computer graphics had yet to revolutionize the ease with which graphics could be produced and thus some of the concerns about minimizing dataink for easier production of graphics are not as relevant as they were.
As an extension of ggplot2
, ggstatsplot
has the same expectations about the structure of the data. More specifically,
The data
should be an object of class data.frame
(a tibble
dataframe will also work).
The data should be organized following the principles of tidy data, which specify how statistical structure of a data frame (variables and observations) should be mapped to physical structure (columns and rows). More specifically, tidy data means all variables have their own columns and each row corresponds to a unique observation (Wickham, 2014).
All ggstatsplot
functions remove NA
s from variables of interest (similar to ggplot2
; Wickham, 2016, p.207) in the data and display total sample size (n) in the subtitle to inform the user/reader about the number of observations included for both the statistical analysis and the visualization. But, when sample sizes differ across tests in the same function, ggstatsplot
makes an effort to inform the user of this aspect (Figure4).
# creating a new dataset without any NAs in variables of interest
msleep_no_na <
dplyr::filter(
.data = ggplot2::msleep,
!is.na(sleep_rem),
!is.na(awake),
!is.na(brainwt),
!is.na(bodywt)
)
# variable names vector
var_names < c("REM sleep",
"time awake",
"brain weight",
"body weight")
# combining two plots
ggstatsplot::combine_plots(
# plot *without* any NAs
ggstatsplot::ggcorrmat(
data = msleep_no_na,
corr.method = "kendall",
sig.level = 0.001,
p.adjust.method = "holm",
cor.vars = c(sleep_rem, awake:bodywt),
cor.vars.names = var_names,
matrix.type = "upper",
colors = c("#B2182B", "white", "#4D4D4D"),
title = "Correlalogram for mammals sleep dataset",
subtitle = "sleep units: hours; weight units: kilograms",
messages = FALSE
),
# plot *with* NAs
ggstatsplot::ggcorrmat(
data = ggplot2::msleep,
corr.method = "kendall",
sig.level = 0.001,
p.adjust.method = "holm",
cor.vars = c(sleep_rem, awake:bodywt),
cor.vars.names = var_names,
matrix.type = "upper",
colors = c("#B2182B", "white", "#4D4D4D"),
title = "Correlalogram for mammals sleep dataset",
subtitle = "sleep units: hours; weight units: kilograms",
messages = FALSE
),
labels = c("(a)", "(b)"),
nrow = 1
)
Functions  Description  Parametric  Nonparametric  Robust  Bayes Factor 

ggbetweenstats 
Between group/condition comparisons  \(\checkmark\)  \(\checkmark\)  \(\checkmark\)  \(\checkmark\) 
gghistostats 
Distribution of a numeric variable  \(\checkmark\)  \(\checkmark\)  \(\checkmark\)  \(\checkmark\) 
ggcorrmat 
Correlation matrix  \(\checkmark\)  \(\checkmark\)  \(\checkmark\)  \(\times\) 
ggscatterstats 
Correlation between two variables  \(\checkmark\)  \(\checkmark\)  \(\checkmark\)  \(\checkmark\) 
ggpiestats 
Association between categorical variables  \(\checkmark\)  \(\times\)  \(\times\)  \(\times\) 
ggcoefstats 
Regression model coefficients  \(\checkmark\)  \(\times\)  \(\checkmark\)  \(\times\) 
Functions  Type  Test  Effect size  95% CI available? 

ggbetweenstats 
Parametric  Independent samples ttest  Cohen’s d, Hedge’s g  \(\checkmark\) 
ggbetweenstats 
Parametric  Oneway ANOVA  \[p\eta^2, p\omega^2\]  \(\checkmark\) 
ggbetweenstats 
Robust  Yuen’s test for trimmed means  \[\xi\]  \(\checkmark\) 
ggbetweenstats 
Robust  Heteroscedastic oneway ANOVA for trimmed means  \[\xi\]  \(\checkmark\) 
ggpiestats 
Parametric  \[\text{Pearson's}~ \chi^2 ~\text{test}\]  Cramer’s V  \(\checkmark\) 
ggpiestats 
Parametric  McNemar’s test  odds ratio (OR)  \(\checkmark\) 
ggscatterstats /ggcorrmat

Parametric  Pearson’s r  r  \(\checkmark\) 
ggscatterstats /ggcorrmat

Nonparametric  \[\text{Spearman's}~ \rho\]  \[\rho\]  \(\checkmark\) 
ggscatterstats /ggcorrmat

Robust  Percentage bend correlation  r  \(\checkmark\) 
gghistostats 
Parametric  Onesample ttest  Cohen’s d  \(\checkmark\) 
gghistostats 
Nonparametric  Onesample Wilcoxon signed rank test  Cohen’s d  \(\times\) 
gghistostats 
Robust  Onesample percentile bootstrap  robust estimator  \(\checkmark\) 
ggcoefstats 
Parametric  Regression models  \[\beta\]  \(\checkmark\) 
One of the important functions of a plot is to show the variation in the data, which comes in two forms:
ggstatsplot
, the actual variation in measurements is shown by plotting a combination of (jittered) raw data points with a boxplot laid on top (Figure1) or a histogram (Figure5). None of the plots, where empirical distribution of the data is concerned, show the sample standard deviation because they are poor at conveying information about limits of the sample and presence of outliers (Cleveland, 1985, p.220).# for reproducibility
set.seed(123)
# plot
ggstatsplot::gghistostats(
data = morley,
x = Speed,
test.value = 792,
test.value.line = TRUE,
bf.message = TRUE,
xlab = "Speed of light (km/sec, with 299000 subtracted)",
title = "Figure5: Distribution of Speed of light",
caption = "Note: Data collected across 5 experiments (20 measurements each)",
messages = FALSE
)
ggstatsplot
plots instead use 95% confidence intervals (e.g., Figure6). This is because the interval formed by error bars correspond to a 68% confidence interval, which is not a particularly interesting interval (Cleveland, 1985, p.222225).# for reproducibility
set.seed(123)
# plot
ggstatsplot::ggcoefstats(x = lme4::lmer(
total.fruits ~ nutrient + rack + (nutrient 
popu / gen),
data = lme4::Arabidopsis
),
p.kr = FALSE)
#> Computing pvalues via Waldstatistics approximation (treating t as Wald z).
The default setting in ggstatsplot
is to produce plots with statistical details included. Most often than not, the results are displayed as a subtitle
in the plot. Great care has been taken into which details are included in statistical reporting and why.
Default statistical tests:
Dealing with null results:
Avoiding the “pvalue error”:
The pvalue indexes the probability that the researchers have falsely rejected a true null hypothesis (Type I error, i.e.) and can rarely be exactly 0. And yet over 97,000 manuscripts on Google Scholar report the pvalue to be p = 0.000
(Lilienfeld et al., 2015), putatively due to relying on default computer outputs. All pvalues displayed in ggstatsplot
plots avoid this mistake. Anything less than p < 0.001
is displayed as such (e.g, Figure1). The package deems it unimportant how infinitesimally small the pvalues are and, instead, puts emphasis on the effect size magnitudes and their 95% CIs.
There are three main documents one can rely on to learn how to use ggstatsplot
:
Manual:
The CRAN
reference manual provides detailed documentation about arguments for each function and examples : https://cran.rproject.org/web/packages/ggstatsplot/ggstatsplot.pdf
README:
The GitHub README
document provides a quick summary of all available functionality without going too much into details: https://github.com/IndrajeetPatil/ggstatsplot/blob/master/README.md
Vignettes:
Vignettes contain probably the most detailed exposition. Every single function in ggstatsplot
has an associated vignette which describes in depth how to use the function and modify the defaults to customize the plot to your liking. All these vignettes can be accessed from the package website: https://indrajeetpatil.github.io/ggstatsplot/articles/
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub
repository for this package: https://github.com/IndrajeetPatil/ggstatsplot/issues
Summarizing session information for reproducibility.
options(width = 200)
devtools::session_info()$platform
#> setting value
#> version R version 3.5.1 (20180702)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 20181021