vignettes/web_only/ggscatterstats.Rmd
ggscatterstats.Rmd
The function ggscatterstats
is meant to provide a publicationready scatterplot with all statistical details included in the plot itself to show association between two continuous variables. This function is also helpful during the data exploration phase. We will see examples of how to use this function in this vignette with the ggplot2movies
dataset.
To begin with, here are some instances where you would want to use ggscatterstats

Note before: The following demo uses the pipe operator (%>%
), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html
ggscatterstats
To illustrate how this function can be used, we will rely on the ggplot2movies
dataset. This dataset provides information about movies scraped from IMDB. Specifically, we will be using cleaned version of this dataset included in the ggstatsplot
package itself.
library(ggstatsplot)
# see the selected data (we have data from 1813 movies)
dplyr::glimpse(x = ggstatsplot::movies_wide)
#> Rows: 1,579
#> Columns: 13
#> $ title <chr> "'Til There Was You", "10 Things I Hate About You", "100 Mil~
#> $ year <int> 1997, 1999, 2002, 2004, 1999, 2001, 1972, 2003, 1999, 2000, ~
#> $ length <int> 113, 97, 98, 98, 102, 120, 180, 107, 101, 99, 129, 124, 93, ~
#> $ budget <dbl> 23.0, 16.0, 1.1, 37.0, 85.0, 42.0, 4.0, 76.0, 6.0, 26.0, 12.~
#> $ rating <dbl> 4.8, 6.7, 5.6, 6.4, 6.1, 6.1, 7.3, 5.1, 5.4, 2.5, 7.6, 8.0, ~
#> $ votes <int> 799, 19095, 181, 7859, 14344, 10866, 1754, 9556, 4514, 2023,~
#> $ mpaa <fct> PG13, PG13, R, PG13, R, R, PG, PG13, R, R, R, R, R, R, P~
#> $ Action <int> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, ~
#> $ Animation <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
#> $ Comedy <int> 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, ~
#> $ Drama <int> 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, ~
#> $ Romance <int> 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, ~
#> $ NumGenre <int> 2, 2, 1, 3, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 2, 3, 2, 2, 1, ~
Now that we have a clean dataset, we can start asking some interesting questions. For example, let’s see if the average IMDB rating for a movie has any relationship to its budget. Additionally, let’s also see which movies had a high budget but low IMDB rating by labeling those data points.
To reduce the processing time, let’s only work with 30% of the dataset.
# for reproducibility
set.seed(123)
# to speed up the calculation, let's use only 10% of the data
movies_10 < dplyr::sample_frac(tbl = movies_long, size = 0.1)
# plot
ggstatsplot::ggscatterstats(
data = movies_10, # dataframe from which variables are taken
x = budget, # predictor/independent variable
y = rating, # dependent variable
xlab = "Budget (in millions of US dollars)", # label for the xaxis
ylab = "Rating on IMDB", # label for the yaxis
label.var = title, # variable to use for labeling data points
label.expression = rating < 5 & budget > 75, # expression for deciding which points to label
point.label.args = list(alpha = 0.7, size = 4, color = "grey50"),
marginal.type = "density", # type of plot for marginal distribution
xfill = "#CC79A7", # fill for marginals on the xaxis
yfill = "#009E73", # fill for marginals on the yaxis
title = "Relationship between movie budget and IMDB rating",
caption = "Source: www.imdb.com"
)
There is indeed a small, but significant, positive correlation between the amount of money studio invests in a movie and the ratings given by the audiences.
Important:
In contrast to all other functions in this package, the ggscatterstats
function returns object that is not further modifiable with ggplot2
. This can be avoided by not plotting the marginal distributions (marginal = FALSE
). Currently trying to find a workaround this problem.
grouped_ggscatterstats
What if we want to do the same analysis do the same analysis for movies with different MPAA (Motion Picture Association of America) film ratings (NC17, PG, PG13, R)?
ggstatsplot
provides a special helper function for such instances: grouped_ggstatsplot
. This is merely a wrapper function around ggstatsplot::combine_plots
. It applies ggstatsplot
across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.
Let’s see how we can use this function to apply ggscatterstats
for all MPAA ratings. Also, let’s run a robust test this time.
# for reproducibility
set.seed(123)
# to speed up the calculation, let's use only 20% of the data
# also since there are only 7 movies with NC17 ratings, leave them out
movies_20 <
ggstatsplot::movies_wide %>%
dplyr::filter(mpaa != "NC17") %>%
dplyr::sample_frac(tbl = ., size = 0.2)
# plot
ggstatsplot::grouped_ggscatterstats(
# arguments relevant for ggstatsplot::ggscatterstats
data = movies_10,
x = budget,
y = rating,
type = "r",
grouping.var = mpaa,
label.var = title,
label.expression = rating < 5 & budget < 20,
marginal.type = "boxplot",
ggtheme = ggthemes::theme_tufte(),
ggstatsplot.layer = FALSE,
# arguments relevant for ggstatsplot::combine_plots
annotation.args = list(
title = "Relationship between movie budget and IMDB rating",
caption = "Source: www.imdb.com"
),
plotgrid.args = list(nrow = 3, ncol = 1)
)
As seen from the plot, this analysis has revealed something interesting: The relationship we found between budget and IMDB rating holds only for PG13 and Rrated movies.
ggscatterstats
+ purrr
Although this is a quick and dirty way to explore large amount of data with minimal effort, it does come with an important limitation: reduced flexibility. For example, if we wanted to add, let’s say, a separate type of marginal distribution plot for each MPAA rating or if we wanted to use different types of correlations across different levels of MPAA ratings (NC17 has only 6 movies, so a robust correlation would be a good idea), this is not possible. But this can be easily done using purrr
.
See the associated vignette here: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html
Following tests are carried out for each type of analyses. Additionally, the correlation coefficients (and their confidence intervals) are used as effect sizes
Type  Test  CI?  Function 

Parametric  Pearson’s correlation coefficient  ✅  correlation::correlation 
Nonparametric  Spearman’s rank correlation coefficient  ✅  correlation::correlation 
Robust  Winsorized Pearson correlation coefficient  ✅  correlation::correlation 
Bayes Factor  Pearson’s correlation coefficient  Yes  correlation::correlation 
If you wish to include statistical analysis results in a publication/report, the ideal reporting practice will be a hybrid of two approaches:
the ggstatsplot
approach, where the plot contains both the visual and numerical summaries about a statistical model, and
the standard narrative approach, which provides interpretive context for the reported statistics.
For example, let’s see the following example:
The ggstatsplot
reporting 
ggscatterstats(mtcars, qsec, drat)
The narrative context (assuming type = "parametric"
) can complement this plot either as a figure caption or in the main text
Pearson’s correlation test revealed that, across 32 cars, a measure of acceleration (1/4 mile time;
qsec
) was positively correlated with rear axle ratio (drat
), but this effect was not statistically significant. The effect size \((r = 0.09)\) was small, as per Cohen’s (1988) conventions. The Bayes Factor for the same analysis revealed that the data were 3.32 times more probable under the null hypothesis as compared to the alternative hypothesis. This can be considered moderate evidence (Jeffreys, 1961) in favor of the null hypothesis (of absence of any correlation between these two variables).
To see how the effect sizes displayed in these tests can be interpreted, see: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/effsize_interpretation.html
If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues
For details, see https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/session_info.html