Introduction to {ggstatsplot}: {ggplot2} Plots with Statistics

.title[
# Introduction to <code>{ggstatsplot}</code>: <code>{ggplot2}</code> Plots with Statistics
]
.author[
### Indrajeet Patil
]

---

---

# Plan

---

- Why `ggstatsplot`?

- Primary functions

- Customizability

- Benefits

- Misconceptions

- Limitations

---

# Why *ggstatsplot*?

---

# Raison d'être

---

]

.content-box-yellow[
In short, `ggstatsplot` returns 
 
📊 .blue[information-rich] plots with .blue[statistical details], which are 
 
📝 suitable for .blue[faster] (exploratory) data analysis and scholarly reports
]

]
]

---

# Simpler/faster data analysis workflow

---

In a typical *exploratory* data analysis workflow, .blue[data visualization] and .blue[statistical modeling] are two
different phases: visualization informs modeling, and modeling can suggest a
different visualization, and so on and so forth.

💡 The central idea of `ggstatsplot` is simple: combine these two phases into one!

---

# Information-rich graphic is worth a thousand words

---

.footnote[[(Matejka & Fitzmaurice, *Autodesk Research*, 2017)](https://www.autodeskresearch.com/publications/samestats)]

---

# Ready-made plot = no customization

The .blue[grammar of graphics] is a powerful framework [(Wilkinson, 2011)](https://www.google.com/books/edition/_/iI1kcgAACAAJ?hl=en&sa=X&ved=2ahUKEwiGl8rJ2KztAhWyElkFHa8NAvkQre8FMBR6BAgMEAc) and can help you make *any* graphics fitting your specific data visualization needs! But...

---

# And a LOT more!
...but we will come back to that later 📌

Let's get started first!

---

# Installation

Install the stable version of `ggstatsplot` from 
[CRAN](https://cran.r-project.org/web/packages/ggstatsplot/index.html):

``` r
install.packages("ggstatsplot")
```
--

You can get the development version of the package from
[Github](https://github.com/IndrajeetPatil/ggstatsplot):

``` r
remotes::install_github("IndrajeetPatil/ggstatsplot")
```

Load the needed packages-

``` r
library(ggstatsplot)
library(ggplot2)
```

---

# Primary functions

---

# Hypothesis about group differences

---

# ggbetweenstats - For between group comparisons

---

``` r
ggbetweenstats(
  data = movies_long,
  x = mpaa,
  y = rating
)
```

- *t*-test if `2` groups
- ANOVA if `> 2` groups

✏️ .blue[Defaults] return

✅ raw data + distributions 
✅ descriptive statistics 
✅ inferential statistics 
✅ effect size + CIs 
✅ pairwise comparisons 
✅ Bayesian hypothesis-testing 
✅ Bayesian estimation
]

]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggbetweenstats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# ggwithinstats - repeated measures equivalent

---

``` r
ggwithinstats(
  data = WRS2::WineTasting,
  x = Wine,
  y = Taste
)
```

Changing the `type` of test

✅ `"p"` &nbsp;&nbsp;→ **parametric** 
✅ `"np"` → **non-parametric** 
✅ `"r"` &nbsp;&nbsp;→ **robust** 
✅ `"bf"` → **Bayesian**
]

]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggwithinstats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# gghistostats - Distribution of a numeric variable

---

``` r
gghistostats(
  data = movies_long,
  x = budget,
* test.value = 30
)
```

✅ counts + proportion for bins 
✅ descriptive statistics 
✅ inferential statistics 
✅ effect size + CIs 
✅ Bayesian hypothesis-testing 
✅ Bayesian estimation

Changing the `type` of test

✅ `"p"` &nbsp;&nbsp;→ **parametric** 
✅ `"np"` → **non-parametric** 
✅ `"r"` &nbsp;&nbsp;→ **robust** 
✅ `"bf"` → **Bayesian**
]

]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/gghistostats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# ggdotplotstats - Labeled numeric variable

---

``` r
ggdotplotstats(
  data = movies_long,
  x = budget,
  y = genre,
* test.value = 30
)
```

✅ descriptive statistics 
✅ inferential statistics 
✅ effect size + CIs 
✅ Bayesian hypothesis-testing 
✅ Bayesian estimation

Changing the `type` of test

✅ `"p"` &nbsp;&nbsp;→ **parametric** 
✅ `"np"` → **non-parametric** 
✅ `"r"` &nbsp;&nbsp;→ **robust** 
✅ `"bf"` → **Bayesian**
]
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggdotplotstats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Hypothesis about correlation

---

# ggscatterstats - Two numeric variables

---

``` r
ggscatterstats(
  data = movies_long,
  x = budget,
  y = rating
)
```

✅ joint distribution 
✅ marginal distributions 
✅ inferential statistics 
✅ effect size + CIs 
✅ Bayesian hypothesis-testing 
✅ Bayesian estimation

Changing the `type` of test

✅ `"p"` &nbsp;&nbsp;→ **parametric** 
✅ `"np"` → **non-parametric** 
✅ `"r"` &nbsp;&nbsp;→ **robust** 
✅ `"bf"` → **Bayesian**
]
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggscatterstats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# ggscatterstats - conditional point tagging

---

``` r
ggscatterstats(
  data = movies_long,
  x = budget,
  y = rating,
  type = "r",
* label.var = title,
* label.expression = budget > 150
* & rating > 7.5
)
```
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggscatterstats_2-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# ggcorrmat - multiple numeric variables

---

``` r
ggcorrmat(dplyr::starwars)
```

✅ effect size + significance 
✅ careful handling of `NA`s

Changing the `type` of test

✅ `"p"` &nbsp;&nbsp;→ **parametric** 
✅ `"np"` → **non-parametric** 
✅ `"r"` &nbsp;&nbsp;→ **robust** 
✅ `"bf"` → **Bayesian**

Partial correlations are also supported! 
Just set `partial=TRUE`.

]
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggcorrmat_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Hypothesis of composition of categorical variables

---

# ggpiestats - association between categorical variables

---

``` r
ggpiestats(
  data = dplyr::filter(
    movies_long,
    genre %in% c("Drama", "Comedy")
  ),
  x = mpaa,
  y = genre
)
```

✅ descriptive statistics 
✅ inferential statistics 
✅ effect size + CIs 
✅ Goodness-of-fit tests 
✅ Bayesian hypothesis-testing 
✅ Bayesian estimation

]
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggpiestats_2-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# ggbarstats - association between categorical variables

---

``` r
ggbarstats(
  data = dplyr::filter(
    movies_long,
    genre %in% c("Drama", "Comedy")
  ),
  x = mpaa,
  y = genre,
* label = "both"
)
```

✅ descriptive statistics 
✅ inferential statistics 
✅ effect size + CIs 
✅ Goodness-of-fit tests 
✅ Bayesian hypothesis-testing 
✅ Bayesian estimation 
]
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggbarstats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Hypothesis about regression coefficients

---

# ggcoefstats

---

``` r
# model
mod <- lm(
 formula = rating ~ mpaa,
 data = movies_long
)

# plot
ggcoefstats(mod)
```

✅ estimate + CIs 
✅ inferential statistics ($t$, `$z$`, `$F$`, `$\chi^2$`) 
✅ model fit indices (AIC + BIC)

Supports all regression models supported in [`{easystats}`](https://easystats.github.io/insight/reference/is_model_supported.html) ecosystem.

]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggcoefstats_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# *grouped_* variants of all functions
Running the same function for 
all levels of a single grouping variable

---

# *grouped_* functions

---

``` r
grouped_ggpiestats(
  data = mtcars,
  x = cyl,
* grouping.var = am
)
```

.font70[
Available `grouped_` variants
- `grouped_ggbetweenstats`
- `grouped_ggwithinstats`
- `grouped_gghistostats`
- `grouped_ggdotplotstats`
- `grouped_ggscatterstats`
- `grouped_ggcorrmat`
- `grouped_ggpiestats`
- `grouped_ggbarstats`
]
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/grouped_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Customizability of *ggstatsplot*
"What if I don't like the default plots?" 🤔

---

# Changing aesthetics (themes + palettes) 🎨

---

Aesthetic preferences not an excuse to avoid `ggstatsplot`! 😻

``` r
ggbetweenstats(
  data = movies_long,
  x = mpaa,
  y = rating,
* ggtheme = ggthemes::theme_economist(),
* palette = "Darjeeling2",
* package = "wesanderson"
)
```

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggbetweenstats_4-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Further modification with *ggplot2* 🛠

---

You can modify `ggstatsplot` plots further using `ggplot2` functions. 🎉

``` r
ggbetweenstats(
  data = mtcars,
  x = am,
  y = wt,
  type = "bayes"
) +
* scale_y_continuous(sec.axis = dup_axis())
```

]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/ggbetweenstats_5-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Too much information 🙈

---

`ggstatsplot` can be used to get .blue[only plots].

``` r
ggbetweenstats(
  data = iris,
  x = Species,
  y = Sepal.Length,
  # turn off centrality measure
* centrality.plotting = FALSE,
  # turn off statistical analysis
* results.subtitle = FALSE,
  # turn off Bayesian message
* bf.message = FALSE,
  # turn off pairwise comparisons
* pairwise.display = "none"
)
```
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/only_plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Expressions for custom plots 🀄️

---

`ggstatsplot` can be used to get .blue[only expressions].

``` r
results <- ggpiestats(
 data = Titanic_full,
 x = Survived,
 y = Sex,
* output = "subtitle"
)

*ggiraphExtra::ggSpine(
  data = Titanic_full,
  aes(x = Sex, fill = Survived),
  addlabel = TRUE,
  interactive = FALSE
) +
* labs(subtitle = results)
```
]

.right-plot[
<img src="ggstatsplot_presentation_files/figure-html/subtitle_1-1.png" width="100%" style="display: block; margin: auto;" />
]

---

# Dataframes

---

[`statsExpressions`](https://indrajeetpatil.github.io/statsExpressions/), statistical processing backend for `ggstatsplot`, can provide .blue[dataframes].

]

``` r
library(statsExpressions)

# for example
one_sample_test(
  data = mtcars,
  x = wt,
  test.value = 3
)
```
]

---

# Why use *ggstatsplot*? 👍️

---

# Supports different statistical approaches

Functions | Description | Parametric | Non-parametric | Robust | Bayesian
------- | ------------------ | ---- | ----- | ----| ----- 
`ggbetweenstats` | Between group comparisons | ✅ | ✅ | ✅ | ✅
`ggwithinstats` | Within group comparisons | ✅ | ✅ | ✅ | ✅
`gghistostats`, `ggdotplotstats` | Distribution of a numeric variable | ✅ | ✅ | ✅ | ✅
`ggcorrmat` | Correlation matrix | ✅ | ✅ | ✅ | ✅
`ggscatterstats` | Correlation between two variables | ✅ | ✅ | ✅ | ✅
`ggpiestats`, `ggbarstats` | Association between categorical variables | ✅ | `NA` | `NA` | ✅
`ggpiestats`, `ggbarstats` | Equal proportions for categorical variable levels | ✅ | `NA` | `NA` | ✅
`ggcoefstats` | Regression modeling | ✅ | ✅ | ✅ | ✅
`ggcoefstats` | Random-effects meta-analysis | ✅ | `NA` | ✅ | ✅

---

# Toggling between statistical approaches 🔀

**.blue[Parametric]**

``` r
# anova
ggbetweenstats(
  data = mtcars,
  x = cyl,
  y = wt,
* type = "p"
)

# correlation analysis
ggscatterstats(
  data = mtcars,
  x = wt,
  y = mpg,
* type = "p"
)

# t-test
gghistostats(
  data = mtcars,
  x = wt,
  test.value = 2,
* type = "p"
)
```
]

**.orange[Non-parametric]**

``` r
# anova
ggbetweenstats(
  data = mtcars,
  x = cyl,
  y = wt,
* type = "np"
)

# correlation analysis
ggscatterstats(
  data = mtcars,
  x = wt,
  y = mpg,
* type = "np"
)

# t-test
gghistostats(
  data = mtcars,
  x = wt,
  test.value = 2,
* type = "np"
)
```
]

---

# Alternative workflow to the following

📦 for inferential statistics (e.g. `stats`) 
📦 computing effect size + CIs (e.g. `effectsize`) 
📦 for descriptives (e.g. `skimr`) 
📦 pairwise comparisons (e.g. `multcomp`) 
📦 Bayesian hypothesis testing (e.g. `BayesFactor`) 
📦 Bayesian estimation (e.g. `bayestestR`) 
📦 . 
]

]

🤔 accepts dataframe, vectors, matrix? 
🤔 long/wide format data? 
🤔 works with `NA`s? 
🤔 returns list, dataframe, arrays? 
🤔 works with tibbles? 
🤔 has all necessary details? 
🤔 .

]

---

# Results *in context* of the underlying data 🕵️

**Standard approach**

Pearson's correlation test revealed that, across 142 participants, variable `x`
was negatively correlated with variable `y`: `$t(140)=-0.76, p=.446$`. The
effect size `$(r=-0.06, 95\% CI [-.23,.10])$` was small, as per Cohen’s (1988)
conventions. The Bayes Factor for the same analysis revealed that the data were
5.81 times more probable under the null hypothesis as
compared to the alternative hypothesis. This can be considered moderate evidence
(Jeffreys, 1961) in favor of the null hypothesis (absence of any correlation
between `x` and `y`).

]

**`ggstatsplot` approach**

![](images/after_ggstats.PNG)

]

---

# Best practices in statistical reporting 🏆

![](images/stats_reporting_format.png)

---

# Avoiding reporting errors

.content-box-green[
 "half of all published psychology papers that use NHST contained at least one
*p*-value that was inconsistent with its test statistic and degrees of freedom.
One in eight papers contained a grossly inconsistent *p*-value that may have
affected the statistical conclusion"

[(Nuijten et al., *Behavior Research Methods*, 2016)](https://link.springer.com/article/10.3758/s13428-015-0664-2)
]

Since the plot and the statistical analysis are yoked together, the chances of
making an error in reporting the results are minimized.

No need to worry about updating figures and statistical details **separately**. 🔗

---

# Making sense of null results

`$p > 0.05$`: The null hypothesis (`H0`) can't be rejected

But can it be **accepted**?! Null Hypothesis Significance Testing 🤫

.content-box-green[
"In 72% of cases, nonsignificant results were misinterpreted, in that the
authors inferred that the effect was absent. A Bayesian reanalysis revealed
that fewer than 5% of the nonsignificant findings provided strong evidence
(i.e., `$BF_{01} > 10$`) in favor of the null hypothesis over the alternative
hypothesis."

[(Aczel et al., *AMPPS*, 2018)](https://journals.sagepub.com/doi/pdf/10.1177/2515245918773742)
]

Juxtaposing frequentist and Bayesian statistics for the same analysis helps to
properly interpret the null results.

---

# A few other benefits

---

.content-box-green[
Minimal code needed (`data`, `x`, `y`): minimizes chances of error + tidy scripts. 💅
]

---

# No more excuses not to explore data! 😉

---

In summary, the `ggstatsplot` approach- 
 
(*a*) avoids errors in statistical reporting, 
 
(*b*) highlights the importance of the effect by providing effect size measures by default, 
 
(*c*) provides an easy way to evaluate *absence* of an effect using Bayesian framework, 
 
(*d*) demands to evaluate statistical analysis in the context of the underlying
data, 
 
and is (*e*) easy and (*f*) simple enough that somebody with little coding
experience can use it without making an error.

]

---

# Misconceptions and limitations

---

# Misconceptions: This package is...

---

❌ an alternative to learning `ggplot2` 
--
✅ (the more you know `ggplot2`, the better you can modify the
defaults to your liking)

❌ meant to be used in talks/presentations 
--
✅ (defaults too complicated for effectively communicating
results in time-constrained presentation settings, e.g. conference talks)

❌ only relevant when used in publications 
--
✅ not necessary; can also be useful *only* during exploratory phase

❌ the only game in town 
--
✅ (excellent GUI open-source softwares: [JASP](https://jasp-stats.org/) and [jamovi](https://www.jamovi.org/))

---

# Limitations of *ggstatsplot* 👎️

---

.content-box-red[
Limited no. of **plots** and **statistical tests** available. 
This will **always** be the case. 🤷
]

.content-box-red[
Expects a non-trivial level of statistical proficiency (but
plots without statistics can still be useful).
]

.content-box-red[
**Faceting** does not work (since there are no corresponding
`geom_` s). For the same reason, plots are not `{gganimate}`-friendly.
]

---

# Overcoming these limitations 👥

---

Contributions (big or small) welcome!

![](images/needs_you.jpg)

]

- Cite if used in a publication 📝

- Proof-read the documentation 📖

- Raise issues about bugs/features 🐞

- Review code 🕵

- Add new functionality 👨‍💻
]

]

---

# Acknowledgments

Developer friends 🙌

[Daniel Lüdecke](https://github.com/strengejacke), [Dominique Makowski](https://github.com/DominiqueMakowski), [Mattan S. Ben-Shachar](https://github.com/mattansb), [Brenton M. Wiernik](https://github.com/bwiernik)

Support 💰

[Mina Cikara](http://www.intergroupneurosciencelaboratory.com/), [Fiery Cushman](http://cushmanlab.fas.harvard.edu/index.php), [Iyad Rahwan](https://rahwan.me/)

Community 🙏

Contributors to *ggstatsplot* &
*rstats* users and developers

---

# More documentation

📄 [Publication](https://joss.theoj.org/papers/10.21105/joss.03167)

🗃️ [Website](https://indrajeetpatil.github.io/ggstatsplot/)

🎥 [Yury Zablotski's YouTube playlist on *ggstatsplot*](https://www.youtube.com/playlist?list=PLPWcjtBkAf6kI13vCpRm08zRarRwIiZ9U)

]

---

# For more

If you are interested in good programming and software development practices, check out my other [slide decks](https://sites.google.com/site/indrajeetspatilmorality/presentations).

]

---

# Find me at...

[🐦 @patilindrajeets](http://twitter.com/patilindrajeets)

[💻 @IndrajeetPatil](http://github.com/IndrajeetPatil)

[🔗 https://sites.google.com/site/indrajeetspatilmorality/](https://sites.google.com/site/indrajeetspatilmorality/)

[📧 patilindrajeet.science@gmail.com](mailto:patilindrajeet.science@gmail.com)

]

---

# The End 👋

To access code for these slides, see-

<https://github.com/IndrajeetPatil/ggstatsplot_slides/>