The function ggstatsplot::ggcorrmat provides a quick way to produce publication-ready correlation matrix (aka correlalogram) plot. The function can also be used for quick data exploration. In addition to the plot, it can also be used to get a correlation coefficient matrix or the associated p-value matrix. Currently, the plot can display Pearson’s r, Spearman’s rho, and Kendall’s tau, and robust correlation coefficient (percentage bend correlation; see ?WRS2::pbcor). This function is a convenient wrapper around ggcorrplot::ggcorrplot function with some additional functionality.

We will see examples of how to use this function in this vignette with the gapminder and diamonds dataset.

To begin with, here are some instances where you would want to use ggcorrmat-

  • to easily visualize a correlation matrix using ggplot2
  • to quickly explore correlation between (all) numeric variables in the dataset

Note before: The following demo uses the pipe operator (%>%), so in case you are not familiar with this operator, here is a good explanation: http://r4ds.had.co.nz/pipes.html

Correlation matrix plot with ggcorrmat

For the first example, we will use the gapminder dataset (available in eponymous package on CRAN) provides values for life expectancy, Gross Domestic Product (GDP) per capita, and population, every five years, from 1952 to 2007, for each of 142 countries and was collected by the Gapminder Foundation. Let’s have a look at the data-

library(gapminder)
library(dplyr)

dplyr::glimpse(x = gapminder)
#> Observations: 1,704
#> Variables: 6
#> $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afgha...
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asi...
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 199...
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 4...
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372,...
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.113...

Let’s say we are interested in studying correlation between population of a country, average life expectancy, and GDP per capita across countries only for the year 2007.

The simplest way to get a correlation matrix is to stick to the defaults-

library(ggstatsplot)

# select data only from the year 2007
gapminder_2007 <- dplyr::filter(.data = gapminder::gapminder, year == 2007)

# producing the correlation matrix
ggstatsplot::ggcorrmat(
  data = gapminder_2007, # data from which variable is to be taken
  cor.vars = lifeExp:gdpPercap # specifying correlation matrix variables
)

This plot can be further modified with additional arguments-

ggstatsplot::ggcorrmat(
  data = gapminder_2007, # data from which variable is to be taken
  cor.vars = lifeExp:gdpPercap, # specifying correlation matrix variables
  cor.vars.names = c(
    "Life Expectancy",
    "population",
    "GDP (per capita)"
  ),
  corr.method = "spearman", # which correlation coefficient is to be computed
  lab.col = "red", # label color
  ggtheme = ggplot2::theme_light(), # selected ggplot2 theme
  ggstatsplot.layer = FALSE, # turn off default ggestatsplot theme overlay
  matrix.type = "lower", # correlation matrix structure
  colors = NULL, # turning off manual specification of colors
  palette = "category10_d3", # choosing a color palette
  package = "ggsci", # package to which color palette belongs
  title = "Gapminder correlation matrix", # custom title
  subtitle = "Source: Gapminder Foundation" # custom subtitle
)

As seen from this correlation matrix, although there is no relationship between population and life expectancy worldwide, at least in 2007, there is a strong positive relationship between GDP, a well-established indicator of a country’s economic performance.

Given that there were only three variables, this doesn’t look that impressive. So let’s work with another example from ggplot2 package: the diamonds dataset. This dataset contains the prices and other attributes of almost 54,000 diamonds.

Let’s have a look at the data-

library(ggplot2)

dplyr::glimpse(ggplot2::diamonds)
#> Observations: 53,940
#> Variables: 10
#> $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23,...
#> $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, ...
#> $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J,...
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS...
#> $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4,...
#> $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62,...
#> $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340,...
#> $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00,...
#> $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05,...
#> $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39,...

Let’s see the correlation matrix between different attributes of the diamond and the price.

# for reproducibility
set.seed(123)

# let's use just 5% of the data to speed it up
ggstatsplot::ggcorrmat(
  data = dplyr::sample_frac(tbl = ggplot2::diamonds, size = 0.05),
  cor.vars = c(carat, depth:z), # note how the variables are getting selected
  cor.vars.names = c(
    "carat",
    "total depth",
    "table",
    "price",
    "length (in mm)",
    "width (in mm)",
    "depth (in mm)"
  ),
  hc.order = TRUE # use hierarchical clustering
)

We can make a number of changes to this basic correlation matrix. For example, since we were interested in relationship between price and other attributes, let’s make the price column to the the first column. Additionally, since we are running 6 correlations that are of a priori interest to us, we can adjust our threshold of significance to (0.05/6 ~ 0.008). Additionally, let’s use a non-parametric correlation coefficient. Please note that it is important to always make sure that the order in which cor.vars and cor.vars.names are entered is in sync. Otherwise, wrong column labels will be displayed.

# for reproducibility
set.seed(123)

# let's use just 5% of the data to speed it up
ggstatsplot::ggcorrmat(
  data = dplyr::sample_frac(tbl = ggplot2::diamonds, size = 0.05),
  cor.vars = c(price, carat, depth:table, x:z), # note how the variables are getting selected
  cor.vars.names = c(
    "price",
    "carat",
    "total depth",
    "table",
    "length (in mm)",
    "width (in mm)",
    "depth (in mm)"
  ),
  corr.method = "spearman",
  sig.level = 0.008,
  matrix.type = "lower",
  title = "Relationship between diamond attributes and price",
  subtitle = "Dataset: Diamonds from ggplot2 package",
  colors = c("#0072B2", "#D55E00", "#CC79A7"),
  lab.col = "yellow",
  lab.size = 6,
  tl.srt = 90, # vertical labels for the x-axis (useful in case of long variable names)
  pch = 7,
  pch.col = "white",
  pch.cex = 14,
  caption = substitute( 
    paste(italic("Note"), ": Point shape denotes correlation non-significant at p < 0.008; adjusted for 6 comparisons")
  )
) + # modification outside ggstatsplot using ggplot functions 
  ggplot2::theme(
  axis.text.x = ggplot2::element_text(
    margin = ggplot2::margin(t = 0.15, r = 0.15, b = 0.15, l = 0.15, unit = "cm")
  )
)

As seen here, and unsurprisingly, the strongest predictor of the diamond price is its carat value, which a unit of mass equal to 200 mg. In other words, the heavier the diamond, the more expensive it is going to be.

Correlation statistics matrix with ggcorrmat

Another utility of ggcorrmat is in obtaining matrix of correlation coefficients and their p-values for a quick and dirty exploratory data analysis. For example, for the correlation matrix we just ran, we can get a coefficient matrix and a p-value matrix.

# for reproducibility
set.seed(123)

# to get correlations
ggstatsplot::ggcorrmat(
  data = dplyr::sample_frac(tbl = ggplot2::txhousing, size = 0.15),
  cor.vars = sales:inventory,
  return = "correlations",
  corr.method = "robust",
  digits = 3
)
#> # A tibble: 5 x 6
#>   variable   sales volume median listings inventory
#>   <chr>      <dbl>  <dbl>  <dbl>    <dbl>     <dbl>
#> 1 sales      1      0.98   0.469    0.93     -0.385
#> 2 volume     0.98   1      0.55     0.898    -0.368
#> 3 median     0.469  0.55   1        0.403    -0.203
#> 4 listings   0.93   0.898  0.403    1        -0.163
#> 5 inventory -0.385 -0.368 -0.203   -0.163     1

# to get p-values
ggstatsplot::ggcorrmat(
  data = dplyr::sample_frac(tbl = ggplot2::txhousing, size = 0.15),
  cor.vars = c(volume, listings:inventory),
  return = "p-values",
  corr.method = "nonparametric",
  digits = 3
)
#> # A tibble: 3 x 4
#>   variable    volume listings inventory
#>   <chr>        <dbl>    <dbl>     <dbl>
#> 1 volume    0.         0       1.84e-43
#> 2 listings  0.         0       8.83e- 2
#> 3 inventory 1.84e-43   0.0883  0.

Note that if cor.vars are not specified, all numeric variables will be used. Moreover, you can also use abbreviations to specify what output you want in return.

# for reproducibility
set.seed(123)

# show four digits in a tibble
options(pillar.sigfig = 4)

# getting the correlation coefficient matrix
ggstatsplot::ggcorrmat(
  data = iris, # all numeric variables from data will be used
  corr.method = "np", # non-parametric
  return = "corr" # correlations
)
#> # A tibble: 4 x 5
#>   variable     Sepal.Length Sepal.Width Petal.Length Petal.Width
#>   <chr>               <dbl>       <dbl>        <dbl>       <dbl>
#> 1 Sepal.Length         1         -0.17          0.88       0.83 
#> 2 Sepal.Width         -0.17       1            -0.31      -0.290
#> 3 Petal.Length         0.88      -0.31          1          0.94 
#> 4 Petal.Width          0.83      -0.290         0.94       1

# getting the p-value matrix
ggstatsplot::ggcorrmat(
  data = ggplot2::msleep,
  cor.vars = sleep_total:bodywt,
  corr.method = "r", # robust
  return = "p", # p-values
  p.adjust.method = "holm"
)
#> # A tibble: 6 x 7
#>   variable    sleep_total sleep_rem sleep_cycle     awake   brainwt    bodywt
#>   <chr>             <dbl>     <dbl>       <dbl>     <dbl>     <dbl>     <dbl>
#> 1 sleep_total   0.        5.291e-12   9.138e- 3 0.        3.170e- 5 2.568e- 6
#> 2 sleep_rem     4.070e-13 0.          1.978e- 2 5.291e-12 9.698e- 3 3.762e- 3
#> 3 sleep_cycle   2.285e- 3 1.978e- 2   0.        9.138e- 3 1.637e- 9 1.696e- 5
#> 4 awake         0.        4.070e-13   2.285e- 3 0.        3.170e- 5 2.568e- 6
#> 5 brainwt       4.528e- 6 4.849e- 3   1.488e-10 4.528e- 6 0.        4.509e-17
#> 6 bodywt        2.568e- 7 7.524e- 4   2.120e- 6 2.568e- 7 3.221e-18 0.

# getting the confidence intervals for correlations
ggstatsplot::ggcorrmat(
  data = ggplot2::msleep,
  cor.vars = sleep_total:bodywt,
  corr.method = "p", # parametric
  return = "ci", # confidence intervals
  p.adjust.method = "holm"
)
#> Note: In the correlation matrix,
#> the upper triangle: p-values adjusted for multiple comparisons
#> the lower triangle: unadjusted p-values.
#> # A tibble: 15 x 7
#>    pair                      r    lower    upper          p lower.adj  upper.adj
#>    <chr>                 <dbl>    <dbl>    <dbl>      <dbl>     <dbl>      <dbl>
#>  1 sleep_total-sleep_~  0.7518  0.6167   0.8438  2.916e- 12  0.5402     8.740e-1
#>  2 sleep_total-sleep_~ -0.4737 -0.7058  -0.1498  6.169e-  3 -0.7738     7.215e-5
#>  3 sleep_total-awake   -1.0000 -1.0000  -1.0000  2.419e-226 -1.0000   -10.000e-1
#>  4 sleep_total-brainwt -0.3605 -0.5694  -0.1078  6.348e-  3 -0.6290    -1.505e-2
#>  5 sleep_total-bodywt  -0.3120 -0.4944  -0.1033  4.085e-  3 -0.5302    -5.506e-2
#>  6 sleep_rem-sleep_cy~ -0.3381 -0.6144   0.01198 5.839e-  2 -0.6806     1.257e-1
#>  7 sleep_rem-awake     -0.7518 -0.8438  -0.6167  2.911e- 12 -0.8748    -5.376e-1
#>  8 sleep_rem-brainwt   -0.2213 -0.4756   0.06701 1.306e-  1 -0.4756     6.701e-2
#>  9 sleep_rem-bodywt    -0.3277 -0.5353  -0.08265 9.947e-  3 -0.5838    -1.223e-2
#> 10 sleep_cycle-awake    0.4737  0.1498   0.7058  6.169e-  3 -0.006407   7.763e-1
#> 11 sleep_cycle-brainwt  0.8516  0.7088   0.9274  2.420e-  9  0.6080     9.487e-1
#> 12 sleep_cycle-bodywt   0.4178  0.08089  0.6690  1.734e-  2 -0.06265    7.410e-1
#> 13 awake-brainwt        0.3605  0.1078   0.5694  6.348e-  3  0.007931   6.333e-1
#> 14 awake-bodywt         0.3120  0.1032   0.4944  4.089e-  3  0.07202    5.178e-1
#> 15 brainwt-bodywt       0.9338  0.8892   0.9608  9.155e- 26  0.8583     9.697e-1

# getting the sample sizes for all pairs
ggstatsplot::ggcorrmat(
  data = ggplot2::msleep,
  cor.vars = sleep_total:bodywt,
  corr.method = "robust",
  return = "n" # note that n is different due to NAs
)
#> # A tibble: 6 x 7
#>   variable    sleep_total sleep_rem sleep_cycle awake brainwt bodywt
#>   <chr>             <dbl>     <dbl>       <dbl> <dbl>   <dbl>  <dbl>
#> 1 sleep_total          83        61          32    83      56     83
#> 2 sleep_rem            61        61          32    61      48     61
#> 3 sleep_cycle          32        32          32    32      30     32
#> 4 awake                83        61          32    83      56     83
#> 5 brainwt              56        48          30    56      56     56
#> 6 bodywt               83        61          32    83      56     83

Grouped analysis with grouped_ggcorrmat

What if we want to do the same analysis separately for each quality of the diamond cut (Fair, Good, Very Good, Premium, Ideal)?

ggstatsplot provides a special helper function for such instances: grouped_ggcorrmat. This is merely a wrapper function around ggstatsplot::combine_plots. It applies ggcorrmat across all levels of a specified grouping variable and then combines list of individual plots into a single plot. Note that the grouping variable can be anything: conditions in a given study, groups in a study sample, different studies, etc.

# for reproducibility
set.seed(123)

# let's use just 5% of the data to speed it up
ggstatsplot::grouped_ggcorrmat(
  # arguments relevant for ggstatsplot::ggcorrmat
  data = dplyr::sample_frac(tbl = ggplot2::diamonds, size = 0.05),
  corr.method = "r", # percentage bend correlation coefficient
  beta = 0.2, # bending constant
  p.adjust.method = "holm", # method to adjust p-values for multiple comparisons
  grouping.var = cut,
  title.prefix = "Quality of cut",
  cor.vars = c(carat, depth:z),
  cor.vars.names = c(
    "carat",
    "total depth",
    "table",
    "price",
    "length (in mm)",
    "width (in mm)",
    "depth (in mm)"
  ),
  lab.size = 3.5,
  # arguments relevant for ggstatsplot::combine_plots
  title.text = "Relationship between diamond attributes and price across cut",
  title.size = 16,
  title.color = "red",
  caption.text = "Dataset: Diamonds from ggplot2 package",
  caption.size = 14,
  caption.color = "blue",
  labels = c("(a)", "(b)", "(c)", "(d)", "(e)"),
  nrow = 3,
  ncol = 2
)
#> Warning: Individual plots in the combined `grouped_` plot
#> can't be further modified with `ggplot2` functions.

Note that this function also makes it easy to run the same correlation matrix across different levels of a factor/grouping variable. For example, if we wanted to get the same correlation coefficient matrix for color of the diamond, we can do the following-

# for reproducibility
set.seed(123)

# let's obtain correlation coefficients with their CIs
ggstatsplot::grouped_ggcorrmat(
  data = ggplot2::msleep,
  grouping.var = vore,
  return = "ci"
)
#> # A tibble: 60 x 8
#>    vore  pair                   r   lower    upper         p lower.adj upper.adj
#>    <chr> <chr>              <dbl>   <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
#>  1 carni sleep_total-sle~  0.9189  0.6864  0.9810  1.714e- 4    0.4403    0.9909
#>  2 carni sleep_total-sle~  0.3764 -0.7574  0.9449  5.323e- 1   -0.9328    0.9858
#>  3 carni sleep_total-awa~ -1.0000 -1.0000 -1.0000  2.140e-44   -1.0000   -1.0000
#>  4 carni sleep_total-bra~ -0.5244 -0.8815  0.2144  1.473e- 1   -0.9448    0.5483
#>  5 carni sleep_total-bod~ -0.4427 -0.7468  0.01441 5.768e- 2   -0.8365    0.2526
#>  6 carni sleep_rem-sleep~  0.1216 -0.8521  0.9066  8.456e- 1   -0.9606    0.9756
#>  7 carni sleep_rem-awake  -0.9189 -0.9810 -0.6865  1.713e- 4   -0.9909   -0.4403
#>  8 carni sleep_rem-brain~ -0.5006 -0.9331  0.5237  3.118e- 1   -0.9778    0.8159
#>  9 carni sleep_rem-bodywt -0.4786 -0.8516  0.2162  1.617e- 1   -0.9261    0.5286
#> 10 carni sleep_cycle-awa~ -0.3764 -0.9449  0.7574  5.323e- 1   -0.9858    0.9328
#> # ... with 50 more rows

As this example illustrates, there is a minimal coding overhead to explore correlations in your dataset with the grouped_ggcorrmat function.

Grouped analysis with ggcorrmat + purrr

Although grouped_ function is good for quickly exploring the data, it reduces the flexibility with which this function can be used. This is the because the common parameters used are applied to plots corresponding to all levels of the grouping variable and there is no way to customize the arguments for different levels of the grouping variable. We will see how this can be done using the purrr package.

See the associated vignette here: https://indrajeetpatil.github.io/ggstatsplot/articles/web_only/purrr_examples.html

Summary of tests

Following tests are carried out for each type of analyses-

Type Test
Parametric Pearson’s correlation coefficient
Non-parametric Spearman’s rank correlation coefficient
Robust Percentage bend correlation coefficient
Bayes Factor Pearson’s correlation coefficient

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues