CRAN_Release_Badge CRAN Checks packageversion Daily downloads badge Weekly downloads badge Monthly downloads badge Total downloads badge Travis Build Status AppVeyor Build Status Licence Project Status: Active - The project has reached a stable, usable state and is being actively developed. Last-changedate lifecycle minimal R version Coverage Status

Overview

groupedstats package provides a collection of functions to run statistical operations on multiple variables across multiple grouping variables in a dataframe. This is a common situation, as illustrated by few example cases-

  1. If you have combined multiple studies in a single dataframe and want to run a common operation (e.g., linear regression) on each study. In this case, column corresponding to study will be the grouping variable.
  2. If you have multiple groups in your dataframe (e.g., clinical disorder groups and controls group) and you want to carry out the same operation for each group (e.g., planned t-test to check for differences in reaction time in condition 1 versus condition 2 for both groups). In this case, group will be the grouping variable.
  3. If you have multiple conditions in a given study (e.g., six types of videos participants saw) and want to run a common operation between different measures of interest for each condition (e.g., correlation between subjective rating of emotional intensity and reaction time).
  4. Combination of all of the above.

This package is still work in progress and it currently supports only the most basic statistical operations (from stats and lme4 package). The next releases will expand on the existing functionality (e.g., ordinal).

Installation

To get the latest, stable CRAN release (0.0.3):

utils::install.packages(pkgs = "groupedstats") 

You can get the development version of the package from GitHub (0.0.3.9000). To see what new changes (and bug fixes) have been made to the package since the last release on CRAN, you can check the detailed log of changes here: https://indrajeetpatil.github.io/groupedstats/news/index.html

If you are in hurry and want to reduce the time of installation, prefer-

If time is not a constraint-

Help

There is a dedicated website to groupedstats, which is updated after every new commit: https://indrajeetpatil.github.io/groupedstats/.

In R, documentation for any function can be accessed with the standard help command-

Another handy tool to see arguments to any of the functions is args. For example-

In case you want to look at the function body for any of the functions, just type the name of the function without the parenetheses:

If you are not familiar either with what the namespace :: does or how to use pipe operator %>%, something this package and its documentation relies a lot on, you can check out these links-

Usage

grouped_summary

Getting summary for multiple variables across multiple grouping variables. This function is a wrapper around skimr::skim_to_wide(). It is supposed to be a handy summarizing tool if you have multiple grouping variables and multiple variables for which summary statistics are to be computed-

library(datasets)
options(tibble.width = Inf)            # show me all columns

groupedstats::grouped_summary(data = datasets::iris,
                              grouping.vars = Species,
                              measures = Sepal.Length:Petal.Width,
                              measures.type = "numeric")
#> # A tibble: 12 x 16
#>    Species    type    variable     missing complete     n  mean    sd   min
#>    <fct>      <chr>   <chr>          <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 setosa     numeric Petal.Length       0       50    50  1.46  0.17   1  
#>  2 setosa     numeric Petal.Width        0       50    50  0.25  0.11   0.1
#>  3 setosa     numeric Sepal.Length       0       50    50  5.01  0.35   4.3
#>  4 setosa     numeric Sepal.Width        0       50    50  3.43  0.38   2.3
#>  5 versicolor numeric Petal.Length       0       50    50  4.26  0.47   3  
#>  6 versicolor numeric Petal.Width        0       50    50  1.33  0.2    1  
#>  7 versicolor numeric Sepal.Length       0       50    50  5.94  0.52   4.9
#>  8 versicolor numeric Sepal.Width        0       50    50  2.77  0.31   2  
#>  9 virginica  numeric Petal.Length       0       50    50  5.55  0.55   4.5
#> 10 virginica  numeric Petal.Width        0       50    50  2.03  0.27   1.4
#> 11 virginica  numeric Sepal.Length       0       50    50  6.59  0.64   4.9
#> 12 virginica  numeric Sepal.Width        0       50    50  2.97  0.32   2.2
#>      p25 median   p75   max std.error mean.low.conf mean.high.conf
#>    <dbl>  <dbl> <dbl> <dbl>     <dbl>         <dbl>          <dbl>
#>  1  1.4    1.5   1.58   1.9    0.0240         1.41           1.51 
#>  2  0.2    0.2   0.3    0.6    0.0156         0.219          0.281
#>  3  4.8    5     5.2    5.8    0.0495         4.91           5.11 
#>  4  3.2    3.4   3.68   4.4    0.0537         3.32           3.54 
#>  5  4      4.35  4.6    5.1    0.0665         4.13           4.39 
#>  6  1.2    1.3   1.5    1.8    0.0283         1.27           1.39 
#>  7  5.6    5.9   6.3    7      0.0735         5.79           6.09 
#>  8  2.52   2.8   3      3.4    0.0438         2.68           2.86 
#>  9  5.1    5.55  5.88   6.9    0.0778         5.39           5.71 
#> 10  1.8    2     2.3    2.5    0.0382         1.95           2.11 
#> 11  6.23   6.5   6.9    7.9    0.0905         6.41           6.77 
#> 12  2.8    3     3.18   3.8    0.0453         2.88           3.06

This function can be used to get summary of either numeric or factor variables, but not both. This is by design. If no measures are specified, the function will compute summary for all variables of the specified type (numeric or factor).

If you want summary of variables of factor type-

Note that there is a column corresponding to top_counts which is really useful in case you, let’s say, want to plot these counts. But this column is of character type and in wide format. The solution is to use an additional argument provided for this function:

This produces a long format table with two new columns factor.level and its corresponding count, which can then be immediately fed into other pipelines, e.g., preparing a plot of mean and sd values in ggplot2).

grouped_slr

This function can be used to run simple linear regression (slr) between different pairs of variables across multiple levels of grouping variable(s). For example, we can use the gapminder dataset to study two relationships of interest for each country across years:

  1. life expectancy and GDP (per capita)
  2. population GDP (per capita) Thus, in this case we have two regression models and one grouping variable with 142 levels (countries)

Notice the order in which the dependent and independent variables are entered; there are two separate regression models being run here: lifeExp ~ gdpPercap and pop ~ gdpPercap If this order is incorrect, the result will also be incorrect. So it is always a good idea to check the formula column to see if you have run the correct linear models. Also, note that the estimates are already standardized, i.e. estimates are standardized regression coefficients (betas, i.e.).

The prior example was with just one grouping variable. This can be done with multiple grouping variables as well. For example, with the diamonds dataset from ggplot2 library, let’s assess the relation between carat and price of a diamond for each type of clarity and cut-

A more general version of this function (grouped_lm) will be implemented in future that will utilize the formula interface of stats::lm.

grouped_lm

A more general version of simple linear regression is stats::lm, implemented in grouped_lm:

The same function can also be used to get model summaries instead of a tidy dataframe containing results-

grouped_aov

A related function to stats::lm is stats::aov, which fits an analysis of variance model for each group. Contrast the output you get here with the previous output for the same model from grouped_lm function. The estimate in this case with be an effect size (either partial eta-squared or partial omega-squared).

The same function can also be used to compute Tukey’s test of Honest Significant Differences (HSD). For example, we can check for differences in life expectancy between different continents for all years for which the gapminder survey was conducted:

Note that the p-value is adjusted adjusted for the number of tests conducted at each level of the grouping variable, and not across all tests conducted.

grouped_glm

The option to run generalized linear model (stats::glm) across different levels of the grouping variable is also implemented similarily-

Note that the statistic will either be a t (gaussian, e.g.) or a z (binomial, e.g.) value, based on the family of models used.

You can also get a model summary across all models with broom::glance methods-

grouped_lmer

Linear mixed effects analyses (lme4::lmer) for all combinations of grouping variable levels can be carried out using grouped_lmer:

library(gapminder)

# getting tidy output of results
groupedstats::grouped_lmer(
  data = gapminder,
  formula = scale(lifeExp) ~ scale(gdpPercap) + (gdpPercap |
                                                   continent),
  grouping.vars = year,
  REML = FALSE,
  output = "tidy"
)
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> Computing p-values via Wald-statistics approximation (treating t as Wald z).
#> # A tibble: 24 x 10
#>     year effect term             estimate std.error t.value conf.low
#>    <int> <chr>  <chr>               <dbl>     <dbl>   <dbl>    <dbl>
#>  1  1952 fixed  (Intercept)         0.215     0.293   0.736   -0.358
#>  2  1952 fixed  scale(gdpPercap)    0.930     0.302   3.07     0.337
#>  3  1957 fixed  (Intercept)         0.254     0.342   0.743   -0.416
#>  4  1957 fixed  scale(gdpPercap)    0.815     0.282   2.89     0.262
#>  5  1962 fixed  (Intercept)         0.255     0.333   0.766   -0.397
#>  6  1962 fixed  scale(gdpPercap)    0.591     0.210   2.82     0.180
#>  7  1967 fixed  (Intercept)         0.249     0.361   0.689   -0.459
#>  8  1967 fixed  scale(gdpPercap)    0.387     0.120   3.24     0.153
#>  9  1972 fixed  (Intercept)         0.276     0.366   0.753   -0.442
#> 10  1972 fixed  scale(gdpPercap)    0.431     0.150   2.88     0.137
#>    conf.high p.value significance
#>        <dbl>   <dbl> <chr>       
#>  1     0.789 0.462   ns          
#>  2     1.52  0.00211 **          
#>  3     0.923 0.457   ns          
#>  4     1.37  0.00389 **          
#>  5     0.908 0.444   ns          
#>  6     1.00  0.00479 **          
#>  7     0.956 0.491   ns          
#>  8     0.622 0.00120 **          
#>  9     0.994 0.451   ns          
#> 10     0.724 0.00402 **          
#> # ... with 14 more rows


# getting tidy output of results
groupedstats::grouped_lmer(
  data = gapminder,
  formula = scale(lifeExp) ~ scale(gdpPercap) + (gdpPercap |
                                                   continent),
  grouping.vars = year,
  REML = FALSE,
  output = "glance"
)
#> # A tibble: 12 x 7
#>     year sigma logLik   AIC   BIC deviance df.residual
#>    <int> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int>
#>  1  1952 0.531  -124.  259.  277.     247.         136
#>  2  1957 0.540  -127.  266.  284.     254.         136
#>  3  1962 0.546  -128.  268.  285.     256.         136
#>  4  1967 0.552  -128.  269.  287.     257.         136
#>  5  1972 0.565  -132.  276.  294.     264.         136
#>  6  1977 0.582  -132.  277.  294.     265.         136
#>  7  1982 0.544  -121.  253.  271.     241.         136
#>  8  1987 0.498  -110.  231.  249.     219.         136
#>  9  1992 0.518  -117.  246.  264.     234.         136
#> 10  1997 0.498  -109.  231.  248.     219.         136
#> 11  2002 0.505  -113.  238.  255.     226.         136
#> 12  2007 0.525  -116.  244.  262.     232.         136

grouped_glmer

A more generalized version of lmer is implemented in lme4::glmer, which can also handle categorical/nominal data. For example, let’s say we want to see if sex of a person was predictive of whether they survived the Titanic tragedy.

# having a look at the data
dplyr::glimpse(x = groupedstats::Titanic_full)
#> Observations: 2,201
#> Variables: 5
#> $ id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
#> $ Class    <fct> 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd, 3rd...
#> $ Sex      <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,...
#> $ Age      <fct> Child, Child, Child, Child, Child, Child, Child, Chil...
#> $ Survived <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, N...

# running glmer model to get tidy output
groupedstats::grouped_glmer(
  formula = Survived ~ Age + (Age |
                                Class),
  data = groupedstats::Titanic_full,
  family = stats::binomial(link = "probit"),                  # choosing the appropriate GLM family
                          control = lme4::glmerControl(       # choosing appropriate control
                            optimizer = "Nelder_Mead",
                            boundary.tol = 1e-07,
                            calc.derivs = FALSE,
                            optCtrl = list(maxfun = 2e9)
                          ),
  grouping.vars = Sex,                                        # grouping variables (just one in this case)
  output = "tidy"
)
#> # A tibble: 4 x 10
#>   Sex    effect term        estimate std.error statistic conf.low conf.high
#>   <fct>  <chr>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
#> 1 Male   fixed  (Intercept)   -0.888     0.162     -5.47   -1.21     -0.570
#> 2 Male   fixed  AgeChild       4.90      4.03       1.22   -3.00     12.8  
#> 3 Female fixed  (Intercept)    1.01      0.379      2.66    0.264     1.75 
#> 4 Female fixed  AgeChild       1.47      0.595      2.48    0.307     2.64 
#>        p.value significance
#>          <dbl> <chr>       
#> 1 0.0000000453 ***         
#> 2 0.224        ns          
#> 3 0.00789      **          
#> 4 0.0133       *

# getting glmer model summaries (let's use the default family and control values)
groupedstats::grouped_glmer(
  formula = Survived ~ Age + (Age |
                                Class),
  grouping.vars = Sex,      
  data = groupedstats::Titanic_full,
  output = "glance"
)
#> # A tibble: 2 x 7
#>   Sex    sigma logLik   AIC   BIC deviance df.residual
#>   <fct>  <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int>
#> 1 Male       1  -860. 1730. 1757.    1698.        1726
#> 2 Female     1  -208.  426.  447.     400.         465

Note that the statistic will either be a t (gaussian, e.g.) or a z (binomial, e.g.) value, based on the family of models used.

grouped_ttest

This function can help you carry out t-tests, paired or independent, on multiple variables across multiple groups. Demonstrating how to use this function is going to first require getting the iris dataset into long format. Let’s say we want to investigate if Sepal part of the flower has greater measurement (length or width) than Petal part of the flower for each Iris species.


# converting the iris dataset to long format
iris_long <- datasets::iris %>%
  dplyr::mutate(.data = ., id = dplyr::row_number(x = Species)) %>%
  tidyr::gather(
    data = .,
    key = "condition",
    value = "value",
    Sepal.Length:Petal.Width,
    convert = TRUE,
    factor_key = TRUE
  ) %>%
  tidyr::separate(
    col = "condition",
    into = c("part", "measure"),
    sep = "\\.",
    convert = TRUE
  ) %>%
  tibble::as_data_frame(x = .)

# check the long format iris dataset
iris_long
#> # A tibble: 600 x 5
#>    Species    id part  measure value
#>    <fct>   <int> <chr> <chr>   <dbl>
#>  1 setosa      1 Sepal Length    5.1
#>  2 setosa      2 Sepal Length    4.9
#>  3 setosa      3 Sepal Length    4.7
#>  4 setosa      4 Sepal Length    4.6
#>  5 setosa      5 Sepal Length    5  
#>  6 setosa      6 Sepal Length    5.4
#>  7 setosa      7 Sepal Length    4.6
#>  8 setosa      8 Sepal Length    5  
#>  9 setosa      9 Sepal Length    4.4
#> 10 setosa     10 Sepal Length    4.9
#> # ... with 590 more rows

# checking if the Sepal part has different dimentions (value) than Petal part
# for each Species and for each type of measurement (Length and Width)
options(tibble.width = Inf)            # show me all columns

groupedstats::grouped_ttest(
  data = iris_long,
  dep.vars = value,                    # dependent variable
  indep.vars = part,                   # independent variable
  grouping.vars = c(Species, measure), # for each Species and for each measurement
  paired = TRUE                        # paired t-test
)
#> # A tibble: 6 x 11
#>   Species    measure formula      method        t.test estimate conf.low
#>   <fct>      <chr>   <chr>        <chr>          <dbl>    <dbl>    <dbl>
#> 1 setosa     Length  value ~ part Paired t-test  -71.8   -3.54     -3.64
#> 2 versicolor Length  value ~ part Paired t-test  -34.0   -1.68     -1.78
#> 3 virginica  Length  value ~ part Paired t-test  -22.9   -1.04     -1.13
#> 4 setosa     Width   value ~ part Paired t-test  -61.0   -3.18     -3.29
#> 5 versicolor Width   value ~ part Paired t-test  -43.5   -1.44     -1.51
#> 6 virginica  Width   value ~ part Paired t-test  -23.1   -0.948    -1.03
#>   conf.high parameter  p.value significance
#>       <dbl>     <dbl>    <dbl> <chr>       
#> 1    -3.44         49 2.54e-51 ***         
#> 2    -1.58         49 9.67e-36 ***         
#> 3    -0.945        49 7.99e-28 ***         
#> 4    -3.08         49 7.21e-48 ***         
#> 5    -1.38         49 8.42e-41 ***         
#> 6    -0.866        49 5.34e-28 ***

grouped_wilcox

This function is just a non-parametric variant of the grouped_ttest:

While we are at it, let’s also check out examples for t-test and Wilcox test in case of between-subjects designs.

We will use diamonds dataset from ggplot2 and will see if the price and depth of a diamond is different for two of our favorite colors (say E and J) for each type of clarity.


# subset the dataframe with two colors of interest to us
diamonds_short <-
  dplyr::filter(.data = ggplot2::diamonds, color == "E" |
                  color == "J")

options(tibble.width = Inf, tibble.print_max = Inf)             # show me all rows and columns

# t-test
groupedstats::grouped_ttest(
  data = diamonds_short,
  dep.vars = c(carat, price, depth),             # note that there three dependent variables 
  indep.vars = color,                            # and just one independent variable 
  grouping.vars = clarity,                       # one grouping variable
  paired = FALSE,
  var.equal = FALSE
)
#> # A tibble: 24 x 10
#>    clarity formula       method                   t.test   estimate
#>    <ord>   <chr>         <chr>                     <dbl>      <dbl>
#>  1 SI2     carat ~ color Welch Two Sample t-test -18.0      -0.503 
#>  2 SI1     carat ~ color Welch Two Sample t-test -22.1      -0.462 
#>  3 VS1     carat ~ color Welch Two Sample t-test -16.8      -0.444 
#>  4 VVS2    carat ~ color Welch Two Sample t-test -12.0      -0.553 
#>  5 VS2     carat ~ color Welch Two Sample t-test -24.6      -0.542 
#>  6 I1      carat ~ color Welch Two Sample t-test  -4.48     -0.644 
#>  7 VVS1    carat ~ color Welch Two Sample t-test  -6.53     -0.417 
#>  8 IF      carat ~ color Welch Two Sample t-test  -2.38     -0.198 
#>  9 SI2     price ~ color Welch Two Sample t-test -10.6   -2347.    
#> 10 SI1     price ~ color Welch Two Sample t-test -12.6   -2024.    
#> 11 VS1     price ~ color Welch Two Sample t-test  -9.26  -2028.    
#> 12 VVS2    price ~ color Welch Two Sample t-test  -6.83  -2643.    
#> 13 VS2     price ~ color Welch Two Sample t-test -14.3   -2560.    
#> 14 I1      price ~ color Welch Two Sample t-test  -2.76  -1766.    
#> 15 VVS1    price ~ color Welch Two Sample t-test  -3.35  -1814.    
#> 16 IF      price ~ color Welch Two Sample t-test   0.378   305.    
#> 17 SI2     depth ~ color Welch Two Sample t-test  -2.64     -0.231 
#> 18 SI1     depth ~ color Welch Two Sample t-test   0.333     0.0208
#> 19 VS1     depth ~ color Welch Two Sample t-test  -6.61     -0.417 
#> 20 VVS2    depth ~ color Welch Two Sample t-test  -2.20     -0.277 
#> 21 VS2     depth ~ color Welch Two Sample t-test  -2.28     -0.145 
#> 22 I1      depth ~ color Welch Two Sample t-test  -3.96     -2.10  
#> 23 VVS1    depth ~ color Welch Two Sample t-test  -2.53     -0.370 
#> 24 IF      depth ~ color Welch Two Sample t-test  -2.16     -0.386 
#>     conf.low  conf.high parameter   p.value significance
#>        <dbl>      <dbl>     <dbl>     <dbl> <chr>       
#>  1    -0.558    -0.448      634.  9.24e- 59 ***         
#>  2    -0.503    -0.420      948.  2.47e- 87 ***         
#>  3    -0.496    -0.392      663.  2.91e- 53 ***         
#>  4    -0.644    -0.461      140.  3.81e- 23 ***         
#>  5    -0.586    -0.499      856.  1.51e-101 ***         
#>  6    -0.932    -0.357       60.6 3.34e-  5 ***         
#>  7    -0.545    -0.290       76.1 6.68e-  9 ***         
#>  8    -0.364    -0.0318      60.1 2.03e-  2 *           
#>  9 -2782.    -1913.         676.  2.02e- 24 ***         
#> 10 -2339.    -1709.        1031.  5.75e- 34 ***         
#> 11 -2458.    -1598.         794.  1.86e- 19 ***         
#> 12 -3407.    -1878.         151.  1.94e- 10 ***         
#> 13 -2911.    -2209.         943.  3.07e- 42 ***         
#> 14 -3044.     -487.          62.0 7.58e-  3 **          
#> 15 -2893.     -735.          81.2 1.24e-  3 **          
#> 16 -1299.     1909.          81.0 7.07e-  1 ns          
#> 17    -0.402    -0.0588     772.  8.58e-  3 **          
#> 18    -0.102     0.143     1245.  7.39e-  1 ns          
#> 19    -0.541    -0.293     1040.  6.08e- 11 ***         
#> 20    -0.526    -0.0279     158.  2.95e-  2 *           
#> 21    -0.271    -0.0203    1084.  2.28e-  2 *           
#> 22    -3.16     -1.05        81.6 1.61e-  4 ***         
#> 23    -0.661    -0.0792      88.1 1.32e-  2 *           
#> 24    -0.739    -0.0321     111.  3.28e-  2 *

# wilcox test (aka Mann-Whitney U-test)
groupedstats::grouped_wilcox(
  data = diamonds_short,
  dep.vars = depth:price,                        # note that you can select variables in range with `:`
  indep.vars = color,                            # again, just one independent, multiple dependent variables case
  grouping.vars = clarity,                       # one grouping variable
  paired = FALSE
)
#> # A tibble: 16 x 9
#>    clarity formula       method                                           
#>    <ord>   <chr>         <chr>                                            
#>  1 SI2     depth ~ color Wilcoxon rank sum test with continuity correction
#>  2 SI1     depth ~ color Wilcoxon rank sum test with continuity correction
#>  3 VS1     depth ~ color Wilcoxon rank sum test with continuity correction
#>  4 VVS2    depth ~ color Wilcoxon rank sum test with continuity correction
#>  5 VS2     depth ~ color Wilcoxon rank sum test with continuity correction
#>  6 I1      depth ~ color Wilcoxon rank sum test with continuity correction
#>  7 VVS1    depth ~ color Wilcoxon rank sum test with continuity correction
#>  8 IF      depth ~ color Wilcoxon rank sum test with continuity correction
#>  9 SI2     price ~ color Wilcoxon rank sum test with continuity correction
#> 10 SI1     price ~ color Wilcoxon rank sum test with continuity correction
#> 11 VS1     price ~ color Wilcoxon rank sum test with continuity correction
#> 12 VVS2    price ~ color Wilcoxon rank sum test with continuity correction
#> 13 VS2     price ~ color Wilcoxon rank sum test with continuity correction
#> 14 I1      price ~ color Wilcoxon rank sum test with continuity correction
#> 15 VVS1    price ~ color Wilcoxon rank sum test with continuity correction
#> 16 IF      price ~ color Wilcoxon rank sum test with continuity correction
#>    statistic      estimate   conf.low     conf.high  p.value significance
#>        <dbl>         <dbl>      <dbl>         <dbl>    <dbl> <chr>       
#>  1   380600     -0.200        -0.300     -0.0000364 1.54e- 2 *           
#>  2   918892      0.0000393    -0.1000     0.100     6.77e- 1 ns          
#>  3   276743     -0.400        -0.500     -0.300     7.03e-12 ***         
#>  4    52618     -0.400        -0.600     -0.200     4.18e- 4 ***         
#>  5   816784.    -0.200        -0.300     -0.1000    8.86e- 5 ***         
#>  6     1570.    -2.00         -3.00      -1.000     1.21e- 4 ***         
#>  7    19455     -0.300        -0.500     -0.1000    5.07e- 3 **          
#>  8     3286     -0.300        -0.700     -0.0000251 4.79e- 2 *           
#>  9   269016  -1844.        -2182.     -1505.        8.82e-31 ***         
#> 10   608724. -1510.        -1789.     -1312.        8.20e-43 ***         
#> 11   277389  -1095.        -1414.      -745.        1.11e-11 ***         
#> 12    42795  -1764.        -2920.     -1205.        2.23e-10 ***         
#> 13   560930  -1888.        -2228.     -1539.        1.08e-54 ***         
#> 14     2017  -1150.        -1832.       -79.0       3.68e- 2 *           
#> 15    19086   -362.         -859.      -107.        2.57e- 3 **          
#> 16     5085    296.          125.       449.        4.94e- 3 **

We can further focus just on two levels of clarity to further elucidate another aspect of entering the arguments-

# subset the dataframe even further to just select two levels of clarity
diamonds_short2 <-
  dplyr::filter(.data = diamonds_short, clarity == "SI2" |
                  clarity == "SI1")

# wilcox test (aka Mann-Whitney U-test)
groupedstats::grouped_wilcox(
 data = diamonds_short2,
  dep.vars = c(carat, price),                    # two dependent variables
  indep.vars = c(color, clarity),                # two independent variables
  grouping.vars = cut,                           # one grouping variable
  paired = FALSE
)
#> # A tibble: 10 x 9
#>    cut       formula        
#>    <ord>     <chr>          
#>  1 Ideal     carat ~ color  
#>  2 Premium   carat ~ color  
#>  3 Good      carat ~ color  
#>  4 Very Good carat ~ color  
#>  5 Fair      carat ~ color  
#>  6 Ideal     price ~ clarity
#>  7 Premium   price ~ clarity
#>  8 Good      price ~ clarity
#>  9 Very Good price ~ clarity
#> 10 Fair      price ~ clarity
#>    method                                            statistic estimate
#>    <chr>                                                 <dbl>    <dbl>
#>  1 Wilcoxon rank sum test with continuity correction   103150.   -0.440
#>  2 Wilcoxon rank sum test with continuity correction    91748    -0.540
#>  3 Wilcoxon rank sum test with continuity correction    20752    -0.400
#>  4 Wilcoxon rank sum test with continuity correction    87590    -0.390
#>  5 Wilcoxon rank sum test with continuity correction     2356.   -0.230
#>  6 Wilcoxon rank sum test with continuity correction   344274.  774.   
#>  7 Wilcoxon rank sum test with continuity correction   329276.  876.   
#>  8 Wilcoxon rank sum test with continuity correction    64082.  516.   
#>  9 Wilcoxon rank sum test with continuity correction   272064   752.   
#> 10 Wilcoxon rank sum test with continuity correction     5113   170.   
#>    conf.low conf.high  p.value significance
#>       <dbl>     <dbl>    <dbl> <chr>       
#>  1   -0.500    -0.380 1.28e-51 ***         
#>  2   -0.600    -0.490 1.82e-59 ***         
#>  3   -0.500    -0.310 4.49e-18 ***         
#>  4   -0.460    -0.320 7.06e-37 ***         
#>  5   -0.370    -0.110 1.21e- 5 ***         
#>  6  536.     1075.    3.00e- 9 ***         
#>  7  538.     1278.    3.52e- 9 ***         
#>  8  141.      921.    3.05e- 3 **          
#>  9  486.     1098.    2.76e- 8 ***         
#> 10 -411.      802.    5.68e- 1 ns

In these examples, two things are worth noting that generalize to all functions in this package and stem from how tidy evaluation works:

  • If just one independent variable is provided for multiple dependent variables, it will be used as a common variable.
  • If you want to use a selection of variables, you need not use c().

Extending with purrr

groupedstats functions can be further extended with purrr package. For example, let’s say we want to run the same linear regression across multiple grouping variables but want to use different formulas-

Current code coverage

As the code stands right now, here is the code coverage for all primary functions involved: https://codecov.io/gh/IndrajeetPatil/groupedstats/tree/master/R

Contributing

I’m happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I personally prefer using the Github issues system over trying to reach out to me in other ways (personal e-mail, Twitter, etc.). Pull requests for contributions are encouraged.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/groupedstats/issues