DRY Package Development in R

Indrajeet Patil

“Copy and paste is a design error.”   - David Parnas

Why So DRY

Why should you not repeat yourself?

Don’t Repeat Yourself (DRY) Principle

The DRY Principle states that:

Every piece of knowledge must have a single representation in the codebase.

That is, you should not express the same thing in multiple places in multiple ways.

It’s about knowledge and not just code

The DRY principle is about duplication of knowledge. Thus, it applies to all programming entities that encode knowledge:

  • You should not duplicate code.
  • You should not duplicate intent across code and comments.
  • You should not duplicate knowledge in data structures.

Benefits of DRY codebase

  • When code is duplicated, if you make change in one place, you need to make parallel changes in other places. When code is DRY, parallel modifications become unnecessary.

  • Easy to maintain since there is only a single representation of knowledge that needs to be updated if the underlying knowledge changes.

  • As a side effect, routines developed to remove duplicated code can become part of general-purpose utilities.


Further Reading

Plan

Apply the DRY principle to remove duplication in following aspects of R package development:

  • Documentation
  • Vignette setup
  • Unit testing
  • Dependency management
  • Exceptions

Documentation

How not to repeat yourself while writing documentation.

What do users read?

What users consult to find needed information may be context-dependent.


README: While exploring the package repository.


Vignettes: When first learning how to use a package.


Manual: When checking details about a specific function.


Thus, including crucial information only in one place makes it likely that the users might miss out on it in certain contexts.

Go forth and multiply (without repetition)

Some documentation is important enough to be included in multiple places (e.g. in the function documentation and in a vignette).


How can you document something just once but include it in multiple locations?

Child documents

You can stitch an R Markdown document from smaller child documents.

                 

(parent Rmd)               (child Rmd)                 (result Rmd)

Thus, the information to repeat can be stored once in child documents and reused multiple times across parents.

Storing child documents in package

Stratagem: You can store child documents in the manual directory and reuse them.

Child documents

├── DESCRIPTION
├── man
│   └── rmd-children
│       └── info1.Rmd
│       └── ...

info1.Rmd example:

This is some crucial information to be repeated across documentation.

```{r}
1 + 1
```


Tips

  • You can include as many child documents as you want.
  • The child document is just like any .Rmd file and can include everything that any other .Rmd file can include.
  • You can choose a different name for the folder containing child documents (e.g. rmd-fragments).
  • Make sure to include Roxygen: list(markdown = TRUE) field in the DESCRIPTION file.
  • The child documents will not pose a problem either for R CMD check or for {pkgdown} website.

Using child documents in package: Part-1

Include contents of child documents in the documentation in multiple locations.

Vignette

├── DESCRIPTION
├── vignettes
│   └── vignette1.Rmd
│   └── ...
│   └── web_only
│       └── vignette2.Rmd
│       └── ...

README

├── DESCRIPTION
├── README.Rmd

In vignette1.Rmd

---
output: html_vignette
---

Vignette content.

```{r, child="../man/rmd-children/info1.Rmd"}
```

In README.Rmd

---
output: github_document
---

README content.

```{r, child="man/rmd-children/info1.Rmd"}
```

Using child documents in package: Part-2

Include contents of child documents in the documentation in multiple locations.

Manual

├── DESCRIPTION
├── R
│   └── foo1.R
│   └── foo2.R
├── man
│   └── foo1.Rd
│   └── foo2.Rd
│   └── ...

In foo1.R

#' @title Foo1
#' @section Information:
#' 
#' ```{r, child="man/rmd-children/info1.Rmd"}
#' ```
foo1 <- function() { ... }


Important

The underlying assumption here is that you are using {roxygen2} to generate package documentation.

What about non-child documents?

You can include contents from any file in .Rmd, not just a child document!


                                   

(parent Rmd)       (child Rmd)       (.R + R engine)   (.md + asis engine)    


                                                                   

                                                                            (result Rmd)

Storing other documentation files in package

Like child documents, other types of documents are also stored in man/ folder.

Reusable content

├── DESCRIPTION
├── man
│   └── rmd-children
│       └── info1.Rmd
│       └── ...
│   └── md-fragments
│       └── fragment1.md
│       └── ...
│   └── r-chunks
│       └── chunk1.R
│       └── ...

fragment1.md example:

This `.md` file contains 
content to be included *as is*
across multiple locations
in the documentation.

chunk1.R example:

# some comment and code
1 + 1

# more comments and code
2 + 3


Folder names

You can name these folders however you wish, but it is advisable that the names provide information about file contents (e.g., r-examples, yaml-snippets, md-fragments, etc.).

Using non-child documents in package: Part-1

Include contents of various files in the documentation in multiple locations.

Vignette

├── DESCRIPTION
├── vignettes
│   └── vignette1.Rmd
│   └── ...
│   └── web_only
│       └── vignette2.Rmd
│       └── ...

README

├── DESCRIPTION
├── README.Rmd

In vignette1.Rmd

---
output: html_vignette
---

Vignette content.

```{asis, file="../man/md-fragments/fragment1.md"}
```

```{r, file="../man/r-chunks/chunk1.R"}
```

In README.Rmd

---
output: github_document
---

README content.

```{asis, file="man/md-fragments/fragment1.md"}
```

```{r, file="man/r-chunks/chunk1.R"}
```

Using non-child documents in package: Part-2

Include contents of child documents in the documentation in multiple locations.

Manual

├── DESCRIPTION
├── R
│   └── foo1.R
│   └── ...
├── man
│   └── foo1.Rd
│   └── ...

In foo1.R

#' @title Foo1
#' @section Information:
#' 
#' ```{r, file="man/md-fragments/fragment1.Rmd"}
#' ```
#' 
#' @example man/r-chunks/chunk1.R
foo1 <- function() { ... }


Important

The underlying assumption here is that you are using {roxygen2} to generate package documentation.

Summary on how to repeat documentation

If you are overwhelmed by this section, note that you actually need to remember only the following rules:

  • Store reusable document files in the /man folder.

  • When you wish to include their contents, provide paths to these files relative to the document you are linking from.

  • If it’s a child .Rmd document, use the child option to include its contents.

  • If it’s not an .Rmd document, use the file option to include its contents and use appropriate {knitr} engine. To see available engines, run names(knitr::knit_engines$get()).

Self-study

Example packages that use reusable component documents to repeat documentation.

Vignette Setup

How not to repeat yourself while setting up vignettes.

Setup chunks in vignettes

Another duplication that occurs is in setup chunks for vignettes.

For example, some parts of the setup can be same across vignettes.


├── DESCRIPTION
├── vignettes
│   └── vignette1.Rmd
│   └── vignette2.Rmd
│   └── ...
---
title: "Vignette-1"
output: html_vignette
---

```{r}
knitr::opts_chunk$set(
  message = FALSE,
  collapse = TRUE,
  comment = "#>"
)
```
---
title: "Vignette-2"
output: html_vignette
---

```{r}
knitr::opts_chunk$set(
  message = FALSE,
  collapse = TRUE,
  comment = "#>"
)

options(crayon.enabled = TRUE)
```


How can this repetition be avoided?

Sourcing setup chunks in vignettes

This repetition can be avoided by moving the common setup to a script, and sourcing it from vignettes. Storing this script in a folder (/setup) is advisable if there are many reusable artifacts.

Option 1

├── DESCRIPTION
├── vignettes
│   └── setup.R

Option 2

├── DESCRIPTION
├── vignettes
│   └── setup
│       └── setup.R

setup.R contents

knitr::opts_chunk$set(
  message = FALSE,
  collapse = TRUE,
  comment = "#>"
)

Sourcing common setup

---
title: "Vignette-1"
output: html_vignette
---

```{r setup, include = FALSE}
source("setup/setup.R")
```

Sourcing common setup

---
title: "Vignette-2"
output: html_vignette
---

```{r setup, include = FALSE}
source("setup/setup.R")
options(crayon.enabled = TRUE)
```

No parallel modification

Now common setup can be modified with a change in only one place!

Self-study

Packages in the wild that use this trick.

Data

How not to repeat yourself while creating and re-using example datasets.

Illustrative example datasets

If none of the existing datasets are useful to illustrate your functions, you can create new datasets.

Let’s say your example dataset is called exdat and function is called foo(). Using it in examples, vignettes, README, etc. requires that it be define multiple times.

In examples

#' @examples
#' exdat <- matrix(c(71, 50))
#' foo(exdat)

In vignettes

---
title: "My Vignette"
output: html_vignette
---

```{r}
exdat <- matrix(c(71, 50))
foo(exdat)
```

In README

---
output: github_document
---

```{r}
exdat <- matrix(c(71, 50))
foo(exdat)
```


How can this repetition be avoided?

Shipping data in a package

You can avoid this repetition by defining the data just once, saving and shipping it with the package.

The datasets are stored in data/, and documented in R/data.R.

Saving data

# In `exdat.R`
exdat <- matrix(c(71, 50))
save(exdat, file="data/exdat.rdata")

Directory structure

├── DESCRIPTION
├── R
├── data-raw
│   └── exdat.R
├── data
│   └── exdat.rdata
├── R
│   └── data.R

Don’t forget!

  • For future reference, save script (in data-raw/ folder) to (re)create or update the dataset.
  • If you include datasets, set LazyData: true in the DESCRIPTION file.

Reusable dataset

exdat can now be used in examples, tests, vignettes, etc.; there is no need to define it every time it is used.

In examples

#' @examples
#' foo(exdat)

In vignettes

---
title: "My Vignette"
output: html_vignette
---

```{r}
foo(exdat)
```

In README

---
output: github_document
---

```{r}
foo(exdat)
```


No parallel modification

Note that if you now wish to update the dataset, you need to change its definition only in one place!

Self-study

Examples of R packages that define datasets and use them repeatedly.

Unit testing

How not to repeat yourself while writing unit tests.

Repeated test patterns

A unit test records the code to describe expected output.


(actual) (expected)


Unit testing involves checking function output with a range of inputs, and this can involve recycling a test pattern.

Not DRY

But such recycling violates the DRY principle. How can you avoid this?

# Function to test
multiplier <- function(x, y) {
  x * y
}

# Tests
test_that(
  desc = "multiplier works as expected",
  code = {
    expect_identical(multiplier(-1, 3),  -3)
    expect_identical(multiplier(0,  3.4), 0)
    expect_identical(multiplier(NA, 4),   NA_real_)
    expect_identical(multiplier(-2, -2),  4)
    expect_identical(multiplier(3,  3),   9)
  }
)

Parametrized unit testing

To avoid such repetition, you can write parameterized unit tests using {patrick}.

Repeated test pattern

expect_identical() used repeatedly.

test_that(
  desc = "multiplier works as expected",
  code = {
    expect_identical(multiplier(-1, 3),  -3)
    expect_identical(multiplier(0,  3.4), 0)
    expect_identical(multiplier(NA, 4),   NA_real_)
    expect_identical(multiplier(-2, -2),  4)
    expect_identical(multiplier(3,  3),   9)
  }
)

Parametrized test pattern

expect_identical() used once.

patrick::with_parameters_test_that(
  desc_stub = "multiplier works as expected",
  code = expect_identical(multiplier(x, y), res),
  .cases = tibble::tribble(
    ~x,  ~y,  ~res,
    -1,  3,   -3,
    0,   3.4,  0,
    NA,  4,    NA_real_,
    -2,  -2,   4,
    3,   3,    9
  )
)

Combinatorial explosion

The parametrized version may not seem impressive for this simple example, but it becomes exceedingly useful when there is a combinatorial explosion of possibilities. Creating each such test manually is cumbersome and error-prone.

Repeated usage of testing datasets

You have already seen how user-facing datasets — useful for illustrating function usage — can be defined and saved once and then used repeatedly.

Similarly, you can define and save developer-facing datasets - useful for testing purposes - and use them across multiple tests.

Saving datasets in either of these locations is fine.

├── DESCRIPTION
├── tests
│   └── data
│       └── script.R
│       └── testdat1.rdata
│       └── testdat2.rdata
│       └── ...
├── DESCRIPTION
├── tests
│   └── testthat
│       └── data
│           └── script.R
│           └── testdat1.rdata
│           └── testdat2.rdata
│           └── ...

Save the script!

Always save the script used to create datasets. This script:

  • acts as documentation for the datasets
  • makes it easy to modify the datasets in the future (if needed)

Using test datasets

Without stored datasets, you define the same datasets multiple times across test files.

In test-foo1.R:

testdat1 <- { ... }
foo1(testdat1)

In test-foo2.R:

testdat1 <- { ... }
foo2(testdat1)

        ...


With saved datasets, you define just once and load them from test files.

In test-foo1.R:

testdat1 <- readRDS("testdat1")
foo1(testdat1)

In test-foo2.R:

testdat1 <- readRDS("testdat1")
foo2(testdat1)

        ...


Note

The exact path provided to readRDS() will depend on where the datasets are stored inside the tests/ folder.

Self-study

Examples of R packages that save datasets required for unit testing.

Exceptions

How not to repeat yourself while signalling exceptions

Sending signals

Exceptions/conditions (messages, warnings, and errors) provide a way for functions to signal to the user that something unexpected happened. Often, similar exceptions need to be signalled across functions.

E.g., for functions that don’t accept negative values:

input validation

foo1 <- function(x) {
  if (x < 0) {
    stop("Argument `x` should be positive.")
  }
  
  ...
}
foo2 <- function(y) {
  if (y < 0) {
    stop("Argument `y` should be positive.")
  }
  
  ...
}

unit testing

expect_error(
  foo1(-1), 
  "Argument `x` should be positive."
)
expect_error(
  foo2(-1), 
  "Argument `y` should be positive."
)

How can this repetition be avoided?

List of exception functions

We can avoid this repetition by extracting exception message strings in a function with an informative name. And then storing them in a list.

exceptions <- list(
  only_positives_allowed = function(arg_name) {
    paste0("Argument `", arg_name, "` should be positive.")
  },
  
  ... # you can store as many functions as you want
)

Why not include the entire validation?

You can move the entire if() block to only_positives_allowed() and create a new validation function.

But this is not done here to address the most general case where:

  • the exception message string can be used outside of an if() block
  • it can be used not only as a message, but may be as a warning or an error

Reusable exceptions: Part-1

We can then use these functions to signal exceptions.

Input validation

foo1 <- function(x) {
  if (x < 0) {
    stop(exceptions$only_positives_allowed("x"))
  }
  
  ...
}
foo2 <- function(y) {
  if (y < 0) {
    stop(exceptions$only_positives_allowed("y"))
  }
  
  ...
}

Unit testing

expect_error(
  foo1(-1), 
  exceptions$only_positives_allowed("x")
)
expect_error(
  foo2(-1), 
  exceptions$only_positives_allowed("y")
)


No parallel modification

Note that if you now wish to change the condition string, this change needs to be made only in one place!

Reusable exceptions: Part-2

As noted before, you can also move the entire validation to a new function. E.g.

exceptions <- list(
  check_only_positive = function(arg) {
    arg_name <- deparse(substitute(arg))
    if (arg < 0) {
      stop(paste0("Argument `", arg_name, "` should be positive."))
    }
  },
  ... # you can store as many functions as you want
)

Input validation

foo1 <- function(x) {
  check_only_positive(x)
  
  ...
}
foo2 <- function(y) {
  check_only_positive(y)
  
  ...
}

Unit testing

x <- -1
expect_error(
  exceptions$check_only_positive(x), 
  "Argument `x` should be positive."
)

Since the validation has moved to a new function, you only need to test it once.

DRY once, DRY multiple times

Most often the exceptions will be useful only for the package in which they are defined in. But, if the exceptions are generic enough, you can even export them. This will make them reusable not only in the current package, but also in other packages.

That is, DRYing up exceptions in one package does the same for many!

Why a list?

It is not necessary that you store exceptions in a list; you can create individual functions outside of a list and export them.

But storing them in a list has the following advantages:

  • Simpler NAMESPACE: There is a single export for all exceptions (e.g. exceptions), instead of dozens (e.g., only_positives_allowed(), only_negatives_allowed, only_scalar_allowed(), etc.), which can overpower the rest of the package API.

  • Extendability: You can easily append a list of imported exceptions by adding more exceptions which are relevant only for the current package. E.g. exceptions$my_new_exception_function <- function() {...}

Self-study

Example of R package that create a list of exception functions and exports it:

{ospsuite.utils}

Example of R package that imports this list and appends it:

{ospsuite}

Dependency management

How not to repeat yourself while importing external package functions.

Imports

Instead of using :: to access external package function (rlang::warn()), you can specify imports explicitly via roxygen directive #' @importFrom.

But if you are importing some functions multiple times, you should avoid specifying the import multiple times, and instead collect all imports in a single file.

Import statements scattered across files:

# file-1
#' @importFrom rlang warn
...

# file-2
#' @importFrom rlang warn
...

#' @importFrom purrr pluck
...

# file-3
#' @importFrom rlang warn seq2
...

# file-4, file-5, etc.
...

In {pkgname}-package.R file:

## {pkgname} namespace: start
#'
#' @importFrom rlang warn seq2
#' @importFrom purrr pluck
#'
## {pkgname} namespace: end
NULL

Self-study

Examples of R packages that list the NAMESPACE imports in a single file this way.

Conclusion

You can use these techniques to avoid repetition while developing R packages, which should make the development workflow faster, more maintainable, and less error-prone.

Advanced

Although related to package development at a meta level, these issues are beyond the scope of the current presentation. I can only point to resources to help you get started.

For more

If you are interested in reading more of my slide decks on related topics, visit this page.

Find me at…

Twitter

LikedIn

GitHub

Website

E-mail

Thank You

And Happy (DRY) Package Development! 😊

Session information

sessioninfo::session_info(include_base = TRUE)
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       Ubuntu 22.04.5 LTS
 system   x86_64, linux-gnu
 hostname fv-az768-838
 ui       X11
 language (EN)
 collate  C.UTF-8
 ctype    C.UTF-8
 tz       UTC
 date     2024-11-17
 pandoc   3.5 @ /opt/hostedtoolcache/pandoc/3.5/x64/ (via rmarkdown)
 quarto   1.6.36 @ /usr/local/bin/quarto

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 base        * 4.4.2      2024-10-31 [3] local
 cli           3.6.3      2024-06-21 [1] RSPM
 compiler      4.4.2      2024-10-31 [3] local
 datasets    * 4.4.2      2024-10-31 [3] local
 digest        0.6.37     2024-08-19 [1] RSPM
 evaluate      1.0.1      2024-10-10 [1] RSPM
 fastmap       1.2.0      2024-05-15 [1] RSPM
 graphics    * 4.4.2      2024-10-31 [3] local
 grDevices   * 4.4.2      2024-10-31 [3] local
 htmltools     0.5.8.1    2024-04-04 [1] RSPM
 jsonlite      1.8.9      2024-09-20 [1] RSPM
 knitr         1.49       2024-11-08 [1] RSPM
 methods     * 4.4.2      2024-10-31 [3] local
 rlang         1.1.4      2024-06-04 [1] RSPM
 rmarkdown     2.29       2024-11-04 [1] RSPM
 sessioninfo   1.2.2.9000 2024-11-03 [1] Github (r-lib/sessioninfo@37c81af)
 stats       * 4.4.2      2024-10-31 [3] local
 tools         4.4.2      2024-10-31 [3] local
 utils       * 4.4.2      2024-10-31 [3] local
 xfun          0.49       2024-10-31 [1] RSPM
 yaml          2.3.10     2024-07-26 [1] RSPM

 [1] /home/runner/work/_temp/Library
 [2] /opt/R/4.4.2/lib/R/site-library
 [3] /opt/R/4.4.2/lib/R/library
 * ── Packages attached to the search path.

──────────────────────────────────────────────────────────────────────────────