Indrajeet Patil
“Copy and paste is a design error.” - David Parnas
Why should you not repeat yourself?
The DRY Principle states that:
Every piece of knowledge must have a single representation in the codebase.
That is, you should not express the same thing in multiple places in multiple ways.
It’s about knowledge and not just code
The DRY principle is about duplication of knowledge. Thus, it applies to all programming entities that encode knowledge:
When code is duplicated, if you make change in one place, you need to make parallel changes in other places. When code is DRY, parallel modifications become unnecessary.
Easy to maintain since there is only a single representation of knowledge that needs to be updated if the underlying knowledge changes.
As a side effect, routines developed to remove duplicated code can become part of general-purpose utilities.
Further Reading
Thomas, D., & Hunt, A. (2019). The Pragmatic Programmer. Addison-Wesley Professional. (pp. 30-38)
Understand distinction between DRY and DAMP (Descriptive And Meaningful Phrases)
Apply the DRY principle to remove duplication in following aspects of R package development:
How not to repeat yourself while writing documentation.
What users consult to find needed information may be context-dependent.
README: While exploring the package repository.
Vignettes: When first learning how to use a package.
Manual: When checking details about a specific function.
Thus, including crucial information only in one place makes it likely that the users might miss out on it in certain contexts.
Some documentation is important enough to be included in multiple places (e.g. in the function documentation and in a vignette).
How can you document something just once but include it in multiple locations?
You can stitch an R Markdown document from smaller child documents.
(parent Rmd) (child Rmd) (result Rmd)
Thus, the information to repeat can be stored once in child documents and reused multiple times across parents.
Stratagem: You can store child documents in the manual directory and reuse them.
Child documents
├── DESCRIPTION
├── man
│ └── rmd-children
│ └── info1.Rmd
│ └── ...
Tips
.Rmd
file and can include everything that any other .Rmd
file can include.rmd-fragments
).Roxygen: list(markdown = TRUE)
field in the DESCRIPTION
file.R CMD check
or for {pkgdown}
website.Include contents of child documents in the documentation in multiple locations.
Vignette
├── DESCRIPTION
├── vignettes
│ └── vignette1.Rmd
│ └── ...
│ └── web_only
│ └── vignette2.Rmd
│ └── ...
README
├── DESCRIPTION
├── README.Rmd
Include contents of child documents in the documentation in multiple locations.
Manual
├── DESCRIPTION
├── R
│ └── foo1.R
│ └── foo2.R
├── man
│ └── foo1.Rd
│ └── foo2.Rd
│ └── ...
Important
The underlying assumption here is that you are using {roxygen2}
to generate package documentation.
You can include contents from any file in .Rmd
, not just a child document!
(parent Rmd) (child Rmd) (.R
+ R
engine) (.md
+ asis
engine)
(result Rmd)
Like child documents, other types of documents are also stored in man/
folder.
Reusable content
├── DESCRIPTION
├── man
│ └── rmd-children
│ └── info1.Rmd
│ └── ...
│ └── md-fragments
│ └── fragment1.md
│ └── ...
│ └── r-chunks
│ └── chunk1.R
│ └── ...
fragment1.md
example:
Folder names
You can name these folders however you wish, but it is advisable that the names provide information about file contents (e.g., r-examples
, yaml-snippets
, md-fragments
, etc.).
Include contents of various files in the documentation in multiple locations.
Vignette
├── DESCRIPTION
├── vignettes
│ └── vignette1.Rmd
│ └── ...
│ └── web_only
│ └── vignette2.Rmd
│ └── ...
README
├── DESCRIPTION
├── README.Rmd
In vignette1.Rmd
Include contents of child documents in the documentation in multiple locations.
Manual
├── DESCRIPTION
├── R
│ └── foo1.R
│ └── ...
├── man
│ └── foo1.Rd
│ └── ...
Important
The underlying assumption here is that you are using {roxygen2}
to generate package documentation.
Summary on how to repeat documentation
If you are overwhelmed by this section, note that you actually need to remember only the following rules:
Store reusable document files in the /man
folder.
When you wish to include their contents, provide paths to these files relative to the document you are linking from.
If it’s a child .Rmd
document, use the child
option to include its contents.
If it’s not an .Rmd
document, use the file
option to include its contents and use appropriate {knitr}
engine. To see available engines, run names(knitr::knit_engines$get())
.
Example packages that use reusable component documents to repeat documentation.
How not to repeat yourself while setting up vignettes.
Another duplication that occurs is in setup chunks for vignettes.
For example, some parts of the setup can be same across vignettes.
├── DESCRIPTION
├── vignettes
│ └── vignette1.Rmd
│ └── vignette2.Rmd
│ └── ...
How can this repetition be avoided?
This repetition can be avoided by moving the common setup to a script, and sourcing it from vignettes. Storing this script in a folder (/setup
) is advisable if there are many reusable artifacts.
Option 1
├── DESCRIPTION
├── vignettes
│ └── setup.R
Option 2
├── DESCRIPTION
├── vignettes
│ └── setup
│ └── setup.R
Sourcing common setup
No parallel modification
Now common setup can be modified with a change in only one place!
Packages in the wild that use this trick.
How not to repeat yourself while creating and re-using example datasets.
If none of the existing datasets are useful to illustrate your functions, you can create new datasets.
Let’s say your example dataset is called exdat
and function is called foo()
. Using it in examples, vignettes, README, etc. requires that it be define multiple times.
In vignettes
How can this repetition be avoided?
You can avoid this repetition by defining the data just once, saving and shipping it with the package.
The datasets are stored in data/
, and documented in R/data.R
.
Don’t forget!
data-raw/
folder) to (re)create or update the dataset.LazyData: true
in the DESCRIPTION
file.exdat
can now be used in examples, tests, vignettes, etc.; there is no need to define it every time it is used.
No parallel modification
Note that if you now wish to update the dataset, you need to change its definition only in one place!
Examples of R packages that define datasets and use them repeatedly.
How not to repeat yourself while writing unit tests.
A unit test records the code to describe expected output.
(actual) (expected)
Unit testing involves checking function output with a range of inputs, and this can involve recycling a test pattern.
Not DRY
But such recycling violates the DRY principle. How can you avoid this?
# Function to test
multiplier <- function(x, y) {
x * y
}
# Tests
test_that(
desc = "multiplier works as expected",
code = {
expect_identical(multiplier(-1, 3), -3)
expect_identical(multiplier(0, 3.4), 0)
expect_identical(multiplier(NA, 4), NA_real_)
expect_identical(multiplier(-2, -2), 4)
expect_identical(multiplier(3, 3), 9)
}
)
To avoid such repetition, you can write parameterized unit tests using {patrick}
.
Repeated test pattern
expect_identical()
used repeatedly.
Combinatorial explosion
The parametrized version may not seem impressive for this simple example, but it becomes exceedingly useful when there is a combinatorial explosion of possibilities. Creating each such test manually is cumbersome and error-prone.
You have already seen how user-facing datasets — useful for illustrating function usage — can be defined and saved once and then used repeatedly.
Similarly, you can define and save developer-facing datasets - useful for testing purposes - and use them across multiple tests.
Saving datasets in either of these locations is fine.
├── DESCRIPTION
├── tests
│ └── data
│ └── script.R
│ └── testdat1.rdata
│ └── testdat2.rdata
│ └── ...
├── DESCRIPTION
├── tests
│ └── testthat
│ └── data
│ └── script.R
│ └── testdat1.rdata
│ └── testdat2.rdata
│ └── ...
Save the script!
Always save the script used to create datasets. This script:
Without stored datasets, you define the same datasets multiple times across test files.
...
With saved datasets, you define just once and load them from test files.
...
Note
The exact path provided to readRDS()
will depend on where the datasets are stored inside the tests/
folder.
Examples of R packages that save datasets required for unit testing.
How not to repeat yourself while signalling exceptions
Exceptions/conditions (messages, warnings, and errors) provide a way for functions to signal to the user that something unexpected happened. Often, similar exceptions need to be signalled across functions.
E.g., for functions that don’t accept negative values:
input validation
unit testing
How can this repetition be avoided?
We can avoid this repetition by extracting exception message strings in a function with an informative name. And then storing them in a list.
exceptions <- list(
only_positives_allowed = function(arg_name) {
paste0("Argument `", arg_name, "` should be positive.")
},
... # you can store as many functions as you want
)
Why not include the entire validation?
You can move the entire if()
block to only_positives_allowed()
and create a new validation function.
But this is not done here to address the most general case where:
if()
blockWe can then use these functions to signal exceptions.
Input validation
Unit testing
No parallel modification
Note that if you now wish to change the condition string, this change needs to be made only in one place!
As noted before, you can also move the entire validation to a new function. E.g.
exceptions <- list(
check_only_positive = function(arg) {
arg_name <- deparse(substitute(arg))
if (arg < 0) {
stop(paste0("Argument `", arg_name, "` should be positive."))
}
},
... # you can store as many functions as you want
)
Input validation
Unit testing
Most often the exceptions will be useful only for the package in which they are defined in. But, if the exceptions are generic enough, you can even export them. This will make them reusable not only in the current package, but also in other packages.
That is, DRYing up exceptions in one package does the same for many!
Why a list?
It is not necessary that you store exceptions in a list; you can create individual functions outside of a list and export them.
But storing them in a list has the following advantages:
Simpler NAMESPACE
: There is a single export for all exceptions (e.g. exceptions
), instead of dozens (e.g., only_positives_allowed()
, only_negatives_allowed
, only_scalar_allowed()
, etc.), which can overpower the rest of the package API.
Extendability: You can easily append a list of imported exceptions by adding more exceptions which are relevant only for the current package. E.g. exceptions$my_new_exception_function <- function() {...}
Example of R package that create a list of exception functions and exports it:
Example of R package that imports this list and appends it:
How not to repeat yourself while importing external package functions.
Instead of using ::
to access external package function (rlang::warn()
), you can specify imports explicitly via roxygen directive #' @importFrom
.
But if you are importing some functions multiple times, you should avoid specifying the import multiple times, and instead collect all imports in a single file.
Import statements scattered across files:
Examples of R packages that list the NAMESPACE
imports in a single file this way.
You can use these techniques to avoid repetition while developing R packages, which should make the development workflow faster, more maintainable, and less error-prone.
Although related to package development at a meta level, these issues are beyond the scope of the current presentation. I can only point to resources to help you get started.
If you are interested in reading more of my slide decks on related topics, visit this page.
And Happy (DRY) Package Development! 😊
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os Ubuntu 22.04.5 LTS
system x86_64, linux-gnu
hostname fv-az768-838
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2024-11-17
pandoc 3.5 @ /opt/hostedtoolcache/pandoc/3.5/x64/ (via rmarkdown)
quarto 1.6.36 @ /usr/local/bin/quarto
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
base * 4.4.2 2024-10-31 [3] local
cli 3.6.3 2024-06-21 [1] RSPM
compiler 4.4.2 2024-10-31 [3] local
datasets * 4.4.2 2024-10-31 [3] local
digest 0.6.37 2024-08-19 [1] RSPM
evaluate 1.0.1 2024-10-10 [1] RSPM
fastmap 1.2.0 2024-05-15 [1] RSPM
graphics * 4.4.2 2024-10-31 [3] local
grDevices * 4.4.2 2024-10-31 [3] local
htmltools 0.5.8.1 2024-04-04 [1] RSPM
jsonlite 1.8.9 2024-09-20 [1] RSPM
knitr 1.49 2024-11-08 [1] RSPM
methods * 4.4.2 2024-10-31 [3] local
rlang 1.1.4 2024-06-04 [1] RSPM
rmarkdown 2.29 2024-11-04 [1] RSPM
sessioninfo 1.2.2.9000 2024-11-03 [1] Github (r-lib/sessioninfo@37c81af)
stats * 4.4.2 2024-10-31 [3] local
tools 4.4.2 2024-10-31 [3] local
utils * 4.4.2 2024-10-31 [3] local
xfun 0.49 2024-10-31 [1] RSPM
yaml 2.3.10 2024-07-26 [1] RSPM
[1] /home/runner/work/_temp/Library
[2] /opt/R/4.4.2/lib/R/site-library
[3] /opt/R/4.4.2/lib/R/library
* ── Packages attached to the search path.
──────────────────────────────────────────────────────────────────────────────
Source code for the slides can be found here.