Reproducibility
Reproducibility is the ability to duplicate the results of a study using the same materials and procedures as were used by the original investigators (Bollen et al., 2015). In data work and coding, this translates to computational reproducibility: the ability to reproduce outputs using the same code and data inputs.
A set of files with code and data to reproduce the results of a study is called a reproducibility package. Achieving computational reproducibility with a reproducibility package might seem straightforward but it can be a challenging pursuit. Four conditions are usually enough to achieve computational reproducibility after using the same code and data as the original authors:
- Documentation: Including enough instructions for replicators unfamiliar with the files to run the code.
- Version of software: Using the same version of a software or programming language.
- Version of dependencies: Using the same version of external packages or user-written packages.
- Seed for random number generation: Using the same seed for random number generations so random processes in data work are reproduced.
Documentation
Version of software
Using the same version of the software or programming language as the original authors of a study is not strictly necessary for reproducibility as releases that are not too far apart often produce the same results. However, for reproducibility to last over longer periods, authors should at least register in their code documentation the version used to obtain their results.
Some software and programming languages allow to interpret code in specific releases. Such is the case of Stata, which enables users to use the command version so that code is interpreted for a specific version of Stata, facilitating reproducibility.
Other tools such as conda allow users to easily export metadata files that register the version of the programming language used for R or Python, so that replicators can import the metadata information and produce a programming environment in which they replicate specific versions of these languages.
Version of dependencies
Dependencies using external commands (for example: commands installed from the repository SSC for Stata or CRAN for R) can become deprecated, not available, or updated or corrected in ways that are not compatible with their previous versions. This can cause code errors or reproducibility issues when different versions of them are used when attempting to reproduce results.
(explain for Stata)
(explain for R)
(explain for Python)
Seed for random number generation
Other factors affecting reproducibility
Additional Resources
- Bollen et al. (2015), Social, Behavioral, and Economic Sciences Perspectives on Robust and Reliable Science