Difference between revisions of "Reproducible Research"
Line 1: | Line 1: | ||
'''Reproducible research''' is the system of [[Data Documentation|documenting]] and [[Publishing Data|publishing]] results from | '''Reproducible research''' is the system of [[Data Documentation|documenting]] and [[Publishing Data|publishing]] results from a given research study. At the very least, '''reproducibility''' allows other researchers to [[Data Analysis|analyze]] the same data to get the same results as the original study, which strengthens the conclusions of the original study. Reproducible research is based on the idea that the path to research findings is just as important as the findings themselves. There are four key elements of reproducible research - [[Publishing Data#Preparing for Release|code publication]], [[Publishing Data#Preparing for Release|data publication]], [[Data Documentation|data documentation]], and '''output reproducibility'''. | ||
==Read First== | ==Read First== | ||
Line 11: | Line 11: | ||
[[Getting started with GitHub | GitHub repositories]] allow researchers to track changes to the code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code. | [[Getting started with GitHub | GitHub repositories]] allow researchers to track changes to the code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code. | ||
'''Replication''' is a process where different scientists run the same experiment independently in different samples and find similar conclusions. | |||
In empirical research, '''replication''' allows researchers to validate the findings of a particular study. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment – and the costs involved are high. | |||
Even in such cases, however, we should still require '''reproducibility''', or when different researchers run the same analysis with the same data and find the same results. That may seem obvious, but unfortunately, it is not as widely observed as we would like. | Even in such cases, however, we should still require '''reproducibility''', or when different researchers run the same analysis with the same data and find the same results. That may seem obvious, but unfortunately, it is not as widely observed as we would like. |
Revision as of 15:26, 20 April 2020
Reproducible research is the system of documenting and publishing results from a given research study. At the very least, reproducibility allows other researchers to analyze the same data to get the same results as the original study, which strengthens the conclusions of the original study. Reproducible research is based on the idea that the path to research findings is just as important as the findings themselves. There are four key elements of reproducible research - code publication, data publication, data documentation, and output reproducibility.
Read First
- DIME Analytics has created the DIME Research Reproducibility Standards.
- DIME Analytics also conducted a bootcamp on reproducible research, which covers the various aspects of reproducibility.
- Well-written master do-files are critical to transparent, reproducible research.
- GitHub repositories play a major role in making research reproducible.
- Specialized text editing and collaboration tools ensure that output is reproducible.
Overview
GitHub repositories allow researchers to track changes to the code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code.
Replication is a process where different scientists run the same experiment independently in different samples and find similar conclusions.
In empirical research, replication allows researchers to validate the findings of a particular study. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment – and the costs involved are high.
Even in such cases, however, we should still require reproducibility, or when different researchers run the same analysis with the same data and find the same results. That may seem obvious, but unfortunately, it is not as widely observed as we would like.
This page provides guidelines on four key elements of reproducibility in research: reproducible code publication, data publication, data documentation, and output reproducibility.
Reproducible Code Publication
Reproducible research requires that others have access to and can identically execute your code and analytical processes. With careful coding, use of master do-files and adherence to protocols, the same data and code will yield the same results for any given person. To ensure this is the case, your master do-file should set the Stata seed and version for replicable sampling and randomization; install all necessary commands and packages; specify settings; sort observations frequently; and use globals for the root folder, project folders, and units and assumptions. By nature, the master do-file will run project do-files in a pre-specified order, which strengthens reproducibility. If you use different languages or software in the same project, consider using a shell script to ensure that other users run the different languages or software in the correct order.
Code should not only be available and reproducible, but also understandable: if someone else runs your code and reproduces the results without understanding what the code did, then your research is not transparent. Comment code frequently to highlight what it is doing and why. For example, if the code drops observations or changes values, explain why through comments.
Software for Reproducible Code Publication
Git is a free version-control software. Users can store files in Git Repositories, most commonly on GitHub. Within repositories, users can track changes to code in different programming languages and create messages explaining why changes were made. Sharing Git repositories makes code publicly available and allows other researchers to read and replicate your code. To learn how to use GitHub, refer to GitHub Services’ introductory training and GitHub Guides’ tutorials. Jupyter Notebook is another platform for code-sharing on which researchers can create and share code in different programming languages, including Python, R, Julia, and Scala.
Data Publication
To execute code, reproducible research requires that researchers publish their data. Ideally, researchers can provide adequate data for others to reproduce all steps in their code from cleaning to analysis. However, this is not always feasible when the data contains personally identifiable or confidential information.
Data Documentation
Data documentation outlines all aspects of the data work that may affect or inform the analysis and results. For reproducibility and transparency, it is important to document all aspects of the data work that may affect or inform the analysis and results. Accordingly, during data collection, data cleaning, variable construction, and data analysis, compile data work documentation in code comments and, ideally, in one consolidated file or folder. The structure of this file or folder will vary from one project to another.
Note that when you submit codes for code review or deposit data in the microdata catalog, the reviewers will revise your data documentation. While they may provide feedback, remember that positive comments on your data documentation do not guarantee no problems, since reviewers cannot ask about issues unknown to them. For more details, see Data Documentation.
Output Reproducibility
Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analyzing data and publishing the output of this analysis, so there's less room for error and manipulation.
Different software allows for different degrees of automatization. R Markdown, for example, allows users to write, text, and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 allows users to dyndoc to create similar documents; the output is a file, usually a PDF, that contains text, tables and graphs. With this kind of document, whenever a researcher updates data or changes the analysis, he/she only needs to run one file to generate a new final paper or report. No copy-pasting or manual changes are necessary, which improves reproducibility.
LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs in order to easily update them in a text document. After you analyze the data in your preferred software, you can export the results into TeX format – R's stargazer is commonly used for this, and Stata has different options such as esttab and outreg2
. The TeX code writes a LaTex document that uploads these outputs. Whenever results are updated, simply recompile the LaTeX document with the press of a button in order to integrate the new graphs and tables. Should you wish to use TeX collaboratively, Overleaf is a web-based platform that facilitates TeX collaboration, and Jupyter Notebook can create dynamic documents in HTML, LaTeX and other formats.
Additional Resources
- DIME Analytics’ Data Management and Cleaning
- DIME Analytic’s Coding for Reproducible Research
- DIME Analytics’ Intro to GitHub
- DIME Analytics’ guides to 1 and 2 to Using Git and GitHub
- DIME Analytics’ Maintaining a GitHub Repository
- DIME Analytics’ Initializing and Synchronizing a Git Repo with GitHub Desktop
- DIME Analytics’ Using Git Flow to Manage Code Projets with GitKraken
- DIME Analytic’s Fundamentals of Scientific Computing with Stata and guides 1 2 and 3 to Stata Coding for Reproducible Research
- Open Science Framework, a web-based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).
- Data Colada’s 8 tips to make open research more findable and understandable
- The Abul Latif Jameel Poverty Action Lab (JPAL)’s resources on transparency and reproducibility
- Innovations for Policy Action (IPA)’s Reproducible Research: Best Practices for Data and Code Management and Guidelines for data publication
- Randomized Control Trials in the Social Science Dataverse
- Center for Open Science’s Transparency and Openness Guidelines, summarized in a 1-Page Handout
- Berkeley Initiative for Transparency in the Social Sciences (BITSS)’ Manual of Best Practices in Transparent Social Science Research
- Coursera’s course for R, Johns Hopkins' Online Course on Reproducible Research, and Stata, Incorporating Stata into Reproducible Documents
- Matthew Salganik's Open and Reproducible Research: Goals, Obstacles and Solutions