Difference between revisions of "Reproducible Research"

Jump to: navigation, search
Line 1: Line 1:
<onlyinclude>In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.</onlyinclude>
<onlyinclude>Reproducible research is research conducted and documented in a manner in which different researchers can run the same analysis with the same data and find the same results. The concept of reproducible research rests largely upon the idea that the path to research findings is an output equally important as the findings themselves. Accordingly, to conduct and confirm reproducible research, researchers should make available to the public not only their results, but also their data, codes and documentation. This page provides guidelines on four key elements of reproducible research: reproducible code publication, data publication, data documentation, and output reproducibility. It is important to keep these elements in mind throughout all stages of research. </onlyinclude>


Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.
==Read First==
*Well-written [[Master Do-files | master do-files]] are critical to transparent, reproducible research.
*[[Getting started with GitHub | GitHub]] repositories allow researchers to track changes to code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code.
*[[Data Documentation | Data documentation]] outlines all aspects of the data work that may affect or inform the analysis and results.
*Tools like LaTeX and Overleaf ensure that output is reproducible.


== Code replication ==
==Overview==
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.
In most scientific fields, results are validated through '''replication''', or when different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment – and the costs involved are high.


It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to [[Randomization in Stata |replicable sampling and randomization]]. Sorting observations is also frequently necessary. Having a [[Master do-files|master do-file]] greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.
Even in such cases, however, we should still require '''reproducibility''', or when different researchers run the same analysis with the same data and find the same results. That may seem obvious, but unfortunately, it is not as widely observed as we would like.


Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not [[Research Ethics#Research Transparency|transparent]]. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.
This page provides guidelines on four key elements of reproducibility in research: reproducible code publication, data publication, data documentation, and output reproducibility.


===Software for Code Replication===
== Reproducible Code Publication==
Git is a free version-control software. Files are stored in Git Repositories, most commonly on [https://github.com/ GitHub]. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves [[Data Documentation|documentation]]. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.
Reproducible research requires that others have access to and can identically execute your code and analytical processes. With careful coding, use of [[Master Do-files | master do-files]] and adherence to protocols, the same data and code will yield the same results for any given person. To ensure this is the case, your [[Master Do-files|master do-file]] should set the Stata seed and version for [[Randomization in Stata |replicable sampling and randomization]]; install all necessary commands and packages; specify settings; sort observations frequently; and use globals for the root folder, project folders, and units and assumptions. By nature, the [[Master Do-files|master do-file]] will run project do-files in a pre-specified order, which strengthens reproducibility. If you use different languages or software in the same project, consider using a shell script to ensure that other users run the different languages or software in the correct order.


To learn GitHub, there is an [https://services.github.com/on-demand/intro-to-github/ introductory training] available through GitHub Services, and multiple tutorials available through [https://guides.github.com/ GitHub Guides].
Code should not only be available and reproducible, but also understandable: if someone else runs your code and reproduces the results without understanding what the code did, then your research is not [[Research Ethics#Research Transparency|transparent]]. Comment code frequently to highlight what it is doing and why. For example, if the code drops observations or changes values, explain why through comments.  


== Data publication ==
===Software for Reproducible Code Publication===
Research results can only be replicated if the data used for [[Publishing_Data#Publishing_Analysis_Data|analysis]] is [[Publishing Data|shared]]. Though being able to reproduce all steps from [[Data Cleaning|data cleaning]] to [[Data Analysis|data analysis]] is ideal to guarantee reproducibility and [[Research Ethics#Research Transparency|transparency]], that is not always possible, as some of the data used may be [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request.
[[Getting started with GitHub | Git]] is a free version-control software. Users can store files in Git Repositories, most commonly on [https://github.com/ GitHub]. Within repositories, users can track changes to code in different programming languages and create messages explaining [[Data Documentation | why changes were made]]. Sharing Git repositories makes code publicly available and allows other researchers to read and replicate your code. To learn how to use GitHub, refer to GitHub Services’ [https://services.github.com/on-demand/intro-to-github/ introductory training] and [https://guides.github.com/ GitHub Guides’] tutorials. [http://jupyter.org/ Jupyter Notebook] is another platform for code-sharing on which researchers can create and share code in different programming languages, including Python, R, Julia, and Scala.  


== Dynamic documents ==  
== Data Publication ==
Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analysing data and publishing the output of this analysis, so there's less room for error and manipulation.
To execute code, reproducible research requires that researchers [[Publishing Data | publish their data]]. Ideally, researchers can provide adequate data for others to reproduce all steps in their code from [[Data Cleaning| cleaning]] to [[Data Analysis| analysis]]. However, this is not always feasible when the data contains [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential information.
==Data Documentation==
For reproducibility and transparency, it is important to document all aspects of the data work that may affect or inform the analysis and results. Accordingly, during [[Primary Data Collection | data collection]], [[Data Cleaning | data cleaning]], variable construction, and [[Data Analysis | data analysis]], compile [[Data Documentation | data work documentation]] in code comments and, ideally, in one consolidated file or folder. The structure of this file or folder will vary from one project to another.  


Different software allow for different degrees of automatization. Using [https://rmarkdown.rstudio.com/ R Markdown], for example, users can write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create such documents using [https://www.stata.com/manuals/pdyndoc.pdf dyndoc]. The output is a file, usually a PDF, that contains text, tables and graphs created simultaneously. With this kind of document, whenever data is updated or a change is made to some of the analysis, it's only necessary to run one file to generate a new final paper or report, with no copy-pasting or manual changes.
Note that when you submit codes for code review or deposit data in the [[Microdata Catalog | microdata catalog]], the reviewers will revise your data documentation. While they may provide feedback, remember that positive comments on your data documentation do not guarantee no problems, since reviewers cannot ask about issues unknown to them. For more details, see [[Data Documentation]].  


LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Using this software, data analysis and writing are two separate processes: first, the data is analyzed using whatever software you prefer and the results are exported into TeX format ([https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf R's stargazer] is commonly used for that, and Stata has different options such as [http://repec.org/bocode/e/estout/esttab.html esttab] and [http://repec.org/bocode/o/outreg2.html outreg2]), then a LaTeX document is written that uploads these outputs. The advantage of using LaTeX is that whenever results are updated, it's only necessary to recompile the LaTeX document for the new tables and graphs to be displayed.
== Output Reproducibility ==
Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analyzing data and publishing the output of this analysis, so there's less room for error and manipulation.


===Other software for dynamic documents===
Different software allows for different degrees of automatization. [https://rmarkdown.rstudio.com/ R Markdown], for example, allows users to write, text, and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 allows users to [https://www.stata.com/manuals/pdyndoc.pdf dyndoc to create similar documents]; the output is a file, usually a PDF, that contains text, tables and graphs. With this kind of document, whenever a researcher updates data or changes the analysis, he/she only needs to run one file to generate a new final paper or report. No copy-pasting or manual changes are necessary, which improves reproducibility.
*[http://jupyter.org/ Jupyter Notebook] is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.


* [https://www.overleaf.com/ Overleaf] is a web based platform for collaboration in TeX documents.
LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs in order to easily update them in a text document. After you analyze the data in your preferred software, you can export the results into TeX format – [https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf R's stargazer] is commonly used for this, and Stata has different options such as [http://repec.org/bocode/e/estout/esttab.html esttab] and [http://repec.org/bocode/o/outreg2.html <code>outreg2</code>]. The TeX code writes a LaTex document that uploads these outputs. Whenever results are updated, simply recompile the LaTeX document with the press of a button in order to integrate the new graphs and tables. Should you wish to use TeX collaboratively, [https://www.overleaf.com/ Overleaf] is a web-based platform that facilitates TeX collaboration, and [http://jupyter.org/ Jupyter Notebook] can create dynamic documents in HTML, LaTeX and other formats.
 
* [https://osf.io/ Open science framework] is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).


== Additional Resources ==
== Additional Resources ==
From Data Colada:
*[http://datacolada.org/69 8 tips to make open research more findable and understandable]


From the Abul Latif Jameel Poverty Action Lab (JPAL)
*DIME Analytic’s [https://github.com/worldbank/DIME-Resources/blob/master/git-2-github.pdf Introduction to Git and GitHub] and guides to [https://github.com/worldbank/DIME-Resources/blob/master/git-3-flow.pdf Using Git and GitHub], [https://github.com/worldbank/DIME-Resources/blob/master/git-4-management.pdf Maintaining a GitHub Repository], and [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-3-git.pdf Initializing and Synchronizing a Git Repo with GitHub Desktop]
* [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and Reproducibility]
*DIME Analytic’s [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-7-computing.pdf Fundamentals of Scientific Computing with Stata] and guides [https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf 1] [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-2-coding.pdf 2] and [https://github.com/worldbank/DIME-Resources/blob/master/stata2-2-coding.pdf 3] to Stata Coding for Reproducible Research
*[https://osf.io/ Open Science Framework], a web-based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).
*[http://datacolada.org/69 Data Colada’s] 8 tips to make open research more findable and understandable
*The Abul Latif Jameel Poverty Action Lab (JPAL)’s [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility resources] on transparency and reproducibility
*Innovations for Policy Action (IPA)’s [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] and [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]
*[https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]
*Center for Open Science’s [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]
*Berkeley Initiative for Transparency in the Social Sciences (BITSS)’ [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]
*Coursera’s course for R, [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research], and Stata, [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into Reproducible Documents ]
*Matthew Salganik's [http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf Open and Reproducible Research: Goals, Obstacles and Solutions]


From Innovations for Policy Action (IPA)
==Back to Parent==
* [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management]
This article is part of the topic [[Reproducible Research]]
* [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]
* [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]


Center for Open Science
* [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]
Berkeley Initiative for Transparency in the Social Sciences
* [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]
Reproducible Research in R
* [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research]
Reproducible Research in Stata
* [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into reproducible documents ]
*Matthew Salganik's [http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf Open and Reproducible Research: Goals, Obstacles and Solutions]
[[Category: Reproducible Research]]
[[Category: Reproducible Research]]
== Additional Resources ==

Revision as of 22:58, 30 April 2019

Reproducible research is research conducted and documented in a manner in which different researchers can run the same analysis with the same data and find the same results. The concept of reproducible research rests largely upon the idea that the path to research findings is an output equally important as the findings themselves. Accordingly, to conduct and confirm reproducible research, researchers should make available to the public not only their results, but also their data, codes and documentation. This page provides guidelines on four key elements of reproducible research: reproducible code publication, data publication, data documentation, and output reproducibility. It is important to keep these elements in mind throughout all stages of research.

Read First

  • Well-written master do-files are critical to transparent, reproducible research.
  • GitHub repositories allow researchers to track changes to code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code.
  • Data documentation outlines all aspects of the data work that may affect or inform the analysis and results.
  • Tools like LaTeX and Overleaf ensure that output is reproducible.

Overview

In most scientific fields, results are validated through replication, or when different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment – and the costs involved are high.

Even in such cases, however, we should still require reproducibility, or when different researchers run the same analysis with the same data and find the same results. That may seem obvious, but unfortunately, it is not as widely observed as we would like.

This page provides guidelines on four key elements of reproducibility in research: reproducible code publication, data publication, data documentation, and output reproducibility.

Reproducible Code Publication

Reproducible research requires that others have access to and can identically execute your code and analytical processes. With careful coding, use of master do-files and adherence to protocols, the same data and code will yield the same results for any given person. To ensure this is the case, your master do-file should set the Stata seed and version for replicable sampling and randomization; install all necessary commands and packages; specify settings; sort observations frequently; and use globals for the root folder, project folders, and units and assumptions. By nature, the master do-file will run project do-files in a pre-specified order, which strengthens reproducibility. If you use different languages or software in the same project, consider using a shell script to ensure that other users run the different languages or software in the correct order.

Code should not only be available and reproducible, but also understandable: if someone else runs your code and reproduces the results without understanding what the code did, then your research is not transparent. Comment code frequently to highlight what it is doing and why. For example, if the code drops observations or changes values, explain why through comments.

Software for Reproducible Code Publication

Git is a free version-control software. Users can store files in Git Repositories, most commonly on GitHub. Within repositories, users can track changes to code in different programming languages and create messages explaining why changes were made. Sharing Git repositories makes code publicly available and allows other researchers to read and replicate your code. To learn how to use GitHub, refer to GitHub Services’ introductory training and GitHub Guides’ tutorials. Jupyter Notebook is another platform for code-sharing on which researchers can create and share code in different programming languages, including Python, R, Julia, and Scala.

Data Publication

To execute code, reproducible research requires that researchers publish their data. Ideally, researchers can provide adequate data for others to reproduce all steps in their code from cleaning to analysis. However, this is not always feasible when the data contains personally identifiable or confidential information.

Data Documentation

For reproducibility and transparency, it is important to document all aspects of the data work that may affect or inform the analysis and results. Accordingly, during data collection, data cleaning, variable construction, and data analysis, compile data work documentation in code comments and, ideally, in one consolidated file or folder. The structure of this file or folder will vary from one project to another.

Note that when you submit codes for code review or deposit data in the microdata catalog, the reviewers will revise your data documentation. While they may provide feedback, remember that positive comments on your data documentation do not guarantee no problems, since reviewers cannot ask about issues unknown to them. For more details, see Data Documentation.

Output Reproducibility

Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analyzing data and publishing the output of this analysis, so there's less room for error and manipulation.

Different software allows for different degrees of automatization. R Markdown, for example, allows users to write, text, and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 allows users to dyndoc to create similar documents; the output is a file, usually a PDF, that contains text, tables and graphs. With this kind of document, whenever a researcher updates data or changes the analysis, he/she only needs to run one file to generate a new final paper or report. No copy-pasting or manual changes are necessary, which improves reproducibility.

LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs in order to easily update them in a text document. After you analyze the data in your preferred software, you can export the results into TeX format – R's stargazer is commonly used for this, and Stata has different options such as esttab and outreg2. The TeX code writes a LaTex document that uploads these outputs. Whenever results are updated, simply recompile the LaTeX document with the press of a button in order to integrate the new graphs and tables. Should you wish to use TeX collaboratively, Overleaf is a web-based platform that facilitates TeX collaboration, and Jupyter Notebook can create dynamic documents in HTML, LaTeX and other formats.

Additional Resources

Back to Parent

This article is part of the topic Reproducible Research