Difference between revisions of "Reproducible Research"

Jump to: navigation, search
Tag: Manual revert
 
(103 intermediate revisions by 5 users not shown)
Line 1: Line 1:
'''Reproducible research''' is the system of [[Data Documentation|documenting]] and [[Publishing Data|publishing]] results from a given research study. At the very least, '''reproducibility''' allows other researchers to [[Data Analysis|analyze]] the same data to get the same results as the original study, which strengthens the conclusions of the original study. '''Reproducible research''' is based on the idea that the path to research findings is just as important as the findings themselves.  
'''Reproducible research''' is the system of [[Data Documentation|documenting]] and [[Publishing Data|publishing]] results of an '''impact evaluation'''. At the very least, '''reproducibility''' allows other researchers to [[Data Analysis|analyze]] the same data to get the same results as the original study, which strengthens the conclusions of the original study. It is important to push researchers towards publishing '''reproducible research''' because the path to research findings is just as important as the findings themselves.  
==Read First==
==Read First==
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created the [https://github.com/worldbank/dime-standards/tree/master/dime-research-standards/pillar-3-research-reproducibility DIME Research Reproducibility Standards].
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created the [https://worldbank.github.io/wb-reproducible-research-repository/ Reproducible Research Repository resources]
* [https://osf.io/wzjtk/ DIME Analytics] has also conducted a [https://osf.io/csmxz/ bootcamp on reproducible research], which covers the various aspects of '''reproducibility'''.
* [https://osf.io/wzjtk/ DIME Analytics] has also conducted a [https://osf.io/csmxz/ bootcamp on reproducible research], which covers the various aspects of '''reproducibility'''.
* Well-written [[Master Do-files | master do-files]] are critical to transparent, '''reproducible research'''.
* Well-written [[Master Do-files | master do-files]] are critical to transparent, '''reproducible research'''.
* [[Getting started with GitHub | GitHub repositories]] play a major role in making research reproducible.
* [[Getting started with GitHub | GitHub repositories]] play a major role in making research reproducible.
* Specialized [[Software Tools#Text Editing Software|text editing]] and [[Collaboration Tools#Paper Writing|collaboration tools]] ensure that output is reproducible.
* Specialized [[Software Tools#Text Editing Software|text editing]] and [[Collaboration Tools#Paper Writing|collaboration tools]] ensure that output is reproducible.
== Replication and Reproducibilty ==
 
== Replication and Reproducibility ==
'''Replication''' is a process where different researchers conduct the same study independently in different samples and find similar conclusions. It adds more validity to the conclusions of an '''empirical''' study. However, in most field experiments, the [[Impact Evaluation Team|research team]] cannot create the same conditions for replication. Different populations can respond differently to the same '''treatment''', and replication is often too expensive.  
'''Replication''' is a process where different researchers conduct the same study independently in different samples and find similar conclusions. It adds more validity to the conclusions of an '''empirical''' study. However, in most field experiments, the [[Impact Evaluation Team|research team]] cannot create the same conditions for replication. Different populations can respond differently to the same '''treatment''', and replication is often too expensive.  
In such cases, the researchers should still try to achieve '''reproducibility'''. There are four key elements of '''reproducible research''' - [[Publishing Data#Preparing for Release|code publication]], [[Publishing Data#Preparing for Release|data publication]], [[Data Documentation|data documentation]], and '''output reproducibility'''.
In such cases, the researchers should still try to achieve '''reproducibility'''. There are four key elements of '''reproducible research''' - [[Data Documentation|data documentation]], [[Publishing Data#Preparing for Release|data publication]], [[Publishing Data#Preparing for Release|code publication]], and [[Reproducible Research#Output Publication|output publication]].


== Code Publication==
==Data Documentation==
'''Code publication''' is the first key element of '''reproducible research'''. The [[Impact Evaluation Team|research team]] should ensure that external researchers have access to, and can execute the same code and data that was used during the original '''impact evaluation'''. With careful coding, use of [[Master Do-files | master do-files]], and adherence to [[Stata Coding Practices|coding best practices]] the same data and code will yield the same results for any given person. Follow these guidelines when publishing the code:
[[Data Documentation | Data documentation]] deals with all aspects of an '''impact evaluation''' - [[Sampling | sampling]], [[Primary Data Collection | data collection]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Proper documentation not only produces reproducible [[Publishing Data| data for publication]] in the future , but also ensures [[Data Quality Assurance Plan| high quality data]] in the present.  For example, a [[Impact Evaluation Team#Field Coordinators (FCs) | field coordinator (FC)]] may notice that some [[Survey Pilot Participants|respondents]] do not understand a questionnaire because of reading difficulties. If the '''field coordinator (FC)''' does not document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not flag these observations during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] does not document why the observations were flagged, and what the flag means, it will affect the results of the [[Data Analysis | analysis]].
* '''Master do-files.''' The [[Master Do-files|master do-file]] should set the Stata seed and version to allow replicable [[Sampling|sampling]] and [[Randomization in Stata|randomization]]. By nature, the '''master do-file'''  will run project do-files in a pre-specified order, which strengthens '''reproducibility'''. The '''master do-file''' can also be used to list assumptions of a study and list all data sets that are used in the study.
=== Guidelines ===
* '''Packages and settings.''' Install all necessary commands and packages in your '''master do-file''' itself. Specify all settings and sort observations frequently to minimize errors.  
Accordingly, in the lead up to, and during [[Primary Data Collection | data collection]], the [[Impact Evaluation Team|research team]] should follow these guidelines for '''data documentation'''.  
* '''Globals.''' Create '''globals''' (or global macros) for the root folder and all project folders. '''Globals''' should only be specified in the '''master do-file''' and can be used '''standardizing coefficients''' for the data set that will be used for [[Data Analysis|analysis]].
* '''Comments.''' Use comments in your code to document the reasons for a particular line or group of commands. In [[Stata Coding Practices|Stata]], for instance, use <code>*</code> to insert comments.  
* '''Shell script.''' If you use different languages or software in the same project, consider using a '''shell script''',  which ensure that other users run the different languages or software in the correct order.
* '''Folders.''' Create separate folders to store all documentation related to the project in separate files. For example, in [https://github.com/ Github], the research team can store notes about each folder and its contents under [https://guides.github.com/features/wikis/ README.md].
* '''Comments.''' Include '''comments''' (using <code>*</code>) in your code frequently to explain what a line or group of commands is doing and why. For example, if the code drops observations or changes values, explain why this was necessary using comments. This ensures that the code is also easy to understand, and ensures that research is also [[Research Ethics#Research Transparency|transparent]].
* '''Consult data collection teams.''' Throughout the process of [[Data Cleaning|data cleaning]], take extensive inputs from the people who are responsible for collecting data. This could be a field team, a government ministry responsible for [[Administrative_and_Monitoring_Data#Administrative Data|administrative data]], or a technology firm that handles [[Remote_Sensing|remote sensing]] data.
===Software for Reproducible Code Publication===
'''Exploratory analysis.''' While '''cleaning''' the data set, look for issues such as '''outliers''', and [[Monitoring Data Quality#High Frequency Checks|data entry errors]] like missing or duplicate values. Record these observations for use during the process of [[Data Documentation#What to Document|variable construction]] and [[Data Analysis|analysis]].
[[Getting started with GitHub | Git]] is a free version-control software. Users can store files in Git Repositories, most commonly on [https://github.com/ GitHub]. Within repositories, users can track changes to code in different programming languages and create messages explaining [[Data Documentation | why changes were made]]. Sharing Git repositories makes code publicly available and allows other researchers to read and replicate your code. To learn how to use GitHub, refer to GitHub Services’ [https://services.github.com/on-demand/intro-to-github/ introductory training] and [https://guides.github.com/ GitHub Guides’] tutorials. [http://jupyter.org/ Jupyter Notebook] is another platform for code-sharing on which researchers can create and share code in different programming languages, including Python, R, Julia, and Scala.
* '''Feedback.''' When researchers submit codes for review, or release data on a public platform (such as the [[Microdata Catalog]]), others may provide feedback, either positive or negative. It is important to document these comments as well, as this can improve the quality of the results of the '''impact evaluation'''.  
* '''Corrections.''' Include records of any corrections made to the data, as well as to the code. For example, based on feedback, the research team may realize that they forgot to drop duplicated entries. Publish these corrections in the '''documentation folder''', along with the communications where theses issues were reported.
* '''Confidential information.''' The research team must be careful not to include confidential information, or any information that is not securely stored.
=== Documentation tools ===
There are various tools available for '''data documentation'''. [https://github.com/ GitHub] and [https://osf.io/ Open Science Framework (OSF)] are two such tools.  
* '''The Open Science Framework (OSF).''' It supports documentation by allowing users to store files and version histories, and collaborate using [https://osf.io/4znzp/wiki/home/ OSF Wiki pages].  
* '''GitHub.''' This is a useful tool for managing tasks and responsibilities across the research team. Like '''OSF''', '''Git''' also stores every version of every file. It supports documentation through [https://guides.github.com/features/wikis/#creating-your-wiki Wiki pages] and [https://guides.github.com/features/wikis/#creating-a-readme README.md].


== Data Publication ==
== Data Publication ==
To execute code, reproducible research requires that researchers [[Publishing Data | publish their data]]. Ideally, researchers can provide adequate data for others to reproduce all steps in their code from [[Data Cleaning| cleaning]] to [[Data Analysis| analysis]]. However, this is not always feasible when the data contains [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential information.  
[[Publishing Data|Data publication]] is the public release of all data once the process of [[Primary Data Collection | data collection]] and [[Data Analysis | analysis]] is complete. '''Data publication''' must be accompanied by proper [[Data Documentation|data documentation]]. Ideally, the [[Impact Evaluation Team|research team]] should publish all data that is needed for others to reproduce every step of the original code, from [[Data Cleaning| cleaning]] to [[Data Analysis| analysis]]. However, this may not always be feasible, since data often contains [[Personally Identifiable Information (PII)|personally identifiable information (PII)]] and other confidential information.
==Data Documentation==
=== Guidelines ===  
[[Data Documentation | Data documentation]] outlines all aspects of the data work that may affect or inform the analysis and results.
The '''research team''' must keep the following things in mind to ensure that the  data is well-organized before publishing:
For reproducibility and transparency, it is important to document all aspects of the data work that may affect or inform the analysis and results. Accordingly, during [[Primary Data Collection | data collection]], [[Data Cleaning | data cleaning]], variable construction, and [[Data Analysis | data analysis]], compile [[Data Documentation | data work documentation]] in code comments and, ideally, in one consolidated file or folder. The structure of this file or folder will vary from one project to another.  
* '''Cleaning.''' Ensure that the data has been [[Data Cleaning | cleaned]] and is [[Data_Cleaning#Applying Labels | well-labelled]].
* '''Missing variables.''' Make sure the data contains all variables used during [[Data Analysis | data analysis]], and includes uniquely [[ID Variable Properties | identifying variables]].  
* '''De-identification.''' Careful [[De-identification | de-identification]] is important to maintain the privacy of respondents and to meet [[Research Ethics|research ethics standards]]. The '''research team''' must carefully de-identify any sensitive or '''personally-identifying information (PII)''' such as names, locations, or financial records before release.
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has developed the following resources to help researchers store and organize data for public release.
* '''Iefieldkit.''' <code>[[iefieldkit]]</code> is a Stata package which allows the research team to follow '''best practices''' for [[Data Cleaning|data cleaning]].
* '''Ietoolkit.''' [https://worldbank.github.io/ietoolkit/ <code>ietoolkit</code>] is a Stata package which simplifies the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in '''impact evaluations'''. It allows the research team to organize the raw data.
* '''Data management guidelines.''' The [https://osf.io/b7z6h/ data management guidelines] provide steps on how to organize data for [[Data Cleaning|cleaning]] and [[Data Analysis|analysis]].
* '''DataWork folder.''' The [[DataWork Folder|DataWork folder]] is a standardized folder template for organizing data in a project folder. The raw '''de-identified data''' can be stored in the [[DataWork_Survey_Round#DataSets_Folder|DataSets folder]] of the [[DataWork_Survey_Round|DataWork survey round folder]].
* '''Microdata catalog checklist.''' The [[Checklist: Microdata Catalog submission|microdata catalog checklist]] provides instructions on how to prepare data for release using the [[Microdata Catalog|Microdata catalog]] of the [https://www.worldbank.org/ World Bank]. The [https://microdata.worldbank.org/index.php/home Microdata Library] offers free access to '''microdata''' produced not only by the World Bank, but also other international organizations, statistical agencies, and government organizations.
* '''Data publication standards.''' The [https://github.com/worldbank/dime-standards/tree/master/dime-research-standards/pillar-5-data-publication DIME Data Publication Standards] provide detailed guidelines for preparing data for release.


Note that when you submit codes for code review or deposit data in the [[Microdata Catalog | microdata catalog]], the reviewers will revise your data documentation. While they may provide feedback, remember that positive comments on your data documentation do not guarantee no problems, since reviewers cannot ask about issues unknown to them. For more details, see [[Data Documentation]].  
=== Data publication tools ===
There are several free software tools that allow the [[Impact Evaluation Team|research team]] to publicly release the data and the associated [[Data Documentation|documentation]], including [https://github.com/ GitHub] and [https://osf.io/ Open Science Framework], and [https://www.researchgate.net/Research Research Gate].
Each of these platforms can handle organized directories and can provide a static '''uniform resource locator (URL)''' which makes it easy to collaborate with other users.
* '''ResearchGate.''' It allows users to assign a '''digital object identifier (DOI)''' to published work, which they can then share with external researchers for review or '''replication'''.
* '''The Open Science Framework (OSF).''' It is an online platform which allows members of a '''research team''' to store all project data, and even publish reports using [https://osf.io/preprints/ OSF preprints].
* '''DIME survey data.''' [https://www.worldbank.org/en/research/dime DIME] also publishes and releases [[DIME_Datasets_on_Microdata_Catalog| survey data]]  through the [[Microdata Catalog]]. However, access to the data may be restricted, and some variables are not allowed to be published.


== Output Reproducibility ==
== Code Publication==
[[Getting started with GitHub | GitHub repositories]] allow researchers to track changes to the code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code.
'''Code publication''' is another key element of '''reproducible research'''. Sometimes academic journals ask for '''reproducible code''' (and data) along with the actual academic paper. Even if they don't, it is a good practice to share codes and data with others. The [[Impact Evaluation Team|research team]] should ensure that external researchers have access to, and can execute the same code and data that was used during the original '''impact evaluation'''. This can be made possible through proper [[Data Documentation|documentation]] and [[Data Management|management]] of data.
=== Guidelines ===
With careful coding, use of [[Master Do-files | master do-files]], and adherence to [[Stata Coding Practices|coding best practices]] the same data and code will yield the same results for any given person. Follow these guidelines when publishing the code:
* '''Master do-files.''' The [[Master Do-files|master do-file]] should set the Stata seed and version to allow replicable [[Sampling|sampling]] and [[Randomization in Stata|randomization]]. By nature, the '''master do-file'''  will run project do-files in a pre-specified order, which strengthens '''reproducibility'''. The '''master do-file''' can also be used to list assumptions of a study and list all data sets that are used in the study.
* '''Packages and settings.''' Install all necessary commands and packages in your '''master do-file''' itself. Specify all settings and sort observations frequently to minimize errors. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created two packages to help researchers in producing '''reproducible research''' - <code>[[Iefieldkit|iefieldkit]]</code> and <code>ietoolkit</code>.
* '''Globals.''' Create '''globals''' (or global macros) for the root folder and all project folders. '''Globals''' should only be specified in the '''master do-file''' and can be used '''standardizing coefficients''' for the data set that will be used for [[Data Analysis|analysis]].
* '''Shell script.''' If you use different languages or software in the same project, consider using a '''shell script''',  which ensure that other users run the different languages or software in the correct order. This would mean that you have a script that you run in your command line that first execute a master script in one language and then a master script in another one or several other languages. This way the shell script becomes the "super master script" that execute several other master scripts in the correct order.
* '''Comments.''' Include '''comments''' (using <code>*</code>) in your code frequently to explain what a line of code (or a group of commands) is doing, and why. For example, if the code drops observations or changes values, explain why this was necessary using comments. This ensures that the code is also easy to understand, and that research is [[Research Ethics#Research Transparency|transparent]].
=== Code publication tools ===
There are several free software tools that allow the [[Impact Evaluation Team|research team]] to publicly release the code, including [https://github.com/ GitHub] and [http://jupyter.org/ Jupyter Notebook]. Users can pick any of these depending on how familiar they are with these tools. There are several pre-publication code review facilities as well.
* '''GitHub.''' It is a free '''version-control''' software. It is popular because users can store every version of every component of a project (like data and code) in '''repositories''' which can be accessed by everyone working in a project. [[Getting started with GitHub |With GitHub repositories]], users can track changes to code in different programming languages, and create [[Data Documentation | documentation]] explaining what changes were made and why. The '''research team''' can then simply share '''Git repositories''' with an external audience which allows others to read and replicate the code as well as the results of an '''impact evaluation'''.
* '''Jupyter Notebook.''' This is another platform where researchers can create and share code in different programming languages, including [https://www.python.org/ Python], [https://www.r-project.org/ R], [https://julialang.org/ Julia], and [https://www.scala-lang.org/ Scala].
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has also created a [https://osf.io/m36kg/ sample peer code review form] that researchers can refer to before publishing their code.
To learn more about how to use these tools, users can refer to the following resources:
* [https://services.github.com/on-demand/intro-to-github/ GitHub introductory training]
* [https://guides.github.com/ GitHub guides]
* [https://jupyter.org/documentation Jupyter documentation]
* [https://blog.jupyter.org/ Jupyter blogs]


Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analyzing data and publishing the output of this analysis, so there's less room for error and manipulation.
== Output Publication ==
The research output is not just a paper or report, but also includes the codes, data, and the documentation. '''Output publication''' is the final aspect of '''reproducible research''' after completing [[Data Documentation|documentation]] and [[Publishing Data|publication]] of data and codes. The [[Impact Evaluation Team|research team]] can follow certain guidelines to ensure their research output is '''reproducible''' and transparent.
* '''Checklist.''' DIME Analytics has created a [https://osf.io/cdxnf/ pre-publication reproducibility checklist] for researchers.
* '''GitHub repos.''' [[Getting started with GitHub | GitHub repositories]] (or repos) allow researchers to track changes to the code, create messages explaining the changes, and make code publicly available for others to read and replicate.
* '''Dynamic documents.''' These are documents which allow researchers to write reports that can automatically display results after [[Data Analysis|analysis]]. This reduces the amount of manual work, and there is also less room for error and manipulation of results.
=== Publication tools ===
There are a wide range of tools that are available for '''output publication'''. Each of them allows users to create '''dynamic documents''' and edit the reports using various programming languages like [https://www.r-project.org/ R], [https://www.stata.com/ Stata], and [https://www.python.org/ Python].
* '''R.''' This language has a feature called [https://rmarkdown.rstudio.com/ R Markdown], which allows users to perform [[Data Analysis|analysis]] using different programming languages, and print the results in the final document along with text to explain the results.
* '''Stata.''' New versions of Stata ([https://www.stata.com/stata15/ version 15] onwards) allow users to [https://www.stata.com/manuals/pdyndoc.pdf create dynamic documents]. The output is usually a PDF file, which contains text, tables and graphs. Whenever there are changes to raw data or in the analysis, the research team only needs to execute one '''do-file''' to create a new document. This improves '''reproducibility''' since users do not have to make changes manually every time.
* '''LaTeX.'''  [https://www.latex-project.org/ LaTeX] is a widely used publication tool. It is a '''typesetting system''' that allows users to reference lines of code and outputs such as tables and graphs, and easily update them in a text document. Users can export the results into '''.tex''' format after analyzing the data in their preferred software – using [https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf stargazer] in '''R''', and packages like <code>[http://repec.org/bocode/e/estout/esttab.html esttab]</code> and <code>[http://repec.org/bocode/o/outreg2.html outreg2]</code> in '''Stata'''. Whenever there are new graphs and tables in the analysis, simply recompile the '''LaTeX''' document with the press of a button in order to include the new graphs and tables.
* '''Overleaf.''' [https://www.overleaf.com/ Overleaf] is a web-based platform that allows users to collaborate on '''LaTeX''', and receive feedback from other researchers.
* '''Jupyter Notebook.''' [http://jupyter.org/ Jupyter Notebook] can create '''dynamic documents''' in various formats like HTML and '''LaTeX'''.


Different software allows for different degrees of automatization. [https://rmarkdown.rstudio.com/ R Markdown], for example, allows users to write, text, and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 allows users to [https://www.stata.com/manuals/pdyndoc.pdf dyndoc to create similar documents]; the output is a file, usually a PDF, that contains text, tables and graphs. With this kind of document, whenever a researcher updates data or changes the analysis, he/she only needs to run one file to generate a new final paper or report. No copy-pasting or manual changes are necessary, which improves reproducibility.
== Related Pages ==
 
[[Special:WhatLinksHere/Reproducible_Research|Click here for pages that link to this topic.]]
LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs in order to easily update them in a text document. After you analyze the data in your preferred software, you can export the results into TeX format – [https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf R's stargazer] is commonly used for this, and Stata has different options such as [http://repec.org/bocode/e/estout/esttab.html esttab] and [http://repec.org/bocode/o/outreg2.html <code>outreg2</code>]. The TeX code writes a LaTex document that uploads these outputs. Whenever results are updated, simply recompile the LaTeX document with the press of a button in order to integrate the new graphs and tables. Should you wish to use TeX collaboratively, [https://www.overleaf.com/ Overleaf] is a web-based platform that facilitates TeX collaboration, and [http://jupyter.org/ Jupyter Notebook] can create dynamic documents in HTML, LaTeX and other formats.


== Additional Resources ==
== Additional Resources ==
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
* Berkeley Initiative for Transparency in the Social Sciences (BITSS),  [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]
*DIME Analytic’s [https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf Coding for Reproducible Research]
* Berkeley Initiative for Transparency in the Social Sciences (BITSS), [https://www.bitss.org/wp-content/uploads/2015/12/Pre-Analysis-Plan-Template.pdf Pre-Analysis Plan template]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/git-1-intro.pdf Intro to GitHub]
* Center for Open Science, [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines]
*DIME Analytics’ guides to  [https://github.com/worldbank/DIME-Resources/blob/master/git-2-github.pdf 1] and [https://github.com/worldbank/DIME-Resources/blob/master/git-3-flow.pdf 2] to Using Git and GitHub
* Coursera, [https://www.coursera.org/learn/reproducible-research Course on Reproducible Research in R]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/git-4-management.pdf Maintaining a GitHub Repository]
* Dataverse (Harvard), [https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-3-git.pdf Initializing and Synchronizing a Git Repo with GitHub Desktop]
* DIME Analytics (World Bank), [https://osf.io/4pg57 Data Management for Reproducible Research]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-4-gitflow.pdf Using Git Flow to Manage Code Projets with GitKraken]
* DIME Analytics (World Bank), [https://osf.io/phz37 How To: DIME Tools and Protocols for Reproducible Research]
*DIME Analytic’s [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-7-computing.pdf Fundamentals of Scientific Computing with Stata] and guides [https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf 1] [https://github.com/worldbank/DIME-Resources/blob/master/onboarding-2-coding.pdf 2] and [https://github.com/worldbank/DIME-Resources/blob/master/stata2-2-coding.pdf 3] to Stata Coding for Reproducible Research
* DIME Analytics (World Bank), [https://osf.io/crhgm What We Have Learned From Reproducibility Checks]
*[https://osf.io/ Open Science Framework], a web-based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).
* DIME Analytics (World Bank), [https://osf.io/7ft2h Stata Linter]
*[http://datacolada.org/69 Data Colada’s] 8 tips to make open research more findable and understandable
* DIME Analytics (World Bank), [https://worldbank.github.io/dime-data-handbook/coding.html DIME Analytics Coding Guide]
*The Abul Latif Jameel Poverty Action Lab (JPAL)’s [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility resources] on transparency and reproducibility
* DIME Analytics (World Bank), [https://osf.io/hzst9 Intro to Git and GitHub]
*Innovations for Policy Action (IPA)’s [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management] and [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]
* DIME Analytics (World Bank), [https://osf.io/9652n GitHub Workflow (pull request training)]
*[https://dataverse.harvard.edu/dataverse/socialsciencercts Randomized Control Trials in the Social Science Dataverse]
* DIME Analytics (World Bank), [https://osf.io/dtf4a/ Management using GitHub Repositories]
*Center for Open Science’s [https://cos.io/our-services/top-guidelines/ Transparency and Openness Guidelines], summarized in a [https://osf.io/pvf56/?_ga=1.225140506.1057649246.1484691980 1-Page Handout]
* DIME Analytics (World Bank), [https://osf.io/f3kad/ Initializing and Synchronizing a Git Repo with GitHub Desktop]
*Berkeley Initiative for Transparency in the Social Sciences (BITSS)’ [http://www.bitss.org/education/manual-of-best-practices/ Manual of Best Practices in Transparent Social Science Research]
* DIME Analytics (World Bank), [https://osf.io/szbwq/ Using Git Flow to Manage Code Projects with GitKraken]
*Coursera’s course for R, [https://www.coursera.org/learn/reproducible-research Johns Hopkins' Online Course on Reproducible Research], and Stata, [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into Reproducible Documents ]
* DIME Analytics (World Bank), [https://osf.io/36hys Basics of Programming in Stata]  
*Matthew Salganik's [http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf Open and Reproducible Research: Goals, Obstacles and Solutions]
* DIME Analytics (World Bank), [https://osf.io/nam2d Stata Markdown]  
 
* Markdown Guide [https://www.markdownguide.org/cheat-sheet/ Markdown Cheat Sheet]  
* Data Colada, [http://datacolada.org/69 Tips for making research findable and reproducible]
* Hua Peng (StataCorp), [https://huapeng01016.github.io/reptalk/#/hua-pengstatacorphpeng Incorporating Stata into Reproducible Documents]
* Innovation for Poverty Action (IPA), [http://www.poverty-action.org/sites/default/files/publications/IPA%27s%20Best%20Practices%20for%20Data%20and%20Code%20Management_Nov2015.pdf Reproducible Research: Best Practices for Data and Code Management]  
* Innovation for Poverty Action (IPA), [http://www.poverty-action.org/sites/default/files/Guidelines-for-data-publication.pdf Guidelines for data publication]
* J-PAL, [https://www.povertyactionlab.org/research-resources/transparency-and-reproducibility Transparency and reproducibility]  
*Matthew Salganik (Princeton), [http://www.princeton.edu/~mjs3/open_and_reproducible_opr_2017.pdf Open and Reproducible Research: Goals, Obstacles and Solutions]
[[Category: Reproducible Research]]
[[Category: Reproducible Research]]

Latest revision as of 18:19, 14 November 2024

Reproducible research is the system of documenting and publishing results of an impact evaluation. At the very least, reproducibility allows other researchers to analyze the same data to get the same results as the original study, which strengthens the conclusions of the original study. It is important to push researchers towards publishing reproducible research because the path to research findings is just as important as the findings themselves.

Read First

Replication and Reproducibility

Replication is a process where different researchers conduct the same study independently in different samples and find similar conclusions. It adds more validity to the conclusions of an empirical study. However, in most field experiments, the research team cannot create the same conditions for replication. Different populations can respond differently to the same treatment, and replication is often too expensive. In such cases, the researchers should still try to achieve reproducibility. There are four key elements of reproducible research - data documentation, data publication, code publication, and output publication.

Data Documentation

Data documentation deals with all aspects of an impact evaluation - sampling, data collection, cleaning, and analysis. Proper documentation not only produces reproducible data for publication in the future , but also ensures high quality data in the present. For example, a field coordinator (FC) may notice that some respondents do not understand a questionnaire because of reading difficulties. If the field coordinator (FC) does not document this issue, the research assistant will not flag these observations during data cleaning. And if the research assistant does not document why the observations were flagged, and what the flag means, it will affect the results of the analysis.

Guidelines

Accordingly, in the lead up to, and during data collection, the research team should follow these guidelines for data documentation.

  • Comments. Use comments in your code to document the reasons for a particular line or group of commands. In Stata, for instance, use * to insert comments.
  • Folders. Create separate folders to store all documentation related to the project in separate files. For example, in Github, the research team can store notes about each folder and its contents under README.md.
  • Consult data collection teams. Throughout the process of data cleaning, take extensive inputs from the people who are responsible for collecting data. This could be a field team, a government ministry responsible for administrative data, or a technology firm that handles remote sensing data.
  • Exploratory analysis. While cleaning the data set, look for issues such as outliers, and data entry errors like missing or duplicate values. Record these observations for use during the process of variable construction and analysis.
  • Feedback. When researchers submit codes for review, or release data on a public platform (such as the Microdata Catalog), others may provide feedback, either positive or negative. It is important to document these comments as well, as this can improve the quality of the results of the impact evaluation.
  • Corrections. Include records of any corrections made to the data, as well as to the code. For example, based on feedback, the research team may realize that they forgot to drop duplicated entries. Publish these corrections in the documentation folder, along with the communications where theses issues were reported.
  • Confidential information. The research team must be careful not to include confidential information, or any information that is not securely stored.

Documentation tools

There are various tools available for data documentation. GitHub and Open Science Framework (OSF) are two such tools.

  • The Open Science Framework (OSF). It supports documentation by allowing users to store files and version histories, and collaborate using OSF Wiki pages.
  • GitHub. This is a useful tool for managing tasks and responsibilities across the research team. Like OSF, Git also stores every version of every file. It supports documentation through Wiki pages and README.md.

Data Publication

Data publication is the public release of all data once the process of data collection and analysis is complete. Data publication must be accompanied by proper data documentation. Ideally, the research team should publish all data that is needed for others to reproduce every step of the original code, from cleaning to analysis. However, this may not always be feasible, since data often contains personally identifiable information (PII) and other confidential information.

Guidelines

The research team must keep the following things in mind to ensure that the data is well-organized before publishing:

DIME Analytics has developed the following resources to help researchers store and organize data for public release.

Data publication tools

There are several free software tools that allow the research team to publicly release the data and the associated documentation, including GitHub and Open Science Framework, and Research Gate. Each of these platforms can handle organized directories and can provide a static uniform resource locator (URL) which makes it easy to collaborate with other users.

  • ResearchGate. It allows users to assign a digital object identifier (DOI) to published work, which they can then share with external researchers for review or replication.
  • The Open Science Framework (OSF). It is an online platform which allows members of a research team to store all project data, and even publish reports using OSF preprints.
  • DIME survey data. DIME also publishes and releases survey data through the Microdata Catalog. However, access to the data may be restricted, and some variables are not allowed to be published.

Code Publication

Code publication is another key element of reproducible research. Sometimes academic journals ask for reproducible code (and data) along with the actual academic paper. Even if they don't, it is a good practice to share codes and data with others. The research team should ensure that external researchers have access to, and can execute the same code and data that was used during the original impact evaluation. This can be made possible through proper documentation and management of data.

Guidelines

With careful coding, use of master do-files, and adherence to coding best practices the same data and code will yield the same results for any given person. Follow these guidelines when publishing the code:

  • Master do-files. The master do-file should set the Stata seed and version to allow replicable sampling and randomization. By nature, the master do-file will run project do-files in a pre-specified order, which strengthens reproducibility. The master do-file can also be used to list assumptions of a study and list all data sets that are used in the study.
  • Packages and settings. Install all necessary commands and packages in your master do-file itself. Specify all settings and sort observations frequently to minimize errors. DIME Analytics has created two packages to help researchers in producing reproducible research - iefieldkit and ietoolkit.
  • Globals. Create globals (or global macros) for the root folder and all project folders. Globals should only be specified in the master do-file and can be used standardizing coefficients for the data set that will be used for analysis.
  • Shell script. If you use different languages or software in the same project, consider using a shell script, which ensure that other users run the different languages or software in the correct order. This would mean that you have a script that you run in your command line that first execute a master script in one language and then a master script in another one or several other languages. This way the shell script becomes the "super master script" that execute several other master scripts in the correct order.
  • Comments. Include comments (using *) in your code frequently to explain what a line of code (or a group of commands) is doing, and why. For example, if the code drops observations or changes values, explain why this was necessary using comments. This ensures that the code is also easy to understand, and that research is transparent.

Code publication tools

There are several free software tools that allow the research team to publicly release the code, including GitHub and Jupyter Notebook. Users can pick any of these depending on how familiar they are with these tools. There are several pre-publication code review facilities as well.

  • GitHub. It is a free version-control software. It is popular because users can store every version of every component of a project (like data and code) in repositories which can be accessed by everyone working in a project. With GitHub repositories, users can track changes to code in different programming languages, and create documentation explaining what changes were made and why. The research team can then simply share Git repositories with an external audience which allows others to read and replicate the code as well as the results of an impact evaluation.
  • Jupyter Notebook. This is another platform where researchers can create and share code in different programming languages, including Python, R, Julia, and Scala.
  • DIME Analytics has also created a sample peer code review form that researchers can refer to before publishing their code.

To learn more about how to use these tools, users can refer to the following resources:

Output Publication

The research output is not just a paper or report, but also includes the codes, data, and the documentation. Output publication is the final aspect of reproducible research after completing documentation and publication of data and codes. The research team can follow certain guidelines to ensure their research output is reproducible and transparent.

  • Checklist. DIME Analytics has created a pre-publication reproducibility checklist for researchers.
  • GitHub repos. GitHub repositories (or repos) allow researchers to track changes to the code, create messages explaining the changes, and make code publicly available for others to read and replicate.
  • Dynamic documents. These are documents which allow researchers to write reports that can automatically display results after analysis. This reduces the amount of manual work, and there is also less room for error and manipulation of results.

Publication tools

There are a wide range of tools that are available for output publication. Each of them allows users to create dynamic documents and edit the reports using various programming languages like R, Stata, and Python.

  • R. This language has a feature called R Markdown, which allows users to perform analysis using different programming languages, and print the results in the final document along with text to explain the results.
  • Stata. New versions of Stata (version 15 onwards) allow users to create dynamic documents. The output is usually a PDF file, which contains text, tables and graphs. Whenever there are changes to raw data or in the analysis, the research team only needs to execute one do-file to create a new document. This improves reproducibility since users do not have to make changes manually every time.
  • LaTeX. LaTeX is a widely used publication tool. It is a typesetting system that allows users to reference lines of code and outputs such as tables and graphs, and easily update them in a text document. Users can export the results into .tex format after analyzing the data in their preferred software – using stargazer in R, and packages like esttab and outreg2 in Stata. Whenever there are new graphs and tables in the analysis, simply recompile the LaTeX document with the press of a button in order to include the new graphs and tables.
  • Overleaf. Overleaf is a web-based platform that allows users to collaborate on LaTeX, and receive feedback from other researchers.
  • Jupyter Notebook. Jupyter Notebook can create dynamic documents in various formats like HTML and LaTeX.

Related Pages

Click here for pages that link to this topic.

Additional Resources