Difference between revisions of "Reproducible Research"

Jump to: navigation, search
Line 16: Line 16:


== Data Publication ==
== Data Publication ==
[[Publishing Data|Data publication]] is the release of all data once the process of [[Primary Data Collection | data collection]] and [[Data Analysis | analysis]] is complete. Ideally, the [[Impact Evaluation Team|research team]] should provide all data that is needed for others to reproduce every step of the original code, from [[Data Cleaning| cleaning]] to [[Data Analysis| analysis]]. However, this may not always be feasible, since data often contains [[Personally Identifiable Information (PII)|personally identifiable information (PII)]] and other confidential information.
[[Publishing Data|Data publication]] is the public release of all data once the process of [[Primary Data Collection | data collection]] and [[Data Analysis | analysis]] is complete. Ideally, the [[Impact Evaluation Team|research team]] should publish all data that is needed for others to reproduce every step of the original code, from [[Data Cleaning| cleaning]] to [[Data Analysis| analysis]]. However, this may not always be feasible, since data often contains [[Personally Identifiable Information (PII)|personally identifiable information (PII)]] and other confidential information.


=== Guidelines ===  
=== Guidelines ===  
Released data should allow any user to [[Reproducible Research | replicate]] research findings. Therefore, released data should be [[Data Cleaning | clean]] and [[Data_Cleaning#Labels | well-labelled]], contain all variables used in [[Data Analysis | data analysis]], and include [[ID Variable Properties | identifying variables]]. Make sure to maintain the privacy of respondents by carefully [[De-identification | de-identifying]] any sensitive or [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records, all of which are not [[Research Ethics | ethical]] to publish.
The '''research team''' must keep the following things in mind to ensure that the  data is well-organized before publishing:
 
* '''Clean and label.''' Ensure that the data has been [[Data Cleaning | cleaned]] and is [[Data_Cleaning#Labels | well-labelled]].
Analysis datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper [[Data Documentation | documentation]], including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] for instructions on how to prepare data and documentation for primary data release.
* '''Missing variables.''' Make sure the data contains all variables used during [[Data Analysis | data analysis]], and includes uniquely [[ID Variable Properties | identifying variables]].  
* '''De-identify.''' Careful [[De-identification | de-identification]] is important to maintain the privacy of respondents and to meet [[Research Ethics|research ethics standards]]. The '''research team''' must carefully '''de-identify''' any sensitive or '''personally-identifying information (PII)''' such as names, locations, or financial records before release.
* '''DataWork folder.''' The [[DataWork Folder|DataWork folder]] is a standardized folder template designed by DIME Analytics for organizing data in a project folder. The raw ''''de-identified data''' can be stored in the [[DataWork_Survey_Round#DataSets_Folder|DataSets folder]] of the [[DataWork_Survey_Round|DataWork survey round folder]].
* '''Documentation.''' Analysed datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper [[Data Documentation | documentation]], including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] for instructions on how to prepare data and documentation for primary data release.


=== Software ===
=== Software ===
Line 27: Line 30:


[[DIME_Datasets_on_Microdata_Catalog| DIME survey data]] is also published and released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.
[[DIME_Datasets_on_Microdata_Catalog| DIME survey data]] is also published and released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.
== Code Publication==
== Code Publication==
'''Code publication''' is another key element of '''reproducible research'''. Sometimes academic journals ask for '''reproducible code''' (and data) along with the actual academic paper. Even if they don't, it is good academic '''citizenship''' to share codes and data with others. The [[Impact Evaluation Team|research team]] should ensure that external researchers have access to, and can execute the same code and data that was used during the original '''impact evaluation'''.
'''Code publication''' is another key element of '''reproducible research'''. Sometimes academic journals ask for '''reproducible code''' (and data) along with the actual academic paper. Even if they don't, it is good academic '''citizenship''' to share codes and data with others. The [[Impact Evaluation Team|research team]] should ensure that external researchers have access to, and can execute the same code and data that was used during the original '''impact evaluation'''.

Revision as of 20:42, 27 April 2020

Reproducible research is the system of documenting and publishing results from a given research study. At the very least, reproducibility allows other researchers to analyze the same data to get the same results as the original study, which strengthens the conclusions of the original study. Reproducible research is based on the idea that the path to research findings is just as important as the findings themselves.

Read First

Replication and Reproducibilty

Replication is a process where different researchers conduct the same study independently in different samples and find similar conclusions. It adds more validity to the conclusions of an empirical study. However, in most field experiments, the research team cannot create the same conditions for replication. Different populations can respond differently to the same treatment, and replication is often too expensive. In such cases, the researchers should still try to achieve reproducibility. There are four key elements of reproducible research - code publication, data publication, data documentation, and output reproducibility.

Data Documentation

Data documentation outlines all aspects of the data work that may affect or inform the analysis and results. For reproducibility and transparency, it is important to document all aspects of the data work that may affect or inform the analysis and results. Accordingly, during data collection, data cleaning, variable construction, and data analysis, compile data work documentation in code comments and, ideally, in one consolidated file or folder. The structure of this file or folder will vary from one project to another.

Note that when you submit codes for code review or deposit data in the microdata catalog, the reviewers will revise your data documentation. While they may provide feedback, remember that positive comments on your data documentation do not guarantee no problems, since reviewers cannot ask about issues unknown to them. For more details, see Data Documentation.

Data Publication

Data publication is the public release of all data once the process of data collection and analysis is complete. Ideally, the research team should publish all data that is needed for others to reproduce every step of the original code, from cleaning to analysis. However, this may not always be feasible, since data often contains personally identifiable information (PII) and other confidential information.

Guidelines

The research team must keep the following things in mind to ensure that the data is well-organized before publishing:

  • Clean and label. Ensure that the data has been cleaned and is well-labelled.
  • Missing variables. Make sure the data contains all variables used during data analysis, and includes uniquely identifying variables.
  • De-identify. Careful de-identification is important to maintain the privacy of respondents and to meet research ethics standards. The research team must carefully de-identify any sensitive or personally-identifying information (PII) such as names, locations, or financial records before release.
  • DataWork folder. The DataWork folder is a standardized folder template designed by DIME Analytics for organizing data in a project folder. The raw 'de-identified data can be stored in the DataSets folder of the DataWork survey round folder.
  • Documentation. Analysed datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper documentation, including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the Microdata Catalog Checklist for instructions on how to prepare data and documentation for primary data release.

Software

GitHub, The Open Science Framework, and Gate, which can also assign a DOI to your work, are all platforms on which you can publish your data, documentation, and code. Any of these platforms is acceptable: they can handle structured directories and can provide a stable, structured URL for your project.

DIME survey data is also published and released through the Microdata Catalog. However, access to the data may be restricted and some variables may be embargoed prior to publication.

Code Publication

Code publication is another key element of reproducible research. Sometimes academic journals ask for reproducible code (and data) along with the actual academic paper. Even if they don't, it is good academic citizenship to share codes and data with others. The research team should ensure that external researchers have access to, and can execute the same code and data that was used during the original impact evaluation.

Guidelines

With careful coding, use of master do-files, and adherence to coding best practices the same data and code will yield the same results for any given person. Follow these guidelines when publishing the code:

  • Master do-files. The master do-file should set the Stata seed and version to allow replicable sampling and randomization. By nature, the master do-file will run project do-files in a pre-specified order, which strengthens reproducibility. The master do-file can also be used to list assumptions of a study and list all data sets that are used in the study.
  • Packages and settings. Install all necessary commands and packages in your master do-file itself. Specify all settings and sort observations frequently to minimize errors. DIME Analytics has created two packages to help researchers in producing reproducible research - iefieldkit and ietoolkit.
  • Globals. Create globals (or global macros) for the root folder and all project folders. Globals should only be specified in the master do-file and can be used standardizing coefficients for the data set that will be used for analysis.
  • Shell script. If you use different languages or software in the same project, consider using a shell script, which ensure that other users run the different languages or software in the correct order.
  • Comments. Include comments (using *) in your code frequently to explain what a line of code (or a group of commands) is doing, and why. For example, if the code drops observations or changes values, explain why this was necessary using comments. This ensures that the code is also easy to understand, and that research is transparent.

Software

There are several free software tools that allow the research team to publicly release the code, including GitHub and Jupyter Notebook. Users can pick any of these depending on how familiar they are with these tools.

  • GitHub
    It is a free version-control software. It is popular because users can store every version of every component of a project (like data and code) in repositories which can be accessed by everyone working in a project. With GitHub repositories, users can track changes to code in different programming languages, and create documentation explaining what changes were made and why. The research team can then simply share Git repositories with an external audience which allows others to read and replicate the code as well as the results of an impact evaluation.
  • Jupyter Notebook
    This is another platform where researchers can create and share code in different programming languages, including Python, R, Julia, and Scala.

To learn more about how to use these tools, users can refer to the following resources:

  1. GitHub introductory training
  2. GitHub guides
  3. Jupyter documentation
  4. Jupyter blogs

Output Reproducibility

GitHub repositories allow researchers to track changes to the code in different programming languages, create messages explaining the changes, make code publicly available and allow other researchers to read and replicate your code.

Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analyzing data and publishing the output of this analysis, so there's less room for error and manipulation.

Different software allows for different degrees of automatization. R Markdown, for example, allows users to write, text, and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 allows users to dyndoc to create similar documents; the output is a file, usually a PDF, that contains text, tables and graphs. With this kind of document, whenever a researcher updates data or changes the analysis, he/she only needs to run one file to generate a new final paper or report. No copy-pasting or manual changes are necessary, which improves reproducibility.

LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs in order to easily update them in a text document. After you analyze the data in your preferred software, you can export the results into TeX format – R's stargazer is commonly used for this, and Stata has different options such as esttab and outreg2. The TeX code writes a LaTex document that uploads these outputs. Whenever results are updated, simply recompile the LaTeX document with the press of a button in order to integrate the new graphs and tables. Should you wish to use TeX collaboratively, Overleaf is a web-based platform that facilitates TeX collaboration, and Jupyter Notebook can create dynamic documents in HTML, LaTeX and other formats.

Additional Resources