Difference between revisions of "Reproducible Research"
Line 16: | Line 16: | ||
== Data publication == | == Data publication == | ||
Research results can only be replicated if the data used for [[Publishing_Data#Publishing_Analysis_Data|analysis]] is shared. Though being able to reproduce all steps from [[Data Cleaning|data cleaning]] to [[Data Analysis|data analysis]] is ideal to guarantee reproducibility and [[Research Ethics#Research Transparency|transparency]], that is not always possible, as some of the data used may be [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request. | Research results can only be replicated if the data used for [[Publishing_Data#Publishing_Analysis_Data|analysis]] is [[Publishing Data|shared]]. Though being able to reproduce all steps from [[Data Cleaning|data cleaning]] to [[Data Analysis|data analysis]] is ideal to guarantee reproducibility and [[Research Ethics#Research Transparency|transparency]], that is not always possible, as some of the data used may be [[De-identification#Personally Identifiable Information|personally identifiable]] or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request. | ||
== Dynamic documents == | == Dynamic documents == |
Revision as of 16:11, 12 February 2018
In most scientific fields, results are validated through replication: that means that different scientists run the same experiment independently in different samples and find similar conclusions. That standard is not always feasible in development research. More often than not, the phenomena we analyze cannot be artificially re-created. Even in the case of field experiments, different populations can respond differently to a treatment, and the costs involved are high.
Even in such cases, however, we should till require reproducibility: this means that different researchers, when running the same analysis in the same data should find the same results. That may seem obvious, but unfortunately is not as widely observed as we would like. The bottom line of research reproducibility is that the path used to get to your results are as much a research output as the results themselves, making the research process fully transparent. This means that not only the final findings should be made available by researchers, but data, codes and documentation are also of great relevance to the public.
Code replication
Replicating results is the most important part of reproducible research. The easiest way to guarantee that results can be replicated is to have a code that runs all data work and can be run by anyone who has access to it. Different researchers running the same code on the same data should get the same results. So to guarantee that research is transparent and reproducible, codes and data should be shared.
It is possible for the same data and code to create different results if the right measures are not taken. In Stata, for example, setting the seed and version is essential to replicable sampling and randomization. Sorting observations is also frequently necessary. Having a master do-file greatly improves the replicability of results since it's possible to standardize setting and run do-files in a pre-specified order from it. If different languages or software are used in the same project, a shell script can be used for the same purpose.
Another important part of replicable code is that it should be understandable. That is, if someone else runs your code and replicates all the results, but doesn't understand what was being done, then your research is still not transparent. Commenting code to make it clear where and why decisions were made is a crucial part of making your work transparent. For example, if observations are dropped of values are changed, the code should be commented to explain why that was done.
Software for Code Replication
Git is a free version-control software. Files are stored in Git Repositories, most commonly on GitHub. They allow the user to track changes to the code and create messages explaining why the changes were made, which improves documentation. Sharing Git repositories is a way to make code publicly available, and allow other researchers to read and replicate your code.
To learn GitHub, there is an introductory training available through GitHub Services, and multiple tutorials available through GitHub Guides.
Data publication
Research results can only be replicated if the data used for analysis is shared. Though being able to reproduce all steps from data cleaning to data analysis is ideal to guarantee reproducibility and transparency, that is not always possible, as some of the data used may be personally identifiable or confidential. However, sharing the final data is necessary for reproducibility. Some journals require datasets to be submitted along with papers, and some researchers prefer to make data available upon request.
Dynamic documents
Dynamic documents allow researchers to write papers and reports that automatically import or display results. This reduces the amount of manual work involved between analysing data and publishing the output of this analysis, so there's less room for error and manipulation.
Different software allow for different degrees of automatization. Using R Markdown, for example, users can write text and code simultaneously, running analyses in different programming languages and printing results in the final document along with the text. Stata 15 also allows users to create such documents using dyndoc. The output is a file, usually a PDF, that contains text, tables and graphs created simultaneously. With this kind of document, whenever data is updated or a change is made to some of the analysis, it's only necessary to run one file to generate a new final paper or report, with no copy-pasting or manual changes.
LaTeX is another widely used tool in the scientific community. It is a type-setting system that allows users to reference code outputs such as tables and graphs so that they can be easily updated in a text document. Using this software, data analysis and writing are two separate processes: first, the data is analyzed using whatever software you prefer and the results are exported into TeX format (R's stargazer is commonly used for that, and Stata has different options such as esttab and outreg2), then a LaTeX document is written that uploads these outputs. The advantage of using LaTeX is that whenever results are updated, it's only necessary to recompile the LaTeX document for the new tables and graphs to be displayed.
Other software for dynamic documents
- Jupyter Notebook is used to create and share code in different programming languages, including Python, R, Julia, and Scala. It can also create dynamic documents in HTML, LaTeX and other formats.
- Overleaf is a web based platform for collaboration in TeX documents.
- Open science framework is a web based project management platform that combines registration, data storage (through Dropbox, Box, Google Drive and other platforms), code version control (through GitHub) and document composition (through Overleaf).
Additional Resources
From Data Colada:
From the Abul Latif Jameel Poverty Action Lab (JPAL)
From Innovations for Policy Action (IPA)
- Reproducible Research: Best Practices for Data and Code Management
- Guidelines for data publication
- Randomized Control Trials in the Social Science Dataverse
Center for Open Science
- Transparency and Openness Guidelines, summarized in a 1-Page Handout
Berkeley Initiative for Transparency in the Social Sciences
Reproducible Research in R
Reproducible Research in Stata