Publishing Data
Data publication is the release of data and data documentation following data collection and analysis. Data publication is an increasingly common standard that bolsters research transparency and reproducibility. Preparation for data publication begins in the early stages of research: effective data management and analytics throughout the project will ensure that the research team can easily publish data when the time comes and that outside users can access and use the data to replicate the researcher's primary results. This page will discuss preparing and publishing data, code, documentation, and directories.
Read First
- DIME Data Publication Standards
- Before publishing data, remove all personally-identifying information (PII) such as names, locations or financial records.
- Accompany published data with proper documentation to ensure that users understand the data.
- Publish data within a comprehensive directory that includes all necessary data files, raw outputs, and code.
- GitHub, The Open Science Framework, and Gate are all platforms on which researchers can publish data, code, and directories
Preparing for Release
Preparing Data
Released data should allow any user to replicate research findings. Therefore, released data should be clean and well-labelled, contain all variables used in data analysis, and include identifying variables. Make sure to maintain the privacy of respondents by carefully de-identifying any sensitive or personally-identifying information (PII) such as names, locations, or financial records, all of which are not ethical to publish.
Preparing Data Documentation
Analysis datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper documentation, including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the Microdata Catalog Checklist for instructions on how to prepare data and documentation for primary data release.
Preparing Code and Directory
For full reproducibility, release a structured directory that allows a user to immediately run your code after changing the project directory. If you have followed the DIME Wiki’s protocols and effectively managed data throughout your research project via, among other things, an organized project folder and master do-file, you will already have well-written and reproducible code within a well-structured directory.
The folders should include all de-identified data necessary for the analysis, all code necessary for the analysis; and the raw outputs you use for the paper. Using iefolder
from DIME’s ietoolkit
can help standardize your directory. In either the /dofiles/ folder or in the root directory, include a master script (.do or .r for example). The master script should allow the reviewer to change one line of code to set his/her directory path. Then, the master script should run the entire project and re-create all the raw outputs exactly as supplied. Check that all code will run completely on a new computer: install any required user-written commands in the master script and make sure that settings like version
, matsize
, and varabbrev
are set. All outputs should clearly correspond by name to an exhibit in the paper, and vice versa.
Publishing
A data publication platform must be able to handle structured directories and provide a stable, structured URL for your project.
DIME survey data is typically published and released through the Microdata Catalog.
GitHub, The Open Science Framework, and Gate are often used for replication packages, as these platforms allow for publication of data, documentation, and code.
Author’s Preprint
Consider releasing an author’s copy or preprint, but check with your publisher before doing so: not all journals will accept material that has been released. Therefore, you may need to wait until acceptance is confirmed. You can do so on a number of pre-print websites, many of which are topically-specific. You can also use GitHub and link the file directly on your personal website or whatever medium through which you are sharing the preprint. Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material.
Additional Resources
- World Bank, Example of a published World Bank directory for replication .
- J-PAL, Guide to Publishing Research Data
- International Aid Transparency Initiative, How to License Your Data