Publishing Data

Revision as of 18:22, 28 August 2020 by Bbdaniels (talk | contribs) (Additional Resources)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Data publication is the release of data and data documentation following data collection and analysis. Data publication is an increasingly common standard that bolsters research transparency and reproducibility. Preparation for data publication begins in the early stages of research: effective data management and analytics throughout the project will ensure that the research team can easily publish data when the time comes and that outside users can access and use the data to replicate the researcher's primary results. This page will discuss preparing and publishing data, code, documentation, and directories.

Read First

Preparing for Release

Preparing Data

Released data should allow any user to replicate research findings. Therefore, released data should be clean and well-labelled, contain all variables used in data analysis, and include identifying variables. Make sure to maintain the privacy of respondents by carefully de-identifying any sensitive or personally-identifying information (PII) such as names, locations, or financial records, all of which are not ethical to publish.

Preparing Data Documentation

Analysis datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper documentation, including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the Microdata Catalog Checklist for instructions on how to prepare data and documentation for primary data release.

Preparing Code and Directory

For full reproducibility, release a structured directory that allows a user to immediately run your code after changing the project directory. If you have followed the DIME Wiki’s protocols and effectively managed data throughout your research project via, among other things, an organized project folder and master do-file, you will already have well-written and reproducible code within a well-structured directory.

The folders should include all de-identified data necessary for the analysis, all code necessary for the analysis; and the raw outputs you use for the paper. Using iefolder from DIME’s ietoolkit can help standardize your directory. In either the /dofiles/ folder or in the root directory, include a master script (.do or .r for example). The master script should allow the reviewer to change one line of code to set his/her directory path. Then, the master script should run the entire project and re-create all the raw outputs exactly as supplied. Check that all code will run completely on a new computer: install any required user-written commands in the master script and make sure that settings like version, matsize, and varabbrev are set. All outputs should clearly correspond by name to an exhibit in the paper, and vice versa.


A data publication platform must be able to handle structured directories and provide a stable, structured URL for your project.

DIME survey data is typically published and released through the Microdata Catalog.

GitHub, The Open Science Framework, and Gate are often used for replication packages, as these platforms allow for publication of data, documentation, and code.

Author’s Preprint

Consider releasing an author’s copy or preprint, but check with your publisher before doing so: not all journals will accept material that has been released. Therefore, you may need to wait until acceptance is confirmed. You can do so on a number of pre-print websites, many of which are topically-specific. You can also use GitHub and link the file directly on your personal website or whatever medium through which you are sharing the preprint. Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material.

Additional Resources