Difference between revisions of "Publishing Data"

Jump to: navigation, search
 
(15 intermediate revisions by 6 users not shown)
Line 1: Line 1:
Making data available to other researchers in some form is a key need of research transparency and reproducibility. However, it is not generally possible or advisable to release raw data. [[Primary Data Collection | Primary data]] usually contains [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records that are unethical to make public; [[Secondary Data Sources | secondary data]] is often owned by an entity other than the research team and therefore may face legal issues in public release. It is therefore important to structure both data management and analytics such that the data that is published replicates the researcher's primary results to the best degree possible and that the data that is released is appropriately accessible.
Data publication is the release of data and data documentation following [[Primary Data Collection | data collection]] and [[Data Analysis | analysis]]. Data publication is an increasingly common standard that bolsters research transparency and [[Reproducible Research | reproducibility]]. Preparation for data publication begins in the early stages of research: effective [[Data Management | data management]] and analytics throughout the project will ensure that the research team can easily publish data when the time comes and that outside users can access and use the data to [[Reproducible Research | replicate]] the researcher's primary results. This page will discuss preparing and publishing data, code, documentation, and directories.


== Guidelines==
==Read First==
=== Publishing Primary Data ===
*[https://github.com/worldbank/dime-standards/tree/master/dime-research-standards/pillar-5-data-publication DIME Data Publication Standards]
*Before publishing data, remove all [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations or financial records. 
*Accompany published data with proper [[Data Documentation | documentation]] to ensure that users understand the data.
* Clearly state who [[Data Ownership|owns]] the data that is being published.
*Publish data within a comprehensive directory that includes all necessary data files, raw outputs, and code.
*[[Getting started with GitHub | GitHub]], [https://osf.io/ The Open Science Framework], and [https://www.researchgate.net/Research Gate] are all platforms on which researchers can publish data, code, and directories


==== Preparing data for release ====
==Preparing for Release ==  
The main issue with releasing primary data is maintaining the privacy of respondents. It is essential to carefully [[De-identification | de-identify]] any sensitive or personally-identifying information contained in the dataset. Datasets released should be easily understandable by users, so [[Data Documentation | documentation]], including variable dictionaries and survey instruments, should be released with the data.


==== DIME data releases ====
=== Preparing Data===
DIME survey data is released through the [[Microdata Catalog]]. However, access to the data may be restricted and some variables may be embargoed prior to publication.


=== Publishing Analysis Data ===
Released data should allow any user to [[Reproducible Research | replicate]] research findings. Therefore, released data  should be [[Data Cleaning | clean]] and [[Data_Cleaning#Labels | well-labelled]], contain all variables used in [[Data Analysis | data analysis]], and include [[ID Variable Properties | identifying variables]]. Make sure to maintain the privacy of respondents by carefully [[De-identification | de-identifying]] any sensitive or [[De-identification#Personally Identifiable Information | personally-identifying information (PII)]] such as names, locations, or financial records, all of which are not [[Research Ethics | ethical]] to publish.  
Some journals require datasets used in [[Data Analysis | data analysis]] to be released when a paper is published. This is intended to make research more transparent and allow readers to [[Reproducible Research | reproduce findings]].


==== Preparing data for release ====
===Preparing Data Documentation===
The objective of the data release is to allow users to reproduce results in the paper. Therefore, the released dataset needs to contain all variables used in [[Data Analysis | data analysis]], as well as all [[ID Variable Properties | identifying variables]].


== Back to Parent ==
Analysis datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper [[Data Documentation | documentation]], including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the [[Checklist: Microdata Catalog submission|Microdata Catalog Checklist]] for instructions on how to prepare data and documentation for primary data release.
This article is part of the topic [[Publishing Data]]


== Additional Resources==
===Preparing Code and Directory===


For full reproducibility, release a structured directory that allows a user to immediately run your code after changing the project directory. If you have followed the DIME Wiki’s protocols and effectively [[Data Management | managed]] data throughout your research project via, among other things, an organized [[DataWork Folder | project folder]] and [[Master Do-files | master do-file]], you will already have well-written and reproducible [[Stata Coding Practices | code]] within a well-structured directory.


[[Category: Publishing Data]]
The folders should include all de-identified data necessary for the analysis, all code necessary for the analysis; and the raw outputs you use for the paper. Using <code>[[iefolder]]</code> from DIME’s <code>[[ietoolkit]]</code> can help standardize your directory. In either the /dofiles/ folder or in the root directory, include a [[Master Do-files | master script]] (.do or .r for example). The master script should allow the reviewer to change one line of code to set his/her directory path. Then, the master script should run the entire project and re-create all the raw outputs exactly as supplied. Check that all code will run completely on a new computer: install any required user-written commands in the master script and make sure that settings like <code>version</code>, <code>matsize</code>, and <code>varabbrev</code> are set. All outputs should clearly correspond by name to an exhibit in the paper, and vice versa.
 
==Publishing==
A data publication platform must be able to handle structured directories and provide a stable, structured URL for your project.
 
[[DIME_Datasets_on_Microdata_Catalog| DIME survey data]] is typically published and released through the [[Microdata Catalog]].
 
[[Getting started with GitHub | GitHub]], [https://osf.io/ The Open Science Framework], and [https://www.researchgate.net/Research Gate] are often used for replication packages, as these platforms allow for publication of data, documentation, and code.
 
== Author’s Preprint ==
 
Consider releasing an author’s copy or preprint, but check with your publisher before doing so: not all journals will accept material that has been released. Therefore, you may need to wait until acceptance is confirmed. You can do so on a number of pre-print websites, many of which are topically-specific. You can also use GitHub and link the file directly on your personal website or whatever medium through which you are sharing the preprint. Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material.
 
== Related Pages ==
[[Special:WhatLinksHere/Publishing_Data|Click here for pages that link to this topic.]]
 
== Additional Resources ==
*J-PAL, [https://www.povertyactionlab.org/sites/default/files/resources/J-PAL-guide-to-publishing-research-data.pdf Guide to Publishing Research Data]
* International Aid Transparency Initiative, [https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data How to License Your Data]
* World Bank, [https://github.com/worldbank/Water-When-It-Counts Example of a published World Bank directory for replication ].
[[Category: Reproducible Research]]

Latest revision as of 17:07, 21 June 2021

Data publication is the release of data and data documentation following data collection and analysis. Data publication is an increasingly common standard that bolsters research transparency and reproducibility. Preparation for data publication begins in the early stages of research: effective data management and analytics throughout the project will ensure that the research team can easily publish data when the time comes and that outside users can access and use the data to replicate the researcher's primary results. This page will discuss preparing and publishing data, code, documentation, and directories.

Read First

Preparing for Release

Preparing Data

Released data should allow any user to replicate research findings. Therefore, released data should be clean and well-labelled, contain all variables used in data analysis, and include identifying variables. Make sure to maintain the privacy of respondents by carefully de-identifying any sensitive or personally-identifying information (PII) such as names, locations, or financial records, all of which are not ethical to publish.

Preparing Data Documentation

Analysis datasets should be easily understandable to researchers trying replicate results. Therefore, it's important that proper documentation, including variable dictionaries and survey instruments, accompany the data release. This ensures that users can easily understand the data. See the Microdata Catalog Checklist for instructions on how to prepare data and documentation for primary data release.

Preparing Code and Directory

For full reproducibility, release a structured directory that allows a user to immediately run your code after changing the project directory. If you have followed the DIME Wiki’s protocols and effectively managed data throughout your research project via, among other things, an organized project folder and master do-file, you will already have well-written and reproducible code within a well-structured directory.

The folders should include all de-identified data necessary for the analysis, all code necessary for the analysis; and the raw outputs you use for the paper. Using iefolder from DIME’s ietoolkit can help standardize your directory. In either the /dofiles/ folder or in the root directory, include a master script (.do or .r for example). The master script should allow the reviewer to change one line of code to set his/her directory path. Then, the master script should run the entire project and re-create all the raw outputs exactly as supplied. Check that all code will run completely on a new computer: install any required user-written commands in the master script and make sure that settings like version, matsize, and varabbrev are set. All outputs should clearly correspond by name to an exhibit in the paper, and vice versa.

Publishing

A data publication platform must be able to handle structured directories and provide a stable, structured URL for your project.

DIME survey data is typically published and released through the Microdata Catalog.

GitHub, The Open Science Framework, and Gate are often used for replication packages, as these platforms allow for publication of data, documentation, and code.

Author’s Preprint

Consider releasing an author’s copy or preprint, but check with your publisher before doing so: not all journals will accept material that has been released. Therefore, you may need to wait until acceptance is confirmed. You can do so on a number of pre-print websites, many of which are topically-specific. You can also use GitHub and link the file directly on your personal website or whatever medium through which you are sharing the preprint. Do not use Dropbox or Google Drive for this purpose: many organizations do not allow access to these tools, and that includes blocking staff from accessing your material.

Related Pages

Click here for pages that link to this topic.

Additional Resources