Difference between revisions of "Data Documentation"

Jump to: navigation, search
Line 1: Line 1:
<onlyinclude>
+
Data documentation is the process of recording any aspect of project design, [[Sampling & Power Calculations | sampling]], [[Primary Data Collection | data collection]], [[Data Cleaning | cleaning]] and [[Data Analysis | analysis]] that may affect results. Data documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project. In all steps of a research project, documentation is important and critical to [[Reproducible Research | reproducible research]]. This page will outline why, what, and how to document.
Documenting any aspects of the data work that may affect the analysis is a crucial part of dealing with data. Impact evaluation projects often take years to be completed and are executed by large teams. If the data work is not documented while it is ongoing, it is likely that many details will be lost and a considerable amount of time spent trying to understand what was previously done. For example, say it became clear during the field work that some respondents didn't understand a test that was applied because they had reading difficulties. If the [[Impact Evaluation Team#Field Coordinator | field coordinator]] didn't document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not know to flag them during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] doesn't document why the observations were flagged and what the flag means, they may not be correctly dealt with during [[Data Analysis | analysis]].
 
</onlyinclude>
 
There are different ways to document data work. One widespread practice is to send e-mails reporting issues to the team. Though this is easily done, it is time-consuming to find answers later on in the project development, even if someone in the team needs to remember that an e-mail was sent. For data cleaning, data analysis and variables construction, it is best practice to document the data work through comments on the code. However, even though this is very helpful for some reading the codes carefully, if these comments are not documented elsewhere, it may also take a long time to go through all the do-files and find the answer to a specific question. It's usually advisable to have all data work documentation in one file or folder, though how it is structured and when, how and by whom it is updated will vary from one project to the other. One advantage of submitting codes for [[Code Review| code review]] and depositing data on the [[Microdata Catalog | microdata catalog]] is that both cases the data work documentation will be reviewed, though does not guarantee that everything that should be documented is in fact, as reviewers cannot ask about issues unkown to them.
 
  
== Read first ==
+
==Read First==
 +
*Project documentation ensures [[Reproducible Research | reproducible research]] and facilitates transfer of knowledge between team members throughout time.
 +
* Project documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project.
 +
* Make sure to document details regarding sampling, field work dates, respondent tracking, issues in the field, data cleaning, and datasets.
 +
*Consolidate all data work documentation in one file or folder.
  
== Field Work Documentation ==
+
==Why Document==
=== Sampling ===
 
* Sample selection
 
* Replacement criteria
 
  
=== Field work dates ===
+
Project documentation ensures [[Reproducible Research | reproducible research]] and facilitates transfer of knowledge between team members in both the future and present. Impact evaluations often span the course of years and are executed by large teams. If a research team does not document data, fieldwork, and project details promptly, important points will likely be lost or forgotten, costing the team a considerable amount of time to try to understand previous work. Timely and detailed documentation not only serves the research team in the future, but also ensures high quality data in the present.  Imagine, for example, that a [[Impact Evaluation Team#Field Coordinator | field coordinator]] notices that some respondents don’t understand a test because they have reading difficulties. If he/she doesn’t document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not know to flag these observations during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] doesn't document why the observations were flagged and what the flag means, the analyst may not correctly deal with these observations during [[Data Analysis | analysis]].
  
=== Tracking respondents ===
+
== What to Document ==
* Total number of respondents listed
+
The following list highlights key pieces of information to document. As all projects differ in design and circumstance, this list is by no means exhaustive, but rather a starting point for data documentation.
* Total number of respondents visited
 
* Refusal rates
 
* Total number of respondents in final sample
 
  
=== Issues on the field ===
+
*Sampling: sample selection, replacement criteria
Report any problems that occurred during the administration of the survey (strikes, inclement weather, inability to enter parts of the country)
+
*Field work dates
 +
*Tracking respondents: total number of respondents listed, total number of respondents visited, refusal rates, total number of respondents in final sample
 +
*Issues in the field: report any problems that occurred during the administration of the survey (strikes, inclement weather, inability to enter parts of the country)
 +
*Data Cleaning Documentation: outliers, inconsistencies, survey codes and missing values
 +
* Variables Construction Documentation: sampling, weights and expansion factors, outliers, inconsistencies, variables definition, references
 +
* Datasets Documentation: dataset creation, linking data sets
  
== Data Cleaning Documentation ==
+
==How to Document==
=== Outliers ===
 
=== Inconsistencies ===
 
=== Survey Codes and Missing values ===
 
  
== Variables Construction Documentation ==
+
===Methods===
=== Sampling ===
 
=== Weights and expansion factors===
 
=== Outliers ===
 
=== Inconsistencies ===
 
=== Variables definition ===
 
=== References ===  
 
  
== Datasets Documentation ==  
+
Research teams often document data and project work through emails and code. These methods are functional, but not sufficient for [[Reproducible Research | reproducible research]]. While reporting issues to the research team via email is convenient and easy in the short-run, data documentation via email makes finding answers later burdensome, time-consuming, and unreliable; and while documenting data cleaning, data analysis and variable construction through comments on code is helpful and important for reading codes carefully, these comments should also be documented elsewhere for maximum efficiency and organization. It is therefore advisable to consolidate all data work documentation in one file or folder. How this file or folder is structured and how, when, and by whom it is updated will vary from one project to the other.
=== Dataset creation ===  
+
 
=== Linking data sets ===  
+
===Platforms===
 +
 
 +
There are various software solutions for building data documentation over time. [https://osf.io/ The Open Science Framework] provides one such solution, with integrated file storage, version histories, and collaborative Wiki pages. [[Getting started with GitHub | GitHub]] provides a transparent task management platform in addition to version histories and Wiki pages, but is less effective for file storage. The exact shape of this process should be agreed on by team members prior to project launch.
 +
 
 +
===Review===
 +
 
 +
When research teams submit codes for code review and deposit data on the [[Microdata Catalog | microdata catalog]], the data work documentation will be reviewed. However, bear in mind that a positive review does not guarantee that everything that should be documented is, in fact, documented: reviewers cannot ask about issues unknown to them.
 +
 
 +
==Back to Parent==
 +
This page is part of the topics [[Data Cleaning]] and [[Data Management]].
  
 
== Additional Resources ==
 
== Additional Resources ==
 
+
*Gentzkow and Shapiro’s [https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf Code and Data]
  
 
[[Category: Data Cleaning]][[Category: Data Management]]
 
[[Category: Data Cleaning]][[Category: Data Management]]

Revision as of 15:20, 17 May 2019

Data documentation is the process of recording any aspect of project design, sampling, data collection, cleaning and analysis that may affect results. Data documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project. In all steps of a research project, documentation is important and critical to reproducible research. This page will outline why, what, and how to document.

Read First

  • Project documentation ensures reproducible research and facilitates transfer of knowledge between team members throughout time.
  • Project documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project.
  • Make sure to document details regarding sampling, field work dates, respondent tracking, issues in the field, data cleaning, and datasets.
  • Consolidate all data work documentation in one file or folder.

Why Document

Project documentation ensures reproducible research and facilitates transfer of knowledge between team members in both the future and present. Impact evaluations often span the course of years and are executed by large teams. If a research team does not document data, fieldwork, and project details promptly, important points will likely be lost or forgotten, costing the team a considerable amount of time to try to understand previous work. Timely and detailed documentation not only serves the research team in the future, but also ensures high quality data in the present. Imagine, for example, that a field coordinator notices that some respondents don’t understand a test because they have reading difficulties. If he/she doesn’t document this issue, the research assistant will not know to flag these observations during data cleaning. And if the research assistant doesn't document why the observations were flagged and what the flag means, the analyst may not correctly deal with these observations during analysis.

What to Document

The following list highlights key pieces of information to document. As all projects differ in design and circumstance, this list is by no means exhaustive, but rather a starting point for data documentation.

  • Sampling: sample selection, replacement criteria
  • Field work dates
  • Tracking respondents: total number of respondents listed, total number of respondents visited, refusal rates, total number of respondents in final sample
  • Issues in the field: report any problems that occurred during the administration of the survey (strikes, inclement weather, inability to enter parts of the country)
  • Data Cleaning Documentation: outliers, inconsistencies, survey codes and missing values
  • Variables Construction Documentation: sampling, weights and expansion factors, outliers, inconsistencies, variables definition, references
  • Datasets Documentation: dataset creation, linking data sets

How to Document

Methods

Research teams often document data and project work through emails and code. These methods are functional, but not sufficient for reproducible research. While reporting issues to the research team via email is convenient and easy in the short-run, data documentation via email makes finding answers later burdensome, time-consuming, and unreliable; and while documenting data cleaning, data analysis and variable construction through comments on code is helpful and important for reading codes carefully, these comments should also be documented elsewhere for maximum efficiency and organization. It is therefore advisable to consolidate all data work documentation in one file or folder. How this file or folder is structured and how, when, and by whom it is updated will vary from one project to the other.

Platforms

There are various software solutions for building data documentation over time. The Open Science Framework provides one such solution, with integrated file storage, version histories, and collaborative Wiki pages. GitHub provides a transparent task management platform in addition to version histories and Wiki pages, but is less effective for file storage. The exact shape of this process should be agreed on by team members prior to project launch.

Review

When research teams submit codes for code review and deposit data on the microdata catalog, the data work documentation will be reviewed. However, bear in mind that a positive review does not guarantee that everything that should be documented is, in fact, documented: reviewers cannot ask about issues unknown to them.

Back to Parent

This page is part of the topics Data Cleaning and Data Management.

Additional Resources