Difference between revisions of "Data Documentation"

Jump to: navigation, search
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Documenting any aspects of the data work that may affect the analysis is a crucial part of dealing with data. Impact evaluation projects often take years to be completed and are executed by large teams. If the data work is not documented while it is ongoing, it is likely that many details will be lost and a considerable amount of time spent trying to understand what was previously done. For example, say it became clear during the field work that some respondents didn't understand a test that was applied because they had reading difficulties. If the [[Impact Evaluation Team#Field Coordinator | field coordinator]] didn't document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not know to flag them during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] doesn't document why the observations were flagged and what the flag means, they may not be correctly dealt with during [[Data Analysis | analysis]].
Data documentation is the process of recording any aspect of project design, [[Sampling & Power Calculations | sampling]], [[Primary Data Collection | data collection]], [[Data Cleaning | cleaning]] and [[Data Analysis | analysis]] that may affect results. Data documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project. In all steps of a research project, documentation is important and critical to [[Reproducible Research | reproducible research]]. This page will outline why, what, and how to document.


There are different ways to document data work. One widespread practice is to send e-mails reporting issues to the team. Though this is easily done, it is time-consuming to find answers later on in the project development, even if someone in the team needs to remember that an e-mail was sent. For data cleaning, data analysis and variables construction, it is best practice to document the data work through comments on the code. However, even though this is very helpful for some reading the codes carefully, if these comments are not documented elsewhere, it may also take a long time to go through all the do-files and find the answer to a specific question. It's usually advisable to have all data work documentation in one file or folder, though how it is structured and when, how and by whom it is updated will vary from one project to the other. One advantage of submitting codes for [[Code Review| code review]] and depositing data on the [[Microdata Catalog | microdata catalog]] is that both cases the data work documentation will be reviewed, though does not guarantee that everything that should be documented is in fact, as reviewers cannot ask about issues unkown to them.
==Read First==
*Project documentation ensures [[Reproducible Research | reproducible research]] and facilitates transfer of knowledge between team members throughout time.
* Project documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project.
* Make sure to document details regarding sampling, field work dates, respondent tracking, issues in the field, data cleaning, and datasets.
*Consolidate all data work documentation in one file or folder.


== Read first ==
==Why Document==


== Field Work Documentation ==
Project documentation ensures [[Reproducible Research | reproducible research]] and facilitates transfer of knowledge between team members in both the future and present. Impact evaluations often span the course of years and are executed by large teams. If a research team does not document data, fieldwork, and project details promptly, important points will likely be lost or forgotten, costing the team a considerable amount of time to try to understand previous work. Timely and detailed documentation not only serves the research team in the future, but also ensures high quality data in the present.  Imagine, for example, that a [[Impact Evaluation Team#Field Coordinator | field coordinator]] notices that some respondents don’t understand a test because they have reading difficulties. If he/she doesn’t document this issue, the [[Impact Evaluation Team#Research Assistant | research assistant]] will not know to flag these observations during [[Data Cleaning | data cleaning]]. And if the [[Impact Evaluation Team#Research Assistant | research assistant]] doesn't document why the observations were flagged and what the flag means, the analyst may not correctly deal with these observations during [[Data Analysis | analysis]].
=== Sampling ===
* Sample selection
* Replacement criteria


=== Field work dates ===  
== What to Document ==
The following list highlights key pieces of information to document. As all projects differ in design and circumstance, this list is by no means exhaustive, but rather a starting point for data documentation.


=== Tracking respondents ===
*Sampling: sample selection, replacement criteria
* Total number of respondents listed
*Field work dates
* Total number of respondents visited
*Tracking respondents: total number of respondents listed, total number of respondents visited, refusal rates, total number of respondents in final sample
* Refusal rates
*Issues in the field: report any problems that occurred during the administration of the survey (i.e. strikes, inclement weather, inability to enter parts of the country)
* Total number of respondents in final sample
*Data Cleaning Documentation: outliers, inconsistencies, survey codes and missing values
* Variables Construction Documentation: sampling, weights and expansion factors, outliers, inconsistencies, variables definition, references
* Datasets Documentation: dataset creation, linking data sets


=== Issues on the field ===
==How to Document==
Report any problems that occurred during the administration of the survey (strikes, inclement weather, inability to enter parts of the country)


== Data Cleaning Documentation ==
===Methods===
=== Outliers ===
=== Inconsistencies ===
=== Survey Codes and Missing values ===


== Variables Construction Documentation ==
Research teams often document data and project work through emails and code. These methods are functional, but not sufficient for [[Reproducible Research | reproducible research]]. While reporting issues to the research team via email is convenient and easy in the short-run, data documentation via email makes finding answers later burdensome, time-consuming, and unreliable; and while documenting data cleaning, data analysis and variable construction through comments on code is helpful and important for reading codes carefully, these comments should also be documented elsewhere for maximum efficiency and organization. It is therefore advisable to consolidate all data work documentation in one file or folder. How this file or folder is structured and how, when, and by whom it is updated will vary from one project to the other.
=== Sampling ===
=== Weights and expansion factors===
=== Outliers ===
=== Inconsistencies ===
=== Variables definition ===
=== References ===


== Datasets Documentation ==
===Platforms===
=== Dataset creation ===
=== Linking data sets ===  


== Additional Resources ==
There are various software solutions for building data documentation over time. [https://osf.io/ The Open Science Framework] provides one such solution, with integrated file storage, version histories, and collaborative Wiki pages. [[Getting started with GitHub | GitHub]] provides a transparent task management platform in addition to version histories and Wiki pages, but is less effective for file storage. The exact shape of this process should be agreed on by team members prior to project launch.
 
===Review===
 
When research teams submit codes for code review and deposit data on the [[Microdata Catalog | microdata catalog]], the data work documentation will be reviewed. However, bear in mind that a positive review does not guarantee that everything that should be documented is, in fact, documented: reviewers cannot ask about issues unknown to them.


== Related Pages ==
[[Special:WhatLinksHere/Data_Documentation|Click here for pages that link to this topic.]]


[[Category: Data Cleaning]][[Category: Data Management]]
== Additional Resources ==
* Crystal Lewis (MPSI), [https://cghlewis.github.io/mpsi-training1/#1 Data Management in Large-Scale Education Research]
* Gentzkow and Shapiro (Chicago Booth and NBER), [https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf Code and Data for the Social Sciences: A Practitioner’s Guide]
* International Household Survey Network, [https://guide-for-data-archivists.readthedocs.io/en/latest/ Quick Reference Guide for Data Archivists]
[[Category: Reproducible Research]]
[[Category: Data Management]]

Latest revision as of 13:12, 31 March 2022

Data documentation is the process of recording any aspect of project design, sampling, data collection, cleaning and analysis that may affect results. Data documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project. In all steps of a research project, documentation is important and critical to reproducible research. This page will outline why, what, and how to document.

Read First

  • Project documentation ensures reproducible research and facilitates transfer of knowledge between team members throughout time.
  • Project documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project.
  • Make sure to document details regarding sampling, field work dates, respondent tracking, issues in the field, data cleaning, and datasets.
  • Consolidate all data work documentation in one file or folder.

Why Document

Project documentation ensures reproducible research and facilitates transfer of knowledge between team members in both the future and present. Impact evaluations often span the course of years and are executed by large teams. If a research team does not document data, fieldwork, and project details promptly, important points will likely be lost or forgotten, costing the team a considerable amount of time to try to understand previous work. Timely and detailed documentation not only serves the research team in the future, but also ensures high quality data in the present. Imagine, for example, that a field coordinator notices that some respondents don’t understand a test because they have reading difficulties. If he/she doesn’t document this issue, the research assistant will not know to flag these observations during data cleaning. And if the research assistant doesn't document why the observations were flagged and what the flag means, the analyst may not correctly deal with these observations during analysis.

What to Document

The following list highlights key pieces of information to document. As all projects differ in design and circumstance, this list is by no means exhaustive, but rather a starting point for data documentation.

  • Sampling: sample selection, replacement criteria
  • Field work dates
  • Tracking respondents: total number of respondents listed, total number of respondents visited, refusal rates, total number of respondents in final sample
  • Issues in the field: report any problems that occurred during the administration of the survey (i.e. strikes, inclement weather, inability to enter parts of the country)
  • Data Cleaning Documentation: outliers, inconsistencies, survey codes and missing values
  • Variables Construction Documentation: sampling, weights and expansion factors, outliers, inconsistencies, variables definition, references
  • Datasets Documentation: dataset creation, linking data sets

How to Document

Methods

Research teams often document data and project work through emails and code. These methods are functional, but not sufficient for reproducible research. While reporting issues to the research team via email is convenient and easy in the short-run, data documentation via email makes finding answers later burdensome, time-consuming, and unreliable; and while documenting data cleaning, data analysis and variable construction through comments on code is helpful and important for reading codes carefully, these comments should also be documented elsewhere for maximum efficiency and organization. It is therefore advisable to consolidate all data work documentation in one file or folder. How this file or folder is structured and how, when, and by whom it is updated will vary from one project to the other.

Platforms

There are various software solutions for building data documentation over time. The Open Science Framework provides one such solution, with integrated file storage, version histories, and collaborative Wiki pages. GitHub provides a transparent task management platform in addition to version histories and Wiki pages, but is less effective for file storage. The exact shape of this process should be agreed on by team members prior to project launch.

Review

When research teams submit codes for code review and deposit data on the microdata catalog, the data work documentation will be reviewed. However, bear in mind that a positive review does not guarantee that everything that should be documented is, in fact, documented: reviewers cannot ask about issues unknown to them.

Related Pages

Click here for pages that link to this topic.

Additional Resources