Data documentation is the process of recording any aspect of project design, sampling, data collection, cleaning and analysis that may affect results. Data documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project. In all steps of a research project, documentation is important and critical to reproducible research. This page will outline why, what, and how to document.
- Project documentation ensures reproducible research and facilitates transfer of knowledge between team members throughout time.
- Project documentation is not a one-time requirement or retrospective task, but rather an active and ongoing process throughout the course of a research project.
- Make sure to document details regarding sampling, field work dates, respondent tracking, issues in the field, data cleaning, and datasets.
- Consolidate all data work documentation in one file or folder.
Project documentation ensures reproducible research and facilitates transfer of knowledge between team members in both the future and present. Impact evaluations often span the course of years and are executed by large teams. If a research team does not document data, fieldwork, and project details promptly, important points will likely be lost or forgotten, costing the team a considerable amount of time to try to understand previous work. Timely and detailed documentation not only serves the research team in the future, but also ensures high quality data in the present. Imagine, for example, that a field coordinator notices that some respondents don’t understand a test because they have reading difficulties. If he/she doesn’t document this issue, the research assistant will not know to flag these observations during data cleaning. And if the research assistant doesn't document why the observations were flagged and what the flag means, the analyst may not correctly deal with these observations during analysis.
What to Document
The following list highlights key pieces of information to document. As all projects differ in design and circumstance, this list is by no means exhaustive, but rather a starting point for data documentation.
- Sampling: sample selection, replacement criteria
- Field work dates
- Tracking respondents: total number of respondents listed, total number of respondents visited, refusal rates, total number of respondents in final sample
- Issues in the field: report any problems that occurred during the administration of the survey (i.e. strikes, inclement weather, inability to enter parts of the country)
- Data Cleaning Documentation: outliers, inconsistencies, survey codes and missing values
- Variables Construction Documentation: sampling, weights and expansion factors, outliers, inconsistencies, variables definition, references
- Datasets Documentation: dataset creation, linking data sets
How to Document
Research teams often document data and project work through emails and code. These methods are functional, but not sufficient for reproducible research. While reporting issues to the research team via email is convenient and easy in the short-run, data documentation via email makes finding answers later burdensome, time-consuming, and unreliable; and while documenting data cleaning, data analysis and variable construction through comments on code is helpful and important for reading codes carefully, these comments should also be documented elsewhere for maximum efficiency and organization. It is therefore advisable to consolidate all data work documentation in one file or folder. How this file or folder is structured and how, when, and by whom it is updated will vary from one project to the other.
There are various software solutions for building data documentation over time. The Open Science Framework provides one such solution, with integrated file storage, version histories, and collaborative Wiki pages. GitHub provides a transparent task management platform in addition to version histories and Wiki pages, but is less effective for file storage. The exact shape of this process should be agreed on by team members prior to project launch.
When research teams submit codes for code review and deposit data on the microdata catalog, the data work documentation will be reviewed. However, bear in mind that a positive review does not guarantee that everything that should be documented is, in fact, documented: reviewers cannot ask about issues unknown to them.
- Crystal Lewis (MPSI), Data Management in Large-Scale Education Research
- Gentzkow and Shapiro (Chicago Booth and NBER), Code and Data for the Social Sciences: A Practitioner’s Guide
- International Household Survey Network, Quick Reference Guide for Data Archivists