Data Management

Jump to: navigation, search

Due to the long life span of a typical impact evaluation, where multiple generations of team members will contribute to the same data work, clear methods for organization of the data folder, the structure of the data sets in that data folder, and the identification of the observations in those data sets is critical.

Read First

  • Never work with a data set where the observations do not have standardized IDs. If you get a data set without IDs, the first thing you need to do is to create individual IDs.
  • Always create master data sets for all unit of observations relevant to the analysis.

Guidelines

  • Organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details

Organization of Project folder

Changes made to the data folder, affects the standards that field coordinators, research assistants and economists will work in for years to come. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.

Most projects have a folder on a synced folder, for example Box or DropBox. The project folder might have several folders like, government communications, budget, concept note etc. In additions to those folders you want to have one folder call data folder. All files related to data work on this project should be stored in this folder. Setting up the folder for the first time is the most important stage when setting the data work standards for this project. This folder should include master do-files that are used to run all other do-files, but are also used as a map to navigate the data work folder. Some folders have specific best practices that should be followed. A well structured project should also have clear naming conventions.


Setting up your folder for the first time

Example of a data folder structure used during the course of an impact evaluation project.
  • You should create a new sub-folder for each round of the survey i.e. baseline, midline, follow up and endline.
  • Each sub-folder should have a data folder that is further divided into different types of datasets(raw, intermediate, etc).

Master Do-file

The master do-file is the main do file that is used to call upon all the other do files. By running this file, all files needed from importing raw data to cleaning, constructing, analysing and outputting results should be run. This file therefore also functions as a map to the data folder. The master do-file should also include all settings that should be the same across the project. For example - basic and advanced memory limits, Stata Version settings, etc. Globals should also be defined in the master do-files. Some of the common globals to declare in the master do files are conversion rates for standardization of unites, varlist commonly used across the projects, assumptions that need to be defined, etc.

Specific rules for different folders

  • Data folder
    • Raw Folder - This folder should contain the data sets you got as soon as you get them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting obvious mistakes, changing variable names, changing format from csv to Stata or other formats, file name changes should not be done to the data in this folder. If the file name needs to be changed to be imported, then the file name changes can be done in this folder.
    • Intermediate Folder - Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder.
    • Final folder - The Final data folder contains clean, and final constructed datasets.
  • Dofiles
    • Have a master do file that runs all other dofiles needed for this project. This is also your map to the data folder.
    • Organize all other master files in sub folders.

Naming conventions

  • Use the version control in Box/DropBox instead of naming the folders _v01, _v02, old, new, etc.
    • Output tables, graphs, and documentations are an exception to this and it is good practice to date them. Instead of using versions, they should be dated for clarity. For example = "_2017June8" rather than "_v02".

Master data sets

For each unit of observation that is relevant to the analysis (respondent in survey, level of randomization, see article for more details) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set.

These master data sets should have information on all observations that we have any data on, not just those that we sampled or included in another way.


Creation of IDs

  • All data sets should have a fully and uniquely identifying ID variable.
    • Two variables are ok, but it should be avoided in most cases. One case where multiple ID variables is good practice is panel data set, where one ID is, for example, household ID and the other variable is time.
  • After we have created IDs in the master data sets, we can then drop all identifying information that we do not need for the analysis in all other data sets that we have copied the ID to.

Additional Resources

  • list here other articles related to this topic, with a brief description and link