Difference between revisions of "Data Management"

Jump to: navigation, search
Line 8: Line 8:


==Organization of Project folder==
==Organization of Project folder==
[[File:Datawork.png |thumb|300px|Example of a data folder structure used during the course of an impact evaluation project.]]
[[File:Datawork.png |thumb|Example of a data folder structure used during the course of an impact evaluation project.]]


A well-organized [[DataWork Folder|data-folder]] is essential to productive workflows for the whole [[Impact Evaluation Team]]. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.
A well-organized [[DataWork Folder|data-folder]] is essential to productive workflows for the whole [[Impact Evaluation Team]]. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.

Revision as of 13:56, 16 January 2018

Due to the long life span of a typical impact evaluation, multiple generations of team members often contribute to the same data work. Clear methods for organization of the data folder, the structure of the data sets in the folder, and identification of the observations in the data sets is critical.

Read First

  • The data folder structure suggested here can easily be set up with the command iefolder in the package ietoolkit
  • A dataset should always have one uniquely identifying variable. If you receive a data set without an ID, the first thing you need to do is to create individual IDs.
  • Always create master data sets for each unit of observations relevant to the analysis.
  • Never merge on variables that are not ID variables unless one of the data sets merged is the Master data.

Organization of Project folder

Example of a data folder structure used during the course of an impact evaluation project.

A well-organized data-folder is essential to productive workflows for the whole Impact Evaluation Team. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.

DIME Folder Standard

At DIME we have a standardized folder template for how to organize our data work folders. Some projects have unique folder requirements but we still recommend all projects to use this template as a starting point. We have published a command called iefolder in the package ietoolkit available through SSC that sets up this folder structure for you. The paragraph below has a short summary of the data folder organization. Click the links for details.

Most projects have a shared folder, for example using Box or DropBox. The project folder typically has several subfolders, including: government communications, budget, impact evaluation design, presentations, etc. But there should also be one folder for all the data work, that we in our standard template call DataWork. All data-related work on this project should be stored in DataWork.

When setting up the DataWork folder structure for the first time, it must be done carefully and planned well so that it will not cause problem as the project evolve. We have based DIME's folder structure template on best practices developed at DIME that will help avoid those problems. A data folder should also include a master do-file which run all other do-files, and also serve as a map to navigate the data folder. The project should also have clear naming conventions. This might sound more difficult than it is, but this process is easy if you use iefolder.

Master data sets

With multiple rounds of data, you need to ensure there are no discrepancies on how observations are identified across survey rounds. Best practice is to have one datafile that overviews the observations, typically called a master data set. For each unit of observation relevant to the analysis (survey respondent, unit of randomization, etc) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set etc.

These master data sets should include time-invariant information, for example ID variables and dummy variables indicating treatment status, to easily merge across data sets. We want this data for all observations we encountered, even observations that we did not select during sampling. This is a small but important point, read master data set for more details.

ID Variables

All datasets must be uniquely and fully identified - see properties of an ID variable for details. In almost all cases the ID variable should be a single variable. One common exception is for panel data sets, where each observation is identified by the primary ID variable and a time variable (year one, year two etc.). As soon as a dataset has one numeric, unique, fully-identifying variable, separate out all personal identifier information (PII) from the data set. The PII should be encrypted, and saved separately in a secure location.

Additional Resources