Difference between revisions of "Data Management"

Jump to: navigation, search
Line 1: Line 1:
Due to the long life span of a typical impact evaluation, where multiple generations of team members will contribute to the same data work, clear methods for organization of the data folder, the structure of the data sets in that data folder, and the identification of the observations in those data sets is critical.
Due to the long life span of a typical impact evaluation, multiple generations of team members often contribute to the same data work. Clear methods for organization of the data folder, the structure of the data sets in the folder, and identification of the observations in the data sets is critical.


== Read First ==
== Read First ==
* Never work with a data set where the observations do not have standardized IDs. If you get a data set without IDs, the first thing you need to do is to create individual IDs.
* A dataset should always have one [[ID Variable Properties|uniquely identifying variable}}. If you receive a data set without an ID, the first thing you need to do is to create individual IDs.
* Always create [[Master Data Set|master data sets]] for all unit of observations relevant to the analysis.
* Always create [[Master Data Set|master data sets]] for each unit of observations relevant to the analysis.
* Never merge on variables that are not ID variables unless one of the data sets merged is the Master data  
* Never merge on variables that are not ID variables unless one of the data sets merged is the Master data  


== Guidelines ==
== Guidelines ==
* Organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details


==Organization of Project folder==
==Organization of Project folder==
[[File:Example_file_and_folder_structure.png |thumb|300px|Example of a data folder structure used during the course of an impact evaluation project.]]
[[File:Example_file_and_folder_structure.png |thumb|300px|Example of a data folder structure used during the course of an impact evaluation project.]]
Changes made to the data folder, affects the standards that field coordinators, research assistants and economists will work in for years to come. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.


Most projects have a folder on a synced folder, for example Box or DropBox. The project folder might have several folders like, government communications, budget, concept note etc. In additions to those folders you want to have one folder call '''data folder'''. All files related to data work on this project should be stored in this folder. [[Data Folder Setup |Setting up the folder for the first time]] is the most important stage when setting the data work standards for this project. This folder should include [[Master Do-files | master do-files]] that are used to run all other do-files, but are also used as a map to navigate the data work folder. Some folders have specific best practices that should be followed. A well structured project should also have clear [[Naming Conventions | naming conventions]].
A well-organized data-folder is essential to productive workflows for the whole [[Impact Evaluation Team]]. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.
 
Most projects have a shared folder, for example using Box or DropBox. The project folder typically has several subfolders, including: government communications, budget, impact evaluation design, presentations, etc. There should always be one '''data folder'''. All data-related work on this project should be stored in the '''data folder'''. [[Data Folder Setup |Setting up the data folder for the first time]] must be done carefully, following clear protocols. The data folder will include a [[Master Do-files | master do-file]] which run all other do-files, and also serve as a map to navigate the data folder. The project should also have clear [[Naming Conventions | naming conventions]].


==Master data sets==
==Master data sets==
With multiple rounds of data, you need to ensure there are no discrepancies on how observations are identified across survey rounds. Best practice is to have one datafile that overviews the observations, typically called a [[Master Data Set|master data set]]. For each [[Unit of Observation|unit of observation]] relevant to the analysis (survey respondent, unit of randomization, etc) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set etc.


As the project folder grows bigger and more complex, we need tools to make sure that there are no discrepancies on how we identify observations in the first year of a project relative to the last year of a project. Also, in order to efficiently have an overview of the observations in our project and what data we need to have one data file that stores exactly this information. This type of data set is called a [[Master Data Set|master data set]] and a typical impact evaluation requires a few of them. For each [[Unit of Observation|unit of observation]] that is relevant to the analysis (respondent in survey, level of randomization, see master data set article for more details) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set etc.
These master data sets should include time-invariant information, for example ID variables and dummy variables indicating treatment status, to easily merge across data sets. We want this data for all observations we encountered, even observations that we did not select during sampling. This is a small but important point, read [[Master Data Set|master data set]] for more details.
 
These master data sets should have time-invariant information, for example ID variables and dummy variables for being included in different data sets. We want this data for all observations we encountered, even observations that we did not select during sampling. This is a small but important point, read [[Master Data Set|master data set]] for more details.


== ID Variables ==
== ID Variables ==
All observations in all data sets should always be identified by a variable that fulfills the [[ID Variable Properties|properties of an ID variable]]. The main property of an ID variable is that it should be uniquely and fully identifying. In almost all cases the ID variable should be a single variable. This is especially the case in the aspect of identifying observations across data sets in the project folder. One case where it is common to make an exception to this rule is panel data sets. In a panel data set, each observation is usually identified by the primary ID variable and a time variable (year one, year two etc.).  
All observations in all data sets should always be identified by a variable that fulfills the [[ID Variable Properties|properties of an ID variable]]. The main property of an ID variable is that it should be uniquely and fully identifying. In almost all cases the ID variable should be a single variable. This is especially the case in the aspect of identifying observations across data sets in the project folder. One case where it is common to make an exception to this rule is panel data sets. In a panel data set, each observation is usually identified by the primary ID variable and a time variable (year one, year two etc.).  



Revision as of 16:06, 8 February 2017

Due to the long life span of a typical impact evaluation, multiple generations of team members often contribute to the same data work. Clear methods for organization of the data folder, the structure of the data sets in the folder, and identification of the observations in the data sets is critical.

Read First

  • A dataset should always have one [[ID Variable Properties|uniquely identifying variable}}. If you receive a data set without an ID, the first thing you need to do is to create individual IDs.
  • Always create master data sets for each unit of observations relevant to the analysis.
  • Never merge on variables that are not ID variables unless one of the data sets merged is the Master data

Guidelines

Organization of Project folder

Example of a data folder structure used during the course of an impact evaluation project.

A well-organized data-folder is essential to productive workflows for the whole Impact Evaluation Team. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.

Most projects have a shared folder, for example using Box or DropBox. The project folder typically has several subfolders, including: government communications, budget, impact evaluation design, presentations, etc. There should always be one data folder. All data-related work on this project should be stored in the data folder. Setting up the data folder for the first time must be done carefully, following clear protocols. The data folder will include a master do-file which run all other do-files, and also serve as a map to navigate the data folder. The project should also have clear naming conventions.

Master data sets

With multiple rounds of data, you need to ensure there are no discrepancies on how observations are identified across survey rounds. Best practice is to have one datafile that overviews the observations, typically called a master data set. For each unit of observation relevant to the analysis (survey respondent, unit of randomization, etc) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set etc.

These master data sets should include time-invariant information, for example ID variables and dummy variables indicating treatment status, to easily merge across data sets. We want this data for all observations we encountered, even observations that we did not select during sampling. This is a small but important point, read master data set for more details.

ID Variables

All observations in all data sets should always be identified by a variable that fulfills the properties of an ID variable. The main property of an ID variable is that it should be uniquely and fully identifying. In almost all cases the ID variable should be a single variable. This is especially the case in the aspect of identifying observations across data sets in the project folder. One case where it is common to make an exception to this rule is panel data sets. In a panel data set, each observation is usually identified by the primary ID variable and a time variable (year one, year two etc.).

As soon as we have created the ID variable in the master data set and merged that variable to a data set, we can already then de-identify a data set. The sooner we we can then drop all identifying information that we do not need for the analysis in all other data sets that we have copied the ID to.

Additional Resources

  • list here other articles related to this topic, with a brief description and link