Difference between revisions of "Data Management"

Jump to: navigation, search
Line 39: Line 39:


These master data sets should have information on all observations that we have any data on, not just those that we sampled or included in another way.
These master data sets should have information on all observations that we have any data on, not just those that we sampled or included in another way.
== Creation of IDs ==
*All data sets should have a fully and uniquely identifying ID variable.
**Two variables are ok, but it should be avoided in most cases. One case where multiple ID variables is good practice is panel data set, where one ID is, for example, household ID and the other variable is time.
*After we have created IDs in the master data sets, we can then drop all identifying information that we do not need for the analysis in all other data sets that we have copied the ID to.


== Back to Parent ==
== Back to Parent ==

Revision as of 17:37, 25 January 2017

Due to the long life span of a typical impact evaluation, where multiple generations of team members will contribute to the same data work, clear methods for organization of the data folder, the structure of the data sets in that data folder, and the identification of the observations in those data sets is critical


Read First

  • Never work with a data set where the observations do not have standardized IDs. If you get a data set without IDs, the first thing you need to do is to create individual IDs.
  • Always create master data sets for all unit of observations relevant to the analysis.

Guidelines

  • Organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details

Orgnaization of Project folder

Setting up your folder for the first time

  • When setting up a folder for the first time you are setting the standards that field coordinators, reaserch assistants and eceonomists will work in for years to come. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.

Master Dofile

  • By running this file all files needed from importing raw data to cleaning to construct to analysis to outputting results should be run. This file therefore also functions as a map to the data folder.
  • The master dofile should also include all settings that should be the same accross the project

Specific rules for different folders

  • Data folder
    • Raw fodler, do not make any edits to files here
    • Final folder
  • Dofiles
    • Have a master do file that runs all other dofiles needed for this proiject. This is also your map to the data folder
    • Organize all other master files in subfolders

Naming conventions

  • Use the version control in Box/DropBox instead of naming the folders _v01, _v02 etc.
    • discuss execptions.

Master data sets

For each unit of observation that is relevant to the analysis (respondent in survey, level of randomization, see article for more details) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set.

These master data sets should have information on all observations that we have any data on, not just those that we sampled or included in another way.


Creation of IDs

  • All data sets should have a fully and uniquely identifying ID variable.
    • Two variables are ok, but it should be avoided in most cases. One case where multiple ID variables is good practice is panel data set, where one ID is, for example, household ID and the other variable is time.
  • After we have created IDs in the master data sets, we can then drop all identifying information that we do not need for the analysis in all other data sets that we have copied the ID to.

Back to Parent

This article is part of the topic Data management

Additional Resources

  • list here other articles related to this topic, with a brief description and link