Difference between revisions of "Data Management"

Jump to: navigation, search
(89 intermediate revisions by 6 users not shown)
Line 1: Line 1:
Due to the long life span of a typical impact evaluation, where multiple generations of team members will contribute to the same data work, clear methods for organization of the data folder, the structure of the data sets in that data folder, and the identification of the observations in those data sets is critical.
<onlyinclude>
 
Due to the long life span of a typical impact evaluation, multiple generations of team members often contribute to the same data work. Clear methods for organization of the data folder, the structure of the data sets in the folder, and identification of the observations in the data sets is critical.
</onlyinclude>
== Read First ==
== Read First ==
* Never work with a data set where the observations do not have standardized IDs. If you get a data set without IDs, the first thing you need to do is to create individual IDs.
* An important step before starting with '''data management''' is creating a [[Data Map|data map]].
* Always create master data sets for all unit of observations relevant to the analysis.
* The data folder structure suggested here can easily be set up with the command [[iefolder]] in the package [[Stata_Coding_Practices#ietoolkit|ietoolkit]]
 
* A dataset should always have one [[ID Variable Properties|uniquely identifying variable]]. If you receive a data set without an ID, the first thing you need to do is to create individual IDs.
== Guidelines ==
* Always create [[Master Data Set|master data sets]] for each unit of observations relevant to the analysis.
* Organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details
* Never merge on variables that are not ID variables unless one of the data sets merged is the Master data.


==Organization of Project folder==
==Organization of Project folder==
=== Setting up your folder for the first time ===
[[File:Datawork_new.png|thumb |Example of a data folder structure used during the course of an impact evaluation project.]]
 
*When setting up a folder for the first time you are setting the standards that field coordinators, research assistants and economists will work in for years to come. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work.


[[File:Example_file_and_folder_structure.png |thumb|400px|Example of a data folder structure used during the course of an impact evaluation project.]]
A well-organized [[DataWork Folder|data-folder]] is essential to productive workflows for the whole [[Impact Evaluation Team]]. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work. There is no universal best way to organize a project folder, what's most important is that all projects carefully plan ahead when setting up a new project folder. It is a good idea to start with a project folder template. Below is a detailed description of DIME's Folder Standard.


*You should create a new sub-folder for each round of the survey i.e. baseline, midline, follow up and endline.
=== DIME Folder Standard ===


=== Master Do-file ===
At DIME we have a  [[DataWork Folder|standardized folder template]] for how to organize our data work folders. Some projects have unique folder requirements but we still recommend all projects to use this template as a starting point. We have published a command called [[iefolder]] in the package [[Stata Coding Practices#ietoolkit|ietoolkit]] available through SSC that sets up this folder structure for you. The paragraph below has a short summary of the data folder organization. Click the links for details.


*By running this file all files needed from importing raw data to cleaning to construct to analysis to outputting results should be run. This file therefore also functions as a map to the data folder.
Most projects have a shared folder, for example using Box or DropBox. The project folder typically has several subfolders, including: government communications, budget, impact evaluation design, presentations, etc. But there should also be one folder for all the data work, that we in our standard template call '''DataWork'''. All data-related work on this project should be stored in '''DataWork'''.
*The master do-file should also include all settings that should be the same across the project


=== Specific rules for different folders ===
When setting up the [[DataWork Folder|DataWork folder structure]] for the first time, it must be done carefully and planned well so that it will not cause problem as the project evolve. We have based DIME's folder structure template on best practices developed at DIME that will help avoid those problems. A data folder should also include a [[Master Do-files | master do-file]] which run all other do-files, and also serve as a map to navigate the data folder. The project should also have clear [[Naming Conventions | naming conventions]]. This might sound more difficult than it is, but this process is easy if you use [[iefolder]].
*Data folder
**Raw Folder - This folder should contain the data sets you got as soon as you get them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and '''''absolutely no changes''''' should be made to it. Even simple changes like correcting obvious mistakes, changing variable names, changing format from csv to Stata or other formats, file name changes should not be done to the data in this folder. If the file name needs to be changed to be imported, then the file name changes can be done in this folder.
**Intermediate Folder - Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder.  
**Final folder - The Final data folder contains clean, and final constructed datasets.
*Dofiles
**Have a master do file that runs all other dofiles needed for this project. This is also your map to the data folder.
**Organize all other master files in sub folders.
 
===Naming conventions===
*Use the version control in Box/DropBox instead of naming the folders _v01, _v02, old, new, etc.
**Output tables, graphs, and documentations are an exception to this and it is good practice to date them. Instead of using versions, they should be dated for clarity. For example = "_2017June8" rather than "_v02".


==Master data sets==
==Master data sets==
With multiple rounds of data, you need to ensure there are no discrepancies on how observations are identified across survey rounds. Best practice is to have one datafile that overviews the observations, typically called a [[Master Data Set|master data set]]. For each [[Unit of Observation|unit of observation]] relevant to the analysis (survey respondent, unit of randomization, etc) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set etc.


<!-- link here from population frame work in sampling topic -->
These master data sets should include time-invariant information, for example ID variables and dummy variables indicating treatment status, to easily merge across data sets. We want this data for all observations we encountered, even observations that we did not select during sampling. This is a small but important point, read [[Master Data Set|master data set]] for more details.


For each [[Unit of Observation|unit of observation]] that is relevant to the analysis (respondent in survey, level of randomization, see article for more details) we need a [[Master Data Set|master data set]]. Common master data sets are household master data set, village master data set, clinic master data set.  
== ID Variables ==
All datasets must be uniquely and fully identified - see [[ID Variable Properties|properties of an ID variable]] for details. In almost all cases the ID variable should be a single variable. One common exception is for panel data sets, where each observation is identified by the primary ID variable and a time variable (year one, year two etc.). As soon as a dataset has one numeric, unique, fully-identifying variable, [[De-identification|separate out all personal identifier information (PII)]] from the data set. The PII should be encrypted, and saved separately in a secure location.


These master data sets should have information on all observations that we have any data on, not just those that we sampled or included in another way.
== Git and GitHub ==
Git is a tool used extensively in the world of computer science in order to manage code work. It is especially good for collaboration but it is also very helpful in seingle-person projects. In the early days of Git you had to manage your own code repository through the command line and set up your own servers to share the code, but several cloud based solutions with non-technical intercases exist now and [https://github.com GitHub] is the most commonly used one within the research community. Other commonly used Git implementations are [https://about.gitlab.com GitLab] and [https://bitbucket.org Bitbucket]. Since they all build on Git they share most features and if you learn one of them your skills are transferable to another.


 
GitHub has tools that offer less technical alternatives to how to interact with the Git functionality and that is probably the reason why GitHub is the most popular Git implementation in the research community. The main drawback with GitHub is that you cannot create private code repositories using the free account (but you can be invited to them). GitLab, for example, allows you to create private repositories on a free account, but you must use the command line to contribute to code. We have created resources and provide link to additional external resources for GitHub on our [[Getting started with GitHub]] page.
=== Creation of IDs ===
 
*All data sets should have a fully and uniquely identifying ID variable.
**Two variables are ok, but it should be avoided in most cases. One case where multiple ID variables is good practice is panel data set, where one ID is, for example, household ID and the other variable is time.
*After we have created IDs in the master data sets, we can then drop all identifying information that we do not need for the analysis in all other data sets that we have copied the ID to.


== Additional Resources ==
== Additional Resources ==
* list here other articles related to this topic, with a brief description and link
* [http://www.poverty-action.org/publication/ipas-best-practices-data-and-code-management| Best practices on data and code management from Innovations for Poverty Action]
 
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
[[Category: *category name* ]]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
[[Category: Data_Management ]]
[[Category: Data Analysis ]]

Revision as of 15:08, 13 April 2021

Due to the long life span of a typical impact evaluation, multiple generations of team members often contribute to the same data work. Clear methods for organization of the data folder, the structure of the data sets in the folder, and identification of the observations in the data sets is critical.

Read First

  • An important step before starting with data management is creating a data map.
  • The data folder structure suggested here can easily be set up with the command iefolder in the package ietoolkit
  • A dataset should always have one uniquely identifying variable. If you receive a data set without an ID, the first thing you need to do is to create individual IDs.
  • Always create master data sets for each unit of observations relevant to the analysis.
  • Never merge on variables that are not ID variables unless one of the data sets merged is the Master data.

Organization of Project folder

Example of a data folder structure used during the course of an impact evaluation project.

A well-organized data-folder is essential to productive workflows for the whole Impact Evaluation Team. It can't be stressed enough that this is one of the most important steps for the productivity of this project team and for reducing the sources of error in the data work. There is no universal best way to organize a project folder, what's most important is that all projects carefully plan ahead when setting up a new project folder. It is a good idea to start with a project folder template. Below is a detailed description of DIME's Folder Standard.

DIME Folder Standard

At DIME we have a standardized folder template for how to organize our data work folders. Some projects have unique folder requirements but we still recommend all projects to use this template as a starting point. We have published a command called iefolder in the package ietoolkit available through SSC that sets up this folder structure for you. The paragraph below has a short summary of the data folder organization. Click the links for details.

Most projects have a shared folder, for example using Box or DropBox. The project folder typically has several subfolders, including: government communications, budget, impact evaluation design, presentations, etc. But there should also be one folder for all the data work, that we in our standard template call DataWork. All data-related work on this project should be stored in DataWork.

When setting up the DataWork folder structure for the first time, it must be done carefully and planned well so that it will not cause problem as the project evolve. We have based DIME's folder structure template on best practices developed at DIME that will help avoid those problems. A data folder should also include a master do-file which run all other do-files, and also serve as a map to navigate the data folder. The project should also have clear naming conventions. This might sound more difficult than it is, but this process is easy if you use iefolder.

Master data sets

With multiple rounds of data, you need to ensure there are no discrepancies on how observations are identified across survey rounds. Best practice is to have one datafile that overviews the observations, typically called a master data set. For each unit of observation relevant to the analysis (survey respondent, unit of randomization, etc) we need a master data set. Common master data sets are household master data set, village master data set, clinic master data set etc.

These master data sets should include time-invariant information, for example ID variables and dummy variables indicating treatment status, to easily merge across data sets. We want this data for all observations we encountered, even observations that we did not select during sampling. This is a small but important point, read master data set for more details.

ID Variables

All datasets must be uniquely and fully identified - see properties of an ID variable for details. In almost all cases the ID variable should be a single variable. One common exception is for panel data sets, where each observation is identified by the primary ID variable and a time variable (year one, year two etc.). As soon as a dataset has one numeric, unique, fully-identifying variable, separate out all personal identifier information (PII) from the data set. The PII should be encrypted, and saved separately in a secure location.

Git and GitHub

Git is a tool used extensively in the world of computer science in order to manage code work. It is especially good for collaboration but it is also very helpful in seingle-person projects. In the early days of Git you had to manage your own code repository through the command line and set up your own servers to share the code, but several cloud based solutions with non-technical intercases exist now and GitHub is the most commonly used one within the research community. Other commonly used Git implementations are GitLab and Bitbucket. Since they all build on Git they share most features and if you learn one of them your skills are transferable to another.

GitHub has tools that offer less technical alternatives to how to interact with the Git functionality and that is probably the reason why GitHub is the most popular Git implementation in the research community. The main drawback with GitHub is that you cannot create private code repositories using the free account (but you can be invited to them). GitLab, for example, allows you to create private repositories on a free account, but you must use the command line to contribute to code. We have created resources and provide link to additional external resources for GitHub on our Getting started with GitHub page.

Additional Resources