Difference between revisions of "DataWork Folder"

Jump to: navigation, search
 
(160 intermediate revisions by 5 users not shown)
Line 1: Line 1:
Since the '''DataWork''' folder is setup to be used throughout the impact evaluation project, it is important to set it up correctly. Setting the folder up correctly can help increase efficiency of the data work being done and also reduces the sources of errors in data work.  
The DataWork folder is a structured, standardized data folder that increases project efficiency and reduces the risk of error. The DataWork folder houses all files related to a project’s data, including data files; questionnaires; data collection documentation; code for sampling, treatment assignment, and [[Data Analysis | analysis]]; analysis output; and survey, monitoring,  [[Administrative and Monitoring Data | administrative]], and secondary data. DIME strongly recommends using the DataWork folder from the beginning of the project and throughout its duration.  


== Inside the DataWork folder ==
== Read First ==
[[File:Example_file_and_folder_structure.png |thumb|300px|Example of a data folder structure used during the course of an impact evaluation project.]]
* Use the <code>[[iefolder]]</code> Stata command to easily set up and update the DataWork folder.
Inside the '''DataWork''' folder there should only be folders and files that help navigating those folders. Each folder should correspond to a data source ([[DataWork_Folder_Setup#Survey_Round|survey-rounds]], [[DataWork_Folder_Setup#Admin_Data|admin data]], [[DataWork_Folder_Setup#Monitor_Data|monitor data]]) or be a folder containing meta-data sets such as the [[Master Data Set|master data set]].
*Many DIME resources are easier to take advantage of when the DataWork folder is used.
*Use the DataWork folder from the beginning of the project: reorganizing a project folder is time-consuming and cumbersome.
*Even if your project has special structural requirements, use the DataWork folder as a starting point.
== Creating the DataWork Folder ==
[[File:FolderBox.png |thumb|350px|Image 1. Example of where the DataWork folder is in relation to Box/DropBox folders. (Click to enlarge.)]]


The most important file that help navigating all sub-folders of the '''DataWork''' is the main master do-file. This do-file calls the master do-file of each data source and meta data folder and re-run all code needed to generate all data sets and output in the '''DataWork''' folder. Another example of file that helps with the navigation of the folder could be a Word document or a PDF describing how to navigate the sub-folders. It is important that the number of files here are kept to an absolute minimum.
The DataWork folder is easily set up via the Stata command <code>[[iefolder]]</code>, which is part of the package <code>[[Stata_Coding_Practices#ietoolkit|ietoolkit]]</code>.


==Survey Round==
The DataWork folder should be housed within the project folder, which contains a variety of other sub-folders (i.e. project budget and government communications). The DataWork folder and the broader project folder should be shared across project teams via [https://www.box.com Box], [https://www.dropbox.com/ Dropbox], or a similar platform. Image 1 shows a DataWork folder housed within a project folder.
Each round of the survey should have it's own sub-folder inside the data folder. For example - Inside the main data folder, you can have sub-folders like baseline, follow up 1, follow up 2, midline, endline, etc. Each of these folders should have the folders described below.


===DataSets folder ===
== Contents ==
The '''DataSets''' folder inside a Survey Round folder should be further divided into three sub-folders. These folders are called '''Raw''', '''Intermediate''' and '''Final'''. '''Raw''' and '''Final''' have strict rules on which datasets can be saved in those folders. All other datasets should be saved in '''Intermediate'''.
[[File:Datawork.png |thumb|300px|Image 2. Example of a DataWork folder. (Click to enlarge.)]]


;Raw Folder
The DataWork folder contains a [[Master Do-files|master do-file]], Survey Round folders, a Survey Encrypted Data folder, and a Master Data folder. It may also contain additional documentation (i.e. readme) to help users navigate its contents.  
This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and '''''absolutely no changes''''' should be made to it. Even simple changes like correcting obvious mistakes, changing variable names, changing format from csv to Stata or other formats, file name changes should not be done to the data in this folder. The only exception to this is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.  


If there are mistakes in your raw datasets that you know of, write a dofile that corrects that mistake and then save the corrected data in the intermediate folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.
=== Master Do-File ===


;Intermediate Folder
The project [[Master Do-files|master do-file]] runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results.  
This folder should contain all datasets that are not supposed to be either in the '''Raw''' folder (see above) or '''Final''' folder (see below). Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder. Since this is a work-in-progress folder there are no specific rules how this folder should be organized. It still make sense to keep it organized in sub-folders. And read the [[Naming Conventions|naming convention]] before thinking about saving multiple versions of the same data set named _v1, _v2 etc.


;Final folder
If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.
This folder should contain the data sets that are cleaned and have the final variables constructed. All datasets in this folder should be clearly marked if they are identified or de-identified. There should only be one version of the data set in this folder. If there are many different data sets in this folder, for example student dataset, school dataset, teacher dataset etc., then the folder should have sub-folders. This is the folder most likely to visited several years after the project has ended by someone who has very little knowledge of the project. This folder should therefore be one of the folders organised to the most level of detail.


===DoFiles Folder===
===Survey Rounds===
Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.  
[[File:FolderSurveyRound.png |thumb|300px|Image 3. Example of a Survey Round folder. (Click to enlarge.)]]


The sub-folders in this folder should be organized according to task. For example, ''import'', ''cleaning'', ''analysis''. All do-files related to each of these tasks should be saved in those folders. Again, using [[Naming Conventions|naming conventions and version control]] rather than having multiple versions of the same dofile.
Each [[DataWork Survey Round | Survey Round]] folder contains a master do file specific to its data source, in addition to the following folders: [[DataWork Survey Round#DataSets Folder|DataSets]], [[DataWork Survey Round#Dofiles Folder|Dofiles]], [[DataWork Survey Round#Outputs Folder|Outputs]], [[DataWork Survey Round#Documentation Folder|Documentation]], and [[DataWork Survey Round#Questionnaire Folder|Questionnaire]]. While the folders listed here are <code>[[iefolder]]</code>, you can also create more folders that are unique to your project.


===Output Folder===
====What Requires a Survey Round Folder?====
Each source of data (i.e. baseline, follow-up, midline, endline, [[Administrative and Monitoring Data | administrative]], secondary) should have its own Survey Round folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.


The output folder should the raw and final tables output folders inside it.
If you have multiple [[Unit of Observation|units of observation]] in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: ''baseline_students'', ''baseline_teachers'' and ''baseline_schools''. If you choose, these folders may be nested within a parent ''Baseline'' folder. <code>[[iefolder]]</code> gives you the option of doing this by creating a ''Baseline'' subfolder and then creating Survey Round folders within it.  


===Documentation===
Note that each Survey Round needs to have a unique name across the project when using <code>[[iefolder]]</code>.


This folder will contain the documentation for the analysis done including any duplicate reports, survey logs, etc.
=== Survey Encrypted Data ===


==AdminData==
Whenever you create either a new survey round or new unit of observation with <code>[[iefolder]]</code>, the command creates a partner folder for each survey round or unit of observation in the Survey Encrypted Data folder. Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be [[Encryption | encrypted]] using software like [https://www.veracrypt.fr/en/Home.html VeraCrypt], we previously recommended Boxcryptor but securoty issues were found in that software so we strongly recommend against using Boxcryptor. . Note that while <code>[[iefolder]]</code> creates the Survey Encrypted Data folder, it does not encrypt it.


* What is?
Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version [[De-identification | without identifying information]] should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.
* Admin data comes in many different forms so there is no exact rule on what to do. Follow survey rounds instructions as much as possible


==MonitorData==
=== Master Data ===
[[File:FolderMasterData.png |thumb|300px|Image 4. Example of a Master Data Set folder. (Click to enlarge.)]]


* What is?
The Master Data folder stores information about all the observations for which data is collected, including observations both in and out of the sample. As it is sometimes necessary to identify observations outside of the sample, the Master Data folder should also include, for example,
* Monitor data is often collected the same way as surveys, and then use the same format. Otherwise it follows the rules of admin data
*Census observations not sampled for the project, or
*Observations encountered in monitoring activities but not in the sample.
In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called [[Master Data Set |master datasets]].


==MasterData==
A project needs one master dataset for each [[Unit of Observation|unit of observation]] used in the project (i.e. households, students, teachers, firms).
*See article on master data sets
 
==== Sampling and Treatment Assignment ====
 
The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, it is important to match the census data to the master data and check that it makes sense in relation to whatever data exists there already. After that quality control step, sample directly from the master data set, knowing that the randomized sample will make sense in relation to other data sources.


== Back to Parent ==
== Back to Parent ==
Line 54: Line 60:


== Additional Resources ==
== Additional Resources ==
* list here other articles related to this topic, with a brief description and link
*DIME Analytics' guidelines on [https://github.com/worldbank/DIME-Resources/blob/master/welcome-iefolder.pdf iefolder]
 
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
[[Category: Data Management ]]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
[[Category: Data_Management ]]

Latest revision as of 14:31, 12 June 2019

The DataWork folder is a structured, standardized data folder that increases project efficiency and reduces the risk of error. The DataWork folder houses all files related to a project’s data, including data files; questionnaires; data collection documentation; code for sampling, treatment assignment, and analysis; analysis output; and survey, monitoring, administrative, and secondary data. DIME strongly recommends using the DataWork folder from the beginning of the project and throughout its duration.

Read First

  • Use the iefolder Stata command to easily set up and update the DataWork folder.
  • Many DIME resources are easier to take advantage of when the DataWork folder is used.
  • Use the DataWork folder from the beginning of the project: reorganizing a project folder is time-consuming and cumbersome.
  • Even if your project has special structural requirements, use the DataWork folder as a starting point.

Creating the DataWork Folder

Image 1. Example of where the DataWork folder is in relation to Box/DropBox folders. (Click to enlarge.)

The DataWork folder is easily set up via the Stata command iefolder, which is part of the package ietoolkit.

The DataWork folder should be housed within the project folder, which contains a variety of other sub-folders (i.e. project budget and government communications). The DataWork folder and the broader project folder should be shared across project teams via Box, Dropbox, or a similar platform. Image 1 shows a DataWork folder housed within a project folder.

Contents

Image 2. Example of a DataWork folder. (Click to enlarge.)

The DataWork folder contains a master do-file, Survey Round folders, a Survey Encrypted Data folder, and a Master Data folder. It may also contain additional documentation (i.e. readme) to help users navigate its contents.

Master Do-File

The project master do-file runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results.

If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.

Survey Rounds

Image 3. Example of a Survey Round folder. (Click to enlarge.)

Each Survey Round folder contains a master do file specific to its data source, in addition to the following folders: DataSets, Dofiles, Outputs, Documentation, and Questionnaire. While the folders listed here are iefolder, you can also create more folders that are unique to your project.

What Requires a Survey Round Folder?

Each source of data (i.e. baseline, follow-up, midline, endline, administrative, secondary) should have its own Survey Round folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.

If you have multiple units of observation in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: baseline_students, baseline_teachers and baseline_schools. If you choose, these folders may be nested within a parent Baseline folder. iefolder gives you the option of doing this by creating a Baseline subfolder and then creating Survey Round folders within it.

Note that each Survey Round needs to have a unique name across the project when using iefolder.

Survey Encrypted Data

Whenever you create either a new survey round or new unit of observation with iefolder, the command creates a partner folder for each survey round or unit of observation in the Survey Encrypted Data folder. Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be encrypted using software like VeraCrypt, we previously recommended Boxcryptor but securoty issues were found in that software so we strongly recommend against using Boxcryptor. . Note that while iefolder creates the Survey Encrypted Data folder, it does not encrypt it.

Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version without identifying information should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.

Master Data

Image 4. Example of a Master Data Set folder. (Click to enlarge.)

The Master Data folder stores information about all the observations for which data is collected, including observations both in and out of the sample. As it is sometimes necessary to identify observations outside of the sample, the Master Data folder should also include, for example,

  • Census observations not sampled for the project, or
  • Observations encountered in monitoring activities but not in the sample.

In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called master datasets.

A project needs one master dataset for each unit of observation used in the project (i.e. households, students, teachers, firms).

Sampling and Treatment Assignment

The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, it is important to match the census data to the master data and check that it makes sense in relation to whatever data exists there already. After that quality control step, sample directly from the master data set, knowing that the randomized sample will make sense in relation to other data sources.

Back to Parent

This article is part of the topic Data Management

Additional Resources