Difference between revisions of "DataWork Folder"

Jump to: navigation, search
 
(112 intermediate revisions by 5 users not shown)
Line 1: Line 1:
A well organized data folder reduces the risk for many types of errors. At DIME we have a standardized folder structure. Some projects have special folder requirements and only use the folder set up as a starting point, but many resources created by DIME are easier to take advantage of if this template is followed. It takes a lot of work to reorganize a project folder, so we strongly recommend that projects follow our standard from the beginning. A poorly set up folder will have inefficiency consequences and increase the risk of errors over several years.
The DataWork folder is a structured, standardized data folder that increases project efficiency and reduces the risk of error. The DataWork folder houses all files related to a project’s data, including data files; questionnaires; data collection documentation; code for sampling, treatment assignment, and [[Data Analysis | analysis]]; analysis output; and survey, monitoring,  [[Administrative and Monitoring Data | administrative]], and secondary data. DIME strongly recommends using the DataWork folder from the beginning of the project and throughout its duration.  
 
We have published a command called [[iefolder]] in our package [[Stata_Coding_Practices#ietoolkit|ietoolkit]] that we have published on SSC. iefolder sets up the recommended folder structure described in this article for you.


== Read First ==
== Read First ==
* Do not set up these folders manually. [[iefolder]] is a Stata command that easily sets up and updates this folder structure for you
* Use the <code>[[iefolder]]</code> Stata command to easily set up and update the DataWork folder.
*Many DIME resources are easier to take advantage of when the DataWork folder is used.
*Use the DataWork folder from the beginning of the project: reorganizing a project folder is time-consuming and cumbersome.
*Even if your project has special structural requirements, use the DataWork folder as a starting point.
== Creating the DataWork Folder ==
[[File:FolderBox.png |thumb|350px|Image 1. Example of where the DataWork folder is in relation to Box/DropBox folders. (Click to enlarge.)]]


== Where should the DataWork folder be created? ==
The DataWork folder is easily set up via the Stata command <code>[[iefolder]]</code>, which is part of the package <code>[[Stata_Coding_Practices#ietoolkit|ietoolkit]]</code>.
[[File:FolderBox.png |thumb|350px|Image 1. Example of where the DataWork folder is location in relation to Box/DropBox folders. (Click to enlarge.)]]


Most folders are shared across the project teams using a DropBox, Box or similar. In this folder there are usually a lot of folders for project budget, government communications etc. The '''DataWork''' folder is assumed to be one of them.  
The DataWork folder should be housed within the project folder, which contains a variety of other sub-folders (i.e. project budget and government communications). The DataWork folder and the broader project folder should be shared across project teams via [https://www.box.com Box], [https://www.dropbox.com/ Dropbox], or a similar platform. Image 1 shows a DataWork folder housed within a project folder.


See the Image 1 to the right with one example of a Box/DropBox folder with three project folders. All three projects has a similar sub-folder structure, but in the image only one of the projects sub-folder structure is show. The '''DataWork''' folder is highlighted with a red circle.
== Contents ==
[[File:Datawork.png |thumb|300px|Image 2. Example of a DataWork folder. (Click to enlarge.)]]


Anything related to the data of a project has a designated location inside this folder. This includes data-files, sampling and treatment assignment code, questionnaires, data collection documentation, analysis code, analysis output etc. This includes data collected by hour selves, both regular survey rounds and monitor data, but it should also include other sources of data such as admin data or secondary data.
The DataWork folder contains a [[Master Do-files|master do-file]], Survey Round folders, a Survey Encrypted Data folder, and a Master Data folder. It may also contain additional documentation (i.e. readme) to help users navigate its contents.  


== Inside the DataWork folder ==
=== Master Do-File ===
[[File:FolderDataWork.png |thumb|300px|Image 2. Example of a DataWork folder. (Click to enlarge.)]]


Inside the '''DataWork''' folder there should only be folders and files that help navigating those folders. In our standardized folder there are only three types of folders; [[DataWork_Folder#Survey_Round|Survey Rounds]] (Baseline, Endline etc.), [[DataWork_Folder#Monitor_Data|Monitor Data]] and [[DataWork_Folder#Monitor_Data|Master Data Sets]]. What's described here is only a template structure and additional folders are often needed. But we recommend that you try to fit even additional folders into one of these folder types and create them using [[iefolder]].
The project [[Master Do-files|master do-file]] runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results.  


In addition to the folders, in our standardized folder structure, there is also a Project_MasterDofile inside the '''DataWork''' folder. This file have three purposes. The first two are described in detail in the article for [[Master Dofiles]], but in short they are that it makes it possible to run all code related to one project at the same time, and it also sets up all the folder paths required to run any dofile for this project. The third purpose is that this file is the main map to the '''DataWork''' folder. Since all code can be run from this file, and since all outputs are (indeirectly) created by this file, this file is the starting point to find where any do-file, data set or output is located in the '''DataWork''' folder. Another examples of files that helps with the navigation of the folder could be a Word document or a PDF describing how to navigate the sub-folders. Such files are not included in our folder template, but may sometimes be a good addition. Although, keep the number of files in this folder to an absolute minimum.
If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.


==Survey Round==
===Survey Rounds===
[[File:FolderSurveyRound.png |thumb|300px|Image 3. Example of a Survey Round folder. (Click to enlarge.)]]
[[File:FolderSurveyRound.png |thumb|300px|Image 3. Example of a Survey Round folder. (Click to enlarge.)]]


Baselines, Follow Up Surveys, Midlines, Endlines are examples of a Survey Round. This is the data that we in Impact Evaluations will test if it changes over time and if that change is significantly different between treatment and control. In contrast, the information in the master data sets, like the ID assigned by us, weather you were sampled for baseline, weather you are selected for treatment or control are all examples of information that is time invariant and will remain the same over the course of the project. Monitor data might change over time, for example in a impact evaluation running over many years one observation might not take up the treatment the first year but might do so the next year.  
Each [[DataWork Survey Round | Survey Round]] folder contains a master do file specific to its data source, in addition to the following folders: [[DataWork Survey Round#DataSets Folder|DataSets]], [[DataWork Survey Round#Dofiles Folder|Dofiles]], [[DataWork Survey Round#Outputs Folder|Outputs]], [[DataWork Survey Round#Documentation Folder|Documentation]], and [[DataWork Survey Round#Questionnaire Folder|Questionnaire]]. While the folders listed here are <code>[[iefolder]]</code>, you can also create more folders that are unique to your project.


Each survey round should have it's own sub-folder inside the '''DataWork''' folder. For example - Inside the main data folder, you can have sub-folders like baseline, follow up 1, follow up 2, midline, endline, etc. See image 3 for an example. When you create a survey round folder using the command [[iefolder]] all sub-folders and sub-sub-folders described below will be created for you and all your master dofiles will be updated or created accordingly.
====What Requires a Survey Round Folder?====
Each source of data (i.e. baseline, follow-up, midline, endline, [[Administrative and Monitoring Data | administrative]], secondary) should have its own Survey Round folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.  


Inside each Survey Round folder you will find a master do file for that survey round as well as the following folders [[DataWork Survey Round#DataSets Folder|DataSets]], [[DataWork Survey Round#Dofiles Folder|Dofiles]], [[DataWork Survey Round#Outputs Folder|Outputs]], [[DataWork Survey Round#Documentation Folder|Documentation]], and [[DataWork Survey Round#Questionnaire Folder|Questionnaire]].
If you have multiple [[Unit of Observation|units of observation]] in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: ''baseline_students'', ''baseline_teachers'' and ''baseline_schools''. If you choose, these folders may be nested within a parent ''Baseline'' folder. <code>[[iefolder]]</code> gives you the option of doing this by creating a ''Baseline'' subfolder and then creating Survey Round folders within it.  


=== Multiple Units of Observation ===
Note that each Survey Round needs to have a unique name across the project when using <code>[[iefolder]]</code>.


If you have multiple [[Unit of Observation|units of observation]] in a survey round for example farmers and villages, or students, teachers and schools, then you should create a survey round folder for each unit of observations.
=== Survey Encrypted Data ===


==AdminData==
Whenever you create either a new survey round or new unit of observation with <code>[[iefolder]]</code>, the command creates a partner folder for each survey round or unit of observation in the Survey Encrypted Data folder. Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be [[Encryption | encrypted]] using software like [https://www.veracrypt.fr/en/Home.html VeraCrypt], we previously recommended Boxcryptor but securoty issues were found in that software so we strongly recommend against using Boxcryptor. . Note that while <code>[[iefolder]]</code> creates the Survey Encrypted Data folder, it does not encrypt it.


Admin data can both be used as the main data in the analysis or as secondary data that is combined with the main data in the analysis. A folder with admin data should be organized as much as possible as a survey round folder. The reason a small distinction is made is that admin data might not always be primary data collected by the project team. Admin data is often collected, cleaned and/or aggregated by someone other than the team. Therefore some of the recommendations made in the section above does not apply.  
Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version [[De-identification | without identifying information]] should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.


If the admin data is collected as main data in multiple rounds, then it might make most sense to treat this data as survey round in the sense that one folder should be made for each round of data collection. But if the amount of work that is required for each round is small, then that is probable not the case.
=== Master Data ===
[[File:FolderMasterData.png |thumb|300px|Image 4. Example of a Master Data Set folder. (Click to enlarge.)]]


==MonitorData==
The Master Data folder stores information about all the observations for which data is collected, including observations both in and out of the sample. As it is sometimes necessary to identify observations outside of the sample, the Master Data folder should also include, for example,
*Census observations not sampled for the project, or
*Observations encountered in monitoring activities but not in the sample.
In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called [[Master Data Set |master datasets]].


Monitor data is data collected to understand the implementation of the assigned treatment in the field. Monitor data varies with each project, but can range from data collected on whether seeds were distributed in an agriculture project, to was saving groups formed and who joined them in a financial inclusion project, or to was school books distributed in an education project. In comparison, the main data focuses on outcomes as in a farmers harvest size in the agriculture project, amount of savings in the financial inclusion project and test scores in the education project. Without the information in the monitor data there is no way to know that any any change in the outcome can be attributed to the the project. This is particularly interesting in the absence of change in outcome data.
A project needs one master dataset for each [[Unit of Observation|unit of observation]] used in the project (i.e. households, students, teachers, firms).  


Monitor data is often collected the same way as surveys, but it could also be collected in the same way as admin data. Depending on which, follow the respective instructions from above. Most important is that monitor data is separate from the survey round data we are collecting. The main data tells us the result, the monitor data tells us about [https://en.wikipedia.org/wiki/Internal_validity internal validity].
==== Sampling and Treatment Assignment ====


==MasterData==
The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, it is important to match the census data to the master data and check that it makes sense in relation to whatever data exists there already. After that quality control step, sample directly from the master data set, knowing that the randomized sample will make sense in relation to other data sources.
[[File:FolderMasterData.png |thumb|300px|Image 6. Example of a Master Data Set folder. (Click to enlarge.)]]
Master data sets are data sets that make sure the observations are correctly identifiable across data sets. Master data sets also includes important information such as sampling, treatment assignment etc. This is a very important topic and [[Master Data Set|master data sets has its own article]]. What is important in the aspect in '''DataWork''' folder management is that the master data set should have their own folder as it is a meta-dataset that stores data useful to all other datasets.


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Data Management]]
This article is part of the topic [[Data Management]]


[[Category: Data Management ]]
== Additional Resources ==
*DIME Analytics' guidelines on [https://github.com/worldbank/DIME-Resources/blob/master/welcome-iefolder.pdf iefolder]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
[[Category: Data_Management ]]

Latest revision as of 14:31, 12 June 2019

The DataWork folder is a structured, standardized data folder that increases project efficiency and reduces the risk of error. The DataWork folder houses all files related to a project’s data, including data files; questionnaires; data collection documentation; code for sampling, treatment assignment, and analysis; analysis output; and survey, monitoring, administrative, and secondary data. DIME strongly recommends using the DataWork folder from the beginning of the project and throughout its duration.

Read First

  • Use the iefolder Stata command to easily set up and update the DataWork folder.
  • Many DIME resources are easier to take advantage of when the DataWork folder is used.
  • Use the DataWork folder from the beginning of the project: reorganizing a project folder is time-consuming and cumbersome.
  • Even if your project has special structural requirements, use the DataWork folder as a starting point.

Creating the DataWork Folder

Image 1. Example of where the DataWork folder is in relation to Box/DropBox folders. (Click to enlarge.)

The DataWork folder is easily set up via the Stata command iefolder, which is part of the package ietoolkit.

The DataWork folder should be housed within the project folder, which contains a variety of other sub-folders (i.e. project budget and government communications). The DataWork folder and the broader project folder should be shared across project teams via Box, Dropbox, or a similar platform. Image 1 shows a DataWork folder housed within a project folder.

Contents

Image 2. Example of a DataWork folder. (Click to enlarge.)

The DataWork folder contains a master do-file, Survey Round folders, a Survey Encrypted Data folder, and a Master Data folder. It may also contain additional documentation (i.e. readme) to help users navigate its contents.

Master Do-File

The project master do-file runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results.

If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.

Survey Rounds

Image 3. Example of a Survey Round folder. (Click to enlarge.)

Each Survey Round folder contains a master do file specific to its data source, in addition to the following folders: DataSets, Dofiles, Outputs, Documentation, and Questionnaire. While the folders listed here are iefolder, you can also create more folders that are unique to your project.

What Requires a Survey Round Folder?

Each source of data (i.e. baseline, follow-up, midline, endline, administrative, secondary) should have its own Survey Round folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.

If you have multiple units of observation in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: baseline_students, baseline_teachers and baseline_schools. If you choose, these folders may be nested within a parent Baseline folder. iefolder gives you the option of doing this by creating a Baseline subfolder and then creating Survey Round folders within it.

Note that each Survey Round needs to have a unique name across the project when using iefolder.

Survey Encrypted Data

Whenever you create either a new survey round or new unit of observation with iefolder, the command creates a partner folder for each survey round or unit of observation in the Survey Encrypted Data folder. Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be encrypted using software like VeraCrypt, we previously recommended Boxcryptor but securoty issues were found in that software so we strongly recommend against using Boxcryptor. . Note that while iefolder creates the Survey Encrypted Data folder, it does not encrypt it.

Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version without identifying information should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.

Master Data

Image 4. Example of a Master Data Set folder. (Click to enlarge.)

The Master Data folder stores information about all the observations for which data is collected, including observations both in and out of the sample. As it is sometimes necessary to identify observations outside of the sample, the Master Data folder should also include, for example,

  • Census observations not sampled for the project, or
  • Observations encountered in monitoring activities but not in the sample.

In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called master datasets.

A project needs one master dataset for each unit of observation used in the project (i.e. households, students, teachers, firms).

Sampling and Treatment Assignment

The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, it is important to match the census data to the master data and check that it makes sense in relation to whatever data exists there already. After that quality control step, sample directly from the master data set, knowing that the randomized sample will make sense in relation to other data sources.

Back to Parent

This article is part of the topic Data Management

Additional Resources