Difference between revisions of "DataWork Folder"

Jump to: navigation, search
Line 20: Line 20:
=== Master Do-File ===
=== Master Do-File ===


The project [[Master Do-files|master do-file]] runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results. Note that while the project master do-file sits in the DataWork folder, each Survey Round folder contains a round master do-file. The round master do-file has the same purpose as the project master do-file, but only for the files and folders associated with that round.  
The project [[Master Do-files|master do-file]] runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results.  


If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.
If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.
Line 27: Line 27:
[[File:FolderSurveyRound.png |thumb|300px|Image 3. Example of a Survey Round folder. (Click to enlarge.)]]
[[File:FolderSurveyRound.png |thumb|300px|Image 3. Example of a Survey Round folder. (Click to enlarge.)]]


Each Survey Round folder contains a master do file for that data source, in addition to the following folders: [[DataWork Survey Round#DataSets Folder|DataSets]], [[DataWork Survey Round#Dofiles Folder|Dofiles]], [[DataWork Survey Round#Outputs Folder|Outputs]], [[DataWork Survey Round#Documentation Folder|Documentation]], and [[DataWork Survey Round#Questionnaire Folder|Questionnaire]]. These folders are standardized by [[iefolder]]. Note that you can create more folders that are unique to your project.
Each [[DataWork Survey Round | Survey Round]] folder contains a master do file specific to its data source, in addition to the following folders: [[DataWork Survey Round#DataSets Folder|DataSets]], [[DataWork Survey Round#Dofiles Folder|Dofiles]], [[DataWork Survey Round#Outputs Folder|Outputs]], [[DataWork Survey Round#Documentation Folder|Documentation]], and [[DataWork Survey Round#Questionnaire Folder|Questionnaire]]. While the folders listed here are [[iefolder]], you can also create more folders that are unique to your project.


====What Requires a Survey Round Folder?====
====What Requires a Survey Round Folder?====
Each source of data (i.e. baseline, follow-up, midline, endline, [[Administrative and Monitoring Data | administrative]], secondary) should have its own Survey Round sub-folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder; if the data source is collected in stages (i.e. baseline and endline household data), then it requires one Survey Round folder for each stage.  
Each source of data (i.e. baseline, follow-up, midline, endline, [[Administrative and Monitoring Data | administrative]], secondary) should have its own Survey Round sub-folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.  


If you have multiple [[Unit of Observation|units of observation]] in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: ''baseline_students'', ''baseline_teachers'' and ''baseline_schools''. [[iefolder]] gives you the option of creating a subfolder that calls baseline and creates all baseline Survey Round folders within it. Note that each Survey Round needs to have a unique name across the project when using [[iefolder]].
If you have multiple [[Unit of Observation|units of observation]] in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: ''baseline_students'', ''baseline_teachers'' and ''baseline_schools''. [[iefolder]] gives you the option of creating a subfolder that calls baseline and creates all baseline Survey Round folders within it.  


See [[DataWork Survey Round]] for more details.
Note that each Survey Round needs to have a unique name across the project when using [[iefolder]].
=== Survey Encrypted Data ===
 
Whenever you create either a new survey round or new unit of observation with [[iefolder]], the command creates a partner folder for each survey round or unit of observation in the Survey Encrypted Data folder. Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be [[Encryption | encrypted]] using software like [https://www.boxcryptor.com Boxcryptor]. Note that while [[iefolder]] creates the ''Survey Encrypted Data'' folder, it does not encrypt it.
 
Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version [[De-identification | without identifying information]] should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.


=== Master Data ===
=== Master Data ===
[[File:FolderMaster Data.png |thumb|300px|Image 4. Example of a Master Data Set folder. (Click to enlarge.)]]
[[File:FolderMaster Data.png |thumb|300px|Image 4. Example of a Master Data Set folder. (Click to enlarge.)]]


The Master Data folder stores information about all the observations for which we collect data, including observations both in and out of the sample. As we sometimes need to identify observations outside of the sample, the Master Data folder should also include, for example,  
The Master Data folder stores information about all the observations for which data is collected, including observations both in and out of the sample. As it is sometimes necessary to identify observations outside of the sample, the Master Data folder should also include, for example,  
*census observations not sampled for the project, or  
*Census observations not sampled for the project, or  
*observations encountered in monitoring activities but not in the sample.  
*Observations encountered in monitoring activities but not in the sample.  
In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called [[Master Data Set |master datasets]].  
In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called [[Master Data Set |master datasets]].  


Line 48: Line 53:
==== Sampling and Treatment Assignment ====
==== Sampling and Treatment Assignment ====


The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, we always want to match the census data to the master data and check that it makes sense in relation to whatever data we have there already. After that quality control step, we can sample directly from the master data set knowing that the sample we randomize will make sense in relation to other data sources.
The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, it is important to match the census data to the master data and check that it makes sense in relation to whatever data exists there already. After that quality control step, sample directly from the master data set, knowing that the randomized sample will make sense in relation to other data sources.
 
=== Survey Encrypted Data ===
Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be [[Encryption | encrypted]] using software like [https://www.boxcryptor.com Boxcryptor]. Note that while [[iefolder]] creates the ''Survey Encrypted Data'' folder, it does not encrypt it.
 
Whenever you create either a new survey round or new unit of observation with iefolder, the command creates a partner folder for each survey round or unit of observation in the ''Survey Encrypted Data'' folder. Here, you may store raw data, high frequency checks and other items likely to have identifying or other sensitive information. Data in the non-encrypted folders should not contain any sensitive or identifying information.
 
Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version [[De-identification | without identifying information]] should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Data Management]]
This article is part of the topic [[Data Management]]


== Additional Resources ==
[[Category: Data Management ]]
[[Category: Data Management ]]

Revision as of 20:49, 14 April 2019

The DataWork folder is a structured, standardized data folder that increases project efficiency and reduces the risk of error. The DataWork folder houses all files related to a project’s data, including data files; questionnaires; data collection documentation; code for sampling, treatment assignment, and analysis; analysis output; and survey, monitoring, administrative, and secondary data. DIME strongly recommends using the DataWork folder from the beginning of the project and throughout its duration.

Read First

  • Use the iefolder Stata command to easily set up and update the DataWork folder.
  • Many DIME resources are easier to take advantage of when the DataWork folder is used.
  • Use the DataWork folder from the beginning of the project: reorganizing a project folder is time-consuming and cumbersome.
  • Even if your project has special structural requirements, use the DataWork folder as a starting point.

Creating the DataWork Folder

Image 1. Example of where the DataWork folder is in relation to Box/DropBox folders. (Click to enlarge.)

The DataWork folder is easily set up via the Stata command iefolder, which is part of the package ietoolkit.

The DataWork folder should be housed within the project folder, which contains a variety of other sub-folders (i.e. project budget and government communications). The DataWork folder and the broader project folder should be shared across project teams via Box, Dropbox, or a similar platform. Image 1 shows a DataWork folder housed within a project folder.

Contents

Image 2. Example of a DataWork folder. (Click to enlarge.)

The DataWork folder contains a master do-file, Survey Round folders, a Survey Encrypted Data folder, and a Master Data folder. It may also contain additional documentation (i.e. readme) to help users navigate its contents.

Master Do-File

The project master do-file runs all other project do-files from cleaning to final analysis. It also sets up dynamic file paths so that multiple users can work from the same project folder shared via, for example, Dropbox or Box. This ensures that everyone gets the same results.

If you are new to a project folder, always start by finding the master do-file: it serves as a map of all files in the DataWork folder.

Survey Rounds

Image 3. Example of a Survey Round folder. (Click to enlarge.)

Each Survey Round folder contains a master do file specific to its data source, in addition to the following folders: DataSets, Dofiles, Outputs, Documentation, and Questionnaire. While the folders listed here are iefolder, you can also create more folders that are unique to your project.

What Requires a Survey Round Folder?

Each source of data (i.e. baseline, follow-up, midline, endline, administrative, secondary) should have its own Survey Round sub-folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.

If you have multiple units of observation in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: baseline_students, baseline_teachers and baseline_schools. iefolder gives you the option of creating a subfolder that calls baseline and creates all baseline Survey Round folders within it.

Note that each Survey Round needs to have a unique name across the project when using iefolder.

Survey Encrypted Data

Whenever you create either a new survey round or new unit of observation with iefolder, the command creates a partner folder for each survey round or unit of observation in the Survey Encrypted Data folder. Any identifying or sensitive data should be saved in the Survey Encrypted Data folder; the folder’s contents can easily be encrypted using software like Boxcryptor. Note that while iefolder creates the Survey Encrypted Data folder, it does not encrypt it.

Consider, for example, a master dataset. The version with identifying information should be stored in the Survey Encrypted Data folder. The version without identifying information should be stored in the Master Data folder. The latter version is quicker to access and can often be shared outside the research team.

Master Data

File:FolderMaster Data.png
Image 4. Example of a Master Data Set folder. (Click to enlarge.)

The Master Data folder stores information about all the observations for which data is collected, including observations both in and out of the sample. As it is sometimes necessary to identify observations outside of the sample, the Master Data folder should also include, for example,

  • Census observations not sampled for the project, or
  • Observations encountered in monitoring activities but not in the sample.

In the Master Data folder, we can also track any time-invariant information relevant to the project (i.e. assigned treatment status, treatment uptake, identifying information, dummy for being sampled). The datasets where we store this information are called master datasets.

A project needs one master dataset for each unit of observation used in the project (i.e. households, students, teachers, firms).

Sampling and Treatment Assignment

The Master Data folder should also include all activities performed on the main listing of all observations (i.e. sampling and treatment assignment). This should never be done directly on census data or baseline data etc. While census data will be used for sampling, it is important to match the census data to the master data and check that it makes sense in relation to whatever data exists there already. After that quality control step, sample directly from the master data set, knowing that the randomized sample will make sense in relation to other data sources.

Back to Parent

This article is part of the topic Data Management

Additional Resources