DataWork Folder

Jump to: navigation, search

A well-organized data folder reduces the risk for many types of errors. At DIME, we have a standardized folder structure. Some projects have special folder requirements and only use the folder set up as a starting point, but many resources created by DIME are easier to take advantage of if this template is followed. It takes a lot of work to reorganize a project folder, so we strongly recommend that projects follow our standard from the beginning. A poorly set up folder will have inefficiency consequences and increases the risk of errors over several years.

We have published a command called iefolder in our package ietoolkit that we have published on SSC. iefolder sets up the recommended folder structure described in this article for you.

Read First

  • Do not set up these folders manually. iefolder is a Stata command that easily sets up and updates this folder structure for you.

Where should the DataWork folder be created?

Image 1. Example of where the DataWork folder is location in relation to Box/DropBox folders. (Click to enlarge.)

Most project folders are shared across the project teams using a DropBox, Box, or something similar. In this folder, there are different folders for project budget, government communications, etc. The DataWork folder is assumed to be one of them.

See Image 1 on the right with an example of a Box/DropBox folder with three project folders. All three projects have a similar sub-folder structure, but in the image only one of the projects sub-folder structure is shown. The DataWork folder is highlighted with a red circle.

Anything related to the data of a project has a designated location inside this folder. This includes data-files, sampling and treatment assignment code, questionnaires, data collection documentation, analysis code, analysis output etc. This includes data collected by ourselves, both regular survey rounds and monitor data, and also includes other sources of data such as admin data or secondary data.

Inside the DataWork folder

File:FolderDataWork.png
Image 2. Example of a DataWork folder. (Click to enlarge.)

Inside the DataWork folder there should only be folders and one file that help navigating those folders. The folder structure below is a template that we recommend for all projects, although we understand that some projects have special requirements and therefore will only use this as starting point. Everything described here is easily set up using the Stata command iefolder.

The DataWork folder should have a folder called Encrypted Data. This folder create a branch in the DataWork folder that can easily be encrypted using an encryption software like boxcryptor. Note that while iefolder creates a folder called Encrypted Data when you set up the DataWork in the first place, iefolder does not encrypt it.

After the project folder is set up initially, there are two types of folder to be added; Survey Rounds and Master Data Sets. A survey round can be more than just baseline, endline etc. It can also be data collecting for monitoring purposes, admin data collection or any other source of data used in the project. A project need one master data set for each unit of observation used in the project. When adding a round or another unit of obsrevation, folders are creaetd both inside the encrypted branch of the DataWork folder and outside it.

In addition to the folders described above there is also a Project_MasterDofile inside the DataWork folder. This file has three purposes. The first two are described in detail in the article for Master Dofiles, but are describe in short berlo:

  • The first reason is that it possible to run all code related to a project by running only one dofile. This is incredible important for replicability.
  • The second purpose is to set up globals with folder paths that enables dynamic file paths that in turn allows multiple users run the same code, it shortens the file paths as well as making it possible to move files and folders with minimal updates to the code.

The third purpose is that this file is the main map to the DataWork folder. Since all code can be run from this file, and since all outputs are (indirectly) created by this file, this file is the starting point to find where any do-file, data set or output is located in the DataWork folder. Another examples of files that help with the navigation of the folder could be a Word document or a PDF describing how to navigate the sub-folders. Such files are not included in our folder template, but may sometimes be a good addition. However, those files needs to be updated in parallel which often does not happen even if that is the intention.

Survey Round


Image 3. Example of a Survey Round folder. (Click to enlarge.)

Baselines, Follow Up Surveys, Midlines, Endlines are all examples of a Survey Round. This is the data that we in Impact Evaluations will test if it changes over time and if that change is significantly different between treatment and control. In contrast, the information in master data sets, like the ID assigned by us, whether you were sampled for baseline, whether you are selected for treatment or control are all examples of information that is time invariant and will remain the same over the course of the project. Monitor data might change over time, for example in a impact evaluation running over many years one observation might not take up the treatment the first year but might do so the next year.

Each survey round should have it's own sub-folder inside the DataWork folder. For example - Inside the main data folder, you can have sub-folders like baseline, follow up 1, follow up 2, midline, endline, etc. See image 3 for an example. When you create a survey round folder using the command iefolder all sub-folders and sub-sub-folders described below will be created for you and all your master dofiles will be updated or created accordingly.

Inside each Survey Round folder you will find a master do file for that survey round as well as the following folders DataSets, Dofiles, Outputs, Documentation, and Questionnaire.

See DataWork Survey Round for more details if any of the bullet points is not perfectly clear to you.

Multiple Units of Observation

If you have multiple units of observation in a survey round for example farmers and villages, or students, teachers and schools, then you should create a survey round folder for each unit of observations. For example, you would end up with one survey round folder called baseline_students, one called baseline_teachers and one called baseline_schools.

Sampling and Treatment Assignment

Sampling and Treatment Assignment folders have intentionally been left out from the survey round folders. Separate folders for those tasks has been created in the master data set folder as we strongly recommend that sampling and treatment assignment is done directly from the master data sets.

MonitorData


Monitor data is data collected to understand the implementation of the assigned treatment in the field. Survey round data helps us understand any changes in outcome variables that the treatment and other factors have caused during the duration of our project, and monitor data helps us the treatment its self. For example, who actually received the treatment and was the treatment carried out according to the research design. Monitor data helps us understand what is usually referred to as internal validity.

Since the purpose of survey round data and monitor data is slightly different, we recommend you keep the different types of data in different folders. At times when there is overlap in survey round data and monitor data, it is not feasible to keep them separated. The method used data collected as monitor data varies with each project. It can be both survey data, observation data and admin data provided by our partners. Monitoring data and survey round data should be kept separate unless the monitoring data was a part of the survey round interviews. For example, if an enumerator visited a training for the respondents in the treatment group at the time of a midline survey and recorded attendance then that data should be kept separate from the survey round data. But if the respondents were asked in the a survey round survey if the respondent attended the training, then it not always worth the effort to go through the work of separating the monitor data from the survey data.

Monitor data has the same sub-folder structure as a survey round folder. The only difference in a folder structure perspective is that they are all organized as sub-folder to a folder in DataWork called MonitorData.

Admin Data


Admin data is data that have been collected for other purposes but has been shared with the research team. It can be used both in the way survey round data is used in the analysis or as the way monitoring data can be used. For example, if the outcome variable the research team is trying to evaluate is measured in some other way then that is admin data we can use to measure change in outcome data. One example would be standardized test scores. We can often use the internal data our implementation partner uses as monitor data. That is one example when admin data is monitoring data. There are also many cases when the same admin data set is used for both purposes.

We recommend that the project team classify admin data sets as either a survey round data or monitoring data and organize them in the DataWork folder accordingly.

MasterData


Image 6. Example of a Master Data Set folder. (Click to enlarge.)

Master Data Sets is the best tool to avoid errors related to using multiple sources of data for one project. All impact evaluations use multiple sources of data. Multiple survey rounds are in this sense different sources of data. A census made before a data collection is a different source of data. Monitoring data and admin data combined with any other data are other examples. Master data sets should therefore be used in all projects and iefolder therefore creates that folder by default when the DataWork folder is created.

The Master Data Sets is a listing of all observations we have ever encountered in relation to this project (not just the observations we sampled). In this listing, we keep identifying information and the unique ID that we have assigned. We also keep time invariant information required at multiple stages of the research work. Examples are dummy for being sampled in baseline, categorical variable for treatment arm, dummy for correctly receiving the assign treatment arm etc. See the Master Data Sets article for more details.

Master Data Sets is more than just a folder. It is a methodology for internal research validity in a multi data set environment. See the Master Data Sets article for information both on how the master data set folder should be organized but also how it should be used for research quality purposes. One important motto is that if you do any merge (string or numeric) where the variable you merge on is not a proper ID variable, then one of the data set you are merging should always be a master data set.

Sampling and Treatment Assignment

The master data sets folder should also include all activities that is best practice to do directly on the main listing of all observations. The two best examples of such tasks in the context of Impact Evaluation is sampling and treatment assignment. This should never be done directly on census data or baseline data etc. When we have census data we use that data for sampling, but the point here is that we always want to match the census data to the master data and check that it makes sense in relation to whatever data we have there already. After that quality control step, we can sample directly from the master data set knowing that the sample we randomize will make sense in relation to other data sources.

Back to Parent

This article is part of the topic Data Management