Difference between revisions of "DataWork Folder"

Jump to: navigation, search
Line 27: Line 27:
Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.  
Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.  


The sub-folders in this folder should be organized according to task. For example, ''import'', ''cleaning'', ''analysis''. All do-files related to each of these tasks should be saved in those folders.
The sub-folders in this folder should be organized according to task. For example, ''import'', ''cleaning'', ''analysis''. All do-files related to each of these tasks should be saved in those folders. Again, using [[Naming Conventions|naming conventions and version control]] rather than having multiple versions of the same dofile.


===Output Folder===
===Output Folder===

Revision as of 16:07, 11 February 2017

Since the DataWork folder is setup to be used throughout the impact evaluation project, it is important to set it up correctly. Setting the folder up correctly can help increase efficiency of the data work being done and also reduces the sources of errors in data work.

Inside the DataWork folder

Example of a data folder structure used during the course of an impact evaluation project.

Inside the DataWork folder there should only be folders and files that help navigating those folders. Each folder should correspond to a data source (survey-rounds, admin data, monitor data) or be a folder containing meta-data sets such as the master data set.

The most important file that help navigating all sub-folders of the DataWork is the main master do-file. This do-file calls the master do-file of each data source and meta data folder and re-run all code needed to generate all data sets and output in the DataWork folder. Another example of file that helps with the navigation of the folder could be a Word document or a PDF describing how to navigate the sub-folders. It is important that the number of files here are kept to an absolute minimum.

Survey Round

Each round of the survey should have it's own sub-folder inside the data folder. For example - Inside the main data folder, you can have sub-folders like baseline, follow up 1, follow up 2, midline, endline, etc. Each of these folders should have the folders described below.

DataSets folder

The DataSets folder inside a Survey Round folder should be further divided into three sub-folders. These folders are called Raw, Intermediate and Final. Raw and Final have strict rules on which datasets can be saved in those folders. All other datasets should be saved in Intermediate.

Raw Folder

This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting obvious mistakes, changing variable names, changing format from csv to Stata or other formats, file name changes should not be done to the data in this folder. The only exception to this is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.

If there are mistakes in your raw datasets that you know of, write a dofile that corrects that mistake and then save the corrected data in the intermediate folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.

Intermediate Folder

This folder should contain all datasets that are not supposed to be either in the Raw folder (see above) or Final folder (see below). Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder. Since this is a work-in-progress folder there are no specific rules how this folder should be organized. It still make sense to keep it organized in sub-folders. And read the naming convention before thinking about saving multiple versions of the same data set named _v1, _v2 etc.

Final folder

This folder should contain the data sets that are cleaned and have the final variables constructed. All datasets in this folder should be clearly marked if they are identified or de-identified. There should only be one version of the data set in this folder. If there are many different data sets in this folder, for example student dataset, school dataset, teacher dataset etc., then the folder should have sub-folders. This is the folder most likely to visited several years after the project has ended by someone who has very little knowledge of the project. This folder should therefore be one of the folders organised to the most level of detail.

DoFiles Folder

Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.

The sub-folders in this folder should be organized according to task. For example, import, cleaning, analysis. All do-files related to each of these tasks should be saved in those folders. Again, using naming conventions and version control rather than having multiple versions of the same dofile.

Output Folder

The output folder should the raw and final tables output folders inside it.

Documentation

This folder will contain the documentation for the analysis done including any duplicate reports, survey logs, etc.

Back to Parent

This article is part of the topic Data Management

Additional Resources

  • list here other articles related to this topic, with a brief description and link