Difference between revisions of "DataWork Folder"
Kbjarkefur (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
Line 2: | Line 2: | ||
We have published a command called [[iefolder]] in our package [[ietoolkit]] that we have published on SSC. iefolder sets up the recommended folder structure described in this article for you. | We have published a command called [[iefolder]] in our package [[ietoolkit]] that we have published on SSC. iefolder sets up the recommended folder structure described in this article for you. | ||
== Read First == | |||
== Where should the DataWork folder be created? == | |||
Most folders are shared across the project teams using a DropBox, Box or similar. The '''DataWork''' folder | |||
== Where should the DataWork folder be created? == | |||
== Inside the DataWork folder == | == Inside the DataWork folder == |
Revision as of 20:22, 10 March 2017
A well organized data folder reduces the risk for many types of errors. At DIME we have a standardized folder structure. Some projects have special folder requirements and only use the folder set up as a starting point, but many resources created by DIME are easier to take advantage of if this template is followed. It takes a lot of work to reorganize a project folder, so we strongly recommend that projects follow our standard from the beginning. A poorly set up folder will have inefficiency consequences and increase the risk of errors over several years.
We have published a command called iefolder in our package ietoolkit that we have published on SSC. iefolder sets up the recommended folder structure described in this article for you.
Read First
Where should the DataWork folder be created?
Most folders are shared across the project teams using a DropBox, Box or similar. The DataWork folder
Where should the DataWork folder be created?
Inside the DataWork folder
Inside the DataWork folder there should only be folders and files that help navigating those folders. Each folder should correspond to a data source (survey-rounds, admin data, monitor data) or be a folder containing meta-data sets such as the master data set.
The most important file that help navigating all sub-folders of the DataWork is the main master do-file. This do-file calls the master do-file of each data source and meta data folder and re-runs all code needed to generate all data sets and output in the DataWork folder. Another example of file that helps with the navigation of the folder could be a Word document or a PDF describing how to navigate the sub-folders. It is important that the number of files here are kept to an absolute minimum.
Survey Round
Each round of the survey should have it's own sub-folder inside the data folder. For example - Inside the main data folder, you can have sub-folders like baseline, follow up 1, follow up 2, midline, endline, etc. Each of these folders should have the folders described below.
DataSets folder
The DataSets folder inside a Survey Round folder should be further divided into three sub-folders. These folders are called Raw, Intermediate and Final. Raw and Final have strict rules on which datasets can be saved in those folders. All other datasets should be saved in Intermediate.
- Raw Folder
This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting obvious mistakes, changing variable names, changing format from csv to Stata or other formats, file name changes should not be done to the data in this folder. The only exception to this is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.
If there are mistakes in your raw datasets that you know of, write a dofile that corrects that mistake and then save the corrected data in the intermediate folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.
- Intermediate Folder
This folder should contain all datasets that are not supposed to be either in the Raw folder (see above) or Final folder (see below). Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder. Since this is a work-in-progress folder there are no specific rules how this folder should be organized. It still make sense to keep it organized in sub-folders. And read the naming convention before thinking about saving multiple versions of the same data set named _v1, _v2 etc.
- Final folder
This folder should contain the data sets that are cleaned and have the final variables constructed. All datasets in this folder should be clearly marked if they are identified or de-identified. There should only be one version of the data set in this folder. If there are many different data sets in this folder, for example student dataset, school dataset, teacher dataset etc., then the folder should have sub-folders. This is the folder most likely to visited several years after the project has ended by someone who has very little knowledge of the project. This folder should therefore be one of the folders organised to the most level of detail.
DoFiles Folder
Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.
The sub-folders in this folder should be organized according to task. For example, import, cleaning, analysis. All do-files related to each of these tasks should be saved in those folders. Again, using naming conventions and version control rather than having multiple versions of the same dofile.
Output Folder
The output folder should also be organized in sub-folders. Which sub-folders needed depends on the what is outputted and the method to do so. If, for example, both tables and graphs are outputted then it probably makes sense to have separate folders for each of them. In these folders there should be sub-folders called raw and formatted if relevant. Examples of that being relevant is if tables are exported in .csv format and the data is copied to a formatted .xls file, or for when single graphs are outputted to disk and later combined to one graph file with multiple graphs.
This folder is one of the few examples where it could be good practice to save multiple versions of the same file. The reason for this is that it is common to compare different versions of the analysis, and it is convenient to be able to do so without using any version control software and then re-generating an old output. But do not use the convention of calling it table_v1, table_v2 etc. Call them by date, for example table_Apr30, table_Jun6 etc. Although, do not do this if the outputted file size is at all significant. Then the multiple version of the file will soon take up a lot of space on the disk.
Documentation
This folder will contain the documentation for the analysis done including any duplicate reports, survey logs, etc.
AdminData
Admin data can both be used as the main data in the analysis or as secondary data that is combined with the main data in the analysis. A folder with admin data should be organized as much as possible as a survey round folder. The reason a small distinction is made is that admin data might not always be primary data collected by the project team. Admin data is often collected, cleaned and/or aggregated by someone other than the team. Therefore some of the recommendations made in the section above does not apply.
If the admin data is collected as main data in multiple rounds, then it might make most sense to treat this data as survey round in the sense that one folder should be made for each round of data collection. But if the amount of work that is required for each round is small, then that is probable not the case.
MonitorData
Monitor data is data collected to understand the implementation of the assigned treatment in the field. Monitor data varies with each project, but can range from data collected on whether seeds were distributed in an agriculture project, to was saving groups formed and who joined them in a financial inclusion project, or to was school books distributed in an education project. In comparison, the main data focuses on outcomes as in a farmers harvest size in the agriculture project, amount of savings in the financial inclusion project and test scores in the education project. Without the information in the monitor data there is no way to know that any any change in the outcome can be attributed to the the project. This is particularly interesting in the absence of change in outcome data.
Monitor data is often collected the same way as surveys, but it could also be collected in the same way as admin data. Depending on which, follow the respective instructions from above. Most important is that monitor data is separate from the survey round data we are collecting. The main data tells us the result, the monitor data tells us about internal validity.
MasterData
Master data sets are data sets that make sure the observations are correctly identifiable across data sets. Master data sets also includes important information such as sampling, treatment assignment etc. This is a very important topic and master data sets has its own article. What is important in the aspect in DataWork folder management is that the master data set should have their own folder as it is a meta-dataset that stores data useful to all other datasets.
Back to Parent
This article is part of the topic Data Management