Difference between revisions of "DataWork Survey Round"

Jump to: navigation, search
Line 36: Line 36:


If there are known errors in your raw datasets, then you should import the raw data set, write a dofile that corrects the errors and then save the corrected data in the intermediate folder. The data set in the raw folder with known errors should remain unaltered in the raw folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.
If there are known errors in your raw datasets, then you should import the raw data set, write a dofile that corrects the errors and then save the corrected data in the intermediate folder. The data set in the raw folder with known errors should remain unaltered in the raw folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.


===Import Dofiles===
===Import Dofiles===

Revision as of 16:45, 4 June 2017

This article describes the organization of a Survey Round folder inside the DataWork folder in DIME's standardized template for how to organize a the data work in project folder.


DataSets Folder

Image 4. Example of a DataSets folder. (Click to enlarge.)

The DataSets folder inside a Survey Round folder should be further divided into three sub-folders. These folders are called Raw, Intermediate and Final, see image 4 for an example. Raw and Final have strict rules on which datasets can be saved in those folders. All data sets that does not fit into any of those rules should be saved in Intermediate.

Raw Folder

This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format from csv to Stata or other formats, or file name changes should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.

If there are known errors in your raw datasets, then you should import the raw data set, write a dofile that corrects the errors and then save the corrected data in the intermediate folder. The data set in the raw folder with known errors should remain unaltered in the raw folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.

Final folder

This folder should contain the data sets that are cleaned and have the final variables constructed. All datasets in this folder should be clearly marked if they are identified or de-identified. There should only be one version of the data set in this folder. If there are many different data sets in this folder, for example student dataset, school dataset, teacher dataset etc., then the folder should have sub-folders. This is the folder most likely to visited several years after the project has ended by someone who has very little knowledge of the project. This folder should therefore be one of the folders organised to the most level of detail.

Intermediate Folder

This folder should contain all datasets that are not supposed to be either in the Raw folder or in the Final folder. Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder. Since this is a work-in-progress folder there are no specific rules how this folder should be organized. It is still a very good idea to keep it organized in sub-folders, but if there is any folder in your DataSets folder where some mess is alowed, then it is this folder. And read the naming convention before thinking about saving multiple versions of the same data set named _v1, _v2 etc.

Dofiles Folder

Image 5. Example of a DataSets folder. (Click to enlarge.)

Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.

The sub-folders in this folder should be organized according to task. For example, import, cleaning, analysis. All do-files related to each of these tasks should be saved in those folders. Again, using naming conventions and version control rather than having multiple versions of the same dofile.

task level dofiles have very little technical purpose as the dofiles can technically be called from the round master dofile, however, the task level master dofiles are critical for documentation for the data work and makes it significantly easier for someone not familiar with the code to follow the data work.

Encrypted Round Folder

Image 5. Example of a DataSets folder. (Click to enlarge.)

These folder are separated from the other round folders as they are likely to include identifying data. These folders are therefore created in the encrypted branch of the DataWork folder.

Raw Data

Any data downloaded from the server used to collect data should be saved in the Raw Identified Data folder.

This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format from csv to Stata or other formats, or file name changes should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.

If there are known errors in your raw datasets, then you should import the raw data set, write a dofile that corrects the errors and then save the corrected data in the intermediate folder. The data set in the raw folder with known errors should remain unaltered in the raw folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.

Import Dofiles

The dofiles used to import, edit and de-identify the raw data is likely to include idenitfying information. Therefore we need to keep those files in the encryoted folder as well.

Running these files should import the data, address any immediate issues such as missing IDs or duplicates, de-identify it and save the data set in the Intermediate folder in the DataSets folder in the branch of the DataWork folder that is not encrypted.

Outputs Folder

The output folder should also be organized in sub-folders. Which sub-folders needed depends on the what is outputted and the method to do so. If, for example, both tables and graphs are outputted then it probably makes sense to have separate folders for each of them. In these folders there should be sub-folders called raw and formatted if relevant. Examples of that being relevant is if tables are exported in .csv format and the data is copied to a formatted .xls file, or for when single graphs are outputted to disk and later combined to one graph file with multiple graphs.

This folder is one of the few examples where it could be good practice to save multiple versions of the same file. The reason for this is that it is common to compare different versions of the analysis, and it is convenient to be able to do so without using any version control software and then re-generating an old output. But do not use the convention of calling it table_v1, table_v2 etc. Call them by date, for example table_Apr30, table_Jun6 etc. Although, do not do this if the outputted file size is at all significant. Then the multiple version of the file will soon take up a lot of space on the disk.

Documentation Folder

This folder will contain the documentation for the analysis done including any duplicate reports, survey logs, etc.

Questionnaire Folder

Back to Parent

This article is part of the topic Data Management