Difference between revisions of "DataWork Survey Round"

Jump to: navigation, search
Line 51: Line 51:


== Encrypted Round Folder ==
== Encrypted Round Folder ==
[[File:FolderEncryptedRound.png |thumb|300px|Image 5. Example of a DataSets folder. (Click to enlarge.)]]
[[File:FolderEncryptedRound.png |thumb|300px|Image 3. Example of a DataSets folder. (Click to enlarge.)]]


The encrypted round folders are created by [[iefolder]] at the same time as a round folder is created, but they are created in the encrypted branch of ''DataWork'' instead of inside the round Folder. The reason the encrypted round folders are separated from the other round folders as they are likely to include identifying data and the encrypted branch should be encrypted.
The Encrypted Round folders are created by [[iefolder]] at the same time as a round folder is created, but they are created in the [[DataWork Folder#Contents#Survey Encrypted Data | encrypted branch]] of DataWork instead of inside the round folder. The [[Encryption | encrypted]] round folders are separated from the other round folders as they are likely to include identifying data.


===Raw Data===
===Raw Data===
Any data downloaded from the server used to collect data should be saved in the ''Raw Identified Data'' folder.
Any data downloaded from the server used to collect data should be saved in the Raw Identified Data folder. This includes data downloaded from the internet, data received from [[Primary Data Collection | data collection]], and data received from other projects.  


This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format from csv to Stata or other formats, or file name changes should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.
The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format, or changing file name changes should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed to be imported.


If there are known errors in your raw datasets, then you should import the raw data set, write a dofile that corrects the errors and then save the corrected data in the intermediate folder. The data set in the raw folder with known errors should remain unaltered in the raw folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.
If there are known errors in your raw datasets, import the raw data set, write a do-file that corrects the errors and save the corrected data in the Intermediate folder. The data set in the Raw folder with known errors should remain unaltered. If corrections to the datasets are not [[Reproducible Research | documented]], then the research team will not fully understand the quality of the data nor the quality of our research.


===Import Dofiles===
===Import Dofiles===
The dofiles used to import, edit and de-identify the raw data is likely to include idenitfying information. Therefore we need to keep those files in the encryoted folder as well.
The do-files used to import, edit and [[De-identification | de-identify]] the raw data are likely to include identifying information. Therefore, need to keep those files in the Encrypted folder as well.


Running these files should import the data, address any immediate issues such as missing IDs or duplicates, de-identify it and save the data set in the ''Intermediate'' folder in the ''DataSets'' folder in the branch of the ''DataWork'' folder that is not encrypted.
Running these files should import the data, address any immediate issues such as missing [[ID Variable Properties | IDs]] or [[Duplicates and Survey Logs | duplicates]], de-identify it and save the data set in the Intermediate folder, which is nested in the DataSets folder in the unencrypted branch of the DataWork.


== Back to Parent ==
== Back to Parent ==

Revision as of 22:06, 14 April 2019

This article describes the organization of a Survey Round folder inside the DataWork folder in DIME's standardized template for how to organize a the data work in project folder.

Subfolders in each Survey Round

These are the folders that iefolder creates inside the survey round folder:

DataSets Folder

Image 1. Example of a DataSets folder. (Click to enlarge.)

The DataSets folder contains three sub-folders: Deidentified, Intermediate and Final (Image 1).

Note that it is not recommended to create a folder for raw data in the DataSets folder: raw data almost always includes identifying information and the DataSets folder is not in the encrypted branch of the DataWork folder. Instead, save raw data in the DataWork Survey Encrypted Data folder. The Raw Data section describes best practices for raw data in more detail.

De-identified Folder

Once the raw data is de-identified, it can be saved in the De-identified folder. Working with the raw, de-identified data rather than raw, encrypted data will make the work flow easier for the entire research team.

Intermediate Folder

The Intermediate folder is a work-in-progress folder containing all datasets that belong neither in the Raw nor Final folder. Raw datasets on which simple changes have been made belong in the Intermediate folder. There are no specific rules how this folder should be organized. However, it is a good idea to keep it organized in sub-folders and to name files according to efficient naming conventions.

Final Folder

The Final folder contains the clean datasets with final variables constructed. This folder should only contain one version of each dataset. If your project has various datasets for different units of observation, (i.e. student dataset, school dataset, teacher dataset), then the Final folder should contain a sub-folder for each dataset. The Final folder is the folder most likely to be visited several years after the project has ended by someone with very little knowledge of the project. This folder should therefore be one of the folders organized to the most level of detail.

Dofiles Folder

Image 2. Example of a DataSets folder. (Click to enlarge.)

Each Survey Round folder contains a Dofiles folder. The top level of this folder should only contain a master do-file and sub-folders. The master do-file maps out all do-files and datasets relevant to this survey round.

The sub-folders in the Dofiles folder should be organized according to task. For example, import, cleaning, analysis. All do-files related to each of these tasks should be saved in those folders. Make sure to use naming conventions and version control rather than having multiple versions of the same dofile. Each sub-folder should have a task level master do-file.

Task-level do-files have very little technical purpose as the do-files can technically be called from the round master do-file. However, the task-level master do-files are critical for documentation for data work and make it significantly easier for someone unfamiliar with the code to follow the data work.

Outputs Folder

The Outputs folder should also be organized in sub-folders. The necessary sub-folders depend on the output and the method used to output. If, for example, both tables and graphs are outputted, then create Table and Graph sub-folders. Each sub-folder should contain Raw and Formatted folders (i.e. if the tables are exported in .csv format and the data is used as input in the formatted file or if single graphs are outputted to disk and later combined to one graph file with multiple graphs). See best practices of exporting analysis for many more relevant recommendations.

The Outputs folder is one of the few places where it may be good practice to save multiple versions of the same file: it is common to compare different versions of the analysis, and it is convenient to do so without any version control software or regeneration of an old output. If you save multiple versions of the same file, do not use the naming convention of table_v1, table_v2, etc. Instead, name files by date (i.e. table_Apr30, table_Jun6). If the outputted file size is significant in terms of disk space, do not save multiple versions of the same file. This will soon take up a lot of space on the disk and will make it difficult for people with slow internet connection to access the folder via syncing services like DropBox.

Documentation Folder

This Documentation folder will contain the documentation for the analysis including any duplicate reports, survey logs, etc. While there is no strict format for this folder, it is good practice to save any documentation for this survey round in this folder. This could be include anything from formal documentation to email conversations. You never know what you and your project team will need to know about this round in the future.

Questionnaire Folder

All documentation related to the data collection should be gathered in this folder. The most important thing to document is the questionnaire, hence the name of the folder, but it is also important to save any other material that documents the survey, including material from the enumerator training, survey manuals used in the field, contract with survey firm, and anything else that provides the project team with a better qualitative understanding of their quantitative data.

Encrypted Round Folder

Image 3. Example of a DataSets folder. (Click to enlarge.)

The Encrypted Round folders are created by iefolder at the same time as a round folder is created, but they are created in the encrypted branch of DataWork instead of inside the round folder. The encrypted round folders are separated from the other round folders as they are likely to include identifying data.

Raw Data

Any data downloaded from the server used to collect data should be saved in the Raw Identified Data folder. This includes data downloaded from the internet, data received from data collection, and data received from other projects.

The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format, or changing file name changes should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed to be imported.

If there are known errors in your raw datasets, import the raw data set, write a do-file that corrects the errors and save the corrected data in the Intermediate folder. The data set in the Raw folder with known errors should remain unaltered. If corrections to the datasets are not documented, then the research team will not fully understand the quality of the data nor the quality of our research.

Import Dofiles

The do-files used to import, edit and de-identify the raw data are likely to include identifying information. Therefore, need to keep those files in the Encrypted folder as well.

Running these files should import the data, address any immediate issues such as missing IDs or duplicates, de-identify it and save the data set in the Intermediate folder, which is nested in the DataSets folder in the unencrypted branch of the DataWork.

Back to Parent

This article is part of the topic DataWork Folder.