Difference between revisions of "DataWork Survey Round"

Jump to: navigation, search
 
(33 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This article describes the organization of a Survey Round folder inside the [[DataWork Folder|DataWork folder]] in DIME's standardized template for how to organize a the data work in project folder.
The DataWork Survey Round folder is a key component of the [[DataWork Folder|DataWork folder]], DIME's standardized folder template for organizing data work in a project folder. The DataWork Survey Round folder is useful throughout all stages of a project.  


== Subfolders in each Survey Round ==
==Read First==
*The Stata command <code>[[iefolder]]</code> creates the DataWork Survey Round folder as a component of the DataWork folder, which is typically housed in [https://www.box.com Box] or [https://www.dropbox.com Dropbox].
*The DataWork Survey Round folder organizes all data-related elements of the project: [[Questionnaire Design | questionnaires]], datasets, do-files, documentation of analysis, graphs, tables, survey logs, etc.
*The DataWork Survey Round folder is an efficient way to save files in a manner understandable both to the current research team and to people visiting the project folder years after the project has ended.


These are the folders that [[iefolder]] creates inside the survey round folder:
== Overview ==


*[[DataWork_Survey_Round#DataSets Folder|DataSets Folder]]
Each source of data (i.e. baseline, follow-up, midline, endline, [[Administrative and Monitoring Data | administrative]], secondary) should have its own Survey Round folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.
*[[DataWork_Survey_Round#Dofiles Folder|Dofiles Folder]]
 
*[[DataWork_Survey_Round#Outputs Folder|Outputs Folder]]
If you have multiple [[Unit of Observation|units of observation]] in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: ''baseline_students'', ''baseline_teachers'' and ''baseline_schools''. If you choose, these folders may be nested within a parent ''Baseline'' folder. <code>[[iefolder]]</code> gives you the option of doing this by creating a ''Baseline'' subfolder and then creating Survey Round folders within it.
*[[DataWork_Survey_Round#DataSets Folder|Documentation Folder]]
 
*[[DataWork_Survey_Round#DataSets Folder|Questionnaire Folder]]
Note that each Survey Round needs to have a unique name across the project when using <code>[[iefolder]]</code>.
 
Each Survey Round folder contains the following sub-folders: the [[DataWork_Survey_Round#DataSets Folder|DataSets folder]], the [[DataWork_Survey_Round#Dofiles Folder|Dofiles folder]], the [[DataWork_Survey_Round#Outputs Folder|Outputs folder]], the [[DataWork_Survey_Round#DataSets Folder|Documentation folder]] and the [[DataWork_Survey_Round#DataSets Folder|Questionnaire folder]]. The remainder of this page explains the contents and purpose of each sub-folder.


==DataSets Folder ==
==DataSets Folder ==
[[File:FolderDataSets.png |thumb|300px|Image 4. Example of a DataSets folder. (Click to enlarge.)]]
[[File:FolderDataSets.png |thumb|300px|Image 1. Example of a DataSets folder. (Click to enlarge.)]]
The '''DataSets''' folder inside a Survey Round folder should be further divided into two sub-folders. These folders are called '''Intermediate''' and '''Final''', see image 4 for an example. '''Final''' has strict rules on which datasets can be saved in that folder (see below) but any data set you are currently working on can be save in '''Intermediate'''. It is not recommended to create a '''Raw''' data folder here, see below for reason why.


=== Why no Raw Folder? ===
The DataSets folder contains three sub-folders in which datasets are stored: De-identified, Intermediate and Final (Image 1).  
It usually a bad practice to create a folder for raw data in the ''DataSet'' folder. The reason for that is that raw data almost always include identifying information and the ''DataSet'' folder is not in the encrypted branch of the ''DataWork'' folder. Instead, if you are using [[iefolder]] there is a folder created for raw data in side the encrypted branch. See the section on [[DataWork_Survey_Round#Raw_Data|encrypted raw data]] below. If you de-identify a raw data set, it is no longer in it's original state, and you should save it in the [[DataWork_Survey_Round#Intermediate_Folder|intermediate folder]].


If your project only uses public data, then feel free to create a raw folder here for your convenience, but make sure that there is no private information before you do so and read the best practices in the [[DataWork_Survey_Round#Raw_Data|encrypted raw data]] section.
It is not recommended to create a folder for raw datasets in the DataSets folder since raw data almost always includes identifying information and the DataSets folder is not [[Encryption | encrypted]]. Instead, save raw data in the DataWork [[DataWork_Folder#Survey_Encrypted_Data | Survey Encrypted Data]] folder. The [[DataWork_Survey_Round#Raw_Data|Raw Data]] section describes in more detail best practices for raw data.


=== Final folder ===  
=== De-identified Folder ===  
This folder should contain the data sets that are cleaned and have the final variables constructed. All datasets in this folder should be clearly marked if they are identified or de-identified. There should only be one version of the data set in this folder. If there are many different data sets in this folder, for example student dataset, school dataset, teacher dataset etc., then the folder should have sub-folders. This is the folder most likely to visited several years after the project has ended by someone who has very little knowledge of the project. This folder should therefore be one of the folders organized to the most level of detail.
The De-identified folder houses [[De-identification|de-identified]] raw data. Working with the raw, de-identified data rather than raw, encrypted data makes the work flow easier and smoother for the entire research team.


=== Intermediate Folder ===  
=== Intermediate Folder ===  
This folder should contain all datasets that are not supposed to be either in the '''Raw''' folder or in the '''Final''' folder. Raw datasets on which simple changes has been made as mentioned above should be put in the intermediate folder. Since this is a work-in-progress folder there are no specific rules how this folder should be organized. It is still a very good idea to keep it organized in sub-folders, but if there is any folder in your DataSets folder where some mess is alowed, then it is this folder. And read the [[Naming Conventions|naming convention]] before thinking about saving multiple versions of the same data set named _v1, _v2 etc.
The Intermediate folder is a work-in-progress folder containing all datasets that belong in neither the Raw nor the Final folder. Raw datasets on which simple changes have been made belong in the Intermediate folder. There are no specific rules how this folder should be organized. However, it is a good idea to keep it organized in sub-folders and to name files according to [[Naming Conventions|naming conventions]].
 
=== Final Folder ===
The Final folder contains the clean datasets with final variables constructed. This folder should only contain one version of each dataset. If your project has various datasets for different [[Unit of Observation | units of observation]], (i.e. student dataset, school dataset, teacher dataset), then the Final folder should contain a sub-folder for each dataset.
 
The Final folder is the folder most likely to be visited several years after the project has ended by someone with very little knowledge of the project. This folder should therefore be one of the folders organized to the most level of detail.


==Dofiles Folder==
==Dofiles Folder==
[[File:FolderDofiles.png |thumb|300px|Image 5. Example of a DataSets folder. (Click to enlarge.)]]
[[File:FolderDofiles.png |thumb|300px|Image 2. Example of a DataSets folder. (Click to enlarge.)]]
Each survey round folder will also contain a dofile folder. In the top-level of this folder there should only be a master do-file and sub-folders. This master do-file is the map to where you find all do-files and all datasets needed for this survey round.  
The Dofiles folder houses all do-files used in the survey round. The top level of the Dofiles folder should contain only a [[Master Do-files | master do-file]] and sub-folders. The master do-file maps out all do-files and datasets relevant to this survey round.  


The sub-folders in this folder should be organized according to task. For example, ''import'', ''cleaning'', ''analysis''. All do-files related to each of these tasks should be saved in those folders. Again, using [[Naming Conventions|naming conventions and version control]] rather than having multiple versions of the same dofile.
The sub-folders in the Dofiles folder should be organized according to task (i.e. ''import'', ''cleaning'', ''analysis''), with all do-files housed within their respective task folders. Each sub-folder should have a task level master do-file. Task-level master do-files have little technical purpose as the round master do-file can technically call each individual do-file. However, the task-level master do-files document data work and make it significantly easier for someone unfamiliar with the code to understand the process.


task level dofiles have very little technical purpose as the dofiles can technically be called from the round master dofile, however, the task level master dofiles are critical for documentation for the data work and makes it significantly easier for someone not familiar with the code to follow the data work.
Make sure to use [[Naming Conventions|naming conventions and version control]] when creating and updating files and folders in the Dofiles folder.


==Outputs Folder==
== Outputs Folder ==


The output folder should also be organized in sub-folders. Which sub-folders needed depends on the what is outputted and the method to do so. If, for example, both tables and graphs are outputted then it probably makes sense to have separate folders for each of them. In these folders there should be sub-folders called raw and formatted if relevant. Examples of that being relevant is if tables are exported in .csv format and the data is copied to a formatted .xls file, or for when single graphs are outputted to disk and later combined to one graph file with multiple graphs.
The Outputs folder houses the survey round outputs. Its exact sub-folders depend on the output and the method used to create outputs. If, for example, tables and graphs are outputted, then create Table and Graph sub-folders. Each sub-folder should contain Raw and Formatted folders. For example, if single graphs are outputted to disk and later combined into one cumulative graph file, save the former as Raw and the latter as Formatted. See best practices of [[Exporting_Analysis|exporting analysis]] for additional recommendations.


This folder is one of the few examples where it could be good practice to save multiple versions of the same file. The reason for this is that it is common to compare different versions of the analysis, and it is convenient to be able to do so without using any version control software and then re-generating an old output. But do not use the convention of calling it table_v1, table_v2 etc. Call them by date, for example table_Apr30, table_Jun6 etc. Although, do not do this if the outputted file size is at all significant. Then the multiple version of the file will soon take up a lot of space on the disk.
The Outputs folder is one of the few places where it may be good practice to save multiple versions of the same file, since it is common to compare different versions of the analysis. If you save multiple versions of the same file, do not use the naming convention of table_v1, table_v2, etc. Instead, name files by date (i.e. table_Apr30, table_Jun6). If the outputted file size is significant in terms of disk space, do not save multiple versions of the same file: this will make it difficult for people with slow internet connection to access the folder via syncing services like [https://www.dropbox.com| DropBox].


==Documentation Folder==
==Documentation Folder==


This folder will contain the documentation for the analysis done including any duplicate reports, survey logs, etc.
This Documentation folder contains documentation for the analysis including any duplicate reports, survey logs, etc. While there is no strict format for this folder, it is good practice to save any documentation for this survey round in this folder. This could include anything from formal documentation to email conversations. You never know what you and your project team will need to know about this round in the future.


==Questionnaire Folder==
== Questionnaire Folder ==
 
The Questionnaire folder contains all documentation related to the [[Primary Data Collection | data collection]]. While [[Questionnaire Design | questionnaire]] is the most important document to include in this folder, it is also important to save any other material that documents the survey. This includes material from the [[Enumerator Training | enumerator training]], survey manuals used in the field, the [[Survey Firm TOR | survey firm TOR]], and anything else that provides the project team with a better qualitative understanding of the data.


== Encrypted Round Folder ==
== Encrypted Round Folder ==
[[File:FolderEncryptedRound.png |thumb|300px|Image 5. Example of a DataSets folder. (Click to enlarge.)]]
[[File:FolderEncryptedRound.png |thumb|300px|Image 3. Example of a DataSets folder. (Click to enlarge.)]]


The encrypted round folders are created by [[iefolder]] at the same time as a round folder is created, but they are created in the encrypted branch of ''DataWork'' instead of inside the ''Round Fodler''. The reason the encrypted round folders are separated from the other round folders as they are likely to include identifying data and the encrypted branch should be encrypted.
The Encrypted Round folders are created by [[iefolder]] at the same time as a round folder is created, but they are created in the [[DataWork Folder#Contents#Survey Encrypted Data | encrypted branch]] of DataWork instead of inside the round folder. The [[Encryption | encrypted]] round folders are separated from the other round folders as they likely include identifying data.


===Raw Data===
===Raw Identified Data===
Any data downloaded from the server used to collect data should be saved in the ''Raw Identified Data'' folder.
Any data downloaded from the server used to collect data should be saved in the Raw Identified Data folder. This includes data downloaded from the internet, data received from [[Primary Data Collection | data collection]], and data received from other projects.  


This folder should contain the datasets in exactly the same state as you got them. This includes data downloaded from the internet, data received from data collection, and data received from other projects. The data in this folder should be exactly as you got it and absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format from csv to Stata or other formats, or file name changes should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed to be imported, then the file name changes can be done in this folder.
The data in this folder should be exactly as you got it: absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format, or changing file names should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed in order to be imported.


If there are known errors in your raw datasets, then you should import the raw data set, write a dofile that corrects the errors and then save the corrected data in the intermediate folder. The data set in the raw folder with known errors should remain unaltered in the raw folder. This is the only way corrections can be fully documented. If corrections to the datasets are not documented, then we will not fully understand the quality of our data, and that means that we will not fully understand the quality of our research.
If there are known errors in your raw datasets, import the raw data set, write a do-file that corrects the errors and save the corrected data in the Intermediate folder. The dataset in the Raw folder with known errors should remain unaltered. If corrections to the datasets are not properly [[Reproducible Research | documented]], then the research team will not fully understand the quality of neither the data nor the research.


===Import Dofiles===
===Dofiles Import===
The dofiles used to import, edit and de-identify the raw data is likely to include idenitfying information. Therefore we need to keep those files in the encryoted folder as well.
The do-files used to import, edit and [[De-identification | de-identify]] the raw data likely include identifying information. Therefore, they should be housed in the Encrypted folder as well.


Running these files should import the data, address any immediate issues such as missing IDs or duplicates, de-identify it and save the data set in the ''Intermediate'' folder in the ''DataSets'' folder in the branch of the ''DataWork'' folder that is not encrypted.
Running these do-files files should import the data, address any immediate issues such as missing [[ID Variable Properties | IDs]] or [[Duplicates and Survey Logs | duplicates]], de-identify it and save the data set in the Intermediate folder, which is nested in the DataSets folder in the unencrypted branch of the DataWork.


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Data Management]]
This article is part of the topic [[DataWork_Folder|DataWork Folder]].
 
==Additional Resources==
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
*DIME Analytics' guidelines on [https://github.com/worldbank/DIME-Resources/blob/master/welcome-iefolder.pdf iefolder]
[[Category: Data Management ]]
[[Category: Data Management ]]

Latest revision as of 14:28, 12 June 2019

The DataWork Survey Round folder is a key component of the DataWork folder, DIME's standardized folder template for organizing data work in a project folder. The DataWork Survey Round folder is useful throughout all stages of a project.

Read First

  • The Stata command iefolder creates the DataWork Survey Round folder as a component of the DataWork folder, which is typically housed in Box or Dropbox.
  • The DataWork Survey Round folder organizes all data-related elements of the project: questionnaires, datasets, do-files, documentation of analysis, graphs, tables, survey logs, etc.
  • The DataWork Survey Round folder is an efficient way to save files in a manner understandable both to the current research team and to people visiting the project folder years after the project has ended.

Overview

Each source of data (i.e. baseline, follow-up, midline, endline, administrative, secondary) should have its own Survey Round folder within the DataWork folder. If the data source is collected continuously (i.e. administrative or secondary data as macro data over time), then it requires only one Survey Round folder. However, if the data source is collected in stages (i.e. baseline and endline data), then it requires one Survey Round folder for each stage.

If you have multiple units of observation in a given data source, each unit of observation for each data source requires its own Survey Round folder. For example, if the units of observation during baseline data collection are students, teachers and schools, then baseline data requires three Survey Round folders: baseline_students, baseline_teachers and baseline_schools. If you choose, these folders may be nested within a parent Baseline folder. iefolder gives you the option of doing this by creating a Baseline subfolder and then creating Survey Round folders within it.

Note that each Survey Round needs to have a unique name across the project when using iefolder.

Each Survey Round folder contains the following sub-folders: the DataSets folder, the Dofiles folder, the Outputs folder, the Documentation folder and the Questionnaire folder. The remainder of this page explains the contents and purpose of each sub-folder.

DataSets Folder

Image 1. Example of a DataSets folder. (Click to enlarge.)

The DataSets folder contains three sub-folders in which datasets are stored: De-identified, Intermediate and Final (Image 1).

It is not recommended to create a folder for raw datasets in the DataSets folder since raw data almost always includes identifying information and the DataSets folder is not encrypted. Instead, save raw data in the DataWork Survey Encrypted Data folder. The Raw Data section describes in more detail best practices for raw data.

De-identified Folder

The De-identified folder houses de-identified raw data. Working with the raw, de-identified data rather than raw, encrypted data makes the work flow easier and smoother for the entire research team.

Intermediate Folder

The Intermediate folder is a work-in-progress folder containing all datasets that belong in neither the Raw nor the Final folder. Raw datasets on which simple changes have been made belong in the Intermediate folder. There are no specific rules how this folder should be organized. However, it is a good idea to keep it organized in sub-folders and to name files according to naming conventions.

Final Folder

The Final folder contains the clean datasets with final variables constructed. This folder should only contain one version of each dataset. If your project has various datasets for different units of observation, (i.e. student dataset, school dataset, teacher dataset), then the Final folder should contain a sub-folder for each dataset.

The Final folder is the folder most likely to be visited several years after the project has ended by someone with very little knowledge of the project. This folder should therefore be one of the folders organized to the most level of detail.

Dofiles Folder

Image 2. Example of a DataSets folder. (Click to enlarge.)

The Dofiles folder houses all do-files used in the survey round. The top level of the Dofiles folder should contain only a master do-file and sub-folders. The master do-file maps out all do-files and datasets relevant to this survey round.

The sub-folders in the Dofiles folder should be organized according to task (i.e. import, cleaning, analysis), with all do-files housed within their respective task folders. Each sub-folder should have a task level master do-file. Task-level master do-files have little technical purpose as the round master do-file can technically call each individual do-file. However, the task-level master do-files document data work and make it significantly easier for someone unfamiliar with the code to understand the process.

Make sure to use naming conventions and version control when creating and updating files and folders in the Dofiles folder.

Outputs Folder

The Outputs folder houses the survey round outputs. Its exact sub-folders depend on the output and the method used to create outputs. If, for example, tables and graphs are outputted, then create Table and Graph sub-folders. Each sub-folder should contain Raw and Formatted folders. For example, if single graphs are outputted to disk and later combined into one cumulative graph file, save the former as Raw and the latter as Formatted. See best practices of exporting analysis for additional recommendations.

The Outputs folder is one of the few places where it may be good practice to save multiple versions of the same file, since it is common to compare different versions of the analysis. If you save multiple versions of the same file, do not use the naming convention of table_v1, table_v2, etc. Instead, name files by date (i.e. table_Apr30, table_Jun6). If the outputted file size is significant in terms of disk space, do not save multiple versions of the same file: this will make it difficult for people with slow internet connection to access the folder via syncing services like DropBox.

Documentation Folder

This Documentation folder contains documentation for the analysis including any duplicate reports, survey logs, etc. While there is no strict format for this folder, it is good practice to save any documentation for this survey round in this folder. This could include anything from formal documentation to email conversations. You never know what you and your project team will need to know about this round in the future.

Questionnaire Folder

The Questionnaire folder contains all documentation related to the data collection. While questionnaire is the most important document to include in this folder, it is also important to save any other material that documents the survey. This includes material from the enumerator training, survey manuals used in the field, the survey firm TOR, and anything else that provides the project team with a better qualitative understanding of the data.

Encrypted Round Folder

Image 3. Example of a DataSets folder. (Click to enlarge.)

The Encrypted Round folders are created by iefolder at the same time as a round folder is created, but they are created in the encrypted branch of DataWork instead of inside the round folder. The encrypted round folders are separated from the other round folders as they likely include identifying data.

Raw Identified Data

Any data downloaded from the server used to collect data should be saved in the Raw Identified Data folder. This includes data downloaded from the internet, data received from data collection, and data received from other projects.

The data in this folder should be exactly as you got it: absolutely no changes should be made to it. Even simple changes like correcting known errors, changing variable names, changing format, or changing file names should never be done to any files in this folder. The only exception to this rule is if the file name needs to be changed in order to be imported.

If there are known errors in your raw datasets, import the raw data set, write a do-file that corrects the errors and save the corrected data in the Intermediate folder. The dataset in the Raw folder with known errors should remain unaltered. If corrections to the datasets are not properly documented, then the research team will not fully understand the quality of neither the data nor the research.

Dofiles Import

The do-files used to import, edit and de-identify the raw data likely include identifying information. Therefore, they should be housed in the Encrypted folder as well.

Running these do-files files should import the data, address any immediate issues such as missing IDs or duplicates, de-identify it and save the data set in the Intermediate folder, which is nested in the DataSets folder in the unencrypted branch of the DataWork.

Back to Parent

This article is part of the topic DataWork Folder.

Additional Resources