Difference between revisions of "Master Dataset"
(6 intermediate revisions by the same user not shown) | |||
Line 15: | Line 15: | ||
===When to Create=== | ===When to Create=== | ||
Create the | Create the '''master dataset''' as soon as you begin working with a new [[Unit of Observation|unit of observation]]. This typically happens at one of three moments: | ||
# When creating the [[Sampling & Power Calculations#Population Frame|population frame]] for a survey: creating a | # When creating the [[Sampling & Power Calculations#Population Frame|population frame]] for a [[Survey Pilot|survey]]: creating a '''master dataset''' at this point is useful as it often reveals any potential issues in the [[Primary Data Collection|data collected]] for population frame work. It is important to find these errors before beginning any field work. | ||
# When adding administrative data to a dataset: if, for example, a research team has surveyed students and then receives administrative data on school budgets to merge into the student dataset, they should create a | # When adding [[Administrative and Monitoring Data|administrative data]] to a '''dataset''': if, for example, a [[Impact Evaluation Team|research team]] has '''surveyed''' students and then receives '''administrative data''' on school budgets to merge into the student '''dataset''', they should create a '''master dataset''' for schools before proceeding. | ||
# When receiving monitoring data: | # When receiving '''monitoring data''': this data is often collected either through '''surveys''' or through '''administrative data''', which the two points above already cover. However, if '''monitoring data''' is of a different '''unit of observation''' than the data already represented in the '''master datasets''', it requires a new '''master dataset'''. For example, if a '''research team''' that conducts farmer '''surveys''' also conducts village-level monitoring activities that track whether villages received a treatment or not, then they should create a '''master dataset''' for villages with the '''monitoring data'''. The village '''master dataset''' should have village IDs and the farmer reference (master) data should include village IDs for each farmer. Then it will be easy to include the '''monitoring data''' whenever it is needed. | ||
===What to Include === | ===What to Include === | ||
Create a | Create a '''master dataset''' for all [[Units of Observation|units of observation]] that fall into any of these categories: | ||
* The | * The '''unit of observation''' of any data source, including: | ||
** The unit of observation at which each survey is conducted. If a research team surveys students, teachers, parents, and principals, each of these units of observation requires a | ** The '''unit of observation''' at which each [[Survey Pilot|survey]] is conducted. If a [[Impact Evaluation Team|research]] team '''surveys''' students, teachers, parents, and principals, each of these '''units of observation''' requires a '''master dataset'''. | ||
** The unit of observation of any administrative data used. | ** The '''unit of observation''' of any [[Administrative and Monitoring Data|administrative]] data used. | ||
** The unit of observation in any monitoring data used. | ** The '''unit of observation''' in any '''monitoring data''' used. | ||
* The unit of observation used in any significant step of the analysis, including: | * The '''unit of observation''' used in any significant step of the [[Data Analysis|analysis]], including: | ||
** Sampling. | ** Sampling. | ||
** Treatment Assignment. If a research team running a student-level analysis never collects data on the school level, but randomly assigns the treatment at the school level, they should have a school | ** Treatment Assignment. If a '''research team''' running a student-level '''analysis''' never [[Primary Data Collection|collects data]] on the school level, but [[Randomization|randomly]] assigns the treatment at the school level, they should have a school '''master dataset'''. This '''dataset''' should have a '''variable''' indicating treatment assignment in such a way that this '''dataset''' can be merged to any other '''dataset'''. | ||
**Any other significant step of the analysis. | **Any other significant step of the '''analysis'''. | ||
For some units of observations in a dataset, it is not worth creating a dataset. Consider, for example, that all students in a survey belong to a school, school district, region, and country. If this survey was conducted in in a single region, then there is no need to create a | For some '''units of observations''' in a '''dataset''', it is not worth creating a '''dataset'''. Consider, for example, that all students in a '''survey''' belong to a school, school district, region, and country. If this '''survey''' was conducted in in a single region, then there is no need to create a '''master dataset''' for region and for country, as they would only have one observation each. It is not incorrect to do so, though, in most cases, it is not worth the effort. | ||
==Adding to the Reference (Master) Dataset== | ==Adding to the Reference (Master) Dataset== | ||
===When to Add=== | ===When to Add=== | ||
Each time you come across new instances of a unit of observation, add them to the | Each time you come across new instances of a [[Unit of Observation|unit of observation]], add them to the '''master dataset'''. For example, imagine that a [[Impact Evaluation Team|research team]] has completed baseline patient [[Survey Pilot|surveys]] for a public health project. Between baseline and endline, they monitor whether patients in the clinics received the treatment according to the research design. However, the monitoring team does not have access to the baseline sample and instead simply [[Randomization|randomly]] selects patients at the clinic. When the '''research team''' first receives the [[Administrative and Monitoring Data|monitoring data]], they should confirm if any of the monitored patients are associated with the baseline using the '''master dataset'''. If the '''master dataset''' is created correctly, then it already includes all the patients associated with the baseline. | ||
They should then merge all new patient-level monitoring data to the patient | They should then merge all new patient-level '''monitoring data''' to the patient '''master dataset'''. Then, they should assign the new observations an unused [[ID Variable Properties | ID]]. Be very careful when assigning '''IDs''' and always check for any errors before proceeding. Make sure to not alter existing '''IDs'''. Consider, for example, that if you sort your observations alphabetically before assigning '''IDs''', then the new observations added to your '''master dataset''' may alter that alphabetical sort and result in errors. For more information on '''ID variables''', see [[ID Variable Properties]]. | ||
===Merging without Numeric IDs=== | ===Merging without Numeric IDs=== | ||
In the case where you have | In the case where you have [[Administrative and Monitoring Data|admininistrative or monitoring data]] that does not have a numeric [[ID Variable Properties|ID]], you may have to merge using string '''variables'''. This process is error prone, so it should always be done carefully and using '''master data sets'''. Never merge on a string '''variable''' using any other '''dataset''' other than the '''master dataset'''. | ||
String variables are often not unique across all possible observations. For example, a name might be unique within a village but not across a district or a region. Therefore, it is not always enough to merge on a single string variable, but rather, a combination of string variables. The exact merging procedure for string variables differs between datasets; the more information one has on the data, the easier it is to avoid mistakes in this exercise. | String '''variables''' are often not unique across all possible observations. For example, a name might be unique within a village but not across a district or a region. Therefore, it is not always enough to merge on a single string '''variable''', but rather, a combination of string '''variables'''. The exact merging procedure for string '''variables''' differs between '''datasets'''; the more information one has on the data, the easier it is to avoid mistakes in this exercise. | ||
This [https://github.com/worldbank/DIMEwiki/blob/master/Topics/Master_Data_Set/stringMergeWithMasterDataSets.do example do-file] shows important steps and useful advice for merging string variables. Note that in a real world scenario, datasets usually require many more corrections than in this example do-file -- especially if the data was collected in a context where the names are written with a script different from the Latin script. When altering strings, be careful not alter them for more observations than you intend to. | This [https://github.com/worldbank/DIMEwiki/blob/master/Topics/Master_Data_Set/stringMergeWithMasterDataSets.do example do-file] shows important steps and useful advice for merging string '''variables'''. Note that in a real world scenario, '''datasets''' usually require many more corrections than in this example do-file -- especially if the data was [[Primary Data Collection|collected]] in a context where the names are written with a script different from the Latin script. When altering strings, be careful not alter them for more observations than you intend to. | ||
==Where to Store== | ==Where to Store== | ||
[[File:FolderMasterData.png |thumb|300px| Example of a | [[File:FolderMasterData.png |thumb|300px| Example of a Master Data Set and Crypted Data folder as set up by command [[iefolder]]. (Click to enlarge.)]] | ||
===Master Data Folder=== | ===Master Data Folder=== | ||
The Master Data folder, which the <code>[[iefolder]]</code> command generates when creating the [[DataWork Folder | DataWork folder]], contains the [[De-identification | de-identified]] data. The Master Data folder should contain a sub-folder for each unit of observation. Each unit of observation sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the unit of observation for each dataset (i.e. ''master_students'', ''master_teachers'' and ''master_schools''). Each unit of observation folder should contain the following sub-folders. | The '''Master Data folder''', which the <code>[[iefolder]]</code> command generates when creating the [[DataWork Folder | DataWork folder]], contains the [[De-identification | de-identified]] data. The '''Master Data folder''' should contain a sub-folder for each [[Unit of Observation|unit of observation]]. Each '''unit of observation''' sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the '''unit of observation''' for each '''dataset''' (i.e. ''master_students'', ''master_teachers'' and ''master_schools''). Each '''unit of observation''' folder should contain the following sub-folders. | ||
*DataSet: This folder stores the | *DataSet: This folder stores the '''de-identified master dataset'''. This folder may also include '''de-identified''', raw folders containing the '''datasets''' needed to create each '''master dataset'''. Never mix raw '''datasets''' with '''master datasets'''. | ||
*Dofiles: This folder should contain the do-files that create each | *Dofiles: This folder should contain the do-files that create each '''master datasets'''. Name each do-file clearly. If there are multiple do-files per '''master dataset''', create do-file sub-folders for each '''unit of observation'''. Be very careful when you write these do-files as the validity of the data for your project depends on this. | ||
===Encrypted Data Folder=== | ===Encrypted Data Folder=== | ||
The Encrypted Data folder, which the <code>[[iefolder]]</code> command generates when creating the [[DataWork Folder | DataWork folder]], should store any [[Personally Identifiable Information (PII) | identifying]] or sensitive | The [[Encryption|Encrypted]] Data folder, which the <code>[[iefolder]]</code> command generates when creating the [[DataWork Folder | DataWork folder]], should store any [[Personally Identifiable Information (PII) | identifying]] or sensitive '''master datasets''', outputs, and do-files. Note that while <code>[[iefolder]]</code> creates the [[Survey Pilot|Survey]] '''Encrypted''' Data folder, it does not '''encrypt''' it. The folder’s contents can easily be '''encrypted''' using software like [https://www.boxcryptor.com Boxcryptor]. | ||
The Encrypted Data folder should contain a sub-folder for each unit of observation. Each unit of observation sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the unit of observation for each dataset (i.e. ''master_students'', ''master_teachers'' and ''master_schools''). Each unit of observation folder should contain the following sub-folders. | The '''Encrypted''' Data folder should contain a sub-folder for each [[Unit of Observation|unit of observation]]. Each '''unit of observation''' sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the '''unit of observation''' for each '''dataset''' (i.e. ''master_students'', ''master_teachers'' and ''master_schools''). Each '''unit of observation folder''' should contain the following sub-folders. | ||
*DataSet: The | *DataSet: The '''master dataset''' with [[Personally Identifiable Information (PII) | PII]] should be stored here. | ||
*Sampling: This folder should contain a folder for each sampling exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the sampling. Typically the do-file only needs to | *Sampling: This folder should contain a folder for each sampling exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the sampling. Typically the do-file only needs to '''master dataset''' to run, though sometimes advanced sampling techniques require additional input. | ||
*Treatment: This folder should contain a folder for each treatment assignment exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the treatment assignment. Typically the do-file only needs to | *Treatment: This folder should contain a folder for each treatment assignment exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the treatment assignment. Typically the do-file only needs to '''master dataset''' to run, though sometimes advanced sampling techniques require additional input. | ||
== Related Pages == | == Related Pages == |
Latest revision as of 20:27, 3 August 2023
All research projects collect and use multiple datasets for a given unit of observation. Reference (Master) data sets are the second component of using a data map to organize data work in a research team. They allow the research team to keep track of individual units for each level of observation. For example, master data sets are useful for keeping track of each household if the unit of observation is individual households, each company if the unit of observation is individual companies, and so on. While master data sets take some time and effort to set up, they significantly reduce sources of error, and simplify the process of working with datasets from multiple sources - baseline data, endline data, administrative and monitoring data, etc.
Read First
- Master data sets are a crucial component of using a data map to organize data work.
- The research team must create one entry in the master data set for each relevant unit of observation.
- Save de-identified master data sets in the Reference (Master) Data folder and save master datasets with PII in the Encrypted Data folder.
Overview
A master data set is a comprehensive listing of the fixed characteristics of the observations that might occur in any other project dataset. Therefore, it contains one entry for each possible observation of a given unit of observation that a research team could ever work with in the project context via sampling, surveying, or otherwise. For example, a household master dataset should include data on all households that the research team ever encountered: households in the analysis, households sampled for surveys, households listed in the census, or households that were included in monitoring data despite not being a part of the project. Accordingly, some observations in the master dataset will not have data for all the variables. As long as documented properly, this is normal and okay. In this case, make sure that all missing values in a master dataset are explained using extended missing values; Stata's regular missing values should never be allowed in a final master data set.
Observations in each master dataset should contain the IDs of any relevant higher level units of observation. For example, a student master dataset should include the student ID and the student’s school ID, region ID, and so on. This ensures quick, effortless merging of information from different datasets when need be – and with a very low risk of errors.
Creating the Reference (Master) Dataset
When to Create
Create the master dataset as soon as you begin working with a new unit of observation. This typically happens at one of three moments:
- When creating the population frame for a survey: creating a master dataset at this point is useful as it often reveals any potential issues in the data collected for population frame work. It is important to find these errors before beginning any field work.
- When adding administrative data to a dataset: if, for example, a research team has surveyed students and then receives administrative data on school budgets to merge into the student dataset, they should create a master dataset for schools before proceeding.
- When receiving monitoring data: this data is often collected either through surveys or through administrative data, which the two points above already cover. However, if monitoring data is of a different unit of observation than the data already represented in the master datasets, it requires a new master dataset. For example, if a research team that conducts farmer surveys also conducts village-level monitoring activities that track whether villages received a treatment or not, then they should create a master dataset for villages with the monitoring data. The village master dataset should have village IDs and the farmer reference (master) data should include village IDs for each farmer. Then it will be easy to include the monitoring data whenever it is needed.
What to Include
Create a master dataset for all units of observation that fall into any of these categories:
- The unit of observation of any data source, including:
- The unit of observation at which each survey is conducted. If a research team surveys students, teachers, parents, and principals, each of these units of observation requires a master dataset.
- The unit of observation of any administrative data used.
- The unit of observation in any monitoring data used.
- The unit of observation used in any significant step of the analysis, including:
- Sampling.
- Treatment Assignment. If a research team running a student-level analysis never collects data on the school level, but randomly assigns the treatment at the school level, they should have a school master dataset. This dataset should have a variable indicating treatment assignment in such a way that this dataset can be merged to any other dataset.
- Any other significant step of the analysis.
For some units of observations in a dataset, it is not worth creating a dataset. Consider, for example, that all students in a survey belong to a school, school district, region, and country. If this survey was conducted in in a single region, then there is no need to create a master dataset for region and for country, as they would only have one observation each. It is not incorrect to do so, though, in most cases, it is not worth the effort.
Adding to the Reference (Master) Dataset
When to Add
Each time you come across new instances of a unit of observation, add them to the master dataset. For example, imagine that a research team has completed baseline patient surveys for a public health project. Between baseline and endline, they monitor whether patients in the clinics received the treatment according to the research design. However, the monitoring team does not have access to the baseline sample and instead simply randomly selects patients at the clinic. When the research team first receives the monitoring data, they should confirm if any of the monitored patients are associated with the baseline using the master dataset. If the master dataset is created correctly, then it already includes all the patients associated with the baseline.
They should then merge all new patient-level monitoring data to the patient master dataset. Then, they should assign the new observations an unused ID. Be very careful when assigning IDs and always check for any errors before proceeding. Make sure to not alter existing IDs. Consider, for example, that if you sort your observations alphabetically before assigning IDs, then the new observations added to your master dataset may alter that alphabetical sort and result in errors. For more information on ID variables, see ID Variable Properties.
Merging without Numeric IDs
In the case where you have admininistrative or monitoring data that does not have a numeric ID, you may have to merge using string variables. This process is error prone, so it should always be done carefully and using master data sets. Never merge on a string variable using any other dataset other than the master dataset.
String variables are often not unique across all possible observations. For example, a name might be unique within a village but not across a district or a region. Therefore, it is not always enough to merge on a single string variable, but rather, a combination of string variables. The exact merging procedure for string variables differs between datasets; the more information one has on the data, the easier it is to avoid mistakes in this exercise.
This example do-file shows important steps and useful advice for merging string variables. Note that in a real world scenario, datasets usually require many more corrections than in this example do-file -- especially if the data was collected in a context where the names are written with a script different from the Latin script. When altering strings, be careful not alter them for more observations than you intend to.
Where to Store
Master Data Folder
The Master Data folder, which the iefolder
command generates when creating the DataWork folder, contains the de-identified data. The Master Data folder should contain a sub-folder for each unit of observation. Each unit of observation sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the unit of observation for each dataset (i.e. master_students, master_teachers and master_schools). Each unit of observation folder should contain the following sub-folders.
- DataSet: This folder stores the de-identified master dataset. This folder may also include de-identified, raw folders containing the datasets needed to create each master dataset. Never mix raw datasets with master datasets.
- Dofiles: This folder should contain the do-files that create each master datasets. Name each do-file clearly. If there are multiple do-files per master dataset, create do-file sub-folders for each unit of observation. Be very careful when you write these do-files as the validity of the data for your project depends on this.
Encrypted Data Folder
The Encrypted Data folder, which the iefolder
command generates when creating the DataWork folder, should store any identifying or sensitive master datasets, outputs, and do-files. Note that while iefolder
creates the Survey Encrypted Data folder, it does not encrypt it. The folder’s contents can easily be encrypted using software like Boxcryptor.
The Encrypted Data folder should contain a sub-folder for each unit of observation. Each unit of observation sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the unit of observation for each dataset (i.e. master_students, master_teachers and master_schools). Each unit of observation folder should contain the following sub-folders.
- DataSet: The master dataset with PII should be stored here.
- Sampling: This folder should contain a folder for each sampling exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the sampling. Typically the do-file only needs to master dataset to run, though sometimes advanced sampling techniques require additional input.
- Treatment: This folder should contain a folder for each treatment assignment exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the treatment assignment. Typically the do-file only needs to master dataset to run, though sometimes advanced sampling techniques require additional input.
Related Pages
Click here to see pages that link to this topic.
Additional Resources
- DIME Analytics (World Bank), Data Map