Master Dataset
All research projects collect and use multiple datasets for a given unit of observation. Reference (Master) data sets are the second component of using a data map to organize data work in a research team. They allow the research team to keep track of individual units for each level of observation. For example, reference (master) data sets are useful for keeping track of each household if the unit of observation is individual households, each company if the unit of observation is individual companies, and so on. While reference (master) data sets take some time and effort to set up, they significantly reduce sources of error, and simplify the process of working with datasets from multiple sources - baseline data, endline data, administrative and monitoring data, etc.
Read First
- Reference (Master) data sets are a crucial component of using a data map to organize data work.
- The research team must create one entry in the reference (master) data set for each relevant unit of observation.
- Save de-identified reference (master) data sets in the Reference (Master) Data folder and save reference (master) datasets with PII in the Encrypted Data folder.
Overview
A reference (master) data set is a comprehensive listing of the fixed characteristics of the observations that might occur in any other project dataset. Therefore, it contains one entry for each possible observation of a given unit of observation that a research team could ever work with in the project context via sampling, surveying, or otherwise. For example, a household reference (master) dataset should include data on all households that the research team ever encountered: households in the analysis, households sampled for surveys, households listed in the census or households that were included in monitoring data despite not being a part of the project. Accordingly, some observations in the reference (master) dataset will not have data for all the variables. As long as documented properly, this is normal and okay. In this case, make sure that all missing values in a reference (master) dataset are explained using extended missing values; Stata's regular missing values should never be allowed in a final reference (master) data set.
Observations in each reference (master) dataset should contain the IDs of any relevant higher level units of observation. For example, a student reference (master) dataset should include the student ID and the student’s school ID, region ID, and so on. This ensures quick, effortless merging of information from different datasets when need be – and with a very low risk of errors.
Creating the Reference (Master) Dataset
When to Create
Create the reference (master) dataset as soon as you begin working with a new unit of observation. This typically happens at one of three moments:
- When creating the population frame for a survey: creating a reference (master) dataset at this point is useful as it often reveals any potential issues in the data collected for population frame work. It is important to find these errors before beginning any field work.
- When adding administrative data to a dataset: if, for example, a research team has surveyed students and then receives administrative data on school budgets to merge into the student dataset, they should create a reference (master) dataset for schools before proceeding.
- When receiving monitoring data: monitoring data is often collected either through surveys or through administrative data, which the two points above already cover. However, if monitoring data is of a different unit of observation than the data already represented in the reference (master) datasets, it requires a new reference (master) dataset. For example, if a research team that conducts farmer surveys also conducts village-level monitoring activities that track whether villages received a treatment or not, then they should create a reference (master) dataset for villages with the monitoring data. The village reference (master) dataset should have village IDs and the farmer reference (master) data should include village IDs for each farmer. Then it will be easy to include the monitoring data whenever it is needed.
What to Include
Create a reference (master) dataset for all units of observation that fall into any of these categories:
- The unit of observation of any data source, including:
- The unit of observation at which each survey is conducted. If a research team surveys students, teachers, parents, and principals, each of these units of observation requires a reference (master) dataset.
- The unit of observation of any administrative data used.
- The unit of observation in any monitoring data used.
- The unit of observation used in any significant step of the analysis, including:
- Sampling.
- Treatment Assignment. If a research team running a student-level analysis never collects data on the school level, but randomly assigns the treatment at the school level, they should have a school reference (master) dataset. This dataset should have a variable indicating treatment assignment in such a way that this dataset can be merged to any other dataset.
- Any other significant step of the analysis.
For some units of observations in a dataset, it is not worth creating a dataset. Consider, for example, that all students in a survey belong to a school, school district, region, and country. If this survey was conducted in in a single region, then there is no need to create a reference (master) dataset for region and for country, as they would only have one observation each. It is not incorrect to do so, though, in most cases, it is not worth the effort.
Adding to the Reference (Master) Dataset
When to Add
Each time you come across new instances of a unit of observation, add them to the reference (master) dataset. For example, imagine that a research team has completed baseline patient surveys for a public health project. Between baseline and endline, they monitor whether patients in the clinics received the treatment according to the research design. However, the monitoring team does not have access to the baseline sample and instead simply randomly selects patients at the clinic. When the research team first receives the monitoring data, they should confirm if any of the monitored patients are associated with the baseline using the reference (master) dataset. If the reference (master) dataset is created correctly, then it already includes all the patients associated with the baseline.
They should then merge all new patient-level monitoring data to the patient reference (master) dataset. Then, they should assign the new observations an unused ID. Be very careful when assigning IDs and always check for any errors before proceeding. Make sure to not alter existent IDs. Consider, for example, that if you sort your observations alphabetically before assigning IDs, then the new observations added to your reference (master) dataset may alter that alphabetical sort and result in errors. For more information on ID variables, see unused ID Variable Properties.
Merging without Numeric IDs
In the case where you have admin or monitoring data that does not have a numeric ID, you may have to merge using string variables. This process is error prone, so it should always be done carefully and using Reference (Master) Data Sets. Never merge on a string variable using any other dataset other than the reference (master) dataset.
String variables are often not unique across all possible observations. For example, a name might be unique within a village but not across a district or a region. Therefore, it is not always enough to merge on a single string variable, but rather, a combination of string variables. The exact merging procedure for string variables differs between datasets; the more information one has on the data, the easier it is to avoid mistakes in this exercise.
This example do-file shows important steps and useful advice for merging string variables. Note that in a real world scenario, datasets usually require many more corrections than in this example do-file -- especially if the data was collected in a context where the names are written with a script different from the Latin script. When altering strings, be careful not alter them for more observations than you intend to.
Where to Store
Master Data Folder
The Master Data folder, which the iefolder
command generates when creating the DataWork folder, contains the de-identified data. The Master Data folder should contain a sub-folder for each unit of observation. Each unit of observation sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the unit of observation for each dataset (i.e. master_students, master_teachers and master_schools). Each unit of observation folder should contain the following sub-folders.
- DataSet: This folder stores the de-identified reference (master) dataset. This folder may also include de-identified, raw folders containing the datasets needed to create each reference (master) dataset. Never mix raw datasets with reference (master) datasets.
- Dofiles: This folder should contain the do-files that create each reference (master) datasets. Name each do-file clearly. If there are multiple do-files per reference (master) dataset, create do-file sub-folders for each unit of observation. Be very careful when you write these do-files as the validity of the data for your project depends on this.
Encrypted Data Folder
The Encrypted Data folder, which the iefolder
command generates when creating the DataWork folder, should store any identifying or sensitive reference (master) datasets, outputs, and do-files. Note that while iefolder
creates the Survey Encrypted Data folder, it does not encrypt it. The folder’s contents can easily be encrypted using software like Boxcryptor.
The Encrypted Data folder should contain a sub-folder for each unit of observation. Each unit of observation sub-folder should be named in a way that allows anyone unfamiliar with the folder to still understand the unit of observation for each dataset (i.e. master_students, master_teachers and master_schools). Each unit of observation folder should contain the following sub-folders.
- DataSet: The reference (master) dataset with PII should be stored here.
- Sampling: This folder should contain a folder for each sampling exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the sampling. Typically the do-file only needs to reference (master) dataset to run, though sometimes advanced sampling techniques require additional input.
- Treatment: This folder should contain a folder for each treatment assignment exercise. Within that folder should be two sub-folders: one for do-files and one for the output of the treatment assignment. Typically the do-file only needs to reference (master) dataset to run, though sometimes advanced sampling techniques require additional input.
Related Pages
Click here to see pages that link to this topic.
Additional Resources
- DIME Analytics (World Bank), Data Map