Data Linkage Table
The point of the data linkage table is to ensure that you can accurately and reproducibly link all datasets associated with your project. Data linkage errors are common. For example, you may have two datasets with the same units (companies, health workers, court cases etc.) but no way to easily merge or append them. You might have to do a fuzzy match on string variables or sets of descriptive characteristics, which is always a time-consuming and error-prone process that cannot scale with additional data.
In our experience, it is easy for all team members to remember the names of all ID variables, where datasets are backed up, etc. at any single point in time for the exact data they are working currently working on. However, when projects last multiple years, and team members rotate in and out, and new datasets are acquired, relying on individual memories is not a sustainable solution. Teams end up with datasets that cannot not be linked together with precision, and end up using other identifiers like names, but that is both laborious and prone to errors.
The data linkage table should not include every version of each data set. It should only list the original datasets and not any derivatives of them. For example, if you collect primary data, you should only include the raw data, and not the cleaned version of the data. Similarly, if you, for example, receive admin data or acquire data through web scraping, you should only include those datasets and not aggregations or reshaped versions of those datasets. Your code that creates all the derivatives of the datasets in the data linkage table should be well documented enough that all derivative data set can be traced back to one of these datasets.
Given below is a brief explanation of the purpose and contents of each column in the data linkage table:
- data_source. Where does this data come from? It could either be a data acquisition activity like a survey, or the name of a partner organization that is providing your research team with the data.
- data_set_name. What is this dataset called in the research team? Make sure that all datasets have unique and informative names, and make sure that everyone in the team uses these names so that there never is any confusion.
- frequency. How often is this data collected? This could be “once”, which would be the case, for example, for a baseline survey. If you run the same survey at two distinct points in time, for example, a baseline and an endline, then count these as separate datasets, each with frequency once. For all other types of data acquisition that is not a single discrete activity this column will indicate what the frequency is, which could be anything from hourly, daily to monthly.
- unit_of_obs. What is the unit of observation for the dataset? I.e. what does each row represents? All categories in this column should have a corresponding master data set.
- master_project_id. What is the name of the ID variable used in this project to identify each unit, i.e. each row, in this dataset. This id should be uniquely and fully identifying.
- alternative_id. List any other IDs that is used to identify this unit of observation. For example IDs used by partner organizations. While a research team should only use one ID per unit of observation per project (i.e. the master_project_id), it is still common that multiple IDs for the same unit of observation occurs in the same project due to, for example, partner organizations having their own ID. Typically we never want to use an alternative ID used by someone else, as then anyone else using that ID can re-identify our data. A project that uses someone else’s ID should consider their data as identified, even if all identifiers are removed during de-identification.
- one_to_many_id. Which (if any) other project ID variables does this project variable merge one-to-many to? For example, if this dataset had the unit of observations “school” and the project also has student datasets, then this column should include the student ID variable as each school merge to many students. Each ID listed here should have a master dataset.
- many_to_one_id. Which (if any) other project ID variables does this project variable merge many-to-one to? For example, if this dataset had the unit of observations “student” and the project also has school datasets, then this column should include the school ID variable as many students merge to the same school. All ID variables listed here must also exist in the master dataset for the unit of observation of the dataset. So in the students/school examples, the student master dataset must include the school ID variable. Each ID listed here should have a master dataset.
- file_location. This is the file path in the projects shared file system for where the dataset is stored. If this data is not stored in a shared file system and, for example, is pulled from the internet each time the code runs, then the URL should be listed here together with other information related how to access the data.
- raw_backup_location_1 and raw_backup_location_2. Where are these data sets backed up? Be as detailed as possible. The one day you may need this information you will be very thankful for any information that you will have available. List things as storage type (hard drive, cloud etc.), full file path and name. Remember that these files need to be encrypted if they include identifying information (which is almost always the case with raw data), so you should save decryption instructions here as well. (Although you should save the decryption key in a more secure location.)
- notes. A column with notes to a specific dataset. If the same type of note appears for many datasets, consider adding a new column for that type of information