Data Linkage Table

Revision as of 14:59, 8 September 2020 by Avnish95 (talk | contribs)
Jump to: navigation, search

The point of the data linkage table is to ensure that you can accurately and reproducibly link all datasets associated with your project. Data linkage errors are common. For example, you may have two datasets with the same units (companies, health workers, court cases etc.) but no way to easily merge or append them. You might have to do a fuzzy match on string variables or sets of descriptive characteristics, which is always a time-consuming and error-prone process that cannot scale with additional data.

In our experience, it is easy for all team members to remember the names of all ID variables, where datasets are backed up, etc. at any single point in time for the exact data they are working currently working on. However, when projects last multiple years, and team members rotate in and out, and new datasets are acquired, relying on individual memories is not a sustainable solution. Teams end up with datasets that cannot not be linked together with precision, and end up using other identifiers like names, but that is both laborious and prone to errors.

The data linkage table should not include every version of each data set. It should only list the original datasets and not any derivatives of them. For example, if you collect primary data, you should only include the raw data, and not the cleaned version of the data. Similarly, if you, for example, receive admin data or acquire data through web scraping, you should only include those datasets and not aggregations or reshaped versions of those datasets. Your code that creates all the derivatives of the datasets in the data linkage table should be well documented enough that all derivative data set can be traced back to one of these datasets.