Difference between revisions of "Data Map"

Jump to: navigation, search
 
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
A '''data map''' is a template designed by [https://www.worldbank.org/en/research/dime DIME] for organizing the 3 main aspects of '''data work''': [[Data Analysis|data analysis]], [[Data Cleaning|data cleaning]], and [[Data Management|data management]]. The '''data map template''' consists of three components: a [[Data Linkage Table|data linkage table]], a [[Master Dataset|master dataset]], and [[Data Flow Charts|data flow charts]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] recommends using '''data maps''' to organize the various components of your '''data work''' in order to increase the quality of data, as well as of research.
A data map is a template designed by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] for organizing the 3 main aspects of data work: [[Data Analysis|data analysis]], [[Data Cleaning|data cleaning]], and [[Data Management|data management]]. The data map template consists of three components: a [[Data Linkage Table|data linkage table]], a [[Master Dataset|master dataset]], and [[Data Flow Charts|data flow charts]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] recommends using data maps to organize the various components of your data work in order to increase the quality of data, as well as of research.


== Read First ==
== Read First ==
* The best time to start creating a '''data map''' is before starting with [[Primary Data Collection|data collection]].  
* The best time to start creating a data map is before starting with [[Primary Data Collection|data collection]].  
* A '''data map template''' has three components: a [[Data Linkage Table|data linkage table]], one or more [[Master Dataset|master datasets]], and one or more [[Data Flow Charts|data flow charts]].
* A data map template has three components: a [[Data Linkage Table|data linkage table]], one or more [[Master Dataset|master datasets]], and one or more [[Data Flow Charts|data flow charts]].
* The [[Impact Evaluation Team|research team]] should keep updating the '''data map''' as the project moves forward.
* The [[Impact Evaluation Team|research team]] should keep updating the data map as the project moves forward.
* The '''data map template''' is meant to act as a starting point for [[Data Management|data management]] within a '''research team'''.  
* The data map template is meant to act as a starting point for [[Data Management|data management]] within a '''research team'''.  
* It is important to understand the underlying '''best practices''' for each component of a '''data map''' before discussing which components do not apply in a given situation.
* It is important to understand the underlying best practices for each component of a data map before discussing which components do not apply in a given situation.


== Overview ==
== Overview ==
Most of the details required for preparing a '''data map''' are not complex. For example, it is easy for the [[Impact Evaluation Team#Field Coordinators (FCs)|field coordinator (FC)]] to remember what the '''respondent ID''' is when [[Primary Data Collection|data collection]] is still ongoing. However, it is harder to ensure that everyone in the [[Impact Evaluation Team|research team]] has the same level of understanding. Further, as time passes, the '''field coordinator (FC)''' themselves can forget what exactly a particular variable measures, or why it was included in the dataset. '''Research teams''' often do not spend enough time planning and organizing '''data work''' because small details like the purpose of a variable might seem obvious. However, this lack of proper planning is a common source of error. Fortunately, the solution is to create a '''data map''', which is quick and easy to implement.
Most of the details required for preparing a data map are not complex. For example, it is easy for the [[Impact Evaluation Team#Field Coordinators (FCs)|field coordinator (FC)]] to remember what the '''respondent ID''' is when [[Primary Data Collection|data collection]] is still ongoing. However, it is harder to ensure that everyone in the [[Impact Evaluation Team|research team]] has the same level of understanding. Further, as time passes, the '''field coordinator (FC)''' themselves can forget what exactly a particular '''variable''' measures, or why it was included in the [[Master Dataset|dataset]]. '''Research teams''' often do not spend enough time planning and organizing data work because small details like the purpose of a '''variable''' might seem obvious. However, this lack of proper planning is a common source of error. Fortunately, the solution is to create a data map, which is quick and easy to implement.


[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has prepared a '''data map template''', which has the following three components:
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has prepared a data map template, which has the following three components:
    
    
* A [[Data Linkage Table|data linkage table]]: The '''data linkage table''' lists all the datasets in a particular project, and explains how they are linked to each other. For example, a '''data linkage table''' can describe how a dataset containing information about students can be merged with a dataset containing information about various schools. It also specifies which [[ID Variable Properties|ID variable]] can be used to perform the merging. Finally, the data linkage table should also include '''meta-information''', that is, information about the datasets, where the original version of these data sets are backed-up, and so on.  
* A [[Data Linkage Table|data linkage table]]: The '''data linkage table''' lists all the [[Master Dataset|datasets]] in a particular project, and explains how they are linked to each other. For example, a '''data linkage table''' can describe how a '''dataset''' containing information about students can be merged with a '''dataset''' containing information about various schools. It also specifies which [[ID Variable Properties|ID variable]] can be used to perform the merging. Finally, the table should also include meta-information, that is, information about the '''datasets''', where the original version of these '''datasets''' are backed-up, and so on.  


* One or more [[Master Dataset|master datasets]]: '''Master datasets''' allow the [[Impact Evaluation Team|research team]] to keep track of [[Unit of Observation|units for each level of observation]]. For example, '''master datasets''' are useful for keeping track of each household if the '''unit of observation''' is individual households, each company if the '''unit of observation''' is individual companies, and so on. Most importantly, the master dataset should specify the [[ID_Variable_Properties#Property_1:_Uniquely_Identifying|uniquely]] and [[ID_Variable_Properties#Property_2:_Fully_Identifying|fully identifying]] '''ID variable''' for each unit of observation. Include variables related to the research design in the master dataset, such as [[Randomized_Control_Trials#Randomized_Assignment|treatment assignment variables]] in the form of '''dummy variables'''. The master dataset is therefore the authoritative source of all information in a particular project.  
* One or more [[Master Dataset|master datasets]]: '''Master datasets''' allow the '''research team''' to keep track of [[Unit of Observation|units for each level of observation]]. For example, '''master datasets''' are useful for keeping track of each household if the '''unit of observation''' is individual households, each company if the '''unit of observation''' is individual companies, and so on. Most importantly, the master dataset should specify the [[ID_Variable_Properties#Property_1:_Uniquely_Identifying|uniquely]] and [[ID_Variable_Properties#Property_2:_Fully_Identifying|fully identifying]] '''ID variable''' for each '''unit of observation'''. Include '''variables''' related to the research design in the '''master dataset''', such as [[Randomized_Control_Trials#Randomized_Assignment|treatment assignment variables]] in the form of '''dummy variables'''. The '''master dataset''' is therefore the authoritative source of all information in a particular project.  


* One or more [[Data Flow Charts|data flow charts]]: A '''data flow chart''' specifies which datasets are needed to create the [[Data Analysis|analysis dataset]], and how they may be combined by either '''appending''' or '''merging''' datasets. This means that there should be one data flow chart per '''analysis dataset'''. Make sure that every original dataset that is mentioned in a data flow chart should be listed in the '''data linkage table'''. For example, in a particular data flow chart, information about which variables to use as the basis for merging datasets should correspond to the information in the data linkage table.  
* One or more [[Data Flow Charts|data flow charts]]: A '''data flow chart''' specifies which '''datasets''' are needed to create the [[Data Analysis|analysis dataset]], and how they may be combined by either appending or merging '''datasets'''. This means that there should be one '''data flow chart''' per analysis '''dataset'''. Make sure that every original '''dataset''' that is mentioned in a '''data flow chart''' is listed in the '''data linkage table'''. For example, in a particular flow chart, information about which '''variables''' to use as the basis for merging '''datasets''' should correspond to the information in the '''data linkage table'''.  


'''Note''' - Please keep the following points in mind regarding '''data maps''':  
'''Note''' - Please keep the following points in mind regarding data maps:  
* '''A good data map can save a lot of time'''. Sometimes research teams realize that they are spending more time on linking datasets instead of actually [[Data Analysis|analyzing data]]. In such cases, creating a data map can save a lot of time.  
* A good data map can save a lot of time. Sometimes '''research teams''' realize that they are spending more time on linking '''datasets''' instead of actually [[Data Analysis|analyzing data]]. In such cases, creating a data map can save a lot of time.  


* '''It is never too lat to create a data map'''. Even if the research team is in the middle of a project, or nearing the end of a project, it is still a good idea to pause and create a data map if it is becoming difficult to keep track of various aspects of the data.
* It is never too late to create a data map. Even if the '''research team''' is in the middle of a project, or nearing the end of a project, it is still a good idea to pause and create a data map if it is becoming difficult to keep track of various aspects of the data.


* '''Modify the data map based on the context.''' As with all templates, the research team might need to add items to the data map template, or may find that some components do not apply in a particular context.
* Modify the data map based on the context. As with all templates, the '''research team''' might need to add items to the data map template, or may find that some components do not apply in a particular context.


* '''There can be multiple master datasets, but only one data linkage table.''' Many projects have multiple units of observation, in which case there should be one master datase for each unit of observation that is considered central to the project. However, there should only be one data linkage table per project.
*There can be multiple '''master datasets''', but only one '''data linkage table'''. Many projects have multiple '''units of observation''', in which case there should be one '''master dataset''' for each '''unit of observation''' that is considered central to the project. However, there should only be one '''data linkage table''' per project.


== Data Linkage Table ==
== Data Linkage Table ==
The point of the data linkage table is to ensure that you can accurately and reproducibly link all datasets associated with your project. Data linkage errors are common. For example, you may have two datasets with the same units (companies, health workers, court cases etc.) but no way to easily merge or append them. You might have to do a fuzzy match on string variables or sets of descriptive characteristics, which is always a time-consuming and error-prone process that cannot scale with additional data.
The purpose of a [[Data Linkage Table|data linkage table]] is to allow the [[Impact Evaluation Team|research team]] to accurately and [[Reproducible Research|reproducibly]] link all [[Master Dataset|datasets]] associated with the project. Errors in linking '''datasets''' are fairly common in development research, particularly when there are several rounds of [[Primary Data Collection|data collection]] involved, or while using [[Secondary Data Sources|secondary data]]. For example, there might be two '''datasets''' with the same units - such as firms, health workers, or agricultural plots. However, there may not be a straightforward way to merge or append the '''datasets'''. In such cases, the '''research team''' might have to try performing a fuzzy match on string '''variables''', for example. However, this can often be a time-consuming and error-prone process, and certainly cannot be scaled up when a large number of '''datasets''' are involved.
 
In our experience, it is easy for all team members to remember the names of all ID variables, where datasets are backed up, etc. at any single point in time for the exact data they are working currently working on. However, when projects last multiple years, and team members rotate in and out, and new datasets are acquired, relying on individual memories is not a sustainable solution. Teams end up with datasets that cannot not be linked together with precision, and end up using other identifiers like names, but that is both laborious and prone to errors.
 
The data linkage table should not include every version of each data set. It should only list the original datasets and not any derivatives of them. For example, if you collect primary data, you should only include the raw data, and not the cleaned version of the data. Similarly, if you, for example, receive admin data or acquire data through web scraping, you should only include those datasets and not aggregations or reshaped versions of those datasets. Your code that creates all the derivatives of the datasets in the data linkage table should be well documented enough that all derivative data set can be traced back to one of these datasets.


== Master Dataset ==
== Master Dataset ==
Most research projects, especially impact evaluations, collect and/or use multiple datasets for a given [[Unit of Observation|unit of observation]]. A master dataset is a comprehensive listing of the fixed characteristics of the observations that might occur in any other project dataset. Therefore, it contains one entry for each possible observation of a given unit of observation that a research team could ever work with in the project context via [[Sampling & Power Calculations | sampling]], surveying, or otherwise. While master datasets take some time and effort to set up, they dramatically mitigate sources of error and simplify working with datasets from multiple sources (i.e. baseline data, endline data, [[Administrative and Monitoring Data | administrative and monitoring data]], etc.).
Most research projects collect and use multiple '''datasets''' for a given [[Unit of Observation|unit of observation]]. A [[Master Dataset|master dataset]] is a comprehensive listing of the fixed characteristics of the observations that might occur in any other project '''dataset'''. Therefore, it contains one entry for each possible observation of a given '''unit of observation''' that a [[Impact Evaluation Team|research team]] could ever work with in the project context via [[Sampling| sampling]], [[Survey Pilot|surveying]], or otherwise. While '''master datasets''' take some time and effort to set up, they dramatically mitigate sources of error and simplify working with '''datasets''' from multiple sources (i.e. baseline data, endline data, [[Administrative and Monitoring Data | administrative and monitoring data]], etc.).


== Data Flow Charts ==
== Data Flow Charts ==
The reason research projects go through the time-consuming and often costly effort it takes to acquire data is that in the end we want to analyze the data. The purpose of a data flow chart is to map out what datasets we need in order to create the datasets that we will run our analysis on, and to communicate to the full team how to create them. After the datasets are created, the data flow charts also becomes a great visual way of documenting how the analysis datasets were created.
[[Data Flow Charts|Data flow charts]] can be very simple, for example, when your [[Data Analysis|analysis dataset]] is created by appending the baseline data with the endline data. But even in a simple case like that, you often realize that there is some [[Administrative and Monitoring Data|administrative data]], treatment statuses, [[Administrative and Monitoring Data|monitoring data]], etc., that is also needed when writing the '''data flow chart'''. Mapping all those needs and documenting them well in both the '''data flow charts''' and the '''data linkage table''' is the best way to guarantee that you find yourself in a situation where you cannot construct the '''analysis data''' as you do not have all the [[Master Dataset|datasets]] you need, or the '''datasets''' you have do not have the information needed to create the '''analysis dataset'''.
 
Data flow charts are very much related to the data linkage table, and they should be created in an iterative process. Each starting point in the data flow chart should be a dataset listed in the data linkage table, but we often do not understand what the full list of datasets we need in the data linkage table is until we have created the data flow chart. Another way to compare the two tools is that the data linkage table is just list all the datasets we have or that we know we will have while the data flow chart maps out what datasets we need or will need, to create the datasets needed in our analysis.
 
 
It is common for projects to require more than one analysis dataset, for example when running regressions on multiple units-of-observations. In these cases, you need one data flow chart per analysis dataset.
 
Data flow charts can be very simple, for example, when your analysis dataset is created by appending the baseline data with the endline data. But even in a simple case like that, you often realize that there is some administrative data, treatment statuses, monitoring data, etc. that is also needed when writing the data flow chart. Mapping all those needs and documenting them well in both the data flow charts and the data linkage map is the best way to guarantee that you find yourself in a situation where you cannot construct the analysis data as you do not have all the datasets you need, or the datasets you have does not have the information needed to create the analysis dataset.


== Related Pages ==
== Related Pages ==
[[Special:WhatLinksHere/Data_Map|Click here to see pages that link to this topic]].
[[Special:WhatLinksHere/Data_Map|Click here to see pages that link to this topic]].


== Additional Resources ==
[[Category: Reproducible Research]]

Latest revision as of 14:14, 14 August 2023

A data map is a template designed by DIME Analytics for organizing the 3 main aspects of data work: data analysis, data cleaning, and data management. The data map template consists of three components: a data linkage table, a master dataset, and data flow charts. DIME Analytics recommends using data maps to organize the various components of your data work in order to increase the quality of data, as well as of research.

Read First

  • The best time to start creating a data map is before starting with data collection.
  • A data map template has three components: a data linkage table, one or more master datasets, and one or more data flow charts.
  • The research team should keep updating the data map as the project moves forward.
  • The data map template is meant to act as a starting point for data management within a research team.
  • It is important to understand the underlying best practices for each component of a data map before discussing which components do not apply in a given situation.

Overview

Most of the details required for preparing a data map are not complex. For example, it is easy for the field coordinator (FC) to remember what the respondent ID is when data collection is still ongoing. However, it is harder to ensure that everyone in the research team has the same level of understanding. Further, as time passes, the field coordinator (FC) themselves can forget what exactly a particular variable measures, or why it was included in the dataset. Research teams often do not spend enough time planning and organizing data work because small details like the purpose of a variable might seem obvious. However, this lack of proper planning is a common source of error. Fortunately, the solution is to create a data map, which is quick and easy to implement.

DIME Analytics has prepared a data map template, which has the following three components:

  • A data linkage table: The data linkage table lists all the datasets in a particular project, and explains how they are linked to each other. For example, a data linkage table can describe how a dataset containing information about students can be merged with a dataset containing information about various schools. It also specifies which ID variable can be used to perform the merging. Finally, the table should also include meta-information, that is, information about the datasets, where the original version of these datasets are backed-up, and so on.
  • One or more master datasets: Master datasets allow the research team to keep track of units for each level of observation. For example, master datasets are useful for keeping track of each household if the unit of observation is individual households, each company if the unit of observation is individual companies, and so on. Most importantly, the master dataset should specify the uniquely and fully identifying ID variable for each unit of observation. Include variables related to the research design in the master dataset, such as treatment assignment variables in the form of dummy variables. The master dataset is therefore the authoritative source of all information in a particular project.
  • One or more data flow charts: A data flow chart specifies which datasets are needed to create the analysis dataset, and how they may be combined by either appending or merging datasets. This means that there should be one data flow chart per analysis dataset. Make sure that every original dataset that is mentioned in a data flow chart is listed in the data linkage table. For example, in a particular flow chart, information about which variables to use as the basis for merging datasets should correspond to the information in the data linkage table.

Note - Please keep the following points in mind regarding data maps:

  • A good data map can save a lot of time. Sometimes research teams realize that they are spending more time on linking datasets instead of actually analyzing data. In such cases, creating a data map can save a lot of time.
  • It is never too late to create a data map. Even if the research team is in the middle of a project, or nearing the end of a project, it is still a good idea to pause and create a data map if it is becoming difficult to keep track of various aspects of the data.
  • Modify the data map based on the context. As with all templates, the research team might need to add items to the data map template, or may find that some components do not apply in a particular context.
  • There can be multiple master datasets, but only one data linkage table. Many projects have multiple units of observation, in which case there should be one master dataset for each unit of observation that is considered central to the project. However, there should only be one data linkage table per project.

Data Linkage Table

The purpose of a data linkage table is to allow the research team to accurately and reproducibly link all datasets associated with the project. Errors in linking datasets are fairly common in development research, particularly when there are several rounds of data collection involved, or while using secondary data. For example, there might be two datasets with the same units - such as firms, health workers, or agricultural plots. However, there may not be a straightforward way to merge or append the datasets. In such cases, the research team might have to try performing a fuzzy match on string variables, for example. However, this can often be a time-consuming and error-prone process, and certainly cannot be scaled up when a large number of datasets are involved.

Master Dataset

Most research projects collect and use multiple datasets for a given unit of observation. A master dataset is a comprehensive listing of the fixed characteristics of the observations that might occur in any other project dataset. Therefore, it contains one entry for each possible observation of a given unit of observation that a research team could ever work with in the project context via sampling, surveying, or otherwise. While master datasets take some time and effort to set up, they dramatically mitigate sources of error and simplify working with datasets from multiple sources (i.e. baseline data, endline data, administrative and monitoring data, etc.).

Data Flow Charts

Data flow charts can be very simple, for example, when your analysis dataset is created by appending the baseline data with the endline data. But even in a simple case like that, you often realize that there is some administrative data, treatment statuses, monitoring data, etc., that is also needed when writing the data flow chart. Mapping all those needs and documenting them well in both the data flow charts and the data linkage table is the best way to guarantee that you find yourself in a situation where you cannot construct the analysis data as you do not have all the datasets you need, or the datasets you have do not have the information needed to create the analysis dataset.

Related Pages

Click here to see pages that link to this topic.