Difference between revisions of "Data Map"

Jump to: navigation, search
Line 13: Line 13:
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has prepared a '''data map template''', which has the following three components:
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has prepared a '''data map template''', which has the following three components:
    
    
* A [[Data Linkage Table|data linkage table]]: The '''data linkage table''' lists all the datasets in a particular project, and explains how they are linked to each other. For example, a '''data linkage table''' can describe how a dataset containing information about students can be merged with a dataset containing information about various schools. It can also specify which ID variable can be used to perform the merging. The '''data linkage table''' should also include '''meta-information''', that is, information about the datasets, where the original version of these data sets are backed-up, and so on. There should only be one '''data linkage table''' per project.  
* A [[Data Linkage Table|data linkage table]]: The '''data linkage table''' lists all the datasets in a particular project, and explains how they are linked to each other. For example, a '''data linkage table''' can describe how a dataset containing information about students can be merged with a dataset containing information about various schools. It can also specify which [[ID Variable Properties|ID variable]] can be used to perform the merging. The '''data linkage table''' should also include '''meta-information''', that is, information about the datasets, where the original version of these data sets are backed-up, and so on.  


* One or more [[Master Dataset|master datasets]]: '''Master datasets''' allow the [[Impact Evaluation Team|research team]] to keep track of units for each level of [[Units of Observation|observation]]. For example, keeping track of each household if your unit of observation is households, each company if your unit of observation is companies, etc. Most importantly, the master dataset specifies the uniquely and fully identifying ID variable for each unit. The master dataset should also include variables related to the research design, such as sample and treatment assignment variables. The master dataset should be the authoritative source of all information included. Many projects have multiple units of observation, requiring one master data set for each unit of observation that is central to the research.  
* One or more [[Master Dataset|master datasets]]: '''Master datasets''' allow the [[Impact Evaluation Team|research team]] to keep track of [[Unit of Observation|units for each level of observation]]. For example, '''master datasets''' are useful for keeping track of each household if the '''unit of observation''' is individual households, each company if the '''unit of observation''' is individual companies, and so on. Most importantly, the '''master dataset''' specifies the [[ID_Variable_Properties#Property_1:_Uniquely_Identifying|uniquely]] and [[ID_Variable_Properties#Property_2:_Fully_Identifying|fully identifying]] '''ID variable''' for each '''unit of observation'''. The '''master dataset''' should also include variables related to the '''research design''', such as [[Randomized_Control_Trials#Randomized_Assignment|treatment assignment variables]] in the form of '''dummy variables'''. The '''master dataset''' should therefore be the authoritative source of all information in a particular project.  


* One or more [[Data Flow Charts|data flow charts]]: There should be one flow chart per analysis data set in the project. Each data flow chart shows what datasets are needed to create the analysis dataset and how they may be combined by appending or merging them. All original datasets in a data flow chart should be listed in the data linkage table; the information in the data flow chart, for example, which variables to merge datasets on, should correspond to the information in the data linkage table.  
* One or more [[Data Flow Charts|data flow charts]]: There should be one flow chart per analysis data set in the project. Each data flow chart shows what datasets are needed to create the analysis dataset and how they may be combined by appending or merging them. All original datasets in a data flow chart should be listed in the data linkage table; the information in the data flow chart, for example, which variables to merge datasets on, should correspond to the information in the data linkage table.  
Line 23: Line 23:


* '''Modify the data map based on the context.''' As with all templates, you might need to add items to our Data Plan Template or you may find that some items do not apply.
* '''Modify the data map based on the context.''' As with all templates, you might need to add items to our Data Plan Template or you may find that some items do not apply.
* There should only be one '''data linkage table''' per project.
* Many projects have multiple '''units of observation''', in which case there should be one '''master dataset''' for each unit of observation that is central to the research.


== Data Linkage Table ==
== Data Linkage Table ==

Revision as of 20:58, 8 September 2020

A data map is a template designed by DIME for organizing the 3 main aspects of data work: data analysis, data cleaning, and data management. The data map template consists of three components: a data linkage table, a master dataset, and data flow charts. DIME Analytics recommends using data maps to organize the various components of your data work in order to increase the quality of data, as well as of research.

Read First

  • The best time to start creating a data map is before starting with data collection.
  • A data map template has three components: a data linkage table, one or more master datasets, and one or more data flow charts.
  • The research team should keep updating the data map as the project moves forward.
  • The data map template is meant to act as a starting point for data management within a research team.
  • It is important to understand the underlying best practices for each component of a data map before discussing which components do not apply in a given situation.

Overview

Most of the details required for preparing a data map are not complex. For example, it is easy for the field coordinator (FC) to remember what the respondent ID is when data collection is still ongoing. However, it is harder to ensure that everyone in the research team has the same level of understanding. Further, as time passes, the field coordinator (FC) themselves can forget what exactly a particular variable measures, or why it was included in the dataset. Research teams often do not spend enough time planning and organizing data work because small details like the purpose of a variable might seem obvious. However, this tendency is exactly what makes lack of, or inadequate planning a common source of error. Fortunately, the solution - a data map - is quick and easy to implement.

DIME Analytics has prepared a data map template, which has the following three components:

  • A data linkage table: The data linkage table lists all the datasets in a particular project, and explains how they are linked to each other. For example, a data linkage table can describe how a dataset containing information about students can be merged with a dataset containing information about various schools. It can also specify which ID variable can be used to perform the merging. The data linkage table should also include meta-information, that is, information about the datasets, where the original version of these data sets are backed-up, and so on.
  • One or more master datasets: Master datasets allow the research team to keep track of units for each level of observation. For example, master datasets are useful for keeping track of each household if the unit of observation is individual households, each company if the unit of observation is individual companies, and so on. Most importantly, the master dataset specifies the uniquely and fully identifying ID variable for each unit of observation. The master dataset should also include variables related to the research design, such as treatment assignment variables in the form of dummy variables. The master dataset should therefore be the authoritative source of all information in a particular project.
  • One or more data flow charts: There should be one flow chart per analysis data set in the project. Each data flow chart shows what datasets are needed to create the analysis dataset and how they may be combined by appending or merging them. All original datasets in a data flow chart should be listed in the data linkage table; the information in the data flow chart, for example, which variables to merge datasets on, should correspond to the information in the data linkage table.

Finally, keep the following points in mind regarding data maps:

  • A good data map can save a lot of time. If you are in the middle or towards the end of your project and you spend more time linking your datasets than doing other data work, you should step back and create a data plan.
  • Modify the data map based on the context. As with all templates, you might need to add items to our Data Plan Template or you may find that some items do not apply.
  • There should only be one data linkage table per project.
  • Many projects have multiple units of observation, in which case there should be one master dataset for each unit of observation that is central to the research.

Data Linkage Table

Master Dataset

Data Flow Charts

Related Pages

Additional Resources