Difference between revisions of "Data Flow Charts"
Line 8: | Line 8: | ||
== Overview == | == Overview == | ||
Data flow charts are very much related to the data linkage table, | '''Data flow charts''' are very much related to the [[Data Linkage Table|data linkage table]]. Together, the two form an interdependent loop. For instance, each starting point in the '''data flow chart''' should be a dataset which is listed in the '''data linkage table'''. However, until we have created the data flow chart, we cannot easily understand which datasets we need to include in the data linkage table. An easy way to differentiate the two concepts is as follows - while the data linkage table just list all the datasets we have, the data flow chart maps out which datasets we will need in order to create the datasets to perform [[Data Analysis|data analysis]]. | ||
It is important to keep the following points in mind regarding '''data flow charts''': | |||
It is common for projects to require more than one analysis dataset, for example when running regressions on multiple units | * '''Make one data flow chart for every analysis dataset'''. It is common for projects to require more than one analysis dataset, for example when running regressions on multiple [[Unit of Observation|units of observation]]. In these cases, the [[Impact Evaluation Team|research team]] should make one data flow chart for each analysis dataset. | ||
* '''Document your needs properly.''' Data flow charts can be very simple, for example, when the analysis dataset is created by appending the '''baseline''' data with the '''endline''' data. Even in such a case, the '''research team''' will need to include information about [[Administrative Data|administrative data]], [[Randomized_Evaluations:_Principles_of_Study_Design#Step_2:_Randomization|treatment statuses]], [[Monitoring Data|monitoring data]] etc. Mapping this information, and documenting it properly using data flow charts and the data linkage table is the best way to avoid a situation where the research team cannot construct the analysis data because they do not have all the datasets they need, or the datasets they have some information that is required to create the analysis dataset. | |||
Data flow charts can be very simple, for example, when | |||
== Sample Data Flow Chart == | == Sample Data Flow Chart == |
Revision as of 01:29, 14 December 2020
A data flow chart is the third component of using a data map to organize data work within a research team. The final purpose of going through the complex process of collecting or acquiring data is to analyze it. Data flow charts allow the research team to visualize which datasets are needed in order to create the datasets that will finally be used for analysis. They are also a useful tool to communicate to the rest of the team, and document how the analysis datasets, are created using various intermediate datasets
Read First
- A data map is a template designed by DIME Analytics to organize 3 main aspects of data work: data analysis, data cleaning, and data management.
- The data map template consists of three components: a data linkage table, a master dataset, and data flow charts.
- Data flow charts specify which datasets are needed to create the analysis dataset, and how they may be combined by either appending or merging datasets.
- Every original dataset that is mentioned in a data flow chart should be listed in the data linkage table.
Overview
Data flow charts are very much related to the data linkage table. Together, the two form an interdependent loop. For instance, each starting point in the data flow chart should be a dataset which is listed in the data linkage table. However, until we have created the data flow chart, we cannot easily understand which datasets we need to include in the data linkage table. An easy way to differentiate the two concepts is as follows - while the data linkage table just list all the datasets we have, the data flow chart maps out which datasets we will need in order to create the datasets to perform data analysis.
It is important to keep the following points in mind regarding data flow charts:
- Make one data flow chart for every analysis dataset. It is common for projects to require more than one analysis dataset, for example when running regressions on multiple units of observation. In these cases, the research team should make one data flow chart for each analysis dataset.
- Document your needs properly. Data flow charts can be very simple, for example, when the analysis dataset is created by appending the baseline data with the endline data. Even in such a case, the research team will need to include information about administrative data, treatment statuses, monitoring data etc. Mapping this information, and documenting it properly using data flow charts and the data linkage table is the best way to avoid a situation where the research team cannot construct the analysis data because they do not have all the datasets they need, or the datasets they have some information that is required to create the analysis dataset.
Sample Data Flow Chart
Below is a data flow chart of a project with three rounds of data collection on farmer level, where the treatment status was randomized on community level and treatment take up was monitored on farmer level. This example is based on the data linkage table above, and you can see that each starting point in the flow chart below corresponds to an item in the data linkage table.
In the example we have used the shape of a cylinder to represent a dataset and a rectangle to represent an action like “merge” or “append”. You do not need to follow this practice, but cylinders commonly indicate data in data infographics. The example below has been created in the free software https://www.lucidchart.com/, but could just as well be created in Microsoft PowerPoint. You could also do this on pen and paper or a whiteboard and scan or take a photo of the final version, but the benefit of creating the chart digitally is that it is easy to update if you have to change something over the course of the project. It is not a bad idea to create the first version on paper or on a whiteboard together with the rest of your project team, and then transfer it to an editable digital format.
For each dataset, we indicate in the data flow chart what the uniquely and fully identifying variable or variables are. Only ID variables that are listed as master project ID variables in the data linkage table should be used in the data flow chart. One common exception to that rule, however, is a variable indicating time in longitudinal data or other such “panel” structural indicators. Another best practice is to take note of the information from supporting datasets, such as treatment take-up in the monitoring data, that is the most relevant for the analysis data.
When a rectangle indicates that two dataset are combined with a merge, then the box indicates which ID will be used and if there is a one-to-one (1:1) merge, a one-to-many (1:m) merge or a many-to-one (m:1) merge. When a rectangle indicates that two datasets should be combined by appending them, then it is useful to indicate how the ID variable in the resulting data will have changed if it has changed.