Difference between revisions of "Variable Construction"
Line 15: | Line 15: | ||
* '''Coordinating''' to ensure each of these aspects are uniform for each round of [[Primary Data Collection|data collection]]. | * '''Coordinating''' to ensure each of these aspects are uniform for each round of [[Primary Data Collection|data collection]]. | ||
Ideally, variable construction should be done right after [[Data Cleaning|data cleaning]], as part of the [[Pre-Analysis Plan|pre-analysis plan (PAP)]]. As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data | Ideally, variable construction should be done right after [[Data Cleaning|data cleaning]], as part of the [[Pre-Analysis Plan|pre-analysis plan (PAP)]]. As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data. | ||
==Workflow== | ==Workflow== |
Revision as of 16:50, 5 February 2021
Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.
Read First
- Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
- Each stage in the data work process has well-defined inputs and outputs
- For each stage, there should be a code folder and a corresponding dataset.
- The names of code files, datasets and outputs for each stage should be consistent.
- The code files, data and outputs of each of these stages should go through at least one round of code review.
Overview
Variable construction uses inputs in the form of one or more clean data tables and master datasets, and creates outputs like one or more analysis data tables, one codebook for each analysis data table, and construction documentation. Note that the research team must carefully document how each variable is constructed, to ensure that the analysis is reproducible.
Before beginning the process of variable construction, the research team must plan for the following aspects:
- Final indicators needed to answer a research question.
- Definitions and calculations for each indicator.
- Steps to perform the calculations.
- Coordinating to ensure each of these aspects are uniform for each round of data collection.
Ideally, variable construction should be done right after data cleaning, as part of the pre-analysis plan (PAP). As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data.
Workflow
Creating new variables
Addressing outliers
Standardizing units
Creating aggregate measures
Merging datasets
Common tasks
Dealing with outliers
While there are many rules of thumb for how to define an outlier, there is no silver bullet. Some consider an outlier to be any data point that is three standard deviations away from the mean of the same data point for all observations. This may be a starting point, but one needs to qualitatively consider if this is a correct approach. Approaches to outliers include, but are not limited to:
- Replacing the outlier values with a missing value.
- Winsorization, or replacing any values bigger than a certain percentile, often the 99th, with the value at that percentile. This prevents very large values from biasing the mean. It also maintains an equality of impact aspect. For example, if all project benefits go to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. Winsorization thus penalizes inequitable distribution of the benefits of a project.