Variable Construction
Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.
Read First
- Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
- Each stage in the data work process has well-defined inputs and outputs
- For each stage, there should be a code folder and a corresponding dataset.
- The names of code files, datasets and outputs for each stage should be consistent.
- The code files, data and outputs of each of these stages should go through at least one round of code review.
Overview
Variable construction uses inputs in the form of one or more clean data tables and master datasets, and creates outputs like one or more analysis data tables, one codebook for each analysis data table, and construction documentation. Note that the research team must carefully document how each variable is constructed, to ensure that the analysis is reproducible.
Before beginning the process of variable construction, the research team must plan for the following aspects:
- Final indicators needed to answer a research question.
- Definitions and calculations for each indicator.
- Steps to perform the calculations.
- Coordinating to ensure each of these aspects are uniform for each round of data collection.
Ideally, variable construction should be done right after data cleaning, as part of the pre-analysis plan (PAP). As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data.
Workflow
Before we list the steps that are part of the variable construction workflow, it is important to keep in mind that construction should be a separate task from analysis for the following major reasons:
- Maintainability. This makes the process of constructing variables more easily replicable. That is, if a code file cleans and constructs variables from the raw data to create a final variable, then any edits to this file are easily replicated in all analysis code files that use the same final variable.
- Preventing errors. The chances of errors are also lower if we keep the task of construction and analysis separate, because it is less likely that different analysis code files use different versions of the same final variable.
Therefore, performing all variable construction and data transformation in a unified code file that is separate from the analysis code file ensures consistency across different outputs (including graphs and tables).
The workflow for construction involves the following steps, each of which are discussed in detail in the sections below:
Creating new variables
Addressing outliers
Standardizing units
Creating aggregate measures
Merging datasets
Common tasks
Dealing with outliers
While there are many rules of thumb for how to define an outlier, there is no silver bullet. Some consider an outlier to be any data point that is three standard deviations away from the mean of the same data point for all observations. This may be a starting point, but one needs to qualitatively consider if this is a correct approach. Approaches to outliers include, but are not limited to:
- Replacing the outlier values with a missing value.
- Winsorization, or replacing any values bigger than a certain percentile, often the 99th, with the value at that percentile. This prevents very large values from biasing the mean. It also maintains an equality of impact aspect. For example, if all project benefits go to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. Winsorization thus penalizes inequitable distribution of the benefits of a project.