Variable Construction

Jump to: navigation, search

Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.

Read First

  • Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
  • Each stage in the data work process has well-defined inputs and outputs
  • For each stage, there should be a code folder and a corresponding dataset.
  • The names of code files, datasets and outputs for each stage should be consistent.
  • The code files, data and outputs of each of these stages should go through at least one round of code review.

Overview

Workflow

Common tasks

Dealing with outliers

While there are many rules of thumb for how to define an outlier, there is no silver bullet. Some consider an outlier to be any data point that is three standard deviations away from the mean of the same data point for all observations. This may be a starting point, but one needs to qualitatively consider if this is a correct approach. Approaches to outliers include, but are not limited to:

  1. Replacing the outlier values with a missing value.
  2. Winsorization, or replacing any values bigger than a certain percentile, often the 99th, with the value at that percentile. This prevents very large values from biasing the mean. It also maintains an equality of impact aspect. For example, if all project benefits go to a single observation in the treatment group, then the mean would still be high, but that is rarely a desired outcome in development. Winsorization thus penalizes inequitable distribution of the benefits of a project.

Documentation