Variable Construction

Jump to: navigation, search

Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.

Read First

  • Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
  • Each stage in the data work process has well-defined inputs and outputs
  • For each stage, there should be a code folder and a corresponding dataset.
  • The names of code files, datasets and outputs for each stage should be consistent.
  • The code files, data and outputs of each of these stages should go through at least one round of code review.

Overview

Variable construction uses inputs in the form of one or more clean data tables and master datasets, and creates outputs like one or more analysis data tables, one codebook for each analysis data table, and construction documentation. Note that the research team must carefully document how each variable is constructed, to ensure that the analysis is reproducible.

Before beginning the process of variable construction, the research team must plan for the following aspects:

  • Final indicators needed to answer a research question.
  • Definitions and calculations for each indicator.
  • Steps to perform the calculations.
  • Coordinating to ensure each of these aspects are uniform for each round of data collection.

Ideally, variable construction should be done right after data cleaning, as part of the pre-analysis plan (PAP). As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data.

Workflow

Before we list the steps that are part of the variable construction workflow, it is important to keep in mind that construction should be a separate task from analysis for the following major reasons:

  • Maintainability. This makes the process of constructing variables more easily replicable. That is, if a code file cleans and constructs variables from the raw data to create a final variable, then any edits to this file are easily replicated in all analysis code files that use the same final variable.
  • Preventing errors. The chances of errors are also lower if we keep the task of construction and analysis separate, because it is less likely that different analysis code files use different versions of the same final variable.

Therefore, performing all variable construction and data transformation in a unified code file that is separate from the analysis code file ensures consistency across different outputs (including graphs and tables).

The workflow for construction involves the following steps, each of which are discussed in detail in the sections below:

Creating new variables

The first part of variable construction is creating new variables. Keep the following points in mind about creating variables:

  • Create new variables. Do not overwrite original information - in fact, the original information must be left completely unchanged so different members of a team can compare the newly created variables with the original information if needed.
  • Provide functional names to the constructed variables. The names should be intuitive, that is, anyone who is going through the dataset should be able to broadly understand the purpose of the constructed variable.
  • Order related variables close to each other. This makes it easier to use constructed variables during analysis.

Addressing outliers

In statistical terms, some define an outlier as the value of a particular characteristic that is three standard deviations more (or less) from the sample mean of that particular characteristic. However, in general, the research team should discuss the following for every dataset they deal with:

  • Definition. That is, defining the criteria to label a data point as an outlier for every dataset.
  • Resolution. That is, dealing with, or addressing outliers.

Two common approaches to addressing outliers include:

  • Replacement. This approach involves replacing the outliers with a missing value.
  • Winsorization. This approach involves replacing any values bigger than a certain percentile, often the 99th, with the value of the data point at that percentile itself. This prevents very large values from overreporting the mean value, or what is called biasing the mean. It also ensures that the effects of a project are distributed in a fair manner. For example, if all benefits of an impact evaluation study go to a single observation in the treatment group, then it would not be a desired outcome, even if the mean of the sample is high.

However, no matter what method the research team uses to address outliers, it is important to keep the following points in mind:

  • Document the chosen method. Clearly document the approach that was used to deal with outliers, as well as the reasons for choosing that particular approach.
  • Keep original variables. Dealing with outliers can affect the distribution of the variable, as well as the final results. Therefore, ensure that the original variable is not replaced.

Standardizing units

Standardizing units refers to making sure there is consistency in the units that the constructed variables are measured in. The method for standardizing can involve different methods depending on whether the variables are Yes/No questions, categorical variables, or numeric variables.

  • Yes/No questions: For such questions, the best way to standardize is to code it as "Yes" = 1 and "No" = 0. This makes it easier to treat them both numerically as frequencies for means, and as dummy variables in regressions.
  • Categorical variables with more than 2 categories: For such variables, first, assign each category to a numeric value like 1, 2, 3, etc. Then check that labels correspond to the same numerical values for all variables that use the same categories.
  • Numeric variables. Often such variables need to be compared with other numeric variables, or aggregated to form statistical indicators. In such cases, convert them to the same scale or unit of measurement. One foolproof way of doing this in a replicable manner is to specify the conversion rates in the master do-file using global macros.

Creating aggregate indicators

The most simple case of variable construction is aggregate indicators, for example, aggregate yield for a farmer who grows wheat, rice, and maize. While this process can seem straightforward, it is important to keep the following issues in mind:

  • Labels and scales of measurement: Double-check value labels and scales of measurement for each of the variables that are being used to construct new variables.
  • Distribution of original and aggregate variables: Compare the distribution of the original and constructed variables to ensure that creating new variables did not alter the distribution of the original variable.
  • Missing values: Always document how missing values are treated, and make sure that aggregating variables has not affected the observations which had missing values.

Merging datasets

One of the steps involved in constructing variables is merging datasets to combine data from multiple sources. In this case, keep the following points in mind:

  • Identifiers: Make sure that the merging datasets have the same unique identifier.
  • Conflicts: In case there is a conflict between the values from different datasets, R creates two different variables, while Stata keeps the values from the master dataset by default. If you want to use the values from the other dataset, use update and replace options with merge.

Preventing Mistakes

Before we look at ways to prevent common mistakes, it is important to understand where things can go wrong in variable construction, such as:

  • Merging, reshaping, collapsing: Can create missing entries, or change the number of observations. Make sure that you understand how each command that you use treats missing values.
  • Subsetting: This refers to creating a subset of a dataset. Drop observations explicitly, and document why you are dropping these observations. Also document how the dataset changed.

In order to address these challenges, DIME Analytics recommends the following steps:

  • Write pseudo-code. Describe the steps for creating the new variable in simple language on a piece of paper. Refine the sub-steps involved in the process. Think about possible errors at every step.
  • Think about expected results. Think about how each command you use will treat missing values. Ask yourself the following questions:
    • Will all observations merge?
    • Will the number of observations change?
    • Will the command create missing values?
  • Document observed results. Carefully explore the actual results from a command. Note down the results using comments. Add comments in case there are unexpected results.
  • Build checks into your code. Test the unit of observation and the ID variable for duplicates and missing values. Include error messages or break the code if results do not match what is expected. Use assert in Stata and stopifnot in R. Please see below for examples of using code to document observed results, and performing in-built checks.
Fig.1: Code to document observed results
Fig.2: Code to perform in-built checks

Related Pages

Additional Resources