Difference between revisions of "Variable Construction"
Line 55: | Line 55: | ||
===Merging datasets=== | ===Merging datasets=== | ||
== Preventing Mistakes == | == Preventing Mistakes == | ||
Revision as of 19:34, 5 February 2021
Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.
Read First
- Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
- Each stage in the data work process has well-defined inputs and outputs
- For each stage, there should be a code folder and a corresponding dataset.
- The names of code files, datasets and outputs for each stage should be consistent.
- The code files, data and outputs of each of these stages should go through at least one round of code review.
Overview
Variable construction uses inputs in the form of one or more clean data tables and master datasets, and creates outputs like one or more analysis data tables, one codebook for each analysis data table, and construction documentation. Note that the research team must carefully document how each variable is constructed, to ensure that the analysis is reproducible.
Before beginning the process of variable construction, the research team must plan for the following aspects:
- Final indicators needed to answer a research question.
- Definitions and calculations for each indicator.
- Steps to perform the calculations.
- Coordinating to ensure each of these aspects are uniform for each round of data collection.
Ideally, variable construction should be done right after data cleaning, as part of the pre-analysis plan (PAP). As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data.
Workflow
Before we list the steps that are part of the variable construction workflow, it is important to keep in mind that construction should be a separate task from analysis for the following major reasons:
- Maintainability. This makes the process of constructing variables more easily replicable. That is, if a code file cleans and constructs variables from the raw data to create a final variable, then any edits to this file are easily replicated in all analysis code files that use the same final variable.
- Preventing errors. The chances of errors are also lower if we keep the task of construction and analysis separate, because it is less likely that different analysis code files use different versions of the same final variable.
Therefore, performing all variable construction and data transformation in a unified code file that is separate from the analysis code file ensures consistency across different outputs (including graphs and tables).
The workflow for construction involves the following steps, each of which are discussed in detail in the sections below:
Creating new variables
The first part of variable construction is creating new variables. Keep the following points in mind about creating variables:
- Create new variables. Do not overwrite original information - in fact, the original information must be left completely unchanged so different members of a team can compare the newly created variables with the original information if needed.
- Provide functional names to the constructed variables. The names should be intuitive, that is, anyone who is going through the dataset should be able to broadly understand the purpose of the constructed variable.
- Order related variables close to each other. This makes it easier to use constructed variables during analysis.
Addressing outliers
In statistical terms, some define an outlier as the value of a particular characteristic that is three standard deviations more (or less) from the sample mean of that particular characteristic. However, in general, the research team should discuss the following for every dataset they deal with:
- Definition. That is, defining the criteria to label a data point as an outlier for every dataset.
- Resolution. That is, dealing with, or addressing outliers.
Two common approaches to addressing outliers include:
- Replacement. This approach involves replacing the outliers with a missing value.
- Winsorization. This approach involves replacing any values bigger than a certain percentile, often the 99th, with the value of the data point at that percentile itself. This prevents very large values from overreporting the mean value, or what is called biasing the mean. It also ensures that the effects of a project are distributed in a fair manner. For example, if all benefits of an impact evaluation study go to a single observation in the treatment group, then it would not be a desired outcome, even if the mean of the sample is high.
However, no matter what method the research team uses to address outliers, it is important to keep the following points in mind:
- Document the chosen method. Clearly document the approach that was used to deal with outliers, as well as the reasons for choosing that particular approach.
- Keep original variables. Dealing with outliers can affect the distribution of the variable, as well as the final results. Therefore, ensure that the original variable is not replaced.
Standardizing units
Standardizing units refers to making sure there is consistency in the units that the constructed variables are measured in. The method for standardizing can involve different methods depending on whether the variables are Yes/No questions, categorical variables, or numeric variables.
- Yes/No questions: For such questions, the best way to standardize is to code it as "Yes" = 1 and "No" = 0. This makes it easier to treat them both numerically as frequencies for means, and as dummy variables in regressions.
- Categorical variables with more than 2 categories: For such variables, first, assign each category to a numeric value like 1, 2, 3, etc. Then check that labels correspond to the same numerical values for all variables that use the same categories.
- Numeric variables. Often such variables need to be compared with other numeric variables, or aggregated to form statistical indicators. In such cases, convert them to the same scale or unit of measurement. One foolproof way of doing this in a replicable manner is to specify the conversion rates in the master do-file using global macros.
Creating aggregate measures
Merging datasets
Preventing Mistakes
Documentation
Related Pages
Additional Resources
- DIME Analytics (World Bank), Data Construction