Difference between revisions of "Variable Construction"
Line 32: | Line 32: | ||
===Merging datasets=== | ===Merging datasets=== | ||
==Common | ==Common Errors== | ||
==Documentation== | ==Documentation== |
Revision as of 17:01, 5 February 2021
Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.
Read First
- Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
- Each stage in the data work process has well-defined inputs and outputs
- For each stage, there should be a code folder and a corresponding dataset.
- The names of code files, datasets and outputs for each stage should be consistent.
- The code files, data and outputs of each of these stages should go through at least one round of code review.
Overview
Variable construction uses inputs in the form of one or more clean data tables and master datasets, and creates outputs like one or more analysis data tables, one codebook for each analysis data table, and construction documentation. Note that the research team must carefully document how each variable is constructed, to ensure that the analysis is reproducible.
Before beginning the process of variable construction, the research team must plan for the following aspects:
- Final indicators needed to answer a research question.
- Definitions and calculations for each indicator.
- Steps to perform the calculations.
- Coordinating to ensure each of these aspects are uniform for each round of data collection.
Ideally, variable construction should be done right after data cleaning, as part of the pre-analysis plan (PAP). As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data.
Workflow
Before we list the steps that are part of the variable construction workflow, it is important to keep in mind that construction should be a separate task from analysis for the following major reasons:
- Maintainability. This makes the process of constructing variables more easily replicable. That is, if a code file cleans and constructs variables from the raw data to create a final variable, then any edits to this file are easily replicated in all analysis code files that use the same final variable.
- Preventing errors. The chances of errors are also lower if we keep the task of construction and analysis separate, because it is less likely that different analysis code files use different versions of the same final variable.
Therefore, performing all variable construction and data transformation in a unified code file that is separate from the analysis code file ensures consistency across different outputs (including graphs and tables).
The workflow for construction involves the following steps, each of which are discussed in detail in the sections below: