Difference between revisions of "Variable Construction"
(48 intermediate revisions by the same user not shown) | |||
Line 29: | Line 29: | ||
* '''Create new variables.''' Do not overwrite original information - in fact, the original information must be left completely unchanged so different members of a team can compare the newly created variables with the original information if needed. | * '''Create new variables.''' Do not overwrite original information - in fact, the original information must be left completely unchanged so different members of a team can compare the newly created variables with the original information if needed. | ||
* '''Provide functional names to the constructed variables.''' The names should be intuitive, that is, anyone who is going through the dataset should be able to broadly understand the purpose of the constructed variable. | * '''Provide functional names to the constructed variables.''' The names should be intuitive, that is, anyone who is going through the dataset should be able to broadly understand the purpose of the constructed variable. | ||
* '''Order related variables close to each other.''' This makes it easier to use constructed variables during analysis. </ | * '''Order related variables close to each other.''' This makes it easier to use constructed variables during analysis. </p> | ||
===Addressing outliers=== | ===Addressing outliers=== | ||
In statistical terms, some define an '''outlier''' as the value of a particular characteristic that is three standard deviations more (or less) from the sample mean of that particular characteristic. However, in general, the research team should discuss the following for every dataset they deal with: | |||
* '''Definition.''' That is, defining the criteria to label a data point as an outlier for every dataset. | |||
* '''Resolution.''' That is, dealing with, or '''addressing outliers'''. | |||
Two common approaches to addressing outliers include: | |||
* '''Replacement.''' This approach involves replacing the outliers with a missing value. | |||
* '''Winsorization'''. This approach involves replacing any values bigger than a certain percentile, often the 99th, with the value of the data point at that percentile itself. This prevents very large values from overreporting the mean value, or what is called '''biasing the mean'''. It also ensures that the effects of a project are distributed in a fair manner. For example, if all benefits of an impact evaluation study go to a single observation in the '''treatment group''', then it would not be a desired outcome, even if the mean of the sample is high. | |||
However, no matter what method the research team uses to address outliers, it is important to keep the following points in mind: | |||
* '''Document the chosen method.''' Clearly document the approach that was used to deal with outliers, as well as the reasons for choosing that particular approach. | |||
* '''Keep original variables.''' Dealing with outliers can affect the distribution of the variable, as well as the final results. Therefore, ensure that the original variable is not replaced. </p> | |||
===Standardizing units=== | ===Standardizing units=== | ||
===Creating aggregate | '''Standardizing units''' refers to making sure there is consistency in the units that the constructed variables are measured in. The method for [[Standardization|standardizing]] can involve different methods depending on whether the variables are '''Yes/No questions''', '''categorical variables''', or '''numeric variables'''. | ||
* '''Yes/No questions:''' For such questions, the best way to standardize is to code it as '''"Yes" = 1''' and '''"No" = 0'''. This makes it easier to treat them both numerically as frequencies for means, and as '''dummy variables''' in regressions. | |||
* '''Categorical variables with more than 2 categories:''' For such variables, first, assign each category to a numeric value like 1, 2, 3, etc. Then check that labels correspond to the same numerical values for all variables that use the same categories. | |||
* '''Numeric variables.''' Often such variables need to be compared with other numeric variables, or aggregated to form statistical indicators. In such cases, convert them to the same scale or unit of measurement. One foolproof way of doing this in a '''replicable''' manner is to specify the [[Master_Do-files#Conversion_Rates|conversion rates]] in the [[Master Do-files|master do-file]] using [[Standardization#Globals|global macros]]. </p> | |||
===Creating aggregate indicators=== | |||
The most simple case of '''variable construction''' is '''aggregate indicators''', for example, aggregate yield for a farmer who grows wheat, rice, and maize. While this process can seem straightforward, it is important to keep the following issues in mind: | |||
* '''Labels and scales of measurement:''' Double-check value labels and scales of measurement for each of the variables that are being used to construct new variables. | |||
* '''Distribution of original and aggregate variables:''' Compare the distribution of the original and constructed variables to ensure that creating new variables did not alter the distribution of the original variable. | |||
* '''Missing values:''' Always document how missing values are treated, and make sure that aggregating variables has not affected the observations which had missing values. </p> | |||
===Merging datasets=== | ===Merging datasets=== | ||
One of the steps involved in '''constructing variables''' is '''merging datasets''' to combine data from multiple sources. In this case, keep the following points in mind: | |||
* '''Identifiers:''' Make sure that the merging datasets have the same [[ID Variable Properties#Property 1: Uniquely Identifying|unique identifier]]. | |||
* '''Conflicts:''' In case there is a conflict between the values from different datasets, '''R''' creates two different variables, while '''Stata''' keeps the values from the master dataset by default. If you want to use the values from the other dataset, use <code>update</code> and <code>replace</code> options with <code>merge</code>. | |||
== Preventing Mistakes == | |||
Before we look at ways to prevent common mistakes, it is important to understand where things can go wrong in '''variable construction''', such as: | |||
* '''Merging, reshaping, collapsing:''' Can create missing entries, or change the number of observations. Make sure that you understand how each command that you use treats missing values. | |||
* '''Subsetting:''' This refers to creating a subset of a dataset. Drop observations explicitly, and document why you are dropping these observations. Also document how the dataset changed. | |||
In order to address these challenges, [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] recommends the following steps: | |||
* '''Write pseudo-code.''' Describe the steps for creating the new variable in simple language on a piece of paper. Refine the sub-steps involved in the process. Think about possible errors at every step. | |||
* '''Think about expected results.''' Think about how each command you use will treat missing values. Ask yourself the following questions: | |||
** Will all observations merge? | |||
** Will the number of observations change? | |||
** Will the command create missing values? | |||
* '''Document observed results.''' Carefully explore the actual results from a command. Note down the results using comments. Add comments in case there are unexpected results. | |||
* '''Build checks into your code.''' Test the [[Unit of Observation|unit of observation]] and the [[ID Variable Properties|ID variable]] for duplicates and missing values. Include error messages or break the code if results do not match what is expected. Use <code>assert</code> in '''Stata''' and <code>stopifnot</code> in '''R'''. Please see below for examples of using code to document observed results, and performing in-built checks. | |||
====Example 1: Documenting observed results.==== | |||
<syntaxhighlight lang="stata" line start="15"> * Merge | |||
use `time_pf', clear | |||
merge 1:1 hhid using`time_pm' | |||
Result # of obs. | |||
---------------------------------- | |||
not matched 511 | |||
from master 503 (_merge==1) | |||
from using 8 (_merge==2) | |||
matched 1,980 (_merge==3) | |||
---------------------------------- | |||
/*** - Household not matched from using - either because they have no female | |||
member of the HH (1493, 1558, 1720, 5437, 5976) , or the female HH member | |||
"Did Not Answer" for age (1151, 1286, 6561). | |||
- Household not matching from master because HHs have no primary male. | |||
***/ | |||
drop _merge | |||
</syntaxhighlight> | |||
====Example 2: Built-in checks in code.==== | |||
<syntaxhighlight lang="stata" line start="60"> | |||
/************************************************** | |||
PART 3: Append rounds | |||
**************************************************/ | |||
cap erase "${panel_doc}/Codebooks/Inter/Household/Reconcile_appended.xlsx" | |||
iecodebook append `baseline' `midline' `endline' /// | |||
using "${panel_doc}/Codebooks/Inter/Household/Reconcile_appended.xlsx", /// | |||
surveys(Baseline Midline Endline) /// | |||
clear | |||
merge 1:1 hhid using "${hh_dt}/hh_master_reconciled" | |||
* Check merge : 6646 is creating problems here | |||
qui count if _merge == 1 | |||
assert r(N) == 1 | |||
/*** Check that HHs that are in Master but are not in the panel: | |||
** Either they were never surveyed, or were surveyed in midline | |||
** Surveys and the data needs to be collapsed to hh level in midline | |||
***/ | |||
assert ! (_merge == 2 & d_surveyed_bl == 1 | d_surveyed_ml == 1 | d_surveyed_el == 1) & /// | |||
!inlist(hhid, 3248, 3249, 4739, 6250, 6463)) | |||
</syntaxhighlight> | |||
== Documentation == | |||
As an output of '''variable construction''', [[Data Documentation#What to Document|documentation]] is as important as the code and the data that creates the new variables. The idea of '''documentation''' is to ensure that someone who is not familiar with the project should also be able to understand the [[Data Analysis|analysis datasets]], and the steps taken to create them. The following are therefore important aspects of documentation: | |||
* '''Document derivations and calculations:''' Constructing new variables often involves changing simple data points to more complex indicators that are harder to measure. Document exactly how each variable is derived or calculated. | |||
* '''Record steps:''' Carefully record how specific variables were combined, recoded, or scaled. Refer to these records in the code to verify if results are as expected. | |||
* '''Establish clear protocols:''' Create protocols for defining, naming, and constructing variables. This will guarantee that indicators are consistent across projects. | |||
== | == Related Pages == | ||
[[Special:WhatLinksHere/Variable_Construction|Click here for pages that link to this topic.]] | |||
== | == Additional Resources == | ||
* DIME Analytics (World Bank), [https://osf.io/k4tr6/ Data Construction] | |||
[[Category: Reproducible Research]] | |||
[[Category:Data Management]] |
Latest revision as of 14:23, 13 April 2021
Variable construction is part of data work, which also involves de-identification, data cleaning, and data analysis. Variable construction involves processing cleaned data to make the data points more suitable for analysis. This is the stage where a survey questions are converted into measurable indicators by creating dummy variables, index variables, and interaction variables.
Read First
- Variable construction is a part of the data work process. The other stages are de-identification, data cleaning, and data analysis.
- Each stage in the data work process has well-defined inputs and outputs
- For each stage, there should be a code folder and a corresponding dataset.
- The names of code files, datasets and outputs for each stage should be consistent.
- The code files, data and outputs of each of these stages should go through at least one round of code review.
Overview
Variable construction uses inputs in the form of one or more clean data tables and master datasets, and creates outputs like one or more analysis data tables, one codebook for each analysis data table, and construction documentation. Note that the research team must carefully document how each variable is constructed, to ensure that the analysis is reproducible.
Before beginning the process of variable construction, the research team must plan for the following aspects:
- Final indicators needed to answer a research question.
- Definitions and calculations for each indicator.
- Steps to perform the calculations.
- Coordinating to ensure each of these aspects are uniform for each round of data collection.
Ideally, variable construction should be done right after data cleaning, as part of the pre-analysis plan (PAP). As the research team goes about the task of analyzing data, they will need to use different constructed variables, subsets of a dataset, as well as other alterations to the data.
Workflow
Before we list the steps that are part of the variable construction workflow, it is important to keep in mind that construction should be a separate task from analysis for the following major reasons:
- Maintainability. This makes the process of constructing variables more easily replicable. That is, if a code file cleans and constructs variables from the raw data to create a final variable, then any edits to this file are easily replicated in all analysis code files that use the same final variable.
- Preventing errors. The chances of errors are also lower if we keep the task of construction and analysis separate, because it is less likely that different analysis code files use different versions of the same final variable.
Therefore, performing all variable construction and data transformation in a unified code file that is separate from the analysis code file ensures consistency across different outputs (including graphs and tables).
The workflow for construction involves the following steps, each of which are discussed in detail in the sections below:
Creating new variables
The first part of variable construction is creating new variables. Keep the following points in mind about creating variables:
- Create new variables. Do not overwrite original information - in fact, the original information must be left completely unchanged so different members of a team can compare the newly created variables with the original information if needed.
- Provide functional names to the constructed variables. The names should be intuitive, that is, anyone who is going through the dataset should be able to broadly understand the purpose of the constructed variable.
- Order related variables close to each other. This makes it easier to use constructed variables during analysis.
Addressing outliers
In statistical terms, some define an outlier as the value of a particular characteristic that is three standard deviations more (or less) from the sample mean of that particular characteristic. However, in general, the research team should discuss the following for every dataset they deal with:
- Definition. That is, defining the criteria to label a data point as an outlier for every dataset.
- Resolution. That is, dealing with, or addressing outliers.
Two common approaches to addressing outliers include:
- Replacement. This approach involves replacing the outliers with a missing value.
- Winsorization. This approach involves replacing any values bigger than a certain percentile, often the 99th, with the value of the data point at that percentile itself. This prevents very large values from overreporting the mean value, or what is called biasing the mean. It also ensures that the effects of a project are distributed in a fair manner. For example, if all benefits of an impact evaluation study go to a single observation in the treatment group, then it would not be a desired outcome, even if the mean of the sample is high.
However, no matter what method the research team uses to address outliers, it is important to keep the following points in mind:
- Document the chosen method. Clearly document the approach that was used to deal with outliers, as well as the reasons for choosing that particular approach.
- Keep original variables. Dealing with outliers can affect the distribution of the variable, as well as the final results. Therefore, ensure that the original variable is not replaced.
Standardizing units
Standardizing units refers to making sure there is consistency in the units that the constructed variables are measured in. The method for standardizing can involve different methods depending on whether the variables are Yes/No questions, categorical variables, or numeric variables.
- Yes/No questions: For such questions, the best way to standardize is to code it as "Yes" = 1 and "No" = 0. This makes it easier to treat them both numerically as frequencies for means, and as dummy variables in regressions.
- Categorical variables with more than 2 categories: For such variables, first, assign each category to a numeric value like 1, 2, 3, etc. Then check that labels correspond to the same numerical values for all variables that use the same categories.
- Numeric variables. Often such variables need to be compared with other numeric variables, or aggregated to form statistical indicators. In such cases, convert them to the same scale or unit of measurement. One foolproof way of doing this in a replicable manner is to specify the conversion rates in the master do-file using global macros.
Creating aggregate indicators
The most simple case of variable construction is aggregate indicators, for example, aggregate yield for a farmer who grows wheat, rice, and maize. While this process can seem straightforward, it is important to keep the following issues in mind:
- Labels and scales of measurement: Double-check value labels and scales of measurement for each of the variables that are being used to construct new variables.
- Distribution of original and aggregate variables: Compare the distribution of the original and constructed variables to ensure that creating new variables did not alter the distribution of the original variable.
- Missing values: Always document how missing values are treated, and make sure that aggregating variables has not affected the observations which had missing values.
Merging datasets
One of the steps involved in constructing variables is merging datasets to combine data from multiple sources. In this case, keep the following points in mind:
- Identifiers: Make sure that the merging datasets have the same unique identifier.
- Conflicts: In case there is a conflict between the values from different datasets, R creates two different variables, while Stata keeps the values from the master dataset by default. If you want to use the values from the other dataset, use
update
andreplace
options withmerge
.
Preventing Mistakes
Before we look at ways to prevent common mistakes, it is important to understand where things can go wrong in variable construction, such as:
- Merging, reshaping, collapsing: Can create missing entries, or change the number of observations. Make sure that you understand how each command that you use treats missing values.
- Subsetting: This refers to creating a subset of a dataset. Drop observations explicitly, and document why you are dropping these observations. Also document how the dataset changed.
In order to address these challenges, DIME Analytics recommends the following steps:
- Write pseudo-code. Describe the steps for creating the new variable in simple language on a piece of paper. Refine the sub-steps involved in the process. Think about possible errors at every step.
- Think about expected results. Think about how each command you use will treat missing values. Ask yourself the following questions:
- Will all observations merge?
- Will the number of observations change?
- Will the command create missing values?
- Document observed results. Carefully explore the actual results from a command. Note down the results using comments. Add comments in case there are unexpected results.
- Build checks into your code. Test the unit of observation and the ID variable for duplicates and missing values. Include error messages or break the code if results do not match what is expected. Use
assert
in Stata andstopifnot
in R. Please see below for examples of using code to document observed results, and performing in-built checks.
Example 1: Documenting observed results.
* Merge
use `time_pf', clear
merge 1:1 hhid using`time_pm'
Result # of obs.
----------------------------------
not matched 511
from master 503 (_merge==1)
from using 8 (_merge==2)
matched 1,980 (_merge==3)
----------------------------------
/*** - Household not matched from using - either because they have no female
member of the HH (1493, 1558, 1720, 5437, 5976) , or the female HH member
"Did Not Answer" for age (1151, 1286, 6561).
- Household not matching from master because HHs have no primary male.
***/
drop _merge
Example 2: Built-in checks in code.
/**************************************************
PART 3: Append rounds
**************************************************/
cap erase "${panel_doc}/Codebooks/Inter/Household/Reconcile_appended.xlsx"
iecodebook append `baseline' `midline' `endline' ///
using "${panel_doc}/Codebooks/Inter/Household/Reconcile_appended.xlsx", ///
surveys(Baseline Midline Endline) ///
clear
merge 1:1 hhid using "${hh_dt}/hh_master_reconciled"
* Check merge : 6646 is creating problems here
qui count if _merge == 1
assert r(N) == 1
/*** Check that HHs that are in Master but are not in the panel:
** Either they were never surveyed, or were surveyed in midline
** Surveys and the data needs to be collapsed to hh level in midline
***/
assert ! (_merge == 2 & d_surveyed_bl == 1 | d_surveyed_ml == 1 | d_surveyed_el == 1) & ///
!inlist(hhid, 3248, 3249, 4739, 6250, 6463))
Documentation
As an output of variable construction, documentation is as important as the code and the data that creates the new variables. The idea of documentation is to ensure that someone who is not familiar with the project should also be able to understand the analysis datasets, and the steps taken to create them. The following are therefore important aspects of documentation:
- Document derivations and calculations: Constructing new variables often involves changing simple data points to more complex indicators that are harder to measure. Document exactly how each variable is derived or calculated.
- Record steps: Carefully record how specific variables were combined, recoded, or scaled. Refer to these records in the code to verify if results are as expected.
- Establish clear protocols: Create protocols for defining, naming, and constructing variables. This will guarantee that indicators are consistent across projects.
Related Pages
Click here for pages that link to this topic.
Additional Resources
- DIME Analytics (World Bank), Data Construction