Aggregation
Aggregation is not a conceptually difficult topic, but unless the best practices described in this article is followed, then concepts like missing values, can easily cause the aggregates to be incorrect.
Read First
- Make sure to use Stata commands exactly for this
- Make sure to check if missing values was created in your aggregates. It could be OK, but be sure that you can explain all of them
Categories
In order to help the respondents to recall information we often split up questions on categories. A common example is income where we often split up the income question in to different types of categories of income even if we are only interested in total income.
To properly aggregate categories in survey data, make sure to clean the categories for survey codes and then use commands that properly handles missing data. See `egen rowtotal()` below if you are using Stata.
Repeat groups
It is common that we ask the same question over a number repeated instances, and most of those times we want to aggregate the amounts. This issues are similar to those described already, but missing values are even more common.
Stata commands
Do not use regular addition with plus signs like var1 + var2
as this is likely to lead to a lot of values being incorrectly reported as missing values. Instead, in Stata one should use the `egen` function `rowtotal()`.
```
egen total_income = rowtotal(income1, income2, income3)
```
Risks
There are factors that risk biasing aggregates. They are easily avoided if one knows what to do.
Missing values
Missing values are common in survey data and cause errors when aggregating. In most programming languages the expression `var1 + var2` is only valid if both variables have valid values. That is why we need to use specialized command like `egen rowtotal()` when aggregating variables.
Standardization
Make sure that your variables are standardized to the same unit (if your variables have units). See standardization for more details.
Double corrections
If adjustments, for example winsorazion for outliers, have been applied to the disaggregated variables, then it usually not a good idea to apply that same adjustment to the aggregate of those variables.
Back to Parent
This article is part of the topic Data Analysis
Additional Resources
list here other articles related to this topic, with a brief description and link