Difference between revisions of "Aggregation"

Jump to: navigation, search
Line 5: Line 5:
* Make sure to check if missing values was created in your aggregates. It could be OK, but be sure that you can explain all of them
* Make sure to check if missing values was created in your aggregates. It could be OK, but be sure that you can explain all of them


== Categories ==
In order to help the respondents to recall information we often split up questions on categories. A common example is income where we often split up the income question in to different types of categories of income even if we are only interested in total income.


== Guidelines ==
To properly aggregate categories in survey data, make sure to clean the categories for [[Data_Cleaning#Survey_Codes_and_Missing_Values|survey codes]] and then use commands that properly handles missing data. See `egen rowtotal()` below if you are using Stata.
* organize information on the topic into subsections. for each subsection, include a brief description / overview, with links to articles that provide details
 
===Subsection 1===
== Repeat groups ==
===Subsection 2===
It is common that we ask the same question over a number repeated instances, and most of those times we want to aggregate the amounts. This issues are similar to those described already, but missing values are even more common.
===Subsection 3===
 
== Stata commands ==
Do not use regular addition with plus signs like `var1 + var2` as this is likely to lead to a lot of values being incorrectly reported as missing values. Instead, in Stata one should use the `egen` function `rowtotal()`.
```
egen total_income = rowtotal(income1, income2, income3)
```
 
== Risks ==
There are factors that risk biasing aggregates. They are easily avoided if one knows what to do.
 
=== Missing values ===
Missing values are common in survey data and cause errors when aggregating. In most programming languages the expression `var1 + var2` is only valid if both variables have valid values. That is why we need to use specialized command like `egen rowtotal()` when aggregating variables.
 
=== Standardization ===
Make sure that your variables are standardized to the same unit (if your variables have units). See [[standardization]] for more details.
 
=== Double corrections ===
If adjustments, for example winsorazion for outliers, have been applied to the disaggregated variables, then it usually not a good idea to apply that same adjustment to the aggregate of those variables.


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[*topic name, as listed on main page*]]
This article is part of the topic [[Data Analysis]]
 


== Additional Resources ==
== Additional Resources ==
* list here other articles related to this topic, with a brief description and link
list here other articles related to this topic, with a brief description and link


[[Category: *category name* ]]
[[Category: Data Analysis ]]

Revision as of 14:18, 26 October 2017

Aggregation is not a conceptually difficult topic, but unless the best practices described in this article is followed, then concepts like missing values, can easily cause the aggregates to be incorrect.

Read First

  • Make sure to use Stata commands exactly for this
  • Make sure to check if missing values was created in your aggregates. It could be OK, but be sure that you can explain all of them

Categories

In order to help the respondents to recall information we often split up questions on categories. A common example is income where we often split up the income question in to different types of categories of income even if we are only interested in total income.

To properly aggregate categories in survey data, make sure to clean the categories for survey codes and then use commands that properly handles missing data. See `egen rowtotal()` below if you are using Stata.

Repeat groups

It is common that we ask the same question over a number repeated instances, and most of those times we want to aggregate the amounts. This issues are similar to those described already, but missing values are even more common.

Stata commands

Do not use regular addition with plus signs like `var1 + var2` as this is likely to lead to a lot of values being incorrectly reported as missing values. Instead, in Stata one should use the `egen` function `rowtotal()`. ``` egen total_income = rowtotal(income1, income2, income3) ```

Risks

There are factors that risk biasing aggregates. They are easily avoided if one knows what to do.

Missing values

Missing values are common in survey data and cause errors when aggregating. In most programming languages the expression `var1 + var2` is only valid if both variables have valid values. That is why we need to use specialized command like `egen rowtotal()` when aggregating variables.

Standardization

Make sure that your variables are standardized to the same unit (if your variables have units). See standardization for more details.

Double corrections

If adjustments, for example winsorazion for outliers, have been applied to the disaggregated variables, then it usually not a good idea to apply that same adjustment to the aggregate of those variables.

Back to Parent

This article is part of the topic Data Analysis

Additional Resources

list here other articles related to this topic, with a brief description and link