Difference between revisions of "Aggregation"

Jump to: navigation, search
 
(10 intermediate revisions by 3 users not shown)
Line 1: Line 1:
Aggregation is not a conceptually difficult topic, but unless the best practices described in this article is followed, then concepts like missing values, can easily cause the aggregates to be incorrect.
<onlyinclude>
Aggregation is the compilation of many values to create one aggregate value. It takes place during data construction, which occurs between [[Data Cleaning | data cleaning]] and [[Data Analysis | data analysis]].  This page provides common cases of aggregation and outlines best practices.
</onlyinclude>
== Read First ==
* Make sure to use specialized commands for aggregation like Stata’s <code>egen rowtotal()</code> to avoid errors.
* After aggregating, check if any missing values were created. If they were, make sure you can explain why.


== Read First ==
==Common Cases of Aggregation ==
* Make sure to use Stata commands exactly for this
* Make sure to check if missing values was created in your aggregates. It could be OK, but be sure that you can explain all of them


== Categories ==
=== Categories ===
In order to help the respondents to recall information we often split up questions on categories. A common example is income where we often split up the income question in to different types of categories of income even if we are only interested in total income.
Questionnaires often split variables into categories in order to help respondents recall information more exhaustively and according to the researcher’s definitions. For example, even if a researcher is only interested in total income, he/she may [[Questionnaire Design | design]] the questionnaire to ask about categories of income (i.e. agricultural wages, non-agricultural wages, self-employment, crop production, livestock production, transfers, and other income).  He/she we will then aggregate these category values later.  


To properly aggregate categories in survey data, make sure to clean the categories for [[Data_Cleaning#Survey_Codes_and_Missing_Values|survey codes]] and then use commands that properly handles missing data. See `egen rowtotal()` below if you are using Stata.
To properly aggregate categories in survey data, make sure to clean the categories for [[Data_Cleaning#Survey_Codes_and_Missing_Values|survey codes]] and to use commands that properly handle missing data like <code>egen rowtotal()</code>.


== Repeat groups ==
=== Repeat Groups ===
It is common that we ask the same question over a number repeated instances, and most of those times we want to aggregate the amounts. This issues are similar to those described already, but missing values are even more common.
Questionnaires often ask the same question over a number of [[SurveyCTO Repeat Group Using Previous Choices | repeated]] instances. For example, a questionnaire may repeat through a list of crops that a household cultivates and ask for the annual income earned on each. To calculate total income earned on crops, the researcher would then aggregate these group values.  


== Stata commands ==
Aggregating repeat groups introduces issues similar to those described above, but missing values are even more common. Accordingly, be sure to clean the categories for survey codes and use commands that properly handle missing data like <code>egen rowtotal()</code>.
Do not use regular addition with plus signs like `var1 + var2` as this is likely to lead to a lot of values being incorrectly reported as missing values. Instead, in Stata one should use the `egen` function `rowtotal()`.
```
egen total_income = rowtotal(income1, income2, income3)
```


== Risks ==
== Best Practices ==
There are factors that risk biasing aggregates. They are easily avoided if one knows what to do.


=== Missing values ===
=== Use Specialized Commands===
Missing values are common in survey data and cause errors when aggregating. In most programming languages the expression `var1 + var2` is only valid if both variables have valid values. That is why we need to use specialized command like `egen rowtotal()` when aggregating variables.
Do not manually aggregate variables. Code like <code>gen var_aggregate = var1 + var2 + var3</code> would lead to many <code>var_aggregate</code> values incorrectly reported as missing, since most programming languages would consider <code>var_aggregate</code> as missing if any one of <code>var1</code>, <code>var2</code>, or <code>var3</code> were missing. Instead, use the <code>egen</code> Stata command, which treats missing values as 0. For example: <code>egen total_income = rowtotal(income1, income2, income3)</code>. In R, use the <code>aggregate</code> or <code>rowSums</code> commands.


=== Standardization ===
=== Standardize Variables ===
Make sure that your variables are standardized to the same unit (if your variables have units). See [[standardization]] for more details.
If your variables have units, make sure that your variables are [[Standardization | standardized]] to the same unit before aggregating.


=== Double corrections ===
=== Avoid Double Corrections===
If adjustments, for example winsorazion for outliers, have been applied to the disaggregated variables, then it usually not a good idea to apply that same adjustment to the aggregate of those variables.
If you have applied adjustments, transformations, or normalizations to the disaggregated variables, it is not typically a good idea to apply the same adjustments to the aggregated value.  


== Back to Parent ==
== Back to Parent ==
Line 35: Line 33:


== Additional Resources ==
== Additional Resources ==
list here other articles related to this topic, with a brief description and link
 
*More [https://stats.idre.ucla.edu/stata/modules/missing-values/ details] on how Stata handles missing values
*Stata’s [https://www.stata.com/manuals13/degen.pdf manual] on <code>egen</code>
*FAO’s [http://www.fao.org/fileadmin/user_upload/riga/pdf/ai197e00.pdf guide] on constructing income aggregates
*UCLA’s [https://stats.idre.ucla.edu/stata/modules/collapsing-data-across-observations/ guide] on collapsing data across observations to create summary statistics


[[Category: Data Analysis ]]
[[Category: Data Analysis ]]

Latest revision as of 18:34, 29 April 2019

Aggregation is the compilation of many values to create one aggregate value. It takes place during data construction, which occurs between data cleaning and data analysis. This page provides common cases of aggregation and outlines best practices.

Read First

  • Make sure to use specialized commands for aggregation like Stata’s egen rowtotal() to avoid errors.
  • After aggregating, check if any missing values were created. If they were, make sure you can explain why.

Common Cases of Aggregation

Categories

Questionnaires often split variables into categories in order to help respondents recall information more exhaustively and according to the researcher’s definitions. For example, even if a researcher is only interested in total income, he/she may design the questionnaire to ask about categories of income (i.e. agricultural wages, non-agricultural wages, self-employment, crop production, livestock production, transfers, and other income). He/she we will then aggregate these category values later.

To properly aggregate categories in survey data, make sure to clean the categories for survey codes and to use commands that properly handle missing data like egen rowtotal().

Repeat Groups

Questionnaires often ask the same question over a number of repeated instances. For example, a questionnaire may repeat through a list of crops that a household cultivates and ask for the annual income earned on each. To calculate total income earned on crops, the researcher would then aggregate these group values.

Aggregating repeat groups introduces issues similar to those described above, but missing values are even more common. Accordingly, be sure to clean the categories for survey codes and use commands that properly handle missing data like egen rowtotal().

Best Practices

Use Specialized Commands

Do not manually aggregate variables. Code like gen var_aggregate = var1 + var2 + var3 would lead to many var_aggregate values incorrectly reported as missing, since most programming languages would consider var_aggregate as missing if any one of var1, var2, or var3 were missing. Instead, use the egen Stata command, which treats missing values as 0. For example: egen total_income = rowtotal(income1, income2, income3). In R, use the aggregate or rowSums commands.

Standardize Variables

If your variables have units, make sure that your variables are standardized to the same unit before aggregating.

Avoid Double Corrections

If you have applied adjustments, transformations, or normalizations to the disaggregated variables, it is not typically a good idea to apply the same adjustments to the aggregated value.

Back to Parent

This article is part of the topic Data Analysis

Additional Resources

  • More details on how Stata handles missing values
  • Stata’s manual on egen
  • FAO’s guide on constructing income aggregates
  • UCLA’s guide on collapsing data across observations to create summary statistics