Aggregation

Aggregation is the compilation of many values to create one aggregate value. It takes place during data construction, which occurs between data cleaning and data analysis. This page provides common cases of aggregation and outlines best practices.

Read First

Make sure to use specialized commands for aggregation like Stata’s egen rowtotal() to avoid errors.
After aggregating, check if any missing values were created. If they were, make sure you can explain why.

Common Cases of Aggregation

Categories

Questionnaires often split variables into categories in order to help respondents recall information more exhaustively and according to the researcher’s definitions. For example, even if a researcher is only interested in total income, he/she may design the questionnaire to ask about categories of income (i.e. agricultural wages, non-agricultural wages, self-employment, crop production, livestock production, transfers, and other income). He/she we will then aggregate these category values later.

To properly aggregate categories in survey data, make sure to clean the categories for survey codes and to use commands that properly handle missing data like egen rowtotal().

Repeat Groups

Questionnaires often ask the same question over a number of repeated instances. For example, a questionnaire may repeat through a list of crops that a household cultivates and ask for the annual income earned on each. To calculate total income earned on crops, the researcher would then aggregate these group values.

Aggregating repeat groups introduces issues similar to those described above, but missing values are even more common. Accordingly, be sure to clean the categories for survey codes and use commands that properly handle missing data like egen rowtotal().

Best Practices

Use Specialized Commands

Do not manually aggregate variables. Code like gen var_aggregate = var1 + var2 + var3 would lead to many var_aggregate values incorrectly reported as missing, since most programming languages would consider var_aggregate as missing if any one of var1, var2, or var3 were missing. Instead, use the egen Stata command, which treats missing values as 0. For example: egen total_income = rowtotal(income1, income2, income3). In R, use the aggregate or rowSums commands.

More details on how Stata handles missing values
Stata’s manual on egen
FAO’s guide on constructing income aggregates
UCLA’s guide on collapsing data across observations to create summary statistics

Aggregation

Contents

Read First

Common Cases of Aggregation

Categories

Repeat Groups

Best Practices

Use Specialized Commands

Standardize Variables

Avoid Double Corrections

Back to Parent

Additional Resources