Clustered Sampling and Treatment Assignment
Multi-stage (cluster) sampling is a common sampling design in which the unit of randomization differs from the unit of observation. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student). This page explains multi-stage (cluster) sampling and provides a demonstration of how to implement it in Stata.
- The number of clusters in a research design is closely related with sampling and power calculations.
- When randomizing between clusters, make sure to cluster standard errors during data analysis.
- Multi-stage (cluster) sampling must typically be implemented manually. It relies on subsetting the data intelligently to the desired assignment levels.
- For more information on multi-stage (cluster) sample size calculations, see Additional Factor for Clustered Sampling on the Sample Size page.
Many studies collect data at a different level of observation than the randomization unit. Consider, for example, a researcher who wants to measure the household-level effects of a village-level water sanitation program, or a researcher who wants to measure the student-level effects of a school-level food program. This research design, in which units are assigned to treatments in clusters, is called clustering.
How Many Clusters?
To test a program impact convincingly and to precisely estimate treatment effects, it is important to use a sufficient number of clusters. With a small number of clusters, the treatment and control clusters are likely not identical; however, as the number of clusters increases, the more similar and balanced the treatment and control clusters become and the, accordingly, the treatment effect estimate becomes more precise. Typically, clustered sampling designs should include at least 40-50 clusters in each treatment and control group in order to obtain sufficient power and balance at baseline . The exact number of clusters depends on the intra-cluster correlation, sampling and power calculations and the budget, as more clusters is generally more costly.
In multi-stage (cluster) sampling, since the treatment is assigned to clusters, there are fewer randomized groups than the number of units in the data. Therefore, at the data analysis stage, standard errors for clustered designs must be clustered at the level at which the treatment was assigned.
Multi-stage (cluster) sampling must typically be implemented manually. It relies on subsetting the data intelligently to the desired assignment levels. A demonstration follows:
// Use [randtreat] in randomization program cap prog drop my_randomization prog def my_randomization // Syntax with open options for [ritest] syntax, [*] cap drop treatment cap drop cluster //Create cluster indicator egen cluster = group(sex agegrp) , label label var cluster "Cluster Group" // Keep only one from each cluster for randomization preserve egen ctag = tag(cluster) keep if ctag == 1 drop ctag // Group 1/2 in control and treatment randtreat, /// generate(treatment) /// New variable name multiple(2) /// Two arms // Apply assignment to entire cluster tempfile ctreat save `ctreat' , replace restore merge m:1 cluster using `ctreat' , nogen // Cleanup lab var treatment "Treatment Arm" lab def treatment /// 0 "Control" /// 1 "Treatment" /// , replace lab val treatment treatment end // // Reproducible setup: data, isid, version, seed sysuse bpwide.dta , clear isid patient , sort version 13.1 set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC // Randomize my_randomization ta cluster treatment
Back to Parent
This article is part of the topic Sampling & Power Calculations