Difference between revisions of "Multi-stage (Cluster) Sampling"

Jump to: navigation, search
Line 1: Line 1:
<onlyinclude>
+
Multi-stage (cluster) sampling is a common sampling design in in which the unit of [[Randomization in Stata | randomization]] differs from the [[Unit of Observation | unit of observation]]. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student). This page explains multi-stage (cluster) sampling and provides a demonstration of how to implement multi-stage (cluster) sampling in Stata.
Most impact evaluations rely on a multi-stage sampling design. This is when the unit of treatment assignment differs from the unit of survey respondent. For example, for an intervention assigned at the village level (treatment villages receive an intervention, control villages do not), the first stage of the sample would be village-level, and the second stage would be household-level (to select survey respondents).  
 
  
</onlyinclude>
+
==Read First==
 +
*The number of clusters in a research design is closely related with [[Sampling & Power Calculations | sampling and power calculations]].
 +
*When randomizing between clusters, make sure to cluster standard errors during [[Data Analysis | data analysis]].
 +
* Multi-stage (cluster) sampling must typically be implemented manually. It relies on subsetting the data intelligently to the desired assignment levels. A demonstration follows:
  
== Read First ==
+
==Overview==
* include here key points you want to make sure all readers understand
 
  
 +
Many studies collect data at a different level of observation than the randomization unit. Consider, for example, a researcher who wants to measure the household-level effects of a village-level water sanitation program, or a researcher who wants to measure the student-level effects of a school-level food program. This research design, in which units are assigned to treatments in clusters, is called clustering.
  
== Guidelines ==
+
==Considerations==
 +
===How Many Clusters?===
 +
To test a program impact convincingly and to precisely estimate treatment effects, it is important to use a sufficient number of clusters. With a small number of clusters, the treatment and control clusters are likely not identical; however, as the number of clusters increases, the more similar and balanced the treatment and control clusters become and the, accordingly, the treatment effect estimate becomes more precise. Typically, clustered sampling designs should include at least 40-50 clusters in each treatment and control group in order to obtain sufficient power and [[Balance tests | balance at baseline]] [https://siteresources.worldbank.org/EXTHDOFFICE/Resources/5485726-1295455628620/Impact_Evaluation_in_Practice.pdf]. The exact number of clusters depends on the intra-cluster correlation, [[Sampling and Power Calculations | sampling and power calculations]] and the [[Survey Budget | budget]], as more clusters is generally more costly.
 +
===Standard Errors===
 +
In multi-stage (cluster) sampling, since the treatment is assigned to clusters, there are fewer randomized groups than the number of units in the data. Therefore, at the [[Data Analysis | data analysis stage]], standard errors for clustered designs must be clustered at the level at which the treatment was assigned.
 +
==Implementation==
 +
Multi-stage (cluster) sampling must typically be implemented manually. It relies on subsetting the data intelligently to the desired assignment levels. A demonstration follows:
  
===Subsection 1===
+
<nowiki>
===Subsection 2===
+
// Use [randtreat] in randomization program
===Subsection 3===
+
cap prog drop my_randomization
 +
prog def  my_randomization
 +
 +
// Syntax with open options for [ritest]
 +
syntax, [*]
 +
cap drop treatment
 +
cap drop cluster
 +
 +
//Create cluster indicator
 +
egen cluster = group(sex agegrp) , label
 +
  label var cluster "Cluster Group"
 +
 
 +
// Keep only one from each cluster for randomization
 +
preserve
 +
egen ctag = tag(cluster)
 +
keep if ctag == 1
 +
drop ctag
 +
 +
// Group 1/2 in control and treatment
 +
randtreat, ///
 +
  generate(treatment)  /// New variable name
 +
  multiple(2) /// Two arms
 +
 +
// Apply assignment to entire cluster
 +
tempfile ctreat
 +
save `ctreat' , replace
 +
restore
 +
merge m:1 cluster using `ctreat' , nogen
 +
 +
// Cleanup
 +
lab var treatment "Treatment Arm"
 +
lab def treatment ///
 +
  0 "Control"     ///
 +
  1 "Treatment"  ///
 +
  , replace
 +
lab val treatment treatment
 +
end //
 +
 
 +
// Reproducible setup: data, isid, version, seed
 +
sysuse bpwide.dta , clear
 +
isid patient , sort
 +
version 13.1
 +
set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
 +
 +
// Randomize
 +
my_randomization
 +
ta cluster treatment
 +
</nowiki>
  
 
== Back to Parent ==
 
== Back to Parent ==
 
This article is part of the topic [[Sampling & Power Calculations]]
 
This article is part of the topic [[Sampling & Power Calculations]]
 
  
 
== Additional Resources ==
 
== Additional Resources ==
* http://betterevaluation.org/en/evaluation-options/multistage
+
*Better Evaluation’s [http://betterevaluation.org/en/evaluation-options/multistage Multistage Clustering] resource
 
*DIME Analytics' presentations on randomization [https://github.com/worldbank/DIME-Resources/blob/master/stata1-5-randomization.pdf 1] and [https://github.com/worldbank/DIME-Resources/blob/master/stata2-5-randomization.pdf 2], the latter of which covers multi-stage cluster sampling
 
*DIME Analytics' presentations on randomization [https://github.com/worldbank/DIME-Resources/blob/master/stata1-5-randomization.pdf 1] and [https://github.com/worldbank/DIME-Resources/blob/master/stata2-5-randomization.pdf 2], the latter of which covers multi-stage cluster sampling
 +
*[https://blogs.worldbank.org/impactevaluations/when-should-you-cluster-standard-errors-new-wisdom-econometrics-oracle This World Bank Blog] discusses when you should cluster standard errors.
  
 
[[Category: Sampling & Power Calculations ]]
 
[[Category: Sampling & Power Calculations ]]

Revision as of 13:05, 10 June 2019

Multi-stage (cluster) sampling is a common sampling design in in which the unit of randomization differs from the unit of observation. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student). This page explains multi-stage (cluster) sampling and provides a demonstration of how to implement multi-stage (cluster) sampling in Stata.

Read First

  • The number of clusters in a research design is closely related with sampling and power calculations.
  • When randomizing between clusters, make sure to cluster standard errors during data analysis.
  • Multi-stage (cluster) sampling must typically be implemented manually. It relies on subsetting the data intelligently to the desired assignment levels. A demonstration follows:

Overview

Many studies collect data at a different level of observation than the randomization unit. Consider, for example, a researcher who wants to measure the household-level effects of a village-level water sanitation program, or a researcher who wants to measure the student-level effects of a school-level food program. This research design, in which units are assigned to treatments in clusters, is called clustering.

Considerations

How Many Clusters?

To test a program impact convincingly and to precisely estimate treatment effects, it is important to use a sufficient number of clusters. With a small number of clusters, the treatment and control clusters are likely not identical; however, as the number of clusters increases, the more similar and balanced the treatment and control clusters become and the, accordingly, the treatment effect estimate becomes more precise. Typically, clustered sampling designs should include at least 40-50 clusters in each treatment and control group in order to obtain sufficient power and balance at baseline [1]. The exact number of clusters depends on the intra-cluster correlation, sampling and power calculations and the budget, as more clusters is generally more costly.

Standard Errors

In multi-stage (cluster) sampling, since the treatment is assigned to clusters, there are fewer randomized groups than the number of units in the data. Therefore, at the data analysis stage, standard errors for clustered designs must be clustered at the level at which the treatment was assigned.

Implementation

Multi-stage (cluster) sampling must typically be implemented manually. It relies on subsetting the data intelligently to the desired assignment levels. A demonstration follows:

// Use [randtreat] in randomization program
cap prog drop my_randomization
	prog def  my_randomization
	
	// Syntax with open options for [ritest]
	syntax, [*]
	cap drop treatment
	cap drop cluster
	
	//Create cluster indicator
	egen cluster = group(sex agegrp) , label
	  label var cluster "Cluster Group"
	  
	// Keep only one from each cluster for randomization
	preserve
	egen ctag = tag(cluster)
		keep if ctag == 1
		drop ctag
		
		// Group 1/2 in control and treatment
		randtreat, 				///
		  generate(treatment)   /// New variable name
		  multiple(2)			/// Two arms
		
	// Apply assignment to entire cluster
	tempfile ctreat
		save `ctreat' , replace
		restore
	merge m:1 cluster using `ctreat' , nogen
	
	// Cleanup
	lab var treatment "Treatment Arm"
	lab def treatment ///
	  0 "Control"	     ///
	  1 "Treatment"   ///
	  , replace
	lab val treatment treatment
end // 

// Reproducible setup: data, isid, version, seed
sysuse bpwide.dta , clear
	isid patient , sort
	version 13.1
	set seed 796683 // Timestamp: 2019-02-26 22:14:17 UTC
	
// Randomize
my_randomization
ta cluster treatment

Back to Parent

This article is part of the topic Sampling & Power Calculations

Additional Resources

  • Better Evaluation’s Multistage Clustering resource
  • DIME Analytics' presentations on randomization 1 and 2, the latter of which covers multi-stage cluster sampling
  • This World Bank Blog discusses when you should cluster standard errors.