Sampling is the process of randomly selecting units from a population of interest to represent the characteristics of that population. Sampling in a statistically valid, representative manner is a crucial step in conducting high quality randomized control trials. The sampling process consists of two parts: sample design and sample implementation, both of which should occur early in the evaluation design process in order to facilitate data collection planning. Sampling affects a research project’s budget, timeline, accuracy, and precision. This page provides guidelines for and approaches to sampling.
- Always sample from a master dataset. If no master dataset exists for the unit of observation at which you want to sample, begin by creating the master dataset.
- Sampling code requires extra care! Errors cannot be corrected after the intervention – or survey -- has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, always consult DIME Analytics before sending a sample to the field.
- While simple random sample works well for small populations, impact evaluations more typically rely on multi-stage (cluster) sampling, often with stratification.
- For information on sample size and power calculations, see Sample Size and Power Calculations; for information on implementing power calculations and selecting samples, see Power Calculations in Stata.
How to Sample
Identify Population of Interest
Before drawing a sample, you must identify the population of interest. Clearly define the region and characteristics of the population: these details will indicate who the sample must represent.
Establish the Sampling Frame and Master Dataset
Once you’ve defined the population of interest, establish the sampling frame and master dataset. This is the most comprehensive listing of the fixed characteristics of the observations in the population of interest. Ideally the master dataset should contain every observation from the population of interest. If you do not have a master dataset for the unit of observation from which you are sampling (i.e. households, villages, clinics, schools), you should always start by creating one. In the field, this is done by a listing at the lowest level of clustering possible.
Choose a Sampling Approach
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on multi-stage (cluster) sampling, often with stratification.
Multi-stage (cluster) sampling is a common sampling design in which the unit of randomization differs from the unit of observation. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student). For more information, see Multi-stage (Cluster) Sampling. Stratification is a sampling design that divides the target population into subgroups before randomization, ensuring that sub-groups of the population are represented in the final sample and treatment groups. In addition to ensuring representativeness, stratification allows researchers to disaggregate by subgroup during analysis. For more information, see Stratified Random Sample.
If you do not have a master dataset and cannot do a listing, an alternative is to conduct spot randomization. One example of spot randomization is the “random walk” method in which enumerators spin a bottle to determine a random direction in which they walk. Without knowing the total number of households, this method will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether enumerators adhere to spot protocols in the field. Further, there isn’t a systematic way of tracing when replacements were used and how they were established. Other examples of spot randomization include flipping a coin, computerized randomization, or cell-phone based randomization. For purposes of replicability and unbiasedness, using a master dataset for sampling is always preferable to spot randomization.
Implement in Code
For more detailed instructions on commands for sample size calculations, see Power Calculations in Stata and, as a compliment, Power Calculations in Optimal Design. Always document sampling processes in a do file. For detailed instructions on sampling commands, see Multi-stage (Cluster) Sampling and Stratified Random Sample.
Note that any code that performs randomization needs version, seed and sort to be reproducible. Randomizing in Stata is always preferable since it is more easily reproducible, randomizing in Excel is also an option.
- Andrew Gelman (Columbia University), Sample size and power calculations
- CEGA (University of California-Berkeley), Sampling and Statistical Power
- DIME Analytics (World Bank), Sampling: Track 1 and Track 2
- JPAL, Six Rules of Thumb for Determining Sample Size and Statistical Power
- Sylvain Chabé-Ferret, What is Sampling Noise?
- United Nations Department of Economic and Social Affairs (UNDESA), Designing Household Survey Samples: Practical Guidelines