Sampling

Jump to: navigation, search

Sampling is the process of randomly selecting units from a population of interest to represent the characteristics of that population. Sampling in a statistically valid, representative manner is a crucial step in conducting high quality randomized control trials. The sampling process consists of two parts: sample design and sample implementation, both of which should occur early in the evaluation design process in order to facilitate data collection planning. Sampling affects a research project’s budget, timeline, accuracy, and precision. This page provides guidelines for and approaches to sampling.

Read First

  • Always sample from a master dataset. If no master dataset exists for the unit of observation at which you want to sample, begin by creating the master dataset.
  • Sampling code requires extra care! Errors cannot be corrected after the intervention – or survey -- has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, always consult DIME Analytics before sending a sample to the field.
  • While simple random sampling works well for small populations, impact evaluations more typically rely on multi-stage (cluster) sampling, often with stratification.
  • For information on sample size and power calculations, see Sample Size and Power Calculations; for information on implementing power calculations and selecting samples, see Power Calculations in Stata.

How to Sample

Identify Population of Interest

Before drawing a sample, you must identify the population of interest. Clearly define the region and characteristics of the population: these details will indicate who the sample must represent.

Establish the Sampling Frame and Master Dataset

Once you’ve defined the population of interest, establish the sampling frame and master dataset. This is the most comprehensive listing of the fixed characteristics of the observations in the population of interest. Ideally the master dataset should contain every observation from the population of interest. If you do not have a master dataset for the unit of observation from which you are sampling (i.e. households, villages, clinics, schools), you should always start by creating one. In the field, this is done by a listing at the lowest level of clustering possible.

Choose a Sampling Approach

The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on multi-stage (cluster) sampling, often with stratification.

Multi-stage (cluster) sampling is a common sampling design in which the unit of randomization differs from the unit of observation. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student).

Stratification is a sampling design that divides the target population into subgroups before randomization, ensuring that sub-groups of the population are represented in the final sample and treatment groups. In addition to ensuring representativeness, stratification allows researchers to disaggregate by subgroup during analysis.

Spot Randomization

If you do not have a master dataset and cannot do a listing, an alternative is to conduct spot randomization. One example of spot randomization is the “random walk” method in which enumerators spin a bottle to determine a random direction in which they walk. Without knowing the total number of households, this method will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether enumerators adhere to spot protocols in the field. Further, there isn’t a systematic way of tracing when replacements were used and how they were established. Other examples of spot randomization include flipping a coin, computerized randomization, or cell-phone based randomization. For purposes of replicability and unbiasedness, using a master dataset for sampling is always preferable to spot randomization.

Implement in Code

For more detailed instructions on commands for sample size calculations, see Power Calculations in Stata and, as a compliment, Power Calculations in Optimal Design. Always document sampling processes in a do file. For detailed instructions on sampling commands, see Multi-stage (Cluster) Sampling and Stratified Random Sample.

Note that any code that performs randomization needs version, seed, and sort to be reproducible. Randomizing in Stata is always preferable since it is more easily reproducible. randomizing in Excel is also an option.

Related Pages

Click here for pages that link to this topic.

Additional Resources