Difference between revisions of "Sampling"

Jump to: navigation, search
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
Sampling is the process of randomly selecting units from a population of interest to represent the characteristics of that population. Sampling in a statistically valid, representative manner is a crucial step in conducting high quality [[Randomized Control Trials | randomized control trials]]. The sampling process consists of two parts: sample design and sample implementation, both of which should occur early in the evaluation design process in order to facilitate [[Preparing for Field Data Collection | data collection planning]]. Sampling affects a research project’s [[Survey Budget|budget]], timeline, accuracy, and precision. This page provides guidelines for and approaches to sampling.
Sampling is the process of [[Randomization|randomly]] selecting units from a population of interest to represent the characteristics of that population. Sampling in a statistically valid, representative manner is a crucial step in conducting high quality [[Randomized Control Trials | randomized control trials]]. The sampling process consists of two parts: sample design and sample implementation, both of which should occur early in the evaluation design process in order to facilitate [[Preparing for Field Data Collection | data collection planning]]. Sampling affects a research project’s [[Survey Budget|budget]], timeline, accuracy, and precision. This page provides guidelines for and approaches to sampling.


==Read First==
==Read First==
* Always sample from a [[Master_Data_Set|master dataset]]. If no master dataset exists for the [[Unit_of_Observation|unit of observation]] at which you want to sample, begin by creating the master dataset.  
* Always sample from a [[Master_Data_Set|master dataset]]. If no '''master dataset''' exists for the [[Unit_of_Observation|unit of observation]] at which you want to sample, begin by creating the '''master dataset'''.  
* Sampling code requires extra care! Errors cannot be corrected after the intervention – or survey -- has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, always consult DIME Analytics before sending a sample to the field.  
* Sampling code requires extra care! Errors cannot be corrected after the intervention – or [[Survey Pilot|survey]] -- has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, always consult DIME Analytics before sending a sample to the field.  
*While simple random sample works well for small populations, impact evaluations more typically rely on [[Multi-stage (Cluster) Sampling | multi-stage (cluster) sampling]], often with [[Stratified Random Sample|stratification]].
*While simple random sampling works well for small populations, impact evaluations more typically rely on [[Multi-stage (Cluster) Sampling | multi-stage (cluster) sampling]], often with [[Stratified Random Sample|stratification]].
*For information on sample size and power calculations, see [[Sample Size and Power Calculations]]; for information on implementing power calculations and selecting samples, see [[Power Calculations in Stata]].  
*For information on sample size and power calculations, see [[Sample Size and Power Calculations]]; for information on implementing power calculations and selecting samples, see [[Power Calculations in Stata]].


==How to Sample==
==How to Sample==
Line 15: Line 15:


Once you’ve defined the population of interest, establish the sampling frame and   
Once you’ve defined the population of interest, establish the sampling frame and   
[[Master_Data_Set|master dataset]]. This is the most comprehensive listing of the fixed characteristics of the observations in the population of interest. Ideally the master dataset should contain every observation from the population of interest. If you do not have a master dataset for the [[Unit_of_Observation|unit of observation]] from which you are sampling (i.e. households, villages, clinics, schools), you should always start by creating one. In the field, this is done by a listing at the lowest level of clustering possible.  
[[Master_Data_Set|master dataset]]. This is the most comprehensive listing of the fixed characteristics of the observations in the population of interest. Ideally the '''master dataset''' should contain every observation from the population of interest. If you do not have a '''master dataset''' for the [[Unit_of_Observation|unit of observation]] from which you are sampling (i.e. households, villages, clinics, schools), you should always start by creating one. In the field, this is done by a [[Listing|listing]] at the lowest level of clustering possible.


===Choose a Sampling Approach===
===Choose a Sampling Approach===
Line 21: Line 21:
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling | multi-stage (cluster) sampling]], often with [[Stratified Random Sample|stratification]].
The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on [[Multi-stage (Cluster) Sampling | multi-stage (cluster) sampling]], often with [[Stratified Random Sample|stratification]].


Multi-stage (cluster) sampling is a common sampling design in which the unit of [[Randomization in Stata | randomization]] differs from the [[Unit of Observation|unit of observation]]. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student). For more information, see [[Multi-stage (Cluster) Sampling]].
'''Multi-stage (cluster) sampling''' is a common sampling design in which the unit of [[Randomization|randomization]] differs from the [[Unit of Observation|unit of observation]]. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which [[Survey Pilot|surveys]] are administered (i.e. household, student).  
Stratification is a sampling design that divides the target population into subgroups before randomization, ensuring that sub-groups of the population are represented in the final sample and treatment groups. In addition to ensuring representativeness, stratification allows researchers to disaggregate by subgroup during analysis. For more information, see [[Stratified Random Sample]].
====Spot Randomization====
If you do not have a master dataset and cannot do a listing, an alternative is to conduct spot randomization. One example of spot randomization is the “random walk” method in which enumerators spin a bottle to determine a random direction in which they walk. Without knowing the total number of households, this method will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether enumerators adhere to spot protocols in the field. Further, there isn’t a systematic way of tracing when replacements were used and how they were established. Other examples of spot randomization include flipping a coin, computerized randomization, or cell-phone based randomization. For purposes of [[Reproducible Research | replicability]] and unbiasedness, using a master dataset for sampling is always preferable to spot randomization.  


===Implement in Code ===
'''Stratification''' is a sampling design that divides the target population into subgroups before '''randomization''', ensuring that sub-groups of the population are represented in the final sample and treatment groups. In addition to ensuring representativeness, '''stratification''' allows researchers to disaggregate by subgroup during analysis.


For more detailed instructions on commands for sample size calculations, see [[Power Calculations in Stata]] and, as a compliment, [[Power Calculations in Optimal Design]]. Always document sampling processes in a do file. For detailed instructions on sampling commands, see [[Multi-stage (Cluster) Sampling]] and [[Stratified Random Sample]].  
===Spot Randomization===
If you do not have a [[Master Dataset|master dataset]] and cannot do a [[Listing|listing]], an alternative is to conduct spot randomization. One example of spot randomization is the “random walk” method in which [[Enumerators|enumerators]] spin a bottle to determine a random direction in which they walk. Without knowing the total number of households, this method will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether '''enumerators''' adhere to spot protocols in the field. Further, there isn’t a systematic way of tracing when replacements were used and how they were established. Other examples of spot randomization include flipping a coin, computerized '''randomization''', or cell-phone based '''randomization'''. For purposes of [[Reproducible Research | replicability]] and unbiasedness, using a '''master dataset''' for sampling is always preferable to spot randomization.


Note that any code that performs randomization needs version, seed and sort to be [[Reproducible Research | reproducible]]. [[Randomization in Stata|Randomizing in Stata]] is always preferable since it is more easily reproducible, [[Randomization in Excel|randomizing in Excel]] is also an option.
==Implement in Code==


== Back to Parent ==
For more detailed instructions on commands for sample size calculations, see [[Power Calculations in Stata]] and, as a compliment, [[Power Calculations in Optimal Design]]. Always document sampling processes in a '''do file'''. For detailed instructions on sampling commands, see [[Multi-stage (Cluster) Sampling]] and [[Stratified Random Sample]].
This article is part of the topic [[Sampling & Power Calculations]]
 
Note that any code that performs '''randomization''' needs version, seed, and sort to be [[Reproducible Research | reproducible]]. [[Randomization in Stata|Randomizing in Stata]] is always preferable since it is more easily reproducible. [[Randomization in Excel|randomizing in Excel]] is also an option.
 
== Related Pages ==
[[Special:WhatLinksHere/Sampling|Click here for pages that link to this topic.]]


== Additional Resources ==
== Additional Resources ==
*JPAL and CEGA’s [http://cega.berkeley.edu/assets/cega_learning_materials/81/Methods_Manual_JPAL_110603.pdf Overview of Methodology for Randomized Evaluations] includes information on different sampling methods.
* Andrew Gelman (Columbia University), [http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf Sample size and power calculations]
*JPAL’s [https://www.povertyactionlab.org/sites/default/files/resources/L5_Sampling%20and%20Sample%20Size_0.pdf slides] explain sampling and sample size in detail.
* CEGA (University of California-Berkeley), [http://cega.berkeley.edu/assets/miscellaneous_files/Sampling_and_Statistical_Power.pdf Sampling and Statistical Power]
*DIME Analytics guidelines on survey sampling [https://github.com/worldbank/DIME-Resources/blob/master/survey-sampling-1.pdf 1] and [https://github.com/worldbank/DIME-Resources/blob/master/survey-sampling-2.pdf 2]
* DIME Analytics (World Bank), [https://osf.io/z4p8x/ Sampling: Track 1] and [https://osf.io/n6g8s/ Track 2]
*The United Nation’s [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines]  
* JPAL, [https://www.povertyactionlab.org/sites/default/files/research-resources/2018.03.21-Rules-of-Thumb-for-Sample-Size-and-Power_0.pdf Six Rules of Thumb for Determining Sample Size and Statistical Power]
*Gelman’s [http://www.stat.columbia.edu/~gelman/stuff_for_blog/chap20.pdf Sample Size and Power Calculations]
* Sylvain Chabé-Ferret, [https://economistjourney.blogspot.com/2018/06/what-is-sampling-noise.html What is Sampling Noise?]
* United Nations Department of Economic and Social Affairs (UNDESA), [http://unstats.un.org/unsd/demographic/sources/surveys/Series_F98en.pdf Designing Household Survey Samples: Practical Guidelines]  
[[Category: Research Design]]
[[Category: Sampling & Power Calculations ]]
[[Category: Sampling & Power Calculations ]]

Latest revision as of 15:38, 7 August 2023

Sampling is the process of randomly selecting units from a population of interest to represent the characteristics of that population. Sampling in a statistically valid, representative manner is a crucial step in conducting high quality randomized control trials. The sampling process consists of two parts: sample design and sample implementation, both of which should occur early in the evaluation design process in order to facilitate data collection planning. Sampling affects a research project’s budget, timeline, accuracy, and precision. This page provides guidelines for and approaches to sampling.

Read First

  • Always sample from a master dataset. If no master dataset exists for the unit of observation at which you want to sample, begin by creating the master dataset.
  • Sampling code requires extra care! Errors cannot be corrected after the intervention – or survey -- has started. Always ask a second person to double-check your code before you use the sampling it generated in the field. For DIME projects, always consult DIME Analytics before sending a sample to the field.
  • While simple random sampling works well for small populations, impact evaluations more typically rely on multi-stage (cluster) sampling, often with stratification.
  • For information on sample size and power calculations, see Sample Size and Power Calculations; for information on implementing power calculations and selecting samples, see Power Calculations in Stata.

How to Sample

Identify Population of Interest

Before drawing a sample, you must identify the population of interest. Clearly define the region and characteristics of the population: these details will indicate who the sample must represent.

Establish the Sampling Frame and Master Dataset

Once you’ve defined the population of interest, establish the sampling frame and master dataset. This is the most comprehensive listing of the fixed characteristics of the observations in the population of interest. Ideally the master dataset should contain every observation from the population of interest. If you do not have a master dataset for the unit of observation from which you are sampling (i.e. households, villages, clinics, schools), you should always start by creating one. In the field, this is done by a listing at the lowest level of clustering possible.

Choose a Sampling Approach

The most basic sampling technique is a Simple Random Sample. This works well for studies of small populations, with a complete sampling frame for the population. More typically, impact evaluations rely on multi-stage (cluster) sampling, often with stratification.

Multi-stage (cluster) sampling is a common sampling design in which the unit of randomization differs from the unit of observation. In other words, the unit at which the treatment is assigned (i.e. community, school) is different than the unit at which surveys are administered (i.e. household, student).

Stratification is a sampling design that divides the target population into subgroups before randomization, ensuring that sub-groups of the population are represented in the final sample and treatment groups. In addition to ensuring representativeness, stratification allows researchers to disaggregate by subgroup during analysis.

Spot Randomization

If you do not have a master dataset and cannot do a listing, an alternative is to conduct spot randomization. One example of spot randomization is the “random walk” method in which enumerators spin a bottle to determine a random direction in which they walk. Without knowing the total number of households, this method will always be biased towards the households at the center of the village. In addition, it’s hard to monitor whether enumerators adhere to spot protocols in the field. Further, there isn’t a systematic way of tracing when replacements were used and how they were established. Other examples of spot randomization include flipping a coin, computerized randomization, or cell-phone based randomization. For purposes of replicability and unbiasedness, using a master dataset for sampling is always preferable to spot randomization.

Implement in Code

For more detailed instructions on commands for sample size calculations, see Power Calculations in Stata and, as a compliment, Power Calculations in Optimal Design. Always document sampling processes in a do file. For detailed instructions on sampling commands, see Multi-stage (Cluster) Sampling and Stratified Random Sample.

Note that any code that performs randomization needs version, seed, and sort to be reproducible. Randomizing in Stata is always preferable since it is more easily reproducible. randomizing in Excel is also an option.

Related Pages

Click here for pages that link to this topic.

Additional Resources