Difference between revisions of "Randomization"

Jump to: navigation, search
 
(4 intermediate revisions by the same user not shown)
Line 2: Line 2:


== Read First ==  
== Read First ==  
* Common alternatives to using Stata for randomization include: (i) Using the Excel <code>Rand</code> command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.
* Common alternatives to using [[Randomization in Stata|Stata for randomization]] include:  
*Randomizing in Stata is preferred to [[Randomization in Excel | randomizing in Excel]] or [[Randomization in SurveyCTO | randomizing in survey software]] because it is transparent, reproducible, and gives the research more time to run [[Balance tests | balance tests]] and double check assignments.  
**Using the Excel <code>Rand</code> command;  
*Make sure to set the version, set the seed, sort the data, and use unique IDs when randomizing in Stata.   
**Randomizing directly within a chosen electronic [[Survey Pilot|survey]] platform such as SurveyCTO  
**randomization through a public lottery.
*'''Randomizing in Stata''' is preferred to [[Randomization in Excel | randomizing in Excel]] or [[Randomization in SurveyCTO | randomizing in SurveyCTO]] because it is transparent, [[Reproducible Research|reproducible]], and gives the research more time to run [[Balance tests | balance tests]] and double check assignments.  
*Make sure to set the version, set the seed, sort the data, and use unique IDs when '''randomizing in Stata'''.   
*For information how to draw a stratified random sample, see [[Stratified Random Sample]].
*For information how to draw a stratified random sample, see [[Stratified Random Sample]].


== Randomization in Stata ==
== Randomization in Stata ==


During surveys, we often need to randomize various aspects of the questionnaire. For example sometimes we need to randomize which household members to interview, and sometimes - which set of questions to ask. While most CAPI software have random number generators, it is not the preferred option. Using, for example, Stata to randomize and then preloading the generated data file into the survey software is in almost all cases the better option among the two. The main advantages of using Stata over CAPI software during randomization are as follows:
During [[Survey Pilot|surveys]], we often need to randomize various aspects of the [[Questionnaire Programming|questionnaire]]. For example, sometimes we need to randomize which household members to interview; other times, sometimes which set of questions to ask. While most [[Computer-Assisted Personal Interviews (CAPI)|CAPI]] software have random number generators, it is not the preferred option. Using, for example, [[Randomization in Stata|Stata to randomize]] and then preloading the generated data file into the '''survey''' software is in almost all cases the better option among the two. The main advantages of using '''Stata''' over '''CAPI''' software during randomization are as follows:


* [[Randomization in Stata|Randomization in Stata]] is transparent and reproducible which is important for publishing research.
* Transparent and [[Reproducible Research|reproducible]] which is important for publishing research.
* Randomization results in Stata can be dependent, so that we are guaranteed that no disproportional large share of the results falls into any group. Randomization is always independent in SurveyCTO which means that no groups could be assigned observations if the number of observation per groups is low.
* The randomization is dependent, so that we are guaranteed that no disproportionally large share of the results falls into any group. Randomization is always independent in SurveyCTO which means that no groups could be assigned observations if the number of observation per groups is low.
* Randomization in Stata provides the option of ensuring that the result of the randomization is balanced over other variables, i.e. stratas. This means that we can guarantee that, for example, not all female respondents end up in a certain group.  
* Provides the option of ensuring that the result of the randomization is balanced over other '''variables''', i.e. stratas. This means that we can guarantee that, for example, not all female respondents end up in a certain group.  
* Randomization in Stata is done before the survey takes place. This provides an opportunity to double check the result of a randomization and fix bugs and typos in the randomization code before it is used in the field, as it then would be too late to fix.
* Done before the '''survey''' takes place. This provides an opportunity to double check the result of a randomization and fix bugs and typos in the randomization code before it is used in the field, as it then would be too late to fix.


== Randomization in SurveyCTO ==  
== Randomization in SurveyCTO ==  
During surveys, you might often need to randomize various aspects of the questionnaire. While SurveyCTO has a random number generator, is is usually not recommended that you use it.
During [[Survey Pilot|surveys]], you might often need to randomize various aspects of the [[Questionnaire Programming|questionnaire]]. While SurveyCTO has a random number generator, is is usually not recommended that you use it. For more information, see [[Randomization in SurveyCTO]].


== Randomization for a Survey ==
== Randomization for a Survey ==


This is a basic example on how to do randomize in Stata and preload the results to SurveyCTO. See [[randomization in Stata]] more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.
This is a basic example on how to do [[Randomization in Stata|randomize in Stata]] and preload the results in SurveyCTO. See the page linked above for more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.


In the example below we have a survey in which we want a random 30% to answer to a long survey and the rest to take a shorter survey. We also have a gender variable and we want the ration of 30/70 to be as exact as possible within the both genders.
In the example below we have a [[Survey Pilot|survey]] in which we want a random 30% to answer to a long '''survey''' and the rest to take a shorter one. We also have a gender '''variable''' and we want the ratio of 30/70 to be as exact as possible within the both genders.


=== Step by step - known respondents ===
=== Step by step - known respondents ===


* Let's say we a data set with all the respondents and replacement respondents, and that we have the variables, ''unique_id'' and ''gender''. Make sure that all observations have [[ID_Variable_Properties|uniquely and fully identifying]] ID in ''unique_id''.   
* Let's say we have a '''dataset''' with all the respondents and replacement respondents, and that we have two '''variables''': ''unique_id'' and ''gender''. Make sure that all observations have [[ID_Variable_Properties|uniquely and fully identifying]] ID in ''unique_id''.   
** If you do not have ID for your respondents you should create it.  
** If you do not have ID for your respondents you should create it.  
** If you do not have a list of your respondents (for example, if you randomize the sample in the field by drawing lots) see the section of [[Randomization_in_SurveyCTO#Preload_Randomization_without_IDs| prelaoding randomization without IDs]].
** If you do not have a list of your respondents (for example, if you randomize the sample in the field by drawing lots) see the section of [[Randomization_in_SurveyCTO#Preload_Randomization_without_IDs| prelaoding randomization without IDs]].
* Set version, set seed and sort the data to guarantee a replicable sort. See [[Randomization_in_Stata#Steps_needed_for_replicability_when_randomizing_in_Stata| replicable randomization in Stata]] why all these three variables are needed.
* Set version, set seed, and sort the data to guarantee a replicable sort. See [[Randomization_in_Stata#Steps_needed_for_replicability_when_randomizing_in_Stata| replicable randomization in Stata]] for why all these three '''variables''' are needed.
* Generate a random number and create a dummy variable (''long_survey'' in this example) that indicates for each observation if that observation should answer the long or the short survey.
* Generate a random number and create a dummy '''variable''' (''long_survey'' in this example) that indicates for each observation if that observation should answer the long or the short '''survey'''.
** Note that it does not have to be a dummy, we could just as well have had randomized into three groups and saved the result into a categorical variable with the value 1, 2, 3.  
** Note that it does not have to be a dummy. We could just as well have had randomized into three groups and saved the result into a categorical '''variable''' with the value 1, 2, 3.  
*In the end we should have a data set with the ''unique_id'' and the categorical variables with the result of the randomization, in this example only ''long_survey''.
*In the end we should have a '''dataset''' with the ''unique_id'' and the categorical '''variables''' with the result of the randomization, in this example only ''long_survey''.
*In your SurveyCTO survey, have a question where the enumerator enters the ID for the respondent currently interviewed. For this to be possible the enumerators needs a list with both the name and the ID.  
*In your SurveyCTO '''survey''', have a question where the [[Enumerator Training|enumerator]] enters the ID for the respondent currently interviewed. For this to be possible, the '''enumerator''' needs a list with both the name and the ID.  
**It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in data sets that have names and other identifying information removed. If the data collected is extra sensitive, consider using a different ID than the main ID for this purpose.
**It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in '''datasets''' that have names and other identifying information [[De-identification|removed]]. If the [[Primary Data Collection|data collected]] is extra sensitive, consider using a different ID than the main ID for this purpose.
* Preload the data set you generated in Stata into your SurveyCTO survey using the ID entered by the enumerator.
* Preload the '''dataset''' you generated in '''Stata''' into your SurveyCTO '''survey''' using the ID entered by the '''enumerator'''.
* Restrict the relevant section of the survey in SurveyCTO using the value just preloaded.
* Restrict the relevant section of the '''survey''' in SurveyCTO using the value just preloaded.


=== Code example - known respondents ===
=== Code example - known respondents ===

Latest revision as of 15:54, 7 August 2023

Randomization is a critical step for ensuring exogeneity in experimental methods and randomized control trials (RCTs). Stata provides a replicable, reliable, and well-documented way to randomize treatment before beginning fieldwork.

Read First

Randomization in Stata

During surveys, we often need to randomize various aspects of the questionnaire. For example, sometimes we need to randomize which household members to interview; other times, sometimes which set of questions to ask. While most CAPI software have random number generators, it is not the preferred option. Using, for example, Stata to randomize and then preloading the generated data file into the survey software is in almost all cases the better option among the two. The main advantages of using Stata over CAPI software during randomization are as follows:

  • Transparent and reproducible which is important for publishing research.
  • The randomization is dependent, so that we are guaranteed that no disproportionally large share of the results falls into any group. Randomization is always independent in SurveyCTO which means that no groups could be assigned observations if the number of observation per groups is low.
  • Provides the option of ensuring that the result of the randomization is balanced over other variables, i.e. stratas. This means that we can guarantee that, for example, not all female respondents end up in a certain group.
  • Done before the survey takes place. This provides an opportunity to double check the result of a randomization and fix bugs and typos in the randomization code before it is used in the field, as it then would be too late to fix.

Randomization in SurveyCTO

During surveys, you might often need to randomize various aspects of the questionnaire. While SurveyCTO has a random number generator, is is usually not recommended that you use it. For more information, see Randomization in SurveyCTO.

Randomization for a Survey

This is a basic example on how to do randomize in Stata and preload the results in SurveyCTO. See the page linked above for more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.

In the example below we have a survey in which we want a random 30% to answer to a long survey and the rest to take a shorter one. We also have a gender variable and we want the ratio of 30/70 to be as exact as possible within the both genders.

Step by step - known respondents

  • Let's say we have a dataset with all the respondents and replacement respondents, and that we have two variables: unique_id and gender. Make sure that all observations have uniquely and fully identifying ID in unique_id.
    • If you do not have ID for your respondents you should create it.
    • If you do not have a list of your respondents (for example, if you randomize the sample in the field by drawing lots) see the section of prelaoding randomization without IDs.
  • Set version, set seed, and sort the data to guarantee a replicable sort. See replicable randomization in Stata for why all these three variables are needed.
  • Generate a random number and create a dummy variable (long_survey in this example) that indicates for each observation if that observation should answer the long or the short survey.
    • Note that it does not have to be a dummy. We could just as well have had randomized into three groups and saved the result into a categorical variable with the value 1, 2, 3.
  • In the end we should have a dataset with the unique_id and the categorical variables with the result of the randomization, in this example only long_survey.
  • In your SurveyCTO survey, have a question where the enumerator enters the ID for the respondent currently interviewed. For this to be possible, the enumerator needs a list with both the name and the ID.
    • It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in datasets that have names and other identifying information removed. If the data collected is extra sensitive, consider using a different ID than the main ID for this purpose.
  • Preload the dataset you generated in Stata into your SurveyCTO survey using the ID entered by the enumerator.
  • Restrict the relevant section of the survey in SurveyCTO using the value just preloaded.

Code example - known respondents

*Set version 
ieboilstart , version(12.1)
`r(version)'

*Open the data set after ieboilstart
use sample.dta

*Set seed
set seed 123456 //this is an example seed, replace this with another number

*Sort data set
sort unique_id

*Generate random number, rank that random number per gender, and assign 
* long survey if the rank is less than or equal the total number of observations 
* in that gender
gen rand = uniform()
bys gender : egen rank = rank(rand)
bys gender : gen long_survey = (rank/_N <= .3)

Back to Parent

This article is part of the topic Randomized Control Trials.

Additional Resources