Difference between revisions of "Randomization in Stata"

Jump to: navigation, search
 
(21 intermediate revisions by 8 users not shown)
Line 1: Line 1:
This page describes how and why to use Stata for randomization of control and treatment assignment in an RCT. Common alternatives to using Stata for randomization include: (i) Using the Excel <code> XX </code>command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.  
Randomization is a critical step for ensuring [[Exogeneity Assumption | exogeneity]] in [[Experimental Methods | experimental methods]] and [[Randomized Control Trials | randomized control trials (RCTs)]]. Stata provides a [[Reproducible Research | replicable]], reliable, and [[Data Documentation | well-documented]] way to randomize treatment before beginning fieldwork. This page describes how and why to use Stata to randomize.  


== Why use Stata to randomize ==  
==Read First==
Using Stata to randomize and then preloading the generated data file into the survey software is generally preferred to using Excel or randomizing within the electronic platform. The main advantages of using Stata for randomization are as follows:  
* Common alternatives to using Stata for randomization include: (i) Using the Excel <code>Rand</code> command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.
*Randomizing in Stata is preferred to [[Randomization in Excel | randomizing in Excel]] or [[Randomization in SurveyCTO | randomizing in survey software]] because it is transparent, reproducible, and gives the research more time to run [[Balance tests | balance tests]] and double check assignments.
*Make sure to set the version, set the seed, sort the data, and use unique IDs when randomizing in Stata. 
*For information how to draw a stratified random sample, see [[Stratified Random Sample]].
 
==Why Use Stata to Randomize?==  
Randomizing in Stata and subsequently preloading the generated data file into the [[Computer-Assisted Personal Interviews (CAPI) | survey software]] is the preferred method to [[Randomization in Excel | randomizing in Excel]] or [[Randomization in SurveyCTO | randomizing in survey software]]. The main advantages of randomizing in Stata follow:
* The process is transparent and reproducible.  
* The process is transparent and reproducible.  
* The researcher has more control of the process, allowing you to check randomization balance and add stratification variables if needed.  
* The researcher has more control of the process and can check [[Balance tests | randomization balance]] and [[Stratified Random Sample | add stratification variables]] if needed.  
* Since randomization in Stata is done before the survey takes place (as opposed to randomization through the survey platform. This provides an opportunity to double check the result of a randomization and fix bugs before using the software in the field.  
* As opposed to randomizing in the survey software, randomizing in Stata allows for time between randomization, implementation and [[Primary Data Collection | data collection]], giving the research team the opportunity to double check assignments and fix bugs before using software in the field.
 
==Implementation==
==Steps needed for replicability when randomizing in Stata ==
An example of a randomization do-file follows:
 
<nowiki>
Here are a few steps that should be followed to create a reproducible randomization using Stata:
    * Set the environment to make randomization replicable
    version 12.0  [SETS VERSION]
    isid unique_id, sort  [SORTS UNIQUE ID] 
    set seed 585506  [SETS THE RANDOM SEED FOR REPLICATION. Generated using https://bit.ly/stata-random ] 
   
    * Assign random numbers to the observations and rank them from the smallest to the largest
    gen random_number = uniform()  [GENERATES A RANDOM NUMBER BETWEEN 0 AND 1]
    egen ordering = rank(random_number) [ORDERS EACH OBSERVATION FROM SMALLEST TO LARGEST]
   
    * Assign observations to control & treatment group based on their ranks
    gen group = . 
    replace group = 1 if ordering <= N/2 [ASSIGNS TREATMENT STATUS TO FIRST HALF OF SAMPLE] 
    replace group = 0 if ordering > N/2  [ASSIGNS CONTROL STATUS TO SECOND HALF OF SAMPLE]
</nowiki>


* Make sure your dataset includes a unique ID [respondent ID, household number, etc.]. If one doesn't exist yet, you can create one using the <code> XX </code> command.  
==Guidelines for Replicable Randomization==
To randomize with replicability in Stata, follow these guidelines:
* Make sure your dataset includes a [[ID Variable Properties | unique ID]] (i.e. respondent ID, household number, etc.). If one doesn't exist yet, create one using the <code> generate </code> command. The ID uniquely and fully identify all observations.
* While writing a do-file, pay close attention to the following things:
* While writing a do-file, pay close attention to the following things:
** Set version. Setting Stata's version in a do file ensures that the randomization algorithm is the same, since the algorithm sometimes changes between Stata versions.  
** Set version: this ensures that the randomization algorithm is the same, since the randomization algorithm sometimes changes between Stata versions. See <code>[[ieboilstart]]</code> for boilerplate code that standardizes Stata version within do files. 
** Set seed. This makes sure that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.  
** [https://www.stata.com/manuals14/rsetseed.pdf Set seed]: this ensures that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.  
** Sort the data by the unique ID. The data should be sorted such that observations are in the same order every time the code is run. The most optimal situation is sorting using an ID variable which uniquely and fully identifies each observation.
** Sort the data by the unique ID: the data should be sorted such that observations are in the same order every time the code is run.  
*Convert the random numbers into categorical variables for treatment or control status.
*Convert the random numbers into categorical variables for treatment or control status.
 
After randomizing, output a CSV format file that contains the ID variable used for randomization and the categorical variables created from the random numbers generated. You can preload this file into SurveyCTO, so that once an enumerator enters the respondent ID at the start of a survey, the assignment will be loaded for the form and can be used for various sections of the survey.
The end goal is to have a CSV format file containing the ID variable used for randomization and the categorical variables created from the random numbers generated. This dataset will be preloaded into SurveyCTO, so that after an enumerator enters the respondent ID at the start of a survey, the result of the randomization will be loaded for the form and can be used for various sections of the survey.
 
An example of the do-file is as follows:
 
<code>
version 12.0  [SETS VERSION] </br>
sort unique_id  [SORTS UNIQUE ID]  </br>
set seed 12345  [SETS THE RANDOM SEED FOR REPLICATION]  </br>
gen random_number = uniform()  [GENERATES A RANDOM NUMBER BETWEEN 0 AND 1]  </br>
egen ordering = rank(random_number) [ORDERS EACH OBSERVATION FROM SMALLEST TO LARGEST]  </br>
gen group = .  </br>
replace group = 1 if ordering <= N/2 [ASSIGNS TREATMENT STATUS TO FIRST HALF OF SAMPLE]  </br>
replace group = 0 if ordering > N/2  [ASSIGNS CONTROL STATUS TO SECOND HALF OF SAMPLE]  </br>
 
 
</code>
 
== Randomization with Stratification in Stata ==
 
The steps to randomize in Stata with stratification is as follows:
 
 
== Randomization with Multiple Treatment Arms ==
 


== Back to Parent ==
== Back to Parent ==
This article is part of the topic [[Randomized Control Trials]]
This article is part of the topic [[Randomized Control Trials]].
 
==Additional Resources==
[[Category: Impact Evaluation Design]]
* DIME Analytics' presentations on randomization [https://github.com/worldbank/DIME-Resources/blob/master/stata1-5-randomization.pdf 1] and [https://github.com/worldbank/DIME-Resources/blob/master/stata2-5-randomization.pdf 2]
* Stata, [https://blog.stata.com/2016/03/10/how-to-generate-random-numbers-in-stata/ How to generate random numbers in Stata]
[[Category: Coding Practices]]
[[Category: Stata Coding Practices]]
[[Category: Research Design]]
[[Category: Reproducible Research]]

Latest revision as of 20:36, 20 July 2022

Randomization is a critical step for ensuring exogeneity in experimental methods and randomized control trials (RCTs). Stata provides a replicable, reliable, and well-documented way to randomize treatment before beginning fieldwork. This page describes how and why to use Stata to randomize.

Read First

  • Common alternatives to using Stata for randomization include: (i) Using the Excel Rand command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.
  • Randomizing in Stata is preferred to randomizing in Excel or randomizing in survey software because it is transparent, reproducible, and gives the research more time to run balance tests and double check assignments.
  • Make sure to set the version, set the seed, sort the data, and use unique IDs when randomizing in Stata.
  • For information how to draw a stratified random sample, see Stratified Random Sample.

Why Use Stata to Randomize?

Randomizing in Stata and subsequently preloading the generated data file into the survey software is the preferred method to randomizing in Excel or randomizing in survey software. The main advantages of randomizing in Stata follow:

  • The process is transparent and reproducible.
  • The researcher has more control of the process and can check randomization balance and add stratification variables if needed.
  • As opposed to randomizing in the survey software, randomizing in Stata allows for time between randomization, implementation and data collection, giving the research team the opportunity to double check assignments and fix bugs before using software in the field.

Implementation

An example of a randomization do-file follows:

    * Set the environment to make randomization replicable
    version 12.0  [SETS VERSION] 
    isid unique_id, sort  [SORTS UNIQUE ID]  
    set seed 585506  [SETS THE RANDOM SEED FOR REPLICATION. Generated using https://bit.ly/stata-random ]  
    
    * Assign random numbers to the observations and rank them from the smallest to the largest
    gen random_number = uniform()  [GENERATES A RANDOM NUMBER BETWEEN 0 AND 1] 
    egen ordering = rank(random_number) [ORDERS EACH OBSERVATION FROM SMALLEST TO LARGEST] 
    
    * Assign observations to control & treatment group based on their ranks 
    gen group = .  
    replace group = 1 if ordering <= N/2 [ASSIGNS TREATMENT STATUS TO FIRST HALF OF SAMPLE]  
    replace group = 0 if ordering > N/2  [ASSIGNS CONTROL STATUS TO SECOND HALF OF SAMPLE]

Guidelines for Replicable Randomization

To randomize with replicability in Stata, follow these guidelines:

  • Make sure your dataset includes a unique ID (i.e. respondent ID, household number, etc.). If one doesn't exist yet, create one using the generate command. The ID uniquely and fully identify all observations.
  • While writing a do-file, pay close attention to the following things:
    • Set version: this ensures that the randomization algorithm is the same, since the randomization algorithm sometimes changes between Stata versions. See ieboilstart for boilerplate code that standardizes Stata version within do files.
    • Set seed: this ensures that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.
    • Sort the data by the unique ID: the data should be sorted such that observations are in the same order every time the code is run.
  • Convert the random numbers into categorical variables for treatment or control status.

After randomizing, output a CSV format file that contains the ID variable used for randomization and the categorical variables created from the random numbers generated. You can preload this file into SurveyCTO, so that once an enumerator enters the respondent ID at the start of a survey, the assignment will be loaded for the form and can be used for various sections of the survey.

Back to Parent

This article is part of the topic Randomized Control Trials.

Additional Resources