Difference between revisions of "Randomization in Stata"

Jump to: navigation, search
Line 1: Line 1:
This page describes how and why to use Stata for randomization of control and treatment assignment in an RCT. Common alternatives to using Stata for randomization include: (i) Using the Excel command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.     
This page describes how and why to use Stata for randomization of control and treatment assignment in an RCT. Common alternatives to using Stata for randomization include: (i) Using the Excel <code> XX </code>command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.     


== Why use Stata to randomize ==  
== Why use Stata to randomize ==  
Line 11: Line 11:
Here are a few steps that should be followed to create a reproducible randomization using Stata:
Here are a few steps that should be followed to create a reproducible randomization using Stata:


* Make sure your dataset includes a unique ID [respondent ID, household number, etc.]. If one doesn't exist yet, you can create one using the XX command.  
* Make sure your dataset includes a unique ID [respondent ID, household number, etc.]. If one doesn't exist yet, you can create one using the <code> XX </code> command.  
* While writing a do-file, pay close attention to the following things:
* While writing a do-file, pay close attention to the following things:
** Set version. Setting Stata's version in a do file ensures that the randomization algorithm is the same, since the algorithm sometimes changes between Stata versions. </br> For example - <code> version 12.0 </code>
** Set version. Setting Stata's version in a do file ensures that the randomization algorithm is the same, since the algorithm sometimes changes between Stata versions.  
** Set seed. This makes sure that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run. </br> For example - <code> set seed 12345 </code>
** Set seed. This makes sure that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.  
** Properly sorting the data. The data should be sorted such that observations are in the same order every time the code is run. The most optimal situation is sorting using an ID variable which uniquely and fully identifies each observation.
** Sort the data by the unique ID. The data should be sorted such that observations are in the same order every time the code is run. The most optimal situation is sorting using an ID variable which uniquely and fully identifies each observation.
*Convert the random numbers into categorical variables or dummy variables. This helps you check if the data is balanced.
*Convert the random numbers into categorical variables for treatment or control status.


The end goal is to have a CSV format file containing the ID variable used for randomization and the categorical variables created from the random numbers generated. This dataset will be preloaded into SurveyCTO, so that after an enumerator enters the respondent ID at the start of a survey, the result of the randomization will be loaded for the form and can be used for various sections of the survey.
The end goal is to have a CSV format file containing the ID variable used for randomization and the categorical variables created from the random numbers generated. This dataset will be preloaded into SurveyCTO, so that after an enumerator enters the respondent ID at the start of a survey, the result of the randomization will be loaded for the form and can be used for various sections of the survey.


An example of the do-file is as follows:
<code>
version 12.0  [SETS VERSION]
sort unique_id  [SORTS UNIQUE ID]
set seed 12345  [SETS THE RANDOM SEED FOR REPLICATION]
gen random_number = uniform()  [GENERATES A RANDOM NUMBER BETWEEN 0 AND 1]
egen ordering = rank(random_number) [ORDERS EACH OBSERVATION FROM SMALLEST TO LARGEST]
gen group = .
replace group = 1 if ordering <= N/2 [ASSIGNS TREATMENT STATUS TO FIRST HALF OF SAMPLE]
replace group = 0 if ordering > N/2  [ASSIGNS CONTROL STATUS TO SECOND HALF OF SAMPLE]
</code>
== Randomization with Stratification in Stata ==  
== Randomization with Stratification in Stata ==  



Revision as of 19:59, 7 February 2017

This page describes how and why to use Stata for randomization of control and treatment assignment in an RCT. Common alternatives to using Stata for randomization include: (i) Using the Excel XX command; (ii) Randomizing directly within a chosen electronic survey platform such as SurveyCTO; or (iii) randomization through a public lottery.

Why use Stata to randomize

Using Stata to randomize and then preloading the generated data file into the survey software is generally preferred to using Excel or randomizing within the electronic platform. The main advantages of using Stata for randomization are as follows:

  • The process is transparent and reproducible.
  • The researcher has more control of the process, allowing you to check randomization balance and add stratification variables if needed.
  • Since randomization in Stata is done before the survey takes place (as opposed to randomization through the survey platform. This provides an opportunity to double check the result of a randomization and fix bugs before using the software in the field.

Steps needed for replicability when randomizing in Stata

Here are a few steps that should be followed to create a reproducible randomization using Stata:

  • Make sure your dataset includes a unique ID [respondent ID, household number, etc.]. If one doesn't exist yet, you can create one using the XX command.
  • While writing a do-file, pay close attention to the following things:
    • Set version. Setting Stata's version in a do file ensures that the randomization algorithm is the same, since the algorithm sometimes changes between Stata versions.
    • Set seed. This makes sure that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.
    • Sort the data by the unique ID. The data should be sorted such that observations are in the same order every time the code is run. The most optimal situation is sorting using an ID variable which uniquely and fully identifies each observation.
  • Convert the random numbers into categorical variables for treatment or control status.

The end goal is to have a CSV format file containing the ID variable used for randomization and the categorical variables created from the random numbers generated. This dataset will be preloaded into SurveyCTO, so that after an enumerator enters the respondent ID at the start of a survey, the result of the randomization will be loaded for the form and can be used for various sections of the survey.

An example of the do-file is as follows:

version 12.0 [SETS VERSION] sort unique_id [SORTS UNIQUE ID] set seed 12345 [SETS THE RANDOM SEED FOR REPLICATION] gen random_number = uniform() [GENERATES A RANDOM NUMBER BETWEEN 0 AND 1] egen ordering = rank(random_number) [ORDERS EACH OBSERVATION FROM SMALLEST TO LARGEST] gen group = . replace group = 1 if ordering <= N/2 [ASSIGNS TREATMENT STATUS TO FIRST HALF OF SAMPLE] replace group = 0 if ordering > N/2 [ASSIGNS CONTROL STATUS TO SECOND HALF OF SAMPLE]


Randomization with Stratification in Stata

The steps to randomize in Stata with stratification is as follows:


Randomization with Multiple Treatment Arms

Back to Parent

This article is part of the topic Randomized Control Trials