Difference between revisions of "Randomization in SurveyCTO"

Jump to: navigation, search
 
(39 intermediate revisions by 5 users not shown)
Line 1: Line 1:
During surveys, you might often need to randomize various aspects of the questionnaire. While SurveyCTO has a random number generator, is is usually not recommended that you use it. This article will argue for doing the randomization in Stata, R or similar software, before the start of the survey, and preload the results of the randomization as dummies or categorical variables.
<onlyinclude>During [[Survey Pilot|surveys]], you might often need to [[Randomization|randomize]] various aspects of the [[Questionnaire Programming|questionnaire]]. While [[SurveyCTO Programming|SurveyCTO]] has a random number generator, is is usually not recommended that you use it. This article will argue for doing the [[Randomization in Stata|randomization in Stata]], R, or similar software, before the start of the '''survey''', and preload the results of the '''randomization''' as dummies or categorical '''variables'''.</onlyinclude>


== Why randomization is better to do before the Survye ==
== Read first ==
*Unless your [[Survey Pilot|survey]] falls into very rare exceptions, do not randomize in SurveyCTO, do it in [[Randomization in Stata|Stata]]R or similar and preload the results into '''SurveyCTO'''.


During surveys, we often need to randomize various aspects of the questionnaire. For example – sometimes we need to randomize which household members to interview, and sometimes - which set of questions to ask. While most CAPI software have random number generators, it is not the preferred option. Using, for example, Stata to randomize and then preloading the generated data file into the survey software is in almost all cases the better option among the two. The main advantages of using Stata over CAPI software during randomization are as follows:
== Why randomization is better to do before the Survey ==
During [[Survey Pilot|surveys]], we often need to [[Randomization|randomize]] various aspects of the [[Questionnaire Programming|questionnaire]]. For example – sometimes we need to '''randomize''' which household members to interview, and sometimes which set of questions to ask. While most [https://dimewiki.worldbank.org/Computer-Assisted_Personal_Interviews_(CAPI) CAPI software] have random number generators, it is not the preferred option. Using, for example, [https://dimewiki.worldbank.org/Randomization_in_Stata Stata to randomize] and then preloading the generated data file into the '''survey''' software is in almost all cases the better option among the two. The main advantages of using [https://dimewiki.worldbank.org/Stata_Coding_Practices Stata] over '''CAPI''' software during '''randomization''' are as follows:


* Randomization in Stata is transparent and reproducible which is important for publishing research.
* '''Randomization in Stata''' is transparent and [https://dimewiki.worldbank.org/Reproducible_Research reproducible] which is important for [[Publishing Data|publishing research].
* Randomization results in Stata can be dependent, so that we are guaranteed that no disproportional large share of the results falls into any group. Randomization is always independent in SurveyCTO which means that no groups could be assigned observations if the number of observation per groups is low.
* '''Randomization''' results in '''Stata''' are dependent, guaranteeing that no disproportionally large share of the results falls into any group. '''Randomization''' is always independent in SurveyCTO which means that it is possible no groups are assigned observations if the number of observation per groups is low.
* Randomization in Stata provides the option of ensuring that the result of the randomization is balanced over other variables, i.e. stratas. This means that we can guarantee that, for example, not all female respondents end up in a certain group.  
* '''Randomization in Stata''' provides the option of ensuring that the result of the '''randomization''' is balanced over other '''variables''', i.e. stratas. This means that we can, for example, guarantee that not all female respondents end up in a certain group.  
* Randomization in Stata is done before the survey takes place. This provides an opportunity to double check the result of a randomization and fix bugs and typos in the randomization code before it is used in the field, as it then would be too late to fix.
* '''Randomization in Stata''' is done before the '''survey''' takes place. This provides an opportunity to double check the result of a '''randomization''' and fix bugs and typos in the '''randomization code''' before it is used in the field, as it then would be too late to fix.


== Practical guide to how to randomize in Stata for a survey ==
== How to randomize in Stata for a survey in SurveyCTO ==


This is a basic examplefor how to do this. See [[randomization in Stata]] more details for how to implement more a
This is a basic example on how to do [[Randomization in Stata|randomize in Stata]] and preload the results to SurveyCTO. See the page linked above for more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.


• Use with a dataset which has a unique ID [respondent ID, household number, etc.].
In the example below, we have a [[Survey Pilot|survey]] in which we want a random 30% to answer to a long '''survey''' and the rest to take a shorter '''survey'''. We also have a gender '''variable''' and we want the ratio of 30/70 to be as exact as possible within both genders.
• While writing a do-file, pay close attention to the following things:
o Set version. This ensures that the randomization algorithm is the same, as it sometimes changes between Stata versions.
o Set seed. This makes sure that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.
o Properly sorting the data. The data should be sorted such that observations are in the same order every time the code is run. The most optimal situation is sorting using an ID variable which uniquely and fully identifies each observation.
• Convert the random numbers into categorical variables or dummy variables. This helps you check if the data is balanced.
• The end goal is to have a CSV format file containing the ID variable used for randomization and the categorical variables created from the random numbers generated. This dataset will be preloaded into SurveyCTO so that after an enumerator enters the respondent ID at the start of a survey questionnaire the result of the randomization will be loaded for the form and can be used for various sections of the survey.


A researcher randomizing through Stata first creates a do file to document their method of randomizing. This allows the researcher to replicate the results in the future. After that, they randomize in Stata and make sure that the randomization takes into account various demographic characteristics like age, sex, income, etc, so that the survey is well balanced. They then create the data _le and check for bugs and irregularities in the randomization. Finally, they load the dataset into the survey software.
=== Step by step - known respondents ===


Randomizing using the CAPI software does not provide the researcher with the opportunity to replicate results in the future, test for bugs or ensure balance in gender and age distribution because the randomization takes place during the field survey itself.
* Let's say we have a [[Master Dataset|dataset]] with all the respondents and replacement respondents and that we have the following '''variables''': ''unique_id'' and ''gender''. Make sure that all observations have [[ID_Variable_Properties|uniquely and fully identifying ID]] in ''unique_id''. 
** If you do not have ID for your respondents you should create it.
** If you do not have a list of your respondents (for example, if you [[Randomization|randomize]] the [[Sampling|sample]] in the field by drawing lots) see the section below.
* Set version, set seed, and sort the data to guarantee a [[Reproducible Research|replicable]] sort. See '''Randomization in Stata''' for why all these three '''variables''' are needed.
* Generate a random number and create a '''dummy variable''' (''long_survey'' in this example) that indicates for each observation if that observation should answer the long or short '''survey'''.
** Note that it does not have to be a dummy, we could just as well have had '''randomized''' into three groups and saved the result into a '''categorical variable''' with the values 1, 2, 3.
*In the end we should have a '''dataset''' with the ''unique_id'' and the '''categorical variables''' with the result of the '''randomization''', in this example only ''long_survey''.
*In your SurveyCTO '''survey''', have a question where the [[Enumerator Training|enumerator]] enters the ID for the respondent currently being interviewed. For this to be possible the '''enumerator''' needs a list with both the name and the ID.
**It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in '''datasets''' that have names and other identifying information [[De-identification|removed]]. If the data collected is extra sensitive, consider using a different ID than the main ID for this purpose.
* Preload the data set you generated in '''Stata''' into your =SurveyCTO '''survey''' using the ID entered by the '''enumerator'''.
* Restrict the relevant section of the '''survey''' in SurveyCTO using the value just preloaded.


The next section provides information on effectively using Stata for randomization.
=== Code example - known respondents ===
<pre>
*Set version
ieboilstart , version(12.1)
`r(version)'


*Open the data set after ieboilstart
use sample.dta


*Set seed
set seed 123456 //this is an example seed, replace this with another number


*Sort data set
sort unique_id


*Generate random number, rank that random number per gender, and assign
* long survey if the rank is less than or equal the total number of observations
* in that gender
gen rand = uniform()
bys gender : egen rank = rank(rand)
bys gender : gen long_survey = (rank/_N <= .3)
</pre>


If you need to randomize certain things in your questionnaire, randomization using SurveyCTO is not recommended. The best way to randomize things in a questionnaire is by [[Randomization in Stata | using Stata]]. The reason that Stata is preferable is because the randomization in Stata is transparent and easily reproducible which is not only necessary for impact evaluation projects but also for publishing research. Randomization in Stata is also done before a survey so there is time to check for errors/bugs and also to ensure that there is balance in the dataset. When Stata is not available, [[Randomization in Excel | Excel]] can also be used but [[Randomization in Excel |randomization done in Excel]] is not as reproducible as randomization in Stata.
== Preload randomization without IDs ==
It is still possible to preload [[Randomization|randomized]] categories even if there is no way of knowing the respondents in the [[Survey Pilot|survey]].  For example, you might only have a list of villages to '''survey''' and intend to do a lottery to decide which respondents to interview. Or you might have '''randomized''' school classes and you will interview all students in those classes but you do not know exactly how many students each class has. In both these cases, we can not assign a '''randomized''' category to each respondent as we do not know who they are yet. Instead we will '''randomize''' a category to the first respondent in the village/class, '''randomize''' a category for the second respondent, etc. And in the field, the [[Enumerator training|enumerators]] have lists for each village/class where they cross off each ID being used.
 
The following list explains the steps needed for this. We will keep using the example where we know which school classes we want to interview but not the size of those classes. Let's say that we want to collect basic demographic information on all students and want to do a short '''survey''' on 40% of the students and a long one on 10% of the students.
 
=== Step by step - unknown respondents ===
* '''Step 1''': Create a [[Master Dataset|dataset]] with one observation per class if you do not already have that. Make sure that all classes have unique IDs.
* Expand the '''dataset''' (code example below) with the maximum number of students you want to interview per class, or the maximum number of students there are in a class if you want to interview all of them. If you expect that the highest of number of students in a class is 50, expand the number of observations to 100 so that there is no risk your expectation was too low. We pick the number 99 so that no ID requires three digits.
* Create a unique ID for each student. It is easy to do in combination with the class ID.
* Set version, set seed, and sort the data to guarantee a [[Reproducible Research|replicable sort]]. See [[Randomization in Stata|randomization in Stata]] for why all three '''variables''' are needed.
* Since we are generating 99 IDs but most classes will be considerably smaller, there is a risk that all or a large share of the 10% of the students that will take the long '''surveys''' are among the higher IDs 50-99 that we never expect to be used. To guarantee that that will never happen, we can create stratas for each group of 10 IDs, 1-10, 11-20, 21-30 etc. and '''randomize''' within each group. See the code example for how to do so.
** There are many different ways to do this. We can make sure that any number of categories shows up among the first 10 or 20 numbers. Just make sure that the students are selected in random order so that the order in which they are assigned IDs is random. The way they sit in a class room might not be random as there might be differences in motivation or socioeconomic status between the students in the front or in the back.
* The example code below generates a '''variable''' ''surveyType'' with a '''randomized''' category for the first respondent, the second, the third etc., up to 99. This '''variable''' is preloaded the same way as explained above where we have IDs for all respondents.
* The final step is to create a system where the [[Enumerator Training|enumerators]] use each ID in correct order and only once. The best way to do that is to print out lists for each village where the '''enumerator''' crosses off the used IDs.
**If there are multiple '''enumerators''' per class, then the lists can be split into IDs with odd or even numbers and the list is crossed of by the '''enumerators'''.
* Make sure that the '''randomization''' information is not on the lists available to the '''enumerators''' so that the '''enumerator''' has a change to influence the assignment of ID.
** For the same reason we need to '''randomize''' differently for each class as '''enumerators''' quickly learn the '''randomization''' order if it would have been the same for each class.
 
=== Code example - unknown respondents ===
<pre>
ieboilstart , version(12.1)
`r(version)'
 
*Open the data set with all classes you have randomized after ieboilstart
use class.dta
 
* Create one observation per each student ID. We use 99 as the highest number of students
expand 99
 
* Create unique ID
bys class_id : gen student_id = class_id * 100 + _n
 
* Generate strata for each group of 10 id, 1-10, 11-20, 21-30 etc.
bys class_id : gen id10strata = floor(_n/10)
 
*Set seed and sort
set seed 123456 //this is an example seed, replace this with another number
sort student_id
 
*Generate the result variable and the random number, rank that random number
* per class_id and ID strata, and assign survey type to 2 if the rank is less than or
* equal the total number of observations in that class and ID strata, and assign survey
* type 3 if the rank is larger than 90% of the total number of observations in that
* class and ID strata. The rest of the observations keep the survey type 1.
gen surveyType = 1
gen rand = uniform()
bys class_id id10strata : egen rank = rank(rand)
bys class_id id10strata : replace surveyType  = 2 if (rank/_N <= .5)
bys class_id id10strata : replace surveyType  = 3 if (rank/_N > .9)
</pre>
 
=== Other things to consider- unknown respondents ===
 
'''Test field system''': The most obvious step where this could go wrong is when '''enumerators''' use the correct ID and do not use IDs more than once. This may not be done properly, or the system they are using is not designed well enough, but we can easily check that by looking into which IDs are used by each '''enumerator''' each day. We should already have a good understanding in the pilot if our system is good.
 
'''Stratification''': If you want to stratify over, for example, gender, you can have two lists, one for male respondents and one for female respondents. However, this increases the complexity in the field and more than one layer of strata organized in this way immediately becomes very complicated. This is one of the few cases where '''randomization''' in SurveyCTO is preferred. However, make sure that there are enough observations in each strata so that even the independent '''randomization''' will have a balanced result.


== Back to Parent ==
== Back to Parent ==
Line 38: Line 116:


== Additional Resources ==
== Additional Resources ==
*SurveyCTO [https://docs.surveycto.com/02-designing-forms/03-advanced-topics/01.randomizing.html Randomizing Survey Elements]
*SurveyCTO [https://docs.surveycto.com/02-designing-forms/04-sample-forms/03.randomizing.html  Randomizing Form Elements]
*SurveyCTO [https://support.surveycto.com/hc/en-us/articles/360026396733-Randomizing-form-section-order Randomizing Form Section Order]


[[Category: Impact Evaluation Design ]]
[[Category: Impact Evaluation Design ]]

Latest revision as of 19:47, 8 August 2023

During surveys, you might often need to randomize various aspects of the questionnaire. While SurveyCTO has a random number generator, is is usually not recommended that you use it. This article will argue for doing the randomization in Stata, R, or similar software, before the start of the survey, and preload the results of the randomization as dummies or categorical variables.

Read first

  • Unless your survey falls into very rare exceptions, do not randomize in SurveyCTO, do it in StataR or similar and preload the results into SurveyCTO.

Why randomization is better to do before the Survey

During surveys, we often need to randomize various aspects of the questionnaire. For example – sometimes we need to randomize which household members to interview, and sometimes which set of questions to ask. While most CAPI software have random number generators, it is not the preferred option. Using, for example, Stata to randomize and then preloading the generated data file into the survey software is in almost all cases the better option among the two. The main advantages of using Stata over CAPI software during randomization are as follows:

  • Randomization in Stata is transparent and reproducible which is important for [[Publishing Data|publishing research].
  • Randomization results in Stata are dependent, guaranteeing that no disproportionally large share of the results falls into any group. Randomization is always independent in SurveyCTO which means that it is possible no groups are assigned observations if the number of observation per groups is low.
  • Randomization in Stata provides the option of ensuring that the result of the randomization is balanced over other variables, i.e. stratas. This means that we can, for example, guarantee that not all female respondents end up in a certain group.
  • Randomization in Stata is done before the survey takes place. This provides an opportunity to double check the result of a randomization and fix bugs and typos in the randomization code before it is used in the field, as it then would be too late to fix.

How to randomize in Stata for a survey in SurveyCTO

This is a basic example on how to do randomize in Stata and preload the results to SurveyCTO. See the page linked above for more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.

In the example below, we have a survey in which we want a random 30% to answer to a long survey and the rest to take a shorter survey. We also have a gender variable and we want the ratio of 30/70 to be as exact as possible within both genders.

Step by step - known respondents

  • Let's say we have a dataset with all the respondents and replacement respondents and that we have the following variables: unique_id and gender. Make sure that all observations have uniquely and fully identifying ID in unique_id.
    • If you do not have ID for your respondents you should create it.
    • If you do not have a list of your respondents (for example, if you randomize the sample in the field by drawing lots) see the section below.
  • Set version, set seed, and sort the data to guarantee a replicable sort. See Randomization in Stata for why all these three variables are needed.
  • Generate a random number and create a dummy variable (long_survey in this example) that indicates for each observation if that observation should answer the long or short survey.
    • Note that it does not have to be a dummy, we could just as well have had randomized into three groups and saved the result into a categorical variable with the values 1, 2, 3.
  • In the end we should have a dataset with the unique_id and the categorical variables with the result of the randomization, in this example only long_survey.
  • In your SurveyCTO survey, have a question where the enumerator enters the ID for the respondent currently being interviewed. For this to be possible the enumerator needs a list with both the name and the ID.
    • It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in datasets that have names and other identifying information removed. If the data collected is extra sensitive, consider using a different ID than the main ID for this purpose.
  • Preload the data set you generated in Stata into your =SurveyCTO survey using the ID entered by the enumerator.
  • Restrict the relevant section of the survey in SurveyCTO using the value just preloaded.

Code example - known respondents

*Set version 
ieboilstart , version(12.1)
`r(version)'

*Open the data set after ieboilstart
use sample.dta

*Set seed
set seed 123456 //this is an example seed, replace this with another number

*Sort data set
sort unique_id

*Generate random number, rank that random number per gender, and assign 
* long survey if the rank is less than or equal the total number of observations 
* in that gender
gen rand = uniform()
bys gender : egen rank = rank(rand)
bys gender : gen long_survey = (rank/_N <= .3)

Preload randomization without IDs

It is still possible to preload randomized categories even if there is no way of knowing the respondents in the survey. For example, you might only have a list of villages to survey and intend to do a lottery to decide which respondents to interview. Or you might have randomized school classes and you will interview all students in those classes but you do not know exactly how many students each class has. In both these cases, we can not assign a randomized category to each respondent as we do not know who they are yet. Instead we will randomize a category to the first respondent in the village/class, randomize a category for the second respondent, etc. And in the field, the enumerators have lists for each village/class where they cross off each ID being used.

The following list explains the steps needed for this. We will keep using the example where we know which school classes we want to interview but not the size of those classes. Let's say that we want to collect basic demographic information on all students and want to do a short survey on 40% of the students and a long one on 10% of the students.

Step by step - unknown respondents

  • Step 1: Create a dataset with one observation per class if you do not already have that. Make sure that all classes have unique IDs.
  • Expand the dataset (code example below) with the maximum number of students you want to interview per class, or the maximum number of students there are in a class if you want to interview all of them. If you expect that the highest of number of students in a class is 50, expand the number of observations to 100 so that there is no risk your expectation was too low. We pick the number 99 so that no ID requires three digits.
  • Create a unique ID for each student. It is easy to do in combination with the class ID.
  • Set version, set seed, and sort the data to guarantee a replicable sort. See randomization in Stata for why all three variables are needed.
  • Since we are generating 99 IDs but most classes will be considerably smaller, there is a risk that all or a large share of the 10% of the students that will take the long surveys are among the higher IDs 50-99 that we never expect to be used. To guarantee that that will never happen, we can create stratas for each group of 10 IDs, 1-10, 11-20, 21-30 etc. and randomize within each group. See the code example for how to do so.
    • There are many different ways to do this. We can make sure that any number of categories shows up among the first 10 or 20 numbers. Just make sure that the students are selected in random order so that the order in which they are assigned IDs is random. The way they sit in a class room might not be random as there might be differences in motivation or socioeconomic status between the students in the front or in the back.
  • The example code below generates a variable surveyType with a randomized category for the first respondent, the second, the third etc., up to 99. This variable is preloaded the same way as explained above where we have IDs for all respondents.
  • The final step is to create a system where the enumerators use each ID in correct order and only once. The best way to do that is to print out lists for each village where the enumerator crosses off the used IDs.
    • If there are multiple enumerators per class, then the lists can be split into IDs with odd or even numbers and the list is crossed of by the enumerators.
  • Make sure that the randomization information is not on the lists available to the enumerators so that the enumerator has a change to influence the assignment of ID.
    • For the same reason we need to randomize differently for each class as enumerators quickly learn the randomization order if it would have been the same for each class.

Code example - unknown respondents

ieboilstart , version(12.1)
`r(version)'

*Open the data set with all classes you have randomized after ieboilstart
use class.dta

* Create one observation per each student ID. We use 99 as the highest number of students
expand 99

* Create unique ID
bys class_id : gen student_id = class_id * 100 + _n

* Generate strata for each group of 10 id, 1-10, 11-20, 21-30 etc.
bys class_id : gen id10strata = floor(_n/10)

*Set seed and sort
set seed 123456 //this is an example seed, replace this with another number
sort student_id

*Generate the result variable and the random number, rank that random number 
* per class_id and ID strata, and assign survey type to 2 if the rank is less than or 
* equal the total number of observations in that class and ID strata, and assign survey 
* type 3 if the rank is larger than 90% of the total number of observations in that 
* class and ID strata. The rest of the observations keep the survey type 1.
gen surveyType = 1
gen rand = uniform()
bys class_id id10strata : egen rank = rank(rand)
bys class_id id10strata : replace surveyType  = 2 if (rank/_N <= .5)
bys class_id id10strata : replace surveyType  = 3 if (rank/_N > .9)

Other things to consider- unknown respondents

Test field system: The most obvious step where this could go wrong is when enumerators use the correct ID and do not use IDs more than once. This may not be done properly, or the system they are using is not designed well enough, but we can easily check that by looking into which IDs are used by each enumerator each day. We should already have a good understanding in the pilot if our system is good.

Stratification: If you want to stratify over, for example, gender, you can have two lists, one for male respondents and one for female respondents. However, this increases the complexity in the field and more than one layer of strata organized in this way immediately becomes very complicated. This is one of the few cases where randomization in SurveyCTO is preferred. However, make sure that there are enough observations in each strata so that even the independent randomization will have a balanced result.

Back to Parent

This article is part of the topic Randomized Control Trials

Additional Resources