Difference between revisions of "Randomization in SurveyCTO"

Jump to: navigation, search
Line 12: Line 12:
== Practical guide to how to randomize in Stata for a survey ==
== Practical guide to how to randomize in Stata for a survey ==


This is a basic examplefor how to do this. See [[randomization in Stata]] more details for how to implement more a
This is a basic example on how to do randomize in Stata and preload the results to SurveyCTO. See [[randomization in Stata]] more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.


• Use with a dataset which has a unique ID [respondent ID, household number, etc.].
In the example below we have a survey in which we want a random 30% to answer to a long survey and the rest to take a shorter survey. We also have a gender variable and we want the ration of 30/70 to be as exact as possible within the both genders.
• While writing a do-file, pay close attention to the following things:
o Set version. This ensures that the randomization algorithm is the same, as it sometimes changes between Stata versions.
o Set seed. This makes sure that the same random number is generated for the first observation, for the second observation, and so on, for every time the code is run.
o Properly sorting the data. The data should be sorted such that observations are in the same order every time the code is run. The most optimal situation is sorting using an ID variable which uniquely and fully identifies each observation.
• Convert the random numbers into categorical variables or dummy variables. This helps you check if the data is balanced.  
• The end goal is to have a CSV format file containing the ID variable used for randomization and the categorical variables created from the random numbers generated. This dataset will be preloaded into SurveyCTO so that after an enumerator enters the respondent ID at the start of a survey questionnaire the result of the randomization will be loaded for the form and can be used for various sections of the survey.


A researcher randomizing through Stata first creates a do file to document their method of randomizing. This allows the researcher to replicate the results in the future. After that, they randomize in Stata and make sure that the randomization takes into account various demographic characteristics like age, sex, income, etc, so that the survey is well balanced. They then create the data _le and check for bugs and irregularities in the randomization. Finally, they load the dataset into the survey software.
* Let's say we a data set with all the respondents and replacement respondents, and that we have the variables, ''unique_id'' and ''gender''. Make sure that all observations have [[uniquely and fully identifying]] ID in ''unique_id''. 
** If you do not have ID for your respondents you should create it.
** If you do not have a list of your respondents (for example, if you randomize the sample in the field by drawing lots) see the section of [[Randomization_in_SurveyCTO#Preload_Randomization_without_IDs| prelaoding randomization without IDs]].
* Set version, set seed and sort the data to guarantee a replicable sort. See [[Randomization_in_Stata#Steps_needed_for_replicability_when_randomizing_in_Stata| replicable randomization in Stata]] why all these three variables are needed.
* Generate a random number and create a dummy variable (''long_survey'' in this example) that indicates for each observation if that observation should answer the long or the short survey.
** Note that it does not have to be a dummy, we could just as well have had randomized into three groups and saved the result into a categorical variable with the value 1, 2, 3.
*In the end we should have a data set with the ''unique_id'' and the categorical variables with the result of the randomization, in this example only ''long_survey''.
*In your SurveyCTO survey, have a question where the enumerator enters the ID for the respondent currently interviewed. For this to be possible the enumerators needs a list with both the name and the ID.
**It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in data sets that have names and other identifying information removed. If the data collected is extra sensitive, consider using a different ID than the main ID for this purpose.
* Preload the data set you generated in Stata into your SurveyCTO survey using the ID entered by the enumerator.
* Restrict the relevant section of the survey in SurveyCTO using the value just preloaded.


Randomizing using the CAPI software does not provide the researcher with the opportunity to replicate results in the future, test for bugs or ensure balance in gender and age distribution because the randomization takes place during the field survey itself.
Example of a the simple randomization used in  
<pre>
*Set version
ieboilstart , version(12.1)
`r(version)'


The next section provides information on effectively using Stata for randomization.
*Set seed
set seed 123456 //this is an example seed, replace this with another number


*Sort data set
sort unique_id


*Generate random number, rank that random number per gender, and assign
* long survey if the rank is less than or equal the total number of observations
* in that gender
gen rand = uniform()
bys gender : egen rank = rank(rand)
bys gender : gen long_survey = (rank/_N <= .3)
</pre>


 
=== Preload randomization without IDs ===
 
It is still possible to preload randomized categories even if there is no way of knowing the respondents in the survey.
If you need to randomize certain things in your questionnaire, randomization using SurveyCTO is not recommended. The best way to randomize things in a questionnaire is by [[Randomization in Stata | using Stata]]. The reason that Stata is preferable is because the randomization in Stata is transparent and easily reproducible which is not only necessary for impact evaluation projects but also for publishing research. Randomization in Stata is also done before a survey so there is time to check for errors/bugs and also to ensure that there is balance in the dataset. When Stata is not available, [[Randomization in Excel | Excel]] can also be used but [[Randomization in Excel |randomization done in Excel]] is not as reproducible as randomization in Stata.


== Back to Parent ==
== Back to Parent ==

Revision as of 13:50, 14 December 2017

During surveys, you might often need to randomize various aspects of the questionnaire. While SurveyCTO has a random number generator, is is usually not recommended that you use it. This article will argue for doing the randomization in Stata, R or similar software, before the start of the survey, and preload the results of the randomization as dummies or categorical variables.

Why randomization is better to do before the Survye

During surveys, we often need to randomize various aspects of the questionnaire. For example – sometimes we need to randomize which household members to interview, and sometimes - which set of questions to ask. While most CAPI software have random number generators, it is not the preferred option. Using, for example, Stata to randomize and then preloading the generated data file into the survey software is in almost all cases the better option among the two. The main advantages of using Stata over CAPI software during randomization are as follows:

  • Randomization in Stata is transparent and reproducible which is important for publishing research.
  • Randomization results in Stata can be dependent, so that we are guaranteed that no disproportional large share of the results falls into any group. Randomization is always independent in SurveyCTO which means that no groups could be assigned observations if the number of observation per groups is low.
  • Randomization in Stata provides the option of ensuring that the result of the randomization is balanced over other variables, i.e. stratas. This means that we can guarantee that, for example, not all female respondents end up in a certain group.
  • Randomization in Stata is done before the survey takes place. This provides an opportunity to double check the result of a randomization and fix bugs and typos in the randomization code before it is used in the field, as it then would be too late to fix.

Practical guide to how to randomize in Stata for a survey

This is a basic example on how to do randomize in Stata and preload the results to SurveyCTO. See randomization in Stata more details for how to implement more advanced randomization, but the procedure for how to preload the result will still be the same as described below.

In the example below we have a survey in which we want a random 30% to answer to a long survey and the rest to take a shorter survey. We also have a gender variable and we want the ration of 30/70 to be as exact as possible within the both genders.

  • Let's say we a data set with all the respondents and replacement respondents, and that we have the variables, unique_id and gender. Make sure that all observations have uniquely and fully identifying ID in unique_id.
    • If you do not have ID for your respondents you should create it.
    • If you do not have a list of your respondents (for example, if you randomize the sample in the field by drawing lots) see the section of prelaoding randomization without IDs.
  • Set version, set seed and sort the data to guarantee a replicable sort. See replicable randomization in Stata why all these three variables are needed.
  • Generate a random number and create a dummy variable (long_survey in this example) that indicates for each observation if that observation should answer the long or the short survey.
    • Note that it does not have to be a dummy, we could just as well have had randomized into three groups and saved the result into a categorical variable with the value 1, 2, 3.
  • In the end we should have a data set with the unique_id and the categorical variables with the result of the randomization, in this example only long_survey.
  • In your SurveyCTO survey, have a question where the enumerator enters the ID for the respondent currently interviewed. For this to be possible the enumerators needs a list with both the name and the ID.
    • It is very important that these lists are not publicly disclosed as that would allow anyone to identify observations even in data sets that have names and other identifying information removed. If the data collected is extra sensitive, consider using a different ID than the main ID for this purpose.
  • Preload the data set you generated in Stata into your SurveyCTO survey using the ID entered by the enumerator.
  • Restrict the relevant section of the survey in SurveyCTO using the value just preloaded.

Example of a the simple randomization used in

*Set version 
ieboilstart , version(12.1)
`r(version)'

*Set seed
set seed 123456 //this is an example seed, replace this with another number

*Sort data set
sort unique_id

*Generate random number, rank that random number per gender, and assign 
* long survey if the rank is less than or equal the total number of observations 
* in that gender
gen rand = uniform()
bys gender : egen rank = rank(rand)
bys gender : gen long_survey = (rank/_N <= .3)

Preload randomization without IDs

It is still possible to preload randomized categories even if there is no way of knowing the respondents in the survey.

Back to Parent

This article is part of the topic Randomized Control Trials

Additional Resources