SurveyCTO Choice Lists

Jump to: navigation, search


Read First

  • Our data work is conducted in multiple software that have different restrictions. ODK, the language that SurveyCTO is based on, allows many practices that are bad practices if all contexts the data will be used in are considered.

Only Numeric Values in the Name Column

The name column is unfortunately named as you should never have text in this column, only numbers. Text values are more difficult to work with when used in relevance or constraint conditions. Also, one step of the data cleaning is to replace all string variables (some exceptions exist) with numeric variables. That task can be greatly reduced if we already when coding the questionnaire assign categories for each answer option. SurveyCTO provides a Stata do-files that create labels and add them to the numeric values.

Negative and Standardized Values for Non-Answers

For any answer that is a type of non-answer, for example, "Don't know", "Question does not apply", "Decline/Refused to Answer" or "Other" we should have a code that is negative and have the same meaning across a project.

We want the value to be negative for two reasons. The first reason is that when cleaning the data set these values will stand out more and therefore be easier to address during the cleaning of the data. This is the case both when tabulating or looking at a distribution of a variable, but also when looking descriptive statistics as negative values distort, for example, means to the degree that the project team will be reminded to look for these error codes. To increase this effect it is better to pick -999 than -9 to represent for example "Don't Know".

The second reason is that we might want to add more answer options in later rounds. If we would have the number 9 representing "Don't Know", then there is a chance that we will need that value for a category that we will add of we have more than eight categories. We could of course assign the new value the code 10 but then we would have a non-value in the middle of actual answers and that is not optimal. We should absolutely never shift the "Don't know" code from 9 to 10 and give the added category the code 9. This is the worst solution of all as in a panel data set, the same thing that means an answer in the follow up data means "Don't know" in the earlier data.

It should be obvious to everyone that having the same code representing the same non-answer across a project will reduce the risk of confusion and make the cleaning of the data easier.

Back to Parent

This article is part of the topic SurveyCTO Coding Practices