Difference between revisions of "Variable Names"

Jump to: navigation, search
 
(6 intermediate revisions by the same user not shown)
Line 6: Line 6:
* Proper '''variable names''' make it easier for others to use and analyze the dataset at later stages.
* Proper '''variable names''' make it easier for others to use and analyze the dataset at later stages.


== Question Numbers VS Descriptions ==
== Question Numbers v/s Descriptions ==
Defenders of the number method often mention the usefulness of question numbers for navigating questionnaires and referring to questions during the development of the questionnaire and training of enumerators; one can simply call out “Go to question 15” to a room full of enumerators in training and everybody knows more or less where to find the question in the printed version of the questionnaire. I also heard more than once from a client that question numbers would make it easier to find variables in a dataset. I do not really buy the second argument — in my opinion a well-structured dataset with descriptive names is easier to navigate than a numbered one — but if there is one point I would concede to the proponents of the number method it is the one about usefulness during training. Other than that I believe that there enough points in favour of a descriptive naming scheme that make it superior to numbering:
Most questionnaires use two broad methods for naming variables:
Descriptive names help to structure the questionnaire and dataset. The careful use of pre- and suffixes helps to indicate which variables belong together, either thematically (e.g. health_selfcare, health_medicine) or by type (e.g. awareness_likert understanding_likert), which makes it easier to understand the questionnaire when looking at it in XLSFormat and the dataset when looking at it in tabular format or as a list of variables.
* '''Question numbers''' like '''1''', '''2''', '''3''' or '''A1''', '''A2a''', '''A2b'''. This method makes it easy to refer to questions during [[Enumerator Training|enumerator training]] and discussions on the survey questions
Questionnaires can change a lot during the development phase or over time if they are used over multiple survey rounds. This can become messy quickly with numbered questionnaires; adding or dropping a variable means that all subsequent variables also need to renamed if one wants to keep a perfect sequence, which is especially annoying if you have a XLSForm with lots of dependencies in calculations and relevances, and it creates havoc in the scripts of data analysts who have already worked with the old names. To avoid this scenario, researchers sometimes revert to “exending” question numbers and one ends up with variable names like Q13 Q14a Q14b Q15 Q16 Q17a Q17b Q17c which go against the maybe only advantage a numbered questionnaire has over descriptive variable names: intuitive navigation for printed questionnaires.
* '''Descriptions''' like '''gender''', '''age''', '''employment''', and so on. This method allows the '''research team''' to indicate which variables belong together using prefixes, like '''health_selfcare''' and '''health_medicine''', and suffixes, like '''awareness_likert''' and '''understanding_likert'''. This method also makes it easier to understand the questionnaire when it is in the [[SurveyCTO_Programming#Excel_Method|Excel format]]. It also allows users to understand datasets in the form of tables.
The same descriptive variables names can be reused across different surveys, making it easier to reuse parts of mobile forms, recycle quality control backends and run similar types of analyses across multiple surveys.
 
- Descriptive variable names make form development, scripted data analysis and backend development more pleasant and less error-prone. Arguably, one does not have to be a statistics wizard to understand what the command tab gender if age < 25 means, whereas tab a26 if a28 < 25 requires one to look for the variables in a codebook of some sorts.
Generally speaking, a questionnaire undergoes various rounds of discussions during the [[Questionnaire Design|development stage]], or when it is used for multiple rounds of [[Primary Data Collection|data collection]]. This can become messy in the case of '''numbered variable names''' because adding or dropping a variable means that all subsequent variables also need to renamed if one wants to keep a perfect sequence. This problem gets harder to deal with in the case of an '''Excel form''' with various calculations and relevance conditions. This also means that users have to rewrite '''do-files''' which used old names, which in turn affects [[Reproducible Research|reproducibility]].  
 
This is however not the case with the '''description method'''. The same variable names can be reused across different surveys, which makes it easier to also reuse [[Monitoring Data Quality|quality control methods]] and use the same '''do-files''' for [[Data Analysis|analysis]] of multiple survey rounds. Moreoever, it is also much easier to write do-files for '''descriptive variable names'''. For example, it is easier for another member of the research team to understand what the command <syntaxhighlight lang="Stata" inline>tabulate gender if age < 25</syntaxhighlight> means. However, if the command is <syntaxhighlight lang="Stata" inline>tabulate a26 if b12 < 25</syntaxhighlight>, then it will require them to refer to the variable dictionary in the original questionnaire.


== Naming Conventions ==
== Naming Conventions ==
 
In this section, we look at '''naming conventions''' keeping in mind that most work that involves [[Primary Data Collection|survey data]] is performed using [[Stata_Coding_Practices|Stata]]. Note that while there is no established standard for '''variable names''', the following best practices ensure that variable names are meaningful, easy to understand, and consistent:
== General Tips ==
* '''Lowercase:''' Keep the variable names lowercase to avoid confusion. This also makes it simpler to refer to variables while writing code.
* '''Clear and simple:''' The name should describe the item it represents clearly, but should not be too long. For example, '''brand''' is a better name than '''brand_name''' because it explains what the variable represents without using extra letters.
* '''Avoid abbreviations:''' Unless it is a commonly used abbreviation (like '''hh''' for household), avoid using abbreviations since they are ambiguous and difficult to remember.
* '''Prefixes and suffixes:''' Use prefixes and suffixes to group variables by theme or by type. For example, '''employee1_satisfaction''', '''employee2_satisfaction''' and so on to group employee satisfaction in a firm. In Stata, you can then use an asterisk ('''*''') to refer to all variables in this group using the following command - <syntaxhighlight lang="Stata" inline>foreach var of varlist employee*_satisfaction {...</syntaxhighlight>.
* '''Underscores:''' Use underscores ('''_''') for variables that have more than one word. For example, '''father_name''', '''father_age''', and so on.
* '''Consistency:''' In the case of common variables, reuse variable names for different questionnaires, especially if the questionnaires are part of the same study. For example, it should not be '''DOB''' in the individual questionnaire, and '''birth_date''' in the household questionnaire.


== Related Pages ==
== Related Pages ==
Line 23: Line 30:
* Jan Schenk, [https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8 Variable Names in Survey Research]
* Jan Schenk, [https://medium.com/@janschenk/variable-names-in-survey-research-a18429d2d4d8 Variable Names in Survey Research]
* Petri Silen, [https://medium.com/better-programming/useful-tips-for-naming-your-variables-8139cc8d44b5 Useful Tips for Naming Your Variables]
* Petri Silen, [https://medium.com/better-programming/useful-tips-for-naming-your-variables-8139cc8d44b5 Useful Tips for Naming Your Variables]
[[Category: Reproducible Research]]

Latest revision as of 14:24, 13 April 2021

Variable names are one of the most important aspects of questionnaire design. Properly named variables improve the quality of data collection by giving members of the research team useful insights into the information captured by each variable. Using good naming conventions for variables allows others who are using the data to not only understand the purpose of the variable, but also its type (such as integer, date, string).

Read First

  • Questionnaire design is an important aspect of primary data collection.
  • Well designed questionnaires improve the quality of data collection, as well as the subsequent data analysis.
  • The research team should therefore spend considerable amount of time thinking about the different ways in which the dataset can be used for analysis.
  • Proper variable names make it easier for others to use and analyze the dataset at later stages.

Question Numbers v/s Descriptions

Most questionnaires use two broad methods for naming variables:

  • Question numbers like 1, 2, 3 or A1, A2a, A2b. This method makes it easy to refer to questions during enumerator training and discussions on the survey questions
  • Descriptions like gender, age, employment, and so on. This method allows the research team to indicate which variables belong together using prefixes, like health_selfcare and health_medicine, and suffixes, like awareness_likert and understanding_likert. This method also makes it easier to understand the questionnaire when it is in the Excel format. It also allows users to understand datasets in the form of tables.

Generally speaking, a questionnaire undergoes various rounds of discussions during the development stage, or when it is used for multiple rounds of data collection. This can become messy in the case of numbered variable names because adding or dropping a variable means that all subsequent variables also need to renamed if one wants to keep a perfect sequence. This problem gets harder to deal with in the case of an Excel form with various calculations and relevance conditions. This also means that users have to rewrite do-files which used old names, which in turn affects reproducibility.

This is however not the case with the description method. The same variable names can be reused across different surveys, which makes it easier to also reuse quality control methods and use the same do-files for analysis of multiple survey rounds. Moreoever, it is also much easier to write do-files for descriptive variable names. For example, it is easier for another member of the research team to understand what the command tabulate gender if age < 25 means. However, if the command is tabulate a26 if b12 < 25, then it will require them to refer to the variable dictionary in the original questionnaire.

Naming Conventions

In this section, we look at naming conventions keeping in mind that most work that involves survey data is performed using Stata. Note that while there is no established standard for variable names, the following best practices ensure that variable names are meaningful, easy to understand, and consistent:

  • Lowercase: Keep the variable names lowercase to avoid confusion. This also makes it simpler to refer to variables while writing code.
  • Clear and simple: The name should describe the item it represents clearly, but should not be too long. For example, brand is a better name than brand_name because it explains what the variable represents without using extra letters.
  • Avoid abbreviations: Unless it is a commonly used abbreviation (like hh for household), avoid using abbreviations since they are ambiguous and difficult to remember.
  • Prefixes and suffixes: Use prefixes and suffixes to group variables by theme or by type. For example, employee1_satisfaction, employee2_satisfaction and so on to group employee satisfaction in a firm. In Stata, you can then use an asterisk (*) to refer to all variables in this group using the following command - foreach var of varlist employee*_satisfaction {....
  • Underscores: Use underscores (_) for variables that have more than one word. For example, father_name, father_age, and so on.
  • Consistency: In the case of common variables, reuse variable names for different questionnaires, especially if the questionnaires are part of the same study. For example, it should not be DOB in the individual questionnaire, and birth_date in the household questionnaire.

Related Pages

Click here for pages that link to this topics.

Additional Resources