Variable Names

Jump to: navigation, search

Variable names are one of the most important aspects of questionnaire design. Properly named variables improve the quality of data collection by giving members of the research team useful insights into the information captured by each variable. Using good naming conventions for variables allows others who are using the data to not only understand the purpose of the variable, but also its type (such as integer, date, string).

Read First

  • Questionnaire design is an important aspect of primary data collection.
  • Well designed questionnaires improve the quality of data collection, as well as the subsequent data analysis.
  • The research team should therefore spend considerable amount of time thinking about the different ways in which the dataset can be used for analysis.
  • Proper variable names make it easier for others to use and analyze the dataset at later stages.

Question Numbers v/s Descriptions

Most questionnaires use two broad methods for naming variables:

  • Question numbers like 1, 2, 3 or A1, A2a, A2b. This method makes it easy to refer to questions during enumerator training and discussions on the survey questions
  • Descriptions like gender, age, employment, and so on. This method allows the research team to indicate which variables belong together using prefixes, like health_selfcare and health_medicine, and suffixes, like awareness_likert and understanding_likert. This method also makes it easier to understand the questionnaire when it is in the Excel format. It also allows users to understand datasets in the form of tables.

Generally speaking, a questionnaire undergoes various rounds of discussions during the development stage, or when it is used for multiple rounds of data collection. This can become messy in the case of numbered variable names because adding or dropping a variable means that all subsequent variables also need to renamed if one wants to keep a perfect sequence. This problem gets harder to deal with in the case of an Excel form with various calculations and relevance conditions. This also means that users have to rewrite do-files which used old names, which in turn affects reproducibility.

This is however not the case with the description method. The same variable names can be reused across different surveys, which makes it easier to also reuse quality control methods and use the same do-files for analysis of multiple survey rounds. Moreoever, it is also much easier to write do-files for descriptive variable names. For example, it is easier for another member of the research team to understand what the command tabulate gender if age < 25 means. However, if the command is tabulate a26 if b12 < 25, then it will require them to refer to the variable dictionary in the original questionnaire.

Naming Conventions

In this section, we look at naming conventions keeping in mind that most work that involves survey data is performed using Stata. Note that while there is no established standard for variable names, the following best practices ensure that variable names are meaningful, easy to understand, and consistent:

  • Lowercase: Keep the variable names lowercase to avoid confusion. This also makes it simpler to refer to variables while writing code.
  • Clear and simple: The name should describe the item it represents clearly, but should not be too long. For example, brand is a better name than brand_name because it explains what the variable represents without using extra letters.
  • Avoid abbreviations: Unless it is a commonly used abbreviation (like hh for household), avoid using abbreviations since they are ambiguous and difficult to remember.
  • Prefixes and suffixes: Use prefixes and suffixes to group variables by theme or by type. For example, employee1_satisfaction, employee2_satisfaction and so on to group employee satisfaction in a firm. In Stata, you can then use an asterisk (*) to refer to all variables in this group using the following command - foreach var of varlist employee*_satisfaction {....
  • Underscores: Use underscores (_) for variables that have more than one word. For example, father_name, father_age, and so on.
  • Consistency: In the case of common variables, reuse variable names for different questionnaires, especially if the questionnaires are part of the same study. For example, it should not be DOB in the individual questionnaire, and birth_date in the household questionnaire.

Related Pages

Click here for pages that link to this topics.

Additional Resources