Jump to: navigation, search

DIME Analytics has created iefieldkit as a package in Stata to support the process of primary data collection from start to finish. In most cases, third party survey firms or local partners collect data on behalf of the research team. Therefore, data quality assurance is a particularly important aspect of data collection. ietestform allows the research team to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms before field data collection starts. For example, the SurveyCTO server has a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of high quality.

Read First

  • Stata coding practices.
  • iefieldkit.
  • To install ietestform, type ssc install ietestform in Stata.
  • To install all the commands in the iefieldkit package, type ssc install iefieldkit in Stata
  • For instructions and available options, type help ietestform.


In Open Data Kit (ODK)-based electronic survey kits, including SurveyCTO, survey forms (or questionnaires) are typically built in Excel using a specialized structured syntax. Before the research team starts with field data collection, they can use ietestform to test ODK-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms.

For example, the SurveyCTO server has a built-in feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of high quality. Therefore, the ietestform command should be used after testing the survey form on a SurveyCTO server to make sure there are no syntax errors.


  , surveyform("filename.xlsx") 

The ietestform command displays a report in .csv format. The report flags errors in coding, as well as practices that are not strictly wrong, but which may indicate bad practices, and therefore need a manual review. The report displayed by ietestform can be displayed in a number of software applications, and can also be used with collaboration tools like [GitHub.

If you think that the command incorrectly flagged issues in your SurveyCTO form, please report the case here to help DIME Analytics improve the command. Refer to the following sections for a detailed explanation of the tests performed by ietestform. These tests flag errors that may interrupt field work. Note that the ietestform should be used only after the form has passed the ODK syntax checks on the SurveyCTO server.

Required Columns

Required columns ensure that the enumerators (the people collecting data in the field) cannot proceed without entering a response into a particular column. This prevents submissions of incomplete forms, and helps ensure that enumerators complete forms in the right order. A column is required if it has the "Yes" value in the required column

Note that only column types that show up when filling the form are affected by that value. For example, fields like begin_group, end_repeat, text_audit do not show up while filling the form, and so tests related to the required columns ignore these fields.

ietestform runs two tests related to the required columns depending on whether they are note type or non-note type. Fields which are of the note type are those for which the enumerator does not have to enter any input. Instead, the enumerator only needs to read out a specific text note.

Non-note fields : required

ietestform tests to make sure that all fields that are not of note type have the value "Yes" in the required column, that is, they are required. The final report then lists all those fields not of type note, but are not required

Even when some type of non-response by a respondent, such as “Declined to answer”, is acceptable, there should always be a valid method to record the reason for no response. The enumerator should not leave the input field empty in this case. The absence of a recorded answer should only mean that the enumerator did not ask the question during the survey. In cases where it is acceptable to skip a question, you should use an appropriate relevance condition.

Fields that record GPS coordinates for instance, are some of the fields that may intentionally have a "No" value under the required column. Such fields often have their type as geopoint, geoshape, or geotrace. If you know that you will have no problem collecting GPS coordinates, then you should have a "Yes" value in the required column to ensure that you get valid data points.

However, if GPS coordinates are difficult to collect, then it might be a good idea to not have a "Yes" value under the required column. This will allow the enumerator to complete the other fields and submit the survey even if it is not possible to record GPS coordinates. In this case, ietestform will still report these fields, but as long as you are happy with your decision, you can still proceed with launching the survey.

Note fields : not required

While fields of the note type can have a "Yes" value in the required column, they cannot record an input. Therefore, if an enumerator comes across such a field during a live survey , they cannot move past this field. In this case, there is no way to continue with the interview, and the enumerator will not be able to submit the data already collected from previous questions. ietestform therefore reports a list of all fields that are of the note type, and have a "Yes" value in the 'required column.

Note that there are cases in which note fields which are required may be useful. Since enumerators cannot move past these fields, you may use them with a relevance condition so that these fields show up if an earlier entry in the form is incorrect. This will force the enumerator to go back and correct the error before continuing with the interview..

For example, enumerators often enter respondent IDs twice to make sure there is no typo in the ID. You may name the two entry fields id1 and id2. Then you can follow these fields with a required note field whihc has the relevance expression as " ${id1} != ${id2} ". In this case, the note type field will only appear if the two entries are not identical. You can use the note text to inform the enumerator that the two ID fields are not identical, and that the enumerator must go back and change the values in order to continue.

Matching begin_ and end_

The ietestform checks that all begin_group fields are matched by an end_group, and that all begin_repeat fields are matched by an end_repeat. While the ODK syntax tester on the SurveyCTO server also tests for matching begin_ and end_ values, it can often be time consuming, especially when the survey form is very large. The ietestform command generates a report that provides additional information that makes it easier to solve this problem.

For example, ODK does not require that the end_group and end_repeat fields should have field names. This makes it difficult to identify where the error is in the underlying survey form. However, ietestform fills that gap because it requires that these fields should have unique names, and then it lists these names in the report, along with the row number (in the form) of the non-valid (unmatched) begin_ and end_ pairs.

For a begin_ and end_ pair to be considered valid or matched by ietestform, the following three criteria must be met:

  1. For each begin_ field, there must be an end_ field.
  2. The corresponding end_ field must be of the correct type. That is, a begin_group should not be closed by an end_repeat, and a begin_repeat should not closed by an end_group.
  3. The names of the end_ fields must match the names of begin_ fields. The SurveyCTO server already tests to makes sure that the begin_ names are unique, so each begin_ and end_ pair will also be unique if this condition is met.

Naming and Labeling

ODK applies very few restrictions to field names and other inputs. Therefore, datasets crated in ODK often contain variable names and labels that are not valid in Stata and will cause an error when the dataset is imported in Stata. For example, ODK only requires that all variable names must be unique, and does not allow the use of a few special characters. The ODK syntax test on the SurveyCTO server tests for only these restrictions. ietestform performs some additional tests which ensure that the datasets are valid, and optimized for being imported in Stata.

Stata-specific labels

ietestform returns a flag if your survey form is not programmed to display Stata-specific labels. In SurveyCTO, for instance, you can program your form to display questions in multiple languages. This is done by creating label columns named label:english, label:swahili, label:hindi and so on. You can then choose which language to use for labels when exporting the dataset to Stata from SurveyCTO.

You can use the same feature to create Stata-specific labels, by adding a label "language" called label:stata. You can obviously add and modify labels after importing the dataset to Stata as well. However, this is the simplest way to add Stata-specific labels. If this practice is not used, the data set may end up being incorrectly labeled.

Length of variable labels

In Stata, there is a restriction on the length of variable labels. Variable labels in Stata cannot be longer than 80 characters, and Stata truncates variable labels that are longer. ietestform checks for this by listing all fields with entries in Stata's label column that are longer than 80 characters.

Length of variable names

Similarly, Stata also restricts the length of variable names to 32 characters. If the name is longer than that, Stata will either truncate the name, or replace the name with generic names like var1, var2, etc. if the truncated name is no longer unique. While you can make these changes in Stata as well, it is much easier to solve these issues before starting with the data collection. ietestform therefore flags all fields with variable names longer than 32 characters.

Length of field names in repeat groups

This test has two parts. The first part lists fields in repeat groups that have names that will be too long in the wide format when imported to Stata. The second part lists fields in repeat groups where the risk of too long names is high, but not certain.

When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. For example, if a repeat group is repeated three times, then in the wide dataset, any variable in that repeat group will generate three variables, with the names suffixed followed by _1, _2 and _3 respectively. This suffix will also count towards the 31 characters limitation for variable names in Stata discussed in the previous test. (Technically 32 characters are allowed, but some common commands add one character to variable names when processing it so we recommend a maximum of 31 characters.) Thus, any variable in a repeat group may should have a field name no longer than 29 characters. If the field is in a nested repeat group (a repeat group inside a repeat group), then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 31 - (2 * number of nested repeat groups for the field). This test lists all variables that have longer names than that constraint.

In the first test we assume that there are not more than 9 iterations in each repeat group; if there would be more than 9 then the suffixes will be _10, _11 etc., which takes up three characters. So the second test lists all fields that have a field name that is longer than 31 - (3 * number of nested repeat groups for the field). Whether this will create an issue with long names is uncertain, but if your names are so long that they might be caught in this test, then it is probably best practice to try to make the names shorter.

Repeat Group Name Conflict

This test checks for name conflicts that may result from the suffixes added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field, it will be suffixed with _1 for the first iteration of that variable; that will create a name conflict with the variable created from field myvar_1.

This test lists all fields inside a repeat group with which another field may conflict due to names. For example, if there is a field with name myvar’’, ietestform tests if there is any variable on the format myvar_#, where # is one or several digits.

If the variable myvar is in a nested repeat group (a repeat group inside a repeat group), then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.

Technical special case: If the fields myvar and myvar_1 are both in a non-nested repeat group then there will be no name conflict: the first iteration of both fields will generate the variables myvar_1 and myvar_1_1 since the variables from both fields are suffixed. These fields are still listed by this test as it may be confusing that the variable myvar_1 is from the field myvar and not from myvar_1.’’

Choice Sheet Stata Labels

Apart from whether the label:stata exists or not, there is no further test on the values of the Stata label column in the choice sheet.

Leading and Trailing Spaces

In computer science, there is a difference between the string "ABC" and "ABC ". This difference does not show in Excel. When uploading your form to SurveyCTO's server, the form checker is programmed to handle these differences. However, when you import your form to Stata, as ietestform and several other commands does, these minor differences are distinguished.

For example, consider you have a list in the choice sheet called village,’’ but the actual content of the cell is "village ". In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as "village", but other tools might treat it as "village " which, when compared, are not the same.

What would be even worse is if some list item in the village list has the list name value "village" and some has the value "village ". This is very difficult to spot in Excel but some tools might treat these as different.

Leading (" ABC") or trailing ("ABC ") spaces are not difficult to deal with and most tools, iestestform included, deals with them. However there is no guarantee that all of them do. To reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.

Tests for Choice List Practices

These tests are related to the choice lists used in select_one and in select_multiple types of fields. The ODK syntax is very lenient when it comes to choice lists, and it lets some undesirable practices to pass. For example, unused lists and duplicate labels could mean that the list elements were copied and pasted accidentally. The command reports this, as they are common sources for errors.

Unused Choice Lists

This test makes sure that all lists defined in the choices list sheet are actually used in at least one select_one or select_multiple field in the survey sheet. It is not incorrect to have unused lists, but it is likely a sign of something that is not kept up to date in your choice lists and might therefore cause an error, an expected behavior, or list items not being displayed during the survey.

For example, imagine you have 10 villages in a choice list called village but you incorrectly type vilage for one of them. Then, according to ODK syntax you have two lists, one called village with 9 items and one called vilage with 1 item. It is unlikely that there is a select_one/select_multiple field that uses the choice list vilage,’’ so listing unused choice is a good way to spot a type like this one.

Value/Name Numeric

In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file provided through SurveyCTO Sync. However, this file only works if the values in the value/name column in the choices sheet are numeric. It is not incorrect to use string variables here, but you will have to spend more time cleaning your dataset to follow Stata best practices. This test lists all list items that have a non-numeric value in the value/name column.

Duplicated List Code

This test makes sure that there are no duplicates in list names and codes in the choice sheet. This test lists all list items that have other list items with the same two values in the name and code columns.

Duplicated List Labels

This test makes sure that there is no label in the same list that is identical (i.e. one label that is listed twice for the same choice list but with different codes). This test lists all list items that have other list items with the same two values in the name and label columns.

Missing Labels or Value/Name in Choice Lists

The first part of this test makes sure that there is no list item that has a value in the label column but no value in the value/name column. The second part of this test makes sure the opposite does not happen. This is extra likely to occur when a form is programmed in multiple languages. This test lists all list items caught by either if these two tests.

Back to Parent

This article is part of the topic iefieldkit

Additional Resources