Difference between revisions of "Ietestform"

Jump to: navigation, search
Line 4: Line 4:
*Install <code>ietestform</code> by typing <code>ssc install ietestform</code> in Stata.
*Install <code>ietestform</code> by typing <code>ssc install ietestform</code> in Stata.
*For instructions and available options, type <code>help ietestform</code> in Stata.
*For instructions and available options, type <code>help ietestform</code> in Stata.
*This command is a part of the package [[Stata_Coding_Practices#iefieldkit|iefieldkit]]; to install all the commands in this package, including <code>ietestform</code>, type <code>ssc install iefieldkit</code> in Stata  
*This command is a part of the package <code>[[Stata_Coding_Practices#iefieldkit|iefieldkit]]</code>; to install all the commands in this package, including <code>ietestform</code>, type <code>ssc install iefieldkit</code> in Stata


Revision as of 23:07, 18 June 2019

ietestform is a Stata command used to test ODK-based SurveyCTO forms during questionnaire programming and before fieldwork. In particular, it tests for best practices in coding, naming and labeling, and choice lists. Researchers should implement the command after using SurveyCTO’s server test feature. This article describes how and when to use the command, and the reasoning for specific tests the command performs.

Read First

  • Install ietestform by typing ssc install ietestform in Stata.
  • For instructions and available options, type help ietestform in Stata.
  • This command is a part of the package iefieldkit; to install all the commands in this package, including ietestform, type ssc install iefieldkit in Stata


ietestform is a Stata command used to test ODK-based SurveyCTO forms during questionnaire programming and before fieldwork. The command tests for coding practices, naming and labeling practices, and choice lists. Among other things, it looks for potential typos that lead to unintended form logic; whether the data generated will be in Stata-suitable format; and if best practices are used in the form. The command then outputs a test report in .csv-format. ietestform is intended to be used after SurveyCTO's server test feature rather than in place of it.

To use this command in Stata, simply specify your SurveyCTO form and the path to which you want to write the report:

ietestform , surveyform("/path/to/surveyform.xlsx") report("/path/to/report.csv")

Note that the command may flag a feature that is indeed the best option for your particular case. Interpret the output conscientiously: be sure to understand why each case was listed and decide whether to modify the form or not accordingly. If you are not sure why something was caught by the tests, read the explanations of each test below. If you think that the command incorrectly catches cases in your SurveyCTO form, please report the case here and DIME Analytics will happily work on improving the command.

Tests for Coding Practices

This section describes the ietestform tests on ODK programming language. These tests flag risks of error that may interrupt field work. Note that ietestform assumes that the ODK syntax is already tested and is correct; it is intended to be used after the form has passed the ODK syntax test on SurveyCTO's server.

Required Column

The required column ensures that the enumerator cannot proceed before a response have been filled in for that field. This is a great data quality feature as it prevents incomplete forms to be submitted, and it helps ensure that enumerators fill in the forms in the right order. A field that is required cannot be passed until data has been recorded for it.

While you can fill in a value in the required column, only field types with a view (i.e. showing up when filling in a form), are affected by that value. Examples of fields without a view are begin_group, end_repeat, text_audit, caluclate, deviceid, caseid, etc. All fields without a view are ignored in the tests related to the required column.

ietestform runs two tests related to the required column.

All Non-Note Fields Required

This tests that all fields that are not of type note have the value "Yes" in the required column. It then outputs a list to the report for all fields that are not required and not of type note.

Even when "no answer" is a valid response from the respondent, never use the absence of a recorded answer to represent that; when applicable, use a valid method to record that the respondent’s answer was "no answer".

Some fields that are commonly left not required intentionally are fields that require the GPS (i.e. geoppoint, geoshape and geotrace). If you know that the devices used for data collection will have no problem collecting GPS coordinates, keep these fields required to ensure you will get valid data points. However, if where GPS coordinates will be difficult to collect due to connection issues, it may be a good idea to not require these fields so that the enumerator can still complete the other fields and submit the form even when it was not possible to record GPS coordinates.

No Note Fields Required

Fields of type note have a view and can therefore be required. However, there is no way to record data in a note field, so there is no way to pass a required note-field. While this feature can sometimes be put to great use (see below), it is generally problematic. ietestform writes a list to the report of all fields that are of type note and are required.

While required notes will always be listed, there are cases when they are really useful. Since they are not possible to pass, they can be used together with a relevance condition so that they show up if something earlier in the form is not correct and the enumerator should be forced to go back and correct before continuing the data collection.

For example, enumerators are often asked to enter respondent IDs twice to be extra careful that there is no typo in the ID. Let's say those two double entry fields are id1 and id2. Then they can be followed by required note-field that has the relevance expression ${id1} != ${id2} so that the note only show if the two IDs are not identical. The note label can then inform the enumerator that the two ID fields are not identical and that the enumerator must go back and change the values in order to continue.

The same functionality could have been achieved using the constraint condition on the second ID field when the ID is re-entered, but the label in the note field can be made more informative than the constraint message, and when the conditional test is more difficult than just testing that two fields are identical, then this method is easier by using intermediate calculate fields that are then used in the relevance column for the required note-field.

Numeric Ranges

not implemented yet

All numeric fields, integer fields or decimal fields should have ranges for acceptable values in the constraint column. Make this range wider than what you expect it to be! The range in the constraint column should be used to prevent typos, to prevent illogical values (like negative age) but not to force the data to be within your preexisting expectations. Your preexisting expectation is a good starting point for this range, but make it much wider, as we not yet know what special cases your data collection will encounter in the field, and these outliers can be important for your research.

Matching begin_/end_

The main aspect of this test is done by the ODK syntax tester on SurveyCTO's server, but the error message for this error are not always useful, especially when the form is very large. One of the main reasons for this might be that ODK does not require the end_group and end_repeat to have field names. So the first part of this test is that all end_group and end_repeat fields are required to have a value in the field name column, and for the second part of this test the name has to be identical to the corresponding begin_group and begin_repeat field.

The main part of this test is to test that all begin_group are matched by an end_group and that all begin_repeat are matched by an end_repeat. To be considered matched the following three criteria needs to be fulfilled:

  1. for each begin_ there is an end_
  2. that the corresponding end_ is of the correct type so that a begin_group is not closed by a end_repeat or a begin_repeat is not closed by a end_group.
  3. tests that the end_ names match the begin_ names. SurveyCTO's server makes sure that the begin_ names are unique, so each pair will be unique if this part of the test is passed

Tests for Naming and Labeling Practices

ODK has very few restrictions on names apart from that all names must be unique and that there are a few characters that are not allowed. All of those restrictions are tested by the ODK syntax test on SurveyCTO's server. The additional tests done by this command are mainly due to the additional rules for names that Stata has that come into effect when importing your data to Stata.

Field Name Length

Stata has a limit of 32 characters in the field name. Stata will truncate the name if the name is longer than that, or replace the name with a generic name on the format var1, var2 etc. if the name no longer is unique after being truncated. All of these cases can be resolved in Stata, but it will be much simpler to solve this before starting to collect the data. This test lists all fields with names longer than 32 characters.

Repeat Group Field Name Length

This test has two parts. The first part lists fields in repeat groups that has names that will be too long in the wide format when imported to Stata. The second part lists fields in repeat groups where the risk of too long names is high, but it is not certain.

When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. For example, if a repeat group is repeated three times, then in the wide data set any variable in that repeat group will generate three variables, with the names suffixed followed by _1, _2 and _3 respectively. This suffix will also count towards the 32 characters limitation for variable names in Stata discussed in the previous test. So any variable in a repeat group may only have a 30 characters long field name. If the field is in a nested repeat group (a repeat group inside a repeat group) then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 32 - (2 * number of nested repeat groups for the field). This test lists all variables that have longer names then that constraint.

In the first test we assume that there are not more than 9 iteration in each repeat group, but if there would be more than 9 then the suffixes will be _10, _11 etc., which takes up three characters. So the second test lists all fields that have a field name that is longer than 32 - (3 * number of nested repeat groups for the field). Whether this will create an issue with long names is uncertain, but if your names are so long that they might be caught in this test, then it is probably a best practice to try to make the name shorter.

Repeat Group Name Conflict

This test checks for name conflicts that could be a result of the suffixes that are added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field, it will be suffixed with _1 for the first iteration of that variable, and that will create a name conflict with the variable created from field myvar_1.

This test lists all field inside a repeat group for which there is another field where there is a risk for this type of name conflict. For example, if there is a field with name myvar it tests if there is any variable on the format myvar_#, where # is one or several digits.

If the variable myvar is in a nested repeat group (a repeat group inside a repeat group), then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.

Technical special case: If the fields myvar and myvar_1 are both in a non-nested repeat group then there will be no name conflict as the first iteration of both fields will generate the variables myvar_1 and myvar_1_1 as the variables from both fields are suffixed. These fields are still listed by this test as it will be confusing that the variable myvar_1 is from field myvar and not from the myvar_1 that has the same name, even though this is technically not a name conflict.

Stata Labels Columns

In SurveyCTO, you can program your form so that multiple languages can be displayed when filling in a form. This is done by creating label columns named label:english, label:swahili, label:hindi etc. When you export your data using SurveyCTO Sync you can choose which language you want to use for labels.

The same feature can be used to create Stata labels by adding a label language called label:stata. Labels can obviously be added and modified once the data set has been imported to Stata. However, our experience is that this is the simplest way to add them, and if this practice is not used, the data set is often not properly labeled.

If you do not use this practice, but still use SurveyCTO's Stata code for importing data sets to Stata, you will end up having the labels displayed in the questionnaire as labels for your Stata variable. While it is better than no labels, label:stata allows better variable labeling.

Survey Sheet Stata Labels

In Stata, there is a restriction on the variable label to not exceed 80 characters. If you apply a label longer than that, it will be truncated. For this reason, this test lists all fields with a label in the Stata label column that is longer than 80 characters.

Choice Sheet Stata Labels

Apart from whether the label:stata exist or not, there is no further test on the values of the Stata label column in the choice sheet.

Leading and Trailing Spaces

In computer science, there is a difference between the string "ABC" and "ABC ". This difference does not show in Excel and when uploading your form to SurveyCTO's server the form checker is programmed to handle this. However, when you import your form to Stata, as ietestform and several other commands does, it makes a difference.

For example, if you have a list in the choice sheet called village but the actual content of the cell is "village ". In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as "village", but other tools might treat it as "village " which, when compared, are not the same.

What would be even worse is if some list item in the village list has the list name value "village" and some has the value "village ". This is very difficult to spot in Excel but some tools might treat these as different.

Leading (" ABC") or trailing ("ABC ") spaces are not difficult to deal with and most tools, iestestform included, deals with them. However there is no guarantee that all of them do, and to reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.

Tests for Choice List Practices

These are all tests related to the choice lists used in select_one and in select_multiple types of fields. The ODK syntax is very lenient when it comes to choice lists, and it lets some undesirable practices get passed. For example, unused lists and duplicate labels could mean that the list elements were copied and pasted accidentally. The command reports on this, as they are common sources for errors.

Unused Choice Lists

This test makes sure that all lists defined in the choices list sheet are actually used in at least one select_one or select_multiple field in the survey sheet. It is not incorrect to have unused lists, but it is likely a sign of something that is not kept up to date in your choice lists and might therefore cause an error, an expected behavior, or list items not being displayed during the survey.

For example, if you have 10 villages in a choice list called village but you incorrectly type vilage for one of them. Then, according to ODK syntax you have two lists, one called village with 9 items and one called vilage with 1 item. It is unlikely that there is a select_one/select_multiple field that uses the choice list vilage so listing unused choice is a good way to spot a type like this one.

Value/Name Numeric

In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file they provide through SurveyCTO Sync, but that only works if there values in the value/name column in the choices sheet are numeric. It is not incorrect to use string var, but you will have to spend more time cleaning your data set to follow Stata best practices. This test lists all list items that have a non-numeric value in the value/name column.

Duplicated List Code

This test makes sure that there are no duplicates in list names and codes in the choice sheet. This test lists all list items that have other list items with the same two values in the name and code columns.

Duplicated List Labels

This test makes sure that there is no label in the same list that is identical, i.e. one label that is listed twice for the same choice list but with different codes. This test lists all list items that have other list items with the same two values in the name and label columns.

Missing Labels or Value/Name in Choice Lists

The first part of this test makes sure that there is no list item that has a value in the label column but no value in the value/name column. The second part of this test makes sure the opposite does not happen. This is extra likely to occur when a form is programmed in multiple languages. This test lists all list items caught by either if these two tests.

Back to Parent

This article is part of the topic iefieldkit

Additional Resources