# Difference between revisions of "Ietestform"

DIME Analytics has created iefieldkit as a package in Stata to support the process of primary data collection from start to finish. In most cases, third party survey firms or local partners collect data on behalf of the research team. Therefore, data quality assurance is a particularly important aspect of data collection. ietestform allows the research team to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms before field data collection starts. For example, the SurveyCTO server has a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of high quality.

• Stata coding practices.
• iefieldkit.
• To install ietestform, type ssc install ietestform in Stata.
• To install all the commands in the iefieldkit package, type ssc install iefieldkit in Stata
• For instructions and available options, type help ietestform.

## Objective

A lot of researchers today use digital tools for primary data collection like the open-source Open Data Kit (ODK), or ODK-based platforms like SurveyCTO. In Open Data Kit (ODK)-based electronic survey kits, including SurveyCTO, survey forms (or questionnaires) are typically built in Excel using a specialized structured syntax. Before the research team starts with field data collection, they can use ietestform to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms.

For example, the SurveyCTO server has a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of high quality. Therefore, the ietestform command should be used after testing the survey form on a SurveyCTO server to make sure there are no syntax errors. The syntax for the command is:

ietestform
, surveyform("filename.xlsx")
report("report.csv")


The ietestform command outputs a test report with various flags indicating potentially improper practices in a CSV format, which is optimized for display in a number of software applications as well as for version-tracking with software like Git. Some of the report entries flag code errors, and others detect practices that are not strictly wrong, but that may indicate potential errors or bad practices (and are therefore intended for manual review). There will often be cases where the command flags a line as suspicious, but it is in fact the best way to construct the questionnaire. The goal of \texttt{ietestform} is not to produce a report with no flags; but to ensure that practices that may cause serious problems if used unintentionally or incorrectly are validated for functionality.

Note that the ietestform may flag a feature that is indeed the best option for your particular case. Interpret the output conscientiously: be sure to understand why each case was flagged and decide whether to modify the form or not accordingly. If you are not sure why something was flagged, read the explanations of each test below. If you think that the command incorrectly flagged cases in your SurveyCTO form, please report the case here and DIME Analytics will happily work on improving the command.

## Tests for Coding Practices

This section describes the ietestform tests on ODK programming language. These tests flag risks of error that may interrupt field work. Note that ietestform assumes that the ODK syntax is already tested and is correct; it is intended to be used after the form has passed the ODK syntax test on SurveyCTO's server.

### Required Column

The required column ensures that the enumerator cannot proceed before a response has been entered into the field. This prevents submissions of incomplete forms and helps ensure that enumerators complete forms in the right order.

While you can fill in a value in the required column, only field types with a view (i.e. showing up when filling in a form), are affected by that value. Examples of fields without a view are begin_group, end_repeat, text_audit, calculate, deviceid, caseid, etc. All fields without a view are ignored in the tests related to the required column.

ietestform runs two tests related to the required column.

#### All Non-Note Fields Required

This tests that all fields that are not of type note have the value "Yes" in the required column. It then outputs a list to the report for all fields that are not required and not of type note. Note that even when "no answer" is a valid response from the respondent, never use the absence of a recorded answer to represent that; when applicable, use a valid method to record that the respondent’s answer was "no answer".

Sometimes, researchers choose to intentionally leave GPS fields as not required (i.e. geoppoint, geoshape and geotrace). If you know that the devices used for data collection will have no problem collecting GPS coordinates, keep these fields required. However, if GPS coordinates will be difficult to collect due to, for example, connection issues, it may be a good idea to not require these fields so that the enumerator can still complete the other fields and submit the form even when he/she cannot record GPS coordinates.

#### No Note Fields Required

Fields of type note have a view and can therefore be required. However, there is no way to record data in a note field, so there is no way to pass a required note-field. While this feature can sometimes be put to great use (see below), it is generally problematic. ietestform writes a list to the report of all fields that are of type note and are required.

Note that here are cases in which required note fields may be really useful. Since enumerators cannot pass these fields, researchers may use them with a relevance condition so that they show up if an earlier entry in the form is incorrect. This forces the numerator to go back and correct the error before continuing data collection.

For example, enumerators are often asked to enter respondent IDs twice to be extra careful that there is no typo in the ID. Let's say those two double entry fields are id1 and id2. These fields can be followed by a required note field that has the relevance expression ${id1} !=${id2}; then, the note will only appear if the two IDs are not identical. The note label can then inform the enumerator that the two ID fields are not identical and that the enumerator must go back and change the values in order to continue.

In this case, researchers could also use the constraint condition on the second ID field when the ID is re-entered. However, the message in the required note field approach could be more informative than the message in the constraint condition. Further, when the conditional test is more difficult than just testing that two fields are identical, the required note field method is an easier approach than using intermediate calculate and relevance fields.

### Numeric Ranges

not implemented yet

All numeric fields, integer fields or decimal fields should have ranges for acceptable values in the constraint column. Make this range wider than what you expect it to be! The range in the constraint column should be used to prevent typos, to prevent illogical values (like negative age) but not to force the data to be within your preexisting expectations. Your preexisting expectation is a good starting point for this range, but make it much wider, as you do not yet know what special cases may exist; these outliers can be important for your research.

### Matching begin_/end_

While the ODK syntax tester on SurveyCTO's server test for matching begin_ and end_ values, the error message for this error is not always useful — especially when the form is very large. The lack of clarity in these error messages may result from the fact that ODK does not require the end_group and end_repeat to have field names. So the first part of ietestform’s test for this characteristic checks that all end_group and end_repeat fields have a value in the field name column. The main part of this test is to test that all begin_group are matched by an end_group and that all begin_repeat are matched by an end_repeat. To be considered matched, the following three criteria need to be fulfilled:

1. for each begin_ there is an end_
2. the corresponding end_ is of the correct type such that a begin_group is not closed by an end_repeat or a begin_repeat is not closed by an end_group.
3. the end_ names match the begin_ names. SurveyCTO's server makes sure that the begin_ names are unique, so each pair will be unique if this part of the test is passed.

## Tests for Naming and Labeling Practices

ODK has very few restrictions on names apart from that all names must be unique and that a few characters are not allowed. All of those restrictions are tested by the ODK syntax test on SurveyCTO's server. The additional tests done by ietestform are mainly due to the additional Stata naming rules that you will encounter when importing data to Stata.

### Field Name Length

Stata has a limit of 32 characters in the field name. Stata will truncate the name if the name is longer than that, or replace the name with a generic name on the format var1, var2, etc. if the name no longer is unique after being truncated. While all of these cases can be resolved in Stata, it is much simpler to solve naming issues before starting to collect the data. While 32 characters are allowed, some common commands add one character to variable names when processing it so we recommend a maximum of 31 characters. This test lists all fields with names longer than 31 characters.

### Repeat Group Field Name Length

This test has two parts. The first part lists fields in repeat groups that have names that will be too long in the wide format when imported to Stata. The second part lists fields in repeat groups where the risk of too long names is high, but not certain.

When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. For example, if a repeat group is repeated three times, then in the wide dataset, any variable in that repeat group will generate three variables, with the names suffixed followed by _1, _2 and _3 respectively. This suffix will also count towards the 31 characters limitation for variable names in Stata discussed in the previous test. (Technically 32 characters are allowed, but some common commands add one character to variable names when processing it so we recommend a maximum of 31 characters.) Thus, any variable in a repeat group may should have a field name no longer than 29 characters. If the field is in a nested repeat group (a repeat group inside a repeat group), then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 31 - (2 * number of nested repeat groups for the field). This test lists all variables that have longer names than that constraint.

In the first test we assume that there are not more than 9 iterations in each repeat group; if there would be more than 9 then the suffixes will be _10, _11 etc., which takes up three characters. So the second test lists all fields that have a field name that is longer than 31 - (3 * number of nested repeat groups for the field). Whether this will create an issue with long names is uncertain, but if your names are so long that they might be caught in this test, then it is probably best practice to try to make the names shorter.

### Repeat Group Name Conflict

This test checks for name conflicts that may result from the suffixes added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field, it will be suffixed with _1 for the first iteration of that variable; that will create a name conflict with the variable created from field myvar_1.

This test lists all fields inside a repeat group with which another field may conflict due to names. For example, if there is a field with name myvar’’, ietestform tests if there is any variable on the format myvar_#, where # is one or several digits.

If the variable myvar is in a nested repeat group (a repeat group inside a repeat group), then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.

Technical special case: If the fields myvar and myvar_1 are both in a non-nested repeat group then there will be no name conflict: the first iteration of both fields will generate the variables myvar_1 and myvar_1_1 since the variables from both fields are suffixed. These fields are still listed by this test as it may be confusing that the variable myvar_1 is from the field myvar and not from myvar_1.’’

### Stata Labels Columns

In SurveyCTO, you can program your form so that multiple languages can be displayed when filling in a form. This is done by creating label columns named label:english, label:swahili, label:hindi etc. When you export your data using SurveyCTO Sync you can choose which language you want to use for labels.

The same feature can be used to create Stata labels by adding a label language called label:stata. Labels can obviously be added and modified once the data set has been imported to Stata. However, our experience is that this is the simplest way to add them; if this practice is not used, the data set is often not properly labeled.

If you do not use this practice, but still use SurveyCTO's Stata code for importing data sets to Stata, you will end up having the labels displayed in the questionnaire as labels for your Stata variable. While it is better than no labels, label:stata allows better variable labeling.

#### Survey Sheet Stata Labels

In Stata, there is a restriction that the variable label does not exceed 80 characters. If you apply a label longer than that, it will be truncated. For this reason, ietestform lists all fields with a label in the Stata label column that is longer than 80 characters.

#### Choice Sheet Stata Labels

Apart from whether the label:stata exists or not, there is no further test on the values of the Stata label column in the choice sheet.

In computer science, there is a difference between the string "ABC" and "ABC ". This difference does not show in Excel. When uploading your form to SurveyCTO's server, the form checker is programmed to handle these differences. However, when you import your form to Stata, as ietestform and several other commands does, these minor differences are distinguished.

For example, consider you have a list in the choice sheet called village,’’ but the actual content of the cell is "village ". In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as "village", but other tools might treat it as "village " which, when compared, are not the same.

What would be even worse is if some list item in the village list has the list name value "village" and some has the value "village ". This is very difficult to spot in Excel but some tools might treat these as different.

Leading (" ABC") or trailing ("ABC ") spaces are not difficult to deal with and most tools, iestestform included, deals with them. However there is no guarantee that all of them do. To reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.

## Tests for Choice List Practices

These tests are related to the choice lists used in select_one and in select_multiple types of fields. The ODK syntax is very lenient when it comes to choice lists, and it lets some undesirable practices to pass. For example, unused lists and duplicate labels could mean that the list elements were copied and pasted accidentally. The command reports this, as they are common sources for errors.

### Unused Choice Lists

This test makes sure that all lists defined in the choices list sheet are actually used in at least one select_one or select_multiple field in the survey sheet. It is not incorrect to have unused lists, but it is likely a sign of something that is not kept up to date in your choice lists and might therefore cause an error, an expected behavior, or list items not being displayed during the survey.

For example, imagine you have 10 villages in a choice list called village but you incorrectly type vilage for one of them. Then, according to ODK syntax you have two lists, one called village with 9 items and one called vilage with 1 item. It is unlikely that there is a select_one/select_multiple field that uses the choice list vilage,’’ so listing unused choice is a good way to spot a type like this one.

### Value/Name Numeric

In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file provided through SurveyCTO Sync. However, this file only works if the values in the value/name column in the choices sheet are numeric. It is not incorrect to use string variables here, but you will have to spend more time cleaning your dataset to follow Stata best practices. This test lists all list items that have a non-numeric value in the value/name column.

### Duplicated List Code

This test makes sure that there are no duplicates in list names and codes in the choice sheet. This test lists all list items that have other list items with the same two values in the name and code columns.

### Duplicated List Labels

This test makes sure that there is no label in the same list that is identical (i.e. one label that is listed twice for the same choice list but with different codes). This test lists all list items that have other list items with the same two values in the name and label columns.

### Missing Labels or Value/Name in Choice Lists

The first part of this test makes sure that there is no list item that has a value in the label column but no value in the value/name column. The second part of this test makes sure the opposite does not happen. This is extra likely to occur when a form is programmed in multiple languages. This test lists all list items caught by either if these two tests.