Difference between revisions of "Ietestform"

Jump to: navigation, search
Line 3: Line 3:
There are other frameworks for testing ODK/SurveyCTO forms similarly to '''ietestform'''. Two examples are [https://github.com/PovertyAction/ipacheckscto IPA's ipacheckscto] and [http://xform-test-docs.pma2020.org/ PMA2020's xform-test].
There are other frameworks for testing ODK/SurveyCTO forms similarly to '''ietestform'''. Two examples are [https://github.com/PovertyAction/ipacheckscto IPA's ipacheckscto] and [http://xform-test-docs.pma2020.org/ PMA2020's xform-test].


This article is meant to describe use cases, work flow and the reasoning used when developing the tests in this command. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing <code>help '''ietestform'''</code> in Stata after installing it. This command is a part of the package [[Stata_Coding_Practices#iefieldkit|iefieldkit]], to install all the commands in this package including this command, type <code>ssc install iefieldkit</code> in Stata or by following the [https://github.com/worldbank/iefieldkit installation instructions here].
This article describes how and when to use the command, and the reasoning for specific tests the command performs. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing <code>help '''ietestform'''</code> in Stata after installing it. This command is a part of the package [[Stata_Coding_Practices#iefieldkit|iefieldkit]], to install all the commands in this package including this command, type <code>ssc install iefieldkit</code> in Stata or by following the [https://github.com/worldbank/iefieldkit installation instructions here].


== Intended use case and work flow ==
== Intended use case and work flow ==

Revision as of 16:28, 29 January 2019

ietestform is a Stata command used to test ODK based SurveyCTO forms before they are used in the field. SurveyCTO's server has a test feature that tests the ODK syntax of the form. This command is not meant as a substitute to that test, but a complement as ietestform. For example, it tests for potential typos that lead to unintended form logic, or whether the data generated will be in Stata-suitable format. The command also points out commonly used best practices if not already used in the form.

There are other frameworks for testing ODK/SurveyCTO forms similarly to ietestform. Two examples are IPA's ipacheckscto and PMA2020's xform-test.

This article describes how and when to use the command, and the reasoning for specific tests the command performs. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ietestform in Stata after installing it. This command is a part of the package iefieldkit, to install all the commands in this package including this command, type ssc install iefieldkit in Stata or by following the installation instructions here.

Intended use case and work flow

This command is intended to be used regularly when developing a form after it is tested on SurveyCTO's server to make sure that there are no syntax errors in the form, but before it is used in the field. This command writes a report that outputs the results of several tests (the tests are described below). The report is in .csv-format so it can be viewed in Excel and is in raw text format so it can be tracked and timestamped in versioning control frameworks like GitHub.

If you are not sure why something was caught by this command and listed in the report, then read the explanations of each test below. If you think that the command incorrectly catches cases in your SurveyCTO form then please report that here and we will be very happy to work on improving the command.

This command has many different tests but not all of them are meant to detect direct errors. Some test are instead meant to highlight things that experienced ODK coders in SurveyCTO often look for to spot potential errors and/or bad practices. So not everything listed in the report is an error, but could be an indication of one, or an indication of something that can cause errors. Use this command as a tool to investigate your form, and read this wiki article to understand how each test is meant to indicate something that often can be used to improve data quality.

The point is that everything that ietestform catches always is an error or will cause an error. There will be many cases where the command catches something that is the best way to do something, so the point is not to change everything just so that the command no longer finds anything and the report is empty. Just make sure that you understand why each case was listed and make sure that the reason why they were listed does not apply to your case, or that what you do is the best way to do something despite that reason.

Instructions

These instructions are meant to help you understand the tests that this command runs on your SurveyCTO questionnaire form. For technical instructions on how to run the command in Stata see the help file by typing help ietestform in Stata after you have installed it by typing ssc install ietestform.

This command is very simple to use in Stata, as a minimum you only need to specify your SurveyCTO form and where on your computer you want the command to write the report.

   ietestform , surveyform("/path/to/surveyform.xlsx") report("/path/to/report.csv")

See the help file for more options, although, for this command they are typically not needed.

Explanation of tests used in this command

ietestform only outputs results from tests that identified an error or a potential bad practice. If a test does not find anything, the test is not at all mentioned in the report.


Coding Practices

This section describes tests related to best practices on how to use and how to not use features in the ODK programming language to reduce the risks of error that interrupts the field work and to ensure data quality. Note that these tests does not test if the ODK syntax is valid since this command is intended to be used after the form has passed the ODK syntax test on SurveyCTO's server. In fact, this command assumes that the ODK syntax is already tested and is correct.

Required Column

The required column makes sure that the enumerator cannot proceed before a response have been filled in for that field. This is a great data quality feature as it prevents incomplete forms to be submitted, and it helps making sure that enumerators fill in the forms in the right order. A field that is required can not be passed until data has been recorded for it.

While you can fill in a value in the required column, only field types with a view, i.e. showing up when filling in a form, are affected by that value. Examples of fields without a view are begin_group, end_repeat, text_audit, caluclate, deviceid, caseid, etc. All fields without a view are ignored in the tests that related to the required column.

There are two tests in ietestform related to the required column.

All Non-Note Fields Required

This test tests that all fields that are not of type note (see the other Required Column test below) have the value "Yes" in the required column. This test outputs a list to the report for all fields that are not required and not of type note.

Even when "no answer" is a valid response from the respondent we should never use the absence of a recorded answer to represent that. The absence of a recorded answer should only mean that no valid answer was recorded and nothing else. When applicable, there should be a valid method to record that the respondents answer was "no answer".

Some fields that are commonly left not required intentionally are fields that require the GPS. Those fields are geoppoint, geoshape and geotrace. If you know that the devices that you will be using for data collection will have no problem collecting GPS coordinates, then keep those fields required to ensure you will get valid data points. But if you are working in a context where GPS coordinates will be difficult to collect, then it could be a good idea to not require these fields, so that the enumerator can complete the other fields and be able to submit the form even when it was not possible to record GPS coordinates.

No Note Fields Required

Fields of type note has a view and can therefore be required, but there is no way to record data, and there is therefore no way to pass a required note-field. If this happens in the field there is no way to pass this field, and therefore no way to complete and submit the form with the already collected data. While there is an exception where this feature can be put to great use (see below), this test writes a list to the report of all fields that are of type note and are required.

While required notes will always be listed, there are cases when they are really useful. Since they are not possible to pass, they can be used together with a relevance condition so that they show up if something earlier in the form is not correct and the enumerator should be forced to go back and correct before continuing the data collection.

For example, enumerators are often asked to enter respondent IDs twice to be extra careful that there is no typo in the ID. Let's say those two double entry fields are id1 and id2. Then they can be followed by required note-field that has the relevance expression ${id1} != ${id2} so that the note only show if the two IDs are not identical. The note label can then inform the enumerator that the two ID fields are not identical and that the enumerator must go back and change the values in order to continue.

The same functionality could have been achieved using the constraint condition on the second ID field when the ID is re-entered, but the label in the note field can be made more informative than the constraint message, and when the conditional test is more difficult than just testing that two fields are identical, then this method is easier by using intermediate calculate fields that are then used in the relevance column for the required note-field.

Numeric Ranges

not implemented yet

All numeric fields, integer fields or decimal fields should have ranges for acceptable values in the constraint column. Make this range wider than what you expect it! The range in the constraint column should be used to prevent typos, to prevent illogical values (like negative age) but not to force the data to be within your preexisting expectations. Your preexisting expectation is a good starting point for this range, but make it much wider than that as we do not yet know what special cases your data collection will encounter in the field, and these outliers are very important to understand for your research.

Matching begin_/end_

The main aspect of this test is done by the ODK syntax tester on SurveyCTO's server, but the error message for this error are not always useful, especially when the form is very large. One of the main reasons for this might be that ODK does not require the end_group and end_repeat to have field names. So the first part of this test is that all end_group and end_repeat fields are required to have a value in the field name column, and for the second part of this test the name has to be identical to the corresponding begin_group and begin_repeat field.

The main part of this test is to test that all begin_group are matched by an end_group and that all begin_repeat are matched by an end_repeat. To be considered matched the following three criteria needs to be fulfilled:

  1. for each begin_ there is an end_
  2. that the corresponding end_ is of the correct type so that a begin_group is not closed by a end_repeat or a begin_repeat is not closed by a end_group.
  3. tests that the end_ names match the begin_ names. SurveyCTO's server makes sure that the begin_ names are unique, so each pair will be unique if this part of the test is passed

Naming and Labeling Practices

ODK have very few restrictions on names apart from that all names must be unique and that there are a few characters that are not allowed. All of those restrictions are tested by the ODK syntax test on SurveyCTO's server. The additional tests done by this command are mainly due to the additional rules for names that Stata has that come into effect when importing your data to Stata.

Field Name Length

Stata has a limit of 32 characters in the field name. Stata will truncate the name if the name is longer than that, or replace the name with a generic name on the format var1, var2 etc. if the name no longer is unique after being truncated. All of these cases can be resolved in Stata but it will be much simpler to solve this before starting to collect the data. This test list all fields with names longer than 32 characters.

Repeat Group Field Name Length

This test has two parts, where the first part list fields in repeat groups that will have too long names in the wide format when importing to Stata and the second part lists fields in repeat groups where the risk is high that that will happen but it is not certain.

When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. If a repeat group is repeated three times, then in the wide data set any variable in that repeat group will generate three variables, with the names suffixed by _1, _2 and _3 respectively. This suffix will also count towards the 32 characters limitation for variable names in Stata discussed in the previous test. So any variable in a repeat group may only have a 30 characters long field name. If the field is in a nested repeat group (a repeat group inside a repeat group) then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 32 - (2 * number of nested repeat groups for the field). This test list all variables that have longer names then that constraint.

In the first test we assume that there are not more than 9 iteration in each repeat group, but if there would be more than 9 then the suffixes will be _10, _11 etc. which takes up three characters. So the second test list all fields that have a field name that is longer than 32 - (3 * number of nested repeat groups for the field). It is not sure that this will create an issue with long names, but if your names are so long that they might be caught in this test, then there is probably a best practice to try to make the name shorter.

Repeat Group Name Conflict

This test for name conflicts that could be a result from the suffixes that are added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field it will be suffixed with _1 for the first iteration of that variable, and that will create a name conflict with the variable created from field myvar_1.

This test lists all field inside a repeat group for which there is another field where there is a risk for this type of name conflict. For example, if there is a field with name myvar it tests if there is any variable on the format myvar_# where # is one or several digits.

If the variable myvar is in a nested repeat group (a repeat group inside a repeat group) then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.

Technical special case: If the fields myvar and myvar_1 are both in a non-nested repeat group then there will be no name conflict as the first iteration of both fields will generate the variables myvar_1 and myvar_1_1 as the variables from both fields are suffixed. These fields are still listed by this test as it will be confusing that the variable myvar_1 is from field myvar and not from the myvar_1 that has the same name, even though this is technically not a name conflict.

Stata Labels Columns

In a SurveyCTO for you can program your form so that multiple languages can be displayed when filling in a form. This is done by having multiple label columns named label:english, label:swahili, label:hindi etc. When you export your data using SurveyCTO Sync you can choose which language you want to use for labels.

The same feature can be used to create Stata labels by adding a label language called label:stata. Labels can obviously be added and modified once the data set has been imported to Stata. However, our experience is that this is the simplest way to add them, and if this practice is not used, the data set is often never properly labeled.

If you do not use this practice, but still use SurveyCTO's Stata code for importing data sets to Stata, you will end up having the labels displayed in the questionnaire as labels for your Stata variable. That is much better than nothing, but those labels will not be very good labels for labeling variables in Stata.

Survey Sheet Stata Labels

In Stata there is a restriction that the variable label is not longer than 80 characters. If you are trying to apply a label longer than that it will be truncated. For that reason, this test lists all fields with a label in the Stata label column that is longer than 80 characters.

Choice Sheet Stata Labels

There are not specific tests to the Stata label column in the choice sheet other than that it exists.

Leading and Trailing Spaces

In computer science there is a difference between the string "ABC" and "ABC ". This difference does not show in Excel and when uploading your form to SurveyCTO's server the form checker is programmed to handle this. However, when you import your form to Stata, as ietestform and several other commands does, it makes a difference.

For example, if you have a list in the choice sheet called village but the actual content of the cell is "village ". In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as "village", but other tools might treat it as "village " which when compared are not the same.

What would be even worse is if some list item in the village list has the list name value "village" and some has the value "village ". This is very difficult to spot in Excel but some tools might treat these as different.

Leading (" ABC") or trailing ("ABC ") spaces are not difficult to deal with and most tools, iestestform included, deals well with them, but there is no guarantee that all of them do, and to reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.


Choice Lists

These are all tests related to the choice lists used in select_one and in select_multiple types of fields. The ODK syntax is very lenient when it comes to choice lists and this can cause that typos are not revealed. For example, unused lists and duplicated labels could mean that list elements were copy and pasted but then never updated as intended. While that will not get caught in the ODK syntax test, it will be listed in by this command, as they are common sources for errors.

Unused Choice Lists

This test makes sure that all lists defined in the choices list sheet are actually used in at least one select_one or select_multiple field in the survey sheet. It is not incorrect to have unused lists, but it is likely a sign of something that is not kept up to date in your choice lists and might therefore cause an error, an expected behavior, or list items not being displayed during the survey.

For example, if you have 10 villages in a choice list called village but you incorrectly type vilage for one of them. Then, according to ODK syntax you have two lists, one called village with 9 items and one called vilage with 1 item. It is unlikely that there is a select_one/select_multiple field that uses the choice list vilage so listing unused choice is a good way to spot a type like this one.

Value/Name Numeric

In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file they provide through SurveyCTO Sync, but that only works if there values in the value/name column in the choices sheet are numeric. It is not incorrect to use string var, but you will have to spend more time cleaning your data set to follow Stata best practices. This test lists all list items that has a non-numeric value in the value/name column.

Duplicated List Code

This test makes sure that there are no duplicates in list names and codes in the choice sheet. This test lists all list items that have other list items with the same two values in the name and code columns.

Duplicated List Labels

This test that there is no labels in the same list that is identical, i.e. one label that is listed twice for the same choice list but with different codes. This test lists all list items that have other list items with the same two values in the name and label columns.

Missing Labels or Value/Name in Choice Lists

The first part of this test makes sure that there is no list items that have a value in the label column but no value in the value/name column. The second part of this tests makes sure the opposite does not happen. This is extra likely to occur when a form is programmed in multiple languages. This test lists all list items caught by either if these two tests.

Back to Parent

This article is part of the topic iefieldkit