Difference between revisions of "Ietestform"

Jump to: navigation, search
Line 99: Line 99:
  
 
==== Leading and Trailing Spaces ====
 
==== Leading and Trailing Spaces ====
In computer science there is a difference between the string "ABC" and "ABC ". This difference does not show in Excel and when uploading your form to SurveyCTO's server the form checker is programmed to handle this. However, when you import your for to Stata,  
+
In computer science there is a difference between the string <code>"ABC"</code> and <code>"ABC "</code>. This difference does not show in Excel and when uploading your form to SurveyCTO's server the form checker is programmed to handle this. However, when you import your form to Stata, as **ietestform** and several other commands does, it makes a difference.
  
 +
For example, if you have a list in the choice sheet called *village* but the actual content of the cell is <code>"village "</code>. In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as <code>"village"</code>, but other tools might treat it as <code>"village "</code> which when compared are not the same.
 +
 +
What would be even worse is if some list item in the *village* list has the list name value  <code>"village"</code> and some has the value  <code>"village "</code>. This is very difficult to spot in Excel but some tools might treat these as different.
 +
 +
Leading (<code>" ABC"</code>) or trailing (<code>"ABC "</code>) spaces are not difficult to deal with and most tools, **iestestform** included deals well with them, but there is no guarantee that all of them do, and to reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.
 
----
 
----
  

Revision as of 19:19, 23 January 2019

ietestform is a Stata command used to test ODK based SurveyCTO forms before they are used in the field. SurveyCTO's server has a test feature that tests the ODK syntax of the form. This command is not meant as a substitute to that test, but a complement as it test for constraints specific to Stata and that the best practices used at DIME are followed.

There are other frameworks for testing ODK/SurveyCTO forms similarly to ietestform. Two examples are IPA's ipacheckscto and PMA2020's xform-test.

This article is meant to describe use cases, work flow and the reasoning used when developing the tests in this command. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ietestform in Stata. This command is a part of the package iefieldkit, to install all the commands in this package including this command, type ssc install iefieldkit in Stata.

Published Beta Version

This command is still only published as a beta version for testing. Anyone can test install the command by following the installation instructions here and report feedback to dimeanalytics@worldbank.org, but be aware that not all bugs are fixed and some features are not yet finalized.

Intended use cases

This command is intended to be used after it is tested on SurveyCTO's server to make sure that there are no syntax errors in the form, but before it is used in the field. This command writes a report that outputs the results of several tests (the tests are described below). The report is in csv-format so it can be viewed in Excel and is raw text so it can be tracked in versioning control frameworks like GitHub.

If you are not sure why something was caught by this command and listed in the report, then read the explanations of each test below. If you think that the command incorrectly catches cases in your SurveyCTO form then please report that here and we will be very happy to work on improving the command.

This command has many different tests but only some of them are direct errors. The tests that are not errors are meant to highlight things that experienced ODK coders in SurveyCTO usually are looking for to spot potential errors or bad practices. It is important to note that it is not necessary the case that something is incorrect just because something was caught in a test in this command. There are several special cases where the practices caught by this command could be the best way to solve something. We have listed acceptable exceptions examples of such cases in this article, and while we will keep adding such exceptions, we will never be able to have an exhaustive list of all those cases.

Instructions

These instructions are meant to help you understand the tests that this command runs on your SurveyCTO questionnaire form. For technical instructions on how to run the command in Stata see the help file by typing help ietestform in Stata.

This command is very simple to use in Stata, you only need to specify your SurveyCTO form and where on your computer you want the command to write the report.

   ietestform , surveyform("/path/to/surveyform.xlsx") report("/path/to/report.csv")

Explanation of tests used in this command

ietestform only outputs results from tests that identified an error or a potential bad practice. If a test does not find anything, the test is not at all mentioned in the report.


Coding Practices

This section describes tests related to best practices on how to use and how to not use features in the ODK programming language to reduce the risks of error that interrupts the field work and to ensure data quality. Note that these tests does not test if the ODK syntax is valid since this command is intended to be used after the form has passed the ODK syntax test on SurveyCTO's server.

Required Column

The required column makes sure that the enumerator cannot proceed before a response have been filled in for that field. This is a great feature as it prevents incomplete forms to be submitted, and it helps making sure that enumerators fill in the forms in the right order. A field that is required can not be passed until data has been recorded for it.

There are two tests in ietestform related to the required column.

All Non-Note Fields Required

This test tests that all fields that are not of type note (see the other Required Column test below) have the value "Yes" in the required column. This test outputs a list to the report for all fields that are not required and not of type note.

Even for questions where it is sometimes expected that there is no answer to be recorded, it is much better practice to have a answer option that represents "No Answer" rather than having the enumerators leaving the field unanswered.

Acceptable Exceptions:

  • Fields that require the GPS function on the device. GPS locations are sometimes difficult to collect for some devices in some contexts. This should be tested before and during the pilot and if it seems as if collecting GPS locations will be difficult, then it is ok to make these field types not required.
No Note Fields Required

For fields of type note there is no way to record data, and there is therefore no way to pass a required note-field. If this happens in the field there is no way to pass this field, and therefore no way to complete and submit the data. See the exception below for a use case when it is great practice to use this feature for data quality assurance. This test writes a list to the report of all fields that are of type note and are required.

Acceptable Exceptions:

  • This impassable note field has one very useful use case where this is the intended expected behavior. This is when you want to force the enumerator to go back and change something that is not correct. For example, enumerators are often asked to enter respondent IDs twice to be extra careful that there is no typo in the ID. Let's say those two double entry fields are id1 and id2 then there can be a required note-field that has the relevance expression ${id1} != ${id2}. If the two ID fields are not identical then the enumerator must go back and change the values. The same functionality could have been achieved using the constraint field when the ID is re-entered, but the label in the note field can be made more informative than the constraint message, and when the conditional test is more difficult than just testing that two fields are identical, then this method is easier by susing intermediate calculate fields that are then used in the relevance column for the required note-field.

Numeric Ranges

not implemented yet in beta version

All numeric fields, integer fields or decimal fields should have ranges for acceptable values in the constraint column. Make this range wider than what you expect it! The range in the constraint column should be used to prevent typos, to prevent illogical values (like negative age) but not to force the data to be within your preexisting expectations. Your preexisting expectation is a good starting point for this range, but make it much wider than that as we do not yet know what special cases your data collection will encounter in the field, and these outliers are very important to understand for your research.

Matching begin_/end_

The main aspect of this test is done by the ODK syntax tester on SurveyCTO's server, but the error message for this error are not always useful, especially when the form is very large. One of the main reasons for this might be that ODK does not require the end_group and end_repeat to have field names. So the first part of this test is that all end_group and end_repeat fields are required to have a value in the field name column, and for the second part of this test the name has to be identical to the corresponding begin_group and begin_repeat field.

The main part of this test is to test that all begin_group are matched by an end_group and that all begin_repeat are matched by an end_repeat. To be considered matched the following three criteria needs to be fulfilled:

  1. for each begin_ there is an end_
  2. that the corresponding _end is of the correct type so that a begin_group is not closed by a end_repeat
  3. tests that the end_ names match the begin_ names. SurveyCTO's server makes sure that the begin_ names are unique, so each pair will be unique if this part of the test is passed

Naming and Labeling Practices

ODK have very few restrictions on names apart from all names must be unique and that there are a few characters that are not allowed. All of those restrictions are tested by the ODK syntax test on SurveyCTO's server. The tests in this section are mainly due to the additional rules for names that Stata has that comes into effect when importing your data to Stata.

Field Name Length

Stata has a limit of 32 characters in the field name. Stata will truncate the name if the name is longer than that, and replace the name with a generic name on the format var1, var2 etc. if the name is no longer unique after being truncated. All of these cases can be resolved in Stata but it will be much simpler to solve this before starting to collect the data. This test list all fields with names longer than 32 characters.

Repeat Group Field Name Length

This test has two parts, where the first part list fields that will have too long names in the wide format when importing to Stata and the second part lists fields where the risk is high that that will happen but it is not certain.

When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. If a repeat group is repeated three times, then in the wide data set any variable in that repeat group will generate three variables, with the names suffixed by _1, _2 and _3 respectively. This suffix is required to fit within the 32 characters limitation for variable names in Stata discussed in the previous test. So any variable in a repeat group may only have a 30 characters long field name. If the field is in a nested repeat group (a repeat group inside a repeat group) then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 32 - (2 * number of nested repeat groups for the field). This test list all variables that have longer names then that constraint.

In the first test we assume that there are not more than 9 iteration in each repeat group, but if there would be more than 9 then the suffix will be _10 which takes up three characters. So the second test list all fields that have a field name that is longer than 32 - (3 * number of nested repeat groups for the field). It is not sure that this will create an issue with long names, but if your names are so long that they might be caught in this test, then there is probably a best practice to try to make the name shorter.

Repeat Group Name Conflict

This test for name conflicts that could be a result from the suffixes that are added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field it will be suffixed with _1 for the first iteration of that variable, and that will create a name conflict with the variable created from field myvar_1.

This test lists all field inside a repeat group for which there is another field where there is a risk for this type of name conflict. For example, if there is a field with name myvar it tests if there is any variable on the format myvar_# where # is one or several digits.

If the variable myvar is in a nested repeat group (a repeat group inside a repeat group) then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.

Technical special case: If the fields myvar and myvar_1 are both in a non-nested repeat group then there will be no name conflict as the first iteration of both fields will generate the variables myvar_1 and myvar_1_1 as the variables from both fields are suffixed. These fields are still listed by this test as it will be confusing that the variable myvar_1 is from field myvar and not from the myvar_1 that has the same name, even though this is technically not a name conflict.

Stata Labels Columns

In a SurveyCTO for you can program your form so that multiple languages can be displayed when filling in a form. This is done by having multiple label columns named label:english, label:swahili, label:hindi etc. When you export your data using SurveyCTO Sync you can choose which language you want to use for labels.

The same feature can be used to create Stata labels by adding a label language called label:stata. Labels can obviously be added and modified once the data set has been imported to Stata. However, our experience is that this is the simplest way to add them, and if this practice is not used, the data set is often never properly labeled.

If you do not use this practice, but still use SurveyCTO's Stata code for importing data sets to Stata, you will end up having the labels displayed in the questionnaire as labels for your Stata variable. That is much better than nothing, but those labels will not be very good labels for labeling variables in Stata.

Survey Sheet Stata Labels

In Stata there is a restriction that the variable label is not longer than 80 characters. If you are trying to apply a label longer than that it will be truncated. For that reason, this test lists all fields with a label in the Stata label column that is longer than 80 characters.

Choice Sheet Stata Labels

There are not specific tests to the Stata label column in the choice sheet other than that it exists.

Leading and Trailing Spaces

In computer science there is a difference between the string "ABC" and "ABC ". This difference does not show in Excel and when uploading your form to SurveyCTO's server the form checker is programmed to handle this. However, when you import your form to Stata, as **ietestform** and several other commands does, it makes a difference.

For example, if you have a list in the choice sheet called *village* but the actual content of the cell is "village ". In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as "village", but other tools might treat it as "village " which when compared are not the same.

What would be even worse is if some list item in the *village* list has the list name value "village" and some has the value "village ". This is very difficult to spot in Excel but some tools might treat these as different.

Leading (" ABC") or trailing ("ABC ") spaces are not difficult to deal with and most tools, **iestestform** included deals well with them, but there is no guarantee that all of them do, and to reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.


Choice Lists

These are all tests related to the choice lists used in select_one and in select_multiple types of fields.

Unused Choice Lists

This test makes sure that all lists defined in the choices list sheet are actually used in at least one select_one or select_multiple field in the survey sheet. It is not incorrect to have unused lists, but it is likely a sign of something that is not kept up to date in your choice lists and might therefore cause an error, an expected behavior, or list items not being displayed during the survey.

Value/Name Numeric

In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file they provide through SurveyCTO Sync, but that only works if there values in the value/name column in the choices sheet are numeric. It is not incorrect to use string var, but you will have to spend more time cleaning your data set to follow Stata best practices. This test lists all list items that has a non-numeric value in the value/name column.

Duplicated List Code

This test makes sure that there are no duplicates in list names and codes in the choice sheet. This test lists all list items that have other list items with the same two values in the name and code columns.

Duplicated List Labels

This test that there is no labels in the same list that is identical, i.e. one label that is listed twice for the same choice list but with different codes. This test lists all list items that have other list items with the same two values in the name and label columns.

Missing Labels or Value/Name in Choice Lists

The first part of this test makes sure that there is no list items that have a value in the label column but no value in the value/name column. The second part of this tests makes sure the opposite does not happen. This is extra likely to occur when a form is programmed in multiple languages. This test lists all list items caught by either if these two tests.

Back to Parent

This article is part of the topic iefieldkit