Difference between revisions of "Ietestform"

Jump to: navigation, search
Line 87: Line 87:
  
 
==== Value/Name Numeric ====
 
==== Value/Name Numeric ====
 +
In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file they provide through SurveyCTO Sync, but that only works if there values in the value/name column in the choices sheet are numeric. It is not incorrect to use string var, but you will have to spend more time cleaning your data set to follow Stata best practices.
  
 
====Duplicated List Codes ====
 
====Duplicated List Codes ====

Revision as of 21:06, 6 December 2018

ietestform is a Stata command used to test ODK based SurveySCTO forms before they are used in the field. SurveyCTO's server has a test feature that tests the ODK syntax of the form. This command is not meant as a substitute to that test, but a complement as it test for constraints specific to Stata and that the best practices used at DIME are followed.

This article is meant to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ietestform in Stata. This command is a part of the package iefieldkit, to install all the commands in this package including this command, type ssc install iefieldkit in Stata.

Intended use cases

This command is intended to be used after it is tested on SurveyCTO's server to make sure that there is no syntax errors in the form, but before it is used in the field. This command writes a report that outputs the results of several tests (the tests are described below). The report is in csv-format so it can be viewed in Excel and is raw text so it can be tracked in versioning control frameworks like GitHub.

This command has many different tests but only some of them are direct errors. The tests that are not errors are meant to highlight things that experienced ODK coders in SurvyeCTO usually looking for to spot errors. Sometimes these things are not errors, but actually the best way to code something, but make sure that you understand all cases listed in the report, and why they are potentially a bad practice.

If you are not sure why something was listed, then read the explanations of each test below. If you think that the command incorrectly catches cases in your SurveyCTO form then please report that here.

Instructions

These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in Stata see the help files by typing help ietestform in Stata.

This command is very simple to use in Stata, you only need to specify your SurveyCTO form and where on your computer you want the command to write the report.

ietestform, form("$projectfolder/questionnaire.xls") report("$outputfolder/form_report")

Explanation of tests used in this command

ietestform only outputs results form tests that identified an error or a potential bad practice. If a test does not find anything, the test is not at all mentioned in the report.


Coding Practices

This section describes tests related to best practices on how to use and how to not use the ODK programming to ensure a good work flow in the form. Note that these tests does not test if the ODK syntax is valid since this command is intended to be usedafter the form has passed SurveyCTO's ODK syntax tester. This tests that the form uses ODK programming in a way that follows best practices that avoids errors in the field and help ensure high data quality.

Required Column

The required column make sure that the enumerator cannot proceed before a response have been filled in for that field. This is a great feature as it prevents incomplete forms to be submitted, and it helps making sure that enumerators fill in the forms in the right order. A field that is required can not be passed until data has been recorded for it.

There are two tests in ietestform related to this.

All Non-Note Fields Required

This test tests that all fields that is not of type Note (see the other Required Column test below) has the value "Yes" in the required column. This test writes a list to the report for all fields that are not required and not of type note.

Even for questions where it is sometimes expected that there is no answer to be recorded, it is much better practice to have a answer option that represents "No Answer" rather than leaving the field unanswered.

Acceptable Exceptions:

  • Fields that require the GPS function on the device. GPS locations are sometimes difficult to collect for some devices in some contexts. This should be tested before and during the pilot and if it seems as if it will be difficult, then it is ok to make these field types not required.
No Note Fields Required

For fields of type note there is no way to record data, and there is therefore no way to pass a required note field. If this happens in the field there is no way to pass this field, and therefore no way to complete and submit the data. See the exception below for a use case when this is a feature important to use for data quality reasons. This test writes a list to the report of all required note fields.

Acceptable Exceptions:

  • This impassable note field has one very useful intended use case which is forcing the enumerator to go back and change something that is not correct. For example, enumerators are often asked to enter respondent IDs twice to be extra careful that there is no typo in the ID. If those two note fields are id1 and id2 then there can be a required note field that has the relevance expression ${id1} != ${id2}. The same functionality could have been achieved using the constraint field when the ID is re-entered, but the label in the note can be made more informative than the constraint message, and when the conditional test is more difficult than just testing that two fields are identical, then this method is easier intermediate calculate fields can be used.

Numeric Ranges

All numeric fields, integer fields or decimal fields should have ranges for acceptable values in the constraint column. Make this range wider than what you expect it! The range in the constraint column should be used to prevent typos, to prevent illogical values (like negative age) but not to force the data to be within your preexisting expectations. Your preexisting expectation is a good starting point for this range, but make it much wider than that as we do not yet know what special cases your data collection will encounter in the field, and these outliers are very important to understand for your research.

Matching begin_/end_

The main aspect of this test is done by test function on SurveyCTO's server, but the error message for this error are not always useful, especially when the form is very large. One of the main reasons for this might be that ODK does not require the end_group and end_repeat to have field names. So the first part of this test is that all end_group and end_repeat fields are required to have a value in the field name column, and for the next test this name has to be identical to the corresponding begin_group and begin_repeat field.

The main part of this test is to test that all begin_groups are matched by an end_group and that all begin_repeat are matched by an end_repeat. Matched means both that for each begin_ there is an end_ but also that the end is of the correct type so that a begin_group is not closed by a end_repeat. Finally it tests that the end_ names match the begin_ names. SurveyCTO's server makes sure that the begin_ names are unique.


Naming and Labeling Practices

ODK have very few restrictions on names apart from all names must be unique and that there are a few characters that are not allowed. All of that is tested by SurveyCTO's server. These tests is mainly due to the additional rules for names that Stata has that comes into effect when first when you import your data to Stata.

Field Name Length

Stata has a limit of 32 characters in the field name. Stata will truncate the name if the name is longer than that, and replace the name with a generic name on the format var1, var2 etc. if the name is no longer unique after being truncated. All of these cases can be resolved in Stata but it will be much simpler to solve this before starting to collect the data.

Repeat Group Field Name Length

This test has two parts, where the first part list fields that will have issues with too long names and the second part lists fields where the risk is high but it is not certain it will cause an issue with too long names.

When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. If a repeat group is repeated three times, then in the wide data set any variable in that repeat group will generate three variables, with the names suffixed by _1, _2 and _3 respectively. This suffix is required to fit within the 32 characters limitation for variable names in Stata discussed in the previous test. So any variable in a repeat group may only have a 30 characters long field name. If the field is in a nested repeat group (a repeat group inside a repeat group) then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 32 - (2 * number of nested repeat groups for the field). This test list all variables that have longer names then that constraint.

In the first test we assume that there are not more than 9 iteration of each repeat group, but if there would be more than 9 then the suffix will be _10 which takes up three characters. So the second test list all fields that have a field name that is longer than 32 - (3 * number of nested repeat groups for the field). It is not sure that this will create an issue with long names, but if your names are so long that they might be caught in this test, then there is probably a best practice to try to make the name shorter.

Repeat Group Name Conflict

This test for name conflicts that could be a result from the suffixes that are added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field it will be suffixed with _1 for the first iteration of that variable, and that will create a name conflict with the variable myvar_1.

This test lists all variables inside a repeat group for which there is a variable with the same first part of the name followed by and underscore. For example, if there is a field with name myvar it tests if there is any variable on the format myvar_# where # is one or several digits.

If the variable myvar is in a nested repeat group (a repeat group inside a repeat group) then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.

Technical special case. This test does not control that the field with potential name conflict also matches the difference in number repeat groups required for the name conflict will actually happen. For example if myvar is in a non-nested repeat group, the variable it will create will be named myvar_#, and will only create a name conflict if there is a field myvar_1 that is in no repeat group, cause if myvar_1 is in a repeat group too, it will create a variable in the wide format that is named myvar_1_#. This test does not take that into account and list all names, as while it will not create a name conflict, it could still lead to confusions and error in the code by having so similar names.

Stata Labels Columns

Explanation of using language as label column

Survey Sheet
Choice Sheet

Choice Lists

Value/Name Numeric

In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file they provide through SurveyCTO Sync, but that only works if there values in the value/name column in the choices sheet are numeric. It is not incorrect to use string var, but you will have to spend more time cleaning your data set to follow Stata best practices.

Duplicated List Codes

Duplicated List Labels

Unused Lists

Missing Labels

Labels With No Value/Name

Back to Parent

This article is part of the topic ietoolkit