Difference between revisions of "Ietestform"
Line 1: | Line 1: | ||
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created <code>[[iefieldkit]]</code> as a package in [[Stata Coding Practices|Stata]] to support the process of [[Primary Data Collection|primary data collection]] from start to finish. Since data is often collected for researchers by third party [[Survey Firm|survey firms]] or local partners, [[Data Quality Assurance Plan|data quality assurance]] is a particularly important aspect of data collection. '''<code>[[ietestform]]</code>''' allows the research team to test [https://opendatakit.org/ Open Data Kit (ODK)]-based [[Field Surveys|electronic survey forms]] for common errors, as well as [[SurveyCTO Coding Practices | best practices]] for [https://www.surveycto.com/ SurveyCTO-based] forms before [[Preparing for Field Data Collection|field data collection]] starts.. For example, the [[SurveyCTO Server Management|SurveyCTO server]] has a built-in test feature that tests the '''ODK''' syntax of a form when it is uploaded by the '''research team'''. <code>[[ietestform]]</code> complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of [[Monitoring Data Quality|high quality]]. | |||
==Read First== | ==Read First== | ||
*Install <code>ietestform</code> by typing <code>ssc install ietestform</code> in Stata. | *Install <code>ietestform</code> by typing <code>ssc install ietestform</code> in Stata. |
Revision as of 22:18, 30 April 2020
DIME Analytics has created iefieldkit
as a package in Stata to support the process of primary data collection from start to finish. Since data is often collected for researchers by third party survey firms or local partners, data quality assurance is a particularly important aspect of data collection. ietestform
allows the research team to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms before field data collection starts.. For example, the SurveyCTO server has a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform
complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and is of high quality.
Read First
- Install
ietestform
by typingssc install ietestform
in Stata. - For instructions and available options, type
help ietestform
in Stata. - This command is a part of the package
iefieldkit
; to install all the commands in this package, includingietestform
, typessc install iefieldkit
in Stata
Overview
This article describes how and when to use the command and interpret its output.There are relatively few Stata-based tools for managing that process, mainly due to its complexity and to the diversity of practices adopted. The DIME Analytics team has worked to standardize some of the tools and processes used for data collection to save time during the more tedious elements of the process, improve documentation, and reduce error. Most importantly, since the quality of data collection is as important to credible research as the quality of analysis, these tools are intended to allow researchers to focus on ensuring that the data they collect is high-quality. Many contemporary data collection efforts use digital survey technologies, such as the open-source Open Data Kit (ODK) or proprietary extensions of ODK, like SurveyCTO.
ietestform
is a Stata command used to test ODK-based SurveyCTO forms during questionnaire programming and before fieldwork. It tests for best practices in coding, naming and labeling, and choice lists. The command then outputs a test report in .csv-format. ietestform
is intended to be used after SurveyCTO's server test feature rather than in place of it.
To use this command in Stata, simply specify your SurveyCTO form and the path to which you want to write the report:
ietestform using "/path/to/surveyform.xlsx", reportsave("/path/to/report.csv")
Note that the ietestform
may flag a feature that is indeed the best option for your particular case. Interpret the output conscientiously: be sure to understand why each case was flagged and decide whether to modify the form or not accordingly. If you are not sure why something was flagged, read the explanations of each test below. If you think that the command incorrectly flagged cases in your SurveyCTO form, please report the case here and DIME Analytics will happily work on improving the command.
Tests for Coding Practices
This section describes the ietestform
tests on ODK programming language. These tests flag risks of error that may interrupt field work. Note that ietestform
assumes that the ODK syntax is already tested and is correct; it is intended to be used after the form has passed the ODK syntax test on SurveyCTO's server.
Required Column
The required column ensures that the enumerator cannot proceed before a response has been entered into the field. This prevents submissions of incomplete forms and helps ensure that enumerators complete forms in the right order.
While you can fill in a value in the required column, only field types with a view (i.e. showing up when filling in a form), are affected by that value. Examples of fields without a view are begin_group, end_repeat, text_audit, calculate, deviceid, caseid, etc. All fields without a view are ignored in the tests related to the required column.
ietestform
runs two tests related to the required column.
All Non-Note Fields Required
This tests that all fields that are not of type note have the value "Yes" in the required column. It then outputs a list to the report for all fields that are not required and not of type note. Note that even when "no answer" is a valid response from the respondent, never use the absence of a recorded answer to represent that; when applicable, use a valid method to record that the respondent’s answer was "no answer".
Sometimes, researchers choose to intentionally leave GPS fields as not required (i.e. geoppoint, geoshape and geotrace). If you know that the devices used for data collection will have no problem collecting GPS coordinates, keep these fields required. However, if GPS coordinates will be difficult to collect due to, for example, connection issues, it may be a good idea to not require these fields so that the enumerator can still complete the other fields and submit the form even when he/she cannot record GPS coordinates.
No Note Fields Required
Fields of type note have a view and can therefore be required. However, there is no way to record data in a note field, so there is no way to pass a required note-field. While this feature can sometimes be put to great use (see below), it is generally problematic. ietestform
writes a list to the report of all fields that are of type note and are required.
Note that here are cases in which required note fields may be really useful. Since enumerators cannot pass these fields, researchers may use them with a relevance condition so that they show up if an earlier entry in the form is incorrect. This forces the numerator to go back and correct the error before continuing data collection.
For example, enumerators are often asked to enter respondent IDs twice to be extra careful that there is no typo in the ID. Let's say those two double entry fields are id1
and id2
. These fields can be followed by a required note field that has the relevance expression ${id1} != ${id2}
; then, the note will only appear if the two IDs are not identical. The note label can then inform the enumerator that the two ID fields are not identical and that the enumerator must go back and change the values in order to continue.
In this case, researchers could also use the constraint condition on the second ID field when the ID is re-entered. However, the message in the required note field approach could be more informative than the message in the constraint condition. Further, when the conditional test is more difficult than just testing that two fields are identical, the required note field method is an easier approach than using intermediate calculate and relevance fields.
Numeric Ranges
not implemented yet
All numeric fields, integer fields or decimal fields should have ranges for acceptable values in the constraint column. Make this range wider than what you expect it to be! The range in the constraint column should be used to prevent typos, to prevent illogical values (like negative age) but not to force the data to be within your preexisting expectations. Your preexisting expectation is a good starting point for this range, but make it much wider, as you do not yet know what special cases may exist; these outliers can be important for your research.
Matching begin_/end_
While the ODK syntax tester on SurveyCTO's server test for matching begin_ and end_ values, the error message for this error is not always useful — especially when the form is very large. The lack of clarity in these error messages may result from the fact that ODK does not require the end_group and end_repeat to have field names. So the first part of ietestform
’s test for this characteristic checks that all end_group and end_repeat fields have a value in the field name column. The main part of this test is to test that all begin_group are matched by an end_group and that all begin_repeat are matched by an end_repeat. To be considered matched, the following three criteria need to be fulfilled:
- for each begin_ there is an end_
- the corresponding end_ is of the correct type such that a begin_group is not closed by an end_repeat or a begin_repeat is not closed by an end_group.
- the end_ names match the begin_ names. SurveyCTO's server makes sure that the begin_ names are unique, so each pair will be unique if this part of the test is passed.
Tests for Naming and Labeling Practices
ODK has very few restrictions on names apart from that all names must be unique and that a few characters are not allowed. All of those restrictions are tested by the ODK syntax test on SurveyCTO's server. The additional tests done by ietestform
are mainly due to the additional Stata naming rules that you will encounter when importing data to Stata.
Field Name Length
Stata has a limit of 32 characters in the field name. Stata will truncate the name if the name is longer than that, or replace the name with a generic name on the format var1, var2, etc. if the name no longer is unique after being truncated. While all of these cases can be resolved in Stata, it is much simpler to solve naming issues before starting to collect the data. While 32 characters are allowed, some common commands add one character to variable names when processing it so we recommend a maximum of 31 characters. This test lists all fields with names longer than 31 characters.
Repeat Group Field Name Length
This test has two parts. The first part lists fields in repeat groups that have names that will be too long in the wide format when imported to Stata. The second part lists fields in repeat groups where the risk of too long names is high, but not certain.
When using SurveyCTO's Stata import do-file or when exporting the data set in wide format, all variables in a repeat group will have a suffix added to the variable name. For example, if a repeat group is repeated three times, then in the wide dataset, any variable in that repeat group will generate three variables, with the names suffixed followed by _1, _2 and _3 respectively. This suffix will also count towards the 31 characters limitation for variable names in Stata discussed in the previous test. (Technically 32 characters are allowed, but some common commands add one character to variable names when processing it so we recommend a maximum of 31 characters.) Thus, any variable in a repeat group may should have a field name no longer than 29 characters. If the field is in a nested repeat group (a repeat group inside a repeat group), then it will be suffixed once for each repeat group. So the actual constraint used in this test is given by this formula: 31 - (2 * number of nested repeat groups for the field)
. This test lists all variables that have longer names than that constraint.
In the first test we assume that there are not more than 9 iterations in each repeat group; if there would be more than 9 then the suffixes will be _10, _11 etc., which takes up three characters. So the second test lists all fields that have a field name that is longer than 31 - (3 * number of nested repeat groups for the field)
. Whether this will create an issue with long names is uncertain, but if your names are so long that they might be caught in this test, then it is probably best practice to try to make the names shorter.
Repeat Group Name Conflict
This test checks for name conflicts that may result from the suffixes added to fields inside a repeat group. SurveyCTO's ODK syntax tester tests that all names are unique. The name myvar and myvar_1 are not duplicates in the ODK syntax test, but if myvar is in a repeat field, it will be suffixed with _1 for the first iteration of that variable; that will create a name conflict with the variable created from field myvar_1.
This test lists all fields inside a repeat group with which another field may conflict due to names. For example, if there is a field with name myvar’’, ietestform
tests if there is any variable on the format myvar_#, where # is one or several digits.
If the variable myvar is in a nested repeat group (a repeat group inside a repeat group), then it is testing for myvar_#, myvar_#_#, myvar_#_#_# etc. for each level of nested repeat group, where # is one or several digits.
Technical special case: If the fields myvar and myvar_1 are both in a non-nested repeat group then there will be no name conflict: the first iteration of both fields will generate the variables myvar_1 and myvar_1_1 since the variables from both fields are suffixed. These fields are still listed by this test as it may be confusing that the variable myvar_1 is from the field myvar and not from myvar_1.’’
Stata Labels Columns
In SurveyCTO, you can program your form so that multiple languages can be displayed when filling in a form. This is done by creating label columns named label:english, label:swahili, label:hindi etc. When you export your data using SurveyCTO Sync you can choose which language you want to use for labels.
The same feature can be used to create Stata labels by adding a label language called label:stata. Labels can obviously be added and modified once the data set has been imported to Stata. However, our experience is that this is the simplest way to add them; if this practice is not used, the data set is often not properly labeled.
If you do not use this practice, but still use SurveyCTO's Stata code for importing data sets to Stata, you will end up having the labels displayed in the questionnaire as labels for your Stata variable. While it is better than no labels, label:stata allows better variable labeling.
Survey Sheet Stata Labels
In Stata, there is a restriction that the variable label does not exceed 80 characters. If you apply a label longer than that, it will be truncated. For this reason, ietestform
lists all fields with a label in the Stata label column that is longer than 80 characters.
Choice Sheet Stata Labels
Apart from whether the label:stata exists or not, there is no further test on the values of the Stata label column in the choice sheet.
Leading and Trailing Spaces
In computer science, there is a difference between the string "ABC"
and "ABC "
. This difference does not show in Excel. When uploading your form to SurveyCTO's server, the form checker is programmed to handle these differences. However, when you import your form to Stata, as ietestform and several other commands does, these minor differences are distinguished.
For example, consider you have a list in the choice sheet called village,’’ but the actual content of the cell is "village "
. In Excel you will not see this extra space unless you really look for it. This means that some tools, probably most of them, will treat this as "village"
, but other tools might treat it as "village "
which, when compared, are not the same.
What would be even worse is if some list item in the village list has the list name value "village"
and some has the value "village "
. This is very difficult to spot in Excel but some tools might treat these as different.
Leading (" ABC"
) or trailing ("ABC "
) spaces are not difficult to deal with and most tools, iestestform included, deals with them. However there is no guarantee that all of them do. To reduce the risk of errors in whatever tools you use on your data in the future, leading and trailing spaces should be removed.
Tests for Choice List Practices
These tests are related to the choice lists used in select_one and in select_multiple types of fields. The ODK syntax is very lenient when it comes to choice lists, and it lets some undesirable practices to pass. For example, unused lists and duplicate labels could mean that the list elements were copied and pasted accidentally. The command reports this, as they are common sources for errors.
Unused Choice Lists
This test makes sure that all lists defined in the choices list sheet are actually used in at least one select_one or select_multiple field in the survey sheet. It is not incorrect to have unused lists, but it is likely a sign of something that is not kept up to date in your choice lists and might therefore cause an error, an expected behavior, or list items not being displayed during the survey.
For example, imagine you have 10 villages in a choice list called village but you incorrectly type vilage for one of them. Then, according to ODK syntax you have two lists, one called village with 9 items and one called vilage with 1 item. It is unlikely that there is a select_one/select_multiple field that uses the choice list vilage,’’ so listing unused choice is a good way to spot a type like this one.
Value/Name Numeric
In Stata, categorical data is best and most efficiently stored as a number with a value label. The easiest way to ensure that is the case with data collected by SurveyCTO is to use the Stata data import file provided through SurveyCTO Sync. However, this file only works if the values in the value/name column in the choices sheet are numeric. It is not incorrect to use string variables here, but you will have to spend more time cleaning your dataset to follow Stata best practices. This test lists all list items that have a non-numeric value in the value/name column.
Duplicated List Code
This test makes sure that there are no duplicates in list names and codes in the choice sheet. This test lists all list items that have other list items with the same two values in the name and code columns.
Duplicated List Labels
This test makes sure that there is no label in the same list that is identical (i.e. one label that is listed twice for the same choice list but with different codes). This test lists all list items that have other list items with the same two values in the name and label columns.
Missing Labels or Value/Name in Choice Lists
The first part of this test makes sure that there is no list item that has a value in the label column but no value in the value/name column. The second part of this test makes sure the opposite does not happen. This is extra likely to occur when a form is programmed in multiple languages. This test lists all list items caught by either if these two tests.
Back to Parent
This article is part of the topic iefieldkit
Additional Resources
- See other frameworks for testing ODK/SurveyCTO forms similarly to
ietestform
: IPA's ipacheckscto and PMA2020's xform-test. - See the
iefieldkit
installation instructions here.