Difference between revisions of "Ietestform"
m (→Required Column: corrected typos) |
|||
(41 intermediate revisions by 5 users not shown) | |||
Line 1: | Line 1: | ||
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created <code>[[iefieldkit]]</code> as a package in [[Stata Coding Practices|Stata]] to support the process of [[Primary Data Collection|primary data collection]] from start to finish. In most cases, third party [[Survey Firm|survey firms]] or local partners collect data on behalf of the [[Impact Evaluation Team|research team]]. Therefore, [[Data Quality Assurance Plan|data quality assurance]] is a particularly important aspect of data collection. <code>ietestform</code> allows the research team to test [https://opendatakit.org/ Open Data Kit (ODK)-based] electronic [[ | [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has created <code>[[iefieldkit]]</code> as a package in [[Stata Coding Practices|Stata]] to support the process of [[Primary Data Collection|primary data collection]] from start to finish. In most cases, third party [[Survey Firm|survey firms]] or local partners collect data on behalf of the [[Impact Evaluation Team|research team]]. Therefore, [[Data Quality Assurance Plan|data quality assurance]] is a particularly important aspect of '''data collection'''. <code>ietestform</code> allows the '''research team''' to test [https://opendatakit.org/ Open Data Kit (ODK)-based] electronic [[SurveyCTO Form Settings|survey forms]] for common errors, as well as [[SurveyCTO Coding Practices | best practices]] for '''SurveyCTO-based Form forms''' before [[Preparing for Field Data Collection|field data collection]] starts. For example, the [[SurveyCTO Server Management|SurveyCTO server]] has a built-in test feature that tests the '''ODK''' syntax of a form when it is uploaded by the '''research team'''. <code>ietestform</code> complements these built-in tests to ensure that the collected data is in a format that is easily readable in '''Stata''', and warns users who use practices we have learnt are prone to data quality errors. | ||
==Read First== | ==Read First== | ||
* [[Stata Coding Practices|Stata coding practices]]. | * Please refer to [[Stata Coding Practices|Stata coding practices]] for coding best practices in '''Stata'''. | ||
* <code>[[iefieldkit]]</code>. | * <code>ietestform</code> is part of the package <code>[[iefieldkit]]</code>, which has been developed by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics]. | ||
* To install <code>ietestform</code>, | * To install <code>ietestform</code>, as well as other commands in the <code>iefieldkit</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in '''Stata'''. | ||
* For instructions and available options, type <syntaxhighlight lang="Stata" inline>help ietestform</syntaxhighlight>. | |||
* For instructions and available options, type < | |||
== Overview == | == Overview == | ||
In | In Open Data Kit (ODK)-based electronic [[Survey Pilot|survey]] kits, including [https://www.surveycto.com/ SurveyCTO], [[SurveyCTO Form Settings|survey forms]] (or [[Questionnaire Programming|questionnaires]]) are typically [[SurveyCTO Programming#Programming in Excel|built in Excel]] using a specialized structured syntax. Before the [[Impact Evaluation Team|research team]] starts with [[Preparing for Field Data Collection|field data collection]], they can use <code>ietestform</code> to test ODK-based electronic '''survey forms''' for common errors, as well as [[SurveyCTO Coding Practices | best practices]] for '''SurveyCTO-based forms'''. | ||
For example, the [[SurveyCTO Server Management|SurveyCTO server]] has a built-in feature that tests the | For example, the [[SurveyCTO Server Management|SurveyCTO server]] has a built-in feature that tests the ODK syntax of a form when it is uploaded by the '''research team'''. <code>ietestform</code> complements these built-in tests to ensure that the collected data is in a format that is easily readable in [[Stata Coding Practices|Stata]], and warns users who use practices we have learnt are prone to data quality errors. Therefore, the <code>ietestform</code> command should be used after testing the '''survey form''' on a '''SurveyCTO server''' to make sure there are no syntax errors. | ||
== Syntax == | == Syntax == | ||
The basic syntax for <code>ietestform</code> is as follows: | The basic syntax for <code>ietestform</code> is as follows: | ||
<syntaxhighlight lang="Stata">ietestform | |||
, surveyform("filename.xlsx") | , surveyform("filename.xlsx") | ||
report("report.csv") | report("report.csv")</syntaxhighlight> | ||
The <code>ietestform</code> command generates a report in '''.csv''' format. The report flags errors in | The <code>ietestform</code> command generates a report in '''.csv''' format. The report flags errors in coding, as well as practices that are not strictly wrong, but which may indicate bad practices, and therefore need a manual review. The report generated by <code>ietestform</code> can be displayed in a number of software applications, and can also be used with collaboration tools like [https://github.com/ GitHub]. | ||
If you think that the command incorrectly flagged issues in your | If you think that the command incorrectly flagged issues in your [[SurveyCTO Form Settings|SurveyCTO form]], please report the case [https://github.com/worldbank/iefieldkit/issues here] to help [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] improve the command. Refer to the following sections for a detailed explanation of the tests performed by <code>ietestform</code>. These tests are meant to flag errors that may interrupt [[Preparing for Field Data Collection|field work]]. Note that <code>ietestform</code> should be used only after the '''form''' has passed the ODK syntax checks on the [[SurveyCTO Server Management|SurveyCTO server]]. | ||
== Required Column == | == Required Column == | ||
Required fields ensure that the [[Enumerator Training|enumerators]] cannot proceed without entering a response to a particular field (each question is a field). This prevents submissions of incomplete forms, and helps ensure that '''enumerators''' complete forms in the right order. A field is required if it has the "Yes" value in the ''required'' column. | |||
It is common that respondents do not have an answer | It is common that respondents do not have an answer or do not want to share an answer to a question, but a missing value should never be used to represent such non-answers. Instead, the [[Questionnaire Programming|questionnaire]] should allow non-answers, for example, "I do not know" or "Decline to answer" as valid answers. | ||
Therefore, almost all fields should be required in an ODK survey while still being able to handle non-answers. | Therefore, almost all fields should be required in an ODK [[Survey Pilot|survey]] while still being able to handle non-answers. | ||
Note that only column types that show up when filling the form are affected by that value. For example, fields like | Note that only column types that show up when filling the form are affected by that value. For example, fields like ''begin_group'', ''end_repeat'', ''text_audit'' do not show up while filling the form, and so tests related to the ''required'' columns ignore these fields. | ||
<code>ietestform</code> runs two tests related to the ''required'' columns depending on whether they are note type or non-note type. Fields which are of the note type are those for which the '''enumerator''' does not have to enter any input. Instead, the '''enumerator''' only needs to read out a specific text note. | |||
=== Non-note fields: ''required'' === | === Non-note fields: ''required'' === | ||
<code>ietestform</code> tests to make sure that all fields that are not of | <code>ietestform</code> tests to make sure that all fields that are not of note type have the value "Yes" in the ''required'' column, that is, they are required. The final report then lists all those fields not of type note, but are not required. | ||
Even when some type of non-response by a | Even when some type of non-response by a respondent, such as “Declined to answer”, is acceptable, there should always be a valid method to record the reason for no response. The [[Enumerator Training|enumerator]] should not leave the input field empty in this case. The absence of a recorded answer should only mean that the '''enumerator''' did not ask the question during the [[Survey Pilot|survey]]. In cases where it is acceptable to skip a question, you should use an appropriate relevance condition. | ||
Fields that record GPS coordinates for instance, are some of the fields that may intentionally have a | Fields that record GPS coordinates for instance, are some of the fields that may intentionally have a "No" value under the ''required'' column. Such fields often have their type as '''geopoint''', '''geoshape''', or '''geotrace'''. If you know that you will have no problem collecting GPS | ||
coordinates, then you should have a | coordinates, then you should have a "Yes" value in the ''required'' column to ensure that you get valid data points. | ||
However, if GPS coordinates are difficult to collect, then it might be a good idea to not have a | However, if GPS coordinates are difficult to collect, then it might be a good idea to not have a "Yes" value under the ''required'' column. This will allow the '''enumerator''' to complete the other fields and submit the '''survey''' even if it is not possible to record GPS coordinates. In this case, <code>ietestform</code> will still report these fields, but you can still proceed with '''survey''' if it was an active decision you are happy with. | ||
=== Note fields: not ''required'' === | === Note fields: not ''required'' === | ||
While fields of the | While fields of the note type can have a "Yes" value in the ''required'' column, they cannot record an input. Therefore, if an [[Enumerator Training|enumerator]] comes across such a field during a live [[Survey Pilot|survey]] , they cannot move past this field. In this case, there is no way to continue with the interview, and the '''enumerator''' will not be able to submit the data already [[Primary Data Collection|collected]] from previous questions. <code>ietestform</code> therefore reports a list of all fields that are of the note type, and have a "Yes" value in the ''required'' column. | ||
Remember that there are cases in which note fields which are required may be useful. Since '''enumerators''' cannot move past these fields, you may use them with a relevance condition so that these fields show up if an earlier entry in the form is incorrect. This will force the '''enumerator''' to go back and correct the error before continuing with the interview.. | |||
For example, enumerators often enter respondent IDs twice to make sure there is no typo in the ID. You may name the two entry fields | For example, '''enumerators''' often enter respondent IDs twice to make sure there is no typo in the ID. You may name the two entry fields ''id1'' and ''id2''. Then you can follow these fields with a ''required'' note field which has the relevance expression as <code>${id1} != ${id2}</code>. In this case, the note type field will only appear if the two entries are not identical. You can use the note text to inform the '''enumerator''' that the two ID fields are not identical, and that the '''enumerator''' must go back and change the values in order to continue. | ||
== Matching begin_ and end_ == | == Matching begin_ and end_ == | ||
For example, [https://opendatakit.org/ ODK] does not require that the | The <code>ietestform</code> command checks that all ''begin_group'' fields are matched by an ''end_group'', and that all ''begin_repeat'' fields are matched by an ''end_repeat''. While the ODK syntax tester on the [[SurveyCTO Server Management|SurveyCTO server]] also tests for matching ''begin_'' and ''end_'' values, the <code>ietestform</code> command provides additional information that makes it faster and easier to solve this problem, especially when the [[SurveyCTO Form Settings|survey form]] (or [[Questionnaire Design|questionnaire]]) is very large. | ||
names ( | |||
'''survey | For example, [https://opendatakit.org/ ODK] does not require that the ''end_group'' and ''end_repeat'' fields should have field | ||
names in the report, along with the row number (in the Excel form) of other | names (''begin_group'' and ''begin_repeat'' are required to have names). This makes it difficult to identify where the error is in the underlying | ||
'''survey''' form. However, <code>ietestform</code> fills that gap because it requires also ''end_group'' and ''end_repeat'' fields should have names and that they should match the corresponding ''begin_group'' and ''begin_repeat'' field. <code>ietestform</code> lists these missing | |||
names in the report, along with the row number (in the Excel form) of other non-valid ''begin_'' and ''end_'' pairs. | |||
For a | For a ''begin_'' and ''end_'' pair to be considered valid by <code>ietestform</code>, the following three criteria must be met: | ||
# For each | # For each ''begin_'' field, there must be an ''end_'' field. | ||
# The corresponding | # The corresponding ''end_'' field must be of the correct type. That is, a ''begin_group'' should not be closed by an ''end_repeat'', and a ''begin_repeat'' should not closed by an ''end_group''. | ||
# The names of the | # The names of the ''end_'' fields must match the names of ''begin_'' fields. The [[SurveyCTO Server Management|SurveyCTO server]] already tests to makes sure that the ''begin_'' names are unique, so each ''begin_'' and ''end_'' pair will also be unique if this condition is met. | ||
== Naming and Labeling == | == Naming and Labeling == | ||
[https://opendatakit.org/ ODK] applies very few restrictions to | |||
[https://opendatakit.org/ ODK] applies very few restrictions to field names and other inputs. Therefore, [[Master Dataset|datasets]] crated in ODK often contain '''variable''' names and labels that are not valid in [[Stata Coding Practices|Stata]] and will cause an error when the '''dataset''' is imported into '''Stata'''. For example, ODK only requires that all '''variable''' names must be unique, and does not allow the use of a few special characters. The ODK syntax test on the [[SurveyCTO Server Management|SurveyCTO server]] tests for only these restrictions. <code>ietestform</code> performs some additional tests which ensure that the '''datasets''' are valid, and optimized for being imported in '''Stata'''. | |||
=== Stata-specific labels === | === Stata-specific labels === | ||
<code>ietestform</code> returns a flag if your survey form is not [[Questionnaire Programming|programmed]] to display Stata-specific labels. | <code>ietestform</code> returns a flag if your [[SurveyCTO Form Settings|survey form]] is not [[Questionnaire Programming|programmed]] to display [[Stata Coding Practices|Stata]]-specific labels. | ||
In | In SurveyCTO, for instance, you can [[SurveyCTO Programming|program]] your '''form''' to display questions in multiple languages. This is done by creating label columns named ''label:english'', ''label:swahili'', ''label:hindi'', and so on. You can then choose which language to use for labels when exporting the [[Master Dataset|dataset]] to '''Stata''' from SurveyCTO. | ||
You can use the same feature to create Stata-specific labels, by adding a label | You can use the same feature to create '''Stata'''-specific labels, by adding a label language called ''label:stata''. You can obviously add and modify labels after importing the '''dataset''' to '''Stata''' as well. However, this is the simplest way to add '''Stata'''-specific labels. If this practice is not used, the data set may end up being incorrectly labeled, or require labor intensive re-labeling after importing to '''Stata'''. <code>ietestform</code> applies the same test on the ''choices'' sheet as well, to ensure that all labels in the ''choices'' sheet are optimized for importing into '''Stata'''. | ||
=== Length of variable labels === | === Length of variable labels === | ||
In Stata, there is a restriction on the length of '''variable labels''' | In [[Stata Coding Practices|Stata]], there is a restriction on the length of '''variable''' labels. '''Variable''' labels in '''Stata''' cannot be longer than 80 characters, and '''Stata''' truncates '''variable''' labels that are longer. <code>ietestform</code> checks for this by listing all fields with entries in '''Stata's''' ''label'' column that are longer than 80 characters. | ||
=== Length of variable names === | === Length of variable names === | ||
Similarly, Stata also restricts the length of '''variable | Similarly, [[Stata Coding Practices|Stata]] also restricts the length of '''variable''' names to 32 characters. If the name is longer than that, '''Stata''' will either truncate the name, or replace the name with generic names like ''var1'', ''var2'', etc. if the truncated name is no longer unique. While you can make these changes in '''Stata''' as well, it is much easier to solve these issues before starting with the [[Primary Data Collection|data collection]]. <code>ietestform</code> therefore flags all fields with '''variable''' names longer than 32 characters. | ||
variable names longer than 32 characters. | |||
=== Length of field names in repeat groups === | === Length of field names in repeat groups === | ||
With respect to | With respect to field names in [[Repeat Groups and Rosters in SurveyCTO|repeat groups]], <code>ietestform</code> lists two kinds of fields in the report. Firstly, it lists fields in '''repeat groups''' that have names that will be too long in the wide format after [[Exporting Analysis|exporting]] to [[Stata Coding Practices|Stata]]. Secondly, it lists fields in '''repeat groups''' for which the risk of having names that are too long is high, but not certain. | ||
It is important to remember that when you use the SurveyCTO-generated Stata | It is important to remember that when you use the SurveyCTO-generated '''Stata''' | ||
'''do-file''', or export a dataset in | '''do-file''', or export a [[Master Dataset|dataset]] in format, a suffix is automatically added to the '''variable''' names that are created inside '''repeat groups'''. For example, if a group of questions is repeated three times, the wide version of the resulting '''dataset''' will contain three '''variables''' for each question in the '''repeat group'''. Each of these three '''variables''' will have the same name, followed by 1, 2 and 3; that is, ''varname_1'', ''varname_2'', and ''varname_3''. Therefore, '''variables''' created inside | ||
a single '''repeat''' | a single '''repeat group''' should not have a name that is longer than 30 characters so that final length is not longer than 32 characters. | ||
Similarly, if the field is in a | Similarly, if the field is in a nested '''repeat group''' (a '''repeat group''' inside another one), a suffix will be added once for each group. In this case, the actual restriction on the length that will be used by <code>ietestform</code> is given by this formula: | ||
*32 − (2 × depth of nested repeats) | |||
than the number given by this formula. | In this case, <code>ietestform</code> will list all '''variables''' that have names longer than the number given by this formula. | ||
However, these restrictions assume that there are no more than 9 questions in each '''repeat''' | However, these restrictions assume that there are no more than 9 questions in each '''repeat group'''. If there were more than 9 questions, the suffixes would be 10, 11, etc., which take up three characters. For example, for the 10th question of a '''repeat group''', the '''variable''' name would be suffixed as ''varname_10''. In this case, <code>ietestform</code> lists all fields with names that are longer than | ||
*32 − (3 × depth of nested repeats). | |||
This is an example of the second test, since it is is uncertain whether this will create an issue with names that are too long. However, if you think that field names are so long that they might be reported by this test, you may consider reducing the length of the field names. | |||
===Repeat group naming conflicts === | ===Repeat group naming conflicts === | ||
In such cases, <code>ietestform</code> flags all variables inside a '''repeat''' | <code>ietestform</code> also flags name conflicts that could result from repeat suffixes (like ''_1'', ''_2'') that are added to field names inside a [[Repeat Groups and Rosters in SurveyCTO|repeat group]]. The ODK syntax test in SurveyCTO checks whether field names are unique. For example, the names ''myvar'' and ''myvar_1'' are both unique according to the ODK syntax test. But if ''myvar'' appears as a '''variable''' in a '''repeat group''', it will appear with a repeat suffix as ''myvar_1'' for the answer to the first question in the '''repeat group'''. This will then create a name conflict with the '''variable''' named ''myvar_1'' which lies outside the '''repeat group'''. | ||
where | |||
In such cases, <code>ietestform</code> flags all '''variables''' inside a '''repeat group''' that could possibly create such a naming conflict. For example, if there is a '''variable''' with the name ''myvar'', the command checks if there are any other '''variable''' names with the format ''myvar_#'', | |||
where ''#'' is one or more digits. Similarly, if the '''variable''' ''myvar'' is in a nested '''repeat group''' (a '''repeat group''' inside another one), then <code>ietestform</code> checks for ''myvar_#'', ''myvar_#_#'' and so on. | |||
'''Note:''' If the '''variables''' ''myvar'' and ''myvar_1'' are both in non-nested '''repeat groups''', there will be no naming conflicts. In this case, the repeat suffixes will generate ''myvar_1'' and ''myvar_1_1''. However, <code>ietestform</code> will still list these fields as it may be not be clear to someone going through the [[Master Dataset|dataset]] that ''myvar_1'' is from the field ''myvar'', and not from ''myvar_1''. | |||
=== Leading and trailing spaces === | === Leading and trailing spaces === | ||
<code>ietestform</code> also reports any fields that have leading ( | <code>ietestform</code> also reports any fields that have leading (" ABC") or trailing ("ABC ") spaces, as these can cause unexpected problems. For example, consider a list in the ''choice'' sheet called "village", but what is actually written is "village ". In Excel you will not see this extra space unless you look closely. While some tools will treat this as "village", others might treat it as "village ", which are not the same. <code>ietestform</code> will flag these fields so you can prevent such errors. | ||
== Choice Lists == | == Choice Lists == | ||
<code>ietestform</code> tests also deal with | |||
choice lists and duplicate labels could mean that the person [[SurveyCTO Coding Practices|coding the survey]] copied and | <code>ietestform</code> tests also deal with [[SurveyCTO Choice Lists|choice lists]], that is, lists that are created for ''select_one'' and ''select_multiple'' types of fields in the ''choices'' sheet on Excel. The ''choices'' sheet lists all response labels in a separate Excel sheet, along with corresponding integer values. The ODK syntax is very lenient when it comes to '''choice lists''' which are then translated into value labels in [[Stata Coding Practices|Stata]]. This can lead to a lot of errors such as typographical errors, missing values, and [[Duplicates and Survey Logs|duplicate values]] which affect the [[Master Dataset|datasets]] imported into '''Stata'''. <code>ietestform</code> flags issues like these that can arise due to coding errors in ODK-based platforms. For example, unused '''choice lists''' and '''duplicate''' labels could mean that the person [[SurveyCTO Coding Practices|coding]] the [[Survey Pilot|survey]] copied and pasted the elements of a list incompletely or incorrectly. | ||
pasted the elements of a list incompletely or incorrectly. | |||
=== Numeric | === Numeric value and name === | ||
Stata usually stores categorical data by assigning | [[Stata Coding Practices|Stata]] usually stores categorical data by assigning integer (numeric) values to string (alphabetical) labels. For example, this means assigning a value of "2" to "Yes", "1" to "No", and "0" to "Declined to answer". | ||
Although SurveyCTO allows string values for questions that have categorical responses, we recommend using integer labels instead. This is because string labels take up more memory, especially when importing large [[Master Dataset|datasets]], and many '''Stata''' functions that deal with categorical '''variables''' cannot handle string labels. <code>ietestform</code> therefore reports all list items that have a non-numeric value in the ''value'' or ''name'' column. | |||
===Unused choice lists=== | ===Unused choice lists=== | ||
<code>ietestform</code> checks that all | <code>ietestform</code> checks that all [[SurveyCTO Choice Lists|choice lists]] defined in the ''choices'' sheet are actually used in at least one ''select_one'' or ''select_multiple'' field in the [[Survey Pilot|survey]] sheet. While it is not incorrect to have some lists that are unused, it could still be a sign of '''choice lists''' that are not in sync with an updated version of the [[SurveyCTO Form Settings|survey form]]. In such cases, unused '''choice lists''' can cause errors, or contain items that will not be displayed during the '''survey'''. | ||
For example, imagine you have 10 villages in a '''choice list''' called ''village'', but you incorrectly type ''vilage'' for one of them. Then, according to ODK syntax you will have two lists - one called ''village'' with 9 items, and one called ''vilage'' with 1 item. In this case, it is likely that there are no ''select_one'' or ''select_multiple'' fields that uses the '''choice list''' called ''vilage'', so <code>ietestform</code> is a good way to spot a typographical error like this. | |||
=== Duplicate value and label === | |||
<code>ietestform</code> makes sure that there are no [[Duplicates and Survey Logs|duplicates]] in the names given to individual items in a [[SurveyCTO Choice Lists|choice list]] and the codes (under the ''value'' column) assigned to each item in the ''choices'' sheet. This test will list all items of the '''choice list''' that have the same two values under the ''name'' and ''value'' columns. | |||
<code>ietestform</code> also makes sure that there is only one label in a given '''choice list''' for a given code. This test lists all list items that have the same two values in the ''name'' and ''label'' columns. For example, suppose that for the '''choice list''' called "village", "Village A"' and "Village B" both have the same code, that is, "1", under the ''code'' column. Then <code>ietestform</code> will list both "Village A" and "Village B" along with the name of the '''choice list''', that is, "village". | |||
===Missing label, value, or name === | |||
In the first part of this test, <code>ietestform</code> lists all items in a [[SurveyCTO Choice Lists|choice list]] that have an entry under the ''label'' column, but have nothing under the ''value'' or ''name'' column. In the second part of the test, it also lists cases where the exact opposite occurs. This can sometimes happen when the [[SurveyCTO Form Settings|survey form]] is [[Questionnaire Programming|programmed]] in multiple languages, or when the [[SurveyCTO Coding Practices|coding is incomplete]]. | |||
== Outdated Syntax == | |||
SurveyCTO updates their syntax of [https://docs.surveycto.com/02-designing-forms/01-core-concepts/09.expressions.html expressions] which tend to have advanced features compared with the previous versions of the syntax. It is recommended to use the latest syntax to ensure full functionality of the expression and avoid potential issues. | |||
<code>ietestform</code> tests to make sure the latest syntax is being used in the survey form. This includes | |||
<code>ietestform</code> | |||
# when the outdated syntax of ''position()'' is being used instead of the ''index()'' | |||
# when the outdated syntax of ''jr:choice-name()'' is being used instead of the ''choice-label()'' | |||
== | == Encryption == | ||
[https://dimewiki.worldbank.org/wiki/Encryption Encryption] of [[SurveyCTO Form Settings|survey forms]] is an integral part of reducing the risk of exposing confidential or [[Personally Identifying Information (PII)|personally identifiable data]]. You can learn how to [https://dimewiki.worldbank.org/wiki/Encryption#Encryption_with_SurveyCTO_Data encrypt your form] on SurveyCTO [https://github.com/worldbank/dime-standards/blob/master/dime-research-standards/pillar-4-data-security/data-security-resources/surveycto-encryption-guidelines.md here]. | |||
== Related Pages == | == Related Pages == |
Latest revision as of 22:20, 15 August 2023
DIME Analytics has created iefieldkit
as a package in Stata to support the process of primary data collection from start to finish. In most cases, third party survey firms or local partners collect data on behalf of the research team. Therefore, data quality assurance is a particularly important aspect of data collection. ietestform
allows the research team to test Open Data Kit (ODK)-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based Form forms before field data collection starts. For example, the SurveyCTO server has a built-in test feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform
complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and warns users who use practices we have learnt are prone to data quality errors.
Read First
- Please refer to Stata coding practices for coding best practices in Stata.
ietestform
is part of the packageiefieldkit
, which has been developed by DIME Analytics.- To install
ietestform
, as well as other commands in theiefieldkit
package, typessc install iefieldkit
in Stata. - For instructions and available options, type
help ietestform
.
Overview
In Open Data Kit (ODK)-based electronic survey kits, including SurveyCTO, survey forms (or questionnaires) are typically built in Excel using a specialized structured syntax. Before the research team starts with field data collection, they can use ietestform
to test ODK-based electronic survey forms for common errors, as well as best practices for SurveyCTO-based forms.
For example, the SurveyCTO server has a built-in feature that tests the ODK syntax of a form when it is uploaded by the research team. ietestform
complements these built-in tests to ensure that the collected data is in a format that is easily readable in Stata, and warns users who use practices we have learnt are prone to data quality errors. Therefore, the ietestform
command should be used after testing the survey form on a SurveyCTO server to make sure there are no syntax errors.
Syntax
The basic syntax for ietestform
is as follows:
ietestform
, surveyform("filename.xlsx")
report("report.csv")
The ietestform
command generates a report in .csv format. The report flags errors in coding, as well as practices that are not strictly wrong, but which may indicate bad practices, and therefore need a manual review. The report generated by ietestform
can be displayed in a number of software applications, and can also be used with collaboration tools like GitHub.
If you think that the command incorrectly flagged issues in your SurveyCTO form, please report the case here to help DIME Analytics improve the command. Refer to the following sections for a detailed explanation of the tests performed by ietestform
. These tests are meant to flag errors that may interrupt field work. Note that ietestform
should be used only after the form has passed the ODK syntax checks on the SurveyCTO server.
Required Column
Required fields ensure that the enumerators cannot proceed without entering a response to a particular field (each question is a field). This prevents submissions of incomplete forms, and helps ensure that enumerators complete forms in the right order. A field is required if it has the "Yes" value in the required column.
It is common that respondents do not have an answer or do not want to share an answer to a question, but a missing value should never be used to represent such non-answers. Instead, the questionnaire should allow non-answers, for example, "I do not know" or "Decline to answer" as valid answers. Therefore, almost all fields should be required in an ODK survey while still being able to handle non-answers.
Note that only column types that show up when filling the form are affected by that value. For example, fields like begin_group, end_repeat, text_audit do not show up while filling the form, and so tests related to the required columns ignore these fields.
ietestform
runs two tests related to the required columns depending on whether they are note type or non-note type. Fields which are of the note type are those for which the enumerator does not have to enter any input. Instead, the enumerator only needs to read out a specific text note.
Non-note fields: required
ietestform
tests to make sure that all fields that are not of note type have the value "Yes" in the required column, that is, they are required. The final report then lists all those fields not of type note, but are not required.
Even when some type of non-response by a respondent, such as “Declined to answer”, is acceptable, there should always be a valid method to record the reason for no response. The enumerator should not leave the input field empty in this case. The absence of a recorded answer should only mean that the enumerator did not ask the question during the survey. In cases where it is acceptable to skip a question, you should use an appropriate relevance condition.
Fields that record GPS coordinates for instance, are some of the fields that may intentionally have a "No" value under the required column. Such fields often have their type as geopoint, geoshape, or geotrace. If you know that you will have no problem collecting GPS coordinates, then you should have a "Yes" value in the required column to ensure that you get valid data points.
However, if GPS coordinates are difficult to collect, then it might be a good idea to not have a "Yes" value under the required column. This will allow the enumerator to complete the other fields and submit the survey even if it is not possible to record GPS coordinates. In this case, ietestform
will still report these fields, but you can still proceed with survey if it was an active decision you are happy with.
Note fields: not required
While fields of the note type can have a "Yes" value in the required column, they cannot record an input. Therefore, if an enumerator comes across such a field during a live survey , they cannot move past this field. In this case, there is no way to continue with the interview, and the enumerator will not be able to submit the data already collected from previous questions. ietestform
therefore reports a list of all fields that are of the note type, and have a "Yes" value in the required column.
Remember that there are cases in which note fields which are required may be useful. Since enumerators cannot move past these fields, you may use them with a relevance condition so that these fields show up if an earlier entry in the form is incorrect. This will force the enumerator to go back and correct the error before continuing with the interview..
For example, enumerators often enter respondent IDs twice to make sure there is no typo in the ID. You may name the two entry fields id1 and id2. Then you can follow these fields with a required note field which has the relevance expression as ${id1} != ${id2}
. In this case, the note type field will only appear if the two entries are not identical. You can use the note text to inform the enumerator that the two ID fields are not identical, and that the enumerator must go back and change the values in order to continue.
Matching begin_ and end_
The ietestform
command checks that all begin_group fields are matched by an end_group, and that all begin_repeat fields are matched by an end_repeat. While the ODK syntax tester on the SurveyCTO server also tests for matching begin_ and end_ values, the ietestform
command provides additional information that makes it faster and easier to solve this problem, especially when the survey form (or questionnaire) is very large.
For example, ODK does not require that the end_group and end_repeat fields should have field
names (begin_group and begin_repeat are required to have names). This makes it difficult to identify where the error is in the underlying
survey form. However, ietestform
fills that gap because it requires also end_group and end_repeat fields should have names and that they should match the corresponding begin_group and begin_repeat field. ietestform
lists these missing
names in the report, along with the row number (in the Excel form) of other non-valid begin_ and end_ pairs.
For a begin_ and end_ pair to be considered valid by ietestform
, the following three criteria must be met:
- For each begin_ field, there must be an end_ field.
- The corresponding end_ field must be of the correct type. That is, a begin_group should not be closed by an end_repeat, and a begin_repeat should not closed by an end_group.
- The names of the end_ fields must match the names of begin_ fields. The SurveyCTO server already tests to makes sure that the begin_ names are unique, so each begin_ and end_ pair will also be unique if this condition is met.
Naming and Labeling
ODK applies very few restrictions to field names and other inputs. Therefore, datasets crated in ODK often contain variable names and labels that are not valid in Stata and will cause an error when the dataset is imported into Stata. For example, ODK only requires that all variable names must be unique, and does not allow the use of a few special characters. The ODK syntax test on the SurveyCTO server tests for only these restrictions. ietestform
performs some additional tests which ensure that the datasets are valid, and optimized for being imported in Stata.
Stata-specific labels
ietestform
returns a flag if your survey form is not programmed to display Stata-specific labels.
In SurveyCTO, for instance, you can program your form to display questions in multiple languages. This is done by creating label columns named label:english, label:swahili, label:hindi, and so on. You can then choose which language to use for labels when exporting the dataset to Stata from SurveyCTO.
You can use the same feature to create Stata-specific labels, by adding a label language called label:stata. You can obviously add and modify labels after importing the dataset to Stata as well. However, this is the simplest way to add Stata-specific labels. If this practice is not used, the data set may end up being incorrectly labeled, or require labor intensive re-labeling after importing to Stata. ietestform
applies the same test on the choices sheet as well, to ensure that all labels in the choices sheet are optimized for importing into Stata.
Length of variable labels
In Stata, there is a restriction on the length of variable labels. Variable labels in Stata cannot be longer than 80 characters, and Stata truncates variable labels that are longer. ietestform
checks for this by listing all fields with entries in Stata's label column that are longer than 80 characters.
Length of variable names
Similarly, Stata also restricts the length of variable names to 32 characters. If the name is longer than that, Stata will either truncate the name, or replace the name with generic names like var1, var2, etc. if the truncated name is no longer unique. While you can make these changes in Stata as well, it is much easier to solve these issues before starting with the data collection. ietestform
therefore flags all fields with variable names longer than 32 characters.
Length of field names in repeat groups
With respect to field names in repeat groups, ietestform
lists two kinds of fields in the report. Firstly, it lists fields in repeat groups that have names that will be too long in the wide format after exporting to Stata. Secondly, it lists fields in repeat groups for which the risk of having names that are too long is high, but not certain.
It is important to remember that when you use the SurveyCTO-generated Stata do-file, or export a dataset in format, a suffix is automatically added to the variable names that are created inside repeat groups. For example, if a group of questions is repeated three times, the wide version of the resulting dataset will contain three variables for each question in the repeat group. Each of these three variables will have the same name, followed by 1, 2 and 3; that is, varname_1, varname_2, and varname_3. Therefore, variables created inside a single repeat group should not have a name that is longer than 30 characters so that final length is not longer than 32 characters.
Similarly, if the field is in a nested repeat group (a repeat group inside another one), a suffix will be added once for each group. In this case, the actual restriction on the length that will be used by ietestform
is given by this formula:
- 32 − (2 × depth of nested repeats)
In this case, ietestform
will list all variables that have names longer than the number given by this formula.
However, these restrictions assume that there are no more than 9 questions in each repeat group. If there were more than 9 questions, the suffixes would be 10, 11, etc., which take up three characters. For example, for the 10th question of a repeat group, the variable name would be suffixed as varname_10. In this case, ietestform
lists all fields with names that are longer than
- 32 − (3 × depth of nested repeats).
This is an example of the second test, since it is is uncertain whether this will create an issue with names that are too long. However, if you think that field names are so long that they might be reported by this test, you may consider reducing the length of the field names.
Repeat group naming conflicts
ietestform
also flags name conflicts that could result from repeat suffixes (like _1, _2) that are added to field names inside a repeat group. The ODK syntax test in SurveyCTO checks whether field names are unique. For example, the names myvar and myvar_1 are both unique according to the ODK syntax test. But if myvar appears as a variable in a repeat group, it will appear with a repeat suffix as myvar_1 for the answer to the first question in the repeat group. This will then create a name conflict with the variable named myvar_1 which lies outside the repeat group.
In such cases, ietestform
flags all variables inside a repeat group that could possibly create such a naming conflict. For example, if there is a variable with the name myvar, the command checks if there are any other variable names with the format myvar_#,
where # is one or more digits. Similarly, if the variable myvar is in a nested repeat group (a repeat group inside another one), then ietestform
checks for myvar_#, myvar_#_# and so on.
Note: If the variables myvar and myvar_1 are both in non-nested repeat groups, there will be no naming conflicts. In this case, the repeat suffixes will generate myvar_1 and myvar_1_1. However, ietestform
will still list these fields as it may be not be clear to someone going through the dataset that myvar_1 is from the field myvar, and not from myvar_1.
Leading and trailing spaces
ietestform
also reports any fields that have leading (" ABC") or trailing ("ABC ") spaces, as these can cause unexpected problems. For example, consider a list in the choice sheet called "village", but what is actually written is "village ". In Excel you will not see this extra space unless you look closely. While some tools will treat this as "village", others might treat it as "village ", which are not the same. ietestform
will flag these fields so you can prevent such errors.
Choice Lists
ietestform
tests also deal with choice lists, that is, lists that are created for select_one and select_multiple types of fields in the choices sheet on Excel. The choices sheet lists all response labels in a separate Excel sheet, along with corresponding integer values. The ODK syntax is very lenient when it comes to choice lists which are then translated into value labels in Stata. This can lead to a lot of errors such as typographical errors, missing values, and duplicate values which affect the datasets imported into Stata. ietestform
flags issues like these that can arise due to coding errors in ODK-based platforms. For example, unused choice lists and duplicate labels could mean that the person coding the survey copied and pasted the elements of a list incompletely or incorrectly.
Numeric value and name
Stata usually stores categorical data by assigning integer (numeric) values to string (alphabetical) labels. For example, this means assigning a value of "2" to "Yes", "1" to "No", and "0" to "Declined to answer".
Although SurveyCTO allows string values for questions that have categorical responses, we recommend using integer labels instead. This is because string labels take up more memory, especially when importing large datasets, and many Stata functions that deal with categorical variables cannot handle string labels. ietestform
therefore reports all list items that have a non-numeric value in the value or name column.
Unused choice lists
ietestform
checks that all choice lists defined in the choices sheet are actually used in at least one select_one or select_multiple field in the survey sheet. While it is not incorrect to have some lists that are unused, it could still be a sign of choice lists that are not in sync with an updated version of the survey form. In such cases, unused choice lists can cause errors, or contain items that will not be displayed during the survey.
For example, imagine you have 10 villages in a choice list called village, but you incorrectly type vilage for one of them. Then, according to ODK syntax you will have two lists - one called village with 9 items, and one called vilage with 1 item. In this case, it is likely that there are no select_one or select_multiple fields that uses the choice list called vilage, so ietestform
is a good way to spot a typographical error like this.
Duplicate value and label
ietestform
makes sure that there are no duplicates in the names given to individual items in a choice list and the codes (under the value column) assigned to each item in the choices sheet. This test will list all items of the choice list that have the same two values under the name and value columns.
ietestform
also makes sure that there is only one label in a given choice list for a given code. This test lists all list items that have the same two values in the name and label columns. For example, suppose that for the choice list called "village", "Village A"' and "Village B" both have the same code, that is, "1", under the code column. Then ietestform
will list both "Village A" and "Village B" along with the name of the choice list, that is, "village".
Missing label, value, or name
In the first part of this test, ietestform
lists all items in a choice list that have an entry under the label column, but have nothing under the value or name column. In the second part of the test, it also lists cases where the exact opposite occurs. This can sometimes happen when the survey form is programmed in multiple languages, or when the coding is incomplete.
Outdated Syntax
SurveyCTO updates their syntax of expressions which tend to have advanced features compared with the previous versions of the syntax. It is recommended to use the latest syntax to ensure full functionality of the expression and avoid potential issues.
ietestform
tests to make sure the latest syntax is being used in the survey form. This includes
- when the outdated syntax of position() is being used instead of the index()
- when the outdated syntax of jr:choice-name() is being used instead of the choice-label()
Encryption
Encryption of survey forms is an integral part of reducing the risk of exposing confidential or personally identifiable data. You can learn how to encrypt your form on SurveyCTO here.
Related Pages
Click here for pages that link to this topic.
This page is part of the topic iefieldkit
.
Additional Resources
Other frameworks for testing ODK-based or SurveyCTO forms include:
- IPA,
ipacheckscto
- PMA2020,
xform-test