Difference between revisions of "Monitoring Data Quality"
Line 54: | Line 54: | ||
*Comparisons of variables with known distributions | *Comparisons of variables with known distributions | ||
*Geographical distribution of observations via maps/GIS – are all observations where they’re meant to be? | *Geographical distribution of observations via maps/GIS – are all observations where they’re meant to be? | ||
== Data Quality for Remote Surveys == | == Data Quality for Remote Surveys == |
Revision as of 17:00, 10 June 2020
Ensuring high data quality during primary data collection involves anticipating everything that can go wrong, and preparing a comprehensive data quality assurance plan to handle these issues. Monitoring data quality from the field is an important part of this broader data quality assurance plan, and involves the following - communication and reporting, field monitoring, minimizing attrition, and real-time data quality checks. Each of these steps allow the research team to identify and correct these issues by using feedback from multiple rounds of piloting, re-training enumerators accordingly, and reviewing and re-drafting protocols for efficient field management.
Read First
and the research team should discuss this as part of field management.
- Data quality checks should be run daily, as the enumerator will still remember the interview should any questions arise. Further, she/he can likely return to the respondent in if necessary. These checks serve as additional enumerator support mechanisms that allow team members and enumerators to notice data discrepancies as they arise and resolve the issue(s) immediately.
- The best time to design and code the back-checks and high frequency checks is in parallel to the design and the programming. Data quality checks may omit important tests or be irrelevant if not written in parallel with the questionnaire.
- Create a dashboard with enumerator data check results to facilitate a continual feedback process for enumerators and survey teams: this maintains transparency, accountability, and can help boost motivation.
Communication and Monitoring
entails conducting back-checks in the field and regularly running high-frequency checks on the data (i.e. response quality checks, programming checks and enumerator checks). This process, part of a broader data quality assurance plan, helps the research team to identify and correct any enumerator issues or remaining quirks in questionnaire design and programming. With the information collected during data quality monitoring, research teams can ensure that they are obtaining the highest possible quality of data. This page outlines key elements of data quality monitoring: back-checks, high frequency checks, project checks, and treatment monitoring.
Field Monitoring
Minimizing Attrition
Real-Time Data Quality Checks
Back Checks
Back checks, also known as survey audits, are a quality control method implemented to verify the quality and legitimacy of data collected during a survey. Throughout the course of fieldwork, a back-check team returns to a randomly-selected subset of households for which data has been collected. The back-check team re-interviews these respondents with a short subset of survey questions, otherwise known as a back-check survey. Back-checks are used to verify the quality and legitimacy of key data collected in the actual survey. For more information on how to collect and analyze data via back-checks, see Back Checks.
High Frequency Checks
Research teams should run high frequency checks (HFC) from the office daily. Prepare the HFC code via Stata or R once the questionnaire is finalized but before it goes to the field. You should also prepare instructions for the HFCs in case someone else needs to run it while you are in the field and/or without internet connectivity. During data collection, download data and daily run the HFC to report flags. This should be a one-click process. Within the HFC, include four main types of checks: response quality checks, programming checks, enumerator checks, and duplicate/survey log checks.
Response Quality Checks
Response quality checks monitor the consistency of responses across the survey instrument and the range within the responses fall.
- Consistency of responses across the survey instrument: most consistency tests can and should be built into the questionnaire programming via logic and constraints. However, some checks may be overly complex to program in the survey instrument, particularly when comparing responses across rosters or dealing with multiple units. For example, imagine we ask about plot and harvest size and allow the respondent to answer in the unit of his/her choice. In order to test if the harvest in terms of kilos per hectare is plausible, we need to convert harvest and plot size to kilos and hectares, which may be challenging to program within the questionnaire itself. As a rule of thumb, program as many checks as possible into the survey instrument. Then include the rest in the HFC do file or script.
- Reasonable ranges of responses: while range checks should always be programmed into the survey instrument, typically questionnaires employ 'soft' constraints (i.e. warning enumerators that the response is unusual but can continue). Thus, HFC data checks should include checks for extreme values and outliers and confirm whether they make sense in context. Data checks should also check the range for constructed indicators; multiplication or division can create or expose outliers even when the numerator and denominator are reasonable. For example, say a household reported a plot size of 0.05 hectares (the low end of an acceptable range) and produced 1000kg of maize (within an acceptable range): the yield for the plot would be 20,000kg/ha. This is an extreme outlier.
Programming Checks
Programming checks help the research team to understand if they have designed and programmed the questionnaire properly. Most programming errors should be caught when testing the questionnaire, but it is impossible to test all possible outcomes before data collection. Including programming checks in the HFC is especially important when the team has made last-minute edits to the survey instrument.
Enumerators Checks
Enumerator checks help the research team determine if any individual enumerator's data is significantly different from other enumerators' data in the datasets or different from the mean of a given question. These checks should:
- Check percentage of “don’t know” and refusal responses by the enumerator.
- Check the distribution of responses for key questions by enumerator.
- Check the number of surveys per day by the enumerator.
- Check the average interview duration by the enumerator.
- Check the duration of consent by the enumerator.
- Check the duration of other modules by enumerator (anthropometrics, games, etc.).
These statistics can be output into an enumerator dashboard. Keeping track of survey team metrics and frequently discussing them with enumerators and team leaders maintain accountability, transparency, and can boost motivation. See more on SurveyCTO’s tracking dashboard here.
Duplicates and Survey Log Checks
Duplicate and survey log checks confirm that all the data from the field is on the survey in a sound manner. They should:
- Test that all data from the field is on the server: match survey data logs from the field with survey data logs on the server to confirm that all the data from the field has been transferred to the server.
- Test for target number: since surveys are submitted in daily waves, keep track of the numbers of surveys submitted and the target number of surveys needed for an area to be completed.
- Test for duplicates: since SurveyCTO/ODK data provides a number of duplicates, check for duplicates using
ieduplicates
.
Verifying these details as soon as possible is critical: since the enumerator is most likely close by if you run daily checks, it is easy for him/her to re-interview and get missing data if the HFC renders this necessary.
Project Checks
Make sure to look at the broader status and progress of the project itself. These statistics can help the research team see bigger picture trends, including:
- Overall survey progress relative to planned sample
- Summaries of key research variables
- Two-way summaries of survey variables by demographic/geographic characteristics
- Attrition rates by type and treatment status
- Comparisons of variables with known distributions
- Geographical distribution of observations via maps/GIS – are all observations where they’re meant to be?
Data Quality for Remote Surveys
In the case of remote surveys, monitoring data quality becomes even more important. Poor quality data in remote surveys can at best reduce the effectiveness of a policy intervention, and at worst require a repeat of the entire process of data collection. Therefore the research team must prepare clear guidelines for the following:
- Type of data checks. Conduct regular back checks and high frequency checks.
- Frequency of data checks. Specify how often the supervisor should conduct data checks.
- Feedback method. Specify method for communicating feedback to the enumerators after a data check. Decide on this before any data collection starts.
Back to Parent
This article is part of the topic Field Management
Additional Resources
- DIME Analytics’ Data Quality Assurance
- Innovation for Poverty Action's template for high frequency checks
- DIME's Planning for, Preparing & Monitoring Household Surveys
- DIME Analytics’ Real Time Data Quality Checks
- SurveyCTO, Monitoring and Visualization case studies
- IPA-JPAL-SurveyCTO, Collecting High Quality Data
- SurveyCTO, Data quality with SurveyCTO