Difference between revisions of "Duplicates and Survey Logs"

Jump to: navigation, search
Line 5: Line 5:
Before analyzing the outcomes of quality checks or sometimes even before running real time quality checks, we need to check for duplicates in the data. Duplicates are common in ODK/SurveyCTO. They need to be removed before starting other data quality checks as other quality checks depend on unique ID. There are three main types of duplicates in SurveyCTO which are:  
Before analyzing the outcomes of quality checks or sometimes even before running real time quality checks, we need to check for duplicates in the data. Duplicates are common in ODK/SurveyCTO. They need to be removed before starting other data quality checks as other quality checks depend on unique ID. There are three main types of duplicates in SurveyCTO which are:  


*Double Submissions of same observation and same data which happens when the first upload from the tablet to the server is interrupted due to bad internet. Few variables differ between the duplicates and differences are in the submission data.
*Double Submissions of same observation and same data which happens when the first upload from the tablet to the server is interrupted due to bad internet. Few variables differ between the duplicates, and the differences are in the submission data.
*Double submissions of same observation but with modified data(rare in SurveyCTO)  
*Double submissions of same observation but with modified data(rare in SurveyCTO). This is due to answer being modified after submission of the original survey and the survey resubmitted. This is bad practice and it is more transparent to correct errors in the do-file instead. Some variables differ between the duplicates and some of them are in the observation data.  
::- This is due to answer being modified after submission of the original survey and the survey resubmitted. This is bad practice and it is more transparent to correct errors in the do-file instead.
*Incorrectly assigned ID i.e. two respondents with the same ID. This is due to a  typo in the field when the a respondent ID is being entered. Many variables differ between the duplicates and many of the differences are in the observation data.
::-Some variables differ between the duplicates and some of them are in the observation data.  
*Incorrectly assigned ID i.e. two respondents with the same ID  
::- This is due to a  typo in the field when the a respondent ID is being entered.
::- Many variables differ between the duplicates and many of the differences are in the observation data.


==Removing duplicates from a dataset==
==Removing duplicates from a dataset==

Revision as of 16:34, 2 February 2017

Read First

  • The data should be downloaded daily and checked for duplicates daily.It is much easier to solve the problem when the field team remembers the interview and is still close by that they can go back and reinterview the respondent. Other data quality checks depend on uniquely identifying ID variables.

Types of Duplicates in SurveyCTO

Before analyzing the outcomes of quality checks or sometimes even before running real time quality checks, we need to check for duplicates in the data. Duplicates are common in ODK/SurveyCTO. They need to be removed before starting other data quality checks as other quality checks depend on unique ID. There are three main types of duplicates in SurveyCTO which are:

  • Double Submissions of same observation and same data which happens when the first upload from the tablet to the server is interrupted due to bad internet. Few variables differ between the duplicates, and the differences are in the submission data.
  • Double submissions of same observation but with modified data(rare in SurveyCTO). This is due to answer being modified after submission of the original survey and the survey resubmitted. This is bad practice and it is more transparent to correct errors in the do-file instead. Some variables differ between the duplicates and some of them are in the observation data.
  • Incorrectly assigned ID i.e. two respondents with the same ID. This is due to a typo in the field when the a respondent ID is being entered. Many variables differ between the duplicates and many of the differences are in the observation data.

Removing duplicates from a dataset

To remove duplicates, you can use the DIME's Stata command ieduplicates which can be found in the ietoolkit Stata package.

ssc install ietoolkit
ieduplicates ID_varname

This identifies the duplicates in the ID variable and exports them to an Excel file which is also used to correct duplicates in Stata. Field supervisors without knowledge of Stata can make the corrections in the Excel file and the duplicates will be corrected the next time you run the code.

Comparing Server Data to Field Logs

Comparing server data to field logs makes sure that all the data collected during the survey has made it to your server. Survey logs also serve the purpose of providing teams with a quick overview of the progress on the field, detecting enumerators who are not performing their tasks properly, and checking balance of the survey(i.e. gender, race, etc) in some cases.

You can do this by creating a log for all the interviews done during the day on the field and matching that with the survey data on the server. This should be done only after the duplicates have been removed. This makes sure that all the survey data done during the day have been uploaded to the server.

Back to Parent

This article is part of the topic Monitoring Data Quality

Additional Resources