Difference between revisions of "Duplicates and Survey Logs"
(12 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
Checking for '''duplicates''' ensures that the answers of the survey respondent are not recorded twice. Matching [[Survey Pilot|field survey]] logs with [[SurveyCTO Server Management|server logs]] ensures that data has been fully transferred to the '''survey'''. This is important to ensure [[Monitoring Data Quality | data quality]]. | |||
== Read First == | == Read First == | ||
* The data should be downloaded | * The data should be downloaded checked for duplicates daily. It is much easier to solve the problem when the [[Impact Evaluation Team | field team]] remembers the interview and is still close by so they can go back and re-interview the respondent. Other [[Monitoring Data Quality | data quality]] checks depend on uniquely identifying ID [[Variable Construction | variables]]. | ||
*Verifying survey logs with duplicates makes sure that data from the field has all been transferred over to the server. This should also be done daily which would make it easier to go back and re-interview respondents in cases where their survey might have been lost due to technical problems. | *Verifying [[Survey Pilot|survey]] logs with duplicates makes sure that data from the field has all been transferred over to the [[SurveyCTO Server Management|server]]. This should also be done daily which would make it easier to go back and re-interview respondents in cases where their '''survey''' might have been lost due to technical problems. | ||
== Types of Duplicates in SurveyCTO == | == Types of Duplicates in SurveyCTO == | ||
Before analyzing the outcomes of quality checks or sometimes even before running real time quality checks, we need to check for duplicates in the data. Duplicates are common in ODK/SurveyCTO. They need to be removed before starting other data quality checks as other quality checks depend on unique ID. There are three main types of duplicates in SurveyCTO which are: | Before [[Data Analysis | analyzing]] the outcomes of [[Monitoring Data Quality | quality checks]] or sometimes even before running real-time quality checks, we need to check for duplicates in the data. Duplicates are common in ODK/SurveyCTO. They need to be removed before starting other data quality checks as other quality checks depend on unique ID. There are three main types of duplicates in SurveyCTO which are: | ||
*Double | * '''Double submissions of the same observation and same data.''' This happens when the first upload from the tablet to the [[SurveyCTO Server Management|server]] is interrupted due to bad internet. A few '''variables''' differ between the duplicates, and the differences are in the submission data. | ||
*Double submissions of same observation but with modified data | * '''Double submissions of the same observation, but with modified data'''. This is rare in SurveyCTO. This is due to answers being modified after the submission of the original [[Survey Pilot|survey]] and the '''survey''' then being resubmitted. This is bad practice and it is more transparent to correct errors in the '''do-file''' instead. Some '''variables''' differ between the duplicates and some of them are in the observation data. | ||
*Incorrectly assigned ID i.e. two respondents with the same ID. This is due to a typo in the field when the | *'''Incorrectly assigned ID i.e. two respondents with the same ID'''. This is due to a typo in the field when the respondent ID is being entered. Many '''variables''' differ between the duplicates and many of the differences are in the observation data. | ||
==Removing duplicates from a dataset== | ==Removing duplicates from a dataset== | ||
To remove duplicates, you can use the DIME's [[Stata Coding Practices|Stata]] command <code> '''ieduplicates''' </code> which can be found in the <code> '''ietoolkit''' </code> Stata package. | |||
To remove duplicates, you can use the DIME's Stata command <code> '''ieduplicates''' </code> which can be found in the <code> '''ietoolkit''' </code> Stata package. | |||
<code> | <code> | ||
Line 21: | Line 20: | ||
</code> | </code> | ||
This identifies the duplicates in the ID variable and exports them to an Excel file which is also used to correct duplicates in Stata. Field supervisors without knowledge of Stata can make the corrections in the Excel file and the duplicates will be corrected the next time you run the code. | This identifies the duplicates in the ID '''variable''' and exports them to an Excel file which is also used to correct duplicates in '''Stata'''. Field supervisors without knowledge of '''Stata''' can make the corrections in the Excel file and the duplicates will be corrected the next time you run the code. | ||
==Comparing Server Data to Field Logs == | ==Comparing Server Data to Field Logs == | ||
Comparing [[SurveyCTO Server Management|server]] data to field logs makes sure that all the data collected during the [[Survey Pilot|survey]] has made it to your '''server'''. '''Survey''' logs also serve the purpose of providing teams with a quick overview of the progress on the field, detecting [[Enumerator Training|enumerators]] who are not performing their tasks properly, and [[Balance tests | checking the balance]] of the '''survey''' (i.e. gender, race, etc.) in some cases. | |||
You can do this by creating a log for all the interviews done during the day on the field and matching that with the [[Field Surveys | survey data]] on the '''server'''. This should be done only after the duplicates have been removed. This makes sure that all the '''survey''' data done during the day have been uploaded to the '''server'''. | |||
== Related Pages == | |||
[[Special:WhatLinksHere/Duplicates_and_Survey|Click here for pages that link to this topic]]. | |||
== | == Additional Resources == | ||
[[Category: Data Management]] | [[Category: Data Management]] |
Latest revision as of 13:51, 9 August 2023
Checking for duplicates ensures that the answers of the survey respondent are not recorded twice. Matching field survey logs with server logs ensures that data has been fully transferred to the survey. This is important to ensure data quality.
Read First
- The data should be downloaded checked for duplicates daily. It is much easier to solve the problem when the field team remembers the interview and is still close by so they can go back and re-interview the respondent. Other data quality checks depend on uniquely identifying ID variables.
- Verifying survey logs with duplicates makes sure that data from the field has all been transferred over to the server. This should also be done daily which would make it easier to go back and re-interview respondents in cases where their survey might have been lost due to technical problems.
Types of Duplicates in SurveyCTO
Before analyzing the outcomes of quality checks or sometimes even before running real-time quality checks, we need to check for duplicates in the data. Duplicates are common in ODK/SurveyCTO. They need to be removed before starting other data quality checks as other quality checks depend on unique ID. There are three main types of duplicates in SurveyCTO which are:
- Double submissions of the same observation and same data. This happens when the first upload from the tablet to the server is interrupted due to bad internet. A few variables differ between the duplicates, and the differences are in the submission data.
- Double submissions of the same observation, but with modified data. This is rare in SurveyCTO. This is due to answers being modified after the submission of the original survey and the survey then being resubmitted. This is bad practice and it is more transparent to correct errors in the do-file instead. Some variables differ between the duplicates and some of them are in the observation data.
- Incorrectly assigned ID i.e. two respondents with the same ID. This is due to a typo in the field when the respondent ID is being entered. Many variables differ between the duplicates and many of the differences are in the observation data.
Removing duplicates from a dataset
To remove duplicates, you can use the DIME's Stata command ieduplicates
which can be found in the ietoolkit
Stata package.
ssc install ietoolkit
ieduplicates ID_varname
This identifies the duplicates in the ID variable and exports them to an Excel file which is also used to correct duplicates in Stata. Field supervisors without knowledge of Stata can make the corrections in the Excel file and the duplicates will be corrected the next time you run the code.
Comparing Server Data to Field Logs
Comparing server data to field logs makes sure that all the data collected during the survey has made it to your server. Survey logs also serve the purpose of providing teams with a quick overview of the progress on the field, detecting enumerators who are not performing their tasks properly, and checking the balance of the survey (i.e. gender, race, etc.) in some cases.
You can do this by creating a log for all the interviews done during the day on the field and matching that with the survey data on the server. This should be done only after the duplicates have been removed. This makes sure that all the survey data done during the day have been uploaded to the server.
Related Pages
Click here for pages that link to this topic.