Difference between revisions of "Iecompdup"
Kbjarkefur (talk | contribs) (Redirected page to Ieduplicates) |
|||
Line 1: | Line 1: | ||
# | <code>iecompdup</code> is a Stata command that identifies why [[Duplicates and Survey Logs | duplicate]] [[ID Variable Properties | ID variables]] exist by comparing them quantitatively. While the decision of how to correct a duplicate is always a qualitative decision, <code>iecompdup</code> provides the information necessary to make that decision. The command should be used whenever <code>[[ieduplicate]]</code> identifies duplicates in order to ensure high quality data before [[Data Cleaning | cleaning]] and [[Data Analysis | data analysis]]. This page describes how to implement the command and interpret its output. | ||
==Read First== | |||
*While <code>iecompdup</code> resolves duplicate issues, <code>[[ieduplicates]]</code> identifies duplicates in ID variables. | |||
*For detailed instructions on how to implement the command and its options in Stata, type <code>help iecompdup</code>in Stata. | |||
*This command is part of the package <code>[[Stata Coding Practices#ietoolkit | ietoolkit]]</code>. To install all commands in this package, including <code>iecompdup</code>, type <code>ssc install ietoolkit</code> in Stata. | |||
==Implementation== | |||
# Run <code>[[ieduplicates]]</code> on the raw data. If there are no duplicates, then you are done and can skip the rest of this list. | |||
# If there are duplicates, use <code>iecompdup</code> on any duplicates identified. | |||
# Enter the corrections identified with <code>iecompdup</code> to the duplicates in the report outputted by <code>ieduplicates</code>. | |||
# After entering the corrections, save the report in the same location with the same name. | |||
# Run <code>ieduplicates</code> again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time. | |||
Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which <code>ieduplicates</code> has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a [[Naming Conventions | different name]]. | |||
== Specifications == | |||
<code>iecompdup</code> requires a single ID variable and the duplicate ID value. See the below example for reference: | |||
<pre>iecompdup HHID, id(123456)</pre> | |||
===idvar=== | |||
<code>iecompdup</code> only allows a single ID variable. In the above example, this is ''HHID''. The ID variable used here is the same ID variable used in <code>ieduplicates</code>. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric. | |||
===id=== | |||
<code>iecompdup</code> requires the ID value for the duplicate pair or group. In the above example, this is ''123456''. If you have several pairs or groups of duplicates, you will have to run this command once for each pair or group. Note that the command can only be run on two duplicates at the time: it picks the two first observations in the sort order. If you want to change which two duplicates are compared, simply change the sort order. When there are more than two duplicates for a given ID, the command issues a warning. To suppress this warning, use the option <code>more2ok</code>. | |||
==Output== | |||
The command outputs the variables names for which the duplicate pair has identical values and the variable names for which the duplicate pair has different values. The section below outlines three cases of duplicates and explains how <code>iecompdup</code> can help to identify to which case the duplicate pair pertains. No output from <code>iecompdup</code> can guarantee any of the cases below, but typically the output will be qualitatively conclusive for one of the three cases. | |||
===Case 1: Same Observation, Same Data=== | |||
This case often occurs with [[Computer-Assisted Personal Interviews (CAPI) | CAPI]] surveys as a consequence of poor internet connection. If a submission is interrupted, then the server still saves that incomplete data; when the server receives a second submission, it saves both submissions since it does not know if the two submissions and the changes made between them were intentional. In <code>iecompdup</code>’s output, this case would appear as very few different variables; the variables that differ would mostly be submission meta data such as submission time or submission ID (called ''KEY'' in SurveyCTO). If no media files (i.e. audio, images, monitoring) were used and only the meta data differs, it does not matter which observation you keep. However, it is good practice to keep the one submitted most recently. | |||
In most cases, submission interruptions occur because media files did not upload correctly. Those files themselves do not come up as variables in Stata -- only their file names do – and thus, only submission meta data variables differ. The file name variable is submitted even when the file is not. When both duplicates have file name and the same file contents, it does not matter which duplicate you keep. However, it is good practice to keep the one submitted most recently. If only one has the file name, keep that observation. | |||
The case may also occur if a duplicate is created on the server. This is very uncommon but in these cases, even some submission data would be the same. In this case, either observation can be dropped. | |||
===Case 2: Same Observation, Modified Data=== | |||
This case is rare but possible in most data collection software. This occurs if an observation is modified after the first submission and then re-submitted. Sometimes it is necessary to modify already-submitted data, though in these cases, it is best practice to do so in a do-file to ensure [[Data Documentation | proper documentation]]. In <code>iecompdup</code>’s output, this case would show up as the submission meta data differing and some observation data differing. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule on which observation to keep: you have to make that decision yourself. Remember that this case is rare since most survey software has systems to prevent this. | |||
===Case 3: Incorrectly Assigned ID=== | |||
The case occurs when the same ID is used for two different respondents. This may happen due to typos or to unfollowed [[Survey Protocols | protocols]]. In <code>iecompdup</code>’s output, this case would show up as submission data differing as well as a lot of observation data differing. Follow up with enumerators and supervisors responsible for this submission and assign a new [[ID Variable Properties | ID]] to one of the observations based on your findings. | |||
== Back to Parent == | |||
This article is part of the topic [[Stata_Coding_Practices#ietoolkit|ietoolkit]] | |||
==Additional Resources== | |||
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-4-quality.pdf Real Time Data Quality Checks] | |||
[[Category: Stata ]] |
Revision as of 22:44, 4 June 2019
iecompdup
is a Stata command that identifies why duplicate ID variables exist by comparing them quantitatively. While the decision of how to correct a duplicate is always a qualitative decision, iecompdup
provides the information necessary to make that decision. The command should be used whenever ieduplicate
identifies duplicates in order to ensure high quality data before cleaning and data analysis. This page describes how to implement the command and interpret its output.
Read First
- While
iecompdup
resolves duplicate issues,ieduplicates
identifies duplicates in ID variables. - For detailed instructions on how to implement the command and its options in Stata, type
help iecompdup
in Stata. - This command is part of the package
ietoolkit
. To install all commands in this package, includingiecompdup
, typessc install ietoolkit
in Stata.
Implementation
- Run
ieduplicates
on the raw data. If there are no duplicates, then you are done and can skip the rest of this list. - If there are duplicates, use
iecompdup
on any duplicates identified. - Enter the corrections identified with
iecompdup
to the duplicates in the report outputted byieduplicates
. - After entering the corrections, save the report in the same location with the same name.
- Run
ieduplicates
again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.
Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which ieduplicates
has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.
Specifications
iecompdup
requires a single ID variable and the duplicate ID value. See the below example for reference:
iecompdup HHID, id(123456)
idvar
iecompdup
only allows a single ID variable. In the above example, this is HHID. The ID variable used here is the same ID variable used in ieduplicates
. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.
id
iecompdup
requires the ID value for the duplicate pair or group. In the above example, this is 123456. If you have several pairs or groups of duplicates, you will have to run this command once for each pair or group. Note that the command can only be run on two duplicates at the time: it picks the two first observations in the sort order. If you want to change which two duplicates are compared, simply change the sort order. When there are more than two duplicates for a given ID, the command issues a warning. To suppress this warning, use the option more2ok
.
Output
The command outputs the variables names for which the duplicate pair has identical values and the variable names for which the duplicate pair has different values. The section below outlines three cases of duplicates and explains how iecompdup
can help to identify to which case the duplicate pair pertains. No output from iecompdup
can guarantee any of the cases below, but typically the output will be qualitatively conclusive for one of the three cases.
Case 1: Same Observation, Same Data
This case often occurs with CAPI surveys as a consequence of poor internet connection. If a submission is interrupted, then the server still saves that incomplete data; when the server receives a second submission, it saves both submissions since it does not know if the two submissions and the changes made between them were intentional. In iecompdup
’s output, this case would appear as very few different variables; the variables that differ would mostly be submission meta data such as submission time or submission ID (called KEY in SurveyCTO). If no media files (i.e. audio, images, monitoring) were used and only the meta data differs, it does not matter which observation you keep. However, it is good practice to keep the one submitted most recently.
In most cases, submission interruptions occur because media files did not upload correctly. Those files themselves do not come up as variables in Stata -- only their file names do – and thus, only submission meta data variables differ. The file name variable is submitted even when the file is not. When both duplicates have file name and the same file contents, it does not matter which duplicate you keep. However, it is good practice to keep the one submitted most recently. If only one has the file name, keep that observation.
The case may also occur if a duplicate is created on the server. This is very uncommon but in these cases, even some submission data would be the same. In this case, either observation can be dropped.
Case 2: Same Observation, Modified Data
This case is rare but possible in most data collection software. This occurs if an observation is modified after the first submission and then re-submitted. Sometimes it is necessary to modify already-submitted data, though in these cases, it is best practice to do so in a do-file to ensure proper documentation. In iecompdup
’s output, this case would show up as the submission meta data differing and some observation data differing. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule on which observation to keep: you have to make that decision yourself. Remember that this case is rare since most survey software has systems to prevent this.
Case 3: Incorrectly Assigned ID
The case occurs when the same ID is used for two different respondents. This may happen due to typos or to unfollowed protocols. In iecompdup
’s output, this case would show up as submission data differing as well as a lot of observation data differing. Follow up with enumerators and supervisors responsible for this submission and assign a new ID to one of the observations based on your findings.
Back to Parent
This article is part of the topic ietoolkit
Additional Resources
- DIME Analytics’ Real Time Data Quality Checks