Iecompdup
iecompdup
is the third command in the Stata package created by DIME Analytics, iefieldkit
. The iecompdup
command helps the research team identify the reason for why duplicate values for ID variables exist, so they can be resolved. ID variables are variables that uniquely identify every observation in a dataset, for example, household_id.
Read First
- Stata coding practices.
iefieldkit
.- While
ieduplicates
identifies duplicates in ID variables,iecompdup
provides more information to resolve these issues. - To install
iecompdup
, typessc install iecompdup
in Stata. - To install all the commands in the
iefieldkit
package, typessc install iefieldkit
in Stata. - For instructions and available options, type
help iecompdup
.
Overview
Once ieduplicates
creates the duplicate correction template, iecompdup
compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision, iecompdup
provides the information necessary to make that decision, and ensure high quality data before cleaning and data analysis. It allows the research team to also select the output format based on their decision process.
Follow these steps when using the ieduplicates
and iecompdup
commands on incoming primary data:
- Run
ieduplicates
on the raw data.If there are no duplicates, you are done. If there are duplicates, the command will output an Excel file containing a duplicates correction template, and a link to this file. It will also stop the code from moving forward, and show a message listing the duplicate values in the ID variables. You can prevent the command from stopping your code by using the force option. This will remove all observations with duplicate ID values and allow the code to continue. - Open the duplicates correction template. This template will list each duplicate entry of the ID variable, and information about each observation. It also contains 5 blank columns - correct, drop, newid, initials, and notes. Use these columns to make corrections, and include comments to document the corrections.
- Use
iecompdup
for more information. Sometimes the template is not enough to solve a particular issue. In such cases, run theiecompdup
command on the same dataset. - Overwrite the previous file. After entering all the corrections to the template, save the Excel file in the same location with the same name.
- Run
ieduplicates
again. This will apply the corrections you made in the previous steps. Now if you use the force option, it will only remove those duplicates that you did not resolve. - Do not overwrite the orginal raw data. Save the resulting dataset under a different name.
- Repeat these steps with each new round of data.
Syntax
Sometimes when there are a lot of variables that are different for observations with duplicate IDs, ieduplicates
cannot store all the information. In such cases, or when there are more than two duplicates, you can use iecompdup
to explore the differences.
iecompdup id varname [if], id(id value) more2ok didifference keepdifference keepother(varlist)
- id_varname: The name of the unique ID variable, which is also used with
ieduplicates
. - id_value: This is the value that the ID variable takes in the duplicate observations you want to compare.
For example, if the household with the ID value A1234 appears twice, then id_varname is household_id, and id_value is A1234. If you have more than one pair of duplicates in your dataset, you will need to run this command multiple times for each such pair to compare the differences. If there are more than two observations with a particular ID value, the command will return an error. This is because iecompdup
can only be compare two duplicates at a time. In this case, consider the following options:
- if: Using if allows you to select the pair of observations you want to compare.
- more2ok: Using more2ok allows
iecompdup
to pick the first two observations by default, as per the sort order. It will then display a warning message so that the user is aware that the sorting order of observations will affect the result.
The default output for iecompdup is information on the number of variables where the duplicate pair has identical values and where the duplicate pair has different values. Two lists with the names of these variables are returned as macros. Specifying option didifference will also make the command print the list of variables with different values. The option keepdifference will keep a dataset containing only variables with different values across the duplicate pair (effectively, dropping those that are not of interest). The option keepother(varlist ) may be used to retain additional variables that are useful for analyzing the duplicate pair
Output
The command outputs the variables names for which the duplicate pair has identical values and the variable names for which the duplicate pair has different values. The section below outlines three cases of duplicates and explains how iecompdup
can help to identify to which case the duplicate pair pertains. No output from iecompdup
can guarantee any of the cases below, but typically the output will be qualitatively conclusive for one of the three cases.
Case 1: Same Observation, Same Data
This case often occurs with CAPI surveys as a consequence of poor internet connection. If a submission is interrupted, then the server still saves that incomplete data; when the server receives a second submission, it saves both submissions since it does not know if the two submissions and the changes made between them were intentional. In iecompdup
’s output, this case would appear as very few different variables; the variables that differ would mostly be submission meta data such as submission time or submission ID (called KEY in SurveyCTO). If no media files (i.e. audio, images, monitoring) were used and only the meta data differs, it does not matter which observation you keep. However, it is good practice to keep the one submitted most recently.
In most cases, submission interruptions occur because media files did not upload correctly. Those files themselves do not come up as variables in Stata -- only their file names do – and thus, only submission meta data variables differ. The file name variable is submitted even when the file is not. When both duplicates have file name and the same file contents, it does not matter which duplicate you keep. However, it is good practice to keep the one submitted most recently. If only one has the file name, keep that observation.
The case may also occur if a duplicate is created on the server. This is very uncommon but in these cases, even some submission data would be the same. In this case, either observation can be dropped.
Case 2: Same Observation, Modified Data
This case is rare but possible in most data collection software. This occurs if an observation is modified after the first submission and then re-submitted. Sometimes it is necessary to modify already-submitted data, though in these cases, it is best practice to do so in a do-file to ensure proper documentation. In iecompdup
’s output, this case would show up as the submission meta data differing and some observation data differing. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule on which observation to keep: you have to make that decision yourself. Remember that this case is rare since most survey software has systems to prevent this.
Case 3: Incorrectly Assigned ID
The case occurs when the same ID is used for two different respondents. This may happen due to typos or to unfollowed protocols. In iecompdup
’s output, this case would show up as submission data differing as well as a lot of observation data differing. Follow up with enumerators and supervisors responsible for this submission and assign a new ID to one of the observations based on your findings.
Back to Parent
This article is part of the topic ietoolkit
Additional Resources
- DIME Analytics’ Real Time Data Quality Checks