iecompdup is a command in the Stata package
iefieldkit created by DIME Analytics. The
iecompdup command helps the research team identify the reason for why duplicated values in ID variables exist, so they can be resolved. ID variables are variables that uniquely identify every observation in a dataset, for example, household_id.
- Please refer to Stata coding practices for coding best practices in Stata.
iecompdupis part of the package
iefieldkit, which has been developed by DIME Analytics.
ieduplicatesidentifies duplicates in ID variables,
iecompdupprovides more information to resolve these issues.
- To install
iecompdup, as well as other commands in the
ssc install iefieldkitin Stata.
- For instructions and available options, type
ieduplicates creates the duplicate correction template,
iecompdup compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision,
iecompdup provides the information necessary to make that decision, and ensure high quality data before cleaning and data analysis. It allows the research team to also select the output format based on their decision process.
ieduplicateson the raw data. If there are no duplicates, you are done. If there are duplicates, the command will output an Excel file containing a duplicates correction template, and a link to this file.
iecompdupfor more information. The duplicates correction template includes some information comparing the duplicates, but if that information is not enough, then this command should be used to get more information.
- Go back to your Duplicates Correction Template and apply the corrections you identified using this this command. (See
ieduplicatesfor more details on how to apply the corrections.)
Sometimes when there are a lot of variables that are different for observations with duplicate IDs,
ieduplicates cannot display all the information in the Duplcates Correction Template. In such cases, or when there are more than two duplicates, you can use
iecompdup to explore the differences.
iecompdup id_varname [if] , id(id_value) more2ok didifference keepdifference keepother(varlist)]
The following points provide a detailed explanation of the syntax and usage of
- Basic inputs:
iecompdupuses id_varname and id_value as its basic inputs:
- id_varname: The name of the unique ID variable, which is also used with
- id_value: This is the value that the ID variable takes in the duplicate observations you want to compare. For example, if the household with the ID value A1234 appears twice, then id_varname is household_id, and id_value is A1234.
- id_varname: The name of the unique ID variable, which is also used with
- More than one pair of duplicates: If you have more than one pair of duplicates in your dataset, you will need to run this command multiple times for each such pair to compare the differences.
- More than two observations with same id_value: If there are more than two observations with a particular ID value, the command will return an error. This is because
iecompdupcan only be compare two duplicates at a time. In this case, use on of the following options:
ifallows you to select the pair of observations you want to compare.
iecompdupto pick the first two observations by default, as per the sort order. It will then display a warning message so that the user is aware that the sorting order of observations will affect the result.
- Default output: By default,
iecompdupdisplays two lists of variables in the form of returned macros - one, variables for which the duplicate pair has identical values, and two, variables for which the duplicate pair has different values.
iecompdupalso provides the following options with respect to these lists:
didifference: This option will also make the command print the list of variables with different values.
keepdifference: This option will only keep the variables which have different values across the duplicate pair. This option effectively drops variables which are not of interest.
keepother: This option can be used if you want to retain additional variables that you think are useful for analyzing the duplicate pair.
The output from
iecompdup allows you to explore the
differences between observations to determine the best way to correct the duplicate values. Broadly, there are three cases that can explain why duplicate values in ID Variables can arise when working with SurveyCTO. Given below are the cases, and information on how
iecompdup can help you identify which of these applies to a particular pair of duplicates. Some details can change if you use a different software, but the general idea should remain the same. And while
iecompdup can not guarantee any of the cases below, the output will allow you to identify one of these cases as the source of the problem.
Case 1: Same observation, same data values
Case 1 errors can occur when the same observation is submitted twice, with the same data values. This often happens during CAPI or CAFE surveys because of poor internet connection. If submission of data to the server is interrupted before you can finish completing all fields, the incomplete data may still be saved. This is because SurveyCTO servers never delete any data. When you re-submit the data the second time, the server saves that too. However, it cannot identify which submission was intentional, and which one was accidental.
For a Case 1 error, the output of
iecompdup will display two observations with very few differences. These differences will mostly be in the form of submission time or submission ID (which SurveyCTO lists as the "KEY" variable). Information of this form is called metadata. Sometimes the only difference between the two observations is in terms of the metadata, and the data does not include any media files (audio, images, monitoring). In such cases it does not matter which observation you keep. However, it is a good practice to keep the most recent submission.
In most cases, however, submission gets interrupted because the data contained media files which did not upload correctly. Those files do not always appear as variables when the dateset is imported in Stata, depending on the data collection software. Even in such cases, only the metadata variables will appear to be different, so you must carefully check the media files which lie outside the imported dataset for duplicate observations.
Case 2: Same observation, different data values
Case 2 errors are possible but rare in most data collection software, because most software do not allow more than one complete observation with the same ID. However, Case 2 errors may still occur if someone modifies an observation after the first submission, and then re-submits it. If you think it is necessary to modify data that has already been submitted, it is better to make these modifications in a do-file as part of data cleaning. This will also allow the research team to document the manual changes that are made, for example, during revisions in survey software.
For a Case 2 error, the output of
iecompdup will display observations with the different submission metadata, as well as a few different observation values (like age or name). In such cases, you will need to follow up with the enumerators and supervisors who submitted the data. Also, there is no clear rule on which observation to keep, and the research team will have to decide this on a case-to-case basis.
Case 3: Incorrectly assigned ID
Case 3 errors can occur because of typographical errors, for example if the ID was typed incorrectly during data collection, or if the field team did not follow proper protocols during data collection.
For a Case 3 error, the output of
iecompdup will display observations with different submission metadata, as well as many different survey responses. In this case too, you will need to follow up with enumerators and supervisors who were responsible for this submission. You will need to assign a new ID to one of the observations based on what you learn after following up with the field team.