Difference between revisions of "Iecompdup"

Jump to: navigation, search
Line 11: Line 11:
Once '''<code>[[ieduplicates]]</code>''' creates the [[ieduplicates#Duplicates Correction Template|duplicate correction template]], '''<code>iecompdup</code>''' compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision, '''<code>iecompdup</code>''' provides the information necessary to make that decision, and ensure [[Monitoring Data Quality|high quality data]] before [[Data Cleaning | cleaning]] and [[Data Analysis | data analysis]]. It allows the [[Impact Evaluation Team|research team]] to also select the output format based on their decision process.
Once '''<code>[[ieduplicates]]</code>''' creates the [[ieduplicates#Duplicates Correction Template|duplicate correction template]], '''<code>iecompdup</code>''' compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision, '''<code>iecompdup</code>''' provides the information necessary to make that decision, and ensure [[Monitoring Data Quality|high quality data]] before [[Data Cleaning | cleaning]] and [[Data Analysis | data analysis]]. It allows the [[Impact Evaluation Team|research team]] to also select the output format based on their decision process.


Follow these steps when using the '''<code>[[ieduplicates]]</code>''' and '''<code>iecompdup</code>''' commands on incoming [[Primary Data Collection|primary data]]:
Follow these steps when using the '''<code>ieduplicates</code>''' and '''<code>iecompdup</code>''' commands on incoming [[Primary Data Collection|primary data]]:
1. Run ieduplicates on the raw data. If there are no duplicates, you are done. If
# Run '''<code>ieduplicates</code>''' on the raw data. If there are no duplicates, you are done. If there are duplicates, the command will output an Excel file containing a '''duplicates correction template''', and a link to this file. It will also stop the code from moving forward, and show a message listing the duplicate values in the [[ID Variable Properties|ID variables]]. You can prevent the command from stopping your code by using the '''force''' option. This will remove all observations with duplicate ID values and allow the code to continue.
there are duplicates, the command will output an Excel file containing a duplicates
# Open the '''duplicates correction template'''. This template will list duplicate entries of the ID variable, along with information about each observation and 5 blank columns - '''correct''', '''drop''', '''newid''', '''initials''', and '''notes'''. Use these columns to make corrections, and include comments to [[Data Documentation|document]] the corrections.
correction template, display a link to this file, stop the code execution and show
# If the information in the duplicates correction template is not enough to solve a case, use '''<code>iecompdup</code>''' for the listed ID values to obtain more information.  
a message listing the duplicated ID values. You can prevent the command from
# After entering all the corrections to the template, save the Excel file in the same location with the same name. Overwrite the previous file.
stopping your code by specifying the option force, in which case it will remove
# Run '''<code>ieduplicates</code>''' on the raw data again. This will apply the corrections you made in the previous steps. Now if you use the '''force''' option, it will only remove those duplicates that you did not resolve.
all observations with duplicated ID values and allow the code to continue.
# Save the resulting dataset under a different [[Naming Conventions|name]]. Do not overwrite the orginal raw data.
2. Open the duplicates correction template. This template will list duplicated entries
# Repeat these steps with each new round of data.
of the ID variable, information about each observation and 5 blank columns. Fill
 
the blank columns with the necessary corrections and comments on the solution
process.
3. If the information in the duplicates correction template is not enough to solve a
case, use iecompdup for the listed ID value to obtain more information.
After entering all the corrections to the duplicates correction template, save it in
the same location with the same name, overwriting the previous file.
5. Run ieduplicates on the raw data again. The corrections you have entered in
the duplicates correction template will be applied, and only duplicates that are
still not resolved will be removed this time.
6. Save the resulting dataset under a different name so the raw data is not overwritten.
7. Repeat these steps every time you receive new data.
== Syntax ==
== Syntax ==



Revision as of 00:40, 8 May 2020

iecompdup is the third command in the Stata package created by DIME Analytics, iefieldkit. The iecompdup command helps the research team identify the reason for why duplicate values for ID variables exist, so they can be resolved. ID variables are variables that uniquely identify every observation in a dataset, for example, household_id.

Read First

  • Stata coding practices.
  • iefieldkit.
  • While ieduplicates identifies duplicates in ID variables, iecompdup provides more information to resolve these issues.
  • To install iecompdup, type ssc install iecompdup in Stata.
  • To install all the commands in the iefieldkit package, type ssc install iefieldkit in Stata.
  • For instructions and available options, type help iecompdup.

Overview

Once ieduplicates creates the duplicate correction template, iecompdup compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision, iecompdup provides the information necessary to make that decision, and ensure high quality data before cleaning and data analysis. It allows the research team to also select the output format based on their decision process.

Follow these steps when using the ieduplicates and iecompdup commands on incoming primary data:

  1. Run ieduplicates on the raw data. If there are no duplicates, you are done. If there are duplicates, the command will output an Excel file containing a duplicates correction template, and a link to this file. It will also stop the code from moving forward, and show a message listing the duplicate values in the ID variables. You can prevent the command from stopping your code by using the force option. This will remove all observations with duplicate ID values and allow the code to continue.
  2. Open the duplicates correction template. This template will list duplicate entries of the ID variable, along with information about each observation and 5 blank columns - correct, drop, newid, initials, and notes. Use these columns to make corrections, and include comments to document the corrections.
  3. If the information in the duplicates correction template is not enough to solve a case, use iecompdup for the listed ID values to obtain more information.
  4. After entering all the corrections to the template, save the Excel file in the same location with the same name. Overwrite the previous file.
  5. Run ieduplicates on the raw data again. This will apply the corrections you made in the previous steps. Now if you use the force option, it will only remove those duplicates that you did not resolve.
  6. Save the resulting dataset under a different name. Do not overwrite the orginal raw data.
  7. Repeat these steps with each new round of data.

Syntax

Implementation

  1. Run ieduplicates on the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
  2. If there are duplicates, use iecompdup on any duplicates identified.
  3. Enter the corrections identified with iecompdup to the duplicates in the report outputted by ieduplicates.
  4. After entering the corrections, save the report in the same location with the same name.
  5. Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.

Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which ieduplicates has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.

Specifications

iecompdup requires a single ID variable and the duplicate ID value. See the below example for reference:

iecompdup HHID [if] , id(123456)

idvar

iecompdup only allows a single ID variable. In the above example, this is HHID. The ID variable used here is the same ID variable used in ieduplicates. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.

id

iecompdup requires the ID value for the duplicate pair or group. In the above example, this is 123456. Note that the command can only be run on two duplicates at the time. When there are more than two duplicates for a given ID, the command issues a warning. If you have several pairs or groups of duplicates, you will have to run this command once for each pair or group. To do that, use an if expression to select the observations to be compared.

Output

The command outputs the variables names for which the duplicate pair has identical values and the variable names for which the duplicate pair has different values. The section below outlines three cases of duplicates and explains how iecompdup can help to identify to which case the duplicate pair pertains. No output from iecompdup can guarantee any of the cases below, but typically the output will be qualitatively conclusive for one of the three cases.

Case 1: Same Observation, Same Data

This case often occurs with CAPI surveys as a consequence of poor internet connection. If a submission is interrupted, then the server still saves that incomplete data; when the server receives a second submission, it saves both submissions since it does not know if the two submissions and the changes made between them were intentional. In iecompdup’s output, this case would appear as very few different variables; the variables that differ would mostly be submission meta data such as submission time or submission ID (called KEY in SurveyCTO). If no media files (i.e. audio, images, monitoring) were used and only the meta data differs, it does not matter which observation you keep. However, it is good practice to keep the one submitted most recently.

In most cases, submission interruptions occur because media files did not upload correctly. Those files themselves do not come up as variables in Stata -- only their file names do – and thus, only submission meta data variables differ. The file name variable is submitted even when the file is not. When both duplicates have file name and the same file contents, it does not matter which duplicate you keep. However, it is good practice to keep the one submitted most recently. If only one has the file name, keep that observation.

The case may also occur if a duplicate is created on the server. This is very uncommon but in these cases, even some submission data would be the same. In this case, either observation can be dropped.

Case 2: Same Observation, Modified Data

This case is rare but possible in most data collection software. This occurs if an observation is modified after the first submission and then re-submitted. Sometimes it is necessary to modify already-submitted data, though in these cases, it is best practice to do so in a do-file to ensure proper documentation. In iecompdup’s output, this case would show up as the submission meta data differing and some observation data differing. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule on which observation to keep: you have to make that decision yourself. Remember that this case is rare since most survey software has systems to prevent this.

Case 3: Incorrectly Assigned ID

The case occurs when the same ID is used for two different respondents. This may happen due to typos or to unfollowed protocols. In iecompdup’s output, this case would show up as submission data differing as well as a lot of observation data differing. Follow up with enumerators and supervisors responsible for this submission and assign a new ID to one of the observations based on your findings.

Back to Parent

This article is part of the topic ietoolkit

Additional Resources