Difference between revisions of "Ieduplicates"

Jump to: navigation, search
Line 8: Line 8:
 
* For instructions and available options, type '''<code>help ieduplicates</code>'''.
 
* For instructions and available options, type '''<code>help ieduplicates</code>'''.
  
==Overview==
+
== Overview ==
<code>ieduplicates</code> is a Stata command that identifies duplicates in ID variables and exports them to an Excel file that research teams can use to correct the duplicates. The command should be run directly after importing raw data from, for example, a server used in survey data collection. <code>ieduplicate</code> outputs a report of all duplicates and removes the duplicates from the dataset until they are resolved. It does so to ensure that other quality checks requiring unique IDs do not use erroneous data. For example, if household_id=123456 was selected for back checks, but the dataset has two observations with household_id=123456, then it is best to resolve that duplicate before running the backcheck test on either observation.  
+
The '''<code>ieduplicates</code>''' and <code>[[Iecompdup|iecompdup]]</code> commands were designed as part of a workflow
 +
to process duplicate observations in primary data in a reproducible and transparent
 +
manner. In such workflow, these commands are used to identify and resolve duplicated
 +
occurrences of an ID value in raw survey data, ensuring that observation is uniquely and
 +
fully identified. The commands combine four key tasks involved in solving duplicated
 +
ID values: (i) identifying duplicated entries; (ii) comparing observations with the same
 +
ID value; (iii) tracking and documenting any changes made to the identifying variable;
 +
(iv) applying the necessary corrections to the data.
 +
On the first run of ieduplicates, a duplicate correction template is created listing
 +
all observations containing duplicated values of an ID variable that is intended to be
 +
unique. Observations are required to have a “key” variable that is unique in the raw
 +
data by construction so that they can be identified in processing.7 After creating this
 +
correction template, ieduplicates will, by default, display a message pointing out that
 +
the intended ID variable does not uniquely and fully identify the data and stop your
 +
code, so you know to fill the correction template.
 +
 
 +
 
 +
 
 +
<code>ieduplicates</code> is a Stata command that identifies duplicates in ID variables and exports them to an Excel file that research teams can use to correct the duplicates. The command should be run directly after importing raw data from, for example, a server used in survey data collection. <code>ieduplicate</code> outputs a report of all duplicates and removes the duplicates from the dataset until they are resolved. It does so to ensure that other quality checks requiring unique IDs do not use erroneous data. For example, if household_id=123456 was selected for back checks, but the dataset has two observations with household_id=123456, then it is best to resolve that duplicate before running the backcheck test on either observation.
  
 
==Implementation==
 
==Implementation==

Revision as of 17:21, 6 May 2020

ieduplicates is the second command in the Stata package created by DIME Analytics, iefieldkit. ieduplicates identifies duplicates in ID variables that uniquely identify every observation in a dataset. It then exports them to an Excel file that the research team can use to resolve these duplicates. The research team should run ieduplicates with each new batch of incoming data to ensure high quality data before cleaning and analysis.

Read First

  • Stata coding practices.
  • iefieldkit.
  • ieduplicates identifies duplicates in ID variables, and then iecompdup resolves these issues.
  • To install ieduplicates, type ssc install ieduplicates in Stata.
  • To install all the commands in the iefieldkit package, type ssc install iefieldkit in Stata.
  • For instructions and available options, type help ieduplicates.

Overview

The ieduplicates and iecompdup commands were designed as part of a workflow to process duplicate observations in primary data in a reproducible and transparent manner. In such workflow, these commands are used to identify and resolve duplicated occurrences of an ID value in raw survey data, ensuring that observation is uniquely and fully identified. The commands combine four key tasks involved in solving duplicated ID values: (i) identifying duplicated entries; (ii) comparing observations with the same ID value; (iii) tracking and documenting any changes made to the identifying variable; (iv) applying the necessary corrections to the data. On the first run of ieduplicates, a duplicate correction template is created listing all observations containing duplicated values of an ID variable that is intended to be unique. Observations are required to have a “key” variable that is unique in the raw data by construction so that they can be identified in processing.7 After creating this correction template, ieduplicates will, by default, display a message pointing out that the intended ID variable does not uniquely and fully identify the data and stop your code, so you know to fill the correction template.


ieduplicates is a Stata command that identifies duplicates in ID variables and exports them to an Excel file that research teams can use to correct the duplicates. The command should be run directly after importing raw data from, for example, a server used in survey data collection. ieduplicate outputs a report of all duplicates and removes the duplicates from the dataset until they are resolved. It does so to ensure that other quality checks requiring unique IDs do not use erroneous data. For example, if household_id=123456 was selected for back checks, but the dataset has two observations with household_id=123456, then it is best to resolve that duplicate before running the backcheck test on either observation.

Implementation

  1. Run ieduplicates on the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
  2. If there are duplicates, use iecompdup on any duplicates identified.
  3. Enter the corrections identified with iecompdup to the duplicates in the report outputted by ieduplicates.
  4. After entering the corrections, save the report in the same location with the same name.
  5. Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.

Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which ieduplicates has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.

Specifications

ieduplicates requires that you specify the ID variable, a file path to the file where the report will be saved, and a unique variable. See the below example for reference:

ieduplicates HHID using ''C:\myIE\Documentation\DupReport.xlsx'',  uniquevars(KEY)

idvar

ieduplicates only allows a single ID variable. In the above example, this is HHID. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.

using

ieduplicates stores the report in the file specified after using. In the above example, this is "C:\myIE\Documentation\DupReport.xlsx". The report is outputted as an Excel sheet so that even team members who do not know Stata can read and correct it. The command also creates a folder called Daily in the same folder as the Excel file. In the Daily folder, ieduplicates saves a back-up report each day in case someone accidentally deletes the main report or any of its contents. To restore a report, simply copy it out of the Daily folder and remove the date from the name. If two different reports are generated the same day, with different outputs, the second report will include a timestamp in the name.

uniquevars

ieduplicates uses the unique variable specified within uniquevars() to apply corrections and assign the correct variable to the correct observation. In the above example, this is KEY. While the unique identifier can consist of multiple variables, most data collection tools assign a unique ID to each observation on their server. In SurveyCTO survey data, for example, this variable is called KEY.

Using the Report

The outputted report provides an excellent format in which research teams can resolve duplicate problems. The report has a correct, drop and newID column. If you want to keep one duplicate and drop another one because they are double recordings of the same observation, then write yes in the correct column for the observation you want to keep, and yes in the drop column for the one you want to drop. If you want to keep one duplicate and assign a new ID to another duplicate, then write yes in the correct column for the observation you want to keep, and a new ID value in the newID column for the observation to which you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID.

Always indicate which observation to keep. After entering your corrections, save the file and run ieduplicates again.

Back to Parent

This article is part of the topic ietoolkit

Additional Resources