Difference between revisions of "Ieduplicates"

Jump to: navigation, search
Line 1: Line 1:
<onlyinclude>
+
<code>ieduplicates</code> is a Stata command that identifies [[Duplicates and Survey Logs | duplicates]] in [[ID Variable Properties | ID variables]] and exports them to an Excel file that research teams can use to correct the duplicates. Research teams should run <code>ieduplicate</code> with each new batch of incoming data to ensure high quality data before [[Data Cleaning | cleaning]] and [[Data Analysis | analysis]]. This page describes how to use the command, how to implement it with <code>[[iecompdup]]</code>, and how to effectively use the outputted report.
'''ieduplicates''' and the sister command '''iecompdup''' are used to identify and resolve duplicates in raw survey data.
 
</onlyinclude>
 
This article is meant to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing <code>help ieduplicates</code> or <code>help iecompdup</code> in Stata. These commands are part of the package [[Stata_Coding_Practices#ietoolkit|ietoolkit]], to install all the commands in this package including these commands, type <code>ssc install ietoolkit</code> in Stata.
 
  
== Intended use cases ==
+
==Read First==
'''ieduplicates''' is meant to be used directly after importing raw data from, for example, a server used in survey data collection. The command does two high level things. It outputs a report of all the duplicates (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved.  
+
*While <code>ieduplicates</code> identifies duplicates in ID variables, <code>[[iecompdup]]</code> resolves duplicate issues.
 +
*For detailed instructions on how to implement the command and its options in Stata, type <code>help ieduplicates</code> in Stata.
 +
*This command is part of the package <code>[[Stata Coding Practices#ietoolkit | ietoolkit]]</code>. To install all commands in this package, including <code>ieduplicates</code>, type <code>ssc install ietoolkit</code> in Stata.
  
The reason the duplicates are removed is to make sure that many other quality checks require unique IDs. For example, if a household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. It is important that you make sure to not overwrite the original raw data with the data set where ieduplicates has removed the duplicates as you would lose that data. To avoid this, save the dataset with removed duplicate with a different name.
+
==Overview==
 +
<code>ieduplicates</code> is a Stata command that identifies duplicates in ID variables and exports them to an Excel file that research teams can use to correct the duplicates. The command should be run directly after importing raw data from, for example, a server used in survey data collection. <code>ieduplicate</code> outputs a report of all duplicates and removes the duplicates from the dataset until they are resolved. It does so to ensure that other quality checks requiring unique IDs do not use erroneous data. For example, if household_id=123456 was selected for back checks, but the dataset has two observations with household_id=123456, then it is best to resolve that duplicate before running the backcheck test on either observation.  
  
'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. See below for instructions on how to interpret the output of iecompdup.
+
==Implementation==
  
=== Intended work flow ===
+
# Run <code>ieduplicates</code> on the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
 +
# If there are duplicates, use <code>[[iecompdup]]</code> on any duplicates identified.
 +
# Enter the corrections identified with <code>iecompdup</code> to the duplicates in the report outputted by <code>ieduplicates</code>.
 +
# After entering the corrections, save the report in the same location with the same name.
 +
# Run <code>ieduplicates</code> again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.
  
# Run '''ieduplicates''' on the raw data
+
Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which <code>ieduplicates</code> has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.
#* If there are no duplicates, then you are done and can skip the rest of this list.  
 
# If there are duplicates, use '''iecompdup''' on any duplicates identified.
 
# Enter the corrections identified with '''iecompdup''' to the duplicates in the report outputted by '''ieduplicates'''
 
# After entering the corrections, save the report in the same location with the same name,
 
# Run '''ieduplicates''' again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.
 
  
Repeat these steps every time you get new data. Our recommendation is that this is done every day that you have new data.
+
== Specifications ==
  
== Instructions ==
+
<code>ieduplicates</code> requires that you specify the ID variable, a folder to which to output the report, and a unique variable. See the below example for reference:
These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in Stata see the help files by typing <code>help ieduplicates</code> or <code>help iecompdup</code> in Stata.
 
 
 
=== ieduplicates ===
 
ieduplicates requires that you list the ID variable (''idvar'') for which to look for duplicates ("HHID" in the example below), a folder for the report ("C:\myImpactEvaluation\duplicateReports" in the example below), and a variable that is always unique ("KEY" in the example below). Each of these are explained below.
 
  
 
<pre>ieduplicates HHID, folder(C:\myImpactEvaluation\duplicateReports) uniquevars(KEY)</pre>
 
<pre>ieduplicates HHID, folder(C:\myImpactEvaluation\duplicateReports) uniquevars(KEY)</pre>
  
'''''idvar'''''. ieduplicates only allow a single ID variable. See the discussion in the section below. We recommend that you create a single ID variable if you currently have two or more variables that identifies the observations in your data set. This variable can be both string or numeric.
+
===idvar===
 
+
<code>ieduplicates</code> only allows a single ID variable. In the above example, this is ''HHID''. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.  
'''folder()'''. The folder is used to store the report. The report is outputted as an Excel sheet so that even team member that does not know Stata can read it and fill in corrections. The command also creates a folder called ''Daily'' where it each day saves a back up report in case someone would accidentally delete the main report or any of its content. To restore a back up version, simply copy it out of the ''Daily'' folder and remove the date from the name. If two different reports are generated the same day, then the first one will be overwritten by the first.
 
 
 
In the report you can make corrections to the duplicates. If you do this, then this report becomes an excellent documentation on how you resolve your duplicates. You have three types of resolutions. ''correct'', ''drop'' and ''newID''. If you want to keep one duplicate and drop another one as they are double recordings of the same observation, then write ''yes'' in the ''correct'' column for the observation you want to keep, and ''yes'' in the ''drop'' column for the one you want to drop. If you want to keep one duplicate and assign a new ID of another duplicate then you write ''yes'' in the ''correct'' column for the observation you want to keep, and a new ID value in the ''newID'' column for the observation that you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID. Note that you must always indicate which observation to keep. After you have entered your corrections, save the file and run ieduplicates again.
 
 
 
'''uniquevars()'''. The unique variable is needed to apply the corrections. The command needs this variable to apply the correct variable to the correct observation, as the ID variable is by definition not unique and can therefore not be used. The unique identifier can consist of multiple variables but most data collection tools assign a unique ID to each observation on their server that is usually downloaded with the raw data set. If you use [[Computer-Assisted_Personal_Interviews_(CAPI)#CAPI_Software|SurveyCTO]] then this variable is called ''KEY''.
 
 
 
=== iecompdup ===
 
 
 
ieduplicates only identifies duplicates but give you no help in how to resolve them. That is what iecompdup does. iecompdup requires a single ID variable ("HHID" in the example below) and the ID value that is duplicated ("123456" in the example below).
 
 
 
<pre>iecompdup HHID, id(123456)</pre>
 
 
 
'''''idvar'''''. Note that the ID variable used here is the same ID variable that was used in the example for ieduplicates. This is how these commands are intended to work together.
 
 
 
'''id()'''. The duplicated value 123456 is for one pair or group of duplicates. If you have several pairs or groups of duplicates you will have to run this command once for each pair or group. iecompdup can only be run on two duplicates at the time and it picks the two first observations in the sort order. You need to change the sort order if you want to change which two duplicates are compared. iecompdup gives you a warning if this is the case, you suppress this warning by using the option ''more2ok''.
 
 
 
The output for iecompdup is information on the variables where the duplicate pair has identical values and where the duplicate pair has different values. The section below outlines three types of duplicates that we have identified as reasons for duplicates when working with [[Computer-Assisted_Personal_Interviews_(CAPI)#CAPI_Software|SurveyCTO]], and how iecompdup can be used to identify which of these cases applies to the duplicate pair. The general picture should be the same even if you are using a different software, but some details might be different. No output from iecompdup can guarantee any of the cases below, but most of the times the output will still be qualitatively conclusive for one of the three cases.
 
 
 
'''Case 1. -  Double submissions of same observation and the same data.'''  This is often a consequence of poor internet connection which is common where we collect our data. If a submission of data is interrupted before it is completed, then the server still saves that incomplete data as the server never deletes any data. When the server receives a second submission it saves both submissions, as it impossible for the server to know if two submissions  or changes made between them were intentional. In iecompdup this case would result in very few variables being different and the variables that differ are mostly submission meta data such as submission time or submission ID (called KEY in SurvyeCTO). If no media files (audio, images, monitoring) were used and only meta data that differs, it does not matter which observation that is kept, but it is good practice to keep the one submitted most recently.
 
 
 
When a submission is interrupted it is usually media (audio, images, monitor) files that were not uploaded correctly. Those files does not come up as variables in Stata, only the name of the file, so only submission meta data variables differ. The file name variable and sometimes that name value is submitted even when the file is not. When both duplicates has file name it does not matter which duplicate you keep (it is good practice to keep the one submitted most recently) but if only one has the file name you should obviously keep that observation.
 
  
A sub-case of this case is if a duplicate is created on the server. This is very uncommon but in these cases even some submission data would be the same. In this case there is no difference which observation is dropped.
+
===folder===
 +
<code>ieduplicates</code> stores the report in the folder specified within <code>folder()</code>. In the above example, this is ''C:\myImpactEvaluation\duplicateReports''. The report is outputted as an Excel sheet so that even team members who do not know Stata can read and correct it. The command also creates a folder called ''Daily,'' in which it saves a back-up report each day in case someone accidentally deletes the main report or any of its contents. To restore a report, simply copy it out of the ''Daily'' folder and remove the date from the name. If two different reports are generated the same day, then the first one will be overwritten by the second.
  
'''Case 2. - Double submissions of same observation but with modified data.''' This is possible but rare in most data collection software as it is bad practice and therefore made really difficult to do by the survey software developer (if at all possible). This happens if an observation if modified after the first submission and then submitted once again. Sometimes there is a need for modifying data already submitted but then it is much better practice to do so in a do-file that is included when the data set is cleaned. This way the manual modifications are properly documented. In iecompdup this would show up as the submission meta data differs and some observation data also differs. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule to which observation to keep, you have to make that decision yourself. But remember that this case is rare as most survey software have systems to prevent this.
+
===uniquevars===
 +
<code>ieduplicates</code> uses the unique variable specified within <code>uniquevars()</code> to apply corrections and assign the correct variable to the correct observation. In the above example, this is ''KEY''. While the unique identifier can consist of multiple variables, most data collection tools assign a unique ID to each observation on their server. In [[Computer-Assisted_Personal_Interviews_(CAPI)#CAPI_Software|SurveyCTO]] survey data, for example, this variable is called ''KEY''.
  
'''Case 3 - Incorrectly assigned ID'''. It happens that the same ID is used for two different respondents. This can be due to typos or to protocols not being followed. In iecompdup this would show up as submission data differ as well as a lot of observation data also differs. Follow up with enumerators and supervisors responsible for this submission and assign a new ID to one of the observations based on what you learn when investigating this case.
+
==Using the Report==
  
== Reasoning used during development ==
+
The outputted report provides an excellent format in which research teams can resolve duplicate problems. The report has a ''correct'', ''drop'' and ''newID'' column. If you want to keep one duplicate and drop another one because they are double recordings of the same observation, then write ''yes'' in the ''correct'' column for the observation you want to keep, and ''yes'' in the ''drop'' column for the one you want to drop. If you want to keep one duplicate and assign a new ID to another duplicate, then write ''yes'' in the ''correct'' column for the observation you want to keep, and a new ID value in the ''newID'' column for the observation to which you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID.
  
'''Single ID variable.''' At DIME we have an ongoing discussion on if the best practice is to never allow multiple ID variables. One exception where we agree that multiple ID variables are needed is longitudinal data (panel data and time series) where a time variable is needed as well to identify each observation. In this case we are discussing if the ID in addition to the time variable or variables needed, should be allowed to consist of more than one variable. Stata's commands for defining longitudinal data sets <code>xtset</code> and <code>tsset</code> only allows one variable in addition to one time variable to uniquely define all observations in a data set. So it seems as if Stata data sets are intended to only have one ID variable, but it is still the case that a lot of Stata users use multiple ID variables. Some of Stata's commands that can be used to investigate if variables are [[ID_Variable_Properties|uniquely and fully identifying]] allows multiple variables, for example <code>isid</code> but that command is also used to investigate data sets without proper ID variables.
+
Always indicate which observation to keep. After entering your corrections, save the file and run <code>ieduplicates</code> again.
  
 
== Back to Parent ==
 
== Back to Parent ==

Revision as of 21:23, 4 June 2019

ieduplicates is a Stata command that identifies duplicates in ID variables and exports them to an Excel file that research teams can use to correct the duplicates. Research teams should run ieduplicate with each new batch of incoming data to ensure high quality data before cleaning and analysis. This page describes how to use the command, how to implement it with iecompdup, and how to effectively use the outputted report.

Read First

  • While ieduplicates identifies duplicates in ID variables, iecompdup resolves duplicate issues.
  • For detailed instructions on how to implement the command and its options in Stata, type help ieduplicates in Stata.
  • This command is part of the package ietoolkit. To install all commands in this package, including ieduplicates, type ssc install ietoolkit in Stata.

Overview

ieduplicates is a Stata command that identifies duplicates in ID variables and exports them to an Excel file that research teams can use to correct the duplicates. The command should be run directly after importing raw data from, for example, a server used in survey data collection. ieduplicate outputs a report of all duplicates and removes the duplicates from the dataset until they are resolved. It does so to ensure that other quality checks requiring unique IDs do not use erroneous data. For example, if household_id=123456 was selected for back checks, but the dataset has two observations with household_id=123456, then it is best to resolve that duplicate before running the backcheck test on either observation.

Implementation

  1. Run ieduplicates on the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
  2. If there are duplicates, use iecompdup on any duplicates identified.
  3. Enter the corrections identified with iecompdup to the duplicates in the report outputted by ieduplicates.
  4. After entering the corrections, save the report in the same location with the same name.
  5. Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.

Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which ieduplicates has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.

Specifications

ieduplicates requires that you specify the ID variable, a folder to which to output the report, and a unique variable. See the below example for reference:

ieduplicates HHID, folder(C:\myImpactEvaluation\duplicateReports) uniquevars(KEY)

idvar

ieduplicates only allows a single ID variable. In the above example, this is HHID. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.

folder

ieduplicates stores the report in the folder specified within folder(). In the above example, this is C:\myImpactEvaluation\duplicateReports. The report is outputted as an Excel sheet so that even team members who do not know Stata can read and correct it. The command also creates a folder called Daily, in which it saves a back-up report each day in case someone accidentally deletes the main report or any of its contents. To restore a report, simply copy it out of the Daily folder and remove the date from the name. If two different reports are generated the same day, then the first one will be overwritten by the second.

uniquevars

ieduplicates uses the unique variable specified within uniquevars() to apply corrections and assign the correct variable to the correct observation. In the above example, this is KEY. While the unique identifier can consist of multiple variables, most data collection tools assign a unique ID to each observation on their server. In SurveyCTO survey data, for example, this variable is called KEY.

Using the Report

The outputted report provides an excellent format in which research teams can resolve duplicate problems. The report has a correct, drop and newID column. If you want to keep one duplicate and drop another one because they are double recordings of the same observation, then write yes in the correct column for the observation you want to keep, and yes in the drop column for the one you want to drop. If you want to keep one duplicate and assign a new ID to another duplicate, then write yes in the correct column for the observation you want to keep, and a new ID value in the newID column for the observation to which you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID.

Always indicate which observation to keep. After entering your corrections, save the file and run ieduplicates again.

Back to Parent

This article is part of the topic ietoolkit

Additional Resources