Difference between revisions of "Ieduplicates"

Jump to: navigation, search
 
(97 intermediate revisions by 4 users not shown)
Line 1: Line 1:
<onlyinclude>
+
<code>ieduplicates</code> is the second command in the Stata package created by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics], <code>[[iefieldkit]]</code>. <code>ieduplicates</code> identifies [[Duplicates and Survey Logs | duplicate values]] in [[ID Variable Properties|ID variables]]. '''ID variables''' are variables that uniquely identify every [[Unit of Observation|observation]] in a dataset, for example, '''household_id'''. It then exports them to an Excel file that the [[Impact Evaluation Team|research team]] can use to resolve these duplicates. The '''research team''' should run <code>ieduplicates</code> with each new batch of incoming data to ensure [[Monitoring Data Quality|high quality data]] before [[Data Cleaning | cleaning]] and [[Data Analysis | analysis]].  
'''ieduplicates''' and the sister command '''iecompdup''' are used to identify and resolve duplicates in raw survey data.
+
==Read First==
</onlyinclude>
+
* Please refer to [[Stata Coding Practices|Stata coding practices]] for coding best practices in Stata.
This article is meant to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing <code>help ieduplicates</code> or <code>help iecompdup</code> in Stata. These commands are part of the package [[Stata_Coding_Practices#ietoolkit|ietoolkit]], to install all the commands in this package including these commands, type <code>ssc install ietoolkit</code> in Stata.
+
* <code>ieduplicates</code> is part of the package <code>[[iefieldkit]]</code>, which has been developed by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics].
 +
* <code>ieduplicates</code> identifies duplicates in [[ID Variable Properties|ID variables]], and then <code>[[iecompdup]]</code> resolves these issues.  
 +
* To install <code>ieduplicates</code>, as well as other commands in the <code>[[iefieldkit]]</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in Stata, as this command is a part of the <code>[[iefieldkit]]</code> package.
 +
* For instructions and available options, type <syntaxhighlight lang="Stata" inline>help ieduplicates</syntaxhighlight>.
  
== Intended use cases ==
+
== Overview ==
'''ieduplicates''' is meant to be used directly after importing raw data from, for example, a server used in survey data collection. The command does two high level things. It outputs a report of all the duplicates (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved.  
+
The <code>ieduplicates</code> and <code>[[iecompdup]]</code> commands are meant to help [[Impact Evaluation Team|research teams]] deal with [[Duplicates and Survey Logs|duplicate observations]] in [[Primary Data Collection|primary data]]. These commands are designed to identify and resolve duplicate instances of an [[ID Variable Properties|ID variable]] in raw survey data, and ensure that each observation is [[ID Variable Properties#Property 1: Uniquely Identifying|uniquely]] and
 +
[[ID Variable Properties#Property 2: Fully Identifying|fully identified]]. The commands combine four key tasks to resolve duplicate values:
 +
# '''Identifying duplicate entries.'''
 +
# '''Comparing observations with the same ID value.'''
 +
# '''Tracking and documenting changes to the ID variable.'''
 +
# '''Applying the necessary corrections to the data.'''
 +
In any dataset, certain variables should be unique on purpose, to allow the '''research team''' to identify them during further [[Data Analysis|analysis]]. For example, suppose you select '''household_id''' as the unique ID variable. Now suppose you pick the observation with '''household_id'''= '''123456''' for [[Back Checks|back checks]], but the dataset has two observations with '''household_id''' = '''123456'''. In this case, it is important to resolve these duplicate observations before performing the '''back check'''.  
  
The reason the duplicates are removed is to make sure that many other quality checks require unique IDs. For example, if a household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. It is important that you make sure to not overwrite the original raw data with the data set where ieduplicates has removed the duplicates as you would lose that data. To avoid this, save the dataset with removed duplicate with a different name.
+
When you run <code>ieduplicates</code> for the first time, it will create a [[Ieduplicates#Duplicates Correction Template|duplicates correction template]]. This template will list all observations that contain duplicate values of an '''ID variable''' that should be unique. In the example above, after creating this template, <code>ieduplicates</code> will, by default, display a message pointing out that '''household_id''' does not '''uniquely''' and '''fully identify''' the data. It will also stop your code, and require you to fill the correction template before you can move on.
  
'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. See below for instructions on how to interpret the output of iecompdup.
+
== Syntax ==
 +
The basic syntax for <code>ieduplicates</code> is as follows:
 +
<syntaxhighlight lang="Stata">ieduplicates id_varname using "filename.xlsx"
 +
    , uniquevars(varlist)</syntaxhighlight>
 +
As inputs, <code>ieduplicates</code> requires the following :
 +
* '''id_varname</code>''': This is the name of the single, unique [[ID Variable Properties|ID variable]]. This variable must be such that it would be an unacceptable duplicate in the dataset, and so must never be repeated. If there are two or more variables that identify the observation in the dataset, you should create a single '''ID variable''' that is unique for the dataset. This variable could be either a string or a number. For example, '''household_id'''.
 +
* '''''"filename.xlsx"''''':  This provides the name of the Excel file in which <code>ieduplicates</code> will display the '''duplicates correction template'''. The file name in this case is specified with the help of <code>using</code>, and must include an [[Stata Coding Practices#File Paths|absolute file path]]. For example, '''"C:/myIE/Documentation/DupReport.xlsx"''' is the '''absolute file path''' for the file called '''"DupReport.xlsx"'''. Since the output is an Excel sheet, even those [[Impact Evaluation Team|members of the research team]] who do not know Stata can read the report, and make corrections.
 +
* <code>uniquevars()</code>: Finally, <code>ieduplicates</code> uses one or multiple variables specified within '''<code>()</code>''' to uniquely identify each observation in the dataset. However, most [[Computer-Assisted_Personal_Interviews_(CAPI)#CAPI_Software|data collection tools]] only use one variable for this purpose. For example, [[SurveyCTO Coding Practices|SurveyCTO]] creates this variable automatically, and names it '''"KEY"'''.
 +
For example, if there are no observations which have duplicate values of '''household_id''', for instance, <code>ieduplicates</code> will display a message saying the data set is '''uniquely''' and '''fully identified''' on the basis of '''household_id'''. In such a case, there will be no output, and this command will leave the data unchanged.  
  
=== Intended work flow ===
+
However, if there are observations which have duplicate values of '''household_id''', <code>ieduplicates</code> will save the output to an Excel sheet called '''"DupReport.xlsx"'''. This file will contain information on these observations in the form of the '''duplicates correction template''', and <code>ieduplicates</code> will also stop your code with a message listing the repeated values under '''household_id'''. See '''Figure 1''' below for a sample correction template.
  
# Run '''ieduplicates''' on the raw data
+
== Implementation ==
#* If there are no duplicates, then you are done and can skip the rest of this list.  
+
The following is the process for using <code>ieduplicates</code>:
# If there are duplicates, use '''iecompdup''' on any duplicates identified.  
+
# '''Run''' <code>ieduplicates</code> ''' on the raw data.''' If there are no duplicate observations, then you are done. Skip the rest of the steps. If there are duplicates, the command will output an Excel file which contains the '''duplicates correction template'''. It will display a message with a link to this file, and stop the code from moving forward. It will also show a message listing the duplicate ID values. <br>
# Enter the corrections identified with '''iecompdup''' to the duplicates in the report outputted by '''ieduplicates'''
+
# '''Open the duplicates correction template.''' This template will list each duplicate entry of the ID variable, and information about each observation. It also contains 5 blank columns - '''correct''', '''drop''', '''newid''', '''initials''', and '''notes'''. Use these columns to make corrections, and include comments to [[Data Documentation|document]] the corrections. <br>
# After entering the corrections, save the report in the same location with the same name,
+
# '''Use''' <code>[[iecompdup]]</code> '''for more information.''' Sometimes the template is not enough to solve a particular issue. In such cases, run the <code>[[iecompdup]]</code> command on the same dataset. <br>
# Run '''ieduplicates''' again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.
+
# '''Overwrite the previous file.''' After entering all the corrections to the template, save the Excel file in the same location with the same name. <br>
 +
# '''Run''' <code>ieduplicates</code> '''again.''' This will apply the corrections you made in the previous steps. Now if you use the '''force''' option, it will only remove those duplicates that you did not resolve. <br>
 +
# '''Do not overwrite the orginal raw data.''' Save the resulting dataset under a different [[Naming Conventions|name]].<br>
 +
# '''Repeat these steps with each new round of data.'''
 +
==== Sample template ====
 +
A sample '''duplicates correction template''' created by <code>ieduplicates</code> is displayed below.
 +
[[File:Ieduplicates.png|900px|thumb|center|'''Figure 1: Duplicates Correction Template''']]
  
Repeat these steps every time you get new data. Our recommendation is that this is done every day that you have new data.
+
== Duplicates Correction Template ==
 +
The <code>ieduplicates</code> exports the '''duplicates correction template''' to the Excel file, based on the following syntax:
 +
<syntaxhighlight lang="Stata">ieduplicates id_varname using "filename.xlsx"
 +
    , uniquevars(varlist)
 +
    [force keepvars(varlist) tostringok droprest nodaily
 +
    duplistid(string) datelisted(string) datefixed(string) correct(string)
 +
    drop(string) newid(string) initials(''string) notes(string) listofdiffs(string)]</syntaxhighlight>
  
== Instructions ==
+
* <code>force</code>: Removes all observations that contain duplicate values of '''id_varname''' from the data. As a result, it keeps only '''uniquely''' and '''fully identified''' observations. The <code>force</code> option is required so that you know that <code>ieduplicates</code> is making changes to your dataset. Do not overwrite the original raw data with the one that this command generates, otherwise you will lose the original data. Save this new dataset with a different name.  
These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in Stata see the help files by typing <code>help ieduplicates</code> or <code>help iecompdup</code> in Stata.
 
  
=== ieduplicates ===
+
* <code>duplistid</code>: Uniquely identifies each of the duplicate observations in the '''duplicates correction template'''. It does this by assigning '''1''' to the first instance of a duplicate value, '''2''' to the second instance, and so on.
ieduplicates requires that you list the ID variable (''idvar'') for which to look for duplicates ("HHID" in the example below), a folder for the report ("C:\myImpactEvaluation\duplicateReports" in the example below), and a variable that is always unique ("KEY" in the example below). Each of these are explained below.
 
  
<pre>ieduplicates HHID, folder(C:\myImpactEvaluation\duplicateReports) uniquevars(KEY)</pre>
+
* <code>datelisted</code>: Indicates the date on which the observation was first included in the template.
  
'''''idvar'''''. ieduplicates only allow a single ID variable. See the discussion in the section below. We recommend that you create a single ID variable if you currently have two or more variables that identifies the observations in your data set. This variable can be both string or numeric.
+
* <code>listofdiffs</code>: Lists the variables in the data set that are different for the observations that have duplicate values of the ID variable. You can also rename these columns by specifying the new column name under their respective options.
  
'''folder()'''. The folder is used to store the report. The report is outputted as an Excel sheet so that even team member that does not know Stata can read it and fill in corrections. The command also creates a folder called ''Daily'' where it each day saves a back up report in case someone would accidentally delete the main report or any of its content. To restore a back up version, simply copy it out of the ''Daily'' folder and remove the date from the name. If two different reports are generated the same day, then the first one will be overwritten by the first.
+
* <code>correct</code>, <code>drop</code>, <code>newID</code>, <code>initials</code>, and <code>notes</code>: Creates 5 blank columns which you can fill to make corrections where needed, and complete the template.
  
In the report you can make corrections to the duplicates. If you do this, then this report becomes an excellent documentation on how you resolve your duplicates. You have three types of resolutions. ''correct'', ''drop'' and ''newID''. If you want to keep one duplicate and drop another one as they are double recordings of the same observation, then write ''yes'' in the ''correct'' column for the observation you want to keep, and ''yes'' in the ''drop'' column for the one you want to drop. If you want to keep one duplicate and assign a new ID of another duplicate then you write ''yes'' in the ''correct'' column for the observation you want to keep, and a new ID value in the ''newID'' column for the observation that you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID. Note that you must always indicate which observation to keep. After you have entered your corrections, save the file and run ieduplicates again.
+
This completed template then acts like a permanent [[Data Documentation|documentation]] of how the [[Impact Evaluation Team|research team]] resolved duplicate ID variables in the raw data. There are three options for
 +
resolving duplicate observations. They appear in the form of '''''correct''''', '''''drop''''', and '''''newID''''' columns in the template. Consider the following examples to understand how to fill these columns:
 +
* '''''drop''''': If you want to keep one of the duplicate observations and drop another, then write '''“correct”''' in the '''''correct''''' column for the observation you want to keep, and '''“drop”''' in the '''''drop''''' column for the observation you want to drop. Make sure you mention the correct value under the  '''''key''''' column for the observations you want to keep, and the ones that you want to drop from the dataset. This is important because '''SurveyCTO''' creates a unique '''key''' for every observation, even if two or more observations have duplicate variable IDs by mistake.  
  
'''uniquevars()'''. The unique variable is needed to apply the corrections. The command needs this variable to apply the correct variable to the correct observation, as the ID variable is by definition not unique and can therefore not be used. The unique identifier can consist of multiple variables but most data collection tools assign a unique ID to each observation on their server that is usually downloaded with the raw data set. If you use [[Computer-Assisted_Personal_Interviews_(CAPI)#CAPI_Software|SurveyCTO]] then this variable is called ''KEY''.
+
* '''''newID''''': If you want to keep one of the duplicates and assign a new unique ID to another one, write '''“correct”''' in the '''''correct''''' column for the observation you want to keep, and the new corrected ID value in the '''''newID''''' column for the observation to which you want to assign the new unique ID.  
  
=== iecompdup ===
+
* '''Combine both methods''': You can also combine these two methods if you have more than 2 duplicate observations. Note that you must always indicate which observation you want to keep for each group of duplicate observations.
  
ieduplicates only identifies duplicates but give you no help in how to resolve them. That is what iecompdup does. iecompdup requires a single ID variable ("HHID" in the example below) and the ID value that is duplicated ("123456" in the example below).
+
After entering your corrections, save the file and run <code>ieduplicates</code> again to apply the corrections to the dataset.  
 +
Since <code>ieduplicates</code> should be used frequently as new data comes in from the field, the command also manages a subfolder called '''/Daily/''' in the same folder which contains the main Excel file. <code>ieduplicates</code> uses this subfolder to save a backup version (along with the date) for every time the template is updated. If you do not wish to use this feature, use the '''nodaily''' option which prevents the creation of backups.
  
<pre>iecompdup HHID, id(123456)</pre>
+
'''Note:''' If two different templates are generated on the same day, it saves the second with an additional time stamp on the name. This is especially useful in case the main corrections template, or any of its contents get deleted. You can restore a backup version by simply copying it out of the '''/Daily/''' folder and remove the date from the name.
  
'''''idvar'''''. Note that the ID variable used here is the same ID variable that was used in the example for ieduplicates. This is how these commands are intended to work together.
+
== Related Pages ==
 
+
[[Special:WhatLinksHere/Ieduplicates|Click here for pages that link to this topic.]]<br>
'''id()'''. The duplicated value 123456 is for one pair or group of duplicates. If you have several pairs or groups of duplicates you will have to run this command once for each pair or group. iecompdup can only be run on two duplicates at the time and it picks the two first observations in the sort order. You need to change the sort order if you want to change which two duplicates are compared. iecompdup gives you a warning if this is the case, you suppress this warning by using the option ''more2ok''.
+
This page is part of the topic <code>[[iefieldkit]]</code>. Also see <code>[[iecompdup]]</code>.
 
 
The output for iecompdup is information on the variables where the duplicate pair has identical values and where the duplicate pair has different values. The section below outlines three types of duplicates that we have identified as reasons for duplicates when working with [[Computer-Assisted_Personal_Interviews_(CAPI)#CAPI_Software|SurveyCTO]], and how iecompdup can be used to identify which of these cases applies to the duplicate pair. The general picture should be the same even if you are using a different software, but some details might be different. No output from iecompdup can guarantee any of the cases below, but most of the times the output will still be qualitatively conclusive for one of the three cases.
 
 
 
'''Case 1. -  Double submissions of same observation and the same data.'''  This is often a consequence of poor internet connection which is common where we collect our data. If a submission of data is interrupted before it is completed, then the server still saves that incomplete data as the server never deletes any data. When the server receives a second submission it saves both submissions, as it impossible for the server to know if two submissions  or changes made between them were intentional. In iecompdup this case would result in very few variables being different and the variables that differ are mostly submission meta data such as submission time or submission ID (called KEY in SurvyeCTO). If no media files (audio, images, monitoring) were used and only meta data that differs, it does not matter which observation that is kept, but it is good practice to keep the one submitted most recently.
 
 
 
When a submission is interrupted it is usually media (audio, images, monitor) files that were not uploaded correctly. Those files does not come up as variables in Stata, only the name of the file, so only submission meta data variables differ. The file name variable and sometimes that name value is submitted even when the file is not. When both duplicates has file name it does not matter which duplicate you keep (it is good practice to keep the one submitted most recently) but if only one has the file name you should obviously keep that observation.
 
 
 
A sub-case of this case is if a duplicate is created on the server. This is very uncommon but in these cases even some submission data would be the same. In this case there is no difference which observation is dropped.
 
 
 
'''Case 2. - Double submissions of same observation but with modified data.''' This is possible but rare in most data collection software as it is bad practice and therefore made really difficult to do by the survey software developer (if at all possible). This happens if an observation if modified after the first submission and then submitted once again. Sometimes there is a need for modifying data already submitted but then it is much better practice to do so in a do-file that is included when the data set is cleaned. This way the manual modifications are properly documented. In iecompdup this would show up as the submission meta data differs and some observation data also differs. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule to which observation to keep, you have to make that decision yourself. But remember that this case is rare as most survey software have systems to prevent this.
 
 
 
'''Case 3 - Incorrectly assigned ID'''. It happens that the same ID is used for two different respondents. This can be due to typos or to protocols not being followed. In iecompdup this would show up as submission data differ as well as a lot of observation data also differs. Follow up with enumerators and supervisors responsible for this submission and assign a new ID to one of the observations based on what you learn when investigating this case.
 
 
 
== Reasoning used during development ==
 
 
 
'''Single ID variable.''' At DIME we have an ongoing discussion on if the best practice is to never allow multiple ID variables. One exception where we agree that multiple ID variables are needed is longitudinal data (panel data and time series) where a time variable is needed as well to identify each observation. In this case we are discussing if the ID in addition to the time variable or variables needed, should be allowed to consist of more than one variable. Stata's commands for defining longitudinal data sets <code>xtset</code> and <code>tsset</code> only allows one variable in addition to one time variable to uniquely define all observations in a data set. So it seems as if Stata data sets are intended to only have one ID variable, but it is still the case that a lot of Stata users use multiple ID variables. Some of Stata's commands that can be used to investigate if variables are [[ID_Variable_Properties|uniquely and fully identifying]] allows multiple variables, for example <code>isid</code> but that command is also used to investigate data sets without proper ID variables.
 
 
 
== Back to Parent ==
 
This article is part of the topic [[Stata_Coding_Practices#ietoolkit|ietoolkit]]
 
  
 +
== Additional Resources ==
 +
* DIME Analytics (World Bank), [https://osf.io/uc2en/ Real Time Data Quality Checks]
 +
* DIME Analytics (World Bank), [https://github.com/worldbank/iefieldkit The <code>iefieldkit</code> GitHub page]
 
[[Category: Stata ]]
 
[[Category: Stata ]]

Latest revision as of 22:36, 18 November 2020

ieduplicates is the second command in the Stata package created by DIME Analytics, iefieldkit. ieduplicates identifies duplicate values in ID variables. ID variables are variables that uniquely identify every observation in a dataset, for example, household_id. It then exports them to an Excel file that the research team can use to resolve these duplicates. The research team should run ieduplicates with each new batch of incoming data to ensure high quality data before cleaning and analysis.

Read First

  • Please refer to Stata coding practices for coding best practices in Stata.
  • ieduplicates is part of the package iefieldkit, which has been developed by DIME Analytics.
  • ieduplicates identifies duplicates in ID variables, and then iecompdup resolves these issues.
  • To install ieduplicates, as well as other commands in the iefieldkit package, type ssc install iefieldkit in Stata, as this command is a part of the iefieldkit package.
  • For instructions and available options, type help ieduplicates.

Overview

The ieduplicates and iecompdup commands are meant to help research teams deal with duplicate observations in primary data. These commands are designed to identify and resolve duplicate instances of an ID variable in raw survey data, and ensure that each observation is uniquely and fully identified. The commands combine four key tasks to resolve duplicate values:

  1. Identifying duplicate entries.
  2. Comparing observations with the same ID value.
  3. Tracking and documenting changes to the ID variable.
  4. Applying the necessary corrections to the data.

In any dataset, certain variables should be unique on purpose, to allow the research team to identify them during further analysis. For example, suppose you select household_id as the unique ID variable. Now suppose you pick the observation with household_id= 123456 for back checks, but the dataset has two observations with household_id = 123456. In this case, it is important to resolve these duplicate observations before performing the back check.

When you run ieduplicates for the first time, it will create a duplicates correction template. This template will list all observations that contain duplicate values of an ID variable that should be unique. In the example above, after creating this template, ieduplicates will, by default, display a message pointing out that household_id does not uniquely and fully identify the data. It will also stop your code, and require you to fill the correction template before you can move on.

Syntax

The basic syntax for ieduplicates is as follows:

ieduplicates id_varname using "filename.xlsx"
     , uniquevars(varlist)

As inputs, ieduplicates requires the following :

  • id_varname: This is the name of the single, unique ID variable. This variable must be such that it would be an unacceptable duplicate in the dataset, and so must never be repeated. If there are two or more variables that identify the observation in the dataset, you should create a single ID variable that is unique for the dataset. This variable could be either a string or a number. For example, household_id.
  • "filename.xlsx": This provides the name of the Excel file in which ieduplicates will display the duplicates correction template. The file name in this case is specified with the help of using, and must include an absolute file path. For example, "C:/myIE/Documentation/DupReport.xlsx" is the absolute file path for the file called "DupReport.xlsx". Since the output is an Excel sheet, even those members of the research team who do not know Stata can read the report, and make corrections.
  • uniquevars(): Finally, ieduplicates uses one or multiple variables specified within () to uniquely identify each observation in the dataset. However, most data collection tools only use one variable for this purpose. For example, SurveyCTO creates this variable automatically, and names it "KEY".

For example, if there are no observations which have duplicate values of household_id, for instance, ieduplicates will display a message saying the data set is uniquely and fully identified on the basis of household_id. In such a case, there will be no output, and this command will leave the data unchanged.

However, if there are observations which have duplicate values of household_id, ieduplicates will save the output to an Excel sheet called "DupReport.xlsx". This file will contain information on these observations in the form of the duplicates correction template, and ieduplicates will also stop your code with a message listing the repeated values under household_id. See Figure 1 below for a sample correction template.

Implementation

The following is the process for using ieduplicates:

  1. Run ieduplicates on the raw data. If there are no duplicate observations, then you are done. Skip the rest of the steps. If there are duplicates, the command will output an Excel file which contains the duplicates correction template. It will display a message with a link to this file, and stop the code from moving forward. It will also show a message listing the duplicate ID values.
  2. Open the duplicates correction template. This template will list each duplicate entry of the ID variable, and information about each observation. It also contains 5 blank columns - correct, drop, newid, initials, and notes. Use these columns to make corrections, and include comments to document the corrections.
  3. Use iecompdup for more information. Sometimes the template is not enough to solve a particular issue. In such cases, run the iecompdup command on the same dataset.
  4. Overwrite the previous file. After entering all the corrections to the template, save the Excel file in the same location with the same name.
  5. Run ieduplicates again. This will apply the corrections you made in the previous steps. Now if you use the force option, it will only remove those duplicates that you did not resolve.
  6. Do not overwrite the orginal raw data. Save the resulting dataset under a different name.
  7. Repeat these steps with each new round of data.

Sample template

A sample duplicates correction template created by ieduplicates is displayed below.

Figure 1: Duplicates Correction Template

Duplicates Correction Template

The ieduplicates exports the duplicates correction template to the Excel file, based on the following syntax:

ieduplicates id_varname using "filename.xlsx" 
     , uniquevars(varlist)
     [force keepvars(varlist) tostringok droprest nodaily 
     duplistid(string) datelisted(string) datefixed(string) correct(string)
     drop(string) newid(string) initials(''string) notes(string) listofdiffs(string)]
  • force: Removes all observations that contain duplicate values of id_varname from the data. As a result, it keeps only uniquely and fully identified observations. The force option is required so that you know that ieduplicates is making changes to your dataset. Do not overwrite the original raw data with the one that this command generates, otherwise you will lose the original data. Save this new dataset with a different name.
  • duplistid: Uniquely identifies each of the duplicate observations in the duplicates correction template. It does this by assigning 1 to the first instance of a duplicate value, 2 to the second instance, and so on.
  • datelisted: Indicates the date on which the observation was first included in the template.
  • listofdiffs: Lists the variables in the data set that are different for the observations that have duplicate values of the ID variable. You can also rename these columns by specifying the new column name under their respective options.
  • correct, drop, newID, initials, and notes: Creates 5 blank columns which you can fill to make corrections where needed, and complete the template.

This completed template then acts like a permanent documentation of how the research team resolved duplicate ID variables in the raw data. There are three options for resolving duplicate observations. They appear in the form of correct, drop, and newID columns in the template. Consider the following examples to understand how to fill these columns:

  • drop: If you want to keep one of the duplicate observations and drop another, then write “correct” in the correct column for the observation you want to keep, and “drop” in the drop column for the observation you want to drop. Make sure you mention the correct value under the key column for the observations you want to keep, and the ones that you want to drop from the dataset. This is important because SurveyCTO creates a unique key for every observation, even if two or more observations have duplicate variable IDs by mistake.
  • newID: If you want to keep one of the duplicates and assign a new unique ID to another one, write “correct” in the correct column for the observation you want to keep, and the new corrected ID value in the newID column for the observation to which you want to assign the new unique ID.
  • Combine both methods: You can also combine these two methods if you have more than 2 duplicate observations. Note that you must always indicate which observation you want to keep for each group of duplicate observations.

After entering your corrections, save the file and run ieduplicates again to apply the corrections to the dataset. Since ieduplicates should be used frequently as new data comes in from the field, the command also manages a subfolder called /Daily/ in the same folder which contains the main Excel file. ieduplicates uses this subfolder to save a backup version (along with the date) for every time the template is updated. If you do not wish to use this feature, use the nodaily option which prevents the creation of backups.

Note: If two different templates are generated on the same day, it saves the second with an additional time stamp on the name. This is especially useful in case the main corrections template, or any of its contents get deleted. You can restore a backup version by simply copying it out of the /Daily/ folder and remove the date from the name.

Related Pages

Click here for pages that link to this topic.
This page is part of the topic iefieldkit. Also see iecompdup.

Additional Resources