Difference between revisions of "Ieduplicates"

Jump to: navigation, search
Line 29: Line 29:


== Duplicates Correction Template ==
== Duplicates Correction Template ==
The '''<code>ieduplicates</code>''' exports the '''duplicates correction template''' to the Excel file, which contains at least 11
The '''<code>ieduplicates</code>''' exports the '''duplicates correction template''' to the Excel file, based on the following syntax:  
columns, based on the following syntax:  
  ieduplicates ''id_varname'' using ''"filename.xlsx"''  
  ieduplicates ''id_varname'' using ''"filename.xlsx"''  
     , uniquevars(''varlist'')
     , uniquevars(''varlist'')
Line 37: Line 36:
     drop(''string'') newid(''string'') initials(''string'') notes(''string'') listofdiffs(''string'')]
     drop(''string'') newid(''string'') initials(''string'') notes(''string'') listofdiffs(''string'')]


if the force option is specified, it will remove all observations containing
duplicated values of id varname from the data, and return only uniquely and fully
identified observations. The option force is
required in this case so you know that ieduplicates is making changes to your dataset,
and do not overwrite the original raw data with the one that has been returned, as you
would lose the original data


id varname, indicating the value of the ID variable
id varname, indicating the value of the ID variable
Line 46: Line 51:
which lists the variables in the data set that are different across the duplicates observations. The names of the columns can be changed by specifying the column title desired
which lists the variables in the data set that are different across the duplicates observations. The names of the columns can be changed by specifying the column title desired
within their respective options.
within their respective options.
Inside the template, you can indicate corrections to resolve the duplicated observations. By this method, the completed template becomes a permanent documentation
on how duplicated IDs were resolved from the raw data. There are three options for
resolution offered as columns in the template: correct, drop, and newID. If you want to
keep one of the duplicates and drop another, as they are double recordings of the same
observation, then write “correct” in the correct column for the observation with the
key varname you want to keep, and “drop” in the drop column for the one you want to
drop. If you want to keep one of the duplicates and assign a new ID to another one, write
“correct” in the correct column for the observation you want to keep, and the new corrected ID value in the newID column for the observation that you want to assign a new
ID to. You can combine these two methods if you have many duplicates with the same
ID. Note that you must always indicate which observation to keep for each duplicate
set. After you have entered your corrections, save the file and run ieduplicates again
to apply the corrections – ieduplicates will automatically recognize that a partially
completed template is already there.
Since the expectation is that the command will be used frequently as the data is
collected, ieduplicates also manages a subfolder called /Daily/ where it saves dated
backups whenever it is re-run, in case the main corrections template or any contents
are deleted. If two different templates are generated the same day, the second will be
saved with an additional time stamp on the name. To restore a backup version, simply
copy it out of the Daily folder and remove the date from the name. The option nodaily
suppresses the creation of backups.





Revision as of 21:27, 6 May 2020

ieduplicates is the second command in the Stata package created by DIME Analytics, iefieldkit. ieduplicates identifies duplicates in ID variables that uniquely identify every observation in a dataset. It then exports them to an Excel file that the research team can use to resolve these duplicates. The research team should run ieduplicates with each new batch of incoming data to ensure high quality data before cleaning and analysis.

Read First

  • Stata coding practices.
  • iefieldkit.
  • ieduplicates identifies duplicates in ID variables, and then iecompdup resolves these issues.
  • To install ieduplicates, type ssc install ieduplicates in Stata.
  • To install all the commands in the iefieldkit package, type ssc install iefieldkit in Stata.
  • For instructions and available options, type help ieduplicates.

Overview

The ieduplicates and iecompdup commands are meant to help research teams deal with duplicate observations in primary data. These commands are designed to identify and resolve duplicate instances of an ID variable in raw survey data, and ensure that each observation is uniquely and fully identified. The commands combine four key tasks to resolve duplicate values:

  1. Identifying duplicate entries.
  2. Comparing observations with the same ID value.
  3. Tracking and documenting changes to the ID variable.
  4. Applying the necessary corrections to the data.

In any data, certain key variables of an observation should be unique by construction, to allow researchers to identify them during further analysis. For example, suppose you select household_id as the unique ID variable. Now suppose you pick the observation with household_id= 123456 for back checks, but the dataset has two observations with household_id = 123456. In this case, it is important to resolve these duplicate observations before performing the back check.

When you run ieduplicates for the first time, it will create a duplicate correction template. This template will list all observations that contain duplicate values of an ID variable that should be unique. In the example above, after creating this template, ieduplicates will, by default, display a message pointing out that household_id does not uniquely and fully identify the data. It will also stop your code, and require you to fill the correction template before you can move on.

Syntax

The basic syntax for ieduplicates is as follows:

ieduplicates id_varname using "filename.xlsx"
    , uniquevars(varlist) 

As inputs, ieduplicates requires the following :

  • id_varname: This is the name of the single, unique ID variable. This variable must be such that it would be an unacceptable duplicate in the dataset, and so must never be repeated. If there are two or more variables that identify the observation in the dataset, you should create a single ID variable that is unique for the dataset. This variable could be either a string or a number. For example, household_id.
  • "filename.xlsx": This provides the name of the Excel file in which ieduplicates will display the duplicates correction template. The file name in this case is specified with the help of using, and must include an absolute file path. For example, "C:/myIE/Documentation/DupReport.xlsx" is the absolute file path for the file called "DupReport.xlsx". Since the output is an Excel sheet, even those members of the research team who do not know Stata can read the report, and make corrections.
  • uniquevars( ): Finally, ieduplicates uses one or multiple variables specified within ( ) to uniquely identify each observation in the dataset. However, most data collection tools only use one variable for this purpose. For example, SurveyCTO creates this variable automatically, and names it "KEY".

For example, if there are no observations which have duplicate values of household_id, for instance, ieduplicates will display a message saying the data set is uniquely and fully identified on the basis of household_id. In such a case, there will be no output, and this command will leave the data unchanged.
However, if there are observations which have duplicate values of household_id, ieduplicatesthe command will save the output to an Excel sheet called "DupReport.xlsx". This file will contain information on these observations in the form of the duplicates correction template, and ieduplicates will also stop your code with a message listing the repeated values under household_id.

Duplicates Correction Template

The ieduplicates exports the duplicates correction template to the Excel file, based on the following syntax:

ieduplicates id_varname using "filename.xlsx" 
    , uniquevars(varlist)
    [force keepvars(varlist) tostringok droprest nodaily 
    duplistid(string) datelisted(string) datefixed(string) correct(string)
    drop(string) newid(string) initials(string) notes(string) listofdiffs(string)]

if the force option is specified, it will remove all observations containing duplicated values of id varname from the data, and return only uniquely and fully identified observations. The option force is required in this case so you know that ieduplicates is making changes to your dataset, and do not overwrite the original raw data with the one that has been returned, as you would lose the original data

id varname, indicating the value of the ID variable in the observation; duplistid, the unique identifier of the observation in the duplicates correction template; datelisted, indicating the date the observation was first included in the template; correct, drop, newid, initials, and notes, blank columns to be filled by the user to correct the data; varlist, one or multiple columns containing the values of the variables specified in uniquevar for the observations in the template; and listofdiffs, which lists the variables in the data set that are different across the duplicates observations. The names of the columns can be changed by specifying the column title desired within their respective options.

Inside the template, you can indicate corrections to resolve the duplicated observations. By this method, the completed template becomes a permanent documentation on how duplicated IDs were resolved from the raw data. There are three options for resolution offered as columns in the template: correct, drop, and newID. If you want to keep one of the duplicates and drop another, as they are double recordings of the same observation, then write “correct” in the correct column for the observation with the key varname you want to keep, and “drop” in the drop column for the one you want to drop. If you want to keep one of the duplicates and assign a new ID to another one, write “correct” in the correct column for the observation you want to keep, and the new corrected ID value in the newID column for the observation that you want to assign a new ID to. You can combine these two methods if you have many duplicates with the same ID. Note that you must always indicate which observation to keep for each duplicate set. After you have entered your corrections, save the file and run ieduplicates again to apply the corrections – ieduplicates will automatically recognize that a partially completed template is already there. Since the expectation is that the command will be used frequently as the data is collected, ieduplicates also manages a subfolder called /Daily/ where it saves dated backups whenever it is re-run, in case the main corrections template or any contents are deleted. If two different templates are generated the same day, the second will be saved with an additional time stamp on the name. To restore a backup version, simply copy it out of the Daily folder and remove the date from the name. The option nodaily suppresses the creation of backups.


ieduplicates also creates a folder called Daily in the same folder as the Excel file. In the Daily folder, ieduplicates saves a back-up report each day in case someone accidentally deletes the main report or any of its contents. To restore a report, simply copy it out of the Daily folder and remove the date from the name. If two different reports are generated the same day, with different outputs, the second report will include a timestamp in the name. nodaily suppresses this.

Using the Report

The outputted report provides an excellent format in which research teams can resolve duplicate problems. The report has a correct, drop and newID column. If you want to keep one duplicate and drop another one because they are double recordings of the same observation, then write yes in the correct column for the observation you want to keep, and yes in the drop column for the one you want to drop. If you want to keep one duplicate and assign a new ID to another duplicate, then write yes in the correct column for the observation you want to keep, and a new ID value in the newID column for the observation to which you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID.

Always indicate which observation to keep. After entering your corrections, save the file and run ieduplicates again.


  1. Run ieduplicates on the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
  2. If there are duplicates, use iecompdup on any duplicates identified.
  3. Enter the corrections identified with iecompdup to the duplicates in the report outputted by ieduplicates.
  4. After entering the corrections, save the report in the same location with the same name.
  5. Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.

Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which ieduplicates has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.

Back to Parent

This article is part of the topic ietoolkit

Additional Resources