ieduplicates is the second command in the Stata package created by DIME Analytics,
ieduplicates identifies duplicates in ID variables that uniquely identify every observation in a dataset. It then exports them to an Excel file that the research team can use to resolve these duplicates. The research team should run
ieduplicates with each new batch of incoming data to ensure high quality data before cleaning and analysis.
- Stata coding practices.
ieduplicatesidentifies duplicates in ID variables, and then
iecompdupresolves these issues.
- To install
ssc install ieduplicatesin Stata.
- To install all the commands in the
ssc install iefieldkitin Stata.
- For instructions and available options, type
iecompdup commands are meant to help research teams deal with duplicate observations in primary data. These commands are designed to identify and resolve duplicate instances of an ID variable in raw survey data, and ensure that each observation is uniquely and
fully identified. The commands combine four key tasks to resolve duplicate values:
- Identifying duplicate entries.
- Comparing observations with the same ID value.
- Tracking and documenting changes to the ID variable.
- Applying the necessary corrections to the data.
In any data, certain key variables of an observation should be unique by construction, to allow researchers to identify them during further analysis. For example, suppose you select household_id as the unique ID variable. Now suppose you pick the observation with household_id= 123456 for back checks, but the dataset has two observations with household_id = 123456. In this case, it is important to resolve these duplicate observations before performing the back check.
When you run
ieduplicates for the first time, it will create a duplicate correction template. This template will list all observations that contain duplicate values of an ID variable that should be unique. In the example above, after creating this template,
ieduplicates will, by default, display a message pointing out that household_id does not uniquely and fully identify the data. It will also stop your code, and require you to fill the correction template before you can move on.
The sample syntax for
ieduplicates is as follows:
ieduplicates id_varname using "filename.xlsx" , uniquevars(varlist) [force keepvars(varlist) tostringok droprest nodaily duplistid(string) datelisted(string) datefixed(string) correct(string) drop(string) newid(string) initials(string) notes(string) listofdiffs(string)]
ieduplicates requires the following :
- id_varname: This is the name of the single, unique ID variable. This variable must be such that it would be an unacceptable duplicate in the dataset, and so must never be repeated. For example, household_id.
- "filename.xlsx": This allows you to specify the name of the Excel file in which
ieduplicateswill display the duplicates correction template. The file name in this case is specified with the help of using, and must include an absolute file path.
- uniquevars: Finally, this provides one or multiple variables that are guaranteed to uniquely identify each observation in the dataset. A software like SurveyCTO will create these automatically.
ieduplicates requires that you specify the ID variable, a file path to the file where the report will be saved, and a unique variable. See the below example for reference:
ieduplicates HHID using ''C:\myIE\Documentation\DupReport.xlsx'', uniquevars(KEY)
ieduplicates only allows a single ID variable. In the above example, this is HHID. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.
ieduplicates stores the report in the file specified after
using. In the above example, this is "C:\myIE\Documentation\DupReport.xlsx". The report is outputted as an Excel sheet so that even team members who do not know Stata can read and correct it. The command also creates a folder called Daily in the same folder as the Excel file. In the Daily folder,
ieduplicates saves a back-up report each day in case someone accidentally deletes the main report or any of its contents. To restore a report, simply copy it out of the Daily folder and remove the date from the name. If two different reports are generated the same day, with different outputs, the second report will include a timestamp in the name.
ieduplicates uses the unique variable specified within
uniquevars() to apply corrections and assign the correct variable to the correct observation. In the above example, this is KEY. While the unique identifier can consist of multiple variables, most data collection tools assign a unique ID to each observation on their server. In SurveyCTO survey data, for example, this variable is called KEY.
ieduplicateson the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
- If there are duplicates, use
iecompdupon any duplicates identified.
- Enter the corrections identified with
iecompdupto the duplicates in the report outputted by
- After entering the corrections, save the report in the same location with the same name.
ieduplicatesagain. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.
Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which
ieduplicates has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.
Using the Report
The outputted report provides an excellent format in which research teams can resolve duplicate problems. The report has a correct, drop and newID column. If you want to keep one duplicate and drop another one because they are double recordings of the same observation, then write yes in the correct column for the observation you want to keep, and yes in the drop column for the one you want to drop. If you want to keep one duplicate and assign a new ID to another duplicate, then write yes in the correct column for the observation you want to keep, and a new ID value in the newID column for the observation to which you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID.
Always indicate which observation to keep. After entering your corrections, save the file and run
Back to Parent
This article is part of the topic ietoolkit
- DIME Analytics’ Real Time Data Quality Checks