Difference between revisions of "Ieduplicates"
Line 36: | Line 36: | ||
drop(''string'') newid(''string'') initials(''string'') notes(''string'') listofdiffs(''string'')] | drop(''string'') newid(''string'') initials(''string'') notes(''string'') listofdiffs(''string'')] | ||
* '''force''': Removes all observations that contain duplicate values of '''id_varname''' from the data. As a result, it keeps only '''uniquely''' and '''fully identified''' observations. The '''force''' option is required so that you know that '''<code>ieduplicates</code>''' is making changes to your dataset. Do not overwrite the original raw data with the one that this command generates, otherwise you will lose the original data. Save this new dataset with a different name. | |||
* '''duplistid''': Uniquely identifies each of the duplicate observations in the '''duplicates correction template'''. It does this by assigning '''1''' to the first instance of a duplicate value, '''2''' to the second instance, and so on. | |||
identified observations. The option | * '''datelisted''': Indicates the date on which the observation was first included in the template. | ||
required | * '''listofdiffs''': Lists the variables in the data set that are different for the observations that have duplicate values of the ID variable. You can also rename these columns by specifying the new column name under their respective options. | ||
* '''correct''', '''drop''', '''newid''', '''initials''', and '''notes''': You can fill these blank columns to make corrections where needed, and complete the template. | |||
This completed template then acts like a permanent [[Data Documentation|documentation]] of how the [[Impact Evaluation Team|research team]] resolved duplicate ID variables in the raw data. There are three options for | |||
resolving duplicate observations. They appear in the form of '''''correct''''', '''''drop''''', and '''''newID''''' columns in the template. Consider the following examples to understand how to fill these columns: | |||
* If you want to keep one of the duplicate observations and drop another, then write '''“correct”''' in the '''''correct''''' column for the observation you want to keep, and '''“drop”''' in the '''''drop''''' column for the observation you want to drop. Make sure you mention the correct value under the '''''key''''' column for the observations you want to keep, and the ones that you want to drop from the dataset. This is important because '''SurveyCTO''' creates a unique '''key''' for every observation, even if two or more observations have duplicate variable IDs by mistake. | |||
* If you want to keep one of the duplicates and assign a new unique ID to another one, write '''“correct”''' in the '''''correct''''' column for the observation you want to keep, and the new corrected ID value in the '''''newID''''' column for the observation to which you want to assign the new unique ID. | |||
* You can also combine these two methods if you have more than 2 duplicate observations. Note that you must always indicate which observation you want to keep for each group of duplicate observations. | |||
After entering your corrections, save the file and run '''<code>ieduplicates</code>''' again to apply the corrections to the dataset. | |||
Since '''<code>ieduplicates</code>''' should be used frequently as new data comes in from the field, the command also manages a subfolder called '''/Daily/''' in the same folder which contains the main Excel file. '''<code>ieduplicates</code>''' uses this subfolder to save a backup version (along with the date) for every time the template is updated. If two different templates are generated on the same day, it saves the second with an additional time stamp on the name. This is especially useful in case the main corrections template, or any of its contents get deleted. You can restore a backup version by simply copying it out of the '''/Daily/''' folder and remove the date from the name. If you do not wish to use this feature, use the '''nodaily''' option which prevents the creation of backups. | |||
to apply the corrections | |||
Since | |||
'''<code>ieduplicates</code>''' also | |||
==Using the Report== | ==Using the Report== |
Revision as of 20:12, 7 May 2020
ieduplicates
is the second command in the Stata package created by DIME Analytics, iefieldkit
. ieduplicates
identifies duplicates in ID variables that uniquely identify every observation in a dataset. It then exports them to an Excel file that the research team can use to resolve these duplicates. The research team should run ieduplicates
with each new batch of incoming data to ensure high quality data before cleaning and analysis.
Read First
- Stata coding practices.
iefieldkit
.ieduplicates
identifies duplicates in ID variables, and theniecompdup
resolves these issues.- To install
ieduplicates
, typessc install ieduplicates
in Stata. - To install all the commands in the
iefieldkit
package, typessc install iefieldkit
in Stata. - For instructions and available options, type
help ieduplicates
.
Overview
The ieduplicates
and iecompdup
commands are meant to help research teams deal with duplicate observations in primary data. These commands are designed to identify and resolve duplicate instances of an ID variable in raw survey data, and ensure that each observation is uniquely and
fully identified. The commands combine four key tasks to resolve duplicate values:
- Identifying duplicate entries.
- Comparing observations with the same ID value.
- Tracking and documenting changes to the ID variable.
- Applying the necessary corrections to the data.
In any data, certain key variables of an observation should be unique by construction, to allow researchers to identify them during further analysis. For example, suppose you select household_id as the unique ID variable. Now suppose you pick the observation with household_id= 123456 for back checks, but the dataset has two observations with household_id = 123456. In this case, it is important to resolve these duplicate observations before performing the back check.
When you run ieduplicates
for the first time, it will create a duplicate correction template. This template will list all observations that contain duplicate values of an ID variable that should be unique. In the example above, after creating this template, ieduplicates
will, by default, display a message pointing out that household_id does not uniquely and fully identify the data. It will also stop your code, and require you to fill the correction template before you can move on.
Syntax
The basic syntax for ieduplicates
is as follows:
ieduplicates id_varname using "filename.xlsx" , uniquevars(varlist)
As inputs, ieduplicates
requires the following :
- id_varname: This is the name of the single, unique ID variable. This variable must be such that it would be an unacceptable duplicate in the dataset, and so must never be repeated. If there are two or more variables that identify the observation in the dataset, you should create a single ID variable that is unique for the dataset. This variable could be either a string or a number. For example, household_id.
- "filename.xlsx": This provides the name of the Excel file in which
ieduplicates
will display the duplicates correction template. The file name in this case is specified with the help of using, and must include an absolute file path. For example, "C:/myIE/Documentation/DupReport.xlsx" is the absolute file path for the file called "DupReport.xlsx". Since the output is an Excel sheet, even those members of the research team who do not know Stata can read the report, and make corrections. - uniquevars( ): Finally,
ieduplicates
uses one or multiple variables specified within( )
to uniquely identify each observation in the dataset. However, most data collection tools only use one variable for this purpose. For example, SurveyCTO creates this variable automatically, and names it "KEY".
For example, if there are no observations which have duplicate values of household_id, for instance, ieduplicates
will display a message saying the data set is uniquely and fully identified on the basis of household_id. In such a case, there will be no output, and this command will leave the data unchanged.
However, if there are observations which have duplicate values of household_id, ieduplicates
the command will save the output to an Excel sheet called "DupReport.xlsx". This file will contain information on these observations in the form of the duplicates correction template, and ieduplicates
will also stop your code with a message listing the repeated values under household_id.
Duplicates Correction Template
The ieduplicates
exports the duplicates correction template to the Excel file, based on the following syntax:
ieduplicates id_varname using "filename.xlsx" , uniquevars(varlist) [force keepvars(varlist) tostringok droprest nodaily duplistid(string) datelisted(string) datefixed(string) correct(string) drop(string) newid(string) initials(string) notes(string) listofdiffs(string)]
- force: Removes all observations that contain duplicate values of id_varname from the data. As a result, it keeps only uniquely and fully identified observations. The force option is required so that you know that
ieduplicates
is making changes to your dataset. Do not overwrite the original raw data with the one that this command generates, otherwise you will lose the original data. Save this new dataset with a different name. - duplistid: Uniquely identifies each of the duplicate observations in the duplicates correction template. It does this by assigning 1 to the first instance of a duplicate value, 2 to the second instance, and so on.
- datelisted: Indicates the date on which the observation was first included in the template.
- listofdiffs: Lists the variables in the data set that are different for the observations that have duplicate values of the ID variable. You can also rename these columns by specifying the new column name under their respective options.
- correct, drop, newid, initials, and notes: You can fill these blank columns to make corrections where needed, and complete the template.
This completed template then acts like a permanent documentation of how the research team resolved duplicate ID variables in the raw data. There are three options for resolving duplicate observations. They appear in the form of correct, drop, and newID columns in the template. Consider the following examples to understand how to fill these columns:
- If you want to keep one of the duplicate observations and drop another, then write “correct” in the correct column for the observation you want to keep, and “drop” in the drop column for the observation you want to drop. Make sure you mention the correct value under the key column for the observations you want to keep, and the ones that you want to drop from the dataset. This is important because SurveyCTO creates a unique key for every observation, even if two or more observations have duplicate variable IDs by mistake.
- If you want to keep one of the duplicates and assign a new unique ID to another one, write “correct” in the correct column for the observation you want to keep, and the new corrected ID value in the newID column for the observation to which you want to assign the new unique ID.
- You can also combine these two methods if you have more than 2 duplicate observations. Note that you must always indicate which observation you want to keep for each group of duplicate observations.
After entering your corrections, save the file and run ieduplicates
again to apply the corrections to the dataset.
Since ieduplicates
should be used frequently as new data comes in from the field, the command also manages a subfolder called /Daily/ in the same folder which contains the main Excel file. ieduplicates
uses this subfolder to save a backup version (along with the date) for every time the template is updated. If two different templates are generated on the same day, it saves the second with an additional time stamp on the name. This is especially useful in case the main corrections template, or any of its contents get deleted. You can restore a backup version by simply copying it out of the /Daily/ folder and remove the date from the name. If you do not wish to use this feature, use the nodaily option which prevents the creation of backups.
Using the Report
The outputted report provides an excellent format in which research teams can resolve duplicate problems. The report has a correct, drop and newID column. If you want to keep one duplicate and drop another one because they are double recordings of the same observation, then write yes in the correct column for the observation you want to keep, and yes in the drop column for the one you want to drop. If you want to keep one duplicate and assign a new ID to another duplicate, then write yes in the correct column for the observation you want to keep, and a new ID value in the newID column for the observation to which you want to assign a new ID. You can also combine these two methods if you have many duplicates with the same ID.
Always indicate which observation to keep. After entering your corrections, save the file and run ieduplicates
again.
- Run
ieduplicates
on the raw data. If there are no duplicates, then you are done and can skip the rest of this list. - If there are duplicates, use
iecompdup
on any duplicates identified. - Enter the corrections identified with
iecompdup
to the duplicates in the report outputted byieduplicates
. - After entering the corrections, save the report in the same location with the same name.
- Run
ieduplicates
again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.
Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which ieduplicates
has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a different name.
Back to Parent
This article is part of the topic ietoolkit
Additional Resources
- DIME Analytics’ Real Time Data Quality Checks