Difference between revisions of "Ieduplicates"

Revision as of 13:02, 21 December 2017

ieduplicates and the sister command iecompdup are used to identify and resolve duplicates in raw survey data.

This article is means to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ieduplicates or help iecompdup in Stata.

Intended use cases

ieduplicates is meant to be used directly after importing raw data from, for example, a server used in survey data collection. The command does two high level things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved.

The reason the duplicates are removed is to make sure that many other quality checks require unique IDs. For example, if a household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. It is important that you make sure to not overwrite the original raw data with the data set where ieduplicates has removed the duplicates as you would lose that data. To avoid this, save the dataset with removed duplicate with a different name.

iecompdup helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. See below for instructions on how to interpret the output of iecompdup.

Intended Work Flow

Run ieduplicates on the raw data
- If there are no duplicates, then you are done and can skip the rest of this list.
If there are duplicates, use iecompdup on any duplicates identified.
Enter the corrections identifies with iecompdup to the duplicates in the report outputted by ieduplicates
After entering the corrections, save the report in the same location with the same name,
Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this times.

Repeat these steps every time you get new data. Our recommendation is that this is done every day that you have new data.

@@ Line 4: / Line 4: @@
 == Intended use cases ==
-'''ieduplicates''' is meant to be used directly after importing raw data from for example a survey data collection. It does two things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved. The reason the duplicates are resolved is to make sure that other quality checks that use the ID require unique IDs. For example, if household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. Make sure not to overwrite the original raw data with the data set where ieduplicates has removed the duplicates.
+'''ieduplicates''' is meant to be used directly after importing raw data from, for example, a server used in survey data collection. The command does two high level things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved.
-'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision.
+The reason the duplicates are removed is to make sure that many other quality checks require unique IDs. For example, if a household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. It is important that you make sure to not overwrite the original raw data with the data set where ieduplicates has removed the duplicates as you would lose that data. To avoid this, save the dataset with removed duplicate with a different name.
+'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. See below for instructions on how to interpret the output of iecompdup.
 === Intended Work Flow ===

Navigation

Tools

Difference between revisions of "Ieduplicates"

Revision as of 13:02, 21 December 2017

Contents

Intended use cases

Intended Work Flow

Directions

Reasoning used during development

Difference between revisions of "Ieduplicates"

Revision as of 13:02, 21 December 2017

Intended use cases

Intended Work Flow

Directions

Reasoning used during development

follow us

newsletter