Ieduplicates
ieduplicates and the sister command iecompdup are used to identify and resolve duplicates in raw survey data.
This article is means to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ieduplicates
or help iecompdup
in Stata.
Intended use cases
ieduplicates is meant to be used directly after importing raw data from for example a survey data collection. It does two things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved. The reason the duplicates are resolved is to make sure that other quality checks that use the ID require unique IDs. For example, if household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. Make sure not to overwrite the original raw data with the data set where ieduplicates has removed the duplicates.
iecompdup helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision.
Intended Work Flow
- Run ieduplicates on the raw data
- If there are no duplicates, then you are done and can skip the rest of this list.
- If there are duplicates, use iecompdup on any duplicates identified.
- Enter the corrections identifies with iecompdup to the duplicates in the report outputted by ieduplicates
- After entering the corrections, save the report in the same location with the same name,
- Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this times.
Repeat these steps every time you get new data. Our recommendation is that this is done every day that you have new data.