Ieduplicates

Revision as of 11:43, 21 December 2017 by Kbjarkefur (talk | contribs)
Jump to: navigation, search

ieduplicates and the sister command iecompdup are used to identify and resolve duplicates in raw survey data.

This article is means to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ieduplicates or help iecompdup in Stata.

Intended use cases

ieduplicates is meant to be used directly after importing raw data from for example a survey data collection. It does two things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved. The reason the duplicates are resolved is to make sure that other quality checks that use the ID require unique IDs. For example, if household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. Make sure not to overwrite the original raw data with the data set where ieduplicates has removed the duplicates.

iecompdup helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision.

Work Flow

The intended work flow is that you run ieduplicates on the raw data, then use iecompdup on the duplicates identified. Then you enter the resolutions to the duplicates in the report outputted by ieduplicates, save the report in the same location with the same name, and run ieduplicates again. The resolutions you have entered is now applied and only duplicates that were not resolved are removed this times.

Directions

Reasoning used during development