Difference between revisions of "Ieduplicates"

Jump to: navigation, search
Line 8: Line 8:
'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision.
'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision.


=== Work Flow ===
=== Intended Work Flow ===


The intended work flow is that you run ieduplicates on the raw data, then use iecompdup on the duplicates identified. Then you enter the resolutions to the duplicates in the report outputted by ieduplicates, save the report in the same location with the same name, and run ieduplicates again. The resolutions you have entered is now applied and only duplicates that were not resolved are removed this times.  
# Run '''ieduplicates''' on the raw data
## If there are no duplicates, then you are done.
# If there are duplicates, use '''iecompdup''' on any duplicates identified.  
# Enter the corrections identifies with '''iecompdup''' to the duplicates in the report outputted by '''ieduplicates'''
# After entering the corrections, save the report in the same location with the same name,  
# Run '''ieduplicates''' again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this times.
 
Repeat these steps every time you get new data. Our recommendation is that this is done every day that you ahve new data.


== Directions ==
== Directions ==


== Reasoning used during development ==
== Reasoning used during development ==

Revision as of 12:39, 21 December 2017

ieduplicates and the sister command iecompdup are used to identify and resolve duplicates in raw survey data.

This article is means to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ieduplicates or help iecompdup in Stata.

Intended use cases

ieduplicates is meant to be used directly after importing raw data from for example a survey data collection. It does two things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved. The reason the duplicates are resolved is to make sure that other quality checks that use the ID require unique IDs. For example, if household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. Make sure not to overwrite the original raw data with the data set where ieduplicates has removed the duplicates.

iecompdup helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision.

Intended Work Flow

  1. Run ieduplicates on the raw data
    1. If there are no duplicates, then you are done.
  2. If there are duplicates, use iecompdup on any duplicates identified.
  3. Enter the corrections identifies with iecompdup to the duplicates in the report outputted by ieduplicates
  4. After entering the corrections, save the report in the same location with the same name,
  5. Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this times.

Repeat these steps every time you get new data. Our recommendation is that this is done every day that you ahve new data.

Directions

Reasoning used during development