Difference between revisions of "Ieduplicates"
Kbjarkefur (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
Line 4: | Line 4: | ||
== Intended use cases == | == Intended use cases == | ||
'''ieduplicates''' is meant to be used directly after importing raw data from for example a survey data collection. | '''ieduplicates''' is meant to be used directly after importing raw data from, for example, a server used in survey data collection. The command does two high level things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved. | ||
'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. | The reason the duplicates are removed is to make sure that many other quality checks require unique IDs. For example, if a household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. It is important that you make sure to not overwrite the original raw data with the data set where ieduplicates has removed the duplicates as you would lose that data. To avoid this, save the dataset with removed duplicate with a different name. | ||
'''iecompdup''' helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. See below for instructions on how to interpret the output of iecompdup. | |||
=== Intended Work Flow === | === Intended Work Flow === |
Revision as of 13:02, 21 December 2017
ieduplicates and the sister command iecompdup are used to identify and resolve duplicates in raw survey data.
This article is means to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help ieduplicates
or help iecompdup
in Stata.
Intended use cases
ieduplicates is meant to be used directly after importing raw data from, for example, a server used in survey data collection. The command does two high level things. It outputs a report of all the duplicated (the report can be used for correcting the duplicates) and it removes the duplicates from the data set until they are resolved.
The reason the duplicates are removed is to make sure that many other quality checks require unique IDs. For example, if a household with ID 123456 was selected for back checks but you incorrectly have two observations that were given the ID 123456, then it is better to solve that duplicate first (you can use the report for this) before trying to run the back check test on either of the observations. It is important that you make sure to not overwrite the original raw data with the data set where ieduplicates has removed the duplicates as you would lose that data. To avoid this, save the dataset with removed duplicate with a different name.
iecompdup helps you to identify the reason for the duplicates. The decision on how to correct a duplicate is always a qualitative decision, but iecompdup compares the duplicated quantitatively and in almost all cases gives you the information that you need in order to make the qualitative decision. See below for instructions on how to interpret the output of iecompdup.
Intended Work Flow
- Run ieduplicates on the raw data
- If there are no duplicates, then you are done and can skip the rest of this list.
- If there are duplicates, use iecompdup on any duplicates identified.
- Enter the corrections identifies with iecompdup to the duplicates in the report outputted by ieduplicates
- After entering the corrections, save the report in the same location with the same name,
- Run ieduplicates again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this times.
Repeat these steps every time you get new data. Our recommendation is that this is done every day that you have new data.