Difference between revisions of "Iecompdup"

Jump to: navigation, search
 
(56 intermediate revisions by 3 users not shown)
Line 1: Line 1:
'''<code>iecompdup</code>''' is the third command in the Stata package created by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics], '''<code>[[iefieldkit]]</code>'''. The '''<code>iecompdup</code>''' command helps the [[Impact Evaluation Team|research team]] identify the reason for why [[Duplicates and Survey Logs|duplicate values]] for [[ID Variable Properties | ID variables]] exist, so they can be resolved. '''ID variables''' are variables that uniquely identify every [[Unit of Observation|observation]] in a dataset, for example, '''household_id'''.
<code>iecompdup</code> is a command in the [[Stata Coding Practices|Stata]] package <code>[[iefieldkit]]</code> created by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics]. The <code>iecompdup</code> command helps the [[Impact Evaluation Team|research team]] identify the reason for why [[Duplicates and Survey Logs|duplicated values]] in [[ID Variable Properties | ID variables]] exist, so they can be resolved. '''ID variables''' are '''variables''' that uniquely identify every observation in a [[Master Dataset|dataset]], for example, '''household_id'''.
 
== Read First ==
== Read First ==
* [[Stata Coding Practices|Stata coding practices]].
* Please refer to [[Stata Coding Practices|Stata coding practices]] for best practices.
* '''<code>[[iefieldkit]]</code>.'''
* <code>iecompdup</code> is part of the package <code>[[iefieldkit]]</code>, which has been developed by [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics].
* While '''<code>[[ieduplicates]]</code>''' identifies duplicates in [[ID Variable Properties|ID variables]], '''<code>iecompdup</code>''' provides more information to resolve these issues.  
* While <code>[[ieduplicates]]</code> identifies [[Duplicates and Survey Logs|duplicates]] in [[ID Variable Properties|ID variables]], <code>iecompdup</code> provides more information to resolve these issues.  
* To install '''<code>iecompdup</code>''', type '''<code>ssc install iecompdup</code>''' in Stata.
* To install <code>iecompdup</code>, as well as other commands in the <code>iefieldkit</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in '''Stata'''.
* To install all the commands in the '''<code>[[iefieldkit]]</code>''' package, type '''<code>ssc install iefieldkit</code>''' in Stata.
* For instructions and available options, type <syntaxhighlight lang="Stata" inline>help iecompdup</syntaxhighlight>.
* For instructions and available options, type '''<code>help iecompdup</code>'''.


== Overview ==  
== Overview ==  
Once '''<code>[[ieduplicates]]</code>''' creates the [[ieduplicates#Duplicates Correction Template|duplicate correction template]], '''<code>iecompdup</code>''' compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision, '''<code>iecompdup</code>''' provides the information necessary to make that decision, and ensure [[Monitoring Data Quality|high quality data]] before [[Data Cleaning | cleaning]] and [[Data Analysis | data analysis]]. It allows the [[Impact Evaluation Team|research team]] to also select the output format based on their decision process.
Once <code>[[ieduplicates]]</code> creates the [[ieduplicates#Duplicates Correction Template|duplicate correction template]], <code>iecompdup</code> compares the [[Duplicates and Survey Logs|duplicate]] entries '''variable-by-variable''' to understand why the '''duplicates''' exist. While the decision of how to correct a '''duplicate''' is always a qualitative decision, <code>iecompdup</code> provides the information necessary to make that decision, and ensures [[Monitoring Data Quality|high quality data]] before [[Data Cleaning | cleaning]] and [[Data Analysis | data analysis]]. It also allows the [[Impact Evaluation Team|research team]] to select the output format based on their decision process.


Follow these steps when using the '''<code>ieduplicates</code>''' and '''<code>iecompdup</code>''' commands on incoming [[Primary Data Collection|primary data]]:
These steps outline the intended work flow for how to use <code>ieduplicates</code> and <code>iecompdup</code> in combination on incoming [[Primary Data Collection|primary data]]:
# '''Run <code>ieduplicates</code> on the raw data.'''If there are no duplicates, you are done. If there are duplicates, the command will output an Excel file containing a '''duplicates correction template''', and a link to this file. It will also stop the code from moving forward, and show a message listing the duplicate values in the [[ID Variable Properties|ID variables]]. You can prevent the command from stopping your code by using the '''force''' option. This will remove all observations with duplicate ID values and allow the code to continue.<br>
 
# '''Open the duplicates correction template.''' This template will list each duplicate entry of the ID variable, and information about each observation. It also contains 5 blank columns - '''correct''', '''drop''', '''newid''', '''initials''', and '''notes'''. Use these columns to make corrections, and include comments to [[Data Documentation|document]] the corrections. <br>
# Run <code>ieduplicates</code> on the raw data. If there are no '''duplicates''', you are done. If there are, the command will output an Excel file containing a [[Ieduplicates#Duplicates Correction Template|duplicates correction template]], and a link to this file.<br>
# '''Use <code>[[iecompdup]]</code> for more information.''' Sometimes the template is not enough to solve a particular issue. In such cases, run the '''<code>[[iecompdup]]</code>''' command on the same dataset. <br>
# Use <code>iecompdup</code> for more information. The '''duplicates correction template''' includes some information comparing the '''duplicates''', but if that information is not enough, then this command should be used to get more information.
# '''Overwrite the previous file.''' After entering all the corrections to the template, save the Excel file in the same location with the same name. <br>
# Go back to your '''duplicates correction template''' and apply the corrections you identified using this this command. (See <code>[[ieduplicates]]</code> for more details on how to apply the corrections.)
# '''Run <code>ieduplicates</code> again.''' This will apply the corrections you made in the previous steps. Now if you use the '''force''' option, it will only remove those duplicates that you did not resolve. <br>
# '''Do not overwrite the orginal raw data.''' Save the resulting dataset under a different [[Naming Conventions|name]].<br>
# '''Repeat these steps with each new round of data.'''


== Syntax ==
== Syntax ==
Furthermore, this comparison can only
Sometimes when there are a lot of '''variables''' that are different for observations with [[Duplicates and Survey Logs|duplicate IDs]], <code>[[ieduplicates]]</code> cannot display all the information in the '''duplicates correction template'''. In such cases, or when there are more than two '''duplicates''', you can use <code>iecompdup</code> to explore the differences.  
be done when there are exactly two duplicates. When there are more differences than
iecompdup ''id_varname'' [if]
can be stored by ieduplicates, or more than two duplicates, you can use iecompdup
  , id(''id_value'')
to explore differences. iecompdup requires as inputs the name of the intended unique
    more2ok
ID variable (the same one as in ieduplicates) and the value that variable takes in the
    didifference
duplicate observations you wish to compare
    keepdifference
    keepother(''varlist'')]


==Implementation==
The following points provide a detailed explanation of the syntax and usage of <code>iecompdup</code>.


# Run <code>[[ieduplicates]]</code> on the raw data. If there are no duplicates, then you are done and can skip the rest of this list.
* '''Basic inputs:''' <code>iecompdup</code> uses ''id_varname'' and ''id_value'' as its basic inputs:
# If there are duplicates, use <code>iecompdup</code> on any duplicates identified.  
** '''id_varname:''' The name of the unique [[ID Variable Properties|ID variable]], which is also used with <code>ieduplicates</code>.  
# Enter the corrections identified with <code>iecompdup</code> to the duplicates in the report outputted by <code>ieduplicates</code>.
** '''id_value:''' This is the value that the '''ID variable''' takes in the '''duplicate''' observations you want to compare. For example, if the household with the ID value ''A1234'' appears twice, then ''id_varname'' is ''household_id'' and ''id_value'' is ''A1234''.  
# After entering the corrections, save the report in the same location with the same name.
# Run <code>ieduplicates</code> again. The corrections you have entered is now applied and only duplicates that are still not resolved are removed this time.


Repeat these steps with each new round of data: DIME Analytics recommends repeating these steps each day that a research team has new data. In doing so, make sure to not overwrite the original raw data with the dataset from which <code>ieduplicates</code> has removed duplicates, as this would result in lost data. Instead, save the dataset with removed duplicates under a [[Naming Conventions | different name]].
* '''More than one pair of duplicates:''' If you have more than one pair of '''duplicates''' in your [[Master Dataset|dataset]], you will need to run this command multiple times for each such pair to compare the differences.


== Specifications ==
* '''More than two observations with same id_value:''' If there are more than two observations with a particular ID value, the command will return an error. This is because <code>iecompdup</code> can only be compare two '''duplicates''' at a time. In this case, use one of the following options:
** <code>if</code>:  Using <code>if</code> allows you to select the pair of observations you want to compare.
** <code>more2ok</code>: Using <code>more2ok</code> allows <code>iecompdup</code> to pick the first two observations by default, as per the sort order. It will then display a warning message so that the user is aware that the sorting order of observations will affect the result.


<code>iecompdup</code> requires a single ID variable and the duplicate ID value. See the below example for reference:
* '''Default output:''' By default, <code>iecompdup</code> displays two lists of '''variables''' in the form of returned macros - one, '''variables''' for which the '''duplicate''' pair has identical values and two, '''variables''' for which the '''duplicate''' pair has different values. <code>iecompdup</code> also provides the following options with respect to these lists:
** <code>didifference</code>: This option will also make the command print the list of '''variables''' with different values.
** <code>keepdifference</code>: This option will only keep the '''variables''' which have different values across the '''duplicate''' pair. This option effectively drops '''variables''' which are not of interest.
** <code>keepother</code>: This option can be used if you want to retain additional '''variables''' that you think are useful for analyzing the '''duplicate''' pair.


<pre>iecompdup HHID [if] , id(123456)</pre>
==Output==
The output from <code>iecompdup</code> allows you to explore the
differences between observations to determine the best way to correct the [[Duplicates and Survey Logs|duplicate values]]. Broadly speaking, there are three cases that explain why '''duplicate''' values in [[ID Variable Properties|ID Variables]] can arise when working with SurveyCTO. Given below are the cases, and information on how <code>iecompdup</code> can help you identify which of these applies to a particular pair of '''duplicates'''. Some details can change if you use different [[Computer-Assisted Personal Interviews (CAPI)#Software|software]], but the general idea should remain the same. And while <code>iecompdup</code> can not guarantee any of the cases below, the output will allow you to identify one of these cases as the source of the problem.


===idvar===
=== Case 1:  Same observation, same data values ===
<code>iecompdup</code> only allows a single ID variable. In the above example, this is ''HHID''. The ID variable used here is the same ID variable used in <code>ieduplicates</code>. If you currently have two or more variables that identify the observation in the dataset, DIME Analytics suggests creating a single ID variable. This variable could be either string or numeric.
Case 1 errors can occur when the same observation is submitted twice, with the same data values. This often happens during [[Computer-Assisted Personal Interviews (CAPI) | CAPI]] or [[Computer-Assisted Field Entry (CAFE)|CAFE]] [[Survey Pilot|surveys]] because of poor internet connection. If submission of data to the [[SurveyCTO Server Management|server]] is interrupted before you can finish completing all fields, the incomplete data may still be saved. This is because '''SurveyCTO servers''' never delete any data. When you re-submit the data the second time, the '''server''' saves that too. However, it cannot identify which submission was intentional, and which one was accidental.  


===id===
For a case 1 error, the output of <code>iecompdup</code> will display two observations with very few differences. These differences will mostly be in the form of submission time or submission ID (which SurveyCTO lists as the '''"KEY"''' '''variable'''). Information of this form is called '''metadata'''. Sometimes the only difference between the two observations is in terms of the '''metadata''', and the data does not include any media files (audio, images, [[Administrative and Monitoring Data#Monitoring Data|monitoring]]). In such cases it does not matter which observation you keep. However, it is a good practice to keep the most recent submission.
<code>iecompdup</code> requires the ID value for the duplicate pair or group. In the above example, this is ''123456''. Note that the command can only be run on two duplicates at the time. When there are more than two duplicates for a given ID, the command issues a warning. If you have several pairs or groups of duplicates, you will have to run this command once for each pair or group.
To do that, use an <code>if</code> expression to select the observations to be compared.
 
==Output==


The command outputs the variables names for which the duplicate pair has identical values and the variable names for which the duplicate pair has different values. The section below outlines three cases of duplicates and explains how <code>iecompdup</code> can help to identify to which case the duplicate pair pertains. No output from <code>iecompdup</code> can guarantee any of the cases below, but typically the output will be qualitatively conclusive for one of the three cases.
In most cases, however, submission gets interrupted because the data contained media files which did not upload correctly. Those files do not always appear as '''variables''' when the [[Master Dataset|dataset]] is imported in [[Stata Coding Practices|Stata]], depending on the data collection software. Even in such cases, only the '''metadata variables''' will appear to be different, so you must carefully check the media files which lie outside the imported '''dataset''' for '''duplicate''' observations.


===Case 1: Same Observation, Same Data===
=== Case 2: Same observation, different data values ===
This case often occurs with [[Computer-Assisted Personal Interviews (CAPI) | CAPI]] surveys as a consequence of poor internet connection. If a submission is interrupted, then the server still saves that incomplete data; when the server receives a second submission, it saves both submissions since it does not know if the two submissions and the changes made between them were intentional. In <code>iecompdup</code>’s output, this case would appear as very few different variables; the variables that differ would mostly be submission meta data such as submission time or submission ID (called ''KEY'' in SurveyCTO). If no media files (i.e. audio, images, monitoring) were used and only the meta data differs, it does not matter which observation you keep. However, it is good practice to keep the one submitted most recently.
Case 2 errors are possible but rare in most data collection software, because most software do not allow more than one complete observation with the same ID. However, case 2 errors may still occur if someone modifies an observation after the first submission, and then re-submits it. If you think it is necessary to modify data that has already been submitted, it is better to make these modifications in a '''do-file''' as part of [[Data Cleaning|data cleaning]]. This will also allow the [[Impact Evaluation Team|research team]] to [[Data Documentation|document]] the manual changes that are made, for example, during revisions in [[Survey Pilot|survey]] software.


In most cases, submission interruptions occur because media files did not upload correctly. Those files themselves do not come up as variables in Stata -- only their file names do – and thus, only submission meta data variables differ. The file name variable is submitted even when the file is not. When both duplicates have file name and the same file contents, it does not matter which duplicate you keep. However, it is good practice to keep the one submitted most recently. If only one has the file name, keep that observation.  
For a case 2 error, the output of <code>iecompdup</code> will display observations with the different submission '''metadata''', as well as a few different observation values (like ''age'' or ''name''). In such cases, you will need to follow up with the [[Enumerator Training|enumerators]] and [[Survey Pilot Participants|supervisors]] who submitted the data. Also, there is no clear rule on which observation to keep, and the '''research team''' will have to decide this on a case-to-case basis.


The case may also occur if a duplicate is created on the server. This is very uncommon but in these cases, even some submission data would be the same. In this case, either observation can be dropped.
===Case 3: Incorrectly assigned ID===
Case 3 errors can occur because of typographical errors, for example if the ID was typed incorrectly during [[Primary Data Collection|data collection]], or if the field team did not follow proper [[Survey Protocols|protocols]] during '''data collection'''.  


===Case 2: Same Observation, Modified Data===
For a case 3 error, the output of <code>iecompdup</code> will display observations with different submission '''metadata''', as well as many different [[Survey Pilot|survey]] responses. In this case too, you will need to follow up with [[Enumerator Training|enumerators]] and [[Survey Pilot Participants|supervisors]] who were responsible for this submission. You will need to assign a new ID to one of the observations based on what you learn after following up with the field team.
This case is rare but possible in most data collection software. This occurs if an observation is modified after the first submission and then re-submitted. Sometimes it is necessary to modify already-submitted data, though in these cases, it is best practice to do so in a do-file to ensure [[Data Documentation | proper documentation]]. In <code>iecompdup</code>’s output, this case would show up as the submission meta data differing and some observation data differing. Look into these cases closely and follow up with the enumerators and supervisors responsible for this submission. There is no clear rule on which observation to keep: you have to make that decision yourself. Remember that this case is rare since most survey software has systems to prevent this.


===Case 3: Incorrectly Assigned ID===
== Related Pages ==
The case occurs when the same ID is used for two different respondents. This may happen due to typos or to unfollowed [[Survey Protocols | protocols]]. In <code>iecompdup</code>’s output, this case would show up as submission data differing as well as a lot of observation data differing. Follow up with enumerators and supervisors responsible for this submission and assign a new [[ID Variable Properties | ID]] to one of the observations based on your findings.  
[[Special:WhatLinksHere/Iecompdup|Click here for pages that link to this topic.]]<br>
This page is part of the topic <code>[[iefieldkit]]</code>. Also see <code>[[ieduplicates]]</code>.


== Back to Parent ==
== Additional Resources ==
This article is part of the topic [[Stata_Coding_Practices#ietoolkit|ietoolkit]]
* DIME Analytics (World Bank), [https://osf.io/uc2en/ Real Time Data Quality Checks]
==Additional Resources==
* DIME Analytics (World Bank), [https://github.com/worldbank/iefieldkit The <code>iefieldkit</code> GitHub page]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-4-quality.pdf Real Time Data Quality Checks]
[[Category: Stata ]]
[[Category: Stata ]]

Latest revision as of 19:57, 15 August 2023

iecompdup is a command in the Stata package iefieldkit created by DIME Analytics. The iecompdup command helps the research team identify the reason for why duplicated values in ID variables exist, so they can be resolved. ID variables are variables that uniquely identify every observation in a dataset, for example, household_id.

Read First

  • Please refer to Stata coding practices for best practices.
  • iecompdup is part of the package iefieldkit, which has been developed by DIME Analytics.
  • While ieduplicates identifies duplicates in ID variables, iecompdup provides more information to resolve these issues.
  • To install iecompdup, as well as other commands in the iefieldkit package, type ssc install iefieldkit in Stata.
  • For instructions and available options, type help iecompdup.

Overview

Once ieduplicates creates the duplicate correction template, iecompdup compares the duplicate entries variable-by-variable to understand why the duplicates exist. While the decision of how to correct a duplicate is always a qualitative decision, iecompdup provides the information necessary to make that decision, and ensures high quality data before cleaning and data analysis. It also allows the research team to select the output format based on their decision process.

These steps outline the intended work flow for how to use ieduplicates and iecompdup in combination on incoming primary data:

  1. Run ieduplicates on the raw data. If there are no duplicates, you are done. If there are, the command will output an Excel file containing a duplicates correction template, and a link to this file.
  2. Use iecompdup for more information. The duplicates correction template includes some information comparing the duplicates, but if that information is not enough, then this command should be used to get more information.
  3. Go back to your duplicates correction template and apply the corrections you identified using this this command. (See ieduplicates for more details on how to apply the corrections.)

Syntax

Sometimes when there are a lot of variables that are different for observations with duplicate IDs, ieduplicates cannot display all the information in the duplicates correction template. In such cases, or when there are more than two duplicates, you can use iecompdup to explore the differences.

iecompdup id_varname [if]
 , id(id_value)
   more2ok
   didifference
   keepdifference
   keepother(varlist)]

The following points provide a detailed explanation of the syntax and usage of iecompdup.

  • Basic inputs: iecompdup uses id_varname and id_value as its basic inputs:
    • id_varname: The name of the unique ID variable, which is also used with ieduplicates.
    • id_value: This is the value that the ID variable takes in the duplicate observations you want to compare. For example, if the household with the ID value A1234 appears twice, then id_varname is household_id and id_value is A1234.
  • More than one pair of duplicates: If you have more than one pair of duplicates in your dataset, you will need to run this command multiple times for each such pair to compare the differences.
  • More than two observations with same id_value: If there are more than two observations with a particular ID value, the command will return an error. This is because iecompdup can only be compare two duplicates at a time. In this case, use one of the following options:
    • if: Using if allows you to select the pair of observations you want to compare.
    • more2ok: Using more2ok allows iecompdup to pick the first two observations by default, as per the sort order. It will then display a warning message so that the user is aware that the sorting order of observations will affect the result.
  • Default output: By default, iecompdup displays two lists of variables in the form of returned macros - one, variables for which the duplicate pair has identical values and two, variables for which the duplicate pair has different values. iecompdup also provides the following options with respect to these lists:
    • didifference: This option will also make the command print the list of variables with different values.
    • keepdifference: This option will only keep the variables which have different values across the duplicate pair. This option effectively drops variables which are not of interest.
    • keepother: This option can be used if you want to retain additional variables that you think are useful for analyzing the duplicate pair.

Output

The output from iecompdup allows you to explore the differences between observations to determine the best way to correct the duplicate values. Broadly speaking, there are three cases that explain why duplicate values in ID Variables can arise when working with SurveyCTO. Given below are the cases, and information on how iecompdup can help you identify which of these applies to a particular pair of duplicates. Some details can change if you use different software, but the general idea should remain the same. And while iecompdup can not guarantee any of the cases below, the output will allow you to identify one of these cases as the source of the problem.

Case 1: Same observation, same data values

Case 1 errors can occur when the same observation is submitted twice, with the same data values. This often happens during CAPI or CAFE surveys because of poor internet connection. If submission of data to the server is interrupted before you can finish completing all fields, the incomplete data may still be saved. This is because SurveyCTO servers never delete any data. When you re-submit the data the second time, the server saves that too. However, it cannot identify which submission was intentional, and which one was accidental.

For a case 1 error, the output of iecompdup will display two observations with very few differences. These differences will mostly be in the form of submission time or submission ID (which SurveyCTO lists as the "KEY" variable). Information of this form is called metadata. Sometimes the only difference between the two observations is in terms of the metadata, and the data does not include any media files (audio, images, monitoring). In such cases it does not matter which observation you keep. However, it is a good practice to keep the most recent submission.

In most cases, however, submission gets interrupted because the data contained media files which did not upload correctly. Those files do not always appear as variables when the dataset is imported in Stata, depending on the data collection software. Even in such cases, only the metadata variables will appear to be different, so you must carefully check the media files which lie outside the imported dataset for duplicate observations.

Case 2: Same observation, different data values

Case 2 errors are possible but rare in most data collection software, because most software do not allow more than one complete observation with the same ID. However, case 2 errors may still occur if someone modifies an observation after the first submission, and then re-submits it. If you think it is necessary to modify data that has already been submitted, it is better to make these modifications in a do-file as part of data cleaning. This will also allow the research team to document the manual changes that are made, for example, during revisions in survey software.

For a case 2 error, the output of iecompdup will display observations with the different submission metadata, as well as a few different observation values (like age or name). In such cases, you will need to follow up with the enumerators and supervisors who submitted the data. Also, there is no clear rule on which observation to keep, and the research team will have to decide this on a case-to-case basis.

Case 3: Incorrectly assigned ID

Case 3 errors can occur because of typographical errors, for example if the ID was typed incorrectly during data collection, or if the field team did not follow proper protocols during data collection.

For a case 3 error, the output of iecompdup will display observations with different submission metadata, as well as many different survey responses. In this case too, you will need to follow up with enumerators and supervisors who were responsible for this submission. You will need to assign a new ID to one of the observations based on what you learn after following up with the field team.

Related Pages

Click here for pages that link to this topic.
This page is part of the topic iefieldkit. Also see ieduplicates.

Additional Resources