Difference between revisions of "Iematch"

Jump to: navigation, search
Line 1: Line 1:
<onlyinclude>
<code>iematch</code> is a Stata command that matches base observations to target observations on a single continuous variable. Matching allows researchers to find non-treated units with similar characteristics as treated units, laying the groundwork for causal inference. Matching with <code>iematch</code> takes place before [[Data Analysis | analysis]]. This page describes the use, options and validity of <code>iematch</code>.
'''iematch''' is used to match observations in one group to observations in another group based on a single variable. This single variable could be a p-score but could be any continuous variable.
</onlyinclude>
This article is meant to describe use cases, work flow and the reasoning used when developing the commands. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing <code>help iematch</code> in Stata. This command is a part of the package [[Stata_Coding_Practices#ietoolkit|ietoolkit]], to install all the commands in this package including this command, type <code>ssc install ietoolkit</code> in Stata.


== Intended use cases ==
==Read First==
*<code>iematch</code> is a Stata command that matches base observations to target observations on a single continuous variable.
*This command is part of the package <code>[[Stata Coding Practices#ietoolkit | ietoolkit]]</code>. To install all commands in this package, including <code>iematch</code>, type <code>ssc install ietoolkit</code> in Stata.
*For detailed instructions on how to implement the command in Stata, type <code>help iematch</code> in Stata.
*While <code>iematch</code> matches observations, it does not test for validity.


'''Important Disclaimer:''' There is no test in iematch that confirms the validity of a match from an economic theory aspect. This command only performs the computational task of matching one set of observations to observations in another set based on the difference in the matching variable. You must perform the tests to the results of iematch that you find appropriate in your specific case. Our understanding is that there is no consensus on a general test for this, but if you have suggestions for tests that we should implement and return statistics on, please let us know. Contact information on our [https://github.com/worldbank/ietoolkit GitHub page]
==Overview==


A very common use case for matching in impact evaluations is [[Propensity_Score_Matching|propensity score matching (PSM)]]. There are several user written commands developed specifically for propensity score matching that includes all steps required and in most cases you want to use one of those commands for PSM analysis. However, sometimes your PSM analysis might require a special step that none of the off-the-shelf PSM commands offer, and you will have to set up each step of the PSM analysis yourself. In such a case iematch can do the matching step for you.
<code>iematch</code> uses baseline data to match base to target observations, laying the groundwork for causal inference. Researchers may use this method if, for example, a control group was not selected at the time of random treatment assignment. In certain cases, <code>iematch</code> may also be used for [[Propensity_Score_Matching|propensity score matching (PSM)]], the most commonly used matching method. Several user written commands are developed specifically for propensity score matching and perform all steps required. In most cases, these commands are optimal for PSM analysis. Sometimes, though, PSM analysis may require a special step that no off-the-shelf PSM commands offer. In these cases, the user must set up each step of the PSM analysis him/herself and may choose to use <code>iematch</code> to perform the matching step.


iematch can also be used to sample controls to treatment observations using baseline data. This is sometimes done when the controls to the treatment observation was not selected at the time of random treatment assignment and needs to be identified in a larger population. There are many factors that can make this type of pairing invalid despite the matching result provided by iematch being mathematically correct. You always need to use econometrical reasoning for the validity of this technique in your case given the data you have available.
Note that while <code>iematch</code> performs the computational task of matching two sets of observations based on differences in the matching variable, the command does not incorporate a test to confirm the validity of a match from an economic theory standpoint. It is up to the user to perform the tests on the <code>iematch</code> results that he/she finds most appropriate. To DIME Analytics’ knowledge, there is no consensus on a general test for matching validity. However, if you have suggestions for tests to integrate into the command, please [mailto:dimeanalytics@worldbank.org contact DIME Analytics].  


== Instructions ==
== How it Works ==
These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in Stata see the help files by typing <code>help  iematch</code> in Stata.
<code>iematch</code> does not identify globally optimal matching results, but rather uses greedy matching. In optimal matching, the sum of all absolute differences between matched pairs is minimized using optimization. In greedy matching, the sum of differences is disregarded: the process instead begins by matching the best pair, then the second best pair and so on until all valid pairs are found. An optimal match might split up a very good match and a decent match to create two medium good matches. Optimal matches are much more complex and require more computational power. Often, the results of optimal matching are only marginally better and do not seem to affect overall balance (see Gu and Rosenbaum 1993).


=== Replicable results ===
===Basic Implementation===
If all values in the variable used for the matching are unique, then the results will always be the same no matter sort order of the data set as long as the values does not change. However, it is not as straightforward if several observations have the same value in the matching variable. The matching algorithm used by iematch must have a unique sort order so a random variable is generated to guarantee this. This random variable has no effect on observations with unique values in the matching variable, so if all values are unique there is no need to take this into consideration.
A basic implementation of the command follows:


However, if there are duplicates in the matching varaible then the random variable will decide which observation that will be matched first. For balance purposes it should make no difference which observation with identical that is matched first, but for replicablility this matters. Therefore, iematch throws and error if it detects duplicates in the matching variable, unless the solution described in next paragraph is used.
<nowiki>iematch , grpdummy(tmt) matchvar(p_hat)</nowiki>


The solution to getting this duplicate values is to set a seed before you run iematch and then use the option ''seedok''. After setting a seed the random values assigned will be the same each time which will generate the same result each time iematch is used, even if some observations have duplicate values. The option ''seedok'' can be used to suppress the error matching without setting the seed if replicable results does not matter or if you want a new random selection each time you use iematch, but in most cases in impact evaluations, this is probably not the best way to do it.
In this example, the observations with tmt=1 will be matched towards the nearest, in terms of p_hat, observations with tmt=0.


=== One-to-One and Many-to-One ===
=== One-to-One vs. Many-to-One ===
iematch can do both a one-to-one match and a many-to-one match. iematch call one group ''base observations'' and the other group ''target observations''. The required option ''grpdummy()'' indicates which type of observation each observations is. 1 indicates base observation and 0 indicates target observation. A missing value excludes an observation from the matching.  
<code>iematch</code> performs either a one-to-one match or a many-to-one match between base and target observations. The required option ''grpdummy()'' indicates the base and target observations: a ''grpdummy'' value of 1 indicates a base observation and a ''grpdummy'' value of 0 indicates target observation. A missing ''grpdummy'' value excludes the observation from the matching.  


In a one-to-one match the result will be matched pairs of exactly one base observation and one target observation in each pair. In a many-to-one match the results will be matched groups that will have exactly one target observation but can have one or more base observations.  
A one-to-one match produces matched pairs of exactly one target observation and exactly one base observation. In a one-to-one match, the data must include more target observations than base observations. If there are more base observations, simply switch which group has value 1 and which has value 0 in the group dummy.


You need to have more target observations than base observations in a one-to-one match. If you have more base observations and want to do one-to-one match you need to switch which group of observations that has value 1 and which has value 0 in the group dummy and then you can run your match.  
A many-to-one match produces matched groups with exactly one target observation and one or more base observations. In a many-to-one match, the data must include more base observations than target observations. If there are more target observations, simply switch which group has value 1 and which has value 0 in the group dummy.  


You can only match many base observations to a single target observation, but you can solve this by switching which group of observations that has value 1 and which has value 0 in the group dummy and then you can run your match. You can restrict how many base observations that is allowed to match with a single target observation using the option ''maxmatch()''.
To restrict the number of base observations allowed to match with a single target observation, use the option ''maxmatch()''.


=== Maximum difference in a match ===
=== Maximum Difference in a Match ===
One method to improve the validity of a matched result is to only allow matches where the difference between the matched observation is no more than a value specified in ''maxdiff()''. You could of course drop those values manually after running iematch, but using ''maxdiff()'' often helps the algorithm to finish faster if you have very large data sets.
To improve the validity of a matched result, consider allowing matches where the difference between the matched observation is no more than a value specified in ''maxdiff()''. You could of course drop those values manually after running iematch, but using ''maxdiff()'' often helps the algorithm to finish faster if you have very large data sets.


== Reasoning used during development ==
==Ensuring Replicability ==
iematch does not identify globally optimal matching results as it uses greedy matching. In optimal matching the sum of all absolute differences between matched pairs is minimized using optimization. In greedy matching the sum of differences is not regarded, the matching starts by matching the best pair, then the second best pair etc. until all valid pairs are found. An optimal match might split up a very good match and a decent match to create two medium good matches. Optimal matches are much more complex and require more computational power, and it often the results are only marginally better and does not seem to make a difference to overall balance (Gu and Rosenbaum 1993).
 
If all values in the variable used for the matching are unique, then the results will always be the same no matter sort order of the data set as long as the values does not change. Thus, for datasets with entirely unique matching values, the results of <code>iematch</code> will always be replicable. However, if there are duplicates values in the matching variable, the user must take the following two steps to ensure that the results are replicable:
# Set a seed (line 1)
# Use the option ''seedok'' with <code>iematch</code> (line 2)
 
<nowiki>
Set seed 12345
iematch , grpdummy(tmt) matchvar(p_hat) seedok</nowiki>
 
Setting a seed ensures that <code>iematch</code> will generate the same matching result each time -- even if some observations have duplicate values. Specifying ''seedok'' suppresses the error message thrown when there are duplicates in ''matchvar.''


== Back to Parent ==
== Back to Parent ==

Revision as of 22:33, 3 June 2019

iematch is a Stata command that matches base observations to target observations on a single continuous variable. Matching allows researchers to find non-treated units with similar characteristics as treated units, laying the groundwork for causal inference. Matching with iematch takes place before analysis. This page describes the use, options and validity of iematch.

Read First

  • iematch is a Stata command that matches base observations to target observations on a single continuous variable.
  • This command is part of the package ietoolkit. To install all commands in this package, including iematch, type ssc install ietoolkit in Stata.
  • For detailed instructions on how to implement the command in Stata, type help iematch in Stata.
  • While iematch matches observations, it does not test for validity.

Overview

iematch uses baseline data to match base to target observations, laying the groundwork for causal inference. Researchers may use this method if, for example, a control group was not selected at the time of random treatment assignment. In certain cases, iematch may also be used for propensity score matching (PSM), the most commonly used matching method. Several user written commands are developed specifically for propensity score matching and perform all steps required. In most cases, these commands are optimal for PSM analysis. Sometimes, though, PSM analysis may require a special step that no off-the-shelf PSM commands offer. In these cases, the user must set up each step of the PSM analysis him/herself and may choose to use iematch to perform the matching step.

Note that while iematch performs the computational task of matching two sets of observations based on differences in the matching variable, the command does not incorporate a test to confirm the validity of a match from an economic theory standpoint. It is up to the user to perform the tests on the iematch results that he/she finds most appropriate. To DIME Analytics’ knowledge, there is no consensus on a general test for matching validity. However, if you have suggestions for tests to integrate into the command, please contact DIME Analytics.

How it Works

iematch does not identify globally optimal matching results, but rather uses greedy matching. In optimal matching, the sum of all absolute differences between matched pairs is minimized using optimization. In greedy matching, the sum of differences is disregarded: the process instead begins by matching the best pair, then the second best pair and so on until all valid pairs are found. An optimal match might split up a very good match and a decent match to create two medium good matches. Optimal matches are much more complex and require more computational power. Often, the results of optimal matching are only marginally better and do not seem to affect overall balance (see Gu and Rosenbaum 1993).

Basic Implementation

A basic implementation of the command follows:

iematch , grpdummy(tmt) matchvar(p_hat)

In this example, the observations with tmt=1 will be matched towards the nearest, in terms of p_hat, observations with tmt=0.

One-to-One vs. Many-to-One

iematch performs either a one-to-one match or a many-to-one match between base and target observations. The required option grpdummy() indicates the base and target observations: a grpdummy value of 1 indicates a base observation and a grpdummy value of 0 indicates target observation. A missing grpdummy value excludes the observation from the matching.

A one-to-one match produces matched pairs of exactly one target observation and exactly one base observation. In a one-to-one match, the data must include more target observations than base observations. If there are more base observations, simply switch which group has value 1 and which has value 0 in the group dummy.

A many-to-one match produces matched groups with exactly one target observation and one or more base observations. In a many-to-one match, the data must include more base observations than target observations. If there are more target observations, simply switch which group has value 1 and which has value 0 in the group dummy.

To restrict the number of base observations allowed to match with a single target observation, use the option maxmatch().

Maximum Difference in a Match

To improve the validity of a matched result, consider allowing matches where the difference between the matched observation is no more than a value specified in maxdiff(). You could of course drop those values manually after running iematch, but using maxdiff() often helps the algorithm to finish faster if you have very large data sets.

Ensuring Replicability

If all values in the variable used for the matching are unique, then the results will always be the same no matter sort order of the data set as long as the values does not change. Thus, for datasets with entirely unique matching values, the results of iematch will always be replicable. However, if there are duplicates values in the matching variable, the user must take the following two steps to ensure that the results are replicable:

  1. Set a seed (line 1)
  2. Use the option seedok with iematch (line 2)
Set seed 12345
iematch , grpdummy(tmt) matchvar(p_hat) seedok

Setting a seed ensures that iematch will generate the same matching result each time -- even if some observations have duplicate values. Specifying seedok suppresses the error message thrown when there are duplicates in matchvar.

Back to Parent

This article is part of the topic ietoolkit

References

  • Gu S, Rosenbaum PR. Comparison of multivariate matching methods: structure, distances, and algorithms. J Comput Graph Stat 1993;2:405–20.