Difference between revisions of "Iematch"
(7 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
==Read First== | ==Read First== | ||
*This command is part of the package <code>[[Stata Coding Practices#ietoolkit | ietoolkit]]</code>. To install all commands in this package, including <code>iematch</code>, type <code>ssc install ietoolkit</code> in Stata. | *This command is part of the package <code>[[Stata Coding Practices#ietoolkit | ietoolkit]]</code>. To install all commands in this package, including <code>iematch</code>, type <code>ssc install ietoolkit</code> in Stata. | ||
*For detailed instructions on how to implement the command in Stata, type <code>help iematch</code> in Stata. | *For detailed instructions on how to implement the command in Stata, type <code>help iematch</code> in Stata. | ||
Line 9: | Line 8: | ||
==Overview== | ==Overview== | ||
<code>iematch</code> uses baseline data to match base to target observations | <code>iematch</code> uses baseline data to match base to target observations. Researchers may use this command if, for example, a control group was not selected at the time of random treatment assignment. In certain cases, <code>iematch</code> may also be used for [[Propensity_Score_Matching|propensity score matching (PSM)]], the most commonly used matching method. Several user written commands are developed specifically for propensity score matching and perform all steps required. In most cases, these commands are optimal for PSM analysis. Sometimes, though, PSM analysis may require a special step that no off-the-shelf PSM commands offer. In these cases, the user must set up each step of the PSM analysis him/herself and may choose to use <code>iematch</code> to perform the matching step. | ||
Note that while <code>iematch</code> performs the computational task of matching two sets of observations based on differences in the matching variable, the command does not incorporate a test to confirm the validity of a match from an economic theory standpoint. It is up to the user to perform the tests on the <code>iematch</code> results that he/she finds most appropriate. To DIME Analytics’ knowledge, there is no consensus on a general test for matching validity. However, if you have suggestions for tests to integrate into the command, please [mailto:dimeanalytics@worldbank.org contact DIME Analytics]. | Note that while <code>iematch</code> performs the computational task of matching two sets of observations based on differences in the matching variable, the command does not incorporate a test to confirm the validity of a match from an economic theory standpoint. It is up to the user to perform the tests on the <code>iematch</code> results that he/she finds most appropriate. To DIME Analytics’ knowledge, there is no consensus on a general test for matching validity. However, if you have suggestions for tests to integrate into the command, please [mailto:dimeanalytics@worldbank.org contact DIME Analytics]. | ||
== | == Implementation == | ||
<code>iematch</code> does not identify globally optimal matching results, but rather uses greedy matching. In optimal matching, the sum of all absolute differences between matched pairs is minimized using optimization. In greedy matching, the sum of differences is disregarded: the process instead begins by matching the best pair, then the second best pair and so on until all valid pairs are found. An optimal match might split up a very good match and a decent match to create two medium good matches. Optimal matches are much more complex and require more computational power. Often, the results of optimal matching are only marginally better and do not seem to affect overall balance (see Gu and Rosenbaum 1993). | <code>iematch</code> does not identify globally optimal matching results, but rather uses greedy matching. In optimal matching, the sum of all absolute differences between matched pairs is minimized using optimization. In greedy matching, the sum of differences is disregarded: the process instead begins by matching the best pair, then the second best pair and so on until all valid pairs are found. An optimal match might split up a very good match and a decent match to create two medium good matches. Optimal matches are much more complex and require more computational power. Often, the results of optimal matching are only marginally better and do not seem to affect overall balance (see Gu and Rosenbaum 1993). | ||
Line 19: | Line 18: | ||
A basic implementation of the command follows: | A basic implementation of the command follows: | ||
<nowiki>iematch , grpdummy(tmt) matchvar(p_hat)</nowiki> | <nowiki>iematch, grpdummy(tmt) matchvar(p_hat)</nowiki> | ||
In this example, the observations with tmt=1 will be matched towards the nearest, in terms of p_hat, observations with tmt=0. | In this example, the observations with tmt=1 will be matched towards the nearest, in terms of p_hat, observations with tmt=0. | ||
=== One-to-One vs. Many-to-One === | === One-to-One vs. Many-to-One === | ||
<code>iematch</code> performs either a one-to-one match or a many-to-one match between base and target observations. The required option | <code>iematch</code> performs either a one-to-one match or a many-to-one match between base and target observations. The required option <code>grpdummy()</code> indicates the base and target observations: a <code>grpdummy()</code> value of 1 indicates a base observation and a <code>grpdummy()</code> value of 0 indicates target observation. A missing <code>grpdummy()</code> value excludes the observation from the matching. | ||
A one-to-one match produces matched pairs of exactly one target observation and exactly one base observation. In a one-to-one match, the data must include more target observations than base observations. If there are more base observations, simply switch which group has value 1 and which has value 0 in the group dummy. | A one-to-one match produces matched pairs of exactly one target observation and exactly one base observation. In a one-to-one match, the data must include more target observations than base observations. If there are more base observations, simply switch which group has value 1 and which has value 0 in the group dummy. | ||
Line 30: | Line 29: | ||
A many-to-one match produces matched groups with exactly one target observation and one or more base observations. In a many-to-one match, the data must include more base observations than target observations. If there are more target observations, simply switch which group has value 1 and which has value 0 in the group dummy. | A many-to-one match produces matched groups with exactly one target observation and one or more base observations. In a many-to-one match, the data must include more base observations than target observations. If there are more target observations, simply switch which group has value 1 and which has value 0 in the group dummy. | ||
To restrict the number of base observations allowed to match with a single target observation, use the option | To restrict the number of base observations allowed to match with a single target observation, use the option <code>maxmatch()</code>. | ||
=== Maximum Difference in a Match === | === Maximum Difference in a Match === | ||
To improve the validity of a matched result, consider allowing matches where the difference between the matched observation is no more than a value specified in | To improve the validity of a matched result, consider allowing matches where the difference between the matched observation is no more than a value specified in <code>maxdiff()</code>. You could of course drop those values manually after running iematch, but using <code>maxdiff()</code> often helps the algorithm to finish faster if you have very large data sets. | ||
==Ensuring Replicability == | ==Ensuring Replicability == | ||
Line 39: | Line 38: | ||
If all values in the variable used for the matching are unique, then the results will always be the same no matter sort order of the data set as long as the values does not change. Thus, for datasets with entirely unique matching values, the results of <code>iematch</code> will always be replicable. However, if there are duplicates values in the matching variable, the user must take the following two steps to ensure that the results are replicable: | If all values in the variable used for the matching are unique, then the results will always be the same no matter sort order of the data set as long as the values does not change. Thus, for datasets with entirely unique matching values, the results of <code>iematch</code> will always be replicable. However, if there are duplicates values in the matching variable, the user must take the following two steps to ensure that the results are replicable: | ||
# Set a seed (line 1) | # Set a seed (line 1) | ||
# Use the option | # Use the option <code>seedok</code> with <code>iematch</code> (line 2) | ||
<nowiki> | <nowiki> | ||
Set seed 12345 | Set seed 12345 | ||
iematch , grpdummy(tmt) matchvar(p_hat) seedok</nowiki> | iematch, grpdummy(tmt) matchvar(p_hat) seedok</nowiki> | ||
Setting a seed ensures that <code>iematch</code> will generate the same matching result each time -- even if some observations have duplicate values. Specifying | Setting a seed ensures that <code>iematch</code> will generate the same matching result each time -- even if some observations have duplicate values. Specifying <code>seedok</code> suppresses the error message thrown when there are duplicates in <code>matchvar</code>. | ||
== Back to Parent == | == Back to Parent == | ||
This article is part of the topic [[Stata_Coding_Practices#ietoolkit|ietoolkit]] | This article is part of the topic [[Stata_Coding_Practices#ietoolkit|ietoolkit]] | ||
== | == Additional Resources == | ||
* | *Read more about <code>ietoolkit</code> [https://github.com/worldbank/ietoolkit here] on GitHub. | ||
[[Category: Stata ]] | [[Category: Stata ]] |
Latest revision as of 15:55, 10 June 2019
iematch
is a Stata command that matches base observations to target observations on a single continuous variable. Matching allows researchers to find non-treated units with similar characteristics as treated units, laying the groundwork for causal inference. Matching with iematch
takes place before analysis. This page describes the use, options and validity of iematch
.
Read First
- This command is part of the package
ietoolkit
. To install all commands in this package, includingiematch
, typessc install ietoolkit
in Stata. - For detailed instructions on how to implement the command in Stata, type
help iematch
in Stata. - While
iematch
matches observations, it does not test for validity.
Overview
iematch
uses baseline data to match base to target observations. Researchers may use this command if, for example, a control group was not selected at the time of random treatment assignment. In certain cases, iematch
may also be used for propensity score matching (PSM), the most commonly used matching method. Several user written commands are developed specifically for propensity score matching and perform all steps required. In most cases, these commands are optimal for PSM analysis. Sometimes, though, PSM analysis may require a special step that no off-the-shelf PSM commands offer. In these cases, the user must set up each step of the PSM analysis him/herself and may choose to use iematch
to perform the matching step.
Note that while iematch
performs the computational task of matching two sets of observations based on differences in the matching variable, the command does not incorporate a test to confirm the validity of a match from an economic theory standpoint. It is up to the user to perform the tests on the iematch
results that he/she finds most appropriate. To DIME Analytics’ knowledge, there is no consensus on a general test for matching validity. However, if you have suggestions for tests to integrate into the command, please contact DIME Analytics.
Implementation
iematch
does not identify globally optimal matching results, but rather uses greedy matching. In optimal matching, the sum of all absolute differences between matched pairs is minimized using optimization. In greedy matching, the sum of differences is disregarded: the process instead begins by matching the best pair, then the second best pair and so on until all valid pairs are found. An optimal match might split up a very good match and a decent match to create two medium good matches. Optimal matches are much more complex and require more computational power. Often, the results of optimal matching are only marginally better and do not seem to affect overall balance (see Gu and Rosenbaum 1993).
Basic Implementation
A basic implementation of the command follows:
iematch, grpdummy(tmt) matchvar(p_hat)
In this example, the observations with tmt=1 will be matched towards the nearest, in terms of p_hat, observations with tmt=0.
One-to-One vs. Many-to-One
iematch
performs either a one-to-one match or a many-to-one match between base and target observations. The required option grpdummy()
indicates the base and target observations: a grpdummy()
value of 1 indicates a base observation and a grpdummy()
value of 0 indicates target observation. A missing grpdummy()
value excludes the observation from the matching.
A one-to-one match produces matched pairs of exactly one target observation and exactly one base observation. In a one-to-one match, the data must include more target observations than base observations. If there are more base observations, simply switch which group has value 1 and which has value 0 in the group dummy.
A many-to-one match produces matched groups with exactly one target observation and one or more base observations. In a many-to-one match, the data must include more base observations than target observations. If there are more target observations, simply switch which group has value 1 and which has value 0 in the group dummy.
To restrict the number of base observations allowed to match with a single target observation, use the option maxmatch()
.
Maximum Difference in a Match
To improve the validity of a matched result, consider allowing matches where the difference between the matched observation is no more than a value specified in maxdiff()
. You could of course drop those values manually after running iematch, but using maxdiff()
often helps the algorithm to finish faster if you have very large data sets.
Ensuring Replicability
If all values in the variable used for the matching are unique, then the results will always be the same no matter sort order of the data set as long as the values does not change. Thus, for datasets with entirely unique matching values, the results of iematch
will always be replicable. However, if there are duplicates values in the matching variable, the user must take the following two steps to ensure that the results are replicable:
- Set a seed (line 1)
- Use the option
seedok
withiematch
(line 2)
Set seed 12345 iematch, grpdummy(tmt) matchvar(p_hat) seedok
Setting a seed ensures that iematch
will generate the same matching result each time -- even if some observations have duplicate values. Specifying seedok
suppresses the error message thrown when there are duplicates in matchvar
.
Back to Parent
This article is part of the topic ietoolkit
Additional Resources
- Read more about
ietoolkit
here on GitHub.