Difference between revisions of "Iegraph"
Kbjarkefur (talk | contribs) |
|||
(16 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
<code>iegraph</code> is used to graphically visualize regression results for some regression models commonly used in impact evaluations. This article is meant to describe use cases, work flow, and the reasoning used when developing the command. For instructions on how to use the command specifically in [[Stata Coding Practices|Stata]] and for a complete list of the options available, see the help files by typing <code>help iegraph</code> in '''Stata'''. This command is a part of the package [[Stata_Coding_Practices#ietoolkit|ietoolkit]]. To install all the commands in this package, type <code>ssc install ietoolkit</code> in '''Stata'''. | |||
This article is | |||
== Intended use cases == | == Intended use cases == | ||
This generates a graph from regression estimations. This command is | This generates a graph from regression estimations. This command is used to work with two specific models common in '''impact evaluations''', but it is possible that that there are more regression models for which this command works. | ||
=== OLS with Treatment Dummies Model === | === OLS with Treatment Dummies Model === | ||
The first regression model, let's call it ''dummy OLS'' for short, is a | The first regression model, let's call it ''dummy OLS'' for short, is a specification where each '''treatment arm''' is represented by a dummy. See the equation below. | ||
<math>y = \alpha + \beta_1 tmt_1 + \beta_2 tmt_2 + \cdots + \beta_n tmt_n + \beta X + \varepsilon </math> | |||
The '''dummy OLS''' has one ''tmt'' '''variable''' for each '''treatment arm'''. The omitted category is intended to be the control group. The number of '''treatment dummies''' has to be at least one and is only limited to the number of dummies that can be displayed on the graph without it getting too cluttered. The specification may include control '''variables''', fixed effects etc., which is represented by the vector X in the equation. | |||
=== Difference-in-Differences Model === | === Difference-in-Differences Model === | ||
The second regression model is a difference in | The second regression model is a [[Differences-in-differences|difference-in-differences model]], let's call it ''diff-in-diff'' for short, where '''treatment''' is the dummy D and time is the dummy T. | ||
<math>y = \alpha + \beta_1 D + \beta_2 T + \beta _3(D*T) + \beta X + \varepsilon</math> | |||
Both the '''treatment dummy''' and the time dummy are included in the regression as well as the interaction term between them (''D'', ''T'' and ''DT'' in the equation). The specification may include control '''variables''', fixed effects etc., which is represented by the vector X in the equation. | |||
=== Intended Work Flow === | === Intended Work Flow === | ||
Simply run the regression using the ''regress'' command in Stata, and immediately afterwards | Simply run the regression using the ''regress'' command in [[Stata Coding Practices|Stata]], and run <code>iegraph</code> immediately afterwards . | ||
== Instructions == | == Instructions == | ||
These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in Stata see the help files by typing <code>help iegrpah</code> in Stata. | These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in [[Stata Coding Practices|Stata]] see the help files by typing <code>help iegrpah</code> in '''Stata'''. | ||
=== Values In The Graph === | === Values In The Graph === | ||
One | One important note is that the values used in the graph are exactly the same as the coefficients for the '''treatment dummy''' (or dummies) in the ''dummy OLS'' and the '''treatment''' and time dummies in the ''diff-in-diff'' only if no control '''variables''', fixed effects etc., were used. To make the graph more easily interpreted by a non-technical audience -- but still correct and equally informative to a technical audience -- the omitted category (the control group in the ''dummy OLS'' and control group in time = 0 in the ''diff-in-diff'') is the average value of Y in that group and not the A coefficient. This is also the starting point of the other values. | ||
If there were no control variables, fixed effects etc. the average of Y for the omitted category is equal to the A coefficient, but that is only true in this very specific case. If we | If there were no control '''variables''', fixed effects etc., the average of Y for the omitted category is equal to the A coefficient, but that is only true in this very specific case. If we use the A coefficient together with control '''variables''' and fixed effects, we risk ending up with values in the graph that might not make sense, for example negative harvest values or a negative number of pre-natal visits. A technical audience would know that the impact of the '''treatment''' can still be read from such a graph, but a non-technical audience would be confused. It is likely rare that the A coefficient shifts so much that harvest or visits becomes negative, but it will shift away from its true value to the degree that non-technical readers might find the absolute values not credible and then not trust the rest of the [[Data Analysis|analysis]]. That is why the omitted category is represented by its average value of Y. | ||
The other categories | The other categories are represented by the average value of Y in the control group plus the value of the coefficient of the corresponding dummy '''variable''' in the regression. This way the impact is clearly shown (the difference between this value and the omitted category) but since the starting point is the average of Y for the omitted category, the absolute value in the graph is close to the average value of that category. | ||
=== List of dummies === | === List of dummies === | ||
When using iegraph you always have to list the treatment dummy variables (and the time and interaction dummies if you ran a ''diff-in-diff'') as the | When using <code>iegraph</code>, you always have to list the '''treatment dummy variables''' (and the time and interaction dummies if you ran a ''diff-in-diff'') as the variable list looks like this: <code>iegraph T1 T2 T3</code> where T1, T2 and T3 are '''treatment dummies'''. This is the only way that <code>iegraph</code> knows which coefficients are the '''treatment dummies''' and which coefficients are control '''variables''', fixed effects etc. Only the '''treatment dummy''' (and the time and interaction dummies in ''diff-in-diff'') will be displayed in the graph. | ||
The command | <code>iegraph</code> tests that the dummies fit either of the two models this command has been developed to work with. The command tests if one of these two sets of criteria are true in regards to the dummies. Otherwise, an error is returned (see below table for option how to disable this test). | ||
{| class="wikitable" | {| class="wikitable" | ||
|style="text-align:center; width: 50%" | | |style="text-align:center; width: 50%" | Dummy OLS | ||
|style="text-align:center; width: 50%" | | |style="text-align:center; width: 50%" | Diff-in-Diff | ||
|- | |- | ||
| | | | ||
* Some observations | * Some observations have the value 0 for all '''treatment dummies''' - control observations | ||
* No observation has the value 1 in more than one treatment dummy - no observation can be in be in two treatment arms | * No observation has the value 1 in more than one '''treatment dummy''' - no observation can be in be in two '''treatment arms''' | ||
* For all treatment dummies there are at least some observations that have the value 1 - at least some observations in each treatment arm | * For all '''treatment dummies''', there are at least some observations that have the value 1 - at least some observations in each '''treatment arm''' | ||
| | | | ||
* Some observations | * Some observations have the value 0 for all dummies- omitted controls observations in time = 0 | ||
* Some observation must have value 1 for only the treatment dummy - treatment observations in time = 0 | * Some observation must have value 1 for only the '''treatment dummy''' - '''treatment''' observations in time = 0 | ||
* Some | * Some observations must have the value 1 for only the time dummy - control observations in time = 1 | ||
* Some observation must have value 1 in all three of the time, treatment and interaction dummies - treatment observations in time = 1 | * Some observation must have value 1 in all three of the time, '''treatment''' and interaction dummies - '''treatment''' observations in time = 1 | ||
* No observation has the value 1 in exactly two dummies or in four or more dummies. | * No observation has the value 1 in exactly two dummies or in four or more dummies. | ||
|} | |} | ||
If you want to use this command for something slightly different, you can disable these tests by using the option ''ignoredummytest''. If you have a model other than ''dummy OLS'' or ''diff-in-diff'' that you think this command is a good fit for, please let us know and we will see if we can add that it as a supported model. Contact information on our [https://github.com/worldbank/ietoolkit GitHub page]. | |||
=== Formatting options === | === Formatting options === | ||
Many of the formatting options available to [[Stata Coding Practices|Stata's]] ''two-way scatter'' graph can be applied to <code>iegraph</code> by just adding those options to <code>iegraph</code>. Some options that should be applied directly to each bar need to be specified in the ''baroption()'' option. | |||
Allowing options from one command, like '''Stata's''' ''two-way scatter'' to a user written command is not always straightforward and can have unintended consequences. For the advanced user, there is an option that allows for debugging. This options is ''norestore'' which tells <code>iegraph</code> to not return the original '''dataset''' but the one that <code>iegraph</code> prepared to produce the graph from (be aware that you will lose any unsaved data when you do this). | |||
Now when you have the same '''dataset''' that <code>iegraph</code> uses, you can get the line of code that <code>iegraph</code> uses to generate the table by accessing that code from the returned macro <code>r(cmd)</code>. If you find any potential improvements or any bugs please let us know. Contact information on our [https://github.com/worldbank/ietoolkit GitHub page]. | |||
'' | |||
== Back to Parent == | == Back to Parent == |
Latest revision as of 17:25, 9 August 2023
iegraph
is used to graphically visualize regression results for some regression models commonly used in impact evaluations. This article is meant to describe use cases, work flow, and the reasoning used when developing the command. For instructions on how to use the command specifically in Stata and for a complete list of the options available, see the help files by typing help iegraph
in Stata. This command is a part of the package ietoolkit. To install all the commands in this package, type ssc install ietoolkit
in Stata.
Intended use cases
This generates a graph from regression estimations. This command is used to work with two specific models common in impact evaluations, but it is possible that that there are more regression models for which this command works.
OLS with Treatment Dummies Model
The first regression model, let's call it dummy OLS for short, is a specification where each treatment arm is represented by a dummy. See the equation below.
The dummy OLS has one tmt variable for each treatment arm. The omitted category is intended to be the control group. The number of treatment dummies has to be at least one and is only limited to the number of dummies that can be displayed on the graph without it getting too cluttered. The specification may include control variables, fixed effects etc., which is represented by the vector X in the equation.
Difference-in-Differences Model
The second regression model is a difference-in-differences model, let's call it diff-in-diff for short, where treatment is the dummy D and time is the dummy T.
Both the treatment dummy and the time dummy are included in the regression as well as the interaction term between them (D, T and DT in the equation). The specification may include control variables, fixed effects etc., which is represented by the vector X in the equation.
Intended Work Flow
Simply run the regression using the regress command in Stata, and run iegraph
immediately afterwards .
Instructions
These instructions are meant to help you understand how to use the command. For technical instructions on how to implement the command in Stata see the help files by typing help iegrpah
in Stata.
Values In The Graph
One important note is that the values used in the graph are exactly the same as the coefficients for the treatment dummy (or dummies) in the dummy OLS and the treatment and time dummies in the diff-in-diff only if no control variables, fixed effects etc., were used. To make the graph more easily interpreted by a non-technical audience -- but still correct and equally informative to a technical audience -- the omitted category (the control group in the dummy OLS and control group in time = 0 in the diff-in-diff) is the average value of Y in that group and not the A coefficient. This is also the starting point of the other values.
If there were no control variables, fixed effects etc., the average of Y for the omitted category is equal to the A coefficient, but that is only true in this very specific case. If we use the A coefficient together with control variables and fixed effects, we risk ending up with values in the graph that might not make sense, for example negative harvest values or a negative number of pre-natal visits. A technical audience would know that the impact of the treatment can still be read from such a graph, but a non-technical audience would be confused. It is likely rare that the A coefficient shifts so much that harvest or visits becomes negative, but it will shift away from its true value to the degree that non-technical readers might find the absolute values not credible and then not trust the rest of the analysis. That is why the omitted category is represented by its average value of Y.
The other categories are represented by the average value of Y in the control group plus the value of the coefficient of the corresponding dummy variable in the regression. This way the impact is clearly shown (the difference between this value and the omitted category) but since the starting point is the average of Y for the omitted category, the absolute value in the graph is close to the average value of that category.
List of dummies
When using iegraph
, you always have to list the treatment dummy variables (and the time and interaction dummies if you ran a diff-in-diff) as the variable list looks like this: iegraph T1 T2 T3
where T1, T2 and T3 are treatment dummies. This is the only way that iegraph
knows which coefficients are the treatment dummies and which coefficients are control variables, fixed effects etc. Only the treatment dummy (and the time and interaction dummies in diff-in-diff) will be displayed in the graph.
iegraph
tests that the dummies fit either of the two models this command has been developed to work with. The command tests if one of these two sets of criteria are true in regards to the dummies. Otherwise, an error is returned (see below table for option how to disable this test).
Dummy OLS | Diff-in-Diff |
|
|
If you want to use this command for something slightly different, you can disable these tests by using the option ignoredummytest. If you have a model other than dummy OLS or diff-in-diff that you think this command is a good fit for, please let us know and we will see if we can add that it as a supported model. Contact information on our GitHub page.
Formatting options
Many of the formatting options available to Stata's two-way scatter graph can be applied to iegraph
by just adding those options to iegraph
. Some options that should be applied directly to each bar need to be specified in the baroption() option.
Allowing options from one command, like Stata's two-way scatter to a user written command is not always straightforward and can have unintended consequences. For the advanced user, there is an option that allows for debugging. This options is norestore which tells iegraph
to not return the original dataset but the one that iegraph
prepared to produce the graph from (be aware that you will lose any unsaved data when you do this).
Now when you have the same dataset that iegraph
uses, you can get the line of code that iegraph
uses to generate the table by accessing that code from the returned macro r(cmd)
. If you find any potential improvements or any bugs please let us know. Contact information on our GitHub page.
Back to Parent
This article is part of the topic ietoolkit