Difference between revisions of "Exporting Analysis"

Jump to: navigation, search
 
(8 intermediate revisions by the same user not shown)
Line 1: Line 1:
<onlyinclude>
After [[Data Analysis | analyzing data]] and before disseminating results, research teams must export analyses. This article discusses formats for exporting, instructs how to make [[Reproducible Research | replicable]] [[Checklist: Submit Table | tables]] and [[Checklist: Reviewing Graphs | graphs]] via [[Stata Coding Practices | Stata]] and LaTeX, outlines the four levels of replicability, and comments on version control.
This article discuss and give examples on three important concepts in relation to exporting results of analysis:
# formatting outputs
# replicability
# version control.
</onlyinclude>


== Read First ==
== Read First ==
* Outputted research should always be reproducible. See categories below for different levels of replicability
*For full replicability, ensure that all results (i.e. tables, graphs, etc.) are generated by code and exported in final form to the final report. For good replicability, minimal changes should be made.
*Never manually copy and paste from a Stata or R window to a file saved as disc. This is sloppy and poor practice and never considered replicable.
*Stata’s <code>estout</code> command package has a wide and useful range of capabilities for producing replicable summary statistic and regression tables. 
*When trying multiple analytical approaches, make sure to use version control via code and/or results.


== Formatting ==
== Formatting==
Formatting requirements depend on the audience. For example, best practices for communicating results to project beneficiaries or government counterparts are different than those for communicating results to the academic research community.
Formatting requirements depend on the audience. For example, best practices for communicating results to project beneficiaries or government counterparts are different than those for communicating results to the academic research community.


=== Policy Output ===
=== Policy Output ===
Fact sheets can be an efficient method to disseminate impact evaluation results to government counterparts and local communities. Here is a great example of a fact sheet used for a DIME project. Regression tables formatted according to journal standards would obviously not work well in this context.
Fact sheets can effectively disseminate [[Randomized_Control_Trials | RCT]] results to government counterparts and local communities. While regression [[Checklist: Submit Table | tables]] formatted according to journal standards will obviously not work well in this context, [[Data visualization | data visualizations]] could help to portray findings and takeaways.  


=== Academic Output ===
=== Academic Output ===
Follow established guidelines, such as:
For academic output, make sure to follow established guidelines, such as those in American Economic Review’s [https://www.aeaweb.org/journals/aer/submissions/accepted-articles/styleguide style guide] and in ShareLaTeX's collection of [https://www.sharelatex.com/templates/journals style guides]. Note that the best tool for excellent-looking and easily-reproducible tables is [[Software_Tools#LaTeX | LaTeX]], or its web-based collaboration tool, [[Collaboration_Tools#Overleaf | Overleaf]]. LaTeX has many features that allow you to produce tables that looks exactly like the tables published in top journals. For more details on how to use LaTeX, see DIME’s [https://github.com/worldbank/DIME-LaTeX-Templates LaTeX Training], which has multiple stages that target the absolute beginner as well as the experienced user.


* American Economic Review [https://www.aeaweb.org/journals/aer/submissions/accepted-articles/styleguide style guide]
==Exporting Tables==
* ShareLaTeX's collection of [https://www.sharelatex.com/templates/journals style guides]


==== LaTeX ====
To create replicable outputs, produce and complete all tables by code in Stata or R.  For example, extra statistics such as test statistics and means should be added before exporting the table. If multiple tables should be combined to one table, then this should be done by code as well. <code>estout</code>, which is a package of commands that also includes <code>esttab</code>, <code>eststo</code>, <code>estadd</code> and <code>estpost</code>, provides the functionality to accomplish this. While the <code>estout</code> commands may be a bit difficult to get started with, they have a wide and useful range of capabilities. <code>estadd</code>, for example, allows users to add additional statistics to a table (i.e. the predicted Y, the mean of sub-samples, the N of sub-samples or any number that you have calculated).


The best tool for excellent looking and easily-reproducible tables is LaTeX. DIME has created a [https://github.com/worldbank/DIME-LaTeX-Templates LaTeX Training] with multiple stages targeting the absolute beginner as well as the experienced user. LaTeX has many features that allows you to produce tables that looks exactly like the tables published in top journals. See the LaTeX training for more details on how to use it.  
To begin with <code>estout</code>, the easiest way might be to copy the code used to generate a table that you like. Recycling code is actually the most common way to use the <code>estout</code> commands even for users familiar with the package.


== Levels of Replicability of Exporting Analysis ==
== Levels of Replicability ==
We all know that all our work should be replicable, especially outputs, but exactly how replicable does something need to be? For example, if a report has a table that is outputted by code but formatted manually, is the report replicable? This section will walk you through different levels of replicability and tools for various levels.  
We all know that all our work and outputs should be [[Reproducible Research | replicable]], but exactly how replicable does something need to be? For example, if a report has a table outputted by code but formatted manually, is the report replicable? This section outlines the four levels of replicability: ''full replicability'', ''good replicability'', ''basic replicability'' and ''no replicability''. ''Full replicability'' is the ideal and ''no replicability'' is never acceptable.  


*'''No replicability.''' Parts or all of the results are '''''not''''' generated by code that can be run by anyone else, and/or parts or all of the outputted results needs to be manually copied and pasted from the result window or graph window in, for example, Stata.
=== Full Replicability ===
In ''full replicability'', all results are generated by code and exported in final form to the final report. No formatting or any other type of editing occurs between running the code and the tables appearing in the report.


*'''Basic replicability'''. All results are produced by code and saved to files on disk. However, some copy and pasting between files are required to create the final tables.
While new [[Software Tools | software tools]] are emerging, [https://www.latex-project.org/ LaTeX] very effectively creates fully replicable outputs. DIME Analytics highly recommends LaTeX (or its web-based collaboration tool, Overleaf) as it is more comprehensive and more supported by online resources than any new competing tools. To import tables and graphs to a report written in LaTeX, they must be exported in LaTeX or LaTeX-readable format. For graphs, this requires a .png format; in Stata, simply add the .png extension to the file name specified in <code>save()</code>. For tables, this requires a .tex format; in Stata, if exporting tables through the <code>estout</code> family, simply change the table format to .tex. For more options, see [[Exporting_Analysis#Additional_Resources | Additional Resources]].


*'''Good replicability'''. All results are produced by code and saved to files on disk and no copying and pasting of results are needed between files. However, formatting and other minor changes are needed, and/or the final tables need to be copied and pasted into the document where it will be used.
=== Good Replicability ===
In ''good replicability'',  all results are produced by code and saved to files on disk. No copying and pasting of results are needed between files. However, formatting and other minor changes may be needed, and/or the final tables may need to be copied and pasted into the final report. All DIME projects should aim to reach ''good replicability'' at least. While ''full replicability'' is objectively better, its model may not work for all teams due to, for example, external collaborators that are not able or willing to work in the tools required for ''full replicability''.


*'''Full replicability'''. All results are generated by code and exported in a format where no changes needs to be need to finalize them (not even formatting). And the results are also fully automatically imported to, and if the results are changed fully automatically updated in, the documents where it will be used.
To create graphs of ''good replicability'', simply use the<code>save()</code> option included in Stata's graph commands. However, if several graphs are supposed to be combined into one graph, it is not ''good replicability'' to combine them manually or to simply put them in the report next to each other. For ''good replicability,'' these graphs should be combined in the code. In Stata, <code>graph combine</code> can accomplish this.
 
=== No replicability ===
Anything that needs manual copying and pasting from a Stata or R window to a file saved as disc can never be considered replicable. This applies to graphs as well. Graphs should be saved to file and not be copied and pasted from the window they pop up in in, for example.
 
This level of replicability is never acceptable for purblished outputs, no matter how small or unimportant the report is. It could be acceptable to do what is described here during the initial exploration of the data, but as soon as output is produced for someone else - even within the team - it should be done with a higher level of replicability. Since analysis should eventually be shown to someone, we strongly recommend that you aim for a higher level of replicability from the start since it will save you time later.


=== Basic Replicability ===
=== Basic Replicability ===
The code generates all graphs and tables to the project folder, however, some copying and pasting between files are needed to create the tables, or some very basic math needs to be done in the outputted files. This is not best practice, but it is the minimum acceptable level of replicability.
In ''basic replicability'', the code generates and outputs all graphs and tables to the project folder, though additional changes are made to the outputs. For example, the research team may copy or paste between files to create the tables, or apply some very basic math to the outputted files. This is not best practice, but it is the minimum acceptable level of replicability.  
 
'''Graphs.''' It is easy to satisfy basic replicability for Graphs. In Stata, you simply use the<code>save()</code> option included in Stata's graph commands. This even satisfies ''good replicability'' for many graphs.
 
'''Tables.''' For tables there are no single built in option for saving output similarly to the <code>save()</code> option for graphs. Common commands to output results include <code>outreg</code> and <code>estout</code>. <code>estout</code> is a package of commands that also include <code>esttab</code>, <code>eststo</code>, <code>estadd</code> and <code>estpost</code>. These commands will be explained in more detail in the good replicability section below.
 
One way to test is that all tables and graphs are exported with basic replicability is to move all tables and graph to a separate folder and run the code again. Make sure that all tables and graph files are re-created again in the folder and that it is possible to make the minor manual actions required to generate the final tables and graphs from these files.
 
=== Good replicability ===
This is the levels that we recommend all DIME projects to aim for. While ''full replicability'' is objectively better, we understand that it might not be a model that works for all teams due to, for example, external collaborators are not able or willing to work in the tools ''full replicability'' requires.
 
'''Graphs.''' For many graphs, the <code>save()</code> descried in ''basic replicability'' is often enough for ''good replicability''. One exception where it is not sufficient is if several graphs are supposed to be combined into one graph. It is not ''good replicability'' to combine them manually or to simply put them in the report next to each other. For ''good replicability'' this should be done in the code. In Stata there is a command called ''graph combine'' that can be used for this purpose.
 
'''Tables.''' For ''good replicability'' all tables should be completed by code in Stata or R. For example, extra statistics such as test statistics and means should be added before exporting the table. If multiple tables should be combined to one table, then that should be done by code as well. The <code>estout</code> family of commands provides the functionality needed to accomplish this.
 
Using ''estout'' commands you can store regression results from multiple regressions in one table using <code>eststo</code>
 
'''''Add estout example here'''''
 
You can add your own statistics using <code>estadd</code>. If you have a statistic that you want to add to the results of the the regression, for example, the predicted Y, the mean of sub-samples (for example the control observations), the N of sub-samples or any number that you have calculated in your, you can add a row with that statistic using <code>estadd</code>.
 
These commands also allows you to append tables to each other, and many other features. The ''estout'' commands might be a bit difficult to get started with, and the easiest way might be to copy the code used to generate a table that you like. Recycling code is actually the most common way to use the ''estout'' commands even for users familiar with the package.
 
=== Full replicability ===
For ''full replicability'' all tables and graphs should also be imported automatically into the final report. No formatting or any other type of editing should be needed between running the code and the tables appearing in between the text in the report.


While new tools are starting to be introduced, traditionally the way to achieve this level of replicability have been to use [https://www.latex-project.org/ LaTeX]. DIME has prepared resources for getting started with LaTeX and how to write fully replicable documents using LaTeX. See the resource [https://github.com/worldbank/DIME-LaTeX-Templates here].
To create basically replicable graphs, simply use the<code>save()</code> option included in Stata's graph commands. This even satisfies ''good replicability'' for many graphs. For tables, there is no single built-in option to reach ''basic replicability''. Common commands to output results include <code>outreg</code> and <code>estout</code>, outlined above. To test that all tables and graphs are exported with ''basic replicability'', move all tables and graph to a separate folder and run the code again. Make sure that all tables and graph files are re-created again in the folder and that it is possible to make the minor manual actions required to generate the final tables and graphs from these files.


For tables and graphs to be possible to import to a report written in LaTeX it has to have been exported in LaTeX format or a format that LaTeX can read. For graphs this only requires that the file type in <code>save()</code> is saved in, for example, .png format. For tables it is also as easy if you are already using the <code>estout</code> family as you can change the format you export the tables in .tex. See the LaTeX resources linked to above to for how to import these files to your report.
=== No Replicability ===


While new alternatives based on, for example, Microsoft Word that skip LaTeX altogether are emerging, we are still recommending LaTeX as it is more comprehensive and the resources online are much more well developed than for any of the new tools.  
Any outputs (i.e. tables, graphs, etc.) that require manual copying and pasting from a Stata or R window to a file saved as disc are never considered replicable. This level of replicability is never acceptable for published outputs, regardless of the size or importance of the report. While ''no replicability'' may be acceptable during, for example, exploratory analysis, as soon as output is produced for someone else - even within the team - it should be done with a higher level of replicability. Since [[Data Analysis | analysis]] should eventually be shown to someone, we strongly recommend that you aim for a higher level of replicability from the start since it will save you time later.


== Version control ==
== Version control ==
It is very common while doing research that multiple approaches are tried out before the Principal Investigator decides on a which will be the final analysis. This should be minimized as it could otherwise be regarded as p-hacking, but to some degree it will always be necessary as what we learn about the data set will affect what analysis is required.
During this process we need to have a way to go back to previous results so we can compare the different versions. This is called version control. We can either version control the code generating the results, the results themselves or both.
'''Version control of code.''' This is the only way we can do version control in the full extent of the definition of ''version control'' as we use GitHub for this.
'''Version control of results''' The method suggested here is not version control in the full extent of the definition of ''version control'' but it is satisfying the basic need discussed here. Date the files.


'''Version control of both.''' It is simply using both of the above techniques.
Very often, research teams try multiple approaches before the Principal Investigator decides which to use for the final [[Data Analysis | analysis]]. This should be minimized as it could otherwise be regarded as p-hacking. Nonetheless, typically it is to some extent a necessary process. During this phase, ensure version control, or a way to go back to previous results in order to compare the different versions. Version control can occur via code, results or both. Version control via code and [[Getting started with GitHub | GitHub]] is the only way to achieve version control in the full extent of the term. Version control via results is not version control in the full extent of the definition of the term, but it satisfies the basic need discussed here. If using version control via results, make sure to date the files.


== Back to Parent ==
== Back to Parent ==
Line 90: Line 53:
== Additional Resources ==
== Additional Resources ==


* list here other articles related to this topic, with a brief description and link
*DIME Analytics' [https://github.com/worldbank/DIME-Resources/blob/master/stata2-6-descriptives.pdf Descriptive Statistics: Creating Tables].
*DIME Analytics’ [https://github.com/worldbank/DIME-LaTeX-Templates LaTeX Training]
*DIME has prepared resources for getting started with LaTeX and how to write fully replicable documents using LaTeX. See the resource [https://github.com/worldbank/DIME-LaTeX-Templates here].


[[Category: Data Analysis ]]
[[Category: Data Analysis ]]

Latest revision as of 16:42, 5 June 2019

After analyzing data and before disseminating results, research teams must export analyses. This article discusses formats for exporting, instructs how to make replicable tables and graphs via Stata and LaTeX, outlines the four levels of replicability, and comments on version control.

Read First

  • For full replicability, ensure that all results (i.e. tables, graphs, etc.) are generated by code and exported in final form to the final report. For good replicability, minimal changes should be made.
  • Never manually copy and paste from a Stata or R window to a file saved as disc. This is sloppy and poor practice and never considered replicable.
  • Stata’s estout command package has a wide and useful range of capabilities for producing replicable summary statistic and regression tables.
  • When trying multiple analytical approaches, make sure to use version control via code and/or results.

Formatting

Formatting requirements depend on the audience. For example, best practices for communicating results to project beneficiaries or government counterparts are different than those for communicating results to the academic research community.

Policy Output

Fact sheets can effectively disseminate RCT results to government counterparts and local communities. While regression tables formatted according to journal standards will obviously not work well in this context, data visualizations could help to portray findings and takeaways.

Academic Output

For academic output, make sure to follow established guidelines, such as those in American Economic Review’s style guide and in ShareLaTeX's collection of style guides. Note that the best tool for excellent-looking and easily-reproducible tables is LaTeX, or its web-based collaboration tool, Overleaf. LaTeX has many features that allow you to produce tables that looks exactly like the tables published in top journals. For more details on how to use LaTeX, see DIME’s LaTeX Training, which has multiple stages that target the absolute beginner as well as the experienced user.

Exporting Tables

To create replicable outputs, produce and complete all tables by code in Stata or R. For example, extra statistics such as test statistics and means should be added before exporting the table. If multiple tables should be combined to one table, then this should be done by code as well. estout, which is a package of commands that also includes esttab, eststo, estadd and estpost, provides the functionality to accomplish this. While the estout commands may be a bit difficult to get started with, they have a wide and useful range of capabilities. estadd, for example, allows users to add additional statistics to a table (i.e. the predicted Y, the mean of sub-samples, the N of sub-samples or any number that you have calculated).

To begin with estout, the easiest way might be to copy the code used to generate a table that you like. Recycling code is actually the most common way to use the estout commands even for users familiar with the package.

Levels of Replicability

We all know that all our work and outputs should be replicable, but exactly how replicable does something need to be? For example, if a report has a table outputted by code but formatted manually, is the report replicable? This section outlines the four levels of replicability: full replicability, good replicability, basic replicability and no replicability. Full replicability is the ideal and no replicability is never acceptable.

Full Replicability

In full replicability, all results are generated by code and exported in final form to the final report. No formatting or any other type of editing occurs between running the code and the tables appearing in the report.

While new software tools are emerging, LaTeX very effectively creates fully replicable outputs. DIME Analytics highly recommends LaTeX (or its web-based collaboration tool, Overleaf) as it is more comprehensive and more supported by online resources than any new competing tools. To import tables and graphs to a report written in LaTeX, they must be exported in LaTeX or LaTeX-readable format. For graphs, this requires a .png format; in Stata, simply add the .png extension to the file name specified in save(). For tables, this requires a .tex format; in Stata, if exporting tables through the estout family, simply change the table format to .tex. For more options, see Additional Resources.

Good Replicability

In good replicability, all results are produced by code and saved to files on disk. No copying and pasting of results are needed between files. However, formatting and other minor changes may be needed, and/or the final tables may need to be copied and pasted into the final report. All DIME projects should aim to reach good replicability at least. While full replicability is objectively better, its model may not work for all teams due to, for example, external collaborators that are not able or willing to work in the tools required for full replicability.

To create graphs of good replicability, simply use thesave() option included in Stata's graph commands. However, if several graphs are supposed to be combined into one graph, it is not good replicability to combine them manually or to simply put them in the report next to each other. For good replicability, these graphs should be combined in the code. In Stata, graph combine can accomplish this.

Basic Replicability

In basic replicability, the code generates and outputs all graphs and tables to the project folder, though additional changes are made to the outputs. For example, the research team may copy or paste between files to create the tables, or apply some very basic math to the outputted files. This is not best practice, but it is the minimum acceptable level of replicability.

To create basically replicable graphs, simply use thesave() option included in Stata's graph commands. This even satisfies good replicability for many graphs. For tables, there is no single built-in option to reach basic replicability. Common commands to output results include outreg and estout, outlined above. To test that all tables and graphs are exported with basic replicability, move all tables and graph to a separate folder and run the code again. Make sure that all tables and graph files are re-created again in the folder and that it is possible to make the minor manual actions required to generate the final tables and graphs from these files.

No Replicability

Any outputs (i.e. tables, graphs, etc.) that require manual copying and pasting from a Stata or R window to a file saved as disc are never considered replicable. This level of replicability is never acceptable for published outputs, regardless of the size or importance of the report. While no replicability may be acceptable during, for example, exploratory analysis, as soon as output is produced for someone else - even within the team - it should be done with a higher level of replicability. Since analysis should eventually be shown to someone, we strongly recommend that you aim for a higher level of replicability from the start since it will save you time later.

Version control

Very often, research teams try multiple approaches before the Principal Investigator decides which to use for the final analysis. This should be minimized as it could otherwise be regarded as p-hacking. Nonetheless, typically it is to some extent a necessary process. During this phase, ensure version control, or a way to go back to previous results in order to compare the different versions. Version control can occur via code, results or both. Version control via code and GitHub is the only way to achieve version control in the full extent of the term. Version control via results is not version control in the full extent of the definition of the term, but it satisfies the basic need discussed here. If using version control via results, make sure to date the files.

Back to Parent

This article is part of the topic Data Analysis

Additional Resources