Difference between revisions of "Data Analysis"

Jump to: navigation, search
Line 15: Line 15:


Once data is collected, it must be recombined into a final format for analysis, including the construction of derived variables not present in the initial collection. See [[Data Cleaning]].
Once data is collected, it must be recombined into a final format for analysis, including the construction of derived variables not present in the initial collection. See [[Data Cleaning]].
== Organizing Analysis Files ==
Analysis programs that is exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over [[Dropbox]] or [[GitHub]] is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.
When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.


== Outputting the Result of the Analysis ==
== Outputting the Result of the Analysis ==
Just as the rest of your code the output of results must also be replicable. There are different degrees of replicability. The basic that is obviously a must is that all parts of the results used in the table is replicable.


Even better is that all part of the same table is outputted in a single file. Sometimes tables are consist of results from multiple estimations and it is preferably that they are outputted to a single file. See Stata command [[estout]].
Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis dofile should result in ''only'' necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.


Optimally all tables are outputted in a way that no manual formatting is required. A very common tool for that is LaTeX. DIME has prepared material for getting started with LaTeX that assumes no knowledge in LaTeX and aims to explain the work flow from software as Stata and R to final reports using LaTeX. [https://github.com/worldbank/DIME-LaTeX-Templates]
For some applications (such as creating internal presentations or simple Word reports, file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a [[GitHub]] service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required; they should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate <code>graph export</code> command in Stata or the equivalent. [[LaTeX]] is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes.


== Different Specific Types of Analysis ==
== Resources for Specific Analytical Tasks ==


===Principal Component Analysis===
===Heterogeneous Effects Analysis===
[[Principal Component Analysis (PCA)]] is an analytical tool looks to explain the maximum amount of variance with the fewest number of principal components.  
 
===Randomization Inference===
 
===Principal Components Analysis===
 
[[Principal Components Analysis]] (PCA) is an analytical tool looks to explain the maximum amount of variance with the fewest number of principal components.  


=== Cost Effectiveness Analysis ===
=== Cost Effectiveness Analysis ===


One type is [[Cost-effectiveness Analysis]]
[[Cost-effectiveness Analysis]]
 





Revision as of 20:03, 6 November 2017



Read First

Data analysis typically has two stages:

  1. Exploratory Analysis
  2. Final Analysis


In exploratory analysis, emphasis will be on producing easily understood summaries of the trends in the data so that the reports, publications, presentations, and summaries that need to be produced can begin to be outlined. Once those stories begin to come together, the code is re-written in a "final" form which would be appropriate for public release with the results.

Preparing the Dataset for Analysis

Once data is collected, it must be recombined into a final format for analysis, including the construction of derived variables not present in the initial collection. See Data Cleaning.

Organizing Analysis Files

Analysis programs that is exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over Dropbox or GitHub is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.

When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.

Outputting the Result of the Analysis

Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis dofile should result in only necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.

For some applications (such as creating internal presentations or simple Word reports, file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a GitHub service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required; they should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate graph export command in Stata or the equivalent. LaTeX is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes.

Resources for Specific Analytical Tasks

Heterogeneous Effects Analysis

Randomization Inference

Principal Components Analysis

Principal Components Analysis (PCA) is an analytical tool looks to explain the maximum amount of variance with the fewest number of principal components.

Cost Effectiveness Analysis

Cost-effectiveness Analysis


Additional Resources