Data Analysis

Jump to: navigation, search



Read First

Data analysis typically has two stages:

  1. Exploratory Analysis
  2. Final Analysis


In exploratory analysis, emphasis will be on producing easily understood summaries of the trends in the data so that the reports, publications, presentations, and summaries that need to be produced can begin to be outlined. Once those stories begin to come together, the code is re-written in a "final" form which would be appropriate for public release with the results.

Preparing the Dataset for Analysis

Once data is collected, it must be recombined into a final format for analysis, including the construction of derived variables not present in the initial collection. See Data Cleaning.

Organizing Analysis Files

Analysis programs that is exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over Dropbox or GitHub is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.

When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.

Outputting the Result of the Analysis

Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis dofile should result in only necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.

For some applications (such as creating internal presentations or simple Word reports, file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a GitHub service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required; they should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate graph export command in Stata or the equivalent. LaTeX is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes.

Resources for Specific Analytical Tasks

Heterogeneous Effects Analysis

Randomization Inference

Principal Components Analysis

Principal Components Analysis (PCA) is an analytical tool looks to explain the maximum amount of variance with the fewest number of principal components.

Cost Effectiveness Analysis

Cost-effectiveness Analysis


Additional Resources