Data Analysis

Jump to: navigation, search


Data analysis refers to the full process of exploring and describing trends and results from data. Data analysis typically has two stages:

  1. Exploratory Analysis
  2. Final Analysis

In exploratory analysis, emphasis will be on producing easily understood summaries of the trends in the data so that the reports, publications, presentations, and summaries that need to be produced can begin to be outlined. Once those stories begin to come together, the code is re-written in a "final" form which would be appropriate for public release with the results.

Preparing the Dataset for Analysis

Once data is collected, it must be recombined into a final format for analysis, including the construction of derived variables not present in the initial collection. See Data Cleaning.

Organizing Analysis Files

Analysis programs that is exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over Dropbox or Github is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.

When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.

Outputting Analytical Results

Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis dofile should result in only necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.

For some applications (such as creating internal presentations or simple Word reports, file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a GitHub service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required; they should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate graph export command in Stata or the equivalent. LaTeX is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes. See exporting analysis results for more details and more resources.

Resources for Specific Analytical Tasks

Spatial/GIS Analysis

Spatial Analysis involves using geospatial information from your data to explore relationships mediated by proximity or connectiveness. This can be descriptive (such as map illustrations) or informative (such as distance to and quality of the nearest road).

Randomization Inference

Randomization Inference techniques replace the "normal" p-values from regression analyses with values based on the treatment assignment methodology, and are generally recommended for reporting in experiments whose estimates are of randomly assigned treatment controlled by the implementer and researcher.

Heterogeneous Effects Analysis

Cost Effectiveness Analysis

Cost-effectiveness Analysis is the economic analysis of the costs and benefits of an impact evaluation project.

Regression Discontinuity Analysis

Here is a practical guide for analyzing regression discontinuity studies.

Data Visualization

Data visualization is a critical step in effectively communicating your research results.

Additional Resources

  • The Stata cheat sheet on Data analysis is a useful reminder of relevant stata code. The cheat sheet on [1] is a good resource for more advanced analytical tasks in Stata.