Difference between revisions of "Data Analysis"

Jump to: navigation, search
Line 1: Line 1:
Data analysis refers to the process of exploring and describing trends and results from data. This page outlines guidelines of implementing analysis, organizing analysis files and outputting analytical results.


<onlyinclude>
==Read First==
Data analysis refers to the full process of exploring and describing trends and results from data. Data analysis typically has two stages:
* Any data used for analysis must be [[Data Cleaning | clean]].
* Place exploratory analysis files in a separate, well-organized "exploratory" folder.
* When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder.
* When outputting analytical results, all tables and figures should be created via replicable do-files in such a way that the result files are ordered, named, placed, and formatted appropriately.


# Exploratory Analysis
Data analysis typically occurs in two stages: exploratory analysis and final analysis. Exploratory analysis seeks to summarize trends in the data so that the research team can begin to outline the reports, publications, presentations, and summaries. Later, in final analysis, the methods are finalized and the code is re-written in manner appropriate for public release with the results.
# Final Analysis
</onlyinclude>
In exploratory analysis, emphasis will be on producing easily understood summaries of the trends in the data so that the reports, publications, presentations, and summaries that need to be produced can begin to be outlined. Once those stories begin to come together, the code is re-written in a "final" form which would be appropriate for public release with the results.


== Preparing the Dataset for Analysis ==
== Implementing Analysis ==


Once data is collected, it must be recombined into a final format for analysis, including the construction of derived variables not present in the initial collection. See [[Data Cleaning]].
The Stata cheat sheet on [http://geocenter.github.io/StataTraining/pdf/StataCheatSheet_analysis_201615_June-REV.pdf data analysis] is a useful reminder of relevant Stata code. The cheat sheet on [http://geocenter.github.io/StataTraining/pdf/StataCheatSheet_programming15_2016_June_TE-REV.pdf Stata programming] is a good resource for more advanced analytical tasks in Stata. Below follows a list of resources on specific analytical methods. This list is by no means exhaustive.
 
*[[Spatial Analysis |Spatial/GIS Analysis]] uses geospatial data to explore relationships mediated by proximity or connectiveness. This can be descriptive (i.e. map illustrations) or informative (i.e. distance to and quality of the nearest road).
*[[Randomization Inference | Randomization inference]] techniques replace the "normal" p-values from regression analyses with values based on the treatment assignment methodology. They are generally recommended for reporting in experiments whose estimates are of randomly assigned treatment controlled by the implementer and researcher.
*[[Cost-effectiveness Analysis | Cost-effectiveness analysis]] compares the cost and effectiveness per unit of a given program to determine whether the value of an intervention justifies its cost.
*[[Regression Discontinuity|Regression discontinuity]] analysis is a [[Quasi-Experimental Methods | quasi-experimental]] impact evaluation design which attempts to find the causal effects of interventions by assigning a threshold (cutoff point) above and below which the treatment is assigned
*[[Propensity Score Matching | Propensity score matching]] is another quasi-experimental impact evaluation technique to estimate the effects of a treatment by matching control group participants to treatment group participants based on propensity score
*Heterogeneous Effects Analysis uses an [https://web.stanford.edu/~jgrimmer/het.pdf ensemble of methods] to understand how effects vary across sub-populations
*[[Data visualization]] is a critical step in effectively communicating your research results.


== Organizing Analysis Files ==
== Organizing Analysis Files ==


Analysis programs that is exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over [https://www.dropbox.com Dropbox] or [https://www.github.com Github] is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.
Analysis programs that are exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over [https://www.dropbox.com Dropbox] or [https://www.github.com Github] is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.


When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.
When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and [[Reproducible Research | reproduce]] them exactly.


== Outputting Analytical Results ==
== Outputting Analytical Results ==


Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis dofile should result in ''only'' necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.
Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, [[Naming Conventions | named]], placed, and formatted appropriately. Running the analysis dofile should result in ''only'' necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.
 
For some applications (such as creating internal presentations or simple Word reports, file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a [https://www.github.com GitHub] service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required; they should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate <code>graph export</code> command in Stata or the equivalent. [https://www.latex-project.org LaTeX] is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes. See [[Exporting Analysis |exporting analysis results]] for more details and more resources.
 
== Resources for Specific Analytical Tasks ==
 
===Spatial/GIS Analysis===
 
[[Spatial Analysis]] involves using geospatial information from your data to explore relationships mediated by proximity or connectiveness. This can be descriptive (such as map illustrations) or informative (such as distance to and quality of the nearest road).
 
===Randomization Inference===
 
[[Randomization Inference]] techniques replace the "normal" p-values from regression analyses with values based on the treatment assignment methodology, and are generally recommended for reporting in experiments whose estimates are of randomly assigned treatment controlled by the implementer and researcher.
 
=== Heterogeneous Effects Analysis ===
 
=== Cost Effectiveness Analysis ===
[[Cost-effectiveness Analysis]] is the economic analysis of the costs and benefits of an impact evaluation project.
 
=== Regression Discontinuity Analysis ===
Here is [http://www-personal.umich.edu/~cattaneo/books/Cattaneo-Idrobo-Titiunik_2018_CUP-Vol2.pdf a practical guide] for analyzing regression discontinuity studies.
 
=== Data Visualization ===


[[Data visualization]] is a critical step in effectively communicating your research results.
For some applications (i.e. creating internal presentations or simple Word reports), file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a [https://www.github.com GitHub] service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required. They should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate <code> graph export </code> command in Stata or the equivalent. [https://www.latex-project.org LaTeX] is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes. See [[Exporting Analysis]] for more details and more resources.


== Additional Resources ==
== Additional Resources ==
* The Stata cheat sheet on [http://geocenter.github.io/StataTraining/pdf/StataCheatSheet_analysis_201615_June-REV.pdf Data analysis] is a useful reminder of relevant stata code. The cheat sheet on [http://geocenter.github.io/StataTraining/pdf/StataCheatSheet_programming15_2016_June_TE-REV.pdf Stata programming] is a good resource for more advanced analytical tasks in Stata.




[[Category: Data Analysis ]]
[[Category: Data Analysis ]]

Revision as of 22:36, 12 April 2019

Data analysis refers to the process of exploring and describing trends and results from data. This page outlines guidelines of implementing analysis, organizing analysis files and outputting analytical results.

Read First

  • Any data used for analysis must be clean.
  • Place exploratory analysis files in a separate, well-organized "exploratory" folder.
  • When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder.
  • When outputting analytical results, all tables and figures should be created via replicable do-files in such a way that the result files are ordered, named, placed, and formatted appropriately.

Data analysis typically occurs in two stages: exploratory analysis and final analysis. Exploratory analysis seeks to summarize trends in the data so that the research team can begin to outline the reports, publications, presentations, and summaries. Later, in final analysis, the methods are finalized and the code is re-written in manner appropriate for public release with the results.

Implementing Analysis

The Stata cheat sheet on data analysis is a useful reminder of relevant Stata code. The cheat sheet on Stata programming is a good resource for more advanced analytical tasks in Stata. Below follows a list of resources on specific analytical methods. This list is by no means exhaustive.

  • Spatial/GIS Analysis uses geospatial data to explore relationships mediated by proximity or connectiveness. This can be descriptive (i.e. map illustrations) or informative (i.e. distance to and quality of the nearest road).
  • Randomization inference techniques replace the "normal" p-values from regression analyses with values based on the treatment assignment methodology. They are generally recommended for reporting in experiments whose estimates are of randomly assigned treatment controlled by the implementer and researcher.
  • Cost-effectiveness analysis compares the cost and effectiveness per unit of a given program to determine whether the value of an intervention justifies its cost.
  • Regression discontinuity analysis is a quasi-experimental impact evaluation design which attempts to find the causal effects of interventions by assigning a threshold (cutoff point) above and below which the treatment is assigned
  • Propensity score matching is another quasi-experimental impact evaluation technique to estimate the effects of a treatment by matching control group participants to treatment group participants based on propensity score
  • Heterogeneous Effects Analysis uses an ensemble of methods to understand how effects vary across sub-populations
  • Data visualization is a critical step in effectively communicating your research results.

Organizing Analysis Files

Analysis programs that are exploratory in nature should be held in an "exploratory" folder and separated according to topic. Particularly when folder syncing over Dropbox or Github is being used, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.

When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final" analysis folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.

Outputting Analytical Results

Since the final analysis do-files are intended to be fully replicable, and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis dofile should result in only necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.

For some applications (i.e. creating internal presentations or simple Word reports), file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a GitHub service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by this file such that no further mathematical calculations are required. They should furthermore be organized and formatted as nearly to the published versions as possible. Figures are typically easy to do this in by using an appropriate graph export command in Stata or the equivalent. LaTeX is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes. See Exporting Analysis for more details and more resources.

Additional Resources