Difference between revisions of "Data Analysis"

Jump to: navigation, search
 
(51 intermediate revisions by 7 users not shown)
Line 1: Line 1:
Data analysis is the process of exploring and describing trends and results from data. Data analysis typically occurs in two stages: exploratory analysis and final analysis. This page provides guidance on how to organize analysis files and output results in an orderly and [[Reproducible Research | reproducible]] manner.


==Read First==
* Always [[Data Cleaning | clean]] data before conducting data analysis.
* Place exploratory analysis files in a separate, well-organized "exploratory" folder; place the final analysis file in the "final"  analysis folder.
* Create tables and figures via [[Reproducible Research|replicable]] '''do-files''' in such a way that the result files are ordered, named, placed, and formatted appropriately.


add introductory 1-2 sentences here
==Exploratory vs. Final Analysis==


Exploratory analysis focuses on producing easily understood summaries of the trends in the data so that researchers can begin to outline reports, [[Publishing Data|publications]], presentations, and summaries. Final analysis is the fine-tuned culmination of exploratory analysis and requires re-written code appropriate for public release with the results.


== Organizing Analysis Files ==
Place exploratory analysis programs in an "exploratory" folder, separated according to topic. When folder syncing over [https://www.dropbox.com Dropbox] or [[Getting Started with GitHub|Github]], separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.


== Read First
When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final analysis" folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and [[Reproducible Research | reproduce]] them exactly.
* include here key points you want to make sure all readers understand


== Implementing Analysis ==


== Preparing the Data Set for Analysis
Below follows a sampling of specific analytical methods.


In the cleaning section we do replace values that otherwise would bias the variable. The first part of Data Analysis is to edit the variables so that they fit into the statistical analysis models that we are using.
*[[Spatial Analysis |Spatial/GIS Analysis]] uses geospatial data to explore relationships mediated by proximity or connectiveness. This can be descriptive (i.e. map illustrations) or informative (i.e. distance to and quality of the nearest road).
*[[Randomization Inference | Randomization inference]] techniques replace the "normal" p-values from regression analyses with values based on the '''treatment assignment''' methodology. They are generally recommended for reporting in experiments whose estimates are of [[Randomization|randomly]] assigned '''treatment''' controlled by the implementer and researcher.
*[[Cost-effectiveness Analysis | Cost-effectiveness analysis]] compares the cost and effectiveness per unit of a given program to determine whether the value of an intervention justifies its cost.
*[[Regression Discontinuity|Regression discontinuity]] analysis is a [[Quasi-Experimental Methods | quasi-experimental]] '''impact evaluation''' design which attempts to find the causal effects of interventions by assigning a threshold (cutoff point) above and below which the '''treatment''' is assigned.
*[[Propensity Score Matching | Propensity score matching]] is another '''quasi-experimental impact evaluation''' technique to estimate the effects of a '''treatment''' by matching control group participants to '''treatment group''' participants based on '''propensity score'''.
*Heterogeneous Effects Analysis uses an [https://web.stanford.edu/~jgrimmer/het.pdf ensemble of methods] to understand how effects vary across sub-populations
*[[Data visualization]] is a critical step in effectively communicating your research results.


1. [[Standardization]] - Convert all values in each variable into the same unit. If the values in one variable are different then errors like 1000 gram will be interpreted as one thousand times larger than 1 kg.
The [[Stata Coding Practices | Stata]] cheat sheet on [http://geocenter.github.io/StataTraining/pdf/StataCheatSheet_analysis_201615_June-REV.pdf data analysis] gives guidelines on relevant '''Stata''' code for analysis. The cheat sheet on [http://geocenter.github.io/StataTraining/pdf/StataCheatSheet_programming15_2016_June_TE-REV.pdf Stata programming] is a good resource for more advanced analytical tasks in '''Stata'''.
1. [[Aggregation]] - We often collect variable disaggregated over categories (income collected as different income categories) or disaggregated over instances (harvest value over multiple crops). Disaggregated data collection is used to improve quality of data collected, but in the analysis we are often interested in the aggregated value.


== Outputting the Result of the Analysis
== Outputting Analytical Results ==
Just as the rest of your code the output of results must also be replicable. There are different degrees of replicability. The basic that is obviously a must is that all parts of the results used in the table is replicable.


Even better is that all part of the same table is outputted in a single file. Sometimes tables are consist of results from multiple estimations and it is preferably that they are outputted to a single file. See Stata command [[estout]].
Since the final analysis '''do-files''' are intended to be fully [[Reproducible Research|replicable]] and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, [[Naming Conventions | named]], placed, and formatted appropriately. Running the analysis '''do-file''' should result in only necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.


Optimally all tables are outputted in a way that no manual formatting is required. A very common tool for that is LaTeX. DIME has prepared material for getting started with LaTeX that assumes no knowledge in LaTeX and aims to explain the work flow from software as Stata and R to final reports using LaTeX. [[https://github.com/worldbank/DIME-LaTeX-Templates]]
For some applications (i.e. creating internal presentations or simple Word reports), file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a [[Getting Started with GitHub|GitHub]] service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by these files such that no further mathematical calculations are required. They should furthermore be organized and formatted as nearly to the [[Publishing Data|published]] versions as possible. It is typically easy to do this with figures by using an appropriate <code>graph export</code> command in [[Stata Coding Practices|Stata]] or the equivalent. [https://www.latex-project.org LaTeX] is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes. See [[Exporting Analysis |exporting analysis results]] for more details and more resources.


== Different Specific Types of Analysis
== Additional Resources ==
 
*Angrist and Pischke's [https://www.researchgate.net/publication/51992844_Mostly_Harmless_Econometrics_An_Empiricist's_Companion Mostly Harmless Econometrics: An Empiricist's Companion]
=== Principal Component Analysis
[[Principal Component Analysis (PCA)]] is an analytical tool looks to explain the maximum amount of variance with the fewest number of principal components.
 
=== Cost Effectiveness Analysis
 
One type is [[Cost-effectiveness Analysis]]
 
== Back to Parent
This article is part of the topic [[Data Analysis]]
 
== Additional Resources
* list here other articles related to this topic, with a brief description and link


[[Category: Data Analysis ]]
[[Category: Data Analysis ]]

Latest revision as of 18:27, 9 August 2023

Data analysis is the process of exploring and describing trends and results from data. Data analysis typically occurs in two stages: exploratory analysis and final analysis. This page provides guidance on how to organize analysis files and output results in an orderly and reproducible manner.

Read First

  • Always clean data before conducting data analysis.
  • Place exploratory analysis files in a separate, well-organized "exploratory" folder; place the final analysis file in the "final" analysis folder.
  • Create tables and figures via replicable do-files in such a way that the result files are ordered, named, placed, and formatted appropriately.

Exploratory vs. Final Analysis

Exploratory analysis focuses on producing easily understood summaries of the trends in the data so that researchers can begin to outline reports, publications, presentations, and summaries. Final analysis is the fine-tuned culmination of exploratory analysis and requires re-written code appropriate for public release with the results.

Organizing Analysis Files

Place exploratory analysis programs in an "exploratory" folder, separated according to topic. When folder syncing over Dropbox or Github, separating these files by function (rather than combining them into a single "analysis" file) allows multiple researchers to work simultaneously and modularly.

When the final analysis workflow is agreed upon for a given publication or other output, a final analysis file should be collated for that output only in the "final analysis" folder. This allows selective reuse of the code from the exploratory analyses, in preparation for the final release of the code if required. This allows any collaborator, referee, or replicator to access only the code used to prepare the final outputs and reproduce them exactly.

Implementing Analysis

Below follows a sampling of specific analytical methods.

  • Spatial/GIS Analysis uses geospatial data to explore relationships mediated by proximity or connectiveness. This can be descriptive (i.e. map illustrations) or informative (i.e. distance to and quality of the nearest road).
  • Randomization inference techniques replace the "normal" p-values from regression analyses with values based on the treatment assignment methodology. They are generally recommended for reporting in experiments whose estimates are of randomly assigned treatment controlled by the implementer and researcher.
  • Cost-effectiveness analysis compares the cost and effectiveness per unit of a given program to determine whether the value of an intervention justifies its cost.
  • Regression discontinuity analysis is a quasi-experimental impact evaluation design which attempts to find the causal effects of interventions by assigning a threshold (cutoff point) above and below which the treatment is assigned.
  • Propensity score matching is another quasi-experimental impact evaluation technique to estimate the effects of a treatment by matching control group participants to treatment group participants based on propensity score.
  • Heterogeneous Effects Analysis uses an ensemble of methods to understand how effects vary across sub-populations
  • Data visualization is a critical step in effectively communicating your research results.

The Stata cheat sheet on data analysis gives guidelines on relevant Stata code for analysis. The cheat sheet on Stata programming is a good resource for more advanced analytical tasks in Stata.

Outputting Analytical Results

Since the final analysis do-files are intended to be fully replicable and the code itself is considered a vital, shareable output, all tables and figures should be created in such a way that the files are ordered, named, placed, and formatted appropriately. Running the analysis do-file should result in only necessary files in the "outputs" folder, with names like "figure_1.png", "table_1.xlsx", and so on.

For some applications (i.e. creating internal presentations or simple Word reports), file types like PNG and XLSX are sufficiently functional. For larger projects with multiple collaborators, particularly when syncing over a GitHub service, plaintext file types such as EPS, CSV, and TEX will be the preferred formats. Tables and figures should at minimum be produced by these files such that no further mathematical calculations are required. They should furthermore be organized and formatted as nearly to the published versions as possible. It is typically easy to do this with figures by using an appropriate graph export command in Stata or the equivalent. LaTeX is a particularly powerful tool for doing this with tables. DIME provides several guides on both processes. See exporting analysis results for more details and more resources.

Additional Resources