Difference between revisions of "Stata Coding Practices"

Jump to: navigation, search
Line 1: Line 1:
Researchers use Stata in all stages of an '''impact evaluation''', such as [[Sampling & Power Calculations |sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring data quality]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Good '''Stata coding practices''' (including packages and commands) are a critical component of high quality [[Reproducible Research | reproducible research]]. These practices also allow the [[Impact Evaluation Team|impact evaluation team]] (or research team) to save time and energy, and focus on other [[Randomized Evaluations: Principles of Study Design|aspects of study design]].  
Researchers use Stata in all stages of an '''impact evaluation''' (or study), such as [[Sampling & Power Calculations |sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring data quality]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Good '''Stata coding practices''' (including packages and commands) are a critical component of high quality [[Reproducible Research | reproducible research]]. These practices also allow the [[Impact Evaluation Team|impact evaluation team]] (or research team) to save time and energy, and focus on other [[Randomized Evaluations: Principles of Study Design|aspects of study design]].  
==Read First==
==Read First==
* DIME Analytics  and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
* DIME Analytics  and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
Line 5: Line 5:
* <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> is a Stata package that standardizes '''best practices''' in [[Data Management|data management]] and [[Data Analysis|data analysis]].  
* <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> is a Stata package that standardizes '''best practices''' in [[Data Management|data management]] and [[Data Analysis|data analysis]].  
* As with standard Stata packages like <code>coefplot</code>, use <code>ssc install</code> to download these packages.
* As with standard Stata packages like <code>coefplot</code>, use <code>ssc install</code> to download these packages.
* Other common Stata '''best practices''', for instance, with respect to naming file paths, also contribute to successful '''impact evaluations''' (or studies).
* Other common Stata '''best practices''', for instance, with respect to naming file paths, also contribute to successful impact evaluations.


== iefieldkit ==
== iefieldkit ==

Revision as of 19:48, 13 April 2020

Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good Stata coding practices (including packages and commands) are a critical component of high quality reproducible research. These practices also allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.

Read First

  • DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
  • iefieldkit is a Stata package that standardizes best practices for high quality, reproducible primary data collection.
  • ietoolkit is a Stata package that standardizes best practices in data management and data analysis.
  • As with standard Stata packages like coefplot, use ssc install to download these packages.
  • Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.

iefieldkit

DIME has developed iefieldkit as a package to help with the process of primary data collection. The package currently supports supports three major components of primary data collection - survey design, survey completion, and data cleaning and harmonization. iefieldkit performs the following three tasks:

  • Before data collection , ietestform complements the ODK syntax test on SurveyCTO server. It runs tests to inform researchers how to use ODK programming language features to ensure high data quality. This command is especially useful if the data that will be imported to Stata has other restrictions in addition to ODK syntax.
  • During data collection, ieduplicates and iecompdup (both previously released as a part of the package ietoolkit but now moved to this package) provide a workflow for detecting and resolving duplicate entries in the dataset. These commands ensure that the final survey dataset is a correct record of the survey sample that the researcher can then merge into the master sampling database.
  • After data collection, iecodebook provides a workflow for rapidly cleaning, harmonizing, and documenting datasets. iecodebook uses input specified in an Excel sheet, which provides a much more well-structured and easy to follow (especially for non-technical users) overview than the same operations written directly to a dofile.

To install the package, type ssc install iefieldkit in your Stata command window. Note that some features of the package might require meta data specific to SurveyCTO, but feel free to try these commands on any use case. For more details, see the iefieldkit GitHub page.

ietoolkit

ietoolkit is a Stata package developed by DIME for data management and analysis in impact evaluations. The list of commands given below will be extended continuously, and suggestions for new commands are always appreciated.

Commands for data management currently include:

  • iefolder, which sets up project folders and creates master do-files that link to all sub-folders;
  • iegitaddmd, which adds a placeholder file to empty folders so that folder structures with empty folders can be shared on GitHub; and
  • ieboilstart, which standardizes the boilerplate code at the top of all do-files.

Commands for data analysis currently include:

  • iematch, an algorithm for matching observations in one group to "the most similar" observations in another group;
  • iebaltab, which runs balance test regressions and outputs the result in well formatted balance tables;
  • iedropone, which drops observations and controls that the correct number was dropped;
  • ieboilsave, which performs checks before saving a data set; *ieddtab, which runs difference in differences regressions and outputs the result in well formatted tables; and *iegraph, which produces graphs of estimation results in common impact evaluation regression models

To install the ietoolkit, type ssc install ietoolkit in your Stata command window. For more details, see the ietoolkit GitHub page.

Common Stata Practices

File Paths

DIME Analytics' recommendation is that all file paths should be absolute and dynamic, should always be enclosed in double quotes, and should always use forward slashes for folder hierarchies (/), since Mac and Linux computers cannot read file paths with backslashes. File paths should also always include the file extension (.dta, .do, .csv, etc.), since to omit the extension causes ambiguity if another file with the same name is created (even if there is a default).

  • Absolute file paths means that all file paths must start at the root folder of the computer, for example, C:/ on a PC or /Users/ on a Mac. This makes sure that you always get the correct file in the correct folder. We never use cd. We have seen many cases when using cd where a file has been overwritten in another project folder where cd was currently pointing to. Relative file paths are common in many other programming languages, but there they are relative to the location of the file running the code, and then there is no risk that a file is saved in a completely different folder. Stata does not provide this functionality.
  • Dynamic file paths use globals that are set in a central master do-file to dynamically build your file paths. This has the same function in practice as setting cd, as all new users should only have to change these file path globals in one location. But dynamic absolute file paths are a better practice since if the global names are set uniquely there is no risk that files are saved in the incorrect project folder, and you can create multiple folder globals instead of just one location as with cd.

Examples

  • Dynamic (and absolute) file path - RECOMMENDED
   global myDocs    "C:/Users/username/Documents"
   global myProject "${myDocs}/MyProject"
   use "${myProject}/MyDataset.dta"
  • Relative (and absolute) file path - NOT RECOMMENDED
   cd "C:/Users/username/Documents/MyProject"
   use MyDataset.dta
  • Absolute but not dynamic - NOT RECOMMENDED
    use "C:/Users/username/Documents/MyProject/MyDataset.dta"

Additional Resources

Programs and Commands

General Coding Resources