Difference between revisions of "Stata Coding Practices"
Line 2: | Line 2: | ||
==Read First== | ==Read First== | ||
* DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands. | * DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands. | ||
* <code>[ | * <code>[https://github.com/worldbank/iefieldkit/ iefieldkit]</code> is a Stata package that standardizes '''best practices''' for high quality, [[Reproducible Research | reproducible]] [[Primary Data Collection | primary data collection]]. | ||
* <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> is a Stata package that standardizes '''best practices''' in [[Data Management|data management]] and [[Data Analysis|data analysis]]. | * <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> is a Stata package that standardizes '''best practices''' in [[Data Management|data management]] and [[Data Analysis|data analysis]]. | ||
* As with standard Stata packages like <code>coefplot</code>, use <code>ssc install</code> to download these packages. | * As with standard Stata packages like <code>coefplot</code>, use <code>ssc install</code> to download these packages. |
Revision as of 20:27, 13 April 2020
Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good Stata coding practices (including packages and commands) are a critical component of high quality reproducible research. These practices also allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.
Read First
- DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
iefieldkit
is a Stata package that standardizes best practices for high quality, reproducible primary data collection.ietoolkit
is a Stata package that standardizes best practices in data management and data analysis.- As with standard Stata packages like
coefplot
, usessc install
to download these packages. - Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.
iefieldkit
DIME has developed iefieldkit
as a package to simplify the process of primary data collection. The package currently supports supports three major components of this workflow (process) - survey design, survey completion, and data cleaning and data harmonization. iefieldkit
uses four commands to simplify each of these tasks:
- Before data collection. The
ietestform
command tests the collected data to make sure it follows best practices in naming, coding, and labeling. For instance, it does not let an enumerator move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted. - During data collection. The
ieduplicates
andiecompdup
commands allow the research team to detect (identify) and resolve (deal with) duplicate entries in the data set. These commands were previously a part of theietoolkit
package, but are now part of theiefieldkit
package. - After data collection. The
iecodebook
command provides a method for rapidly cleaning, harmonizing, and documenting data sets.
To install the iefieldkit
package, type ssc install iefieldkit
in your Stata command window. Note that some features of the package might require meta data specific to SurveyCTO, but feel free to try these commands on any use case.
ietoolkit
DIME has developed the ietoolkit
package for Stata, to simplify the process of data management and analysis in impact evaluations. The list of commands given below will be extended continuously, and suggestions for new commands are always appreciated.
Commands for data management currently include:
iefolder
, which sets up project folders and creates master do-files that link to all sub-folders;iegitaddmd
, which adds a placeholder file to empty folders so that folder structures with empty folders can be shared on GitHub; andieboilstart
, which standardizes the boilerplate code at the top of all do-files.
Commands for data analysis currently include:
iematch
, an algorithm for matching observations in one group to "the most similar" observations in another group;iebaltab
, which runs balance test regressions and outputs the result in well formatted balance tables;iedropone
, which drops observations and controls that the correct number was dropped;ieboilsave
, which performs checks before saving a data set; *ieddtab
, which runs difference in differences regressions and outputs the result in well formatted tables; and *iegraph
, which produces graphs of estimation results in common impact evaluation regression models
To install the ietoolkit
, type ssc install ietoolkit
in your Stata command window. For more details, see the ietoolkit
GitHub page.
Common Stata Practices
File Paths
DIME Analytics' recommendation is that all file paths should be absolute and dynamic, should always be enclosed in double quotes, and should always use forward slashes for folder hierarchies (/
), since Mac and Linux computers cannot read file paths with backslashes. File paths should also always include the file extension (.dta
, .do
, .csv
, etc.), since to omit the extension causes ambiguity if another file with the same name is created (even if there is a default).
- Absolute file paths means that all file paths must start at the root folder of the computer, for example,
C:/
on a PC or/Users/
on a Mac. This makes sure that you always get the correct file in the correct folder. We never usecd
. We have seen many cases when usingcd
where a file has been overwritten in another project folder wherecd
was currently pointing to. Relative file paths are common in many other programming languages, but there they are relative to the location of the file running the code, and then there is no risk that a file is saved in a completely different folder. Stata does not provide this functionality.
- Dynamic file paths use globals that are set in a central master do-file to dynamically build your file paths. This has the same function in practice as setting
cd
, as all new users should only have to change these file path globals in one location. But dynamic absolute file paths are a better practice since if the global names are set uniquely there is no risk that files are saved in the incorrect project folder, and you can create multiple folder globals instead of just one location as withcd
.
Examples
- Dynamic (and absolute) file path - RECOMMENDED
global myDocs "C:/Users/username/Documents"
global myProject "${myDocs}/MyProject"
use "${myProject}/MyDataset.dta"
- Relative (and absolute) file path - NOT RECOMMENDED
cd "C:/Users/username/Documents/MyProject"
use MyDataset.dta
- Absolute but not dynamic - NOT RECOMMENDED
use "C:/Users/username/Documents/MyProject/MyDataset.dta"
Additional Resources
Programs and Commands
- You can find a broad variety of Stata commands in this World Bank repository, How to Write Programs in Stata, which contains ado files for commands useful for data management, statistical analysis, and the production of graphics. In many cases, these adofiles reduce the production of routine items from a tedious programming task to a single command line (i.e. data import and cleaning; production of summary statistics table; and categorical bar charts with confidence intervals.
- You can experiment with and build upon DIME Analytics’ Intro to how to write programs (also called commands or functions) in Stata and Share functions (sub-programs) between command in the same package. Download the files and read the instructions.
- This DIME Analytics Stata IE Visual Library repository hosts Stata Graph examples on GitHub; feel free to submit your own example codes there.
- Innovations for Poverty Action's Stata modules for data collection and analysis and GitHub page host programs for impact evaluations
- Innovations for Poverty Action's odkmeta command writes a do-file to import ODK data to Stata, using the metadata from the survey and choices worksheets of the XLSForm.
- Read more on
iefolder
in DIME Analytics’ presentations here and here. - Read more on
ietoolkit
in DIME Analytics’ Real Time Data Quality Checks. - Check out The World Bank's Stata GitHub.
General Coding Resources
- Read DIME Analytics' guide to Stata coding and cleaning.
- Refer to these Stata cheat sheets on GitHub.
- Gentzkow and Shapiro's Code and Data for the Social Sciences is a handbook for best practices.
- Poverty Action Lab's Programming with Stata, Princeton's Getting Started in Data Analysis Using Stata and Standford's Basics of Stata provide resources for beginning and intermediate Stata users.
For more details, see the iefieldkit
GitHub page.