Difference between revisions of "Stata Coding Practices"
Line 37: | Line 37: | ||
=== File Paths === | === File Paths === | ||
DIME Analytics suggests the following for specifying '''file paths''' in Stata: | DIME Analytics suggests the following guidelines for specifying '''file paths''' in Stata: | ||
* '''Double quotes (<code>"</code>).''' Always enclose file paths in double quotes (<code>"</code>) . For example, <code>"$maindir"</code>. | * '''Double quotes (<code>"</code>).''' Always enclose file paths in double quotes (<code>"</code>) . For example, <code>"$maindir"</code>. | ||
* '''Forward slashes (<code>/</code>).''' Always use forward slashes (<code>/</code>) to specify folder '''hierarchies''', that is, the exact location of a folder inside another folder, and so on. For example, <code>"C:/Users/username/Documents"</code>. This is important because Mac and Linux computers cannot read file paths with '''back slashes'''(<code>\</code>). | * '''Forward slashes (<code>/</code>).''' Always use forward slashes (<code>/</code>) to specify folder '''hierarchies''', that is, the exact location of a folder inside another folder, and so on. For example, <code>"C:/Users/username/Documents"</code>. This is important because Mac and Linux computers cannot read file paths with '''back slashes'''(<code>\</code>). | ||
* '''File extension.''' Always include the file extension in the file path, such as <code>.dta</code>, <code>.do</code>, or <code>.csv</code>. This helps to avoid '''ambiguity''' (or doubt) if another file with the same name exists. | * '''File extension.''' Always include the file extension in the file path, such as <code>.dta</code>, <code>.do</code>, or <code>.csv</code>. This helps to avoid '''ambiguity''' (or doubt) if another file with the same name exists. | ||
* '''Absolute.''' File paths must be '''absolute''', that is, all file paths must begin from the '''root folder''' of the computer, for example, <code>C:/</code> on a PC or <code>/Users/</code> on a Mac. This makes sure that users are always specifying the the correct file in the correct folder. Users should never use <code>cd</code> since there can be cases where a user accidentally overwrites a file in the project folder which the <code>cd</code> initially referred to. While '''relative''' (non-absolute) file paths are common in many other programming languages, Stata does not provide this functionality. | * '''Absolute.''' File paths must be '''absolute''', that is, all file paths must begin from the '''root folder''' of the computer, for example, <code>C:/</code> on a PC or <code>/Users/</code> on a Mac. This makes sure that users are always specifying the the correct file in the correct folder. Users should never use <code>cd</code> since there can be cases where a user accidentally overwrites a file in the project folder which the <code>cd</code> initially referred to. While '''relative''' (non-absolute) file paths are common in many other programming languages, Stata does not provide this functionality. | ||
* '''Dynamic.''' File paths must also be '''dynamic'''. Dynamic file paths use '''globals''' (global macros) that are located in the '''master''' (central) do-file, and allows users to expand file paths '''dynamically''' (whenever needed). In practice, using global macros to specify folders is the same as using <code>cd</code>, and users only need to change file path in the global macro in the master do-file. But in this method, users can create multiple folder '''globals''' (global macros) instead of just one, which is the case with <code>cd</code>. | * '''Dynamic.''' File paths must also be '''dynamic'''. Dynamic file paths use '''globals''' (global macros) that are located in the '''master''' (central) do-file, and allows users to expand file paths '''dynamically''' (whenever needed). In practice, using global macros to specify folders is the same as using <code>cd</code>, and users only need to change file path in the global macro in the master do-file. But in this method, users can create multiple folder '''globals''' (global macros) instead of just one, which is the case with <code>cd</code>. | ||
In practice, therefore, '''absolute''' and '''dynamic''' file paths are a better practice since there is no risk of files getting saved in the incorrect project folder, as long as the global macro has a unique name. | |||
====Examples==== | ====Examples==== | ||
*Dynamic | * Dynamic and absolute file path | ||
<code>global myDocs "C:/Users/username/Documents" | |||
global myProject "${myDocs}/MyProject" | |||
use "${myProject}/MyDataset.dta"</code> | |||
* Relative (and absolute) file path | |||
<code>cd "C:/Users/username/Documents/MyProject" | |||
* Relative (and absolute) file path | use MyDataset.dta</code> | ||
* Absolute but not dynamic | |||
<code> use "C:/Users/username/Documents/MyProject/MyDataset.dta"</code> | |||
* Absolute but not dynamic | |||
== Additional Resources == | == Additional Resources == |
Revision as of 22:47, 13 April 2020
Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good Stata coding practices (including packages and commands) are a critical component of high quality reproducible research. These practices also allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.
Read First
- DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
iefieldkit
is a Stata package that standardizes best practices (guidelines) for high quality, reproducible primary data collection.ietoolkit
is a Stata package that standardizes best practices in data management and data analysis.- As with standard Stata packages like
coefplot
, usessc install
to download these packages. - Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.
iefieldkit
DIME has developed the iefieldkit
package for Stata to simplify the process of primary data collection. The package currently supports supports three major components of this workflow (process) - survey design, survey completion, and data cleaning and data harmonization. iefieldkit
uses four commands to simplify each of these tasks:
- Before data collection. The
ietestform
command tests the collected data to make sure it follows best practices in naming, coding, and labeling. For instance, it does not let an enumerator move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted. - During data collection. The
ieduplicates
andiecompdup
commands allow the research team to detect (identify) and resolve (deal with) duplicate entries in the data set. These commands were previously a part of theietoolkit
package, but are now part of theiefieldkit
package. - After data collection. The
iecodebook
command provides a method for rapidly cleaning, harmonizing, and documenting data sets.
To install the iefieldkit
package, type ssc install iefieldkit
in your Stata command window. Note that some features of this package might require meta data (information) that is specific to SurveyCTO, but users can still test them in other cases.
ietoolkit
DIME has developed the ietoolkit
package for Stata to simplify the process of data management and analysis in impact evaluations. Given below are the list of commands that are currently part of this package.
- Data management.
iefolder
sets up a standardized (common) structure for all folders that are shared as part of a project, that is the project folder. It creates master do-files that link to all sub-folders (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the field teams. This command helps create reproducible research.iegitaddmd
allows members of the research team to share a template (outline) folder for a new project on GitHub even if it is empty. This command creates a placeholder that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of data analysis will be stored. This folder remains empty until the data set is cleaned to prepare it for analysis. Using this command, two people, say A and B, can still share this folder with each other on GitHub.ieboilstart
standardizes the version, capacity (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be run (typed) at the top of all do-files that are shared between members of the research team. Such a code is called a boilerplate code, since it standardizes the code at the beginning for all do-files.
An example of a code that uses these commands is given below:
ieboilstart, version(14.0) //Standardizes the version for everyone. global folder "C:/Users/username/DropBox/ProjectABC" iefolder new project, projectfolder("$folder") //Sets up the main structure iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty
- Data analysis.
iematch
is a command which can be used for matching observations in one group to observations in another group which are the closest in terms of a particular characteristic.
For example, consider a study which is designed to evaluate the impact of randomly providing cash transfers to half the workers in a firm. The research team can useiematch
to match and compare wages of women in the treatment group (which received the cash transfers) with observations in a control group (which did not receive the cash transfers).iebaltab
runs balance tests, and produces balance tables which show the difference in means for one or more treatment groups. It can be used to check if there are statistically significant differences between the treatment and control groups. If there are significant differences in the means,iebaltab
even displays an error message that suggests that results from such data can be wrongly interpreted.iedropone
drops only a specific number of observations, and makes sure that no additional observations are dropped.ieboilsave
performs checks to ensure that best practices are followed before saving a data set.ieddtab
runs difference-in-difference regressions and displays the result in well-formatted tables.iegraph
produces graphs of results from regression models that researchers commonly use during impact evaluations.
To install the ietoolkit
, type ssc install ietoolkit
in your Stata command window.
Other Common Practices
File Paths
DIME Analytics suggests the following guidelines for specifying file paths in Stata:
- Double quotes (
"
). Always enclose file paths in double quotes ("
) . For example,"$maindir"
. - Forward slashes (
/
). Always use forward slashes (/
) to specify folder hierarchies, that is, the exact location of a folder inside another folder, and so on. For example,"C:/Users/username/Documents"
. This is important because Mac and Linux computers cannot read file paths with back slashes(\
). - File extension. Always include the file extension in the file path, such as
.dta
,.do
, or.csv
. This helps to avoid ambiguity (or doubt) if another file with the same name exists. - Absolute. File paths must be absolute, that is, all file paths must begin from the root folder of the computer, for example,
C:/
on a PC or/Users/
on a Mac. This makes sure that users are always specifying the the correct file in the correct folder. Users should never usecd
since there can be cases where a user accidentally overwrites a file in the project folder which thecd
initially referred to. While relative (non-absolute) file paths are common in many other programming languages, Stata does not provide this functionality. - Dynamic. File paths must also be dynamic. Dynamic file paths use globals (global macros) that are located in the master (central) do-file, and allows users to expand file paths dynamically (whenever needed). In practice, using global macros to specify folders is the same as using
cd
, and users only need to change file path in the global macro in the master do-file. But in this method, users can create multiple folder globals (global macros) instead of just one, which is the case withcd
.
In practice, therefore, absolute and dynamic file paths are a better practice since there is no risk of files getting saved in the incorrect project folder, as long as the global macro has a unique name.
Examples
- Dynamic and absolute file path
global myDocs "C:/Users/username/Documents"
global myProject "${myDocs}/MyProject"
use "${myProject}/MyDataset.dta"
- Relative (and absolute) file path
cd "C:/Users/username/Documents/MyProject"
use MyDataset.dta
- Absolute but not dynamic
use "C:/Users/username/Documents/MyProject/MyDataset.dta"
Additional Resources
Programs and Commands
- You can find a broad variety of Stata commands in this World Bank repository, How to Write Programs in Stata, which contains ado files for commands useful for data management, statistical analysis, and the production of graphics. In many cases, these adofiles reduce the production of routine items from a tedious programming task to a single command line (i.e. data import and cleaning; production of summary statistics table; and categorical bar charts with confidence intervals.
- You can experiment with and build upon DIME Analytics’ Intro to how to write programs (also called commands or functions) in Stata and Share functions (sub-programs) between command in the same package. Download the files and read the instructions.
- This DIME Analytics Stata IE Visual Library repository hosts Stata Graph examples on GitHub; feel free to submit your own example codes there.
- Innovations for Poverty Action's Stata modules for data collection and analysis and GitHub page host programs for impact evaluations
- Innovations for Poverty Action's odkmeta command writes a do-file to import ODK data to Stata, using the metadata from the survey and choices worksheets of the XLSForm.
- Read more on
iefolder
in DIME Analytics’ presentations here and here. - Read more on
ietoolkit
in DIME Analytics’ Real Time Data Quality Checks. - Check out The World Bank's Stata GitHub.
General Coding Resources
- Read DIME Analytics' guide to Stata coding and cleaning.
- Refer to these Stata cheat sheets on GitHub.
- Gentzkow and Shapiro's Code and Data for the Social Sciences is a handbook for best practices.
- Poverty Action Lab's Programming with Stata, Princeton's Getting Started in Data Analysis Using Stata and Standford's Basics of Stata provide resources for beginning and intermediate Stata users.
For more details, see the iefieldkit
GitHub page.