Difference between revisions of "Stata Coding Practices"

Jump to: navigation, search
Line 18: Line 18:
 
DIME has developed the <code>ietoolkit</code> package for Stata to simplify the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in impact evaluations. Given below are the list of commands that are currently part of this package.  
 
DIME has developed the <code>ietoolkit</code> package for Stata to simplify the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in impact evaluations. Given below are the list of commands that are currently part of this package.  
 
* '''Data management.'''
 
* '''Data management.'''
*# <code>[[iefolder]]</code> sets up project folders and creates master do-files that link to all sub-folders.  
+
*# <code>[[iefolder]]</code> sets up a '''standardized''' (common) structure for all folders that are shared as part of a project, that is the '''project folder'''. It creates master do-files that link to all '''sub-folders''' (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the '''field teams'''. This command helps create [[Reproducible Research|reproducible research]].
*# <code>[[iegitaddmd]]</code> adds a placeholder file to empty folders so that folder structures with empty folders can be shared on GitHub.
+
*# <code>[[iegitaddmd]]</code> allows members of the research team to share a'''template''' (outline) folder for a new project on GitHub even if it is empty. This code allows this by creating a '''placeholder''', that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of [[Data Analysis|data analysis]] will be stored. This folder remains empty until the data set is [[Data Cleaning|cleaned]] to prepare it for analysis. However, using this command, two people, say A and B, can still share this folder with each other on GitHub.
*# <code>[[ieboilstart]]</code> standardizes the boilerplate code at the top of all do-files.  
+
*# <code>[[ieboilstart]]</code> standardizes the '''version''', '''capacity''' (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be '''run''' at the top of all do-files that are shared between members of the [[Impact Evaluation Team|research team]]. Such a code is called a '''boilerplate''' code, since it standardizes the code at the beginning for all do-files.  
 +
An example of a code that uses these commands is given below:
 +
ieboilstart, version(14.0) //Standardizes the version for everyone.
 +
global folder "C:/Users/username/DropBox/ProjectABC"
 +
iefolder new project, projectfolder("$folder") //Sets up the main structure
 +
iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty
 
* '''Data analysis.'''   
 
* '''Data analysis.'''   
 
*# <code>[[iematch]]</code> is an algorithm for matching observations in one group to "the most similar" observations in another group.
 
*# <code>[[iematch]]</code> is an algorithm for matching observations in one group to "the most similar" observations in another group.

Revision as of 21:19, 13 April 2020

Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good Stata coding practices (including packages and commands) are a critical component of high quality reproducible research. These practices also allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.

Read First

  • DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
  • iefieldkit is a Stata package that standardizes best practices for high quality, reproducible primary data collection.
  • ietoolkit is a Stata package that standardizes best practices in data management and data analysis.
  • As with standard Stata packages like coefplot, use ssc install to download these packages.
  • Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.

iefieldkit

DIME has developed the iefieldkit package for Stata to simplify the process of primary data collection. The package currently supports supports three major components of this workflow (process) - survey design, survey completion, and data cleaning and data harmonization. iefieldkit uses four commands to simplify each of these tasks:

  • Before data collection. The ietestform command tests the collected data to make sure it follows best practices in naming, coding, and labeling. For instance, it does not let an enumerator move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted.
  • During data collection. The ieduplicates and iecompdup commands allow the research team to detect (identify) and resolve (deal with) duplicate entries in the data set. These commands were previously a part of the ietoolkit package, but are now part of the iefieldkit package.
  • After data collection. The iecodebook command provides a method for rapidly cleaning, harmonizing, and documenting data sets.

To install the iefieldkit package, type ssc install iefieldkit in your Stata command window. Note that some features of the package might require meta data specific to SurveyCTO, but feel free to try these commands on any use case.

ietoolkit

DIME has developed the ietoolkit package for Stata to simplify the process of data management and analysis in impact evaluations. Given below are the list of commands that are currently part of this package.

  • Data management.
    1. iefolder sets up a standardized (common) structure for all folders that are shared as part of a project, that is the project folder. It creates master do-files that link to all sub-folders (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the field teams. This command helps create reproducible research.
    2. iegitaddmd allows members of the research team to share atemplate (outline) folder for a new project on GitHub even if it is empty. This code allows this by creating a placeholder, that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of data analysis will be stored. This folder remains empty until the data set is cleaned to prepare it for analysis. However, using this command, two people, say A and B, can still share this folder with each other on GitHub.
    3. ieboilstart standardizes the version, capacity (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be run at the top of all do-files that are shared between members of the research team. Such a code is called a boilerplate code, since it standardizes the code at the beginning for all do-files.

An example of a code that uses these commands is given below:

ieboilstart, version(14.0) //Standardizes the version for everyone.
global folder "C:/Users/username/DropBox/ProjectABC" 
iefolder new project, projectfolder("$folder") //Sets up the main structure
iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty
  • Data analysis.
    1. iematch is an algorithm for matching observations in one group to "the most similar" observations in another group.
    2. iebaltab runs balance test regressions and outputs the result in well formatted balance tables.
    3. iedropone drops observations and controls that the correct number was dropped.
    4. ieboilsave performs checks before saving a data set.
    5. ieddtab runs difference-in-difference regressions and outputs the result in well formatted tables.
    6. iegraph produces graphs of estimation results in common impact evaluation regression models

To install the ietoolkit, type ssc install ietoolkit in your Stata command window.

Common Stata Practices

File Paths

DIME Analytics' recommendation is that all file paths should be absolute and dynamic, should always be enclosed in double quotes, and should always use forward slashes for folder hierarchies (/), since Mac and Linux computers cannot read file paths with backslashes. File paths should also always include the file extension (.dta, .do, .csv, etc.), since to omit the extension causes ambiguity if another file with the same name is created (even if there is a default).

  • Absolute file paths means that all file paths must start at the root folder of the computer, for example, C:/ on a PC or /Users/ on a Mac. This makes sure that you always get the correct file in the correct folder. We never use cd. We have seen many cases when using cd where a file has been overwritten in another project folder where cd was currently pointing to. Relative file paths are common in many other programming languages, but there they are relative to the location of the file running the code, and then there is no risk that a file is saved in a completely different folder. Stata does not provide this functionality.
  • Dynamic file paths use globals that are set in a central master do-file to dynamically build your file paths. This has the same function in practice as setting cd, as all new users should only have to change these file path globals in one location. But dynamic absolute file paths are a better practice since if the global names are set uniquely there is no risk that files are saved in the incorrect project folder, and you can create multiple folder globals instead of just one location as with cd.

Examples

  • Dynamic (and absolute) file path - RECOMMENDED
   global myDocs    "C:/Users/username/Documents"
   global myProject "${myDocs}/MyProject"
   use "${myProject}/MyDataset.dta"
  • Relative (and absolute) file path - NOT RECOMMENDED
   cd "C:/Users/username/Documents/MyProject"
   use MyDataset.dta
  • Absolute but not dynamic - NOT RECOMMENDED
    use "C:/Users/username/Documents/MyProject/MyDataset.dta"

Additional Resources

Programs and Commands

General Coding Resources

For more details, see the iefieldkit GitHub page.