Difference between revisions of "Stata Coding Practices"
Maria jones (talk | contribs) |
Kbjarkefur (talk | contribs) |
||
(143 intermediate revisions by 11 users not shown) | |||
Line 1: | Line 1: | ||
Researchers use Stata in all stages of an '''impact evaluation''' (or study), such as [[Sampling & Power Calculations |sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring data quality]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Good '''Stata coding practices''', packages, and commands are a critical component of high quality [[Reproducible Research | reproducible research]]. These practices also allow the [[Impact Evaluation Team|impact evaluation team]] (or research team) to save time and energy, and focus on other [[Randomized Evaluations: Principles of Study Design|aspects of study design]]. | |||
==Read First== | |||
* [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] and institutions like [https://github.com/PovertyAction Innovations for Poverty Action (IPA)] offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands. | |||
* <code>[https://github.com/worldbank/iefieldkit/ iefieldkit]</code> is a Stata package that standardizes '''best practices''' (guidelines) for high quality, [[Reproducible Research | reproducible]] [[Primary Data Collection | primary data collection]]. | |||
* <code>[https://worldbank.github.io/ietoolkit/ ietoolkit]</code> is a Stata package that standardizes best practices in [[Data Management|data management]] and [[Data Analysis|data analysis]]. | |||
* As with other Stata packages like [https://www.stata-journal.com/article.html?article=gr0059 <code>coefplot</code>], use <syntaxhighlight lang="Stata" inline>ssc install</syntaxhighlight> to download these packages. | |||
* Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations. | |||
== | == iefieldkit == | ||
* | DIME has developed the <code>[[iefieldkit]]</code> package for Stata to simplify the process of [[Primary Data Collection|primary data collection]]. The package currently supports three major components of this '''workflow''' (process) - [[Questionnaire Design|survey design]], [[Iecompdup|survey completion]], and [[Data Cleaning|data cleaning]] and [[Iecodebook#Harmonize| data harmonization]]. <code>[[iefieldkit]]</code> uses four commands to simplify each of these tasks: | ||
* '''Before data collection.''' The <code>[[ietestform]]</code> command tests the collected data to make sure it follows '''best practices''' in naming, coding, and labeling. For instance, it does not let an '''enumerator''' move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted. | |||
* '''During data collection.''' The <code>[[ieduplicates]]</code> and <code>[[Iecompdup|iecompdup]]</code> commands allow the [[Impact Evaluation Team|research team]] to '''detect''' (identify) and '''resolve''' (deal with) duplicate entries in the data set. These commands were previously a part of the <code>[[Stata Coding Practices#ietoolkit|ietoolkit]]</code> package, but are now part of the <code>[[iefieldkit]]</code> package. | |||
* '''After data collection.''' The <code>[[iecodebook]]</code> command provides a method for rapidly [[Data Cleaning|cleaning]], [[iecodebook#Harmonize|harmonizing]], and [[Data Documentation|documenting]] data sets. | |||
To install the <code>[[iefieldkit]]</code> package, type <syntaxhighlight lang="Stata" inline>ssc install iefieldkit</syntaxhighlight> in your Stata command window. Note that some features of this package might require '''meta data''' (information) that is specific to '''SurveyCTO''', but users can still test them in other cases. | |||
== ietoolkit == | == ietoolkit == | ||
DIME has developed the <code>[[Ietoolkit|ietoolkit]]</code> package for Stata to simplify the process of [[Data Management|data management]] and [[Data Analysis|analysis]] in impact evaluations. Given below are the list of commands that are currently part of this package. | |||
* '''Data management.''' | |||
** <code>[[iefolder]]</code> sets up a '''standardized''' (common) structure for all folders that are shared as part of a project, that is the '''project folder'''. It creates [[Master Do-files|master do-files]] that link to all '''sub-folders''' (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the '''field teams'''. This command helps create [[Reproducible Research|reproducible research]]. | |||
** <code>[[iegitaddmd]]</code> allows members of the research team to share a '''template''' (outline) folder for a new project on GitHub even if it is empty. This command creates a '''placeholder''' that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of [[Data Analysis|data analysis]] will be stored. This folder remains empty until the data set is [[Data Cleaning|cleaned]] to prepare it for analysis. Using this command, two people, say A and B, can still share this folder with each other on GitHub. | |||
** <code>[[ieboilstart]]</code> standardizes the '''version''', '''capacity''' (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be '''run''' (typed) at the top of all do-files that are shared between members of the [[Impact Evaluation Team|research team]]. Such a code is called a '''boilerplate code''', since it standardizes the code at the beginning for all do-files. | |||
An example of a code that uses these commands is given below: | |||
<syntaxhighlight lang="stata" line>ieboilstart, version(14.0) //Standardizes the version for everyone. | |||
global folder "C:/Users/username/DropBox/ProjectABC" | |||
iefolder new project, projectfolder("$folder") //Sets up the main structure | |||
iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty </syntaxhighlight> | |||
* '''Data analysis.''' | |||
** <code>[[iematch]]</code> is a command which can be used for matching observations in one group to observations in another group which are the closest in terms of a particular characteristic. <br>For example, consider a study which is designed to evaluate the impact of randomly providing cash transfers to half the workers in a firm. The research team can use <code>[[iematch]]</code> to match and compare wages of women in the '''treatment''' group (which received the cash transfers) with observations in a '''control''' group (which did not receive the cash transfers). | |||
** <code>[[iebaltab]]</code> runs [[Balance tests|balance tests]], and produces '''balance tables''' which show the difference in means for one or more '''treatment''' groups. It can be used to check if there are '''statistically significant''' differences between the '''treatment''' and '''control''' groups. In case there are significant differences in the means, <code>[[iebaltab]]</code> even displays an error message that suggests that results from such data can be wrongly interpreted. | |||
** <code>[[iedropone]]</code> drops only a specific number of observations, and makes sure that no additional observations are dropped. | |||
** <code>[[ieboilsave]]</code> performs checks to ensure that '''best practices''' are followed before saving a data set. | |||
** <code>[[ieddtab]]</code> runs [[Difference-in-Differences | difference-in-difference]] regressions and displays the result in well-formatted tables. | |||
** <code>[[iegraph]]</code> produces graphs of results from regression models that researchers commonly use during impact evaluations. | |||
To install the <code>ietoolkit</code>, type <syntaxhighlight lang="Stata" inline>ssc install ietoolkit</syntaxhighlight> in your Stata command window. | |||
<code> | == File Paths== | ||
DIME Analytics suggests the following guidelines for specifying '''file paths''' in Stata: | |||
* '''Double quotes (<code>"</code>).''' Always enclose file paths in double quotes (<code>"</code>) . For example, <syntaxhighlight lang="Stata" inline>"${maindir}"</syntaxhighlight>. | |||
* '''Forward slashes (<code>/</code>).''' Always use forward slashes (<code>/</code>) to specify folder '''hierarchies''', that is, the exact location of a folder inside another folder, and so on. For example, <code>"C:/Users/username/Documents"</code>. This is important because Mac and Linux computers cannot read file paths with '''back slashes'''(<code>\</code>). | |||
* '''File extension.''' Always include the file extension in the file path, such as <code>.dta</code>, <code>.do</code>, or <code>.csv</code>. This helps to avoid '''ambiguity''' (or doubt) if another file with the same name exists. | |||
'''''Dynamic and absolute file paths'''''. | |||
Relative file paths exists in Stata but is implemented differently in Stata compared to many other computer languages. One should therefore use caution when translating practices that builds on relative file paths from other languages into Stata. | |||
Therefore, it is common to use ''dynamic'' and ''absolute'' file paths in Stata. A file path is '''absolute''' when it begins from the '''root folder''' of the computer, for example, <code>C:/</code> on a PC or <code>/Users/</code> on a Mac. This guarantees that a each file path only can corresponds to a single location in the file system, no matter what the working directory is set to. | |||
== | In contrast, relative file path points to a different location each time the working directory is changed. In a collaborative context your file paths might start to point to other locations on your computer if someone in your team introduce code that use <code>cd</code> to change the directory. The types of errors this can lead to are not possible when a team use absolute paths. | ||
However, in absolute paths, the first part of the file path is almost always unique to each user. To make this work, you need to create a file path that is both '''dynamic''' and absolute. An absolute file path is dynamic if it sets the first part of the path dynamically with code. This means that users set '''globals''' (global macros) located in the [[Master Do-files|main do-files]] to specify the root part of file paths. The root part is the part of the file path that differs between all users. | |||
There are other ways to solve the same problem, but dynamic absolute file paths is considered a very generalizable method with few and simple steps to learn. | |||
=== Examples === | |||
* Dynamic and absolute file path. | |||
<syntaxhighlight lang="stata" line>global root "C:/Users/username/Documents" | |||
global myProject "${root}/MyProject" | |||
use "${myProject}/MyDataset.dta"</syntaxhighlight> | |||
* Non-absolute, non-dynamic file path. | |||
<syntaxhighlight lang="stata" line>cd "C:/Users/username/Documents/MyProject" | |||
use MyDataset.dta</syntaxhighlight> | |||
* Absolute, but non-dynamic file path. | |||
<syntaxhighlight lang="stata" line>cd "C:/Users/username/Documents/MyProject" | |||
use "C:/Users/username/Documents/MyProject/MyDataset.dta"</syntaxhighlight> | |||
[ | == Exporting Tables == | ||
* [https://github.com/ | Tables play a crucial role in representing the results of a study in an easy-to-understand format. However, it is common to copy-and-paste results from Stata, and format them in a word-processing software, which affects the [[Reproducible Research|reproducibility of research]]. [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has therefore created the following resources for exporting tables in Stata: | ||
* [[Checklist:_Submit_Table|Checklist for submitting tables in development research]] | |||
* [https://osf.io/78nuc/ Nice and fast tables in Stata for LaTex and Excel] | |||
* [https://github.com/worldbank/stata-tables GitHub - Stata tables] is a repository with do-files and output tables. Use these to practice exporting tables using the <code>esttab</code> command. | |||
* [https://blogs.worldbank.org/impactevaluations/nice-and-fast-tables-stata Blog post on Stata tables] | |||
== Related Pages == | |||
[[Special:WhatLinksHere/Stata_Coding_Practices|Click here for pages that link to this topic.]] | |||
[[ | == Additional Resources == | ||
* DIME Analytics (World Bank), [https://osf.io/36hys Basics of Programming in Stata] | |||
* DIME Analytics (World Bank), [https://osf.io/zatqj Statistical Programming 101] | |||
* DIME Analytics (World Bank), [https://github.com/vikjam/mostly-harmless-replication Mostly Harmless Replication] | |||
* DIME Analytics (World Bank, [https://gist.github.com/kbjarkefur/16b63c1fc89ab52c3d4cae9c74288452 Sharing sub-functions between different commands]. Download the <code>.ado</code> files and follow the instructions. | |||
* DIME Analytics (World Bank), [https://worldbank.github.io/Stata-IE-Visual-Library/ Stata visual library] | |||
* DIME Analytics (World Bank), [https://osf.io/mw965 Data Management] | |||
* DIME Analytics (World Bank), [https://osf.io/msh8r ietoolkit and iefieldkit- introduction] | |||
* DIME Analytics (World Bank), [https://osf.io/4tbkr ietoolkit- follow up slides] | |||
* DIME Analytics (World Bank), [https://osf.io/t48ug Data Quality Assurance]. | |||
* DIME Analytics (World Bank), [https://osf.io/nzbvu Data Cleaning and Documentation in Stata (Intro)]. | |||
* DIME Analytics (World Bank), [https://osf.io/juxcb Data Cleaning in Stata]. | |||
* DIME (World Bank), [[Checklist: Submit Table| Checklist on submitting results.]] | |||
* David McKenzie (World Bank), [https://blogs.worldbank.org/impactevaluations/updated-overview-multiple-hypothesis-testing-commands-stata An updated overview of multiple hypothesis testing commands in Stata] | |||
* Gentzkow and Shapiro (Stanford) [http://web.stanford.edu/~gentzkow/research/CodeAndData.pdf Code and Data for the Social Sciences] | |||
* The GeoCenter, [http://geocenter.github.io/StataTraining/portfolio/01_resource/ Stata cheat sheets.] | |||
* Innovations for Poverty Action, [http://www.poverty-action.org/researchers/research-resources/stata-programs Stata modules for data collection and analysis] | |||
* Innovations for Poverty Action, [https://github.com/PovertyAction GitHub repository on impact evaluations] | |||
* Innovations for Poverty Action, [https://github.com/PovertyAction/odkmeta Odkmeta command]. This command writes a do-file to import ODK (Open Data Kit) data to Stata, using metadata from the survey and choices worksheets of the XLSForm. | |||
* J-PAL, [https://www.povertyactionlab.org/sites/default/files/resources/IAPStataWorkshopSlides.pdf Programming with Stata] | |||
* Princeton, [https://www.princeton.edu/~otorres/StataTutorial.pdf Data analysis in Stata for beginners] | |||
* Standford, [https://web.stanford.edu/~leinav/teaching/econ257/STATA.pdf Basics of Stata] | |||
* World Bank, [https://worldbank.github.io/stata/ Stata repository]. | |||
[[Category: Coding Practices]] | |||
[[Category: Reproducible Research]] | |||
[[Category: Stata Coding Practices]] | |||
[[Category: Technical Tools]] |
Latest revision as of 07:36, 31 August 2023
Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good Stata coding practices, packages, and commands are a critical component of high quality reproducible research. These practices also allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.
Read First
- DIME Analytics and institutions like Innovations for Poverty Action (IPA) offer a wide range of resources - tutorials, sample codes, and easy-to-install packages and commands.
iefieldkit
is a Stata package that standardizes best practices (guidelines) for high quality, reproducible primary data collection.ietoolkit
is a Stata package that standardizes best practices in data management and data analysis.- As with other Stata packages like
coefplot
, usessc install
to download these packages. - Other common Stata best practices, for instance, with respect to naming file paths, also contribute to successful impact evaluations.
iefieldkit
DIME has developed the iefieldkit
package for Stata to simplify the process of primary data collection. The package currently supports three major components of this workflow (process) - survey design, survey completion, and data cleaning and data harmonization. iefieldkit
uses four commands to simplify each of these tasks:
- Before data collection. The
ietestform
command tests the collected data to make sure it follows best practices in naming, coding, and labeling. For instance, it does not let an enumerator move to the next field until they enter a response, thus ensuring that incomplete forms can not be submitted. - During data collection. The
ieduplicates
andiecompdup
commands allow the research team to detect (identify) and resolve (deal with) duplicate entries in the data set. These commands were previously a part of theietoolkit
package, but are now part of theiefieldkit
package. - After data collection. The
iecodebook
command provides a method for rapidly cleaning, harmonizing, and documenting data sets.
To install the iefieldkit
package, type ssc install iefieldkit
in your Stata command window. Note that some features of this package might require meta data (information) that is specific to SurveyCTO, but users can still test them in other cases.
ietoolkit
DIME has developed the ietoolkit
package for Stata to simplify the process of data management and analysis in impact evaluations. Given below are the list of commands that are currently part of this package.
- Data management.
iefolder
sets up a standardized (common) structure for all folders that are shared as part of a project, that is the project folder. It creates master do-files that link to all sub-folders (folders within another folder), so that the project folder is automatically updated every time more data or files are shared from the field teams. This command helps create reproducible research.iegitaddmd
allows members of the research team to share a template (outline) folder for a new project on GitHub even if it is empty. This command creates a placeholder that can be updated later when a file is added to that folder. For example, templates often include an output folder where the results of data analysis will be stored. This folder remains empty until the data set is cleaned to prepare it for analysis. Using this command, two people, say A and B, can still share this folder with each other on GitHub.ieboilstart
standardizes the version, capacity (in terms of the number of observations it can store in memory), and other Stata settings for all users in a project. This command should be run (typed) at the top of all do-files that are shared between members of the research team. Such a code is called a boilerplate code, since it standardizes the code at the beginning for all do-files.
An example of a code that uses these commands is given below:
ieboilstart, version(14.0) //Standardizes the version for everyone.
global folder "C:/Users/username/DropBox/ProjectABC"
iefolder new project, projectfolder("$folder") //Sets up the main structure
iegitaddmd, folder ("$folder") //Makes sure users can share the main folder on GitHub even if it is empty
- Data analysis.
iematch
is a command which can be used for matching observations in one group to observations in another group which are the closest in terms of a particular characteristic.
For example, consider a study which is designed to evaluate the impact of randomly providing cash transfers to half the workers in a firm. The research team can useiematch
to match and compare wages of women in the treatment group (which received the cash transfers) with observations in a control group (which did not receive the cash transfers).iebaltab
runs balance tests, and produces balance tables which show the difference in means for one or more treatment groups. It can be used to check if there are statistically significant differences between the treatment and control groups. In case there are significant differences in the means,iebaltab
even displays an error message that suggests that results from such data can be wrongly interpreted.iedropone
drops only a specific number of observations, and makes sure that no additional observations are dropped.ieboilsave
performs checks to ensure that best practices are followed before saving a data set.ieddtab
runs difference-in-difference regressions and displays the result in well-formatted tables.iegraph
produces graphs of results from regression models that researchers commonly use during impact evaluations.
To install the ietoolkit
, type ssc install ietoolkit
in your Stata command window.
File Paths
DIME Analytics suggests the following guidelines for specifying file paths in Stata:
- Double quotes (
"
). Always enclose file paths in double quotes ("
) . For example,"${maindir}"
. - Forward slashes (
/
). Always use forward slashes (/
) to specify folder hierarchies, that is, the exact location of a folder inside another folder, and so on. For example,"C:/Users/username/Documents"
. This is important because Mac and Linux computers cannot read file paths with back slashes(\
). - File extension. Always include the file extension in the file path, such as
.dta
,.do
, or.csv
. This helps to avoid ambiguity (or doubt) if another file with the same name exists.
Dynamic and absolute file paths.
Relative file paths exists in Stata but is implemented differently in Stata compared to many other computer languages. One should therefore use caution when translating practices that builds on relative file paths from other languages into Stata.
Therefore, it is common to use dynamic and absolute file paths in Stata. A file path is absolute when it begins from the root folder of the computer, for example, C:/
on a PC or /Users/
on a Mac. This guarantees that a each file path only can corresponds to a single location in the file system, no matter what the working directory is set to.
In contrast, relative file path points to a different location each time the working directory is changed. In a collaborative context your file paths might start to point to other locations on your computer if someone in your team introduce code that use cd
to change the directory. The types of errors this can lead to are not possible when a team use absolute paths.
However, in absolute paths, the first part of the file path is almost always unique to each user. To make this work, you need to create a file path that is both dynamic and absolute. An absolute file path is dynamic if it sets the first part of the path dynamically with code. This means that users set globals (global macros) located in the main do-files to specify the root part of file paths. The root part is the part of the file path that differs between all users.
There are other ways to solve the same problem, but dynamic absolute file paths is considered a very generalizable method with few and simple steps to learn.
Examples
- Dynamic and absolute file path.
global root "C:/Users/username/Documents"
global myProject "${root}/MyProject"
use "${myProject}/MyDataset.dta"
- Non-absolute, non-dynamic file path.
cd "C:/Users/username/Documents/MyProject"
use MyDataset.dta
- Absolute, but non-dynamic file path.
cd "C:/Users/username/Documents/MyProject"
use "C:/Users/username/Documents/MyProject/MyDataset.dta"
Exporting Tables
Tables play a crucial role in representing the results of a study in an easy-to-understand format. However, it is common to copy-and-paste results from Stata, and format them in a word-processing software, which affects the reproducibility of research. DIME Analytics has therefore created the following resources for exporting tables in Stata:
- Checklist for submitting tables in development research
- Nice and fast tables in Stata for LaTex and Excel
- GitHub - Stata tables is a repository with do-files and output tables. Use these to practice exporting tables using the
esttab
command. - Blog post on Stata tables
Related Pages
Click here for pages that link to this topic.
Additional Resources
- DIME Analytics (World Bank), Basics of Programming in Stata
- DIME Analytics (World Bank), Statistical Programming 101
- DIME Analytics (World Bank), Mostly Harmless Replication
- DIME Analytics (World Bank, Sharing sub-functions between different commands. Download the
.ado
files and follow the instructions. - DIME Analytics (World Bank), Stata visual library
- DIME Analytics (World Bank), Data Management
- DIME Analytics (World Bank), ietoolkit and iefieldkit- introduction
- DIME Analytics (World Bank), ietoolkit- follow up slides
- DIME Analytics (World Bank), Data Quality Assurance.
- DIME Analytics (World Bank), Data Cleaning and Documentation in Stata (Intro).
- DIME Analytics (World Bank), Data Cleaning in Stata.
- DIME (World Bank), Checklist on submitting results.
- David McKenzie (World Bank), An updated overview of multiple hypothesis testing commands in Stata
- Gentzkow and Shapiro (Stanford) Code and Data for the Social Sciences
- The GeoCenter, Stata cheat sheets.
- Innovations for Poverty Action, Stata modules for data collection and analysis
- Innovations for Poverty Action, GitHub repository on impact evaluations
- Innovations for Poverty Action, Odkmeta command. This command writes a do-file to import ODK (Open Data Kit) data to Stata, using metadata from the survey and choices worksheets of the XLSForm.
- J-PAL, Programming with Stata
- Princeton, Data analysis in Stata for beginners
- Standford, Basics of Stata
- World Bank, Stata repository.