Difference between revisions of "Stata Coding Practices"

Jump to: navigation, search
Line 1: Line 1:
This page list a lot of resources both developed at DIME but also by other people or organizations.
Stata is used in all stages of an impact evaluation: [[Sampling & Power Calculations | sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring]], [[Data Cleaning | cleaning]], and [[Data Analysis | analyzing]]. Good Stata coding practices, packages, and commands are not only a critical component of high quality, [[Reproducible Research | reproducible research]], but they are also key in saving the research team time, energy, and sanity. This page outlines a number of packages and commands developed by DIME and externally for use in impact evaluations. For additional resources on Stata coding, see [[Stata Coding Practices#Additional Resources | Additional Resources]].  


== ietoolkit ==
==Read First==
<onlyinclude>
*point1
At DIME we have developed a package of Stata commands specially developed for impact evaluations but could also be useful in other contexts as well. The package is called '''ietoolkit''' and can be installed from the SSC server. To install the package, type <code>ssc install ietoolkit</code> in your Stata command window.
*point2
</onlyinclude>
Please visit our github page for details: https://github.com/worldbank/ietoolkit


'''ietoolkit''' provides a set of commands that address different aspects of data management and data analysis in relation to Impact Evaluations. These include the following:
==Packages for Impact Evaluations==
# [[iebaltab]] is a tool for running balance test regressions and output the result in well formatted balance tables
# [[ieddtab]] is a tool for running difference in difference regressions and output the result in well formatted tables
# [[ieboilstart]] standardizes the boilerplate code at the top of all do-files
# [[ieduplicates]] and [[iecompdup]] are useful tools to identify and correct for duplicates, particularly in primary survey data
# [[iefolder]] sets up project folders and creates master do-files that links to all sub-folders
# [[iegitaddmd]] adds a placeholder file to empty folders so that folder structures with empty folders can be shared on GitHub
# [[iegraph]] produces graphs of estimation results in common impact evaluation regression models
# [[iematch]] is an algorithm for matching observations in one group to "the most similar" observations in another group
# [[iedropone]] drops observations and controls that the correct number was dropped
# [[ieboilsave]] performs checks before saving a data set


== iefieldkit ==
=== iefieldkit ===
<onlyinclude>
<code>[[iefieldkit]]</code> is a Stata package developed by DIME for primary data collection. The package currently supports three major components of that workflow: survey design; survey completion; and data cleaning and survey harmonization. <code>iefieldkit</code> performs the following three tasks:
At DIME we have also developed a package of Stata commands specially developed for primary data collection . The package is called '''iefieldkit''' and can be installed from the SSC server (as of Feb 2019). To install the package, type <code>ssc install fieldkit</code> in your Stata command window.
*Before data collection , <code>[[ietestform]]</code> complements ODK syntax test on [[SurveyCTO Coding Practices | SurveyCTO]] server. It runs tests to inform researchers how to use ODK programming language features to ensure high data quality. This command is especially useful if the data that will be imported to Stata has other restrictions in addition to ODK syntax.
</onlyinclude>
*During data collection, <code>[[ieduplicates]]</code> and <code>iecompdup</code> (both previously released as a part of the package ietoolkit but now moved to this package) provide a workflow for detecting and resolving duplicate entries in the dataset. This ensures that the final survey dataset is a correct record of the survey sample that the researcher can then merge into the master sampling database.
Please visit our github page for details: https://github.com/worldbank/iefieldkit
*After data collection, <code>[[iecodebook]]</code> provides a workflow for rapidly cleaning, harmonizing, and documenting datasets. <code>iecodebook</code> uses input specified in an Excel sheet, which provides a much more well-structured and easy to follow (especially for non-technical users) overview than the same operations written directly to a dofile.
To install the package, type <code>ssc install iefieldkit</code> in your Stata command window. Note that some features of the package might require meta data specific to SurveyCTO, but you free to try these commands on any use case. For more details, see the [https://github.com/worldbank/iefieldkit/ <code>iefieldkit</code> GitHub page].


'''fieldkit''' provides a set of commands that address different aspects of primary data collection. These include the following:
=== ietoolkit ===
# [[iecodebook]] is a tool for applying bulk changes data sets and combining data sets from slightly different data collections.
# [[ietestform]] is a tool for testing for [[SurveyCTO_Coding_Practices|SurveyCTO]] forms for typos, usage of best practices etc.


== Stata Command Repository ==
<code>ietoolkit</code> is a Stata package developed by DIME for data management and analysis in impact evaluations. The list of commands will be extended continuously, and suggestions for new commands are always appreciated.


Repository with a large number of [https://github.com/worldbank/stata Stata ado files]. These commands cannot be installed through SSC but click the link for installation instructions. This repository contains a broad variety of Stata commands ([https://gist.github.com/kbjarkefur/1f880b78029eaf78416d12dfd2076985 adofiles]) which are useful in data management, statistical analysis, and the production of graphics. In many cases, these adofiles reduce the production of routine items from a tedious programming task to a single command line – such as data import and cleaning; production of summary statistics tables; and categorical bar charts with confidence intervals.
<code>ietoolkit</code>’s commands for data management currently include <code>[[iefolder]]</code>, which sets up project folders and creates master do-files that link to all sub-folders; <code>[[iegitaddmd]]</code>, which adds a placeholder file to empty folders so that folder structures with empty folders can be shared on GitHub; and <code>[[ieboilstart]]</code>, which standardizes the boilerplate code at the top of all do-files.


== DIME's Stata IE Visual Library ==
Its commands for data analysis currently include <code>[[iematch]]</code>, an algorithm for matching observations in one group to "the most similar" observations in another group; <code>[[iebaltab]]</code>, which runs balance test regressions and outputs the result in well formatted balance tables; <code>[[iedropone]]</code>, which drops observations and controls that the correct number was dropped; <code>[[ieboilsave]]</code>, which performs checks before saving a data set; <code>[[ieddtab]]</code>, which runs [[Difference-in-Differences | difference in differences]] regressions and outputs the result in well formatted tables; and <code>[[iegraph]]</code>, which produces graphs of estimation results in common impact evaluation regression models


We have developed a repository where we collect [https://github.com/worldbank/Stata-IE-Visual-Library Stata Graph examples] on GitHub. Feel free to submit your own example codes there.  
To install the <code>ietoolkit</code>, type <code>ssc install ietoolkit</code> in your Stata command window. For more details, see the [https://worldbank.github.io/ietoolkit/ <code>ietoolkit</code> GitHub page].  


== Snippets of Code with Best Practices with Explanations ==
== Command Repository ==
 
You can find a broad variety of Stata commands in this [https://gist.github.com/kbjarkefur/1f880b78029eaf78416d12dfd2076985 repository], which contains ado files for commands useful for data management, statistical analysis, and the production of graphics. In many cases, these adofiles reduce the production of routine items from a tedious programming task to a single command line (i.e. data import and cleaning; production of summary statistics [[Cheklist: Submit Table tables]]; and categorical bar charts with confidence intervals.
 
* DIME Analytics’ [https://gist.github.com/kbjarkefur/1f880b78029eaf78416d12dfd2076985 Intro to how to write programs (also called commands or functions) in Stata] and  [https://gist.github.com/kbjarkefur/16b63c1fc89ab52c3d4cae9c74288452 Share functions (sub-programs) between command in the same package] are easy to experiment with and can be built on to fit many different contexts. Download the files and read the instuctions.
*This DIME Analytics [https://worldbank.github.io/Stata-IE-Visual-Library/ repository] hosts Stata Graph examples on GitHub. Feel free to submit your own example codes there.
*World Bank stata github https://worldbank.github.io/stata/


The following code examples have been written in a way that it they should be easy to experiment with and be possible to build on to fit many different contexts. Download the files and read the instructions.


* [https://gist.github.com/kbjarkefur/1f880b78029eaf78416d12dfd2076985 Intro to how to write programs (also called commands or functions) in Stata]
* [https://gist.github.com/kbjarkefur/16b63c1fc89ab52c3d4cae9c74288452 Share functions (sub-programs) between command in the same package]


== Additional Resources ==
== Additional Resources ==
PROGRAMS
*[http://www.poverty-action.org/researchers/research-resources/stata-programs Stata modules for data collection and analysis] developed by Innovations for Poverty Action
*[https://github.com/PovertyAction/odkmeta odkmeta command] developed by Innovations for Poverty Action
*More on <code>iefolder</code> in DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/welcome-iefolder.pdf presentation]
GENERAL CODING
https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf
https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf
*[http://geocenter.github.io/StataTraining/portfolio/01_resource/  Stata cheat sheets] on GitHub
https://www.povertyactionlab.org/sites/default/files/resources/IAPStataWorkshopSlides.pdf
https://www.princeton.edu/~otorres/StataTutorial.pdf
https://web.stanford.edu/~leinav/teaching/econ257/STATA.pdf


* [http://www.poverty-action.org/researchers/research-resources/stata-programs Stata modules for data collection and analysis] developed by Innovations for Poverty Action
* [https://github.com/PovertyAction/odkmeta odkmeta odkmeta command]
*[http://geocenter.github.io/StataTraining/portfolio/01_resource/  Stata cheat sheets] on github


[[Category: Stata ]]
[[Category: Stata ]]

Revision as of 17:59, 14 May 2019

Stata is used in all stages of an impact evaluation: sampling, randomizing, monitoring, cleaning, and analyzing. Good Stata coding practices, packages, and commands are not only a critical component of high quality, reproducible research, but they are also key in saving the research team time, energy, and sanity. This page outlines a number of packages and commands developed by DIME and externally for use in impact evaluations. For additional resources on Stata coding, see Additional Resources.

Read First

  • point1
  • point2

Packages for Impact Evaluations

iefieldkit

iefieldkit is a Stata package developed by DIME for primary data collection. The package currently supports three major components of that workflow: survey design; survey completion; and data cleaning and survey harmonization. iefieldkit performs the following three tasks:

  • Before data collection , ietestform complements ODK syntax test on SurveyCTO server. It runs tests to inform researchers how to use ODK programming language features to ensure high data quality. This command is especially useful if the data that will be imported to Stata has other restrictions in addition to ODK syntax.
  • During data collection, ieduplicates and iecompdup (both previously released as a part of the package ietoolkit but now moved to this package) provide a workflow for detecting and resolving duplicate entries in the dataset. This ensures that the final survey dataset is a correct record of the survey sample that the researcher can then merge into the master sampling database.
  • After data collection, iecodebook provides a workflow for rapidly cleaning, harmonizing, and documenting datasets. iecodebook uses input specified in an Excel sheet, which provides a much more well-structured and easy to follow (especially for non-technical users) overview than the same operations written directly to a dofile.

To install the package, type ssc install iefieldkit in your Stata command window. Note that some features of the package might require meta data specific to SurveyCTO, but you free to try these commands on any use case. For more details, see the iefieldkit GitHub page.

ietoolkit

ietoolkit is a Stata package developed by DIME for data management and analysis in impact evaluations. The list of commands will be extended continuously, and suggestions for new commands are always appreciated.

ietoolkit’s commands for data management currently include iefolder, which sets up project folders and creates master do-files that link to all sub-folders; iegitaddmd, which adds a placeholder file to empty folders so that folder structures with empty folders can be shared on GitHub; and ieboilstart, which standardizes the boilerplate code at the top of all do-files.

Its commands for data analysis currently include iematch, an algorithm for matching observations in one group to "the most similar" observations in another group; iebaltab, which runs balance test regressions and outputs the result in well formatted balance tables; iedropone, which drops observations and controls that the correct number was dropped; ieboilsave, which performs checks before saving a data set; ieddtab, which runs difference in differences regressions and outputs the result in well formatted tables; and iegraph, which produces graphs of estimation results in common impact evaluation regression models

To install the ietoolkit, type ssc install ietoolkit in your Stata command window. For more details, see the ietoolkit GitHub page.

Command Repository

You can find a broad variety of Stata commands in this repository, which contains ado files for commands useful for data management, statistical analysis, and the production of graphics. In many cases, these adofiles reduce the production of routine items from a tedious programming task to a single command line (i.e. data import and cleaning; production of summary statistics Cheklist: Submit Table tables; and categorical bar charts with confidence intervals.


Additional Resources

PROGRAMS


GENERAL CODING https://github.com/worldbank/DIME-Resources/blob/master/stata1-2-coding.pdf https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf

https://www.povertyactionlab.org/sites/default/files/resources/IAPStataWorkshopSlides.pdf https://www.princeton.edu/~otorres/StataTutorial.pdf https://web.stanford.edu/~leinav/teaching/econ257/STATA.pdf