Difference between revisions of "Stata Linter"
Line 1: | Line 1: | ||
[https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has developed '''Stata linter''', a Stata package to identify and correct the use of bad [[Stata Coding Practices|coding practices]] in Stata do-files. It does so using the <code>lint</code> command. | Researchers use Stata in all stages of an '''impact evaluation''' (or study), such as [[Sampling & Power Calculations |sampling]], [[Randomization in Stata | randomizing]], [[Monitoring Data Quality | monitoring data quality]], [[Data Cleaning | cleaning]], and [[Data Analysis | analysis]]. Good coding practices, packages, and commands are therefore a critical component of high quality [[Reproducible Research | reproducible research]]. With an aim to improve the quality of code, [https://www.worldbank.org/en/research/dime/data-and-analytics DIME Analytics] has developed '''Stata linter''', a Stata package to identify and correct the use of bad [[Stata Coding Practices|coding practices]] in Stata do-files. It does so using the <code>lint</code> command. | ||
==Read first== | ==Read first== | ||
* The '''Stata linter''' follows the practices outlined in the [https://worldbank.github.io/dime-data-handbook/coding.html#the-dime-analytics-stata-style-guide DIME Analytics Stata Style Guide], and the [https://worldbank.github.io/dime-data-handbook/coding.html DIME Analytics Coding Guide]. | * The '''Stata linter''' follows the practices outlined in the [https://worldbank.github.io/dime-data-handbook/coding.html#the-dime-analytics-stata-style-guide DIME Analytics Stata Style Guide], and the [https://worldbank.github.io/dime-data-handbook/coding.html DIME Analytics Coding Guide]. | ||
* This package incorporates [[Stata Coding Practices | Stata coding best practices]] to improve the clarity of code in Stata. | * This package incorporates [[Stata Coding Practices | Stata coding best practices]] to improve the clarity of code in Stata. | ||
* These practices allow the [[Impact Evaluation Team|impact evaluation team]] (or research team) to save time and energy, and focus on other [[Randomized Evaluations: Principles of Study Design|aspects of study design]]. | |||
==Overview== | ==Overview== | ||
The '''Stata linter''' detects two types of coding practices in Stata do-files that prevent correct functionality and legibility: '''one''', possible programming errors, and '''two''', style practices that can be improved, with an emphasis on the latter. It includes two main features: | The '''Stata linter''' detects two types of coding practices in Stata do-files that prevent correct functionality and legibility: '''one''', possible programming errors, and '''two''', style practices that can be improved, with an emphasis on the latter. It includes two main features: |
Revision as of 14:26, 1 March 2023
Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good coding practices, packages, and commands are therefore a critical component of high quality reproducible research. With an aim to improve the quality of code, DIME Analytics has developed Stata linter, a Stata package to identify and correct the use of bad coding practices in Stata do-files. It does so using the lint
command.
Read first
- The Stata linter follows the practices outlined in the DIME Analytics Stata Style Guide, and the DIME Analytics Coding Guide.
- This package incorporates Stata coding best practices to improve the clarity of code in Stata.
- These practices allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.
Overview
The Stata linter detects two types of coding practices in Stata do-files that prevent correct functionality and legibility: one, possible programming errors, and two, style practices that can be improved, with an emphasis on the latter. It includes two main features:
- Detection: Identifies coding practices that should be changed to improve code clarity. It can display a list of by-line items where corrections are needed using the verbose option.
- Correction: Automatically applies corrections to some of the identified bad coding practices and saves a new do-file with the results.
Prerequisites
The Stata linter command runs in Stata, but uses Python code in the background. You will need Stata 16 (or higher) and Python to run it, and you will have to ensure that your installations of Stata and Python are integrated and that the Pandas package is installed.
Installation
You can install the latest version of the Stata linter by running the following command in Stata:
ssc install stata_linter
Basic use
This section summarizes the 2 basic functionalities of Stata linter, namely, detection, and correction.
Detection feature
The basic syntax of the command is the following (you can find the do-file used for this example here). Note that the name of the package is stata_linter
but the name of the command is lint
.
lint "bad.do"
Stata will display a list of bad practices and specify whether and how often each was found. Linting the example do-file gives the following output:
-------------------------------------------------------------------------------------
Bad practice Occurrences
-------------------------------------------------------------------------------------
Hard tabs used instead of soft tabs: Yes
One-letter local name in for-loop: 3
Non-standard indentation in { } code block: 7
No indentation on line following ///: 1
Use of . where missing() is appropriate: 5
Missing whitespaces around operators: 0
Implicit logic in if-condition: 1
Delimiter changed: 1
Working directory changed: 0
Lines too long: 4
Global macro reference without { }: 0
Potential omission of missing values in expression: 1
Backslash detected in potential file path: 0
Tilde (~) used instead of bang (!) in expression: 5
-------------------------------------------------------------------------------------
The detection feature can be customized using a variety of options, such as:
1. Show exactly which lines have bad coding practices
lint "test/bad.do", verbose
2. Remove the summary of bad practices
lint "test/bad.do", nosummary
3. Specify the number of whitespaces (default: 4):
lint "test/bad.do", indent(2)
4. Specify the maximum number of characters in a line (default: 80):
lint "test/bad.do", linemax(100)
5. Specify the number of whitespaces used instead of hard tabs (default: 4):
lint "test/bad.do", tab_space(3)
6. Exports the results of the line by line analysis to an Excel file
lint "test/bad.do", excel("test_dir/detect_output.xlsx")
7. Finally, you can also use this command to test all the do-files that are in a folder:
lint "test"
The issues flagged by the detection feature are as follows:
- Use whitespaces instead of hard tabs: Use whitespaces (usually 2 or 4) instead of hard tabs.
- Avoid abstract index names: In for-loop statements, index names should describe what the code is looping over.
- Use proper indentations: After declaring for-loop statements or if-else statements, add indentation with whitespaces (usually 2 or 4) in the lines inside the loop.
- Use indentations after declaring newline symbols (
///
): After a new line statement (///
), add indentation (usually 2 or 4 whitespaces). - Use the
!missing()
function for conditions with missing values: For clarity, use!missing(var)
instead ofvar < .
orvar != .
- Add whitespaces around math symbols (+, =, <, >): For better readability, add whitespaces around math symbols. For example, do
gen a = b + c if d == e
instead ofgen a=b+c if d==e
. - Specify the condition in an
if
statement: Always explicitly specify the condition in theif
statement. For example, declareif var == 1
instead of just usingif var
. - Do not use `#delimit`: use
///
instead for line breaks. - Do not use
cd
to change current folder: Use absolute and dynamic file paths. - Use line breaks in long lines: For lines that are too long, use
///
to divide them into multiple lines. It is recommended to restrict the number of characters in a line to 80 or less. - Use curly brackets for global macros: Always use
${ }
for global macros. For example, use${global_name}
instead ofglobal_name
. - Include missing values in condition expressions: Condition expressions like
var != 0
orvar > 0
are evaluated to true for missing values. Make sure to explicitly take missing values into account by usingmissing(var)
in expressions. - Check if backslashes are not used in file paths: Check if backslashes (
\
) are not used in file paths. If you are using them, then replace them with forward slashes (/
). Users should note that the linter might not distinguish perfectly which uses of a backslash are file paths. In general, this flag will come up every time a backslash is used in the same line as a local, global, or thecd
command. - Check if tildes (
~
) are not used for negations: If you are using tildes (~
) for negations, replace them with bangs (!
).
Correction feature
The linter can automate corrections to the identified bad practices. When you choose to do this, you are asked to specify the name of the do-file where the corrections will be saved (this ensures your original do-file is not overwritten).
lint "test/bad.do" using "test/bad_corrected.do"
You are then asked whether you want each specific bad practice detected to be corrected. For example, the command above displays the following requests for confirmation:
------------------------------------------------------------
Correcting do-file
------------------------------------------------------------
Avoid using [delimit], use three forward slashes (///) instead.
Do you want to correct this? To confirm type Y and hit enter, to abort type N and hit enter. Type BREAK and hit enter to stop the code. See option automatic to not be prompted before creating files.
Other options for correction include:
1. Automatic (Stata corrects the file automatically, without confirmation of use):
lint "test/bad.do" using "test/bad_corrected.do", automatic
2. Replace the output file if it already exists
lint "test/bad.do" using "test/bad_corrected.do", automatic replace
As of version 1.1 of the Stata linter, the issues that can be corrected by the linter are:
- Replaces the use of
#delimit
with three forward slashes (///
) in each line affected by#delimit
. - Replaces hard tabs with soft spaces (4 by default). The number of spaces can be set with the
tab_space()
option. - Indents lines inside curly brackets with 4 spaces by default. The number of spaces can be set with the
indent()
option. - Breaks long lines into multiple lines. Long lines are considered to have more than 80 characters by default, but this setting can be changed with the option
linemax()
. Note that lines can only be split in whitespaces that are not inside parentheses, curly brackets, or double quotes. If a line does not have any whitespaces, the linter will not be able to break a long line. - Adds whitespace before opening curly brackets, except for globals.
- Removes redundant blank lines after closing curly brackets.
- Removes duplicated blank lines.
Recommended use
We recommend the following workflow for making the :
- Use the detection feature to get an idea of how many bad coding practices the do-file has.
- Decide whether to use or not the correction feature. If only a few bad practices are flagged, they could be corrected manually with help of the verbose option
- If there are many bad practices, use the correction feature and verify that the outputs of the do-file have not changed
- Re-apply the detection feature and correct any outstanding issues manually