Stata Linter

Jump to: navigation, search

Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good coding practices, packages, and commands are therefore a critical component of high quality reproducible research. With an aim to improve the quality of code, DIME Analytics has developed Stata linter, a Stata package to identify and correct the use of bad coding practices in Stata do-files. It does so using the lint command.

Read first

Overview

The Stata linter detects two types of coding practices in Stata do-files that prevent correct functionality and legibility: one, possible programming errors, and two, style practices that can be improved, with an emphasis on the latter. It includes two main features:

  1. Detection: Identifies coding practices that should be changed to improve code clarity. It can display a list of by-line items where corrections are needed using the verbose option.
  2. Correction: Automatically applies corrections to some of the identified bad coding practices and saves a new do-file with the results.

Prerequisites

The Stata linter command runs in Stata, but uses Python code in the background. You will need Stata 16 (or higher) and Python to run it, and you will have to ensure that your installations of Stata and Python are integrated and that the Pandas package is installed.

Installation

You can install the latest version of the Stata linter by running the following command in Stata:

ssc install stata_linter

Basic use

This section summarizes the 2 basic functionalities of Stata linter, namely, detection, and correction.

Detection feature

The basic syntax of the command is the following (you can find the do-file used for this example here). Note that the name of the package is stata_linter but the name of the command is lint.

lint "bad.do"

Stata will display a list of bad practices and specify whether and how often each was found. Linting the example do-file gives the following output:

-------------------------------------------------------------------------------------
Bad practice                                                          Occurrences          
-------------------------------------------------------------------------------------
Hard tabs used instead of soft tabs:                                  Yes       
One-letter local name in for-loop:                                    3
Non-standard indentation in { } code block:                           7
No indentation on line following ///:                                 1
Use of . where missing() is appropriate:                              5
Missing whitespaces around operators:                                 0
Implicit logic in if-condition:                                       1
Delimiter changed:                                                    1
Working directory changed:                                            0
Lines too long:                                                       4
Global macro reference without { }:                                   0
Potential omission of missing values in expression:                   1
Backslash detected in potential file path:                            0
Tilde (~) used instead of bang (!) in expression:                     5
-------------------------------------------------------------------------------------

The detection feature can be customized using a variety of options, such as:

1. Show exactly which lines have bad coding practices
lint "test/bad.do", verbose
        
2. Remove the summary of bad practices
lint "test/bad.do", nosummary
        
3. Specify the number of whitespaces (default: 4):
lint "test/bad.do", indent(2)
        
4. Specify the maximum number of characters in a line (default: 80):
lint "test/bad.do", linemax(100)
        
5. Specify the number of whitespaces used instead of hard tabs (default: 4): 
lint "test/bad.do", tab_space(3)
        
6. Exports the results of the line by line analysis to an Excel file
lint "test/bad.do", excel("test_dir/detect_output.xlsx")
        
7. Finally, you can also use this command to test all the do-files that are in a folder:
lint "test"

The issues flagged by the detection feature are as follows:

  1. Use whitespaces instead of hard tabs: Use whitespaces (usually 2 or 4) instead of hard tabs.
  2. Avoid abstract index names: In for-loop statements, index names should describe what the code is looping over.
  3. Use proper indentations: After declaring for-loop statements or if-else statements, add indentation with whitespaces (usually 2 or 4) in the lines inside the loop.
  4. Use indentations after declaring newline symbols (///): After a new line statement (///), add indentation (usually 2 or 4 whitespaces).
  5. Use the !missing() function for conditions with missing values: For clarity, use !missing(var) instead of var < . or var != .
  6. Add whitespaces around math symbols (+, =, <, >): For better readability, add whitespaces around math symbols. For example, do gen a = b + c if d == e instead of gen a=b+c if d==e.
  7. Specify the condition in an if statement: Always explicitly specify the condition in the if statement. For example, declare if var == 1 instead of just using if var.
  8. Do not use `#delimit`: use /// instead for line breaks.
  9. Do not use cd to change current folder: Use absolute and dynamic file paths.
  10. Use line breaks in long lines: For lines that are too long, use /// to divide them into multiple lines. It is recommended to restrict the number of characters in a line to 80 or less.
  11. Use curly brackets for global macros: Always use ${ } for global macros. For example, use ${global_name} instead of global_name.
  12. Include missing values in condition expressions: Condition expressions like var != 0 or var > 0 are evaluated to true for missing values. Make sure to explicitly take missing values into account by using missing(var) in expressions.
  13. Check if backslashes are not used in file paths: Check if backslashes (\) are not used in file paths. If you are using them, then replace them with forward slashes (/). Users should note that the linter might not distinguish perfectly which uses of a backslash are file paths. In general, this flag will come up every time a backslash is used in the same line as a local, global, or the cd command.
  14. Check if tildes (~) are not used for negations: If you are using tildes (~) for negations, replace them with bangs (!).

Correction feature

The linter can automate corrections to the identified bad practices. When you choose to do this, you are asked to specify the name of the do-file where the corrections will be saved (this ensures your original do-file is not overwritten).

lint "test/bad.do" using "test/bad_corrected.do"

You are then asked whether you want each specific bad practice detected to be corrected. For example, the command above displays the following requests for confirmation:

------------------------------------------------------------
Correcting do-file
------------------------------------------------------------
 
    Avoid using [delimit], use three forward slashes (///) instead.
    Do you want to correct this? To confirm type Y and hit enter, to abort type N and hit enter. Type BREAK and hit enter to stop the code. See option automatic to not be prompted before creating files.

Other options for correction include:

1. Automatic (Stata corrects the file automatically, without confirmation of use):
lint "test/bad.do" using "test/bad_corrected.do", automatic
2. Replace the output file if it already exists
lint "test/bad.do" using "test/bad_corrected.do", automatic replace

As of version 1.1 of the Stata linter, the issues that can be corrected by the linter are:

  1. Replaces the use of #delimit with three forward slashes (///) in each line affected by #delimit.
  2. Replaces hard tabs with soft spaces (4 by default). The number of spaces can be set with the tab_space() option.
  3. Indents lines inside curly brackets with 4 spaces by default. The number of spaces can be set with the indent() option.
  4. Breaks long lines into multiple lines. Long lines are considered to have more than 80 characters by default, but this setting can be changed with the option linemax(). Note that lines can only be split in whitespaces that are not inside parentheses, curly brackets, or double quotes. If a line does not have any whitespaces, the linter will not be able to break a long line.
  5. Adds whitespace before opening curly brackets, except for globals.
  6. Removes redundant blank lines after closing curly brackets.
  7. Removes duplicated blank lines.

Recommended Workflow

We recommend the following workflow for using the Stata linter command:

  1. Use the detection feature to get an idea of how many bad coding practices the do-file has.
  2. Decide whether to use or not the correction feature. If only a few bad practices are flagged, they could be corrected manually with help of the verbose option
  3. If there are many bad practices, use the correction feature and verify that the outputs of the do-file have not changed
  4. Re-apply the detection feature and correct any outstanding issues manually

Related Pages

Click here for pages that link to this topic.

Additional Resources