Difference between revisions of "Stata Linter"

Latest revision as of 14:31, 1 March 2023

Researchers use Stata in all stages of an impact evaluation (or study), such as sampling, randomizing, monitoring data quality, cleaning, and analysis. Good coding practices, packages, and commands are therefore a critical component of high quality reproducible research. With an aim to improve the quality of code, DIME Analytics has developed Stata linter, a Stata package to identify and correct the use of bad coding practices in Stata do-files. It does so using the lint command.

Read first

The Stata linter follows the practices outlined in the DIME Analytics Stata Style Guide, and the DIME Analytics Coding Guide.
This package incorporates Stata coding best practices to improve the clarity of code in Stata.
These practices allow the impact evaluation team (or research team) to save time and energy, and focus on other aspects of study design.

Overview

The Stata linter detects two types of coding practices in Stata do-files that prevent correct functionality and legibility: one, possible programming errors, and two, style practices that can be improved, with an emphasis on the latter. It includes two main features:

Detection: Identifies coding practices that should be changed to improve code clarity. It can display a list of by-line items where corrections are needed using the verbose option.
Correction: Automatically applies corrections to some of the identified bad coding practices and saves a new do-file with the results.

Prerequisites

The Stata linter command runs in Stata, but uses Python code in the background. You will need Stata 16 (or higher) and Python to run it, and you will have to ensure that your installations of Stata and Python are integrated and that the Pandas package is installed.

Installation

You can install the latest version of the Stata linter by running the following command in Stata:

ssc install stata_linter

Basic use

This section summarizes the 2 basic functionalities of Stata linter, namely, detection, and correction.

Detection feature

The basic syntax of the command is the following (you can find the do-file used for this example here). Note that the name of the package is stata_linter but the name of the command is lint.

lint "bad.do"

Stata will display a list of bad practices and specify whether and how often each was found. Linting the example do-file gives the following output:

-------------------------------------------------------------------------------------
Bad practice                                                          Occurrences          
-------------------------------------------------------------------------------------
Hard tabs used instead of soft tabs:                                  Yes       
One-letter local name in for-loop:                                    3
Non-standard indentation in { } code block:                           7
No indentation on line following ///:                                 1
Use of . where missing() is appropriate:                              5
Missing whitespaces around operators:                                 0
Implicit logic in if-condition:                                       1
Delimiter changed:                                                    1
Working directory changed:                                            0
Lines too long:                                                       4
Global macro reference without { }:                                   0
Potential omission of missing values in expression:                   1
Backslash detected in potential file path:                            0
Tilde (~) used instead of bang (!) in expression:                     5
-------------------------------------------------------------------------------------

The detection feature can be customized using a variety of options, such as:

1. Show exactly which lines have bad coding practices
lint "test/bad.do", verbose
        
2. Remove the summary of bad practices
lint "test/bad.do", nosummary
        
3. Specify the number of whitespaces (default: 4):
lint "test/bad.do", indent(2)
        
4. Specify the maximum number of characters in a line (default: 80):
lint "test/bad.do", linemax(100)
        
5. Specify the number of whitespaces used instead of hard tabs (default: 4): 
lint "test/bad.do", tab_space(3)
        
6. Exports the results of the line by line analysis to an Excel file
lint "test/bad.do", excel("test_dir/detect_output.xlsx")
        
7. Finally, you can also use this command to test all the do-files that are in a folder:
lint "test"

The issues flagged by the detection feature are as follows:

Use whitespaces instead of hard tabs: Use whitespaces (usually 2 or 4) instead of hard tabs.
Avoid abstract index names: In for-loop statements, index names should describe what the code is looping over.
Use proper indentations: After declaring for-loop statements or if-else statements, add indentation with whitespaces (usually 2 or 4) in the lines inside the loop.
Use indentations after declaring newline symbols (///): After a new line statement (///), add indentation (usually 2 or 4 whitespaces).
Use the !missing() function for conditions with missing values: For clarity, use !missing(var) instead of var < . or var != .
Add whitespaces around math symbols (+, =, <, >): For better readability, add whitespaces around math symbols. For example, do gen a = b + c if d == e instead of gen a=b+c if d==e.
Specify the condition in an if statement: Always explicitly specify the condition in the if statement. For example, declare if var == 1 instead of just using if var.
Do not use `#delimit`: use /// instead for line breaks.
Do not use cd to change current folder: Use absolute and dynamic file paths.
Use line breaks in long lines: For lines that are too long, use /// to divide them into multiple lines. It is recommended to restrict the number of characters in a line to 80 or less.
Use curly brackets for global macros: Always use ${ } for global macros. For example, use ${global_name} instead of global_name.
Include missing values in condition expressions: Condition expressions like var != 0 or var > 0 are evaluated to true for missing values. Make sure to explicitly take missing values into account by using missing(var) in expressions.
Check if backslashes are not used in file paths: Check if backslashes (\) are not used in file paths. If you are using them, then replace them with forward slashes (/). Users should note that the linter might not distinguish perfectly which uses of a backslash are file paths. In general, this flag will come up every time a backslash is used in the same line as a local, global, or the cd command.
Check if tildes (~) are not used for negations: If you are using tildes (~) for negations, replace them with bangs (!).

Correction feature

The linter can automate corrections to the identified bad practices. When you choose to do this, you are asked to specify the name of the do-file where the corrections will be saved (this ensures your original do-file is not overwritten).

lint "test/bad.do" using "test/bad_corrected.do"

You are then asked whether you want each specific bad practice detected to be corrected. For example, the command above displays the following requests for confirmation:

------------------------------------------------------------
Correcting do-file
------------------------------------------------------------
 
    Avoid using [delimit], use three forward slashes (///) instead.
    Do you want to correct this? To confirm type Y and hit enter, to abort type N and hit enter. Type BREAK and hit enter to stop the code. See option automatic to not be prompted before creating files.

Other options for correction include:

1. Automatic (Stata corrects the file automatically, without confirmation of use):
lint "test/bad.do" using "test/bad_corrected.do", automatic
2. Replace the output file if it already exists
lint "test/bad.do" using "test/bad_corrected.do", automatic replace

As of version 1.1 of the Stata linter, the issues that can be corrected by the linter are:

Replaces the use of #delimit with three forward slashes (///) in each line affected by #delimit.
Replaces hard tabs with soft spaces (4 by default). The number of spaces can be set with the tab_space() option.
Indents lines inside curly brackets with 4 spaces by default. The number of spaces can be set with the indent() option.
Breaks long lines into multiple lines. Long lines are considered to have more than 80 characters by default, but this setting can be changed with the option linemax(). Note that lines can only be split in whitespaces that are not inside parentheses, curly brackets, or double quotes. If a line does not have any whitespaces, the linter will not be able to break a long line.
Adds whitespace before opening curly brackets, except for globals.
Removes redundant blank lines after closing curly brackets.
Removes duplicated blank lines.

Recommended Workflow

We recommend the following workflow for using the Stata linter command:

Use the detection feature to get an idea of how many bad coding practices the do-file has.
Decide whether to use or not the correction feature. If only a few bad practices are flagged, they could be corrected manually with help of the verbose option
If there are many bad practices, use the correction feature and verify that the outputs of the do-file have not changed
Re-apply the detection feature and correct any outstanding issues manually

Additional Resources

David McKenzie (World Bank), An updated overview of multiple hypothesis testing commands in Stata
DIME Analytics (World Bank), Writing programs in Stata. This GitHub repository contains .ado files that reduce tedious programming tasks like statistical analysis and the production of graphs to a single line of command. You can use this repository to experiment with various commands.
DIME Analytics (World Bank), Stata visual library
The GeoCenter, Stata cheat sheets.
Innovations for Poverty Action, Stata modules for data collection and analysis
Innovations for Poverty Action, GitHub repository on impact evaluations
Innovations for Poverty Action, Odkmeta command. This command writes a do-file to import ODK (Open Data Kit) data to Stata, using metadata from the survey and choices worksheets of the XLSForm.
World Bank, Stata repository.

@@ Line 98: / Line 98: @@
 # Removes redundant blank lines after closing curly brackets.
 # Removes duplicated blank lines.
-==Recommended use==
+==Recommended Workflow==
-We recommend the following workflow for making the :
+We recommend the following workflow for using the '''Stata linter''' command:
 # Use the detection feature to get an idea of how many bad coding practices the do-file has.
 # Decide whether to use or not the correction feature. If only a few bad practices are flagged, they could be corrected manually with help of the verbose option
 # If there are many bad practices, use the correction feature and verify that the outputs of the do-file have not changed
 # Re-apply the detection feature and correct any outstanding issues manually
+== Related Pages ==
+[[Special:WhatLinksHere/Stata_Linter|Click here for pages that link to this topic.]]
+== Additional Resources ==
+* David McKenzie (World Bank), [https://blogs.worldbank.org/impactevaluations/updated-overview-multiple-hypothesis-testing-commands-stata An updated overview of multiple hypothesis testing commands in Stata]
+* DIME Analytics (World Bank), [https://gist.github.com/kbjarkefur/1f880b78029eaf78416d12dfd2076985 Writing programs in Stata]. This GitHub repository contains <code>.ado</code> files that reduce tedious programming tasks like statistical analysis and the production of graphs to a single line of command. You can use this repository to experiment with various commands.
+* DIME Analytics (World Bank), [https://worldbank.github.io/Stata-IE-Visual-Library/ Stata visual library]
+* The GeoCenter, [http://geocenter.github.io/StataTraining/portfolio/01_resource/  Stata cheat sheets.]
+* Innovations for Poverty Action, [http://www.poverty-action.org/researchers/research-resources/stata-programs Stata modules for data collection and analysis]
+* Innovations for Poverty Action, [https://github.com/PovertyAction GitHub repository on impact evaluations]
+* Innovations for Poverty Action, [https://github.com/PovertyAction/odkmeta Odkmeta command]. This command writes a do-file to import ODK (Open Data Kit) data to Stata, using metadata from the survey and choices worksheets of the XLSForm.
+* World Bank, [https://worldbank.github.io/stata/ Stata repository].
+[[Category: Coding Practices]]
+[[Category: Reproducible Research]]
+[[Category: Stata Coding Practices]]
+[[Category: Technical Tools]]

Navigation

Tools

Difference between revisions of "Stata Linter"

Latest revision as of 14:31, 1 March 2023

Contents

Read first

Overview

Prerequisites

Installation

Basic use

Detection feature

Correction feature

Recommended Workflow

Related Pages

Additional Resources

Difference between revisions of "Stata Linter"

Latest revision as of 14:31, 1 March 2023

Read first

Overview

Prerequisites

Installation

Basic use

Detection feature

Correction feature

Recommended Workflow

Related Pages

Additional Resources

follow us

newsletter