Reprun

Jump to: navigation, search

reprun - This command is used to automate a reproducibility check for a single Stata do-file, or a set of do-files called by a main do-file. The command should be used interactively; reprun will execute one run of the do-file and record the state of Stata after the execution of each line. It will then run the entire do-file a second time and flag all potential reproducibility error caused by comparing the Stata state to the first run after each line. Debugging and reporting options are available.

Syntax

reprun “do-file.do” [using “/directory/”] , [verbose] [compact] [suppress(rng|srng|dsum|loop)] [debug] [noclear]

By default, reprun will execute the complete do-file specified in “do-file.do” once (Run 1), and record the “seed RNG state”, “sort order RNG”, and “data checksum” after the execution of every line, as well as the exact data in certain cases. reprun will then execute the do-file a second time (Run 2), and find all changes and mismatches in these states throughout Run 2. A table of mismatches will be reported in the Results window, as well as in a SMCL file in a new directory called /reprun/ in the same location as the do-file. If the using argument is supplied, the /reprun/ directory containing the SMCL file will be stored in that location instead.

Options

  • verbose: Report all lines where Run 1 and Run 2 mismatch or change for any value.
  • compact: Report only lines where Run 1 and Run 2 mismatch and change for either the seed or sort RNG.
  • suppress(types): Suppress reporting of state changes that do not result in mismatches for seed RNG state (rng), sort order RNG (srng), and/or data checksum (dsum), for any reporting setting
  • debug: Save all records of Stata states in Run 1 and Run 2 for inspection in the /reprun/ folder.
  • noclear: Do not reset the Stata state before beginning reproducibility Run 1.

Description

The reprun command is intended to be used to check the reproducibility of a do-file or set of do-files (called by a main do-file) that are ready to be transferred to other users or published. The command will ensure that the outputs produced by the do-file or set of do-files are stable across runs, such that they do not produce reproducibility errors caused by incorrectly managed randomness in Stata. To do so, reprun will check three key sources of reproducibility failure at each point in execution of the do-file(s): the state of the random number generator, the sort order of the data, and the contents of the data itself (see detailed description below).

After completing Run 2, reprun will report all lines where there are mismatches between Run 1 and Run 2 in any of these values. Lines where changes lead to mismatches will be highlighted. Problems should be approached top-to-bottom, as solving earlier issues will often resolve later ones. Additionally, addressing issues from left-to-right in the table is effective. RNG states are responsible for most errors, followed by unstable sorts, while data mismatches are typically symptoms of these reproducibility failures rather than causes in and of themselves.

Mismatches are defined as follows:

Seed RNG State: A mismatch occurs whenever the RNG state differs from Run 1 to Run 2, except any time the RNG state is exactly equivalent to set seed 12345 in Run 1 (the initialization default). By default, reprun invokes clear and set seed 12345 to match the default Stata state before beginning Run 1. The noclear option prevents this behavior; this is not recommended unless you have a rare issue that you need to check at the very beginning of the file. Most projects should quickly set the randomization seed appropriately for replicability.

Sort Order RNG: Since the sort RNG state should always differ between Run 1 and Run 2, a mismatch is defined as any line where the sort RNG state is advanced and checksum fails to match when compared with the Run 1 data (as a CSV) at the same line. This mismatch occurs when the sort order RNG is used in a command that results in the data taking a different order between the two runs. Users should never manually set the sortseed (See help seed and help sortseed) to override these mismatches; instead, they should implement a unique sort on the data using a command like isid (See help isid).

Data Checksum: A mismatch occurs whenever checksum fails to match when comparing the result from the Run 1 data (as a CSV) in Run 2. Users should understand that lines where only the data checksum fails to match are unlikely to be where problems originate in the code; these mismatches are generally consequences of earlier reproducibility failures in randomization or sorting. Users should also note that results from datasignature are only unique up to the sort order of each column independently; hence, we do not use this command.

Options

By default, reprun returns a list of mismatches in Stata state between Run 1 and Run 2. This means that any time the state of the random number generator, the sort order of the data, or the contents of the data itself do not match Run 1 during Run 2, a flag will be generated for the corresponding line of code. The user may modify this reporting in several ways using options.

Line flagging options

The verbose option can be used to produce even more detail than the default. If the verbose option is specified, then any line in which the state changes during Run 1 or Run 2; or mismatches between the runs will be flagged and reported. This is intended to allow the user to do a deep-dive into the function and structure of the do-file’s execution.

The compact option, by contrast, produces less detailed reporting, but is often a good first step to begin locating issues in the code. If the compact option is specified, then only those lines which have mismatched seed or sort order RNG changes during Run 1 or Run 2 and mismatches between the runs will be flagged and reported. Data checksum mismatches alone will be ignored; as will RNG mismatches not accompanied by a change in the state. This is intended to reduce the reporting of “cascading” differences, which are caused because some state value changes inconsistently at a single point and remains inconsistent for the remainder of the run (making every subsequent data change a mismatch, for example).

The suppress() option is used to hide the reporting of changes that do not lead to mismatches (especially when the verbose option is specified) for one or more of the types. In particular, since the sort order RNG frequently changes and should not be forced to match between runs, it will very often have changes that do not produce errors, specifying suppress(srng) will remove a great deal of unhelpful output from the reporting table. To do this for all states, write suppress(rng srng dsum). Suppressing loop will clean up the display of loops so that the titles are only shown on the first line; but if combined with compact may not display at all.

Reporting and debugging options

The debug option allows the user to save all of the underlying materials used by reprun in the /reprun/ folder where the reporting SMCL file will be written. This will include copies of all do-files for each run for manual inspection and text files of the states of Stata after each line. This is automatically cleaned up after execution if debug is not specified.

Other options

By default, reprun invokes clear and set seed 12345 to match the default Stata state before beginning Run 1. noclear prevents this behavior. It is not recommended unless you have a rare issue that you need to check at the very beginning of the file, because most projects should very quickly set these states appropriately for reproducibility.

Note on reproducibility of certain commands

by and bysort: Users will often use by and bysort or equivalent commands to produce “group-level” statistics. The syntax used is usually something like bysort groupvarname : egen newvarname = function(varlist). However, we note that such an approach necessarily introduces an instability in the sort order within each group. reprun will flag these instances as indeterminate sorts, since they can introduce issues later in the code when code is order-dependent; and will do so right away, for functions like rank() or other approaches like bysort groupvarname : egen newvarname = n. To avoid this, and to write truly reproducible code, users should use the less common but fully reproducible unique sorting syntax of bysort groupvarname (uniqueidvar) ... to ensure a unique sort with by-able commands. For commands with by() options, users should check whether this syntax is available, or remember to re-sort uniquely before any further processes are done. If bysort or the equivalent is called in intermediate or user-written commands that cannot be made to return the data sorted uniquely, those lines will continue to be flagged by ’reprun‘. There is not a technical solution to this, to the best of our knowledge; therefore, the flag will remain as a reminder that the user should implement a unique sort after the indicated lines.

merge m:m and set sortseed: These commands will be flagged interactively by reprun with warnings following the results table, regardless of whether any instability is obviously introduced according to the Stata RNG states. This is because merge m:m and set sortseed, while they often appear to work reproducibly, generally have the function of creating false stability that masks underlying issues in the code. In the case of merge m:m, the data that is produced is always sort-dependent in both datasets, and almost always meaningless as a result. In the case of set sortseed, the command often works to hide an instability in the underlying code that is sort-dependent. Users should instead remove all instances of these commands, and fix whatever issues in the process are causing their results to depend on the (indeterminate) sort order of the data.

Examples

Example 1: This is the most basic usage of reprun. Specified in any of the following ways, either in the Stata command window or as part of a new do-file, reprun will execute the complete do-file “myfile.do” once (Run 1), and record the “seed RNG state”, “sort order RNG”, and “data checksum” after the execution of every line, as well as the exact data in certain cases. reprun will then execute “myfile.do” a second time (Run 2), and find all changes and mismatches in these states throughout Run 2. A table of mismatches will be reported in the Results window, as well as in a SMCL file in a new directory called /reprun/ in the same location as “myfile.do”.

reprun "myfile.do"

or

reprun "path/to/folder/myfile.do"

or

local myfolder "/path/to/folder"
reprun "`myfolder'/myfile.do"