Difference between revisions of "Master Do-files"

Jump to: navigation, search
Line 126: Line 126:
*DIME Analytics' [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
*DIME Analytics' [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
[[Category: Data Management ]]
[[Category: Data Management ]]

Revision as of 19:53, 14 May 2019

The master do-file is the main do-file that calls upon and runs all the other do-files of a project. It plays a critical role throughout all stages of the research project and functions as a map to the data folder. This page outlines the components of a well-structured and replicable master do-file.

Read First

  • The command iefolder sets up the master do-file.
  • Anyone with the master do-file should be able to run do-files for all stages of research (cleaning, construction, analysis, exporting, etc.).
  • After changing the path global to the location where each stores his/her project folder, any two people with the master do-file should be able to run it and get identical results.

Overview

A master do-file serves three main purposes:

  1. It compactly and reproducibly runs all do-files needed for data work. More specifically, in the DataWork folder structure, the master do-file houses all survey round master do-files, which contain all round-specific task-level do-files.
  2. It establishes an identical workspace between users by specifying settings, installing programs, and setting globals. Globals, or referenceable pieces of information defined in the do-file and stored in memory until the user exits Stata, help to ensure consistency, accuracy and conciseness in code.
  3. It maps all files within the data folder and serves as the starting point to find any do-file, dataset or output.

Components of a Master Do-file

Intro Header

At the very top of the master do-file, the intro header should clearly explain the purpose of the do-file. It should provide any other important information, including but not limited to an outline of the do-file; the data files required to correctly run the do-file; the data files created by the do-file; or the variable that uniquely identifies the unit of observation in the datasets. The intro header should be understandable to someone unfamiliar with the project.

Installation of ietoolkit and User Written Commands

Master do-files created by iefolder must include a line to install the package ietoolkit. After this line, you can install other user written commands needed for the project. Follow each installation with replace. This ensures that the latest version of the command with updated functionalities is installed. Overall, this section will look something like this:

       *Install all packages that this project requires:
       ssc install ietoolkit, replace
       ssc install outreg2  , replace
       ssc install estout   , replace
       ssc install ivreg2   , replace

You may comment out this section once the commands are installed. However, for replicability, it is important that the master do-file always includes this section, whether commented out or not.

Settings

Stata allows the user to customize a wide range of settings: Stata version, memory settings, code interpretation settings, etc. If two users with different settings run the same code, the code could crash or yield different results. ieboilstart sets the settings to the values recommended by Stata, thus harmonizing settings across users. Note that in most cases it does not matter what values are used so long as all users use the same value. You can use ieboilstart like this:

       *Standardize settings accross users
       ieboilstart, version(12.1)      //Set the version number to the oldest version used by anyone in the project team
       `r(version)'                    //This line is needed to actually set the version from the command above

Since Stata does not recommend any particular version, you must specify this setting manually when using ieboilstart. We recommend using the oldest Stata version that anyone from your team will ever use. Once you have done a randomization that is meant to be replicable in your project, you should not change version setting. If you do, your randomization will no longer be replicable. Read the Stata help file for ieboilstart for a more detailed description of the command.

Root Folder Globals

Collaborators on a project likely have slightly different file paths to shared project folders. The root folder globals indicate where each user stores the project folder on his/her computer. This allows multiple users to run the same do-files by making only a minor modification in the master do-file. In the code below, the global user is set to 1, meaning that Stata will use Ann's folder location. If John would like to run the code, he would change the user number to 2. If all file references in all do-files use these globals, John can now run all code. If a third user wants to run the same code, that user would add the same information and identify as user number 3.

   *User Number:
   * Ann          1 
   * John         2
   * Add more users here as needed

   *Set this value to the user currently using this file
   global user  1

   * Root folder globals
   * ---------------------
   if $user == 1 {
       global projectfolder "C:/Users/AnnDoe/Dropbox/Project ABC"
   }
   if $user == 2 {
       global projectfolder  "C:/Users/JohnSmith/Dropbox/Project ABC"
   }

You can modify this code so that Stata automatically detects which user is running the code, thereby eliminating the need for any manual change. To do this, use Stata's built-in local c(username), which reads the username assigned to each user’s computer during the installation of his/her operating system (i.e. Windows). Then, in the above code, change if $user == 1 to if c(username) == "username" for each user. Note that you must still add new users manually.

Project Folder Globals

As the number of folders grows, it becomes more and more convenient to have globals that point to project sub-folders. iefolder automatically creates these globals for any folders it generates, placing globals to the main folders in the project master do-file and placing globals to round folders in the round master do-files.

   * Project folder globals
   * ---------------------
   global dataWorkFolder         "$projectfolder/DataWork"
   global baseline               "$dataWorkFolder/Baseline"
   global endline                "$dataWorkFolder/Endline"

Units and Assumptions

Storing units, conversion rates, and numeric assumptions as globals in the master do-file ensures consistency, accuracy and code conciseness. If you are using iefolder, a separate file exists so that the exact same global definitions can be accessed from any both project and round master do-file. In an iefolder master do-file, the global set-up file is referenced like this:

    do "$dataWorkFolder/global_setup.do" 

Below follow some of the most common and useful pieces of information to store as globals:

Conversion Rates

Globals can also be used to standardize conversion rates (i.e. length, weight, volume, exchange rates). For example, if you need to convert amounts between currencies in your code, you can store the conversion rate in a global and reference it each time you convert an amount.

Control Variables

If a project repeatedly uses a set of control variables, you can store them in a global for brevity, consistency, and convenience during analysis.

Sub Master Do-file(s)

At this point, all settings and globals are set so that the code runs identically for all users with little effort. The only thing left in a master do-file is to run the actual code. A project master do-file runs the round master do-files (i.e. baseline, endline); a round master do-file runs round-specific, high-level task master do-files (i.e. import, cleaning, analysis); and the round-specific, high-level task master do-file runs the do-files that complete the parts of each high-level task.

A project master do-file may employ the following code. < code>if (0) allows you to decide which round master do-files to run, as running them all every time may be tedious, time-consuming, or unnecessary.

   if (0) { //Change the 0 to 1 to run the baseline master dofile
       do "$baseline/baseline_MasterDofile.do" 
   }
   if (0) { //Change the 0 to 1 to run the endline master dofile
       do "$endline/endline_MasterDofile.do" 
   }

In iefolder, a round master do-file would look like this:

   local importDo       0
   local cleaningDo     0

   if (`importDo' == 1) { //Change the local above to run or not to run this file
       do "$baseline_doImp/baseline_import_MasterDofile.do" 
   }
   if (`cleaningDo' == 1) { //Change the local above to run or not to run this file
       do "$baseline_do/baseline_cleaning_MasterDofile.do" 
   }

Back to Parent

This article is part of the topic Data Management

Additional Resources