Difference between revisions of "Master Do-files"

Jump to: navigation, search
 
(45 intermediate revisions by 5 users not shown)
Line 1: Line 1:
The master do-file is the main do file that is used to call upon all the other do files. By running this file, all files needed from importing raw data to cleaning, constructing, analysing and outputting results should be run. This file therefore also functions as a map to the data folder.
The master do-file is the main '''do-file''' that calls upon and runs all the other '''do-files''' of a project. It plays a critical role throughout all stages of the research project and functions as a map to the [[DataWork Folder|data folder]]. This page outlines the components of a well-structured and [[Reproducible Research | replicable]] master '''do-file'''.  


== Read First ==
== Read First ==
* The person creating the master do file should be able to run the do files from all stages(cleaning, construct, analysis, exporting tables, etc) from the master do-file and the someone else running the master do file should be able to run all of those just by changing the paths to their Dropbox/Box folders.
* The command <code>[[iefolder]]</code> sets up the master '''do-file'''.
* Anyone with the master '''do-file''' should be able to run '''do-files''' for all stages of research ([[Data Cleaning |cleaning]], construction, [[Data Analysis | analysis]], [[Exporting Analysis | exporting]], etc.).
*After changing the path global to the location where each stores their [[DataWork Folder|project folder]], any two people with the master '''do-file''' should be able to run it and get identical results.


== Purpose of a Master Do-File ==
== Overview==
'''Run do-files needed for the data work'''
A master '''do-file''' serves three main purposes:
As projects grow large it is impractical to write all code in a single do-file. And even the code needed for a high level tasks such as cleaning or analysis is too long to be in one do-file. It would be impossible to run all of these files manually as the number of files grow, but a master do-file is a solution to that. Using Stata's <code>do</code> command a do-file can run other do-files. In a master do-file you will find a section where all do-files needed for the project are ran. Typically you will have multiple master do-files. For example, you have one for each data collection, baseline, endline etc. that runs the files needed for all the data work for that round. Then you will have a project master do-files that runs the round master do-files if you want to run all code related to the project with one click.
#It compactly and [[Reproducible Research|reproducibly]] runs all '''do-files''' needed for data work. More specifically, in the [[DataWork Folder | DataWork folder]] structure, the master '''do-file''' houses all [[DataWork Survey Round | survey round]] master '''do-files''', which contain all round-specific task-level '''do-files'''.
#It establishes an identical workspace between users by specifying settings, installing programs, and setting globals. Globals are referenceable pieces of information defined in the '''do-file''' and stored in memory until the user exits [[Stata Coding Practices|Stata]] and help to ensure consistency, accuracy and conciseness in code.
#It maps all files within the '''data folder''' and serves as the starting point to find any '''do-file''', [[Master Dataset|dataset]] or output.


A master do-file has the following three purposes:
See an example of a master '''do-file''' [https://github.com/worldbank/dime-data-handbook/blob/master/code/stata-master-dofile.do here].


* The first reason is that it possible to run all code related to a project by running only one dofile. This is incredible important for replicability.
==Components of a Master Do-file ==
* The second purpose is to set up globals with folder paths that enables dynamic file paths that in turn allows multiple users run the same code, it shortens the file paths as well as making it possible to move files and folders with minimal updates to the code.
* The third purpose is that this file is the main map to the '''DataWork''' folder.


Each of these purposes are described in more detail below.
=== Intro Header ===
At the very top of the master '''do-file''', the intro header should clearly explain its purpose. It should provide any other important information, including but not limited to an outline of the '''do-file'''; the data files required to correctly run it; the data files created by the '''do-file'''; or the '''variable''' that uniquely identifies the [[Unit of Observation|unit of observation]] in the [[Master Dataset|datasets]]. The intro header should be understandable to someone unfamiliar with the project.
 
=== Installation of ietoolkit and User Written Commands ===
Master '''do-files''' created by <code>[[iefolder]]</code> must include a line to install the package <code>[[ietoolkit]]</code>. After this line, you can install other user written commands needed for the project. Follow each installation with <tt>replace</tt>. This ensures that the latest version of the command with updated functionalities is installed. Overall, this section will look something like this:
 
<pre>
      *Install all packages that this project requires:
      ssc install ietoolkit, replace
      ssc install outreg2  , replace
      ssc install estout  , replace
      ssc install ivreg2  , replace
</pre>
 
You may comment out this section once the commands are installed. However, for [[Reproducible Research | replicability]], it is important that the master '''do-file''' always includes this section, whether commented out or not.
 
=== Settings ===
 
[[Stata Coding Practices|Stata]] allows the user to customize a wide range of settings: '''Stata''' version, memory settings, code interpretation settings, etc. If two users with different settings run the same code, the code could crash or yield different results. <code>[[ieboilstart]]</code> sets the settings to the values recommended by '''Stata''', thus harmonizing settings across users. Note that in most cases it does not matter what values are used so long as all users use the same value. You can use <code>ieboilstart</code> like this:


'''Run do-files needed for the data work'''
<pre>
As projects grow large it is impractical to write all code in a single do-file. And even the code needed for a high level tasks such as cleaning or analysis is too long to be in one do-file. It would be impossible to run all of these files manually as the number of files grow, but a master do-file is a solution to that. Using Stata's <code>do</code> command a do-file can run other do-files. In a master do-file you will find a section where all do-files needed for the project are ran. Typically you will have multiple master do-files. For example, you have one for each data collection, baseline, endline etc. that runs the files needed for all the data work for that round. Then you will have a project master do-files that runs the round master do-files if you want to run all code related to the project with one click.
      *Standardize settings accross users
      ieboilstart, version(12.1)      //Set the version number to the oldest version used by anyone in the project team
      `r(version)'                    //This line is needed to actually set the version from the command above
</pre>


'''Set up globals'''
Since '''Stata''' does not recommend any particular version, you must specify this setting manually when using <code>ieboilstart</code>. We recommend using the oldest '''Stata''' version that anyone from your team will ever use. Once you have done a [[Randomization|randomization]] that is meant to be [[Reproducible Research|replicable]] in your project, you should not change version setting. If you do, your '''randomization''' will no longer be '''replicable'''. Read the '''Stata''' help file for <code>ieboilstart</code> for a more detailed description of the command.


'''Map to DataWork Folder'''
=== Root Folder Globals ===
Since all code can be run from this file, and since all outputs are (indirectly) created by this file, this file is the starting point to find where any do-file, data set or output is located in the '''DataWork''' folder. Another examples of files that help with the navigation of the folder could be a Word document or a PDF describing how to navigate the sub-folders. Such files are not included in our folder template, but may sometimes be a good addition. However, those files needs to be updated in parallel which often does not happen even if that is the intention.


==Components of a Master Do file ==
Collaborators on a project likely have slightly different file paths to shared [[DataWork Folder|project folders]]. The root folder globals indicate where each user stores the project folder on their computer. This allows multiple users to run the same '''do-files''' by making only a minor modification in the master '''do-file'''. In the code below, the global user is set to 1, meaning that [[Stata Coding Practices|Stata]] will use Ann's folder location. If John would like to run the code, he would change the user number to 2. If all file references in all '''do-files''' use these globals, John can now run all code. If a third user wants to run the same code, that user would add the same information and identify as user number 3.
Since the master do file acts as a map to all the other do files in the project, it is important that the do file is organized and contains all the information necessary during the analysis. Some of the necessary components of a do file are as follows:


=== Intro Header ===  
<pre>
The intro header should contain some descriptive information about the do file such that somebody who doesn't know the do file can read it and understand what the do file does and what it produces.
  *User Number:
Some examples of information to put on the header are the purpose of the do file, the outline of the do file, the data files required to run the do file correctly, the data files created by the do file, the name variable that uniquely identifies the unit of observation in the datasets, etc.
  * Ann          1
  * John        2
  * Add more users here as needed
 
  *Set this value to the user currently using this file
  global user  1
 
  * Root folder globals
  * ---------------------
  if $user == 1 {
      global projectfolder "C:/Users/AnnDoe/Dropbox/Project ABC"
  }
  if $user == 2 {
      global projectfolder  "C:/Users/JohnSmith/Dropbox/Project ABC"
  }
</pre>


=== Settings to Declare in the Master do-file===
You can modify this code so that '''Stata''' automatically detects which user is running the code, thereby eliminating the need for any manual change. To do this, use '''Stata's''' built-in local <code>c(username)</code>, which reads the username assigned to each user’s computer during the installation of their operating system (i.e. Windows). Then, in the above code, change <code>if $user == 1</code> to <code>  if c(username) == "username"</code> for each user. Note that you must still add new users manually.
After the intro header, settings that are used throughout the project should also be declared in the master do-file. Some of the settings are as follows:


====Version Settings====
=== Project Folder Globals ===
The version settings for Stata needs to be declared in the master do-file. Since, things like Stata's randomization algorithm sometimes changes across versions, it is important to declare Stata's version number to make sure that the analysis done using Stata is reproducible.


::: Example: ''version 12.0''
As the number of folders grows, it becomes more and more convenient to have globals that point to project sub-folders.  <code>[[iefolder]]</code> automatically creates these globals for any folders it generates, placing globals to the main folders in the project master '''do-file''' and placing globals to round folders in the round master '''do-files'''.


====Basic and Advanced Memory Limits====
<pre>
Memory limits on Stata affect various things like the maximum number of variables a dataset can have (<code> maxvar</code>), number of variables than be used during Stata's estimation commands <code>matsize</code>, the amount and time Stata uses the system memory<code> niceness / min_memory / max_memory </code>, etc. Declaring the memory limits on the master do-files makes sure that the analysis runs smoothly with maximum efficiency.
  * Project folder globals
  * ---------------------
  global dataWorkFolder        "$projectfolder/DataWork"
  global baseline              "$dataWorkFolder/Baseline"
  global endline                "$dataWorkFolder/Endline"
</pre>


::: Example: ''set maxvar 20000''
=== Units and Assumptions ===


====Default Options====
Storing units, conversion rates, and numeric assumptions as globals in the master '''do-file''' ensures consistency, accuracy, and code conciseness. If you are using <code>[[iefolder]]</code>, a separate file exists so that the exact same global definitions can be accessed from any both project and round master '''do-file'''. In an <code>iefolder</code> master '''do-file''', the global set-up file is referenced like this:
<pre>
    do "$dataWorkFolder/global_setup.do"
</pre>


Default options like setting more off/on, pause on/off, and abbreviation should also be set in the master do file. Declaring options in the main file ensures that when the other do-files are run through the master do file, the settings do not have to be declared again.
Below follow some of the most common and useful pieces of information to store as globals:


::: Example: ''set varabbrev off''
====Conversion Rates====
Globals can also be used to [[Standardization#Globals | standardize]] conversion rates (i.e. length, weight, volume, exchange rates). For example, if you need to convert amounts between currencies in your code, you can store the conversion rate in a global and reference it each time you convert an amount.


==== Standardization of Units and Assumptions ====
====Control Variables====
Conversion rates for standardization of units and assumptions that need to be defined should be defined as globals in the master do-files. ''Varlist'' commonly used across the projects are also defined using globals/locals in the master do file. Since, globals defined in one do file also work on other do files throughout a Stata session, it is important to declare all the global variables necessary during the project on the master do-file.
If a project repeatedly uses a set of control '''variables''', you can store them in a global for brevity, consistency, and convenience during [[Data Analysis|analysis]].  


==== Installing any user written commands ====  
=== Sub Master Do-file(s) ===
User written commands that need to be installed for the do-file should also be declared in the master do-file. Since, every computer that runs the code will not have the commands installed, it is necessary to install those commands. For example to install the command <code> outreg2 </code> used for exporting regression results in LaTeX and text formats, you should declare <code>ssc install outreg2, replace </code>. The <tt>replace</tt> makes sure that the latest version of the command with updated functionalities is installed if any previous versions have already been installed on the computer.


==== Sub Master do-file(s) ====
At this point, all settings and globals are set so that the code runs identically for all users with little effort. The only thing left in a master '''do-file''' is to run the actual code. A project master '''do-file''' runs the round master '''do-files''' (i.e. baseline, endline); a round master do-file runs round-specific, high-level task master '''do-files''' (i.e. import, [[Data Cleaning|cleaning]], [[Data Analysis|analysis]], etc.); and the round-specific, high-level task master '''do-file''' runs the do-files that complete the parts of each high-level task.
Sub Master do-files are similar to a Master do-file except they perform a singular function, whereas the Master do-file runs all the necessary do-files from the raw data stage to the analysis and output stage. A sub Master-do file could be a do-file that runs all the do-files and commands used to generate all the graphs produced for a project. Instead of including each do-file that was used to produce the graphs needed for a project in the Master do-file, one could create a sub Master-do file for graphs outputs that will be called by the Master do-file. Following this technique one could have a sub Master do-file for graphs outputs, regressions, and data cleaning; all of which will be called upon by the Master do-file.


== Implementation ==
A project master '''do-file''' may employ the following code. < code>if (0)</code> allows you to decide which round master '''do-files''' to run, as running them all every time may be tedious, time-consuming, or unnecessary.
[[File:ieboilstart_example.png |thumb|350px|Example of the settings declared in a master do file.]]
<pre>
DIME's Stata command <code>'''ieboilstart'''</code> from the <code> ietoolkit </code> package declares all the necessary basic settings to standardize the code across multiple people working on the same project. This can be done adding the following 2 lines of code to every do-files.  
  if (0) { //Change the 0 to 1 to run the baseline master dofile
      do "$baseline/baseline_MasterDofile.do"
  }
  if (0) { //Change the 0 to 1 to run the endline master dofile
      do "$endline/endline_MasterDofile.do"
  }
</pre>


<code> ssc install ietoolkit, replace </br>
In <code>iefolder</code>, a round master do-file would look like this:
ieboilstart, <u>v</u>ersionnumber(''version_number'') ''options'' </br>
<pre>
`r(version)'
  local importDo      0
</code>
  local cleaningDo    0


Declaring these commands at the top of do file used by every member of the project ensures that the version settings are the same across all runs for the project. However, the globals and any extra commands installed should be declared as well.
  if (`importDo' == 1) { //Change the local above to run or not to run this file
      do "$baseline_doImp/baseline_import_MasterDofile.do"
  }
  if (`cleaningDo' == 1) { //Change the local above to run or not to run this file
      do "$baseline_do/baseline_cleaning_MasterDofile.do"
  }
</pre>


== Back to Parent ==
== Back to Parent ==
Line 74: Line 127:


== Additional Resources ==
== Additional Resources ==
*  
*DIME Analytics' guidelines on [https://github.com/worldbank/DIME-Resources/blob/master/welcome-iefolder.pdf iefolder]
 
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata1-3-cleaning.pdf Data Management and Cleaning]
*DIME Analytics’ [https://github.com/worldbank/DIME-Resources/blob/master/stata2-3-data.pdf Data Management for Reproducible Research]
[[Category: Data Management ]]
[[Category: Data Management ]]
[[Category: Reproducible Research]]

Latest revision as of 13:48, 14 August 2023

The master do-file is the main do-file that calls upon and runs all the other do-files of a project. It plays a critical role throughout all stages of the research project and functions as a map to the data folder. This page outlines the components of a well-structured and replicable master do-file.

Read First

  • The command iefolder sets up the master do-file.
  • Anyone with the master do-file should be able to run do-files for all stages of research (cleaning, construction, analysis, exporting, etc.).
  • After changing the path global to the location where each stores their project folder, any two people with the master do-file should be able to run it and get identical results.

Overview

A master do-file serves three main purposes:

  1. It compactly and reproducibly runs all do-files needed for data work. More specifically, in the DataWork folder structure, the master do-file houses all survey round master do-files, which contain all round-specific task-level do-files.
  2. It establishes an identical workspace between users by specifying settings, installing programs, and setting globals. Globals are referenceable pieces of information defined in the do-file and stored in memory until the user exits Stata and help to ensure consistency, accuracy and conciseness in code.
  3. It maps all files within the data folder and serves as the starting point to find any do-file, dataset or output.

See an example of a master do-file here.

Components of a Master Do-file

Intro Header

At the very top of the master do-file, the intro header should clearly explain its purpose. It should provide any other important information, including but not limited to an outline of the do-file; the data files required to correctly run it; the data files created by the do-file; or the variable that uniquely identifies the unit of observation in the datasets. The intro header should be understandable to someone unfamiliar with the project.

Installation of ietoolkit and User Written Commands

Master do-files created by iefolder must include a line to install the package ietoolkit. After this line, you can install other user written commands needed for the project. Follow each installation with replace. This ensures that the latest version of the command with updated functionalities is installed. Overall, this section will look something like this:

       *Install all packages that this project requires:
       ssc install ietoolkit, replace
       ssc install outreg2  , replace
       ssc install estout   , replace
       ssc install ivreg2   , replace

You may comment out this section once the commands are installed. However, for replicability, it is important that the master do-file always includes this section, whether commented out or not.

Settings

Stata allows the user to customize a wide range of settings: Stata version, memory settings, code interpretation settings, etc. If two users with different settings run the same code, the code could crash or yield different results. ieboilstart sets the settings to the values recommended by Stata, thus harmonizing settings across users. Note that in most cases it does not matter what values are used so long as all users use the same value. You can use ieboilstart like this:

       *Standardize settings accross users
       ieboilstart, version(12.1)      //Set the version number to the oldest version used by anyone in the project team
       `r(version)'                    //This line is needed to actually set the version from the command above

Since Stata does not recommend any particular version, you must specify this setting manually when using ieboilstart. We recommend using the oldest Stata version that anyone from your team will ever use. Once you have done a randomization that is meant to be replicable in your project, you should not change version setting. If you do, your randomization will no longer be replicable. Read the Stata help file for ieboilstart for a more detailed description of the command.

Root Folder Globals

Collaborators on a project likely have slightly different file paths to shared project folders. The root folder globals indicate where each user stores the project folder on their computer. This allows multiple users to run the same do-files by making only a minor modification in the master do-file. In the code below, the global user is set to 1, meaning that Stata will use Ann's folder location. If John would like to run the code, he would change the user number to 2. If all file references in all do-files use these globals, John can now run all code. If a third user wants to run the same code, that user would add the same information and identify as user number 3.

   *User Number:
   * Ann          1 
   * John         2
   * Add more users here as needed

   *Set this value to the user currently using this file
   global user  1

   * Root folder globals
   * ---------------------
   if $user == 1 {
       global projectfolder "C:/Users/AnnDoe/Dropbox/Project ABC"
   }
   if $user == 2 {
       global projectfolder  "C:/Users/JohnSmith/Dropbox/Project ABC"
   }

You can modify this code so that Stata automatically detects which user is running the code, thereby eliminating the need for any manual change. To do this, use Stata's built-in local c(username), which reads the username assigned to each user’s computer during the installation of their operating system (i.e. Windows). Then, in the above code, change if $user == 1 to if c(username) == "username" for each user. Note that you must still add new users manually.

Project Folder Globals

As the number of folders grows, it becomes more and more convenient to have globals that point to project sub-folders. iefolder automatically creates these globals for any folders it generates, placing globals to the main folders in the project master do-file and placing globals to round folders in the round master do-files.

   * Project folder globals
   * ---------------------
   global dataWorkFolder         "$projectfolder/DataWork"
   global baseline               "$dataWorkFolder/Baseline"
   global endline                "$dataWorkFolder/Endline"

Units and Assumptions

Storing units, conversion rates, and numeric assumptions as globals in the master do-file ensures consistency, accuracy, and code conciseness. If you are using iefolder, a separate file exists so that the exact same global definitions can be accessed from any both project and round master do-file. In an iefolder master do-file, the global set-up file is referenced like this:

    do "$dataWorkFolder/global_setup.do" 

Below follow some of the most common and useful pieces of information to store as globals:

Conversion Rates

Globals can also be used to standardize conversion rates (i.e. length, weight, volume, exchange rates). For example, if you need to convert amounts between currencies in your code, you can store the conversion rate in a global and reference it each time you convert an amount.

Control Variables

If a project repeatedly uses a set of control variables, you can store them in a global for brevity, consistency, and convenience during analysis.

Sub Master Do-file(s)

At this point, all settings and globals are set so that the code runs identically for all users with little effort. The only thing left in a master do-file is to run the actual code. A project master do-file runs the round master do-files (i.e. baseline, endline); a round master do-file runs round-specific, high-level task master do-files (i.e. import, cleaning, analysis, etc.); and the round-specific, high-level task master do-file runs the do-files that complete the parts of each high-level task.

A project master do-file may employ the following code. < code>if (0) allows you to decide which round master do-files to run, as running them all every time may be tedious, time-consuming, or unnecessary.

   if (0) { //Change the 0 to 1 to run the baseline master dofile
       do "$baseline/baseline_MasterDofile.do" 
   }
   if (0) { //Change the 0 to 1 to run the endline master dofile
       do "$endline/endline_MasterDofile.do" 
   }

In iefolder, a round master do-file would look like this:

   local importDo       0
   local cleaningDo     0

   if (`importDo' == 1) { //Change the local above to run or not to run this file
       do "$baseline_doImp/baseline_import_MasterDofile.do" 
   }
   if (`cleaningDo' == 1) { //Change the local above to run or not to run this file
       do "$baseline_do/baseline_cleaning_MasterDofile.do" 
   }

Back to Parent

This article is part of the topic Data Management

Additional Resources