File path
Files are pieces of information stored in a computer's hard drives. To be able to retrieve files after creating them, users need to specify exactly in which part of which hard drive the information was stored. This is done through file paths, which are nothing more than a way to organize files in a machine that is also understandable for humans.
The main issue that file paths present to research reproducibility is that they are specific to each machine. So when a researcher writes code to load a file, the path used to retrieve that file in their computer will be different from the one that another researcher needs to use to load the same code on another computer. This often entails in code not being transferrable across machines or users. This article discusses a few options to get around this issue and ensure basic computational reproducibility.
Read First
- File paths are a way to refer to files stored inside a file system.
- Users can choose to refer to files by their absolute or relative file paths.
- A common reproducibility issue is caused by file paths that are written into code in a non-transferrable manner.
- Employing good coding practices will guarantee code reproducibility in any software or programming language.
Overview
File systems store files using a hierarchical structure, where multiple files can be grouped into one directory (commonly also known as folder), and directories can be grouped into other directories. The starting point of the directory hierarchy for each hard drive is called a root, and is usually represented by capital letters followed by a colon (e.g. C:/ or G:/). In a textual representation adopted by Mac OS, Linux and Windows, file paths are represented by directory names separated by the forward slash ("/"), where the directory that succeeds the forward slash is a subdirectory of the one that precedes it.
Absolute file paths point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the Documents folder is typically stored on a hard drive called C:, inside a user-specific directory, and can usually be referenced by the file path C:/Users/username/Documents.
When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. Relative file paths refer to files relative to the current working directory for a program. For example, if the working directory is the Documents folder, a project folder called My Project stored inside the Documents folder would be referred to simply as simply My Project, while a folder called DataWork stored inside the My Project folder would be referred to as My Project/DataWork.
Coding transferrable file paths
The instructions below demonstrate how to set up projects with transferrable file paths in different software. It takes as an example a user that is writing code for a project's data work. The relevant files are stored in a directory called DataWork that contains the following subdirectories and files:
DataWork |__Data |__raw.csv |__clean.dta |__final.dta |__Code |__cleaning.do |__analysis.do |__Output |__summary-stats.tex |__balance-table.tex |__coefplot.png |__Documentation
Absolute file paths
Absolute file paths in R
1. Identify your computer's user name
# On a Windows computer Sys.getenv("USERNAME") # On a Mac computer Sys.getenv("USER")
2. Copy the returned string and use it on an if
statement in your main script, as in the first and sixth lines of the code chunk below.
# On a Windows computer if (Sys.getenv("USERNAME") == "user1") { code <- "C:/Users/user1/Documents/GitHub/repository-name/Code" data <- "C:/Users/user1/Box/project-folder/Data" docs <- "G:/Shared drives/Team Drive/project-folder/Documentation" } # On a Mac computer if (Sys.getenv("USER") == "username2") { code <- "/Users/username2/GitHub/repository-name/Code" data <- "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data" docs <- "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation" }
3. Find the file path to the code, data, and documentation folders in your computer and replace the strings that define the object's contents, as in lines 2-4 and 7-9 in the code chunk above.
4. Run the main script to create the objects define by the code in R's memory.
5. Use the objects defined in your main script to refer to directories inside the function here
, as in the examples below.
* Load data set clean_data <- read_dta( here( data, "clean.dta" ) )
The base R function file.path
has a similar effect, but does not work in all the cases where here
does. [1]
Absolute file paths in Stata
1. Identify your computer's user name
di c(username)
2. Copy the returned string and use it on an if
statement in your main do-file, as in the first line of the code chunk below.
* On a Windows computer if c(username) == "user1" { global code "C:/Users/user1/Documents/GitHub/repository-name/Code" global data "C:/Users/user1/Box/project-folder/Data" global docs "G:/Shared drives/Team Drive/project-folder/Documentation" } * On a Mac computer else if c(username) == "username2" { global code "/Users/username2/GitHub/repository-name/Code" global data "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data" global docs "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation" }
3. Find the file path to the code, data, and documentation folders in your computer and replace the strings define by the globals, as in lines 2-4 in the code chunk above.
4. Run the main do-file to load the global macros defined by the code in Stata's memory.
5. Use the global macros defined in your main script to refer to files, as in the examples below.
* Load data set use "${data}/clean.dta", clear
* Run do-file do "${code}/analysis.do"
Note that for this workflow to work, global macros need to be used instead of local macros. This is because the file path macros are set only on the main do-file, but also need to be used by other do-files.
Protecting file paths (advanced)
A concern with data security may be raised when absolute file paths are included in easily accessible code. Users who wish to avoid the creation of additional vulnerability may prefer to set their file paths using R or Stata profiles instead of explicitly writing them in the code. These profiles include code that is run every time the programs are launched. Therefore, users can use it to create macros or objects that will always be available for use in any session.
To use this option in Stata, users will need to create a file called profile.do and store it in a directory that listed in the path over which Stata searches for ado-files (run adopath
in Stata to access the list of such directories). Adding the following lines to this do-file will make the global macros ${DROPBOX}
and ${GITHUB}
available to all Stata sessions in that computer:
global DROPBOX "C:/Users/user1/Dropbox" global GITHUB "C:/Users/user1/Documents/GitHub"
To use this option in R, users will need to edit the file .Rprofile. This file can be founds in three different places in your computer: the directory where R is installed, your computer;s home directory and the current R working directory [[2]]. Adding the following lines to this R script will make the file paths for the Dropbox and GitHub folders available through Sys.getenv("DROPBOX")
and Sys.getenv("GITHUB")
, respectively:
Sys.setenv(DROPBOX = "C:/Users/user1/Dropbox") Sys.setenv(GITHUB = "C:/Users/user1/Documents/GitHub")
Relative file paths
Relative file paths in R
1. Set the working directory to the desired project root folder by launching an R Project or opening any file inside that project and loading the package here
. Once you do that, you will see a message saying here() starts at file/path
. This means that the project working directory has been identified and set.
2. Refer to files by their relative file paths using the function here
and entering each subfolder as a separate argument:
raw_data <- read_csv( here( "Data", "raw.csv" ) )
Relative file paths in Stata
1. Set the working directory to the desired project root folder by opening a Stata project. Once you do this, you will see a message on Stata's console starting with projmanager
and ending with the file path to the Stata project. The working directory will be set to the same directory where the Stata project is saved.
2. Refer to files by their path relative to the working directory
use "Data/clean.dta", clear
Pros and cons of absolute and relative file paths
Since absolute file paths indicate the complete file path to files starting from the root, they can easily be used in projects where not all files are stored in a common root directory. To do this in a reproducible manner, however, the different project folders need to be shared with all team members and synced to their computers to enable local access in the machines they are working on. In addition, for every new computer used to run the project, the file paths to the different folders must be explicitly set in the main script, as explained in the previous section.
To use relative file paths in a project, all the relevant project files need to be inside the same root directory, so file paths can be spelled out using them as starting point. If you are using multiple cloud storage applications for the same project, this can be done through the creation of directory junctions.
Changing working directories
A commonly used alternative to the suggested workflows is to use relative file paths in combination with changing the working directories. However, this error-prone practice is recommended against. Changes in the working directory are permanent throughout a program session. This means that once the working directory is changed to a new location, that location will become the root directory for all code that is subsequently run in that session. Users often don't realize that a script has changed their working directory and continue to use programs as if that did not happen, which can break the code and save files to unintended locations.
Additional Resources
- Hadley Wickham (RStudio). R for Data Science - Workflow: projects
- Jenny Bryan (RStudio). Workflow vs script
- Julian Reif (University of Illinois). Stata Coding Guide - Setting up the environment