Difference between revisions of "File path"

Jump to: navigation, search
Line 16: Line 16:
'''Absolute file paths''' point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the ''Documents'' folder is typically stored on a hard drive called ''C:'', inside a user-specific directory, and can usually be referenced by the file path ''C:/Users/username/Documents''.
'''Absolute file paths''' point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the ''Documents'' folder is typically stored on a hard drive called ''C:'', inside a user-specific directory, and can usually be referenced by the file path ''C:/Users/username/Documents''.


When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. '''Relative file paths''' refer to files relative to the current working directory in a project. For example, if the working directory is the ''Documents'' folder, a project folder called ''My Project'' stored inside the ''Documents'' folder would be referred to simply as simply ''My Project'', while a folder called ''DataWork'' stored inside the ''My Project'' folder would be referred to as ''My Project/DataWork''.
When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. '''Relative file paths''' refer to files relative to the current working directory for a program. For example, if the working directory is the ''Documents'' folder, a project folder called ''My Project'' stored inside the ''Documents'' folder would be referred to simply as simply ''My Project'', while a folder called ''DataWork'' stored inside the ''My Project'' folder would be referred to as ''My Project/DataWork''.


== Coding transferrable file paths ==
== Coding transferrable file paths ==

Revision as of 21:19, 7 November 2022

Files are pieces of information stored in a computer's hard drives. To be able to retrieve files after creating them, users need to specify exactly in which part of which hard drive the information was stored. This is done through file paths, which are nothing more than a way to organize files in a machine that is also understandable for humans.

The main issue that file paths present to research reproducibility is that they are specific to each machine. So when a researcher writes code to load a file, the path used to retrieve that file in their computer will be different from the one that another researcher needs to use to load the same code on another computer. This often entails in code not being transferrable across machines or users. This article discusses a few options to get around this issue and ensure basic computational reproducibility.

Read First

  • File paths are a way to refer to files stored inside a file system.
  • Users can choose to refer to files by their absolute or relative file paths.
  • A common reproducibility issue is caused by file paths that are written into code in a non-transferrable manner.
  • Employing good coding practices will guarantee code reproducibility in any software or programming language.

Overview

File systems store files using a hierarchical structure, where multiple files can be grouped into one directory (commonly also known as folder), and directories can be grouped into other directories. The starting point of the directory hierarchy for each hard drive is called a root, and is usually represented by capital letters followed by a colon (e.g. C:/ or G:/). In a textual representation adopted by Mac OS, Linux and Windows, file paths are represented by directory names separated by the forward slash ("/"), where the directory that succeeds the forward slash is a subdirectory of the one that precedes it.

Absolute file paths point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the Documents folder is typically stored on a hard drive called C:, inside a user-specific directory, and can usually be referenced by the file path C:/Users/username/Documents.

When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. Relative file paths refer to files relative to the current working directory for a program. For example, if the working directory is the Documents folder, a project folder called My Project stored inside the Documents folder would be referred to simply as simply My Project, while a folder called DataWork stored inside the My Project folder would be referred to as My Project/DataWork.

Coding transferrable file paths

The instructions below demonstrate how to set up projects with transferrable file paths in different software. It takes as an example a user that is writing code for a project's data work. The relevant files are stored in a directory called DataWork that contains the following subdirectories and files:

  DataWork
  |__Data
     |__raw.csv
     |__clean.dta
     |__final.dta
  |__Code
     |__cleaning.do
     |__analysis.do
  |__Output
     |__summary-stats.tex
     |__balance-table.tex
     |__coefplot.png
  |__Documentation

Absolute file paths in R

1. Identify your computer's user name or computer name

 Sys.getenv("USERNAME")
 Sys.getenv("COMPUTERNAME")

2. Copy the returned string and use it on an if statement in your main script, as in the first and sixth lines of the code chunk below.

 # On a Windows computer
 if (Sys.getenv("USERNAME") == "user1") {
   code <- "C:/Users/user1/Documents/GitHub/repository-name/Code"
   data <- "C:/Users/user1/Box/project-folder/Data"
   docs <- "G:/Shared drives/Team Drive/project-folder/Documentation"
 }
 # On a Mac computer
 if (Sys.getenv("USER") == "username2") {
   code <- "/Users/username2/GitHub/repository-name/Code"
   data <- "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
   docs <- "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation"
 }

3. Find the file path to the code, data, and documentation folders in your computer and replace the strings that define the object's contents, as in lines 2-4 and 7-9 in the code chunk above.

4. Run the main script to create the objects define by the code in R's memory.

5. Use the objects defined in your main script to refer to directories inside the function here, as in the examples below.

 * Load data set
 clean_data <- 
   read_dta(
     here(
       data,
       "clean.dta"
     )
   )

The base R function file.path has a similar effect, but does not work in all the cases where here does. [1]

Absolute file paths in Stata

1. Identify your computer's user name or host name

 di c(username)
 di c(hostname)

2. Copy the returned string and use it on an if statement in your main do-file, as in the first line of the code chunk below.

 * On a Windows computer
 if c(username) == "user1" {
   global code "C:/Users/user1/Documents/GitHub/repository-name/Code"
   global data "C:/Users/user1/Box/project-folder/Data"
   global docs "G:/Shared drives/Team Drive/project-folder/Documentation"
 }
 * On a Mac computer
 else if c(username) == "username2" {
   global code "/Users/username2/GitHub/repository-name/Code"
   global data "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
   global docs "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation"
 }

3. Find the file path to the code, data, and documentation folders in your computer and replace the strings define by the globals, as in lines 2-4 in the code chunk above.

4. Run the main do-file to load the global macros defined by the code in Stata's memory.

5. Use the global macros defined in your main script to refer to files, as in the examples below.

 * Load data set
 use "${data}/clean.dta", clear
 * Run do-file
 do  "${code}/analysis.do"

Note that for this workflow to work, global macros need to be used instead of local macros. This is because the file path macros are set only on the main do-file, but also need to be used by other do-files.

Relative file paths in R

1. Set the working directory to the desired project root folder by launching an R Project or opening any file inside that project and loading the package here. Once you do that, you will see a message saying here() starts at file/path. This means that the project working directory has been identified and set.

2. Refer to files by their relative file paths using the function here and entering each subfolder as a separate argument:

 raw_data <- 
     read_csv(
         here(
              "Data",
              "raw.csv"
         )
     )

Relative file paths in Stata

1. Set the working directory to the desired project root folder by either launching Stata by opening the project's main do-file or opening a Stata project.

2. Refer to files by their path relative to the working directory

 use "Data/clean.dta", clear

Pros and cons of absolute and relative file paths

Since absolute file paths indicate the complete file path to files starting from the root, they can easily be used in projects where not all files are stored in a common root directory. To do this in a reproducible manner, however, the different project folders need to be shared with all team members and synced to their computers to enable local access in the machines they are working on. In addition, for every new computer used to run the project, the file paths to the different folders must be explicitly set in the main script, as explained in the previous section.

To use relative file paths in a project, all the relevant project files need to be inside the same root directory, so file paths can be spelled out using them as starting point. If you are using multiple cloud storage applications for the same project, this can be done through the creation of directory junctions. In Stata, the Project Manager also allows users to connect multiple drives to one project.

Changing working directories

A commonly used alternative to the suggested workflows is to use relative file paths in combination with changing the working directories. However, this error-prone practice is recommended against. Changes in the working directory are permanent throughout a program session. This means that once the working directory is changed to a new location, that location will become the root directory for all code that is subsequently run in that session. Users often don't realize that a script has changed their working directory and continue to use programs as if that did not happen, which can break the code and save files to unintended locations.

Additional Resources