Difference between revisions of "File path"

Jump to: navigation, search
 
(13 intermediate revisions by the same user not shown)
Line 16: Line 16:
'''Absolute file paths''' point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the ''Documents'' folder is typically stored on a hard drive called ''C:'', inside a user-specific directory, and can usually be referenced by the file path ''C:/Users/username/Documents''.
'''Absolute file paths''' point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the ''Documents'' folder is typically stored on a hard drive called ''C:'', inside a user-specific directory, and can usually be referenced by the file path ''C:/Users/username/Documents''.


When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. '''Relative file paths''' refer to files relative to the current working directory in a project. For example, if the working directory is the ''Documents'' folder, a project folder called ''My Project'' stored inside the ''Documents'' folder would be referred to simply as simply ''My Project'', while a folder called ''DataWork'' stored inside the ''My Project'' folder would be referred to as ''My Project/DataWork''.
When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. '''Relative file paths''' refer to files relative to the current working directory for a program. For example, if the working directory is the ''Documents'' folder, a project folder called ''My Project'' stored inside the ''Documents'' folder would be referred to simply as simply ''My Project'', while a folder called ''DataWork'' stored inside the ''My Project'' folder would be referred to as ''My Project/DataWork''.


== Coding transferrable file paths ==
== Coding transferrable file paths ==
Line 36: Line 36:
   |__Documentation
   |__Documentation


=== Absolute file paths in R ===
=== Absolute file paths ===


'''1.''' Identify your computer's user name or computer name
==== Absolute file paths in R ====
 
'''1.''' Identify your computer's user name
  # On a Windows computer
   Sys.getenv("USERNAME")
   Sys.getenv("USERNAME")
   Sys.getenv("COMPUTERNAME")
  # On a Mac computer
   Sys.getenv("USER")


'''2.''' Copy the returned string and use it on an <code>if</code> statement in your main script, as in the first and sixth lines of the code chunk below.
'''2.''' Copy the returned string and use it on an <code>if</code> statement in your main script, as in the first and sixth lines of the code chunk below.
Line 48: Line 52:
     code <- "C:/Users/user1/Documents/GitHub/repository-name/Code"
     code <- "C:/Users/user1/Documents/GitHub/repository-name/Code"
     data <- "C:/Users/user1/Box/project-folder/Data"
     data <- "C:/Users/user1/Box/project-folder/Data"
     docs <- "G:/Shared drives/Team Shared Drive/project-folder/Documentation"
     docs <- "G:/Shared drives/Team Drive/project-folder/Documentation"
   }
   }
   # On a Mac computer
   # On a Mac computer
   if (Sys.getenv("USER") == "username2") {
   if (Sys.getenv("USER") == "username2") {
     code <- "/Users/username2/Library/CloudStorage/GitHub/repository-name/Code"
     code <- "/Users/username2/GitHub/repository-name/Code"
     data <- "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
     data <- "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
     docs <- "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Shared Drive/project-folder/Documentation"
     docs <- "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation"
   }
   }


Line 73: Line 77:
The base R function [https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/file.path <code>file.path</code>] has a similar effect, but does not work in all the cases where <code>here</code> does. [https://malco.io/2018/11/05/why-should-i-use-the-here-package-when-i-m-already-using-projects/]
The base R function [https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/file.path <code>file.path</code>] has a similar effect, but does not work in all the cases where <code>here</code> does. [https://malco.io/2018/11/05/why-should-i-use-the-here-package-when-i-m-already-using-projects/]


=== Absolute file paths in Stata ===
==== Absolute file paths in Stata ====


'''1.''' Identify your computer's user name or host name
'''1.''' Identify your computer's user name
   di c(username)
   di c(username)
  di c(hostname)


'''2.''' Copy the returned string and use it on an <code>if</code> statement in your main do-file, as in the first line of the code chunk below.
'''2.''' Copy the returned string and use it on an <code>if</code> statement in your main do-file, as in the first line of the code chunk below.


  * On a Windows computer
   if c(username) == "user1" {
   if c(username) == "user1" {
     global code "C:/Users/user1/Documents/GitHub/repository-name/Code"
     global code "C:/Users/user1/Documents/GitHub/repository-name/Code"
     global data "C:/Users/user1/Box/project-folder/Data"
     global data "C:/Users/user1/Box/project-folder/Data"
     global docs "G:/Shared drives/Team Shared Drive/project-folder/Documentation"
     global docs "G:/Shared drives/Team Drive/project-folder/Documentation"
  }
  * On a Mac computer
  else if c(username) == "username2" {
    global code "/Users/username2/GitHub/repository-name/Code"
    global data "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
    global docs "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation"
   }
   }
ADD MAC EXAMPLE


'''3.''' Find the file path to the code, data, and documentation folders in your computer and replace the strings define by the globals, as in lines 2-4 in the code chunk above.
'''3.''' Find the file path to the code, data, and documentation folders in your computer and replace the strings define by the globals, as in lines 2-4 in the code chunk above.
Line 102: Line 111:
Note that for this workflow to work, '''global macros need to be used instead of local macros'''. This is because the file path macros are set only on the main do-file, but also need to be used by other do-files.
Note that for this workflow to work, '''global macros need to be used instead of local macros'''. This is because the file path macros are set only on the main do-file, but also need to be used by other do-files.


=== Relative file paths in R ===
==== Protecting file paths (advanced) ====
 
A concern with data security may be raised when absolute file paths are included in easily accessible code. Users who wish to avoid the creation of additional vulnerability may prefer to set their file paths using [https://support.posit.co/hc/en-us/articles/360047157094-Managing-R-with-Rprofile-Renviron-Rprofile-site-Renviron-site-rsession-conf-and-repos-conf R] or [https://www.stata.com/support/faqs/programming/profile-do-file/ Stata] profiles instead of explicitly writing them in the code. These profiles include code that is run every time the programs are launched. Therefore, users can use it to create macros or objects that will always be available for use in any session.
 
To use this option in '''Stata''', users will need to create a file called ''profile.do'' and store it in a directory that listed in the path over which Stata searches for ado-files (run <code>adopath</code> in Stata to access the list of such directories). Adding the following lines to this do-file will make the global macros <code>${DROPBOX}</code> and <code>${GITHUB}</code> available to all Stata sessions in that computer:
 
    global DROPBOX "C:/Users/user1/Dropbox"
    global GITHUB  "C:/Users/user1/Documents/GitHub"
 
To use this option in R, users will need to edit the file ''.Rprofile''. This file can be founds in three different places in your computer: the directory where R is installed, your computer;s home directory and the current R working directory [[https://csgillespie.github.io/efficientR/set-up.html#location]]. Adding the following lines to this R script will make the file paths for the Dropbox and GitHub folders available through <code>Sys.getenv("DROPBOX")</code> and <code>Sys.getenv("GITHUB")</code>, respectively:
 
    Sys.setenv(DROPBOX = "C:/Users/user1/Dropbox")
    Sys.setenv(GITHUB = "C:/Users/user1/Documents/GitHub")
 
=== Relative file paths ===
 
==== Relative file paths in R ====


'''1.''' Set the working directory to the desired project root folder by launching an [https://support.rstudio.com/hc/en-us/articles/200526207-Using-RStudio-Projects R Project] or opening any file inside that project and loading the package [https://here.r-lib.org/articles/here.html <code>here</code>]. Once you do that, you will see a message saying <code>here() starts at file/path</code>. This means that the project working directory has been identified and set.  
'''1.''' Set the working directory to the desired project root folder by launching an [https://support.rstudio.com/hc/en-us/articles/200526207-Using-RStudio-Projects R Project] or opening any file inside that project and loading the package [https://here.r-lib.org/articles/here.html <code>here</code>]. Once you do that, you will see a message saying <code>here() starts at file/path</code>. This means that the project working directory has been identified and set.  
Line 116: Line 141:
       )
       )


=== Relative file paths in Stata ===
==== Relative file paths in Stata ====


'''1.''' Set the working directory to the desired project root folder by either launching Stata by opening the project's [https://dimewiki.worldbank.org/Master_Do-files main do-file] or opening a [https://www.stata.com/manuals/pprojectmanager.pdf Stata project].
'''1.''' Set the working directory to the desired project root folder by opening a [https://www.stata.com/manuals/pprojectmanager.pdf Stata project]. Once you do this, you will see a message on Stata's console starting with <code>projmanager</code> and ending with the file path to the Stata project. The working directory will be set to the same directory where the Stata project is saved.  


'''2.''' Refer to files by their path relative to the working directory
'''2.''' Refer to files by their path relative to the working directory
Line 128: Line 153:
Since '''absolute file paths''' indicate the complete file path to files starting from the root, they '''can easily be used in projects where not all files are stored in a common root directory'''. To do this in a reproducible manner, however, the different project folders need to be shared with all team members and synced to their computers to enable local access in the machines they are working on. In addition, for every new computer used to run the project, the file paths to the different folders must be explicitly set in the main script, as explained in the [[File paths##Absolute file paths | previous section]].
Since '''absolute file paths''' indicate the complete file path to files starting from the root, they '''can easily be used in projects where not all files are stored in a common root directory'''. To do this in a reproducible manner, however, the different project folders need to be shared with all team members and synced to their computers to enable local access in the machines they are working on. In addition, for every new computer used to run the project, the file paths to the different folders must be explicitly set in the main script, as explained in the [[File paths##Absolute file paths | previous section]].


To use ''relative file paths'' in a project, '''all the relevant project files need to be inside the same root directory''', so file paths can be spelled out using them as starting point. If you are using multiple cloud storage applications for the same project, this can be done through the creation of '''directory junctions'''. '''In Stata, the [https://www.stata.com/manuals/pprojectmanager.pdf Project Manager] also allows users to connect multiple drives to one project'''.
To use ''relative file paths'' in a project, '''all the relevant project files need to be inside the same root directory''', so file paths can be spelled out using them as starting point. If you are using multiple cloud storage applications for the same project, this can be done through the creation of '''directory junctions'''.


== Changing working directories ==
== Changing working directories ==


A commonly used alternative to the suggested workflows is to use relative file paths in combination with changing the working directories. However, '''this error-prone practice is recommended against'''. Changes in the working directory are permanent throughout a program session. This means that once the working directory is changed to a new location, that location will become the root directory for all code that is subsequently run in that session. Users often don't realize that a script has changed their working directory and continue to use programs as if that did not happen, which can break the code and save files to unintended locations.
A commonly used alternative to the suggested workflows is to use relative file paths in combination with changing the working directories. However, '''this error-prone practice is recommended against'''. Changes in the working directory are permanent throughout a program session. This means that once the working directory is changed to a new location, that location will become the root directory for all code that is subsequently run in that session. Users often don't realize that a script has changed their working directory and continue to use programs as if that did not happen, which can break the code and save files to unintended locations.
== Additional Resources ==
* Hadley Wickham (RStudio). [https://r4ds.had.co.nz/workflow-projects.html R for Data Science - Workflow: projects]
* Jenny Bryan (RStudio). [https://www.tidyverse.org/blog/2017/12/workflow-vs-script/ Workflow vs script]
* Julian Reif (University of Illinois). [https://julianreif.com/guide/#setting-up-the-environment Stata Coding Guide - Setting up the environment]

Latest revision as of 22:08, 7 November 2022

Files are pieces of information stored in a computer's hard drives. To be able to retrieve files after creating them, users need to specify exactly in which part of which hard drive the information was stored. This is done through file paths, which are nothing more than a way to organize files in a machine that is also understandable for humans.

The main issue that file paths present to research reproducibility is that they are specific to each machine. So when a researcher writes code to load a file, the path used to retrieve that file in their computer will be different from the one that another researcher needs to use to load the same code on another computer. This often entails in code not being transferrable across machines or users. This article discusses a few options to get around this issue and ensure basic computational reproducibility.

Read First

  • File paths are a way to refer to files stored inside a file system.
  • Users can choose to refer to files by their absolute or relative file paths.
  • A common reproducibility issue is caused by file paths that are written into code in a non-transferrable manner.
  • Employing good coding practices will guarantee code reproducibility in any software or programming language.

Overview

File systems store files using a hierarchical structure, where multiple files can be grouped into one directory (commonly also known as folder), and directories can be grouped into other directories. The starting point of the directory hierarchy for each hard drive is called a root, and is usually represented by capital letters followed by a colon (e.g. C:/ or G:/). In a textual representation adopted by Mac OS, Linux and Windows, file paths are represented by directory names separated by the forward slash ("/"), where the directory that succeeds the forward slash is a subdirectory of the one that precedes it.

Absolute file paths point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the Documents folder is typically stored on a hard drive called C:, inside a user-specific directory, and can usually be referenced by the file path C:/Users/username/Documents.

When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. Relative file paths refer to files relative to the current working directory for a program. For example, if the working directory is the Documents folder, a project folder called My Project stored inside the Documents folder would be referred to simply as simply My Project, while a folder called DataWork stored inside the My Project folder would be referred to as My Project/DataWork.

Coding transferrable file paths

The instructions below demonstrate how to set up projects with transferrable file paths in different software. It takes as an example a user that is writing code for a project's data work. The relevant files are stored in a directory called DataWork that contains the following subdirectories and files:

  DataWork
  |__Data
     |__raw.csv
     |__clean.dta
     |__final.dta
  |__Code
     |__cleaning.do
     |__analysis.do
  |__Output
     |__summary-stats.tex
     |__balance-table.tex
     |__coefplot.png
  |__Documentation

Absolute file paths

Absolute file paths in R

1. Identify your computer's user name

 # On a Windows computer
 Sys.getenv("USERNAME")
 # On a Mac computer
 Sys.getenv("USER")

2. Copy the returned string and use it on an if statement in your main script, as in the first and sixth lines of the code chunk below.

 # On a Windows computer
 if (Sys.getenv("USERNAME") == "user1") {
   code <- "C:/Users/user1/Documents/GitHub/repository-name/Code"
   data <- "C:/Users/user1/Box/project-folder/Data"
   docs <- "G:/Shared drives/Team Drive/project-folder/Documentation"
 }
 # On a Mac computer
 if (Sys.getenv("USER") == "username2") {
   code <- "/Users/username2/GitHub/repository-name/Code"
   data <- "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
   docs <- "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation"
 }

3. Find the file path to the code, data, and documentation folders in your computer and replace the strings that define the object's contents, as in lines 2-4 and 7-9 in the code chunk above.

4. Run the main script to create the objects define by the code in R's memory.

5. Use the objects defined in your main script to refer to directories inside the function here, as in the examples below.

 * Load data set
 clean_data <- 
   read_dta(
     here(
       data,
       "clean.dta"
     )
   )

The base R function file.path has a similar effect, but does not work in all the cases where here does. [1]

Absolute file paths in Stata

1. Identify your computer's user name

 di c(username)

2. Copy the returned string and use it on an if statement in your main do-file, as in the first line of the code chunk below.

 * On a Windows computer
 if c(username) == "user1" {
   global code "C:/Users/user1/Documents/GitHub/repository-name/Code"
   global data "C:/Users/user1/Box/project-folder/Data"
   global docs "G:/Shared drives/Team Drive/project-folder/Documentation"
 }
 * On a Mac computer
 else if c(username) == "username2" {
   global code "/Users/username2/GitHub/repository-name/Code"
   global data "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data"
   global docs "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation"
 }

3. Find the file path to the code, data, and documentation folders in your computer and replace the strings define by the globals, as in lines 2-4 in the code chunk above.

4. Run the main do-file to load the global macros defined by the code in Stata's memory.

5. Use the global macros defined in your main script to refer to files, as in the examples below.

 * Load data set
 use "${data}/clean.dta", clear
 * Run do-file
 do  "${code}/analysis.do"

Note that for this workflow to work, global macros need to be used instead of local macros. This is because the file path macros are set only on the main do-file, but also need to be used by other do-files.

Protecting file paths (advanced)

A concern with data security may be raised when absolute file paths are included in easily accessible code. Users who wish to avoid the creation of additional vulnerability may prefer to set their file paths using R or Stata profiles instead of explicitly writing them in the code. These profiles include code that is run every time the programs are launched. Therefore, users can use it to create macros or objects that will always be available for use in any session.

To use this option in Stata, users will need to create a file called profile.do and store it in a directory that listed in the path over which Stata searches for ado-files (run adopath in Stata to access the list of such directories). Adding the following lines to this do-file will make the global macros ${DROPBOX} and ${GITHUB} available to all Stata sessions in that computer:

   global DROPBOX "C:/Users/user1/Dropbox"
   global GITHUB  "C:/Users/user1/Documents/GitHub"

To use this option in R, users will need to edit the file .Rprofile. This file can be founds in three different places in your computer: the directory where R is installed, your computer;s home directory and the current R working directory [[2]]. Adding the following lines to this R script will make the file paths for the Dropbox and GitHub folders available through Sys.getenv("DROPBOX") and Sys.getenv("GITHUB"), respectively:

   Sys.setenv(DROPBOX = "C:/Users/user1/Dropbox")
   Sys.setenv(GITHUB = "C:/Users/user1/Documents/GitHub")

Relative file paths

Relative file paths in R

1. Set the working directory to the desired project root folder by launching an R Project or opening any file inside that project and loading the package here. Once you do that, you will see a message saying here() starts at file/path. This means that the project working directory has been identified and set.

2. Refer to files by their relative file paths using the function here and entering each subfolder as a separate argument:

 raw_data <- 
     read_csv(
         here(
              "Data",
              "raw.csv"
         )
     )

Relative file paths in Stata

1. Set the working directory to the desired project root folder by opening a Stata project. Once you do this, you will see a message on Stata's console starting with projmanager and ending with the file path to the Stata project. The working directory will be set to the same directory where the Stata project is saved.

2. Refer to files by their path relative to the working directory

 use "Data/clean.dta", clear

Pros and cons of absolute and relative file paths

Since absolute file paths indicate the complete file path to files starting from the root, they can easily be used in projects where not all files are stored in a common root directory. To do this in a reproducible manner, however, the different project folders need to be shared with all team members and synced to their computers to enable local access in the machines they are working on. In addition, for every new computer used to run the project, the file paths to the different folders must be explicitly set in the main script, as explained in the previous section.

To use relative file paths in a project, all the relevant project files need to be inside the same root directory, so file paths can be spelled out using them as starting point. If you are using multiple cloud storage applications for the same project, this can be done through the creation of directory junctions.

Changing working directories

A commonly used alternative to the suggested workflows is to use relative file paths in combination with changing the working directories. However, this error-prone practice is recommended against. Changes in the working directory are permanent throughout a program session. This means that once the working directory is changed to a new location, that location will become the root directory for all code that is subsequently run in that session. Users often don't realize that a script has changed their working directory and continue to use programs as if that did not happen, which can break the code and save files to unintended locations.

Additional Resources