Difference between revisions of "File path"
Line 38: | Line 38: | ||
=== Absolute file paths in R === | === Absolute file paths in R === | ||
'''1.''' Identify your computer's user | '''1.''' Identify your computer's user name | ||
# On a Windows computer | # On a Windows computer | ||
Sys.getenv("USERNAME") | Sys.getenv("USERNAME") |
Revision as of 21:22, 7 November 2022
Files are pieces of information stored in a computer's hard drives. To be able to retrieve files after creating them, users need to specify exactly in which part of which hard drive the information was stored. This is done through file paths, which are nothing more than a way to organize files in a machine that is also understandable for humans.
The main issue that file paths present to research reproducibility is that they are specific to each machine. So when a researcher writes code to load a file, the path used to retrieve that file in their computer will be different from the one that another researcher needs to use to load the same code on another computer. This often entails in code not being transferrable across machines or users. This article discusses a few options to get around this issue and ensure basic computational reproducibility.
Read First
- File paths are a way to refer to files stored inside a file system.
- Users can choose to refer to files by their absolute or relative file paths.
- A common reproducibility issue is caused by file paths that are written into code in a non-transferrable manner.
- Employing good coding practices will guarantee code reproducibility in any software or programming language.
Overview
File systems store files using a hierarchical structure, where multiple files can be grouped into one directory (commonly also known as folder), and directories can be grouped into other directories. The starting point of the directory hierarchy for each hard drive is called a root, and is usually represented by capital letters followed by a colon (e.g. C:/ or G:/). In a textual representation adopted by Mac OS, Linux and Windows, file paths are represented by directory names separated by the forward slash ("/"), where the directory that succeeds the forward slash is a subdirectory of the one that precedes it.
Absolute file paths point to a directory or file by explicitly mentioning all its parent directories since the root. For example, on Windows, the Documents folder is typically stored on a hard drive called C:, inside a user-specific directory, and can usually be referenced by the file path C:/Users/username/Documents.
When a program is launched in a computer, it is usually associated with a working directory. That means it will look for files inside that directory. Relative file paths refer to files relative to the current working directory for a program. For example, if the working directory is the Documents folder, a project folder called My Project stored inside the Documents folder would be referred to simply as simply My Project, while a folder called DataWork stored inside the My Project folder would be referred to as My Project/DataWork.
Coding transferrable file paths
The instructions below demonstrate how to set up projects with transferrable file paths in different software. It takes as an example a user that is writing code for a project's data work. The relevant files are stored in a directory called DataWork that contains the following subdirectories and files:
DataWork |__Data |__raw.csv |__clean.dta |__final.dta |__Code |__cleaning.do |__analysis.do |__Output |__summary-stats.tex |__balance-table.tex |__coefplot.png |__Documentation
Absolute file paths in R
1. Identify your computer's user name
# On a Windows computer Sys.getenv("USERNAME") # On a Mac computer Sys.getenv("USER")
2. Copy the returned string and use it on an if
statement in your main script, as in the first and sixth lines of the code chunk below.
# On a Windows computer if (Sys.getenv("USERNAME") == "user1") { code <- "C:/Users/user1/Documents/GitHub/repository-name/Code" data <- "C:/Users/user1/Box/project-folder/Data" docs <- "G:/Shared drives/Team Drive/project-folder/Documentation" } # On a Mac computer if (Sys.getenv("USER") == "username2") { code <- "/Users/username2/GitHub/repository-name/Code" data <- "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data" docs <- "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation" }
3. Find the file path to the code, data, and documentation folders in your computer and replace the strings that define the object's contents, as in lines 2-4 and 7-9 in the code chunk above.
4. Run the main script to create the objects define by the code in R's memory.
5. Use the objects defined in your main script to refer to directories inside the function here
, as in the examples below.
* Load data set clean_data <- read_dta( here( data, "clean.dta" ) )
The base R function file.path
has a similar effect, but does not work in all the cases where here
does. [1]
Absolute file paths in Stata
1. Identify your computer's user name or host name
di c(username) di c(hostname)
2. Copy the returned string and use it on an if
statement in your main do-file, as in the first line of the code chunk below.
* On a Windows computer if c(username) == "user1" { global code "C:/Users/user1/Documents/GitHub/repository-name/Code" global data "C:/Users/user1/Box/project-folder/Data" global docs "G:/Shared drives/Team Drive/project-folder/Documentation" } * On a Mac computer else if c(username) == "username2" { global code "/Users/username2/GitHub/repository-name/Code" global data "/Users/username2/Library/CloudStorage/Box-Box/project-folder/Data" global docs "/Users/username2/Library/CloudStorage/GoogleDrive-username2@gmail.com/Team Drive/project-folder/Documentation" }
3. Find the file path to the code, data, and documentation folders in your computer and replace the strings define by the globals, as in lines 2-4 in the code chunk above.
4. Run the main do-file to load the global macros defined by the code in Stata's memory.
5. Use the global macros defined in your main script to refer to files, as in the examples below.
* Load data set use "${data}/clean.dta", clear
* Run do-file do "${code}/analysis.do"
Note that for this workflow to work, global macros need to be used instead of local macros. This is because the file path macros are set only on the main do-file, but also need to be used by other do-files.
Relative file paths in R
1. Set the working directory to the desired project root folder by launching an R Project or opening any file inside that project and loading the package here
. Once you do that, you will see a message saying here() starts at file/path
. This means that the project working directory has been identified and set.
2. Refer to files by their relative file paths using the function here
and entering each subfolder as a separate argument:
raw_data <- read_csv( here( "Data", "raw.csv" ) )
Relative file paths in Stata
1. Set the working directory to the desired project root folder by either launching Stata by opening the project's main do-file or opening a Stata project.
2. Refer to files by their path relative to the working directory
use "Data/clean.dta", clear
Pros and cons of absolute and relative file paths
Since absolute file paths indicate the complete file path to files starting from the root, they can easily be used in projects where not all files are stored in a common root directory. To do this in a reproducible manner, however, the different project folders need to be shared with all team members and synced to their computers to enable local access in the machines they are working on. In addition, for every new computer used to run the project, the file paths to the different folders must be explicitly set in the main script, as explained in the previous section.
To use relative file paths in a project, all the relevant project files need to be inside the same root directory, so file paths can be spelled out using them as starting point. If you are using multiple cloud storage applications for the same project, this can be done through the creation of directory junctions. In Stata, the Project Manager also allows users to connect multiple drives to one project.
Changing working directories
A commonly used alternative to the suggested workflows is to use relative file paths in combination with changing the working directories. However, this error-prone practice is recommended against. Changes in the working directory are permanent throughout a program session. This means that once the working directory is changed to a new location, that location will become the root directory for all code that is subsequently run in that session. Users often don't realize that a script has changed their working directory and continue to use programs as if that did not happen, which can break the code and save files to unintended locations.
Additional Resources
- Hadley Wickham (RStudio). R for Data Science - Workflow: projects
- Jenny Bryan (RStudio). Workflow vs script