Projects in RStudio
Author: Håkon Kaspersen
Agenda
Go through what an R Project is, and how to start an use one
What is the folder structure?
How to handle data inside an R project
Briefly go through which packages that are recommended and how they are used
Versions
This description of Projects in RStudio is based on the following versions:
R version 4.0.5
RStudio version 1.4.1717
What is an R project?
Have you ever wondered where your analysis “lives”, and how to handle overlapping analyses and data? If so, an R project is the thing for you! I want you to consider the following:
When you use R with RStudio, how do you open it?
Do you use only one instance of RStudio for everything?
Do you save your scripts, do they “live” somewhere?
Where do you store your data files and analysis files?
Do you use
setwd()when starting RStudio?
R Projects is a great way to handle the issues that may arise when using a workflow described above. Especially the use of setwd() makes it virtually impossible for anyone else than the original author of the script to make the file paths work. As Jenny Bryan1 put it:
If the first line of your R script is
setwd("C:\Users\jenny\path\that\only\I\have")I will come into your office and SET YOUR COMPUTER ON FIRE 🔥. If the first line of your R script is
rm(list = ls())I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.
An R project is simply a folder on your computer that holds all the files relevant to that particular piece of work. Any scripts run inside an R Project assumes that the working directory is the project folder location, using relative paths. This makes it possible to move the whole project folder anywhere, and it can still be run. Another advantage is that you have full control of package versions and data, assuming you use the relevant packages to handle this. Let’s assume you are analyzing data for a manuscript, where reproducibility is very important. By using an RStudio project, you can gather all necessary data in a data subfolder, and all necessary scripts in a scripts folder, using relative paths. When the paper is published, you can pack the whole project down and share it with anyone who wants to reproduce your analysis! In short, RStudio projects should be reproducible, portable, and self-contained4.
Before you start
Some considerations are necessary for optimal and safe use of RStudio projects. In general, you should not be scared of cleaning out the RStudio global environment when working with Projects! I cannot stress this enough. Your analysis should be reproducible to the point where you always start with nothing saved into memory. To be able to force yourself into this way of thinking, I highly recommend to change the following in the Global options in RStudio (“Tools” -> “Global Options”):
Un-tick the “Restore .RData intro workspace at startup”
Set “Save workspace to .RData on exit” to “Never”
By doing this, you force yourself to start with a clean slate every time you open the project again. Onward to a more reproducible way of thinking!
Note: This does not affect the saved files and scripts that live inside your project folder, only the data that RStudio has read into memory, i.e. what you see listed under the “Environment” pane in RStudio.
Double-Note: This changes the global setting, which will affect RStudio across all instances on your computer. If you wish to retain this function elsewhere, you may change these settings in “Project Options” instead after creating the project as described below.
How to start an RStudio Project
To start a new RStudio project, open RStudio and click on “File” -> “New Project”. A pop-up box will appear asking you if you want to generate a new directory for the project, if you want to use a pre-existing directory, or to checkout a project from version control history. Generally I would recommend to always start a new project on a clean slate, so pick the “New Directory” option. Next, the menu will ask what kind of project you would like to create. Here, the topmost option “New Project” will suffice for most instances, unless you are creating a package or a Shiny application. Next, you have to give the project a name, and give RStudio the path to where the Project will be stored. You may also choose to tick off “Create git repository” and “Use renv with this project”, but more on that later.
Note: Saving your projects in a dedicated R-Projects folder in a safe location on your computer is recommended! However, some issues may arise if you store it at a location with special symbols or spaces in the path, such as
OneDrive - Veterinærinstituttet/R/R_Projects. Try to avoid saving the projects at such a location!
After clicking “Create Project”, the project will be generated and you will be met by an empty project. When a new project is created, the following happens 3:
Creates a project file (with an .Rproj extension) within the project directory. This file contains various project options and can also be used as a shortcut for opening the project directly from the filesystem.
Creates a hidden directory (named .Rproj.user) where project-specific temporary files (e.g. auto-saved source documents, window-state, etc.) are stored. This directory is also automatically added to .Rbuildignore, .gitignore, etc. if required.
Loads the project into RStudio and display its name in the Projects toolbar (which is located on the far right side of the main toolbar)
After creating a new project, you can create scripts and run analyses!
How to structure an RStudio project
How you structure your project is an important for reproducibility and portability. Below is an example of a structure that is regularly used 5.
Data: This is the directory where the necessary data files are stored. How you structure this directory is up to you. Data is usually
.csv,.xlsx, or other formats.VERY IMPORTANT ABOUT THE DATA DIRECTORY: The data stored here should be regarded as source data, which means that under no circumstance should you be overwriting or editing these files! This is to ensure reproducibility and safety of the project as a whole.
Scripts: This is the directory where the generated scripts are stored, usually with the extension
.Ror.Rmd. How you structure this is up to you. In most cases I have single scripts that are meant to do one thing (i.e.generate_figures.R,calculate_occurrence.R). Sensible file names are recommended, so that other people may understand what the script does! Also, remember to always use relative paths (i.e. do not reference to paths outside the project folder location), this ensures portability. For more information see the description of the “here” package below.Functions: This is the directory where self-made functions are stored and called from. This directory may not be necessary for all projects, especially if you only use functions that are defined in packages only.
Output: The output folder for generated data. This is where the results from the scripts is stored, be it figures, tables, or other file types.
Recommended packages
By far, the most important packages for a reproducible and portable project is renv and here. For detailed information about these packages, click the above links.
renv The renv package brings dependency management to RStudio projects. This package ensures that all package versions are bundled inside the project itself - making sure that the same versions are used even when updating a package in your main R library. It generates a project library of packages inside the project, and new packages that are installed are only installed there. The workflow with this package is as follows:
Call renv::init() to initialize a new project-local environment with a private R library,
Work in the project as normal, installing and removing new R packages as they are needed in the project,
Call renv::snapshot() to save the state of the project library to the lockfile (called
renv.lock)Continue working on your project, installing and updating R packages as needed.
Call renv::snapshot() again to save the state of your project library if your attempts to update R packages were successful, or call renv::restore() to revert to the previous state as encoded in the lockfile if your attempts to update packages introduced some new problems.
To install:
install.packages("renv")
# to initialize
renv::init()
here The here package enables easy file referencing in project-oriented workflows. It is a package that contains functions that automatically build the path to the file you want to interact with. Here makes it easy to generate relative paths, which makes the project portable. When you want to read a data file, the following is usually written:
read_delim("path/to/datafile.txt", delim = "\t")
with the here package you would write:
read_delim(here("path","to","datafile.txt"), delim = "\t")
The here() function finds your projects files relative to the project’s root location. Also, since there is no need to supply slashes in the path, no issues will arise with the windows standard of backslashes (\), while R prefers forward slashes (/).
To install:
install.packages("here")
# To initialize at currect location
library(here)
Packages to avoid
So far, there is only one package that should be avoided when working with Projects in RStudio, the pacman package. Especially, the use of pacman::p_load is to be avoided. This is mainly because the function actually installs with install.packages() and loads the library with library() in one go. If you use it, it will cause issues with the renv management of package versions, because each time you run p_load it will install the newest version of the package in question, breaking the control of package versions.
Version control
Version control (e.g. git) is a way to ensure that your files are safe, and to track changes throughout the project’s development. RStudio has built-in git functionality, but this may not be optimal for all users. I like to use git outside of RStudio (e.g. git bash), or use the built-in terminal that was implemented in the newer versions of RStudio. However you choose to proceed, using git is good practice, and it makes sure that your files are safe from deletion or accidental changes. More information about git version control and RStudio can be found here. The use of git also allows for upload to f.ex. GitHub, where the code can be published.
Take-Home messages
Create an RStudio project for each data analysis project
Keep all data files and scripts used in the analysis inside the project, under version control
Never overwrite or change the source data files. Save your output in a separate subfolder!
Only ever use relative paths, not absolute paths
Use the
hereandrenvpackages to help you out!Avoid using
pacman