Setting up your work environment

Development tools, version control, packages

Roland Krause

Rworkshop

Tuesday, 6 February 2024

Introduction

Motivation

  • We want to work in a minimally distraction environment.

  • We want to write as little code as possible and use ideas that others have implemented.

  • How can we share code when it’s constantly changing?

Learning objectives

Integrated development environment

  • Get to know RStudio the most important IDE for R projects
  • Recommended and hidden features
  • Use a versioning system
  • Good package management

Project environment

  • Set up a new scientific analysis project
  • Set up code versioning with git
  • Very basic use of git
  • Set up and basic use of renv for environment management

RStudio

What is RStudio?

RStudio is an Integrated Development Environment. It makes working with R much easier

Install R first

RStudio and R are independent.

Features

  • Projects to ease files organisation
  • Console to run R, with syntax highlighter
  • full support for Rmarkdown docs & chunks
  • Viewer for data / plots / website
  • Package management (including building, tests and development)
  • Autocompletion using TAB
  • Cheatsheets
  • Git integration for versioning
  • Inline outputs
  • Keyboard shortcuts
  • integrated Terminal
  • Jobs for running long runs in a separated session

The four panels layout

Four panels

Scripting

  • Your main window for code development
  • Should be a Rmarkdown doc
  • Tabs are great
  • Additional code windows in columns to the left

Console

  • could be hidden with inline outputs
  • Rmarkdown output logs
  • optional, embed a nice Terminal tab
  • optional,Jobs tab

Environment

  • Environment, display loaded objects and their str()
  • History is useless IMO
  • nice git integration
  • Build for development
  • database connections interface
  • Tutorial integration of learnr

Files / Plots / Help

  • Packages management tab
  • unnecessary Plots panel when using inline outputs
  • Help tab
  • Viewer for animations

Better reproducibility

Environments should not be saved

rm(list = ls()) is not recommended

  • does not make a fresh R session
    • library() calls remain
    • working directory not set!
    • modified functions, evil == <- !=
  • Knitting Rmarkdown files solves it
    • Always run in a fresh session
  • Workflows like targets allow to be fast and fully reproducible

Options to activate / deactivate

Warning

Please save all environments except for the R session

Other settings of note

Base pipe with Ctrl/⌘-Shift-M

Additional code diagnostics

Consider at your own risk

Creates many warnings if all options are selected, particular with tidyverse code using non-standard evaluation.

Projects

RStudio projects

Set your working directory

  • All work you conduct in RStudio should happen in the context of a Project.
  • A project consists of a directory/folder containing a file with an .Rproj extension.
  • The directory that contains this file is the reference for all paths in your project.
  • A good project name goes a long way.
    • Please don’t call it PhD.
    • Make it unique!
  • There are multiple ways to create projects
    • RStudio User Interface
    • Place an .Rproj file in a already existing folder
    • Copy (clone) an existing setup under a different name

RStudio projects

Creating a project

  • Use the Project button in the top right corner
  • Select New project
  • Name the project properly
  • A .Rproj extension file is generated.

Project initialization from scratch

Good practice

Use git and renv when starting projects from scratch.

Working directory set up with RStudio projects

Avoid using setwd() and absolute file paths

What’s wrong with setwd()? Jenny Bryan’s great blog post on project based workflows

  • setwd() set the working directory to a specific folder through an absolute file path
setwd("C:/Users/veronica.codoni/Projects/Survival_Analysis")
  • What if this folder moved to /Downloads, or onto another machine?
setwd("C:/Users/veronica.codoni/Projects/Survival_Analysis")
Error in setwd("C:/Users/veronica.codoni/Projects/Survival_Analysis"): cannot change working directory

This approach is not self-contained and portable!!

RStudio projects as alternative

Start a new research project/data analysis by creating a new Rstudio Project

  • it helps to keep all the files together
  • it sets automatically the working directory to the project directory
  • it helps resolve file path issues by using relative paths

Create a Project in Rstudio assures the project directory to be stand-alone and portable.

Your turn: Create a RStudio project

Use the Project button in the top right corner.

  • Always name the projects properly

  • Keeping names concise, no spaces

  • Once the project has been created, a .Rproj extension file is generated. This allows for automatic working directory set-up.

Exercise

  • You can create a new project called dummy-project
    • in a new folder
    • enable git
    • enable renv
  • Create sub-folders data and R.
  • Standard practice is to include an R/ project for code, similar to src and bin directories.

Tip

We will not use this directory in this workshop any more, hence the name.

How your project directory should not look like

├── analysis.R        
├── analusis_2.R
├── analusis4.R
├── parse_data.sh
├── tools  #<< environment
│   ├── FACS.exe
│   ├── plink1.09
│   ├── plink1.7
│   ├── plink2.0
│   └── R2.9
├── mydata #<< see next lecture
│   ├── facs.xlsx
│   └── single-cells.txt
├── old_scripts/
│   ├── analysis_final.R
│   ├── analysis_final_2.R
│   ├── analysis_tmp.R
│   └── analysis_final_nature.R
├── Analysis-international_frontiers_biomedical-Machine_learning.R # IF 3.011!
├── Manuscript_2015-oct.docx
├── Manuscript_15-oct2.docx
├── Manuscript_15-oct2-PI.docx
└── Manuscript_Sept-2015.docx

Directory setup

A standard setup

├── .git/         #<< code versioning
├── .gitignore  
├── Repro.Rproj
├── R
│   ├── functions.R
│   └── utils.R
├── data/
├── renv/         #<< environment
├── renv.lock     #<< environment
├── README.md     #<< Project information
└── report.Rmd #<< RMarkdown (supersedes analysis.R) 

Keeping things tidy

  • How to keep older versions of scripts?
  • How not to lose good, working code when changing things?
  • How to share code with others for collaboration?

Version control

Code versioning systems support you in all of these use cases.

Code versioning

Git in five minutes

What it does for you

  • Version your code at small scale
    • No more physical copies of source code(!)
    • Go back to the versions that worked
    • Go back to the version you used a year ago
  • Helps to document your advancements and solutions
  • Share and publish your code while continuing to work on it
├── analysis.R        
├── analusis_2.R
├── analusis4.R
├── analusis_final.R
├── analysis.R 
├── analusis-final2.R
└── alanusis_nature.R

Capacities in code versioning

  1. Register (add) files in a project called a repository
  2. Track changes (commit) resolutions.)
  3. Exchange code with others through a shared repository
  4. The most common platforms for sharing repositories are
    • GitHub, a global platform owned and supported by Microsoft
    • gitlab, local platforms maintained and supported by institutes, e.g. gitlab.lcsb.uni.lu
    • Both platforms offer many other tools
  • Here, we cover Git as far as easily accessible from RStudio and other IDEs

Project initialization from a repository

From an existing Git repository

From a web repository by cloning (SSH)

Working with the repository

Add a file to the staging area

Using RStudio git panel:

The equivalent command line is git add coude.R

Commit (to) changes

After a change file is tagged with M (modified)

Click in the staged column to move the modification to the staged area.

  • Changes you do not commit are not tracked.
  • In git you can move between commits.
  • All these changes are local unless you declare a remote.

The command line equivalent is:

git commit -m "Added Chip-seq input analysis.R"

Git history

Synchronizing with a repository

Optionally use rebase (rewrite for linear history to avoid merge commits) using the little black arrow menu next to Pull

  • Compare all files already in the remote repository
  • Fail if changes are not committed
  • If changes are committed.
  • If additional changes have happened, ensure that code is working

Command line equivalents

  • : git pull
  • Pull with rebase: git pull --rebase
  • Push the local state to the remote repository

Command line equivalent of :

git push

What not to put into Git

Data types to be aware of

  • Sensitive data (should not be on GitHub)
  • Large data
  • Git servers are not setup to handle large files (>1Mb)
  • Binary data
    • git makes line by line comparisons
  • Derived data
    • Any data you can reproduce by computation
    • You might want to archive some intermediate files by other means

Possible solutions

  • Install scripts
    • Copy or link your data to the data directory
  • git-lfs Large File Storage
    • Put files on server but only tracks via checksums
    • Repository only tracks a small text file instead

Make your live easy with .gitignore

  • Derived files (e.g. *.html)
  • Data products (.RData)
.RData
**.Rproj
**.pdf
**.tex
.DS_Store

Our turn

Standard .gitignore

The course practical contains a standard .gitignore file. Remember to take a look!

Double stars escape while highlightLines is true. Let’s take a look at the .gitignore file in the practical

Working with Git

Helpful tips

1. Commit and pull frequently

  • Minimally once daily when your code is working
  • When adding a feature or analysis step (e.g. chunk-wise)
  • Regularly merge from other branches if you have multi-branch setup

2. Best learn from experienced person

  1. It’s a habit that needs developing
  1. You are the first to benefit from it
  1. You need to version files. Automate it. Don’t reinvent the wheel

Environment management

What’s your environment?

Packages

Reliable: packages are checked during submission process

install.packages("tidyverse")

Of note: Microsoft will stop on July 1 2023 MRAN

Dedicated to biology research {limma} example

Requires dedicated package BiocManager for installation.

Typical install

# install.packages("BiocManager")
BiocManager::install("limma")

Easy install thanks to remotes package.

# install.packages("remotes")
remotes::install_github("tidyverse/readr")

Installing random packages from github can be a security issue

Package installation

CRAN install from Rstudio (autocompletion)

CRAN install from Rstudio’s console

Managing your programming environment

Documentation

  • Source and version of package, incl. dependencies
  • R/Bioconductor or Python versions
  • (Ideally your work should not depend on the operating system)

Package documentations becomes complex fast and tedious to control manually * Projects require different versions of packages * Collaborators have different versions installed on their machines!

What does not work

  • Agreements on specific versions in the group/institute/collaboration
  • Hope that the operating systems/programming language will be supported until things are finished
  • Keeping things as they are and not upgrading until the paper/revision/is out
  • … not upgrading until the PhD is finished
  • … not upgrading until the PostDoc is finished

Packages for environment definition

  • conda for scientific programming

  • Python virtual environments

    • Defines Python version and packages
  • renv R - environments

    • Defines packages from CRAN, Bioconductor, GitHub repositories
    • Defines specific version (incl. R if you want)
    • Every project has it’s own and defined packages
    • Simplifies (!) installation

renv features

Basic functions

  • hydrate() searches source code files for library calls
  • install(pkgs) installs package pkgs including its dependencies
  • snapshot() registers changes, hashes and origin
  • restore() to a certain point in time

Snapshot

```{r}
> renv::snapshot()
The following package(s) will be updated in the lockfile:

# CRAN ===============================
- RcppParallel   [5.0.2 -> 5.0.3]
- cli            [2.3.0 -> 2.3.1]
- pkgload        [1.1.0 -> 1.2.0]
- tint           [0.1.3 -> *]

# GitHub =============================
- targets        [ropensci/targets@main: 598d7a23 -> bdc1b29c]

Do you want to proceed? [y/N]: 
```

renv.lock file after a snapshot

  "R": {
    "Version": "4.0.3",
    "Repositories": [
      {
        "Name": "CRAN",
        "URL": "https://cloud.r-project.org"
      }
    ]
  },
  "Bioconductor": {
    "Version": "3.12"
  },
  "Packages": {
    "AnnotationDbi": {
      "Package": "AnnotationDbi",
      "Version": "1.52.0",
      "Source": "Bioconductor",
      "Hash": "ca5106b296b3aa6af713ce197be547c1"
    },
    "BH": {
      "Package": "BH",
      "Version": "1.75.0-0",
      "Source": "Repository",
      "Repository": "CRAN",
      "Hash": "e4c04affc2cac20c8fec18385cd14691"
    },
    "targets": {
      "Package": "targets",
      "Version": "0.1.0.9000",
      "Source": "GitHub",
      "RemoteType": "github",
      "RemoteUsername": "ropensci",
      "RemoteRepo": "targets",
      "RemoteRef": "main",
      "RemoteSha": "598d7a23661d4c760209c7991bf10584eadcf7c8",
      "RemoteHost": "api.github.com",
      "Hash": "ee66061fd5c757ec600071965d457818"
    },
    [...]

GitHub classroom configuration setup

1. Check the github-classroom

Follow this link and authorize it

2. Accept the assignment

You should see this invite:

3. Go to the assignment

  • Reload the browser page

  • The assignment is created in your personal github repository.

  • Click on the link with the blue background

Clone the repository as a RStudio project

4. Check your own repo

It should look like this:

5.Copy the Code SSH URL

Make sure to use the Git Clone with SSH!

6.Insert the URL for a new Git project

In Repository URL

Practicals project setup

7. Install renv and yaml by running install.packages(c("renv", "yaml"))

8. Activate renv by running renv::activate().

9. Run renv::hydrate() to install the packages necessary. You should see:

Discovering package dependencies ... Done!
Copying packages into the cache ... Done!

Should be fast since you already have most packages in your renv cache.

10. Create your first package snapshot() with renv::snapshot(). The output should be something like this:

The following package(s) will be updated in the lockfile:

# CRAN ===============================
- R6           [* -> 2.5.1]
- base64enc    [* -> 0.1-3]
- bslib        [* -> 0.3.1]
[...]
- tinytex      [* -> 0.38]
- xfun         [* -> 0.30]
- yaml         [* -> 2.3.5]

Do you want to proceed? [y/N]: 

Say y to Do you want to proceed? [y/N]:. The renv.lock is created.

The first tutorial (datasauRus) will be guided and demonstrate capacities in the tidyverse which we will explore in the workshop.

Before we stop

You learned about:

  • RStudio
    • Basic functionality
    • Recommended settings
    • Project setup
  • Code versioning with Git
    • Simplify your analysis code
    • GitHub classroom
  • Package management with renv

Acknowledgments

  • Hadley Wickham (RStudio)
  • Kevin Ushey (renv)
  • Romain François
  • Jenny Bryan (File organization)

Contributors

  • Roland Krause
  • Aurélien Ginolhac
  • Veronica Codoni

Thank you for your attention!