Drives organization HPC (SAGA and NIRD)

We have access to several drive locations on SAGA and NIRD. This document explains which areas we have access to and how to use them. SAGA is used for compute (calculations) and NIRD for long term storage.

A note on backup

As of fall 2021, there is a 6 months snapshot (a kind of backup) function on SAGA. This means that the disk areas on SAGA are backed up to NIRD. Please note that any changes after 6 months cannot be recovered.

NIRD and SAGA

It is important to remember that NIRD and SAGA are two different computers. However, in every day use, we only access SAGA as such. The drives that are on the NIRD server are available on SAGA. However, since they are not actually on SAGA, we should use rsync and not copy/move when shifting data to and from NIRD.

Drive locations on SAGA and NIRD: overview

These are different locations that we work with in regarding SAGA and NIRD. Users normally get access to those two areas. However, we sometime forget, so please complain (Contact: Karin Lagesen, Thomas Haverkamp).

In the following $USER is your username. See also SIGMA2 Documentation: Storage areas on HPC clusters

SAGA		NIRD
Compute		Storage
saga.sigma2.no		login.nird.sigma2.no
/cluster/project/nn9305K	Main NVI Bioinfo project area	/project/NS9305K

Overview:

SAGA	Usage	Backed-up	Access	Comments
main organization
/cluster/home/$USER	Your login area: where you end up after login	Yes	All	Do not store any NVI data here!
/cluster/projects/nn9305k	Main Bioinformatic location and working area	yes	All
/nird/projects/NS9305K	Mount point for NIRD NNVI project (= NIRD)	NA	All	Long term storage
/cluster/work/users/$USER	Analysis area	no	All	Data is removed after carence time if spaces is limited. = `$USERWORK`

Subfolders organization
/cluster/projects/nn9305k/active/$USER_OR_project	Your active working area, shared active projects	yes	All	Where all your work happens
/cluster/projects/nn9305k/db_flatfiles	various references files	yes	All	eg. ariba MLST, AMR, and virulence db; chewbbaca schemes; adapter files for trimmomatic.
/cluster/projects/nn9305k/genome_references	Reference genome data and databases created by reference genomes	yes	all	(eg. annotations db)
/cluster/projects/nn9305k/samplefiles	Template files and procedures for new users	yes	all	(eg. slurm scripts, bashrc and bash_profile)
Other NVI Projects

Shared ressources on SAGA
/cluster/shared	Location for shared/common databases	no
/cluster/shared/vetinst	Location for NVI shared active raw data*	no	all
/cluster/shared/biobases	Location for shared classification databases (eg. kraken)	no	all	Contact : Thomas Haverkamp
/cluster/shared/databases	Location for shared databases (eg. blast)	no	all	Contact : Thomas Haverkamp
Detail : datasets on /cluster/shared/vetinst	`datasets` - copies of currently active = worked on datasets. Divided into `wgs`, `transcriptomics` and `metagenomics`.	no	all	Ensure that raw data is stored on NIRD. Can be deleted if space is required !


Special explanations

Shared folders usually contain subfolders and REAME file describing what the databases contains. Read those.
/cluster/shared/vetinst - Main raw data directory. All data used for calculations should be stored there. Unpack and softlinked the data in to your active directories to work on them. Compressed archives should be removed. This folder is not backed up, therefore you must ensure that all datasets are first deposited on NIRD for safe long-term storage and then copied from NIRD to this folder.
/cluster/work/users/$USERNAME - A folder that should be used for analyses, ie. if analyses create many temporary files. It is used as such in eg. BIFROST and ALPPACA pipelines. There is no backup of this folder, files are automatically deleted after 21-42 days. Remember copying your results to your dedicated results folder if you use this area.

We also have access to other SAGA projects, whihc are special projects, eg. the MetaMicro project (nn10070k). Those projects have specific documentations.

for the MetaMicro project (Responsible Håkon Kaspersen and Thomas Haverkamp) please see its specific documentation for usage.

NIRD	Usage	Backed-up	Access	Commments
projects/NS9305K	Long term storage	yes	all	Main Bioinformatic storage area. NVI research projects row data and storage of analysis
projects/NS9305K/datasets	Long term storage datasets	yes	all	Archives of row datasets. Sub-directories `wgs`, `transcriptomics` and `metagenomics`.
/projects/NS9305K/home/$USER	Long term storage	yes	all	Directory where you can archive eg. results of analyses.
/projects/NS9305K/home/<nb_project>	Long term storage	yes	all	Long term storage for results

Usage and activities

Long time storage of raw data

All long term storage of raw data should happen on NIRD, where there is backup. In the NIRD folder, there is a datasets folder, same as in SAGA. Sequencing done at NVI will put the data there. Make sure you use rsync to get the data from the NIRD drives to the drives on SAGA.

If your data is not sequenced at NVI, you need to ensure yourself that your raw data is on NIRD.

Long time storing of results

All long time storage should happen on the NIRD drives. There are two ways of doing long time storage there:

Shared project (in /projects/NS9305K) if you are working with someone else on this project, create a top level directory on the NIRD drive, and keep the results there.
Personal project (in /projects/NS9305K/home/$USER). If you are not collaborating with someone on the project, put it in your NIRD home folder.

Data analysis

All active analyses should happen in the active folder on SAGA. This allows us to monitor the data usage on the server. For analyses, we suggest the following way of organizing things.

Create a folder inside your active folder, with this format: year_projectname
Inside this folder, create a README file that explains to you and anybody else what this project is about
Then, create the following folders:
- rawdata - softlinklink in your dataset here
- scripts - any scripts, including slurm scripts go here
You might have several subtasks that you will do. Create a separate directory for each of them, using the format year_month_purpose (for instance 2020_09_phylogeny). Remembering when things were done is usually easier than many other things. You should still have a README file for each subdirectory.
We also suggest you keep a log file in text format with your analysis. In this file we recommend that you not only write down what you did, but why you did it. The why is usually the most difficult to remember.