Drives organization HPC (SAGA and NIRD)
We have access to several drive locations on SAGA and NIRD. This document explains which areas we have access to and how to use them. SAGA is used for compute (calculations) and NIRD for long term storage.
A note on backup
As of fall 2021, there is a 6 months snapshot (a kind of backup) function on SAGA. This means that the disk areas on SAGA are backed up to NIRD. Please note that any changes after 6 months cannot be recovered.
NIRD and SAGA
It is important to remember that NIRD and SAGA are two different computers. However, in every day use, we only access SAGA as such. The drives that are on the NIRD server are available on SAGA. However, since they are not actually on SAGA, we should use rsync and not copy/move when shifting data to and from NIRD.
Drive locations on SAGA and NIRD: overview
These are different locations that we work with in regarding SAGA and NIRD. Users normally get access to those two areas. However, we sometime forget, so please complain (Contact: Karin Lagesen, Thomas Haverkamp).
In the following $USER is your username. See also
SIGMA2 Documentation: Storage areas on HPC clusters
SAGA |
NIRD |
|
|---|---|---|
Compute |
Storage |
|
saga.sigma2.no |
login.nird.sigma2.no |
|
/cluster/project/nn9305K |
Main NVI Bioinfo project area |
/project/NS9305K |
Overview:
SAGA |
Usage |
Backed-up |
Access |
Comments |
|---|---|---|---|---|
main organization |
||||
/cluster/home/$USER |
Your login area: where you end up after login |
Yes |
All |
Do not store any NVI data here! |
/cluster/projects/nn9305k |
Main Bioinformatic location and working area |
yes |
All |
|
/nird/projects/NS9305K |
Mount point for NIRD NNVI project (= NIRD) |
NA |
All |
Long term storage |
/cluster/work/users/$USER |
Analysis area |
no |
All |
Data is removed after carence time if spaces is limited. = |
Subfolders organization |
||||
/cluster/projects/nn9305k/active/$USER_OR_project |
Your active working area, shared active projects |
yes |
All |
Where all your work happens |
/cluster/projects/nn9305k/db_flatfiles |
various references files |
yes |
All |
eg. ariba MLST, AMR, and virulence db; chewbbaca schemes; adapter files for trimmomatic. |
/cluster/projects/nn9305k/genome_references |
Reference genome data and databases created by reference genomes |
yes |
all |
(eg. annotations db) |
/cluster/projects/nn9305k/samplefiles |
Template files and procedures for new users |
yes |
all |
(eg. slurm scripts, bashrc and bash_profile) |
Other NVI Projects |
||||
Shared ressources on SAGA |
||||
/cluster/shared |
Location for shared/common databases |
no |
||
/cluster/shared/vetinst |
Location for NVI shared active raw data* |
no |
all |
|
/cluster/shared/biobases |
Location for shared classification databases (eg. kraken) |
no |
all |
Contact : Thomas Haverkamp |
/cluster/shared/databases |
Location for shared databases (eg. blast) |
no |
all |
Contact : Thomas Haverkamp |
Detail : datasets on /cluster/shared/vetinst |
|
no |
all |
Ensure that raw data is stored on NIRD. Can be deleted if space is required ! |
Special explanations |
Shared folders usually contain subfolders and REAME file describing what the databases contains. Read those.
/cluster/shared/vetinst- Main raw data directory. All data used for calculations should be stored there. Unpack and softlinked the data in to your active directories to work on them. Compressed archives should be removed. This folder is not backed up, therefore you must ensure that all datasets are first deposited on NIRD for safe long-term storage and then copied from NIRD to this folder./cluster/work/users/$USERNAME- A folder that should be used for analyses, ie. if analyses create many temporary files. It is used as such in eg. BIFROST and ALPPACA pipelines. There is no backup of this folder, files are automatically deleted after 21-42 days. Remember copying your results to your dedicated results folder if you use this area.
We also have access to other SAGA projects, whihc are special projects, eg. the MetaMicro project (nn10070k). Those projects have specific documentations.
for the MetaMicro project (Responsible Håkon Kaspersen and Thomas Haverkamp) please see its specific documentation for usage.
NIRD |
Usage |
Backed-up |
Access |
Commments |
|---|---|---|---|---|
projects/NS9305K |
Long term storage |
yes |
all |
Main Bioinformatic storage area. NVI research projects row data and storage of analysis |
projects/NS9305K/datasets |
Long term storage datasets |
yes |
all |
Archives of row datasets. Sub-directories |
/projects/NS9305K/home/$USER |
Long term storage |
yes |
all |
Directory where you can archive eg. results of analyses. |
/projects/NS9305K/home/<nb_project> |
Long term storage |
yes |
all |
Long term storage for results |
Usage and activities
Long time storage of raw data
All long term storage of raw data should happen on NIRD, where there is backup. In the NIRD folder, there is a datasets folder, same as in SAGA. Sequencing done at NVI will put the data there. Make sure you use rsync to get the data from the NIRD drives to the drives on SAGA.
If your data is not sequenced at NVI, you need to ensure yourself that your raw data is on NIRD.
Long time storing of results
All long time storage should happen on the NIRD drives. There are two ways of doing long time storage there:
Shared project(in /projects/NS9305K) if you are working with someone else on this project, create a top level directory on the NIRD drive, and keep the results there.Personal project(in /projects/NS9305K/home/$USER). If you are not collaborating with someone on the project, put it in your NIRD home folder.
Data analysis
All active analyses should happen in the active folder on SAGA. This allows us to monitor the data usage on the server. For analyses, we suggest the following way of organizing things.
Create a folder inside your active folder, with this format:
year_projectnameInside this folder, create a
README filethat explains to you and anybody else what this project is aboutThen, create the following folders:
rawdata- softlinklink in your dataset herescripts- any scripts, including slurm scripts go here
You might have several subtasks that you will do. Create a separate directory for each of them, using the format year_month_purpose (for instance 2020_09_phylogeny). Remembering when things were done is usually easier than many other things. You should still have a README file for each subdirectory.
We also suggest you keep a log file in text format with your analysis. In this file we recommend that you not only write down what you did, but why you did it. The why is usually the most difficult to remember.