Install nextflow pipelines on SAGA and configure

Example with bioanalyzer which is a nextflow pipeline build AS annf-corepipeline (but not directly downloadable from nf-core). See also : Nextflow documentation, documentation to run nf-core pipelines and the nf-core tools documentation: for tools installed in nf-core conda environment

1. Check requirements : which nextflow version

[ ] download appropriate version if not have - and put in here:

cd /cluster/projects/nn9305k/bin
wget <nextflow_link>
module load Java/17.0.2 
nextflow --version 

[ ] rename with the appropriate version number - as done with other version

Note: If it is a nf-core pipeline there is a nf-coreconda environment that contains java and nf-core that allow to download appropriate nextflow (& pipeline)

cd /cluster/projects/nn9305k/bin
conda activate nf-core 

2. Download all other requirements (eg. databases dependencies)

[ ] Download databases we need if they are not already existing

Check here:

/cluster/shared/biobases
/cluster/shared/vetinst
/cluster/projects/nn9305k/db_flatfiles

If not in - choose the most appropriate location

Note: nf-core pipelines do not necessarily need to be installed, they can be run directly from github. But then you will always use the last version, without it be reported to you, unless you specifically requiire github to use a specific version of the pipeline. Take this in account for reproducibility of your workflow.

We decided to avoid this risk and then installed the pipeline

3. Installing the pipeline

[ ] Where to install?

# For External pipelines
/cluster/projects/nn9305k/src
# For NVI developped pipelines
/cluster/projects/nn9305k/vi_src

[ ] If using its a nf-core pipeline we can do that in conda, instructions to use the tool here

cd <install_directory>
conda activate nf-core
nf-core list
nf-core download <pipeline>

[ ] If NOT nf-core pipeline we download the pipeline:

cd <install_directory>
git clone <http_adress> 

Configuration of the pipeline to work on SAGA

NB: Described according to state of the art to make configuration files, but depending on the maturity of the pipeline, you might also need to explore all configuration files.

Explore how the pipeline has bee build

[ ] go to new pipeline directory
[ ] Look at the nextflow.config file. It helps understand the parameters, which version it require and how it runs. Look at the max_memory defined, max_cpusand max_time. This helps understand parameters, which version are required and how it runs
- [ ] max_parameters are usually adjusted to run the pipeline installation test github pages (they are adjusted usually to avoid being billed $ by github). It might also provide an indication of running requirements.
- [ ] env settings - are important for when using containers, including containers that run R, contain eg. the definition of profiles paths that will allow interacting with other profiles
- [ ] manifest is important, it allow the pipeline to be run directly from github (automatically fetch the specific version)
[ ] Look at the specific base configuration: dir/conf/base.config. This is the default configuration that is run (aka local: everything specified as additional configuration, provided by -cwill overwrite this base configuration ).
[ ] Look at the dir/conf/modules.config. This is where the output of the processes are configured. This eg. allow some same modules to be used to do for different things (eg. by providing different arguments)-

Adapt the base configuration file to run with SAGA

[ ] Copy the base configuration to folder /cluster/projects/nn9305k/nextflow/configs
[ ] rename this base config (see other examples in the directory) so eg. it can avoid to reconfigurate everything after any update of the pipeline.

cp /cluster/projects/nn9305k/src/bovisanalyzer/conf/base.config  /cluster/projects/nn9305k/nextflow/configs/saga_bovisanalyser.config
cd /cluster/projects/nn9305k/nextflow/configs

[ ] Scan the new base configuration for check_max function. There are currently some issues on SAGA (that we do not know how to fix yet) so we usually do not use it. You can eg. ask Håkon if this has been fixed. NB: The check_max time, memory, cpu : make that the SLUM processes will be killed if the resources asked exceed to run the pipeline exceed those mask (defensive programming towards resource usage).
[ ] scan for other parameters defined to understand how the pipeline is built:
- ex: in BovisAnalyzer is there is a test run profile, with the test parameters that is already installed in the pipeline, and they have a test profile for development

Note : there are 2 ways to supply nf-core type pipelines profiles

either some parameters/arguments are defined in the process run : how to run the same module in different ways -> see `modules.config

look f.eks. in the BCFTOOLS filter module : there is a way to specify when the pipelines uses conda and when the pipeline uses containers what it does

[ ] note database paths, version of the pipeline
[ ] Modify the copied base.config to run on SAGA and define SLURM as executor
- [ ] check the requirements for : cpus, memory, time (note that in nf-core type pipelines, those labels are standardized)
- [ ] remove the check_max function statements
- [ ] add slurmas executor
- [ ] add the clusterOptions: '--job-name=Saga_nxf --account=nn9305k'
- [ ] specify the queueSize 24. The queue size defines how many jobs are sent to the queue at the same time (when one jobs is removed from the queue it sends a new one, so there are always eg. 24 jobs requested)

# Edit the base condif (nano, vim)
nano saga_bovisanalyser.config

Note: if we want to run a pipeline locally on saga (without slurm), then its only to run the default base.config. But then we have to load java and run the pipeline in a srun

Note: should you have high requirement for memory: look at examples in the saga_ampliseq.config, there are examples how to ask for more that 150GB mem and then how to specify that the specific processes should be run on big memnodes (to do that we need to overwrite the normal cluster options).

Test run the pipeline

[ ] We use singularity containers in SAGA. Check if you have the default path defined for singularity images in your ~/.bashrc
- [ ] If not add the line to your .bashrc : export SINGULARITY_CACHEDIR="$USERWORK/images". NB: Here are additional nextflow environment variables you can use in .bashrc eg. I now added export NXF_WORK="$USERWORK/NF_WORK" and export NXF_HOME="$USERWORK/NF_HOME" so that everything goes on $USERWORK. Remember to source .bashrc
[ ] look at the test config (it provides an example of what memory does - this can be configured/used on github actions, it is possible that some modules wont run in the test eg. because they would require too much memory to be run on github actions, without being billed …)
- [ ] First a test run local (could give some cpu memory problems - eg 1cpu used instead of 2 cpus required (so this can be adapted in the test profile requirements), short, just to see if can run)
- [ ] Test run using the config adapted to run on SAGA with SLURM as exectutor
  - [ ] check the queue
  - [ ] check that the pipeline is running

# test run command should look somehow like that
# <path/required_nf_version> run <pipeline/main.nf>  -profile test,singularity -c path/config/saga/my_base.config --out_dir <mytestdir>

# Example with BovisAnalyzer
NF_EXECUTABLE="/cluster/projects/nn9305k/bin/nextflow_22.10.6"
PIPELINE_MAIN="/cluster/projects/nn9305k/src/bovisanalyzer/main.nf"
BASE_CONFIG="/cluster/projects/nn9305k/nextflow/configs/saga_bovisanalyser.config"
OUTDIR="$USERWORK/test"

# Test Run - Local 
srun --account=nn9305k --mem-per-cpu=4G --qos=devel --time=0:30:00 --pty bash -i 
module load Java/17.0.2
$NF_EXECUTABLE run $PIPELINE_MAIN -profile test,singularity --out_dir $OUTDIR

# Test Run SLURM executor
module load Java/17.0.2
## I had the release as it seems otherwise this fails
$NF_EXECUTABLE run $PIPELINE_MAIN -profile test,singularity -c $BASE_CONFIG --out_dir $OUTDIR

Note: nextflow will predownload the images based on local profile, only the actual job is sent to SLURM, so not a requirement of pre-downloading because of lack of access to the net on compute nodes.

NB: srun still have access to internet, but not when running in SLURM