How to do a Blast job on Saga

The example files for this tutorial are stored in our project directory, the PATH is:

/cluster/projects/nn9305k/

in the subfolders:

/samplefiles/fasta_files/

from your home area we go to this directory.

cd /cluster/projects/nn9305k/samplefiles/fasta_files/

Now we create a directory to do our analysis, on our working area: /cluster/work/users/YOUR_USERNAME A shortcut for that working area is a variable called: $USERWORK. We can use that instead of the PATH to that directory.

let’s create a temporary directory on our work directory $USERWORK

mkdir $USERWORK/blast_tutorial

Now we copy the example file to the new directory.

rsync -rauWP 16S_rRNA_sequences_20.fasta $USERWORK/blast_tutorial

Let us follow where that file went.

cd $USERWORK

This folder is located at:

pwd

The location should be this: /cluster/work/users/YOUR_USERNAME.

let us go to the blast_tutorial, and check if the protein sequences are there.

cd blast_tutorial

ls -la

To do Blast of these 20 rRNA sequences we need two items

A blast database suited for microbial 16S rRNAs. For instance a 16S rRNA database
The Blast software

Now the software is loaded. and we now need a database to blast our sequence against it. The current ncbi-NR database is more than 35 Gb, so that takes sometime to download. But that is not needed on Saga.

On saga the NCBI-NR database is present. The location of the database is:

/cluster/shared/databases/blast/latest/

Go to that folder to check the databases that are present.

ls

This is not helpful when there are so many files. So let’s use a bash trick to sort get only the essential things.

ls |cut -f1 -d "." | uniq 

To understand the command above you can run part of the command before a vertical slash.

Now we can see a bunch of databases. And yes there we find a 16S rRNA microbial database, which we can use to blast our 16S sequences.

the location of that database is then:

/cluster/shared/databases/blast/latest/16SMicrobial

let’s go back to our tutorial folder

cd $USERWORK/blast_tutorial

Before we can run our blast job we need to ask for computing time. We do that by first starting a screen. This to make sure our job is running, even when our connection is broken to Saga

screen

Here you can find more on screen: How to use screen

Once the screen is up and running, we can ask for computing time and resources for an interactive job. For testing small commands use the development queue

srun --account=nn9305k --mem=40G --cpus-per-task=10 --time=1:0:0 --pty bash -i

or when using the develop queue

srun --account=nn9305k --mem=40G --cpus-per-task=10 --time=1:0:0 --qos=devel --pty bash -i

Once a job is allocated we can load the blast software. The blast software can be loaded using the module system on saga.

Check the modules present

module avail

This shows all available software. To show only the available blast software type this:

module avail blast

Before we load the module, we want to remove any module that could interfere with the software we want to use:

module purge

And then we load the module of our choice:

module load BLAST+/2.8.1-intel-2018b

Now we can test if the nucleotide blast is working:

blastn -h

or

blastn -help

now we make the command to run our 16S rRNA sequences against this 16SMicrobial database.

blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt 0 -evalue 0.001 -out blast_16S_results.txt -num_threads 2

This should take about 30 seconds.

When you want to create a tabular format output style, use: -outfmt 6.

blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt 6 -evalue 0.001 -out blast_16S_results.txt -num_threads 2

When you want to create a custom tabular output file:

blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt "6 qseqid saccver pident length mismatch gapopen qstart qend sstart send
   evalue bitscore" -evalue 0.001 -out blast_16S_results.tab -num_threads 2

When you only want to show the 5 best hits

blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt 6 -evalue 0.001 -out blast_16S_results.txt -num_threads 2 -max_target_seqs 5

A final tip

When using a different database, say the NCBI-NR or NT databases, than you need much more memory to run a blast job. Ask for a minimum of 10 cpus for your job. Since the loading of the databases costs a lot of time, it is better to use a slurm script to blast many sequences against such large databases.