# How to do a Blast job on Saga The example files for this tutorial are stored in our project directory, the PATH is: ``` /cluster/projects/nn9305k/ ``` in the subfolders: ``` /samplefiles/fasta_files/ ``` from your home area we go to this directory. ``` cd /cluster/projects/nn9305k/samplefiles/fasta_files/ ``` Now we create a directory to do our analysis, on our working area: `/cluster/work/users/YOUR_USERNAME` A shortcut for that working area is a variable called: `$USERWORK`. We can use that instead of the PATH to that directory. let's create a temporary directory on our work directory `$USERWORK` ``` mkdir $USERWORK/blast_tutorial ``` Now we copy the example file to the new directory. ``` rsync -rauWP 16S_rRNA_sequences_20.fasta $USERWORK/blast_tutorial ``` Let us follow where that file went. ``` cd $USERWORK ``` This folder is located at: ``` pwd ``` The location should be this: `/cluster/work/users/YOUR_USERNAME`. let us go to the blast_tutorial, and check if the protein sequences are there. ``` cd blast_tutorial ls -la ``` To do Blast of these 20 rRNA sequences we need two items 1. A blast database suited for microbial 16S rRNAs. For instance a 16S rRNA database 2. The Blast software Now the software is loaded. and we now need a database to blast our sequence against it. The current ncbi-NR database is more than 35 Gb, so that takes sometime to download. But that is not needed on Saga. On saga the NCBI-NR database is present. The location of the database is: ``` /cluster/shared/databases/blast/latest/ ``` Go to that folder to check the databases that are present. ``` ls ``` This is not helpful when there are so many files. So let's use a bash trick to sort get only the essential things. ``` ls |cut -f1 -d "." | uniq ``` To understand the command above you can run part of the command before a vertical slash. Now we can see a bunch of databases. And yes there we find a 16S rRNA microbial database, which we can use to blast our 16S sequences. the location of that database is then: ``` /cluster/shared/databases/blast/latest/16SMicrobial ``` let's go back to our tutorial folder ``` cd $USERWORK/blast_tutorial ``` Before we can run our blast job we need to ask for computing time. We do that by first starting a screen. This to make sure our job is running, even when our connection is broken to Saga ``` screen ``` Here you can find more on screen: [How to use screen](https://linuxize.com/post/how-to-use-linux-screen/) Once the screen is up and running, we can ask for computing time and resources for an interactive job. For testing small commands use the [development queue](https://documentation.sigma2.no/jobs/job_scripts/saga_job_scripts.html) ``` srun --account=nn9305k --mem=40G --cpus-per-task=10 --time=1:0:0 --pty bash -i ``` or when using the develop queue ``` srun --account=nn9305k --mem=40G --cpus-per-task=10 --time=1:0:0 --qos=devel --pty bash -i ``` Once a job is allocated we can load the blast software. The blast software can be loaded using the [module system on saga](https://documentation.sigma2.no/software/installed_software.html). Check the modules present ``` module avail ``` This shows all available software. To show only the available blast software type this: ``` module avail blast ``` Before we load the module, we want to remove any module that could interfere with the software we want to use: ``` module purge ``` And then we load the module of our choice: ``` module load BLAST+/2.8.1-intel-2018b ``` Now we can test if the nucleotide blast is working: ``` blastn -h or blastn -help ``` now we make the command to run our 16S rRNA sequences against this 16SMicrobial database. ``` blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt 0 -evalue 0.001 -out blast_16S_results.txt -num_threads 2 ``` This should take about 30 seconds. When you want to create a tabular format output style, use: `-outfmt 6`. ``` blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt 6 -evalue 0.001 -out blast_16S_results.txt -num_threads 2 ``` When you want to create a custom tabular output file: ``` blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt "6 qseqid saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore" -evalue 0.001 -out blast_16S_results.tab -num_threads 2 ``` When you only want to show the 5 best hits ``` blastn -db /cluster/shared/databases/blast/latest/16SMicrobial -query 16S_rRNA_sequences_20.fasta -outfmt 6 -evalue 0.001 -out blast_16S_results.txt -num_threads 2 -max_target_seqs 5 ``` ### A final tip When using a different database, say the NCBI-NR or NT databases, than you need much more memory to run a blast job. Ask for a minimum of 10 cpus for your job. Since the loading of the databases costs a lot of time, it is better to use a slurm script to blast many sequences against such large databases.