Nanopore sequence data tutorial

This is a tutorial to do quality control of the Nanopore sequence data. The data we use at our institute is mainly minION sequence data generated for the assembly of bacterial genomes. This tutorial uses Guppy version 5 for basecalling, demultiplexing and quality score filtering. The tutorial is run a HIGH PERFORMANCE COMPUTING system that uses a SLURM system for job allocation. If you have your own server, just skip the srun step, but make sure to have Guppy, Nanoplot and NanoFilt installed.

This tutorial is a work in progress, and will be updated in the coming periode, with assembly steps as well.

To follow this tutorial you need to have:

  • Guppy version 5 installed that uses the GPU nodes on your environment. I would not advice using cpu’s only, since it takes for ever.

    • I have it installed as a conda environment. (if you want to know how to make that, please contact me: thomas.haverkamp at vetinst.no)

  • A recent dataset from a nanopore sequencer with the fast5 extention.

  • Information of the flowcell and sequencing kits used.

The Nanopore sequence data for this tutorial can be found in SAGA folder, or use your own data:

 /cluster/projects/nn9305k/tutorial/20211025_nanopore_analysis

This consist of 2 fast5 files, but you can find the complete dataset (12 datasets) in the SAGA folder:

 /cluster/projects/nn9305k/datasets/wgs/nanopore_test_data/np_test_data.tar.gz

The tutorial dataset consist of 8000 sequences.

Setting up the analysis

For this tutorial we are going to create a directory on our user directory: $USERWORK. which contains a directory with fast5 files.

cd $USERWORK

mkdir nanopore_tutorial

cd nanopore_tutorial

rsync -rauWP /cluster/projects/nn9305k/tutorial/20211025_nanopore_analysis/*.fast5 ./fast5

Now we have a tutorial folder, that is temporary, and it contains the fast5 files.

Now we need the information for Guppy, which selects the correct basecalling model for the data we have here.

type ID

ID

Flowcell ID

FLO-MIN106

Sequencing kit ID

SQK-RBK004

Barcoding KIT ID

RBK004

running Guppy basecalling

Let’s try and see if we can run Guppy, without blocking each other access to the GPU nodes.

First let’s active a screen. We call this screen : NANOPORE, so it is easy to find back if our connection is gone, and we have to log on to the system again. It is also easy to use when you open new screens all the time.

screen -S NANOPORE

and now we start an active node on SAGA that has GPU chips.

srun --account=nn9305k --mem=64G --partition=accel --gres=gpu:1 --cpus-per-task=8 --time=2:0:0 --pty bash -i

asking for the software to be loaded:

conda activate guppy_gpu_v5

To see which flowcells and kits are supported by this version of Guppy we can do:

guppy_basecaller --print_workflows

and then running guppy on the fast5 files, which takes about 2 minutes to run.

guppy_basecaller --flowcell FLO-MIN106 --kit SQK-RBK004 \
        -x "cuda:all" \
        --gpu_runners_per_device 16 \
        --num_callers 16 \
    --records_per_fastq 0 \
    --compress_fastq \
    -i fast5 \
    -s fastq

as a single line

guppy_basecaller --flowcell FLO-MIN106 --kit SQK-RBK004 -x "cuda:all" --gpu_runners_per_device 16 --num_callers 16 --records_per_fastq 0 --compress_fastq -i fast5 -s fastq

Note that the minimum quality score cut-off is: 9

This produces the following folders and files:

folders:
fail
pass

files:
sequencing_summary.txt  
sequencing_telemetry.js
guppy_basecaller_log-2021-10-25_10-34-10.log
guppy_basecaller_log-2021-10-25_10-31-13.log

We can also run it with demultiplexing, and quality trimming

Some options from Guppy:

--disable_qscore_filtering : Disable filtering of reads into PASS/FAIL folders based on min qscore
--min_qscore : Minimum acceptable qscore for a read to be filtered into the PASS folder
--barcode_kits : Space separated list of barcoding kit(s) or expansion kit(s) to detect against. Must be in double quotes.
--trim_barcodes : Trim the barcodes from the output sequences in the FastQ files.
--num_barcode_threads: Number of worker threads to use for barcoding
--disable_pings: stop ping the nanopore server.

So let us add the demultiplexing and reduce minimum quality score

guppy_basecaller --flowcell FLO-MIN106 --kit SQK-RBK004 \
        -x "cuda:all" \
        --gpu_runners_per_device 16 \
        --num_callers 16 \
    --records_per_fastq 0 \
    --compress_fastq \
    --min_qscore 8 \
    --barcode_kits "SQK-RBK004" \
    --trim_barcodes \
    -i fast5 \
    -s fastq_demultiplexed

Now we have enough datasets produced with the GPU nodes, so now we can stop the srun job. on saga.

conda deactivate
exit

getting some statistics.

Now we start the nanoplot conda environment to generate a report of our nanopore data Nanoplot can be found at this website: https://github.com/wdecoster/NanoPlot

conda activate nanoplot

cd fastq

Let’s run NanoPlot on the file sequencing_summary.txt

NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed

That generates a folder called: “summary-plots-log-transformed”

Let’s take a look at the files. If you are on a cluster, you should download the folder to your own personal computer.

You can also use Nanoplot on the fastq files that were produced. Lets do that for now on the fail and passed reads.

NanoPlot -t 2 --fastq fail/*.fastq.gz --loglength -o failed_summary_plots

NanoPlot -t 2 --fastq pass/*.fastq.gz --loglength -o passed_summary_plots

We can clearly see that the first Guppy analysis removed all the low quality reads.

Filtering reads

Let us do some more filtering on the passed reads with a tool called NanoFilt (https://github.com/wdecoster/nanofilt) NanoFilt needs the fastq files to be unzipped. So we do that in a single command.

zcat *.fastq.gz |NanoFilt -q 10 -l 1000 --headcrop 50 | gzip >> High_quality_reads.fastq.gz

I added the headcrop, because nanopore reads can be of low quality in the first bit of the reads. It could improved your assemblies later on.

Checking the quality of the filtered reads with Nanoplot.

conda activate NanoPlot
NanoPlot -t 2 --fastq High_quality_reads.fastq.gz --loglength -o ../High_quality_reads.summary.txt

The end for now

This tutorial will be update with more options on how to do nanopore sequence quality control.