Genome assembly
main point: size of k, and paired end data. Insertion sizes.
In this tutorial we aim to teach you how to do assemblies on saga. Our basis for this assembly tutorial is that the user has paired short read data, and that assembly is happening with a de Bruijn graph assembler like velvet or SPAdes. The things we want you to know after this tutorial is the following:
the effects of k-mer size on assembly
the effect of having paired data
what N50 means
The presumtion for this exercise is that you are in posession of a paired end short read dataset. In the following course material these files will be referred to as seq_R1.fq.gz, and seq_R2.fq.gz.
You will also find a cheat list for commands at the bottom of this page.
Experiment with k-mer size
First, log in to saga, and find a suitable directory to work on. We will first
experiment a bit with the program velvet, which was one of the very first
short read assembly programs that became available. As described elsewhere in
this documentation, please activate screen and ask for compute resources.
The first thing we need to do is to activate this program.
conda activate velvet
Building the Velvet Index File
Velvet requires an index file to be built before the assembly takes place. We must choose a k- mer value for building the index. Longer k- mers result in a more stringent assembly, at the expense of coverage. There is no definitive value of k for any given project. However, there are several absolute rules:
k must be less than the read length
it should be an odd number
First we are going to run Velvet in single-end mode, ignoring the pairing information. Later on we will incorporate this information.
Find a value of k (between 21 and 113) to start with. Next, run velveth as
shown below to create the index.
velveth options:
ASM_NAME: select your own name for the folder where the results should govalue_of_k: use k-mers of this size-short: short reads (as opposed to long, Sanger-like reads)-separate: read1 and read2 are in separate files-fastq: read type is fastq
Build the index as follows:
velveth ASM_NAME VALUE_OF_K \
-short -separate -fastq \
seq_R1.fq.gz \
seq_R2.fq.gz
**NOTES**
* Change `ASM_NAME` to a directory name of your choosing
* Change `VALUE_OF_K` to the value you have picked
* The command is split over several lines by adding a space, and a `\`
(backslash) to each line. This trick makes long commands more readable. If you
want, you can write the whole command on one line instead.
After `velveth` is finished, use `ls` to look in the new folder that has the
name you chose. You should see the following files:
Log Roadmaps Sequences
The '`Log`' file has a useful reminder of what commands you typed to get this
assembly result, for reproducing results later on. '`Sequences`' contains the
sequences we put in, and '`Roadmaps`' contains the index you just created.
Now we will run the assembly with default parameters:
velvetg ASM_NAME
Velvet will end with a text like this:
`Final graph has ... nodes and n50 of ..., max ..., total ..., using .../... reads`
The number of nodes represents the number of nodes in the graph. Nodes in the
graph get converted to contigs, however, nodes with a length below that of the
k-mer size do not become contigs. Velvet reports its N50 (as well as
everything else) in 'kmer' space. The conversion to 'basespace' is as simple
as adding k-1 to the reported length.
Look again at the folder `ASM_NAME`, you should see the following extra files:
`contigs.fa`
`Graph`
`LastGraph`
`PreGraph`
`stats.txt`
The important files are:
`contigs.fa` - the assembly itself
`Graph` - a textual representation of the contig graph
`stats.txt` - a file containing statistics on each contig
When it comes to downloading and running notebooks:
1. On github, click on the filename so that the notebook is displayed.
2. Click on `raw` in the upper corner
3. Copy the URL in the address field
4. Go to a terminal on your local computer, and type in `wget URL`
5. To run the notebook: `jupyter notebook filename.ipynb`
### Homework
There are six genomes in the directory
`/work/projects/nn9305k/bioinf_course/compgenomics/rawdata/homework`
You will work with a partner. Each team will pick three k-mer sizes and use
velvet to assemble the six genomes. Each person in the team is responsible
for three genomes each.
[Record your results here](https://docs.google.com/spreadsheets/d/124Eb6IQ44coSKMH0kRLU18AJ5FZ7-ijwsxqf1NsC9Ys/edit?usp=sharing)
## 2018-02-17
### Today's practical
Today's practical consists of different sections:
* first, we will assemble the test genome with SPAdes, using two different
options
* second, we will run the program QUAST to evaluate our assemblies
* third, we will spend some time understanding the QUAST output
We are again working on abel, in the `2018_compgenome` directory we created
last time. We want to be working inside of a `qlogin`, since some of the
things we will do might take a bit of time.
Note: to use SPAdes and QUAST, we have to use `module load`. There
are some extra files we need for QUAST, these are:
* A genome sequence:
`/work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.fna `
* A genome annotation file, in `GFF` format:
`/work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.gff`
We will examine the QUAST results using a browser. To do that, we need to
transfer the QUAST result directory to our local computer. We can do that using
this command line on our local vm:
`scp -r username@abel.uio.no:/full/path/to/directory .`
1. In your webbrowser, go to the course pages listed above. Find the SPAdes module.
* Create a new directory _next_ to the velvet directory, called spades, and
go into it
* Figure out what the SPAdes command line looks like.
* Create two assemblies from the inclass data, one using the `--careful`
option, and one that does not use this option. Name their output
directories `spades_wo` and `spades_careful`
* Google for the SPAdes manual and try to figure out two things:
* Which k-mers are used
* What does --careful do?
* Go into the results directory, and and look at the results
* Run the assemblathon script on the two assemblies, and do a quick comparison between the
* velvet assembly
* the SPAdes assembly
* the SPAdes assembly with `--careful`
2. In your web browser, go to the course pages listed above. Find the module
where we are comparing to a reference.
* Have a look at the gff file that contains the genome annotation using
`less`. What does it contain? What's the format like?
* Using the webpage from the course, figure out what the command line for
QUAST should look like. In your comparison, include the velvet assembly,
the SPAdes assembly, and the SPAdes careful assembly. IMPORTANT! You
should include the `--scaffold` option.
* Run the QUAST comparison.
3. While QUAST is running, have a look at the
[QUAST paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806/) and
the [QUAST manual](http://quast.bioinf.spbau.ru/manual.html).
Figure out the following things:
* What basic statistics does QUAST produce?
* What is a missassembly according to QUAST?
* What is the Duplication ratio, number of mismatches per 100k, and number of genes?
* What are the Nx, NAx, NGx and so on values?
4. Transfer the resulting QUAST output directory to your local computer, and
open the `report.html` file using firefox.
* Have a look at the basic statistics
* Which assembly has the highest N50?
* Which assembly has the most found genes?
* Which genome has the fewest misassemblies?
* Open the Icarus browser
* Select the first 300 000 bases of the reference genome
by typing in the numbers `1` and `300000` in the boxes above
* Click on the first contig in the velvet assembly. Have a
look at the box on the right. It seems like this contig is
split up, but it kind of isn't. What's going on here?
Use the arrows that is shown in the section below to understand
what is going on.
* The velvet assembly is first, and it has a red contig in the
unbroken assembly. Figure out what the difference between
the two assemblies here, by examining the boxes on the right.
* The two SPAdes assemblies look very similar. Can you find some
regions where there is a difference between the non-careful and
the careful assembly?
### Homework
Remember to use `qlogin` when doing the exercises.
1. Redo the velvet assemblies using the command in the cheat list. We did get
to the place where I should have introduced the `shortPaired` option last
week, which is why you didn't use it for your assemblies. This option helps
a lot, which is why we should use it.
2. Do SPAdes assemblies for all the genomes that we gave you. Remember to use
`--careful`.
3. Use QUAST and compare each of your SPAdes assemblies to the velvet assembly
of the same genome. Use the same reference genome and annotation as
in the exercise. Which assembly is `best` and why?
## 2018-02-26
Today, we will summarize what we've done so far, and see what we have.
We will also start in on the annotation part. Two people will, for
next time, figure out what PROKKA and ROARY does, when it comes to
genome annotation and genome comparisons.
### Questions from the last two weeks
What are your questions? You will sit together in small groups and
discuss for a few minutes.
### Assign tasks for figuring out stuff
We need volunteers for creating some student presentations
for the material we will be working through next.
### Basic statistics for SPAdes and velvet
Please fill in this [Google docs](https://docs.google.com/spreadsheets/d/1yWWPxKhfSc3JbwSSaJoxsqx5WGlS4bUVsmqqnMgevas/edit?usp=sharing)
spreadsheet, to get the basic statistics for the assemblies for the six
genomes. For this, use the assemblathon script.
Next: compare the results from running the assemblathon scripts for
both SPAdes and velvet for each genome. Can you see some systematic
differences? The easiest way to do this is to have the results from
each script
### Comparisons between SPAdes and velvet
We will form three groups. Each group will take two of the assigned
genomes.
First, compare the output from the assemblathon script for
the SPAdes and velvet for each genome. Can you see some systematic
differences? The easiest way to do this is to have the results
for each genome present in a separate terminal and compare line
by line.
Next, we will look at the QUAST comparison result for the assemblies.
Some questions to ask/answer:
* Which of the assemblies have the higher N50?
* Which assembly has the highest number of contigs?
* Which assembly is the longest?
* Which one of them have the most unaligned seqence/contigs?
* Which of the assemblies have the most misassemblies?
* Which of the assemblies have the most genes found?
Group 1: 14042624 and DTU2014
Group 2: F159 and F168
Group 3: F27 and S19
### Have a look at contig lengths using the jupyter notebook
We will use a jupyter notebook to compare the lengths of the contigs of
our assemblies. To do this, we need a notebook, and two assemblies, one
assembled with velvet and one with spades.
[Download this jupyter notebook](https://github.com/karinlag/Lytir/blob/master/lengths.ipynb).
Remember, to do this, you need to first click "Raw" in the upper right corner, then
copy the URL. Next, you go to a terminal and type in `wget copied_url`. Using this
notebook, we can compare the lengths of tje scaffolds in the two assemblies.
Next, we need to download the assemblies we want to compare. Use the `scp` command
mentioned above, or mentioned in the [HPC module](working_with_hpc.md).
Open the notebook by typing in `jupyter notebook lengths.ipynb`. Inside the notebook,
you need to replace the file names in the notebook with your own file names.
### Homework
1. Run `fastqc` on your genomes. Figure out what the lengths of your reads are.
Based on that, google and find the SPAdes manual and figure out what options
to spades would be appropriate regarding k-mer sizes.
NOTE: once you've figured out your command line, send a summary of what you
found to the biopinf-comp email list.
2. Re-run SPAdes. Use the options that you have found before. Also, use the
coverage cutoff option and set it to auto.
3. You will compare these SPAdes assemblies to velvet using QUAST. We have put some
ready made velvet-assemblies for you in this directory:
`/work/projects/nn9305k/bioinf_course/compgenomics/assemblies/velvet`.
Use some of the questions stated above as a guide to figure out what to look at.
4. Compare the contig lengths using the jupyter notebook as described above.
## Command line cheat list
Note: anything in CAPITAL LETTERS should be replaced with something, usually
a file name, an output directory name, a fasta file name or something similar.
You can figure out what the various options do by googling for the manual,
by using the INFBIO course pages linked to above, or (quite often) typing
in the name of the program, without any options, and pressing enter.
Note, in the command below, long commands will be broken up with a `\`. If you
write in the commands in one long line, you do not include this slash.
### qlogin
XX should be replaced with how many hours you want, CPUS with the number
of cpus you want to use. Remember, if you ask for more than one cpu, you
should actually use those CPUs when running commands. I.e. for SPAdes
for instance, you should specify the `-t` option. You are also likely to want
to use `screen` before requesting a `qlogin` session.
qlogin –account=nn9305k –time=XX:00:00 –ntasks=CPUS –mem-per-cpu=4G
### Velvet
velveth ASM_NAME VALUE_OF_K \ -shortPaired -fastq -separate \ PATH/TO/READ_1.FASTQ \ PATH/TO/READ_2.FASTQ \
velvetg ASM_NAME -exp_cov auto -cov_cutoff auto
### SPADdes
spades.py -t CPUS –careful –pe1-1 PATH/TO/READ_1.FASTQ -pe1-2 PATH/TO/READ_2.FASTQ \ -o OUTPUTDIRECTORY > LOGFILENAME.LOG 2>&1
#### QUAST
Note, you can compare as many assemblies as you want, from 1 to hundreds.
You need to add them sequentially to the command line, and remember to
name them prperly inside the `-l` option.
quast.py -t CPUS -o OUTPUTDIRECTORY \ -R /work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.fna \ -G /work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.gff \ –scaffolds PATH/TO/ASSEMBLY_1.FSA PATH/TO/ASSEMBLY_2.FSA PATH/TO/ASSEMBLY_3.FSA \ -l “ASSEMBLY_1, ASSEMBLY_3, ASSEMBLY_3” > QUAST.LOG 2>&1