# Genome assembly - main point: size of k, and paired end data. Insertion sizes. In this tutorial we aim to teach you how to do assemblies on saga. Our basis for this assembly tutorial is that the user has paired short read data, and that assembly is happening with a [de Bruijn graph assembler like velvet or SPAdes](https://en.wikipedia.org/wiki/Velvet_assembler). The things we want you to know after this tutorial is the following: - the effects of k-mer size on assembly - the effect of having paired data - what N50 means The presumtion for this exercise is that you are in posession of a paired end short read dataset. In the following course material these files will be referred to as seq_R1.fq.gz, and seq_R2.fq.gz. You will also find a cheat list for commands at the bottom of this page. ## Experiment with k-mer size First, log in to saga, and find a suitable directory to work on. We will first experiment a bit with the program `velvet`, which was one of the very first short read assembly programs that became available. As described elsewhere in this documentation, please activate `screen` and ask for compute resources. The first thing we need to do is to activate this program. `conda activate velvet` ### Building the Velvet Index File Velvet requires an index file to be built before the assembly takes place. We must choose a *k-* mer value for building the index. Longer *k-* mers result in a more stringent assembly, at the expense of coverage. There is no definitive value of *k* for any given project. However, there are several absolute rules: * *k* must be less than the read length * it should be an odd number First we are going to run Velvet in single-end mode, *ignoring the pairing information*. Later on we will incorporate this information. Find a value of *k* (between 21 and 113) to start with. Next, run `velveth` as shown below to create the index. `velveth` options: * `ASM_NAME`: select your own name for the folder where the results should go * `value_of_k`: use k-mers of this size * `-short`: short reads (as opposed to long, Sanger-like reads) * `-separate`: read1 and read2 are in separate files * `-fastq`: read type is fastq Build the index as follows: ``` velveth ASM_NAME VALUE_OF_K \ -short -separate -fastq \ seq_R1.fq.gz \ seq_R2.fq.gz **NOTES** * Change `ASM_NAME` to a directory name of your choosing * Change `VALUE_OF_K` to the value you have picked * The command is split over several lines by adding a space, and a `\` (backslash) to each line. This trick makes long commands more readable. If you want, you can write the whole command on one line instead. After `velveth` is finished, use `ls` to look in the new folder that has the name you chose. You should see the following files: ``` Log Roadmaps Sequences ``` The '`Log`' file has a useful reminder of what commands you typed to get this assembly result, for reproducing results later on. '`Sequences`' contains the sequences we put in, and '`Roadmaps`' contains the index you just created. Now we will run the assembly with default parameters: ``` velvetg ASM_NAME ``` Velvet will end with a text like this: `Final graph has ... nodes and n50 of ..., max ..., total ..., using .../... reads` The number of nodes represents the number of nodes in the graph. Nodes in the graph get converted to contigs, however, nodes with a length below that of the k-mer size do not become contigs. Velvet reports its N50 (as well as everything else) in 'kmer' space. The conversion to 'basespace' is as simple as adding k-1 to the reported length. Look again at the folder `ASM_NAME`, you should see the following extra files: `contigs.fa` `Graph` `LastGraph` `PreGraph` `stats.txt` The important files are: `contigs.fa` - the assembly itself `Graph` - a textual representation of the contig graph `stats.txt` - a file containing statistics on each contig When it comes to downloading and running notebooks: 1. On github, click on the filename so that the notebook is displayed. 2. Click on `raw` in the upper corner 3. Copy the URL in the address field 4. Go to a terminal on your local computer, and type in `wget URL` 5. To run the notebook: `jupyter notebook filename.ipynb` ### Homework There are six genomes in the directory `/work/projects/nn9305k/bioinf_course/compgenomics/rawdata/homework` You will work with a partner. Each team will pick three k-mer sizes and use velvet to assemble the six genomes. Each person in the team is responsible for three genomes each. [Record your results here](https://docs.google.com/spreadsheets/d/124Eb6IQ44coSKMH0kRLU18AJ5FZ7-ijwsxqf1NsC9Ys/edit?usp=sharing) ## 2018-02-17 ### Today's practical Today's practical consists of different sections: * first, we will assemble the test genome with SPAdes, using two different options * second, we will run the program QUAST to evaluate our assemblies * third, we will spend some time understanding the QUAST output We are again working on abel, in the `2018_compgenome` directory we created last time. We want to be working inside of a `qlogin`, since some of the things we will do might take a bit of time. Note: to use SPAdes and QUAST, we have to use `module load`. There are some extra files we need for QUAST, these are: * A genome sequence: `/work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.fna ` * A genome annotation file, in `GFF` format: `/work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.gff` We will examine the QUAST results using a browser. To do that, we need to transfer the QUAST result directory to our local computer. We can do that using this command line on our local vm: `scp -r username@abel.uio.no:/full/path/to/directory .` 1. In your webbrowser, go to the course pages listed above. Find the SPAdes module. * Create a new directory _next_ to the velvet directory, called spades, and go into it * Figure out what the SPAdes command line looks like. * Create two assemblies from the inclass data, one using the `--careful` option, and one that does not use this option. Name their output directories `spades_wo` and `spades_careful` * Google for the SPAdes manual and try to figure out two things: * Which k-mers are used * What does --careful do? * Go into the results directory, and and look at the results * Run the assemblathon script on the two assemblies, and do a quick comparison between the * velvet assembly * the SPAdes assembly * the SPAdes assembly with `--careful` 2. In your web browser, go to the course pages listed above. Find the module where we are comparing to a reference. * Have a look at the gff file that contains the genome annotation using `less`. What does it contain? What's the format like? * Using the webpage from the course, figure out what the command line for QUAST should look like. In your comparison, include the velvet assembly, the SPAdes assembly, and the SPAdes careful assembly. IMPORTANT! You should include the `--scaffold` option. * Run the QUAST comparison. 3. While QUAST is running, have a look at the [QUAST paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3624806/) and the [QUAST manual](http://quast.bioinf.spbau.ru/manual.html). Figure out the following things: * What basic statistics does QUAST produce? * What is a missassembly according to QUAST? * What is the Duplication ratio, number of mismatches per 100k, and number of genes? * What are the Nx, NAx, NGx and so on values? 4. Transfer the resulting QUAST output directory to your local computer, and open the `report.html` file using firefox. * Have a look at the basic statistics * Which assembly has the highest N50? * Which assembly has the most found genes? * Which genome has the fewest misassemblies? * Open the Icarus browser * Select the first 300 000 bases of the reference genome by typing in the numbers `1` and `300000` in the boxes above * Click on the first contig in the velvet assembly. Have a look at the box on the right. It seems like this contig is split up, but it kind of isn't. What's going on here? Use the arrows that is shown in the section below to understand what is going on. * The velvet assembly is first, and it has a red contig in the unbroken assembly. Figure out what the difference between the two assemblies here, by examining the boxes on the right. * The two SPAdes assemblies look very similar. Can you find some regions where there is a difference between the non-careful and the careful assembly? ### Homework Remember to use `qlogin` when doing the exercises. 1. Redo the velvet assemblies using the command in the cheat list. We did get to the place where I should have introduced the `shortPaired` option last week, which is why you didn't use it for your assemblies. This option helps a lot, which is why we should use it. 2. Do SPAdes assemblies for all the genomes that we gave you. Remember to use `--careful`. 3. Use QUAST and compare each of your SPAdes assemblies to the velvet assembly of the same genome. Use the same reference genome and annotation as in the exercise. Which assembly is `best` and why? ## 2018-02-26 Today, we will summarize what we've done so far, and see what we have. We will also start in on the annotation part. Two people will, for next time, figure out what PROKKA and ROARY does, when it comes to genome annotation and genome comparisons. ### Questions from the last two weeks What are your questions? You will sit together in small groups and discuss for a few minutes. ### Assign tasks for figuring out stuff We need volunteers for creating some student presentations for the material we will be working through next. ### Basic statistics for SPAdes and velvet Please fill in this [Google docs](https://docs.google.com/spreadsheets/d/1yWWPxKhfSc3JbwSSaJoxsqx5WGlS4bUVsmqqnMgevas/edit?usp=sharing) spreadsheet, to get the basic statistics for the assemblies for the six genomes. For this, use the assemblathon script. Next: compare the results from running the assemblathon scripts for both SPAdes and velvet for each genome. Can you see some systematic differences? The easiest way to do this is to have the results from each script ### Comparisons between SPAdes and velvet We will form three groups. Each group will take two of the assigned genomes. First, compare the output from the assemblathon script for the SPAdes and velvet for each genome. Can you see some systematic differences? The easiest way to do this is to have the results for each genome present in a separate terminal and compare line by line. Next, we will look at the QUAST comparison result for the assemblies. Some questions to ask/answer: * Which of the assemblies have the higher N50? * Which assembly has the highest number of contigs? * Which assembly is the longest? * Which one of them have the most unaligned seqence/contigs? * Which of the assemblies have the most misassemblies? * Which of the assemblies have the most genes found? Group 1: 14042624 and DTU2014 Group 2: F159 and F168 Group 3: F27 and S19 ### Have a look at contig lengths using the jupyter notebook We will use a jupyter notebook to compare the lengths of the contigs of our assemblies. To do this, we need a notebook, and two assemblies, one assembled with velvet and one with spades. [Download this jupyter notebook](https://github.com/karinlag/Lytir/blob/master/lengths.ipynb). Remember, to do this, you need to first click "Raw" in the upper right corner, then copy the URL. Next, you go to a terminal and type in `wget copied_url`. Using this notebook, we can compare the lengths of tje scaffolds in the two assemblies. Next, we need to download the assemblies we want to compare. Use the `scp` command mentioned above, or mentioned in the [HPC module](working_with_hpc.md). Open the notebook by typing in `jupyter notebook lengths.ipynb`. Inside the notebook, you need to replace the file names in the notebook with your own file names. ### Homework 1. Run `fastqc` on your genomes. Figure out what the lengths of your reads are. Based on that, google and find the SPAdes manual and figure out what options to spades would be appropriate regarding k-mer sizes. NOTE: once you've figured out your command line, send a summary of what you found to the biopinf-comp email list. 2. Re-run SPAdes. Use the options that you have found before. Also, use the coverage cutoff option and set it to auto. 3. You will compare these SPAdes assemblies to velvet using QUAST. We have put some ready made velvet-assemblies for you in this directory: `/work/projects/nn9305k/bioinf_course/compgenomics/assemblies/velvet`. Use some of the questions stated above as a guide to figure out what to look at. 4. Compare the contig lengths using the jupyter notebook as described above. ## Command line cheat list Note: anything in CAPITAL LETTERS should be replaced with something, usually a file name, an output directory name, a fasta file name or something similar. You can figure out what the various options do by googling for the manual, by using the INFBIO course pages linked to above, or (quite often) typing in the name of the program, without any options, and pressing enter. Note, in the command below, long commands will be broken up with a `\`. If you write in the commands in one long line, you do not include this slash. ### qlogin XX should be replaced with how many hours you want, CPUS with the number of cpus you want to use. Remember, if you ask for more than one cpu, you should actually use those CPUs when running commands. I.e. for SPAdes for instance, you should specify the `-t` option. You are also likely to want to use `screen` before requesting a `qlogin` session. ``` qlogin --account=nn9305k --time=XX:00:00 --ntasks=CPUS --mem-per-cpu=4G ``` ### Velvet ``` velveth ASM_NAME VALUE_OF_K \ -shortPaired -fastq -separate \ PATH/TO/READ_1.FASTQ \ PATH/TO/READ_2.FASTQ \ velvetg ASM_NAME -exp_cov auto -cov_cutoff auto ``` ### SPADdes ``` spades.py -t CPUS --careful --pe1-1 PATH/TO/READ_1.FASTQ -pe1-2 PATH/TO/READ_2.FASTQ \ -o OUTPUTDIRECTORY > LOGFILENAME.LOG 2>&1 ``` #### QUAST Note, you can compare as many assemblies as you want, from 1 to hundreds. You need to add them sequentially to the command line, and remember to name them prperly inside the `-l` option. ``` quast.py -t CPUS -o OUTPUTDIRECTORY \ -R /work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.fna \ -G /work/projects/nn9305k/genome_references/genomes/ecoli/GCF_000005845.2_ASM584v2_genomic.gff \ --scaffolds PATH/TO/ASSEMBLY_1.FSA PATH/TO/ASSEMBLY_2.FSA PATH/TO/ASSEMBLY_3.FSA \ -l "ASSEMBLY_1, ASSEMBLY_3, ASSEMBLY_3" > QUAST.LOG 2>&1 ```