VLSCI logo GVL logo





De novo Genome Assembly for Illumina Data

Protocol

Written and maintained by Simon Gladman - VLSCI

Protocol Overview / Introduction

In this protocol we discuss and outline the process of de novo assembly for small to medium sized genomes.

What is de novo genome assembly?

Genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated [1]. De novo genome assemblies assume no prior knowledge of the source DNA sequence length, layout or composition. In a genome sequencing project, the DNA of the target organism is broken up into millions of small pieces and read on a sequencing machine. These “reads” vary from 20 to 1000 nucleotide base pairs (bp) in length depending on the sequencing method used. Typically for Illumina type short read sequencing, reads of length 36 - 150 bp are produced. These reads can be either “single ended” as described above or “paired end.” A good summary of other types of DNA sequencing can be found here.

Paired end reads are produced when the fragment size used in the sequencing process is much longer (typically 250 - 500 bp long) and the ends of the fragment are read in towards the middle. This produces two “paired” reads. One from the left hand end of a fragment and one from the right with a known separation distance between them. (The known separation distance is actually a distribution with a mean and standard deviation as not all original fragments are of the same length.) This extra information contained in the paired end reads can be useful for helping to tie pieces of sequence together during the assembly process.

The goal of a sequence assembler is to produce long contiguous pieces of sequence (contigs) from these reads. The contigs are sometimes then ordered and oriented in relation to one another to form scaffolds. The distances between pairs of a set of paired end reads is useful information for this purpose.

The mechanisms used by assembly software are varied but the most common type for short reads is assembly by de Bruijn graph. See this document for an explanation of the de Bruijn graph genome assembler “Velvet.”

Genome assembly is a very difficult computational problem, made more difficult because many genomes contain large numbers of identical sequences, known as repeats. These repeats can be thousands of nucleotides long, and some occur in thousands of different locations, especially in the large genomes of plants and animals. [1]

Why do we want to assemble an organism’s DNA?

Determining the DNA sequence of an organism is useful in fundamental research into why and how they live, as well as in applied subjects. Because of the importance of DNA to living things, knowledge of a DNA sequence may be useful in practically any biological research. For example, in medicine it can be used to identify, diagnose and potentially develop treatments for genetic diseases. Similarly, research into pathogens may lead to treatments for contagious diseases [2].

The protocol in a nutshell:

Figure 1: Flowchart of de novo assembly protocol.

Raw read sequence file formats.

Raw read sequences can be stored in a variety of formats. The reads can be stored as text in a Fasta file or with their qualities as a FastQ file. They can also be stored as alignments to references in other formats such as SAM or its binary compressed implementation BAM. All of the file formats (with the exception of the binary BAM format) can be compressed easily and often are stored so (.gz for gzipped files.)

The most common read file format is FastQ as this is what is produced by the Illumina sequencing pipeline. This will be the focus of our discussion henceforth.

Bioinformatics tools for this protocol.

There are a number of tools available for each step in the genome assembly protocol. These tools all have strengths and weaknesses and have their own application space. Suggestions rather than prescriptions for tools will be made for each of the steps. Other tools could be substituted in each case depending on user preference, experience or problem type.

Genomics Virtual Laboratory resources for this protocol.

Depending on your requirements and skill base there are two options for running this protocol using GVL computing resources.

You can also use your own computing resources.


Section 1: Read Quality Control

Purpose:

The purpose of this section of the protocol is to show you how to understand your raw data, make informed decisions on how to handle it and maximise your chances of getting a good quality assembly. Knowledge of the read types, the number of reads, their GC content, possible contamination and other issues are important. This information will give you an idea of any quality issues with the data and guide you on the choice of data trimming/cleanup methods to use. Cleaning up the raw data before assembly can lead to much better assemblies as contamination and low quality error prone reads will have been removed. It will also give you a better guide as to setting appropriate input parameters for the assembly software. It is a good idea to perform these steps on all of your read files as they could have very different qualities.

Steps involved and suggested tools:

Examine the quality of your raw read files.

For FastQ files (the most common), the suggested tool is FastQC. Details can be found here. FastQC can be run from within Galaxy or by command line. (It has a GUI interface for the command line version.)

FastQC on any GVL Galaxy is located in: NGS: QC and Manipulation → FastQC: Comprehensive QC

Command line: fastqc

Some of the important outputs of FastQC for our purposes are:

Quality trimming/cleanup of read files.

Now that you have some knowledge about the raw data, it is important to use this information to clean up and trim the reads to improve its overall quality before assembly. There are a number of tools available in Galaxy and by command line that can perform this step (to varying degrees) but you’ll need one that can handle read pairing if you’ve got paired end reads. If one of the ends of a pair is removed, the orphaned read needs to be put into a separate “orphaned reads” file. This maintains the paired ordering of the reads in the paired read files so the assembly software can use them correctly. The suggested tool for this is a pair aware read trimmer called Trimmomatic. Details on Trimmomatic can be found here.

Trimmomatic on GVL systems: NGS: QC and Manipulation -> Trimmomatic

Command line: details and examples here.

Trimmomatic can perform many read trimming functions sequentially.

Suggested Trimmomatic functions to use:

Things to look for in the output:

Trimmomatic should produce 2 pairs files (1 left and 1 right hand end) and 1 or 2 single “orphaned reads” files if you trimmed a pair of read files using paired end mode. It only produces 1 output read file if you used it in single ended mode. Each read library (2 paired files or 1 single ended file) should be trimmed separately with parameters dependent on their own FastQC reports. The output files are the ones you should use for assembly.

Possible alternate tools:

Read quality trimming: nesoni clip, part of the nesoni suite of bioinformatics tools. Available at http://www.bioinformatics.net.au/software.shtml


Section 2: Assembly

Purpose:

The purpose of this section of the protocol is to outline the process of assembling the quality trimmed reads into draft contigs. Most assembly software has a number of input parameters which need to be set prior to running. These parameters can and do have a large effect on the outcome of any assembly. Assemblies can be produced which have less gaps, less or no mis-assemblies, less errors by tweaking the input parameters. Therefore, knowledge of the parameters and their effects is essential to getting good assemblies. In most cases an optimum set of parameters for your data can be found using an iterative method.

You shouldn’t just run it once and say, “I’ve assembled!”

Steps involved and suggested tools:

Assembly of the reads.

The suggested assembly software for this protocol is the Velvet Optimiser which wraps the Velvet Assembler. The Velvet assembler is a short read assembler specifically written for Illumina style reads. It uses the de Bruijn graph approach (see here for details).

Velvet and therefore the Velvet Optimiser is capable of taking multiple read files in different formats and types (single ended, paired end, mate pair) simultaneously.

The quality of contigs that Velvet outputs is dependent heavily on its parameter settings, and significantly better assemblies can be had by choosing them appropriately. The three most critical parameters to optimize are the hash size (k), the expected coverage (e), and the coverage cutoff (c).

Velvet Optimiser is a Velvet wrapper that optimises the values for the input parameters in a fast, easy to use and automatic manner for all datasets. It can be run from within GVL Galaxy servers or by command line.

In Galaxy: NGS-Assembly → Velvet Optimiser

Command line: details and examples here.

The critical inputs for Velvet Optimiser are the read files and the k-mer size search range. The read files need to be supplied in a specific order. Single ended reads first, then by increasing paired end insert size. The k-mer size search range needs a start and end value. Each needs to be an odd integer with start < end. If you set the start hash size to be higher than the length of any of the reads in the read files then those reads will be left out of the assembly. i.e. reads of length 36 with a starting hash value of 39 will give no assembly. The output from FastQC can be a very good tool for determining appropriate start and end of the k-mer size search range. The per base sequence quality graph from FastQC shows where the quality of the reads starts to drop off and going just a bit higher can be a good end value for the k-mer size search range.

Examine the draft contigs and assessment of the assembly quality.

The Velvet Optimiser log file contains information about all of the assemblies ran in the optimisation process. At the end of this file is a lot of information regarding the final assembly. This includes some metric data about the draft contigs (n50, maximum length, number of contigs etc) as well as the estimates of the insert lengths for each paired end data set. It also contains information on where to find the final contigs.fa file.

The assembly parameters used in the final assembly can also be found as part of the last entry in the log file.

The contig_stats.txt file associated with the assembly shows details regarding the coverage depth of each contig (in k-mer coverage terms NOT read coverage) and this can be useful information for finding repeated contigs etc.

More detailed metrics on the contigs can be gotten using a fasta statistics tool such as fasta-stats on Galaxy. (Fasta Manipulation → Fasta Statistics).

Possible alternative software:

Assembly: There are a large number of short read assemblers available. Each with their own strengths and weaknesses. Some of the available assemblers include:

See here for a comprehensive list of - and links to - short read assembly programs.


Section 3: What next?

Purpose:

Help determine the suitability of a draft set of contigs for the rest of your analysis and what to do with them now.

Some things to remember about the contigs you have just produced:

What happens with your contigs next is determined by what you need them for:

Possible tools for improving your assemblies:

Most of these tools are open source and freely available (or at least academically available) but some are commercial software. This is by no means an exhaustive list. No warranties/suitability for purpose supplied or implied!

Mis-assembly checking and assembly metric tools:

Genome finishing tools:

Semi-automated gap fillers:

Genome visualisers and editors

Automated and semi automated annotation tools