Next Generation Sequencing Overview
Sequencing technologies
- Sequencing by hybridisation and ligation (Complete Genomics, Life Technologies, Polonator)
- Emulsion PCR amplification
- Detection: Cyclic ligation of nucleotide dimers. Multiple rounds required.
- Sequencing by synthesis (Helicos, Illumina, Intelligent Bio-Sys)
- Bridge PCR amplification on slides
- Detection: cyclic addition of reversible polymerisation terminators
- Real-time sequencing by synthesis (Pacific BioSciences, VisiGen)
- Single molecule, no amplification
- Cyclic removal of nucleotide and measure change in conductance of micropore
- Pyrosequencing (Roche/454™).
- Emulsion PCR amplification
- Detection: cyclic addition of single nucleotide species and detection of number of pyrophosphates released during polymerisation (through ATP to luciferase reaction)
Applications
Genomic
- Whole genome resequencing for variation detection
- Whole genome sequencing for de novo assembly
- Metagenomics (sequencing of multiple genomes together in a 'metasample')
- Targetted resequencing for variation detection
- Exome sequencing/targetted exome capture
- Arbitrary PCR enrichment of target
Epigenomic
- CHiP-Seq through sequencing of DNA enriched by immunoprecipitation
- DNA methylation analysis
- Sequencing of DNA enriched through immunoprecipitation of 5-methylcytosine ()
- Methylation-sensitive restriction enzyme fragment enrichment
Transcriptomic (RNA-seq)
- Whole transcriptome expression profiling (comparative gene expression)
- Whole transcriptome sequencing for de novo assembly (identify genes)
- Strand-specific mRNA-seq
- Paired-end RNA-seq
- Whole transcriptome variation detection
- Differential splicing detection
- sRNA analysis
Types of outcome
Variation Detection
- Copy number variation (CNV)
- Single Nucleotide Polymorphisms (SNPs)
- Insertions and deletions (Indels)
- Rearrangements
De Novo assembly
Concatemers (length?)
Comparative gene expression
Matrix of gene vs expression level
Splicing detection
List of alternative splice isoforms
Resequencing alignment strategies
- Align to transcriptome (simplest)
- Align to genome and exon-exon junction sequences (discover new transcripts, splice variants)
- De novo (most complex)
Categories of analysis
NGS informatics may be categorised into:
1. Primary analysis:
Conversion of imagesets (output from NGS machines) into strings of bases - 'base calling'.
Inputs:
- Image files from NGS machine
Outputs:
- Short reads
Software:
- Firecrest and Bustard (for Illumina)
2. Secondary analysis:
Assembly of short strings of bases (output of primary analysis) into contigs, and/or -Li Jason 11/18/09 2:43 PM alignment onto a reference genome and extraction of the first level of genomic meaning (SNP, rearrangements)
Inputs:
- Short reads (20-30million/run?)
Outputs:
- Aligned sequences against a reference genome
- SNP analysis
- Copy number variation data
- Quantified expression data
Software:
- E.g. Eland, BWA, MAQ for alignment. Velvet, ABySS for de-novo assembly.
References:
http://www.illumina.com/Documents/products/technotes/technote_denovo_assembly.pdf
3. Tertiary analysis
Extraction of biological information from the aligned sequences. Annotation of sequence data.
Inputs:
- E.g. Aligned sequences, SNP analysis
Outputs:
- Biological and statistical interpretations and visualisations
At each stage there are multiple options for processing - vendor software, open source software, each with varying levels of accuracy and computational expense. There are also large amounts of data produced at each stage. Additionally, primary and secondary analysis are dependent on the NGS platform being supported - Illumina Solexa pipeline is different to an Applied Biosystems SOLiD pipeline.
Anecdotally, 80% of processing in a NGS project is manual and occurs mainly in tertiary analysis.
Sample preparation
Fragment
Each read is independent - an arbitrary fragment of DNA with no meta-information about the relationship of that read to any other read
Paired-end sequencing
Each fragment is sequenced twice, once from each end, in succession. Probably the reads will not overlap, as the average length of a fragment is normally >2x that of a read.
The two reads are thus 'related' and have extra information about the structure of the sample. The information is that the two reads from the single fragment are 'close' - depending on the length of fragments that were prepared for the run.
http://www.illumina.com/technology/paired_end_sequencing_assay.ilmn
Mate-pair
Similar to paired-end, but ends of sequenced fragment can be much further apart - up to 10kb (normally 2-5kb)
http://www.illumina.com/technology/mate_pair_sequencing_assay.ilmn
Barcoded fragments
Multiple samples are prepared independently. Each sample is 'labelled' with a unique sequence adaptor (barcode). Samples are then mixed and sequenced together. Individual reads can be categorised to a particular sample by their barcodes.
Enrichment for target sequences
- PCR
- Hybridisation to oligo arrays
- PolyA selection (mRNA-seq)
- Size selection
- Precipitation
Potential services in NGS
Courtesy Colorado State Uni
Sequence Matching Analysis
-
Align SOLiD colorspace reads with reference genomes
-
Generate consensus sequences
Mate-Pair Analysis
-
Analyze paired reads from SOLiD mate-pair runs
-
Improve accuracy of read matches to detect structural variations > between sample and reference genomes
SNP / Indel Analysis
-
Identify single nucleotide polymorphisms
-
Identify insertion and deletion mutations in DNA sequences
Copy Number Variation Analysis
- Detect copy number variations in DNA sequences
Whole Transcriptome Analysis
-
Map SOLiD reads from a transcript sample to reference genomes
-
Assign tag counts to features of the reference genome
-
Identify novel transcripts
Inversion Analysis
- Identify paracentric and pericentric chromosomal inversions
de Novo Assembly
-
Create de novo assemblies from SOLiD colorspace reads
-
Characterize genomic sequences for which no closely related reference genome exists
-
Assemble SOLiD reads from microbial genomes into nucleotide sequence contigs of several thousand bases, and scaffolds of tens of thousands of bases.
-
Assemble SOLiD reads from microbial genomes into scaffolds of tens of thousands of bases
smRNA Analysis
-
Whole genome analysis of SOLiD RNA library reads
-
Includes filtering, matching against miRBase sequences, and matching against reference genomes
NextGENe Analysis
-
NextGENe is a comprehensive NGS bioinformatics package
-
NextGENe includes de novo assembly, target assembly, SNP/Indel discovery, digital gene expression analysis, whole transcriptome analysis, ChiPSeq analysis, miRNA discovery and quantification, and a condensation assembly tool for reducing instrument error
System Alignment Browser (SAB)
-
SAB is a graphical genome annotation viewer
-
Used for viewing basespace and colorspace reads aligned to reference genomes
GFF Conversion
-
Convert SOLiD system mapping files into GFF-format files
-
GFF is a standardized file format for describing DNA, RNA, and proteins
SRF Conversion
-
Convert SOLiD system reads into SRF format
-
SRF is a standardized file format for describing DNA sequence data
Appendix
Paired end sequencing - empirical observations
paired end or mate pair refers to how the library is made, and then how it is sequenced. Both are methodologies that, in addition to the sequence information, give you information about the physical distance between the two reads in your genome. For example, you shear up some genomic DNA, and cut a region out at ~500bp. Then you prepare your library, and sequence 35bp from each end of each molecule. Now you have three pieces of information:
- the tag 1 sequence
- the tag 2 sequence*
- that they were 500bp ± (some) apart in your genome
When we do 2x 50 bp paired-end runs on a GAIIx using the current gel purification step we get read distances of between that vary by about 100 bp in a nice tight bell shaped curve starting between 160-200 bp. So the first thing to bear in mind is that L is not fixed within or between runs. Either way this group accounts for >99.99% of paired-end reads in an assembly. Because of the way fragments are generated for sequencing 1 and 2 can align either F-B of B-F.
If you want to be more realistic there are always a tiny proportion of reads <0.1% that align with much longer read distances, some of which is due to bioinformatics but some of which is real and simply reflects biology. Likewise a tiny proportion of reads at all read distances will be F-F or B-B. Also there appear to often be a tiny proportion of reads that come out overlapping where the read distance is the same as a read length ie 1+L+2 is, in this case, \~50-100. I have no idea of the prevelance of such reads but you can often find them if you look. Lastly if its not going to be part of the assembler, end trimming and quality trimming can often mean that 1 and 2 are different lengths and that a substantial number of reads from a paired end run end up with no partner at all.
Mate-pair sequencing
Illumina refers to "paired end" as the original library preparation method they use, where you sequence each end of the same molecule. Because of the way the cluster generation technology works, it is limited to an inter-pair distance of ~300bp ( 200-600bp).
Illumina refers to mate pairs as sequences derived from their newer library prep method which is designed to provide paired sequences separated by a greater distance (between about 2 and 10kb). This method still actually only sequences the ends of ~400bp molecules, but this template is derived from both ends of a 2-10kb fragment that has had the middle section cut out and the 'internal' ends ligated in the middle. Basically, you take your 2-10kb random fragments, biotinylate the end, circularise them, shear the circles to ~400bp, capture biotinylated molecules, and then sequence those (they go into what is essentially a standard 'paired end' sample prep procedure).