Yesterday I attended the Next Generation Sequencing Congress at the Edwardian Radisson Hotel at Heathrow. The meeting was quite small (300 people perhaps) and quite vendor heavy and bioinformatics light. An interesting mix. The day was split into two streams, which I switched between frequently.
What is presented below is my notes from the meeting. This was not an attempt to liveblog the event, they have been written up today. They reflect my personal biases as to which bits of the talks I was paying the most attention to, may be riddled with inaccuracies and misquotes and are not to be taken as verbatim reports of the talks. If anyone feels they may have been misquoted or misrepresented by anything below, please let me know and I will amend this as soon as I can.
From a personal perspective a few things were highlighted.
Firstly I do not see how the 454FLX system and/or Ion Torrent can possibly consider themselves the de-facto choice for clinical resequencing. The error profiles of these machines just do not lend themselves to a discipline that needs accuracy most of all, but has been sold machines on the basis that ‘long reads’ were the best way to replicate what had previously done by Sanger sequencing.
Secondly Galaxy is gaining a lot of ground as part of the analytical toolbox. Like others at the conference I’m not sure this is the way forward. I do wonder how much analysis is blindly pushed through Galaxy on default settings by naive researchers without a thought to what is being done, because data does come out of the end.
Thirdly, sequencing 100′s of exomes doesn’t always lead you to causal genes..
Using Next Generation Sequencing to Identify Recurrent Mutational Events in Human Cancers
Steven Jones, Professor, Associate Director and Head, Bioinformatics, BC Cancer Agency
Sadly I arrived right at the end of this talk, but caught enough to find out that their SNP calling pipeline is Samtools and SNVMix. SNVMix is an SNV caller for cancer samples to address specific statistical issues not addressed by standard SNV calling tools.
SureSelectXT: Focus your Sequencing on DNA that matters
Darren Marjenberg, Agilent Technologies
30x coverage was still quoted as being minimum requirement. Claimed that SureSelect can detect indels of 38bp. Talked about the v4 and v5 exome kits which are complete redesigns and are said to address some of the issues seen with the 50Mb kit. Also said that costs are 50% down with the new kits, and introduced their focused kinome kit. They are developing an FFPE protocol but there is a working one already published (http://genomebiology.com/1755-8794/4/68). They have also stratified the custom targets kit sizes into 1Kb-199Kb, 200Kb-499Kb, 500Kb-1.49Mb, 1.5Mb-2.99Mb, 3Mb-6.9Mb, and beyond – so smaller target sizes are now catered for. They also quoted this paper (http://www.pnas.org/content/early/2010/06/23/1007983107) for cancer panel resequencing with custom kits citing excellent allelic balance (60/40 quoted as being the maximum deviation) and even said that the relative simplistic approach in this paper for identifying CNV’s was successful in this panel. There were also slides on RNA target enrichment developed with Joshua Levin from the Broad Institute (http://genomebiology.com/2009/10/10/R115).
Targeted Amplicon Resequencing on Illumina NGS
James Hadfield, Head of Genomics Core Facility, CRUK
CRUK are running HiSEq and MiSeq but not Ion Torrent. Had good words to say about the Nextera library prep kits (http://www.epibio.com/nextera/nextera.asp). A cautionary note was added about making sure that your genes and regions of interest are covered by the capture kits you’re using. Their cancer resequencing panel is 627 targets and includes targets of somatic and germline mutation, as well as targets with no current clinical intervention options. Trialled long-range PCR with TP53 then fragmenting prior to library prep. This then moved to testing Raindance to GAIIX for 4.5K exons. It seems they’re currently using Fluidigm (http://www.fluidigm.com/) to HiSeq2000. Fliudigm takes in 48 cDNA samples and 48 sets of assays (primer pairs) to create a 2304 well assay plate. The suggestion was it might be possible to plex up to 1500 samples per lane. There was also good correlation between Fluidigm and Sanger follow up. They’ve also trialled a Nextera long range PCR approach with the MiSeq where a 12 sample turnaround can be done in a week. He was also very positive about 23andMe style visualisation/reporting of genetic data in a clinical context.
Single Molecule, Real-Time Sequencing on the PacBio RS platform: Technology and Applications
Deepak Singh, Sr. Director Sales, Pacific Biosciences Europe
Having never seen a PacBio presentation before this was quite interesting. They sequence at around 1bp/sec and with the C1 chemistry achieves an average read length of 1.5kb with the 95th%ile around 3.5kb. The UK installation that currently exists has reported read lengths up to 16kb. Machines have a built in blade centre for data processing. Machine reportedly does not suffer GC bias issues. The procedure for sample prep is essentially DNA fragmentation, end-repair, ligation of the circularising adapters. The circularising set up means that complementary strands are sequenced in the same run. The SMRTCell loading system has a 30’ minimum run time and is loaded serially. SMRTCell max mappable reads = 45Mb. The loading hopper cannot be filled completely and left to its own devices as reagents do not last for two days prior to loading. Larger inserts are sequenced ones, smaller ones multiple times as they pass through the polymerase, this sounds good for scaffolding de-novo assemblies and error correction from multiple pass short reads can be applied to the longer reads. Easier to detect gene fusions and deletions with long reads. Not capable of WGS yet, so targeted applications are best. Improvements are going to come from brighter dyes, so less laser power is required, and polymerase degradation will decrease. Also only 33% of ZMW’s are filled with a single polymerase, 33% have 2 or more and 33% have none, so technically only operate at 1/3rd of potential capacity. C2 chemistry will offer 2.5-3kb read average lengths and 95th%ile read lengths of 6-8kb.
What’s New? Putting Variants from Whole Genome or Whole Exome resequencing in biological context
Frank Schacherer, COO, BIOBASE
BIOBASE argue that HGMD is the best tool for identifying novelty in variant analysis. All BIOBASE offerings are human curated. Highlighted utility in cancer analysis due to the number of variants uncovered. Highlighted a typical cancer analysis pipeline of taking a variant list, dropping these to coding variants, then uncommon variants, then non-germline variants, and characterising the remaining somatic variants with SIFT, PolyPhen, MutationTaster, applying GO annotations and doing pathway analysis. Neatly uploaded HGMD into Galaxy to analyse Watson genome. HGMD data in wide use, from 1000Genomes to Cartagenia, Avadis NGS, CLC Bio, Alamut. The human annotation shows its worth from SNPs that are initially reported as disease causing but later found to be high prevalence in the population (e.g. 1000Genomes data). These are flagged by BIOBASE and eventually removed as not being clinically relevant. There was a suggestion that HGMD is going to be essential for personal genome assessment.
NGS: A deep look into the transcriptome
John Castle, Computational Medicine, TRON, Gutenberg University of Mainz
Their HiSeq is installed on a vibration free table, because the emergency medical helicopter that lands on the roof of their institute played havoc with their runs. Highlighted the utility of RNA-Seq in gene expression analysis as you can get zero counts back from an experiment, whereas microarrays always report noise/some signal. Interestingly made use of unaligned reads to assay viral load in samples (SARS in this case) and also to look for virulence mutations in the viral as opposed to human reads. Specific amplification protocols developed to remove amplification of reads from globin or rRNA – to get more bang for your sequencing buck. Highlighted that really for clinical use samples need to be received, sequenced and analysed in DAYS for successful clinical intervention. Use a Galaxy based LIMS system. Most interestingly they even run duplicate experiments for their exome resequencing studies, duplication even for exome sequencing should be done ‘as a matter of course’.
Next Generation Sequencing Case Studies in Drug Discovery and Development
Jessica Vamathevan, Principal Scientist, Computational Biology GSK
Also use BIOBASE products. Use NGS for examination of viral titres in samples. Incorporate profiles of polymorphisms in viral load in gathering information about responses to drugs during clinical trials. Used NGS to examine viral population diversity during drug studies, especially to get a handle on drug resistance development. End up with 4000 reads per time-point. Use phylogenetic tree analyses to trace provenance of infection and viral mutation in HIV studies by patient clustering. Even possible to tell which subpopulation of virus may have been passed from one person to another even if viral population very diverse in transmitting individual. Layer on depth of sequencing information into phylogenetic trees using ‘pplacers’ and ‘guppy’.
Translational Genome Sequencing and Bioinformatics: The Medical Genome Project
Joquain Dopazo, Director of Bioinformatics and Genomics, CIPF
The initial challenge was to sequence exomes from well characterised, phenotyped patients and compare them to phenotypically control individuals (300 samples). He considers 1000Genomes data not sufficient for a control group as they are not adequately phenotyped and collecting a local pool of controls means that population specific information becomes readily available in the course of the study. The pipeline uses a GPU optimised BFAST for alignment – reducing run times to 5 hours per sample so on an 8CPU machine, 200 million reads (or 20-30 exomes) can be processed a week. Highlighted the problem that exome sequencing throws up ‘too many’ variants, their filtering strategies did not seem to highlight single gene causative mutations and comparisons of familial groups failed to identify causal genes in the diseases of interest. In fact they have no causal genes from 200 patient exomes. Consequently have developed pathway based approaches to try and match up diseases and potentially causative genes to provide a story across the spectrum of the families involved in a given disease.
BRCA1/2 Sequencing on the Roche GS-FLX System – an evaluation of the first year
Genevieve Michils, Laboratory for Molecular Diagnostics, University of Leuven
Sequencing to 40x for diagnostics, using AVA for variant calling. Have a robust multiplexing system to get a minimum of 25x coverage. After processing 500 samples in 22 runs, 150 mutations detected of which 15 were ‘in or near homopolymer regions’. Breakdown is €530/patient. QC involves rejecting reads that cover region only in one direction, if necessary backing up missing areas with Sanger sequencing. Homopolymer issues are worse as BRCA genes have plenty of them. Homopolymer error bias is not the same in forward and reverse directions. Trying to use SEQNEXT (http://www.jsi-medisys.de/products.html) which converts reads to Sangeresque ‘peaks’ but nevertheless the homopolymers lead to false positive variant detection. “In a diagnostic context this is not efficient”. Trying to go back to the raw data to develop a statistical model to identify ‘abnormal’ profiles in homopolymer read regions but still follow up everything with Sanger sequencing afterwards anyway.
Towards Complete Quality-Assured Next-Generation Genetic Tests
Prof Harry Cuppens, Centre for Human Genetics, KULeuven
Primary work is on CFTR mutations, and thinks all couples should be screened for carrier status. Clinicians are not interested in non-actionable rare mutations. Highlighted a number of issues that need solving:
- Robust equimolar multiplex amplifications
- Economical pooling of samples
- Quality assured protocols
- Automated protocols
- Accurate homopolymer calling
Not a fan of the DTC genetics testing companies protocols for sample handling and believes that there are so many steps involved the chances of errors are too high. The solution for this is to barcode samples at the earliest possible place in the sequencing process.
NGS Bioinformatics Support and Research Challenges
Mick Watson, Directory of ARK-Genomics, The Roslin Institute
Currently has 7 bioinformaticians, 6lab staff. HiSeq, GAIIx and array work mainly in agriculturally important animal genomics. We are in the “Age of Bioscience” so is this the most exciting time to be a biologist? Highlighted the long history of bioinformatics from Fischer, to Dayhoff, Ledley, Bennet and Kendrew in the 50s and 60s. Was critical of hypothesis free large sequencing projects. Highlighted that bioinformatics often fails to follow through from turning information into knowledge for research scientists and this needs to be a priority not an afterthought. Discussed the makeup of the bioinformatics community – coders, statisticians, data miners, database developers. An interesting point about Galaxy is that he believes this is “moving the problems into a point and click interface”. If you don’t understand parameterisation and use of the command line, then you won’t understand it in Galaxy either. The greatest challenge of the future will be analysing individual genome plasticity. Bioinformaticians have always worried about the size of the data. From AB1 trace files, to array image data to MAGE-ML and now we worry about sequence data, but history shows we have coped before, and someone else will solve the problems. Also noted that there is a dearth of EXPERIENCED NGS bioinformaticians, so look to recruit people with some experience and train them up.
Using Galaxy to provide a NGS Analysis Platform
Hans-Rudolf Hotz, Bioinformatics Support, Friedrich Miescher Institute for Biomedical Research
The core offers its services for free. The expectation from biologists is the magic red button for analysis which when pressed turns raw data into Nature papers. Is Galaxy the solution? Galaxy captures provenance of data, modules can be constructed into workflows, analytical solutions from bioinformaticians and statisticians can be supplied directly to the end user, removing analytical load from the core (at the expense of system administrative load for loading Galaxy with relevant software).