Monthly Archives: November 2011

Book review: “The $1000 Genome” by Kevin Davies

So in my quest to do a bit of reading around the industry I now find myself in, I’ve lined up a few books.  I actually started “The $1000 Genome” in the weeks prior to my interview at OGT, and the fact that I only finished it a couple of weeks ago (5 months later) should not be taken as a reflection on the quality of the book.

I think one thing people will be asking is whether a book written in 2010 is still relevant at the tail end of 2011 in such a fast-moving industry, and I think it’s a testament to Kevin Davies writing that it is.

I haven’t read either of Kevin’s previous two books, “Cracking the Genome” (no explanation required as to what that might be about [but note that the UK title of this book is “The Sequence”]) and “Breakthrough” on the race to find the breast cancer gene, I probably will purchase these when the existing book backlog is cleared.

One thing this book has in spades is an excellent history to how we got where we are today, both in terms of the personalities and the companies that have drive the NGS revolution.  Consequently some about familiar names such as 23andMe and the small sequencing startups that were swallowed by the industry biogiants may be familiar, but the book charts them from setup to acquisition and the movement of key staff between them.  For me alone the history of NGS and emergent personal genomics is probably worth the cover price of the book alone.  There is also no skimping on the next ‘next-generation’ contenders.

Also well documented is the rivalry between the main DTC companies, 23andMe, deCODE and Navigenics, and it’s interesting to see how they stratify in terms of panels offered, risk calculation and how focused they are on ‘actionable’ information.  It’s also worth delving into the longer term research-led strategies of these companies, and the regulatory hurdles they are already embroiled in.

It’s actually quite poignant how quickly we’ve moved from sequencing a reference genome, to sequencing an individual person’s genome, to having dozens, and then hundreds of full genomes.  This was brought home in a telecon I had this week with a research institute who figured they had sequenced 160 genomes in 2010/2011.  As with all science what was once a Nature paper becomes quickly routine when NGS hardware is ramping up capacity as much as it is.  This is also strongly highlighted in the book.

The final section deals with the likely arrival of genome-led P4 medicine, the sequencing X-Prize and wraps up just how close we are to the $1000 genome.  The book is actually quite light on price as a driver, and prefers to point out what could be done when it’s cheap enough to do so.  With excursions into the authors own genomic landscape and thorough referencing throughout, it’s a book I can happily recommend to anyone in the field, or with a passing interest in it.

Notes from the Next-Generation Sequencing Congress

Yesterday I attended the Next Generation Sequencing Congress at the Edwardian Radisson Hotel at Heathrow. The meeting was quite small (300 people perhaps) and quite vendor heavy and bioinformatics light. An interesting mix. The day was split into two streams, which I switched between frequently.

What is presented below is my notes from the meeting. This was not an attempt to liveblog the event, they have been written up today. They reflect my personal biases as to which bits of the talks I was paying the most attention to, may be riddled with inaccuracies and misquotes and are not to be taken as verbatim reports of the talks. If anyone feels they may have been misquoted or misrepresented by anything below, please let me know and I will amend this as soon as I can.

From a personal perspective a few things were highlighted.

Firstly I do not see how the 454FLX system and/or Ion Torrent can possibly consider themselves the de-facto choice for clinical resequencing. The error profiles of these machines just do not lend themselves to a discipline that needs accuracy most of all, but has been sold machines on the basis that ‘long reads’ were the best way to replicate what had previously done by Sanger sequencing.

Secondly Galaxy is gaining a lot of ground as part of the analytical toolbox. Like others at the conference I’m not sure this is the way forward. I do wonder how much analysis is blindly pushed through Galaxy on default settings by naive researchers without a thought to what is being done, because data does come out of the end.

Thirdly, sequencing 100’s of exomes doesn’t always lead you to causal genes..

Notes below:

Using Next Generation Sequencing to Identify Recurrent Mutational Events in Human Cancers

Steven Jones, Professor, Associate Director and Head, Bioinformatics, BC Cancer Agency

Sadly I arrived right at the end of this talk, but caught enough to find out that their SNP calling pipeline is Samtools and SNVMix. SNVMix is an SNV caller for cancer samples to address specific statistical issues not addressed by standard SNV calling tools.

SureSelectXT: Focus your Sequencing on DNA that matters

Darren Marjenberg, Agilent Technologies

30x coverage was still quoted as being minimum requirement.  Claimed that SureSelect can detect indels of 38bp.  Talked about the v4 and v5 exome kits which are complete redesigns and are said to address some of the issues seen with the 50Mb kit. Also said that costs are 50% down with the new kits, and introduced their focused kinome kit.  They are developing an FFPE protocol but there is a working one already published (http://genomebiology.com/1755-8794/4/68).  They have also stratified the custom targets kit sizes into 1Kb-199Kb, 200Kb-499Kb, 500Kb-1.49Mb, 1.5Mb-2.99Mb, 3Mb-6.9Mb, and beyond – so smaller target sizes are now catered for.  They also quoted this paper (http://www.pnas.org/content/early/2010/06/23/1007983107) for cancer panel resequencing with custom kits citing excellent allelic balance (60/40 quoted as being the maximum deviation) and even said that the relative simplistic approach in this paper for identifying CNV’s was successful in this panel.  There were also slides on RNA target enrichment developed with Joshua Levin from the Broad Institute (http://genomebiology.com/2009/10/10/R115).

Targeted Amplicon Resequencing on Illumina NGS

James Hadfield, Head of Genomics Core Facility, CRUK

CRUK are running HiSEq and MiSeq but not Ion Torrent.  Had good words to say about the Nextera library prep kits (http://www.epibio.com/nextera/nextera.asp).  A cautionary note was added about making sure that your genes and regions of interest are covered by the capture kits you’re using.  Their cancer resequencing panel is 627 targets and includes targets of somatic and germline mutation, as well as targets with no current clinical intervention options.  Trialled long-range PCR with TP53 then fragmenting prior to library prep.  This then  moved to testing Raindance to GAIIX for 4.5K exons.  It seems they’re currently using Fluidigm (http://www.fluidigm.com/) to HiSeq2000.  Fliudigm takes in 48 cDNA samples and 48 sets of assays (primer pairs) to create a 2304 well assay plate. The suggestion was it might be possible to plex up to 1500 samples per lane.  There was also good correlation between Fluidigm and Sanger follow up.  They’ve also trialled a Nextera long range PCR approach with the MiSeq where a 12 sample turnaround can be done in a week.  He was also very positive about 23andMe style visualisation/reporting of genetic data in a clinical context.

Single Molecule, Real-Time Sequencing on the PacBio RS platform: Technology and Applications

Deepak Singh, Sr. Director Sales, Pacific Biosciences Europe

Having never seen a PacBio presentation before this was quite interesting. They sequence at around 1bp/sec and with the C1 chemistry achieves an average read length of 1.5kb with the 95th%ile around 3.5kb.  The UK installation that currently exists has reported read lengths up to 16kb.  Machines have a built in blade centre for data processing.  Machine reportedly does not suffer GC bias issues.  The procedure for sample prep is essentially DNA fragmentation, end-repair, ligation of the circularising adapters.  The circularising set up means that complementary strands are sequenced in the same run.  The SMRTCell loading system has a 30’ minimum run time and is loaded serially.  SMRTCell max mappable reads = 45Mb. The loading hopper cannot be filled completely and left to its own devices as reagents do not last for two days prior to loading.  Larger inserts are sequenced ones, smaller ones multiple times as they pass through the polymerase, this sounds good for scaffolding de-novo assemblies and error correction from multiple pass short reads can be applied to the longer reads.  Easier to detect gene fusions and deletions with long reads.  Not capable of WGS yet, so targeted applications are best.  Improvements are going to come from brighter dyes, so less laser power is required, and polymerase degradation will decrease.  Also only 33% of ZMW’s are filled with a single polymerase, 33% have 2 or more and 33% have none, so technically only operate at 1/3rd of potential capacity.  C2 chemistry will offer 2.5-3kb read average lengths and 95th%ile read lengths of 6-8kb.

What’s New? Putting Variants from Whole Genome or Whole Exome resequencing in biological context

Frank Schacherer, COO, BIOBASE

BIOBASE argue that HGMD is the best tool for identifying novelty in variant analysis.  All BIOBASE offerings are human curated.  Highlighted utility in cancer analysis due to the number of variants uncovered.   Highlighted a typical cancer analysis pipeline of taking a variant list, dropping these to coding variants, then uncommon variants, then non-germline variants, and characterising the remaining somatic variants with SIFT, PolyPhen, MutationTaster, applying GO annotations and doing pathway analysis.  Neatly uploaded HGMD into Galaxy to analyse Watson genome.  HGMD data in wide use, from 1000Genomes to Cartagenia, Avadis NGS, CLC Bio, Alamut.  The human annotation shows its worth from SNPs that are initially reported as disease causing but later found to be high prevalence in the population (e.g. 1000Genomes data).  These are flagged by BIOBASE and eventually removed as not being clinically relevant.  There was a suggestion that HGMD is going to be essential for personal genome assessment.

NGS: A deep look into the transcriptome

John Castle, Computational Medicine, TRON, Gutenberg University of Mainz

Their HiSeq is installed on a vibration free table, because the emergency medical helicopter that lands on the roof of their institute played havoc with their runs.  Highlighted the utility of RNA-Seq in gene expression analysis as you can get zero counts back from an experiment, whereas microarrays always report noise/some signal.  Interestingly made use of unaligned reads to assay viral load in samples (SARS in this case) and also to look for virulence mutations in the viral as opposed to human reads.  Specific amplification protocols developed to remove amplification of reads from globin or rRNA – to get more bang for your sequencing buck.  Highlighted that really for clinical use samples need to be received, sequenced and analysed in DAYS for successful clinical intervention.  Use a Galaxy based LIMS system.  Most interestingly they even run duplicate experiments for their exome resequencing studies, duplication even for exome sequencing should be done ‘as a matter of course’.

Next Generation Sequencing Case Studies in Drug Discovery and Development

Jessica Vamathevan, Principal Scientist, Computational Biology GSK

Also use BIOBASE products.  Use NGS for examination of viral titres in samples.  Incorporate profiles of polymorphisms in viral load in gathering information about responses to drugs during clinical trials.  Used NGS to examine viral population diversity during drug studies, especially to get a handle on drug resistance development.  End up with 4000 reads per time-point.  Use phylogenetic tree analyses to trace provenance of infection and viral mutation in HIV studies by patient clustering. Even possible to tell which subpopulation of virus may have been passed from one person to another even if viral population very diverse in transmitting individual. Layer on depth of sequencing information into phylogenetic trees using ‘pplacers’ and ‘guppy’.

Translational Genome Sequencing and Bioinformatics: The Medical Genome Project

Joquain Dopazo, Director of Bioinformatics and Genomics, CIPF

The initial challenge was to sequence exomes from well characterised, phenotyped patients and compare them to phenotypically control individuals (300 samples).  He considers 1000Genomes data not sufficient for a control group as they are not adequately phenotyped and collecting a local pool of controls means that population specific information becomes readily available in the course of the study.  The pipeline uses a GPU optimised BFAST for alignment – reducing run times to 5 hours per sample so on an 8CPU machine, 200 million reads (or 20-30 exomes) can be processed a week.  Highlighted the problem that exome sequencing throws up ‘too many’ variants, their filtering strategies did not seem to highlight single gene causative mutations and comparisons of familial groups failed to identify causal genes in the diseases of interest.  In fact they have no causal genes from 200 patient exomes.  Consequently have developed pathway based approaches to try and match up diseases and potentially causative genes to provide a story across the spectrum of the families involved in a given disease.

BRCA1/2 Sequencing on the Roche GS-FLX System – an evaluation of the first year

Genevieve Michils, Laboratory for Molecular Diagnostics, University of Leuven

Sequencing to 40x for diagnostics, using AVA for variant calling.  Have a robust multiplexing system to get a minimum of 25x coverage.  After processing 500 samples in 22 runs, 150 mutations detected of which 15 were ‘in or near homopolymer regions’.  Breakdown is €530/patient.  QC involves rejecting reads that cover region only in one direction, if necessary backing up missing areas with Sanger sequencing.  Homopolymer issues are worse as BRCA genes have plenty of them.  Homopolymer error bias is not the same in forward and reverse directions.  Trying to use SEQNEXT (http://www.jsi-medisys.de/products.html) which converts reads to Sangeresque ‘peaks’ but nevertheless  the homopolymers  lead to false positive variant detection.  “In a diagnostic context this is not efficient”. Trying to go back to the raw data to develop a statistical model to identify ‘abnormal’ profiles in homopolymer read regions but still follow up everything with Sanger sequencing afterwards anyway.

Towards Complete Quality-Assured Next-Generation Genetic Tests

Prof Harry Cuppens, Centre for Human Genetics, KULeuven

Primary work is on CFTR mutations, and thinks all couples should be screened for carrier status.  Clinicians are not interested in non-actionable rare mutations.  Highlighted a number of issues that need solving:

  • Robust equimolar multiplex amplifications
  • Economical pooling of samples
  • Quality assured protocols
  • Automated protocols
  • Accurate homopolymer calling

Not a fan of the DTC genetics testing companies protocols for sample handling and believes that there are so many steps involved the chances of errors are too high. The solution for this is to barcode samples at the earliest possible place in the sequencing process.

NGS Bioinformatics Support and Research Challenges

Mick Watson, Directory of ARK-Genomics, The Roslin Institute

Currently has 7 bioinformaticians, 6lab staff.  HiSeq, GAIIx and array work mainly in agriculturally important animal genomics.  We are in the “Age of Bioscience” so is this the most exciting time to be a biologist?  Highlighted the long history of bioinformatics from Fischer, to Dayhoff, Ledley, Bennet and Kendrew in the 50s and 60s.  Was critical of hypothesis free large sequencing projects.  Highlighted that bioinformatics often fails to follow through from turning information into knowledge for research scientists and this needs to be a priority not an afterthought.  Discussed the makeup of the bioinformatics community – coders, statisticians, data miners, database developers.  An interesting point about Galaxy is that he believes this is “moving the problems into a point and click interface”.  If you don’t understand parameterisation and use of the command line, then you won’t understand it in Galaxy either.  The greatest challenge of the future will be analysing individual genome plasticity.  Bioinformaticians have always worried about the size of the data.  From AB1 trace files, to array image data to MAGE-ML and now we worry about sequence data, but history shows we have coped before, and someone else will solve the problems.  Also noted that there is a dearth of EXPERIENCED NGS bioinformaticians, so look to recruit people with some experience and train them up.

Using Galaxy to provide a NGS Analysis Platform

Hans-Rudolf Hotz, Bioinformatics Support, Friedrich Miescher Institute for Biomedical Research

The core offers its services for free.  The expectation from biologists is the magic red button for analysis which when pressed turns raw data into Nature papers.  Is Galaxy the solution?  Galaxy captures provenance of data, modules can be constructed into workflows, analytical solutions from bioinformaticians and statisticians can be supplied directly to the end user, removing analytical load from the core (at the expense of system administrative load for loading Galaxy with relevant software).

23andMe kit ordered

So today I took a look at my savings account for ‘frivolous’ things, wondered what I could do with the money and split it between a couple of nights in a hotel in Manchester for a gig next week, and a 23andMe kit. This is something I’ve been wanting to do for a long time, previously thwarted by a lack of cash.

Watching Genomes Unzipped unfold and seeing a number of my Twitter/FriendFeed/BioStar chums get their spit analysed has been interesting, but since my day to day job now involves hunting down variations in clinical samples from exome sequencing projects, the subject of what might be lurking in my own genome has started to exhibit a morbid fascination.

I will say straight up, that if I had the money and there was an established DTC exome offering, I’d just get my exome sequenced and analyse it myself. However there is admittedly a little professional interest in how 23andMe present the data, as well as finding out more about what might be in store for me as I approach (if I have not already entered ;)) middle age.

I haven’t decided what I might do with the data (in terms of public release) when I receive it, I’m going to have a bit of a think and a read before the saliva kit arrives. Interestingly I’m feeling quite apprehensive about the results even at this stage, we will see how that develops too.