Tag Archives: ngs


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

7.1    Christine Eng, Baylor College of Medicine: “Clinical Exome Sequencing for the Diagnosis of Mendelian Disorders”

Christine spoke about the pipeline for clinical WES at Baylor. Samples are sequenced to 140x to achieve 85%>40x coverage for the exome. A SNP array is run in conjunction with each sample. Concordance with the SNP array is tested for each sample and this must exceed 99%.

BWA is the primary mapper, but variants are called with ATLAS and annotated with Cassandra (Annovar is a dependency of Cassandra)

Critical resource: https://www.hgsc.bcm.edu/software/cassandra

Critical resource: http://sourceforge.net/projects/atlas2/

Critical paper: http://genome.cshlp.org/content/20/2/273.short “A SNP discovery method to assess variant allele probability from next-generation resequencing data”

Variants are filtered against HGMD. Filtered for variants which are <5% MAF. 4000 clinical internal exomes have been run so there is a further requirement for variants to have a <2% MAF in this dataset.

New gene list is updated for the system weekly and VOUS are reported in genes related to the disorder to all patients – this is much more extensive reporting than for those groups who feel VOUS muddy the waters.

An expanded report can be requested in addition which also reports deleterious mutations in genes for which there is no disease/phenotype linkage. The hit rate for molecular diagnostics via clinical exome is 25% and 75% are not clinically solved. These are then asked if they would like to opt in to a research programme so that the data can be shared and aggregated for greater diagnostic power.

11/504 cases had two distinct disorders presenting at the same time. 280 cases were autosomal dominant and 86% of the dominant cases are de novo mutations. 187 cases were autosomal recessive and this was 57% compound heterozygous, 3% UPD and 37% had homozygosity due to shared ancestry.

Many initially unsolved diagnoses can be revisited and successfully resolved 6-12 months later on revisiting the data such is the base of new data deposition.

They use guidelines from CPIC (from PharmGKB) and data on drug/gene interactions and there is linking to a prescription database, so the pipeline is ‘end to end’.

Critical resource: http://www.pharmgkb.org/page/cpic


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

5.1    Mark Lawler, QUB, Belfast: “Personalised Cancer Medicine; Are we there yet?”

Another talk from Mark who was an excellent chair for some conference sessions as well. One of the biggest problems with personalized medicine is that some data is already silo’d, or at very best fragmented.

In the UK getting science into clinical practice within the NHS is really predicated on the evidence that it reduces costs, is transformational in terms of treatment and adds value to the current system. So the bar is set quite high.

This was contrasted with the INCa Tumour Molecular Profiling Programme which is running in France with colorectal and lung cancers. This is drawing on 28 labs around Europe. INCa appears to be run under the auspices of the Institut National du Cancer.

Critical resource: http://www.e-cancer.fr/en

Mark felt that empowering patient advocacy was going to be an important drive in NHS uptake of new technologies and tests. But equally important was increasing personalized medicine literacy amongst GPs, policymakers and the insurance industry.

5.2    Nazneen Rahman, ICR, London “Implementing large-scale, high-throughput cancer predisposition genomic testing in the clinic”

Nazneen is obviously interested in testing germline mutations unlike much of the rest of the cancer programme which was focused on somatic mutation detection. Consequently working with blood draws and not biopsy material.

There are >100 predisposition genes implicated in 40+ cancers and there is variable contribution depending on the mutation and the cancer type. 15% of ovarian cancers result from germline variants, and this falls to 2-3% of all cancers. For this kind of screening a negative result is just as important as a positive one.

On the NHS testing for about half these predisposition genes is already available but even basic BRAF testing is not rolled out completely so tests have ‘restricted access’.

What is really needed is more samples. Increased sample throughput drives ‘mainstreaming of cancer genetics’. And three phases need to be tested – data generation, data analysis and data interpretation.

Critical resource: http://mcgprogramme.com/

They are using a targeted panel (CAPPA – which I believe is a TruSight Cancer Panel) where every base must be covered to at least 50x, which means mean target coverage of samples approaches 1000x even for germline detection. There’s a requirement for a <8week TAT and positive and negative calls must be made. It was acknowledged that there will be a switch to WEX/WES ‘in time’ when it is cheap.

The lab runs rapid runs on a HiSeq 2500 at a density of 48 samples per run. This gives a capacity of 500+ samples per week (so I assume there’s more than one 2500 available!). 50ng of starting DNA is required and there is a very low failure rate. 2.5k samples have been run to date. 384 of these were for BRCA1/2. 3 samples have failed and 15 required ‘Sanger filling’.

In terms of analysis Stampy is used for the aligner and Platypus for variant calling due to its superior handling of indels. A modified version of ExomeDepth is used for CNV calling and internal development produced coverage evaluation and HGVS parsers. All pathogenic mutations are still validated with Sanger or another validation method.

Data interpretation is the bottleneck now, its intensive work for pathogenic variants, and VOUS are an issue – they cannot be analysed in a context independent fashion and are ‘guilty until proven innnocent’ in the clinicians mind.

They have also performed exome sequencing of 1k samples, and observed an average of 117 variants per individual of clinical significance to cancer and 16% of the population has a rare BRCA variant.

Nazneen prefers to assume that VOUS are not implicated in advance, we should stick to reporting what is known, until such time a previous VOUS is declared to be pathogenic in some form. But we should be able to autoclassify 95% of the obvious variants, reducing some of the interpretation burden. Any interpretation pipeline needs to be dynamic and iteratively improved with decision trees built into the software. As such control variant data is important, ethnic variation is a common trigger for VOUS, where the variant is not in the reference sequence, but is a population level variant for an ethnic group.

Incorporating gene level information is desirable but rarely used. For instance information about how variable a gene is would be useful in assessing whether something was likely to be pathogenic – against a background which may be highly changeable vs. one that changes little.

Although variants are generally stratified into 5 levels of significance they really need to be collapsed down into a binary state of ‘do something’ or ‘do nothing’. A number of programs help in the classification including SIFT, PolyPhen, MAPP, AlignGVD, NN-Splice, MutationTaster. The report also has Google Scholar link outs (considered to be easier to query sanely than PubMed).

To speed analysis all the tools are used to precompute scores for every base substitution possible in the panel design.

5.3    Timothy Caulfield, University of Alberta, Canada: “Marketing the Myth of Personalised Prevention in the Age of Genomics”

No notes, here but an honorable mention for Tim who gave what was easily the most entertaining talk of the conference focusing on the misappropriation of genomics health by the snake oil industries of genomic matched dating, genomic influenced exercise regimes and variant led diets.  He also asked the dangerous question that if you 1) eat healthily 2) don’t smoke 3) drink in moderation 4) exercise is there really any value in personalized medicine except for a few edge cases? Health advice hasn’t changed much in decades. And people still live unhealthily. You won’t change this by offering them a genetic test and asking them to modify their behavior. If you ever have a chance to see Tim speak, it’s worth attending. He asked for a show of hands who had done 23andMe. Quite shocking for a genetics conference 3 people had their hand in the air. Myself, Tim and one of the other speakers.


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

3.1    Peter Robinson, Humboldt University, Berlin: “Effective diagnosis of genetic disease by computational phenotpye analysis of the disease associated genome”

Peter focused on the use of bioinformatics in medicine, specifically around the use of ontologies to describe phenotypes and look for similarities between diseases. It is important to capture the signs, symptoms and behavioural abnormalities of a patient in PRECISE language to be useful.

The concept here is ‘deep phenotyping’ – there’s almost nothing here in terms of too much information about clinical presentation, but it must be consistent to enable a basis for computational comparison and analysis.

HPO (The Human Phenotype Ontology) was introduced, saying that in many ways it is indebted to OMIM (Online Mendelian Inheritance in Man).

He felt strongly that the standard exome with 17k genes was ‘useless’ in a diagnostic context, when there are 2800 genes associated with 5000 disorders, covering a huge spectrum of presenting disease. Consequently he does not recommend screening the exome as a first line test, but encourages the use of reduced clinical exomes. This allows, especially, higher coverage for the same per-sample costs and suggested that the aim should be to have 98% of the target regions covered to >20x.

Pathogenic mutations that are clearly identified are clearly the easiest thing to call from this kind of dataset, but OMIM remains the first point of call for finding out the association of a mutation to a condition. And OMIM is not going to be of much help finding information on a predicted deleterious mutation in a random chromosomal ORF.

Specifically they take VCF files and annotate them with HPO terms as well as the standard suite of Mutation Taster, Polyphen and SIFT

A standard filtering pipeline should get you down to 50 to 100 genes of interest and then you can do a phenotype comparison of the HPO terms you have collected from the clinical presentation and the HPO terms annotated in the VCF. This can give you a ranked list of variants.

This was tested by running 10k simulations of such a process with spiked in variants from HGMD into an asymptomatic individuals VCF file. The gene ranking score depends on a variant score for deleteriousness and a phenotype score for the match to the clinical phenotype. In the simulation 80% of the time, the right gene was at the top of the list.

This approach is embodied in PhenIX: http://compbio.charite.de/PhenIX/

This has led to the development of a clinical bioinformatics workflow where the clinician supplies the HPO terms and runs the algorithm. Information is borrows from OMIM and Orphanet in the process.

Prioritisation of variants is not a smoking gun for pathogenicity however. This needs to be backed up by Sanger sequencing validation, and co-segregation analysis within a family (if available). Effective diagnosis of disease will not lose the human component.

Exomiser was also introduced http://www.sanger.ac.uk/resources/databases/exomiser/query/ from Damien Smedley’s group at the Sanger Institute, which uses information from the mouse and zebrafish to increase the utility as there is a huge amount of phenotype data from developmental biology studies of gene knockouts in other organisms.

3.2    Dan Bradley, Trinity College, Dublin: “Ancient population genomics: do it all, or not at all”

Dan gave a great talk on the sequencing of ancient DNA to look at population data. Ancient DNA is highly fragmented, and you’re generally working with 50-70base fragments (generally worse than FFPE samples).

DNA from ancient samples actually undergoes a target enrichment step, largely to remove environmental sequence contamination, although it was noted that repetitive DNA can be problematic in terms of ruining a capture experiment.

From the ancient samples that were covered at 22x (I don’t expect that’s genome coverage, but target capture coverage) the samples were down-sampled to 1x data, and then 1kG data used to impute the likely genotypes. This actually recapitulated 99% of calls from the original 22x data, showing that this approach can be used to reconstruct ancestral population genomics information from very limited datasets, using very modern data.

HGV2014 Meeting Report, Session 2 “THE TRACTABLE CANCER GENOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

2.1    Lillian Su, University of Toronto: “Prioritising Therapeutic Targets in the Context of Intratumour Heterogeneity”

The central question is how we can move towards molecular profiling of a patient. Heterogeneity of cancers includes not just inter-patient difference but also intra-patient differences, either within a tumour itself, or a primary tumour and its secondary metastases.

Lillian was reporting the on the IMPACT study, which has no fresh biopsy material available so works exclusively from FFPE samples Their initial work has been using a 40 gene TruSeq Custom Amplicon hotspot panel, but they are in the process of developing their own ‘550 gene’ panel which will have the report integrated with the EHR system.

The 550 gene panel has 52 hereditary hotspots, 51 full length genes and the rest presumably hotspot location. There’s also 45 SNP’s for QA/sample tracking.

Lillian went on to outline the difference between trial types and the effects of inter-individual differences. Patients can be stratified into ‘umbrella’ trials – which are histology let, or ‘basket’ trials which are led by genetic mutations (as well as N-of-1 studies where you have unmatched comparisons of drugs).

But none of this addresses the intra-patient heterogeneity, it’s not really considered in clinical trial design. Not all genes have good concordance in terms of the mutation spectra between primary and metastatic stakes (PIK3CA was given as an example). What is really required is a knowledge base of tumour heterogeneity before a truly effective trial design can be constructed. And how do you link alterations to clinical actions?

Critical paper: http://www.nature.com/nm/journal/v20/n6/abs/nm.3559.html “Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine”

Lillian outlined the filtering strategy for variants from FFPE and matched bloods. This was a MAF of <1% in 1kG data, a VAF of >5% and a DP>500x for the tumour, and DP>50x in matched bloods. Data was cross-referenced with COSMIC, TCGA, LDSBs and existing clinical trials, and missense mutations characterized with Polyphen, SIGT, LRT (likelihood ratio test) and MutationTaster.

They are able to pick out events like KRAS G12 mutations that are enriched on treatment, and this is a driver mutation, so the treatment enriches the driver over time.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436069/ “The molecular evolution of acquired resistance to targeted EGFR blockade in colorectal cancers”

Lillian sees WES/WGS as important as a long term investment rather than panels as well as the use of RNA-Seq in investigating heterogeneity. Ideally you want a machine learning system overlaid over the NGS datasets. Deep sequencing of tumours early might give you some idea of whether the tumour heterogeneity is pre-existing, or is it a result of tumoural selection over time. It was acknowledged that this was hard to do for every patient but would answer more long standing questions about the existence of resistant subclones being present and stable at the start of tumourogenesis.

2.2    Charles Lee, JAX labs: “Mouse PDX and Avatars: The Jackson Laboratory Experience”

PDX stands for “Patient Derived Xenografts”. This was an amazing talk, and as such I have few notes. The basic premise here is to take a tumour from a patient and segment it and implant the segments into immunodefficient mice where the tumours can grow. There was a lot of detail on the mouse strains involved, but the applications for this seem to be huge. Tumours can be treated in situ with a number of compounds and this information used to stratify patient treatment. The material can be used for CNV work, grown up for biobanking, expression profiling etc.

Fitting in with the previous talk, this model can also be used for investigating tumour heterogeneity as you can transplant different sections of the same tumour and then follow e.g. size in response to drug dosage in a number of animals all harbouring parts of the same original tumour.

Importantly this is not just limited to solid tumour work as AML human cell lines can also be established in the mice in a matter of weeks.

2.3    Frederica Di Nicolantonio, University of Torino, Italy: “Druggable Kinases in colorectal cancer”

The quote that stayed with me from the beginning of the talk was “Precision cancer medicine stands on exceptions”. The success stories of genomic guided medicine in cancer such as EGFR and ALK mutations are actually present in very small subsets of tumours. The ALK mutation is important in NSCLC tumours, but this is only 4% of tumours and only 2% respond. Colorectal cancer (CRC) is characterized by EGFR mutations and disruption of the RAS/RAF pathway.

However the situation is that you can’t just use mutation data to predict the response to a chemotherapeutic agent. BRAF mutations give different responses to drugs in melanomas vs. CRC because the melanomas have no expression of EGFR, owing to the differences in their embryonic origin.

Consequently in cell-line studies the important question to ask is are the gene expression profiles of the cell line appropriate to the tumour? This may determine the response to treatment, which may or not be the same depending on how the cell line has developed during its time in culture. Are cell lines actually a good model at all?

Frederica made a point that RNA-Seq might not be the best for determining outlier gene expression and immunohistochemistry was their preferred route to determine whether the cell line and tumour were still in sync in terms of gene expression/drug response.

2.4    Nick Orr, Institute of Cancer Research, London “Large-scale fine-mapping and functional characterisation identifies novel bresat cancer susceptibility loci at 9q31.2”

Nick started off talking about the various classes of risk alleles that exist for breast cancer. At the top of the list there are the high penetrance risk alleles in BRCA1 and BRCA2. In the middle there are moderate risk alleles at relatively low frequency in ATM and PALB2. Then there is a whole suite of common variants that are low risk, but population wide (FGFR2 mutations cited as an example).

With breast cancer the family history is still the most important predictive factor, but even so 50% of clearly familial breast cancer cases are genetically unexplained.

He went on to talk about the COGS study which has a website at http://nature.com/icogs which involved a large GWAS study of 10k cases and 12k controls. This was then followed up in a replication study of 45k cases and 45k controls.

Nick has been involved in the fine mapping follow up of the COGS data, but one of the important data points was an 11q13 association with TERT and FGFR2.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3483423/ “Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression”

Data was presented on the fine mapping work that shows associated SNPs mapping to DNAseI hypersensitivity sites in MCF7 (a metastatic breast cancer cell line) as well as to transcription binding factor sites. This work relied on information from RegulomeDB: http://regulomedb.org/.

One of the most impressive feats of this talk was Nick reeling off 7 digit rsID’s repeatedly during his slides without stumbling over the numbers.

Work has also been performed to generate eQTLS. The GWAS loci are largely cis acting regulators of transcription factors.

HGV2014 Meeting Report, Session 1 “INTERPRETING THE HUMAN VARIOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

1.1    Pui-Yan Kwok, UCSF: “Structural Variations in the Human Genome”

Talk focused on structural variant detection, the challenges were outlined as being

  • Short reads
  • Repeats
  • CNVs
  • Haplotying for compound heterozygote identification
  • Difficulty of analysis of SVs

Currently the approach is to map short reads to an imperfect assembly. Imperfect because it is haploid, composite and incomplete with regards to gaps, N’s and repeat sizes

Critical paper: http://www.nature.com/nature/journal/v464/n7289/full/nature08516.html

There are 1000 structural variations per genome, accruing to 24Mb/person, and 11,000 common ones in the population covering 4% of the genome (i.e. more than your exome).

ArrayCGH dup/del arrays don’t tell you about the location of your duplications and deletions. Sequencing only identifies the boundaries.

Presented a model of single molecule analysis on the BioNanoGenomics Irys platform. Briefly this uses a restriction enzyme to introduce single stranded nicks in the DNA, which are then fluorescently labelled. These are then passed down a channel and resolved optically to create a set of sequence motif maps – that is very much akin to an optical restriction endonuclease map. This process requires high molecular weight DNA, so presumably therefore not suitable for FFPE/archival samples.

The motifs are ‘aligned’ to each other via a clustering procedure.

Critical paper: http://www.nature.com/nbt/journal/v30/n8/full/nbt.2324.html

There are some technical considerations –the labelling efficiency is not 100% (mismatch problem on alignment), some nicks are too short for optical resolution. The nicking process can make some sites fragile causing breakup of the DNA into smaller fragments. The ‘assembly’ is still an algorithmic approach and by no means a perfect solution.

However this approach shows a great synergy with NGS for combinatorial data analysis.

They took the classic CEPH trio (NA12878/891/892) and made de novo assembled genome maps for the three individuals, generating ~259Gbases of data per sample. 99% of the data maps back to the GRCh38 assembly (I assume this is done via generating a profile of GRCh38 using an in silico nickase approach). The N50 of the assemblies is 5Mbases, and 96% of GRCh38 is covered by the assembled genomes.

This obviously enables things like gap sizing in the current reference genome. They were able to validate 120/156 known deletions, and identified 135 new ones. For insertions they validated 43/59 and found 242 new ones. A number of other mismatches were identified – 6 were switched insertion/deletion events, 9 were low coverage and 31 there was no evidence for.

The strength of the system is the ability to do tandem duplications, inversions and even complex rearrangements followed by tandem duplications. It also supports haplotyping, but critically you can tell where a CNV has arrived in the genome. This would enable applications like baiting the sequences in CNV regions and mapping the flanks. This allows you to produce diploid genome maps.

Critical paper: http://www.cell.com/ajhg/abstract/S0002-9297%2812%2900373-4

This platform therefore allows assessment of things like DUF1220-Domain copy number repeats, implicated in autism spectrum disorders and schizophrenia (repeat number increases in ASD, and decreases in schizophrenia).

1.2    Stephen Sherry, NCBI, Maryland: “Accessing human genetic variation in the rising era of individual genome sequence”

Stephen spoke about new NCBI services including simplified dbGAP data requests and the option to look for alleles of interest in other databases by Beacon services.

dbGAP is a genotype/phenotype database for reseachers that presents its data consistent with the terms of the original patient consent. “GRU” items are “general research use” – these are broadly consented and genotyped or sequenced datasets that are available to all. This consists of CNV, SNP, exome (3.8k cases) and imputed data. PHS000688 is the top level ID for GRU items.

The Beacon system should be the jumping point for studies looking for causative mutations in disease to find out what other studies the alleles have been observed in rather than relying on 1KG/EVS data. This is part of the GA4GH project and really exists so a researcher can ask a resource if it has a particular variant.

At some point of genome sequencing we will probably have observed a SNP event in one in every two bases, i.e. there will be a database of 1.5 billion variant events. And critically we lack the kind of infrastructure to support this level of data presentation. And the presentation is the wrong way around. We concern ourselves with project/study level data organization but this should be “variant” led – i.e. you want to identify which holdings have your SNP of interest. This is not currently possible, but the Beacon system would allow this kind of interaction between researchers.

There are a number of Beacons online, which are sharing public holdings such as 1KG. The NCBI, GA4GH, Broad, EBI are involved. There is even a meta-Beacon that allows you to query multiple Beacons.

This introduces a new worfkflow – really it allows you to open a dialogue between yourself and the data holder. The existence of a variant is still devoid of context, but you can contact the data holder and then enter a controlled access agreement for the metadata, or information down to the read level

Machine mining of Beacon resources is prohibited. However the SRA toolkit allows access to dbGAP with security tokens which allows automatic query of SRA related material with local caching.

1.3    Daniel Geraghty, FHCRC, Seattle “Complete re-sequencing of extended genomic regions using fosmid target capture and single molecule, real time (SMRT) long-read sequencing technology”

This talk introduced a fosmid enrichment strategy followed by SMRT sequencing for characterizing complex genomic regions.

The premise was set up by suggesting that GWAS leaves rare variants undetected. Fosmid based recloning of HLA has been demonstrated.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1199539/

The steps involved are building a fosmid library. This is then plated out. Molecular inversion probes are used to identify fosmids from the region of interest.   Single clones are then extracted and sequenced extensively. This obviously means you need a fosmid library for each individual you’re looking at and is not a hybridization extraction method like using BAC’s as baits for large regions.

Sequencing is done on Pacbio both for speed (faster than a MiSeq) and read length. At this point the data can be assembled by Velvet, or even by the venerable Phrap/Consed approaches. About 40-100 PacBio reads are required to assemble a fosmid clone.

Quiver can be used to find a consensus sequence, and one a fosmid has been assembled, it can be coassembled with other fosmids that have been similarly reconstructed to get regions of 800kb.

The question was raised whether it might be possible to bypass the fosmid step with other recombineering approaches to work directly with gDNA and MIPS.

1.4    Peter Byers, UWASH, Seattle: “Determinants of splice site mutation outcomes and comparison of characterisation in cultured cells with predictive programs”

Peter talked about the prediction of splice mutation effects with particular reference to the collagen genes. 20% of collagen mutations are splice site mutations (these genes have lots of exons). This is pathogenic in a spread of osteogeneis imperfect (OI) disorders. It is complex because we not only have to consider the effects on splice donor and splice acceptor sites but also the effects on Lariat sequences within introns.

Consequently there are a number of downstream effects – the production of cryptic splice sites, intron retention, exon skipping (which tends to lead to more severe phenotypes). But this is made more complex again by the fact a single variant can have multiple outcomes and there’s no clear explanation for this.

This complexity means that it is hard to produce a computational prediction program that takes into account all the uncertainties of the system, especially at locations 3, 4 or 5 bases outside the splice site.

SplicePort and Asseda were tested, and Asseda came out on top in the tests, with a mere 29% of events wrongly predicted when compared with experimental evidence. So what is happening to make these predictions incorrect?

Peter explained that the order of intron removal in genes is specific to the gene, but shared with individual, but there was no global model for what that order might be, however it must be encoded in some way by intronic sequence. The speed of intron removal and the effects on the mature mRNA are incredibly important to the pathogenesis of the disease. It was clearly shown that the splicing events under study were predicated by the speed of intron removal as the RNA matured.

If you want to predict the splicing effect of a mutation, you therefore need some information about the order of intron processing in the gene you’re looking at to have a completely holistic view of the system. How do you generate this information systematically? It’s a very labour intensive piece of work, and Peter was looking for suggestions on how best to mine RNA-Seq data to get to the bottom of this line of enquiry. Is it possible even to do homology based predictions of splicing speed and therefore splicing order?

OGT NGS Team seeking a placement student

1330361778Edit: We found ourselves a student, and we’ve welcomed Agatha Treveil to the team!

Every year we take on a placement/sandwich student from a UK University and have them work within the NGS Team in the Computational Biology group at Oxford Gene Technology.

This year is no different, and we’re looking for someone to start somewhere in the July/August 2014 window to work with us for a year.

Our previous students have come from a number of institutions, and one has not only won a Cogent Life Sciences skills award for “Placement of the Year”, but has also joined the NGS Team after graduating. Traditionally we’ve recruited from Natural Sciences, Biomedical Sciences and Biochemistry undergraduate courses – no previous bioinformatics or programming skills are required, just a willingness to deal with lots of DNA and RNA-Seq data in a commercial environment. The post involves exome sequencing, targeted resequencing and RNA-Seq analysis (whole transcriptome, small RNA) and would be coming in at an exciting point as we start to broaden our NGS services.

This is very much a ‘hands on’ position, and we specialise in taking in biologists and having them leave as bioinformaticians.

Here’s the full text of the advert:

Computational Biology (salary approx £14K p.a)
We are looking for a candidate who has (or would like to develop) programming skills and an interest in biology; is keen to learn new techniques, gain experience in a biotechnology company, and who has a particular interest in the interface of numerical methods and life sciences.

The placement will potentially involve projects in all of our scientific areas, but will begin with a focus on Next Generation Sequencing analysis where the student will learn pipeline development and data analysis. The student will gain experience working with clinical and academic scientists, and an understanding of the data processing aspects of the most exciting experimental technologies currently being applied in biological and clinical research and practice.

To apply for this position, please send your CV and covering letter (clearly stating you wish to apply for the industrial placement and which position you are applying for) to hr@ogt.com or by post to:
HR, Oxford Gene Technology, Begbroke Science Park, Begbroke Hill, Woodstock Road, Begbroke, Oxfordshire, OX5 1PF

Closing date for entries 30 April 2014
For further information about OGT please visit http://www.ogt.com/

Two posts available at Oxford Gene Technology

So we have two positions currently open at OGT – one in the Computational Biology group and one as a Product Manager for Next Generation Sequencing.  The post reports to the Directory of Strategic Marketing at OGT and is aimed at an experienced product manager, preferably with experience in the NGS space.  You can read about that position by following the link.

The position within the Computational Biology group is not a full-time NGS position (and therefore you may be pleased to know, would not report directly to me!).  The previous post holder was responsible for a great deal of the array analysis that is required to develop microarray products (a core part of OGT’s business) but was called upon to support NGS work when required.  Candidates with a good statistical grounding are being sought, and experience with microarray data analysis is advantageous.  The full post description can be found by following the link.

CV’s for both posts to be directed to hr@ogt.com

Short-read alignment on the Raspberry Pi

This week I invested a little bit of spare cash in the Raspberry Pi.  Now that there’s no waiting time for these, I bought mine from Farnell’s Element 14 site, complete with a case, a copy of Raspbian on SD card and a USB power supply.  Total costs, about 50 quid.

First impressions are that it is a great little piece of hardware. I’ve always considered playing with an Arduino, but the Pi fits nicely into my existing skill set.  It did get connected to the TV briefly just to watch a tiny machine driving a 37″ flatscreen TV via HDMI.  I’m sure it’s just great, if your sofa isn’t quite as far away from the TV as mine is. So with sshd enabled on the Pi it is currently sat on the mantlepiece, blinking lights flashing, running headless.

The first thing it occurred to me to do was to do some benchmarking.  What I was interested in is the capacity of the machine to do real world work.  I’m an NGS bioinformatician so the the obvious thing to do was to throw some data at it through some short-read aligners.

I’m used to human exome data, or RNA-Seq data that generally encompasses quite a few HiSeq lanes, and used to processing them in large enough amounts that I need a few servers to do it.  I did wonder however whether the Pi might have enough grunt for smaller tasks, such as small gene panels, or bacterial genomes.  Primarily this is because I’ve got a new project at work which uses in solution hybridisation and sequencing to identify pathogens in clinical samples, and it occurred to me that the computing requirements probably aren’t the same as what I’m used to.

The first thing I did was to take some data from wgsim generated from an E.coli genome to test out paired-end alignment on 100bp reads.

Initially I thought I would try to get Bowtie2 working, on the grounds that I wasn’t really intending to do anything other than read mapping and I am under the impression it’s still faster than BWA.  BWA does tend to be my go-to aligner for mammalian data.  However I quickly ran into the fact that there is no armhf build of bowtie2 in the Raspbian repository.  Code downloaded I was struggling to get it to compile from source, and in the middle of setting up a cross-compiling environment so I could do the compilation on my much more powerful EeePC 1000HE(!) it occurred that someone might have been foolish enough to try this before.  And they had.  The fact is that bowtie2 requires a CPU with an SSE instruction set – i.e. Intel.  So whilst it might work on the Atom CPU in the EeePC it’s a complete non starter on the ARM chip in the Pi.

Bowtie1 however is in the Rasbpian repository.  And I generated 1×10^6 reads as a test dataset after seeing that it was aligning the 1000 read dataset from bowtie with some speed.  This took 55 minutes.

I then picked out a real-world E.coli dataset from the CLC Bio website.  Generated on the GAIIx, these are 36bp PE reads, around 2.6×10^6 of them.

BWA 0.6.2 is also available from the Raspbian repos (which is more up to date than the version in the Xubuntu distro I notice, probably because Raspbian is tracking the current ‘testing’ release, Wheezy).

So I did a full paired end alignment of this real world data, making sure both output to SAM.  I quickly ran out of space on my 4GB SD card, so all data was written out to an 8GB attached USB thumb drive.

Bowtie1 took just over an hour to align this data (note reads and genome for alignment are from completely different E.coli strains)

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Seeded quality full-index search: 01:01:31
# reads processed: 2622382
# reads with at least one reported alignment: 1632341 (62.25%)
# reads that failed to align: 990041 (37.75%)
Reported 1632341 paired-end alignments to 1 output stream(s)
Time searching: 01:01:32
Overall time: 01:01:32

I was a little surprised that actually BWA managed to do this a little faster (please note aligners were run with default options).  I only captured the start and end of this process for BWA.

Align start: Sat Jan 26 22:36:06 GMT 2013
Align end: Sat Jan 26 23:29:31 GMT 2013

Which brings the total alignment time for BWA to 53 minutes and 25 seconds.

Anyway it was just a little play to see how things stacked up.  I think it’s fantastic that a little machine   like the Pi has enough power to do anything like this.  It’s probably more of a comment on the fact that the people behind the aligners have managed to write such efficient code that this can be done without exceeding the 512Mb of RAM.  Bowtie memory usage was apparently lower than BWA though during running tests.

I always thought that the ‘missing aspect’ of DIYbio was getting people involved with bioinformatics, instead the community seemed desperate to follow overly ambitious plans to get involved in synthetic biology.  And it seemed to me that DIYbio should sit in the same amateur space that amateur astronomy does (i.e. within the limitations of equipment that you can buy without having to equip a laboratory).  And for a low cost entry into Linux, with enough grunt to play with NGS tools and publicly available data, it’s hard to fault the very compact Raspberry Pi. Now I just need to see exactly where the performance limits are!

Thoughts on a year in industry

A year ago I left the safe environs of academia and decided to move to industry.  I said farewell to my final salary pension, my Mac-centric mode of life, my newly-purchased house and the place that had been my home for the last 7 years to go return to Oxford and enter a world dictated by the cold logic of business.

Why did I move?

I think quite a few people were wondering this at the time.  Aside from the fact that 2011 had started as a most abysmal year (personal issues, not professional) there were a number of factors leading to my departure, but on the face of it the move may have seemed rash.  The Bioinformatics Support Unit at Newcastle University was running very well, publications were flowing, costs were being recovered to the satisfaction of the Faculty.  I had a great set of friends, colleagues and co-workers.

One of the things about bioinformatics support work is that it is, by it’s very nature, diverse.  This is great for not getting bored day to day, but not so great if you want to specialise in a field.  My work was split mostly between arrays and NGS (and mainly arrays) alongside the financial management of the Unit, student supervision (several PhD students and Masters students) and a dozen or so odd little projects that come your way in that kind of job.

My heart however has always been with genomics.  My favourite part of my PhD was always the sequencing.  In the hot-lab, up to my elbows in acrylamide and isotopes, all for the joy of pulling the autoradiograph film from the developer and spending the next couple of hours typing it into DNAStar before applying whatever gene/exon/splice-site prediction software I had committed to that day.  The future, to me, looks like it’s going to be heavily flavoured with NGS.

I had decided a long time ago never to enter industry, the by-product of a difficult year at Glaxo as a sandwich student.  I hated the feeling of being the smallest cog in a giant, impenetrable, deeply impersonal, multinational pharma.  From the people who I saw there, struggling with their own academia to industry transitions, to daily pickets from animal rights groups, to people who on handing in their notice, were marched from their offices to be dispatched from the premises without even a chance to pick up their personal belongings.  It didn’t seem like it was such a great place to be.

In late 2010 I started to get approaches from recruiters, all the positions were with NGS firms, or NGS related firms.  Some still around, some now counting down the days to their demise.  After a couple of months of weighing up whether I wanted to commit to the jump, the perfect job advert crossed my desk.  For the first time, I phoned a recruiter.  And that job was with OGT.

A year later, I thought it might be nice to summarise what I thought of the change.

What does the role entail?

No longer ‘bioinformatician at large’ I now have responsibility for developing  and returning the data to academic and commercial customers from our NGS analysis pipelines. We have built an extensive exome analysis pipeline which analyses not just exome samples, but also does comprehensive trio analysis and analyses cancer samples. A lot of data passes through this pipeline, and I couldn’t have done it without my fantastic sandwich student David Blaney, who I hope has had a much better year out  in industry than I did.  We’ve built an RNA-Seq pipeline too, shortly to be launched as a service.

I’m involved in a number of grant programmes internally, from solid tumour cancer diagnostics for stratified medicine, to pathogen screening and host/pathogen interactions – all from an NGS perspective.  We have a Genomics Biomarkers team as well, and they obviously have an increasing need for NGS approaches.

So what is the same?

Well I’m still doing bioinformatics. Arguably I’m doing more bioinformatics than in my previous role. I still get to interact with customers, although this took a while to be direct, rather than mediated via the sales team. I think you have to earn a certain amount of trust when entering a new role, but having done nothing but talk to customers for 7 years, I didn’t initially appreciate that there might be good procedural reasons for having an intermediate layer of communications with customers.

This is still one of the most satisfying parts of the role, delivering results and analyses back to researchers or commercial customers is great. Especially when you’re getting great feedback back about the quality of the data, and the findings from it.  Even better when they come up at a conference, shake your hand and tell you about the papers that have been submitted.

This is one thing about doing a lot of exome sequencing work for rare diseases – you get a lot of diagnostic power, and consequently a lot of hits. My name still goes on papers, we have just had a paper accepted that comes out in the AJHG in August and favourable noises from a pre-submission enquiry with a very high-impact journal for another.  Both exhibiting (we believe) absolutely novel classes of discovery from exome data.

What is different?

I talk to people from a much wider background at work. No longer talking to just biologists and computer scientists and fellow bioinformaticians, I now get to talk to enthusiastic people in the sales and marketing departments. I’m now much more intimately connected to the lab again, thanks to the both the R&D and services work.

It helps that OGT has a touch over 60 employees, it’s small enough to feel genially personable. I actually get to talk to the VP’s and CEO. Reguarly.

I get to travel more. This was something of a self-imposed rule at Newcastle – when you’re managing your own finances, trips to conferences don’t do much for the balance sheet. They simply don’t generate any revenue. Now the reasons have a much more financial focus, if I go away, I go away with one of the sales team. We do roadshows, conferences. I am now one of those people who stands on the company booth and talks to you, rather than the person who goes to a conference to listen to talks. However you are there to generate leads, not listen to talks. The cost of going must be balanced against the gains from the leads.

This is another aspect that has been very different. I have had an increasing interest in the business side of the life sciences for some time, but lacking any practical experience. This is now changing, I now understand how a business operates, what the margins need to be on a sale, the balance between selling products and selling services.

Because of the size of OGT I get exposure to this, I doubt it would happen in a larger company. I get involved in product development, I help to write product profiles, I’ve developed, and continue to develop, marketing materials for the website. These are all new skills for me, and I love to learn.

Another thing I’ve noticed is the makeup of the company is very different to academia. I work in a phenomenally talented group of computational biologists, who are skilled in software design, software development and all facets of bioinformatics analysis.  But not everyone has a PhD. Not everyone has a biology background.  And these are things I took for granted  in academia. If anything I have become more and more convinced that a PhD is of little consequence, especially for people who, like me, have switched discipline after getting it.  My colleagues are people who have worked in the more quantitative fields of accounting or investment banking, but retooled for bioinformatics, and have done so with aplomb.

Social networking changes

I think most people who interact with me online will have noticed that I don’t blog, tweet or participate in BioStar as much anymore.  I spend a lot of time on SeqAnswers, and my RSS reader is now top heavy with NGS related blogs, but participation is down.

There  are just commercial pressures which mean I can’t always blog about what I’m doing, and believe me there are some things at work I do under CDA/NDA that I would really love to talk about, but it’s not that I can’t blog about it, I can’t even talk to you about it over a pint of beer.  This is something I have had to accept about the commercial environment.  The IT policy at work is incredibly strict, to maintain the ISO information security standards that we have.  I’ve learned to adapt to this, and the Windows-centric environment.

The biggest issue though? Inability to get to papers.  Oh how I took for granted the access to papers I had at Newcastle.  I just want to say a big thank you to everyone who has sent me a paper on request in the last year, you have been invaluable to me, and it is deeply appreciated.

Was it worth it?

Absolutely.  Life at OGT is hectic, pressured but deeply rewarding.  I have the focus that I wanted, but with the diversity of a new set of challenges.  I think I’ve been very lucky to settle into a company that is the perfect size and makeup to transition gently from academia into the commercial world.  It might not be for everyone, but I will say I wish I had done it sooner.  I harboured doubts about industry, but they were predicated on my experiences with a giant company.  Sitting now in a position that is in a long-established  SME that is on a sound financial footing (as opposed to giant multinational, or precarious start-up), I wonder what I was concerned about.