Author Archives: Daniel

Blog changes, update your RSS..

So I have decided to separate out my personal and work blogs in advance of my start in a couple of weeks at The Genome Analyis Centre in Norwich.

All the old work related content will remain here, but everything that was tagged work has been shifted to and all future work related content will appear there.  If you just want to get straight to the RSS feed, just hit the link: Now that the commercial requirements for silence on blogging about work have been removed I hope that it will be something I can resume with a little more regularity.

This will allow me to post on non-genomics/bioinformatics topics here. In other words, probably not much will happen here and I get stuck into NaNoWriMo again. But you have been warned.

15th International Conference on Human Genome Variation – Meeting report

Last week I was lucky enough to attend the HGV2014 meeting at the Culloden Hotel in Belfast. It was my first trip to Northern Ireland and my first attendance at an HGV meeting.  The meeting is small and intimate, but had a great wide-ranging programme, and I would heartily recommend attending if you get the chance and have an interest in clincal or human genomics.

Have a look at the full programme here:

Here’s a link to my write-ups for each session  (where I had notes that I could reconstruct!):

  1. Interpreting the human variome
  2. The tractable cancer genome
  3. Phenomes, genomes and archaeomes
  4. Answering the global genomics challenge
  5. Improving our health: Time to get personal
  6. Understanding the evolving genome
  7. Next-gen ‘omics and the actionable genome



Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

7.1    Christine Eng, Baylor College of Medicine: “Clinical Exome Sequencing for the Diagnosis of Mendelian Disorders”

Christine spoke about the pipeline for clinical WES at Baylor. Samples are sequenced to 140x to achieve 85%>40x coverage for the exome. A SNP array is run in conjunction with each sample. Concordance with the SNP array is tested for each sample and this must exceed 99%.

BWA is the primary mapper, but variants are called with ATLAS and annotated with Cassandra (Annovar is a dependency of Cassandra)

Critical resource:

Critical resource:

Critical paper: “A SNP discovery method to assess variant allele probability from next-generation resequencing data”

Variants are filtered against HGMD. Filtered for variants which are <5% MAF. 4000 clinical internal exomes have been run so there is a further requirement for variants to have a <2% MAF in this dataset.

New gene list is updated for the system weekly and VOUS are reported in genes related to the disorder to all patients – this is much more extensive reporting than for those groups who feel VOUS muddy the waters.

An expanded report can be requested in addition which also reports deleterious mutations in genes for which there is no disease/phenotype linkage. The hit rate for molecular diagnostics via clinical exome is 25% and 75% are not clinically solved. These are then asked if they would like to opt in to a research programme so that the data can be shared and aggregated for greater diagnostic power.

11/504 cases had two distinct disorders presenting at the same time. 280 cases were autosomal dominant and 86% of the dominant cases are de novo mutations. 187 cases were autosomal recessive and this was 57% compound heterozygous, 3% UPD and 37% had homozygosity due to shared ancestry.

Many initially unsolved diagnoses can be revisited and successfully resolved 6-12 months later on revisiting the data such is the base of new data deposition.

They use guidelines from CPIC (from PharmGKB) and data on drug/gene interactions and there is linking to a prescription database, so the pipeline is ‘end to end’.

Critical resource:


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

6.1    Yves Moreau, University of Leuven, Belgium: “Variant Prioritisation by genomic data fusion”


An essential part of the prioritization process is the integration of phenotype.

Critical paper: “Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis”

Yves introduced “Endeavour” which takes a gene list and matches it to the disease of interest and ranks them, but this requires phenotypic information to be ‘rich’. Two main questions need to be addressed 1) What genes are related to a phenotype? And 2) Which variants in a gene are pathogenic? Candidate gene prioritization is not a new thing, and has a long history in microarray analysis. Whilst it’s easy to interrogate things like pathway information, GO terms and literature it is much harder to find relevant expression profile information or functional annotation and existing machine learning tools do not really support these data types.

Critical paper: “Gene prioritization through genomic data fusion.”

Critical resource:

Endeavour can be trained, rank according to various criteria and then merge ranks to provide ordered statistics

Next eXtasy was introduced, another variant prioritization tool for non-synonymous variants given a specific phenotype.

Critical resource:

Critical paper: “eXtasy: variant prioritization by genomic data fusion”

eXtasy allows variants to be ranked by effects on structural change in the protein, association in a case/control or GWAS study, evolutionary conservation.

The problem though is one of multiscale data integration – we might know that a megabase region is interesting through one technique, a gene is interesting by another technique, and then we need to find the variant of interest from a list of variants in that gene.

They have performed HGMD to HPO mappings (1142 HPO terms cover HGMD mutations). It was noted that Polyphen and SIFT are useless for distinguishing between disease causing and rare, benign variants.

eXtasy produces rankings for a VCF file by taking the trained classifier data and using a random forest approach to rank. One of the underlying assumptions of this approach is that any rare variant found in the 1kG dataset is benign as they are meant to be nominally asymptomatic individuals.

These approaches are integrated into NGS-Logistics a federated analysis of variants over multiple sites which has some similarities to the Beacon approaches discussed previously. NGS-Logistics is a project looking for test and partner sites

Critical paper:

Critical resource:

However it’s clear what is required as much as a perfect database of pathogenic mutations is also a database of benign ones – both local population controls for ethnicity matching, but also high MAF variants, rare variants in asymptomatic datasets.

6.2    Aoife McLysaght, Trinity College Dublin: “Dosage Sensitive Genes in Evolution and Disease”


Aiofe started by saying that most CNVs in the human genome are benign. The quality that makes a CNV pathogenic is that of gene dosage. Haploinsufficiency (where half the product != half the activity) affects about 3% of genes in a systematic study in yeast. This is going to affect certain classes of genes, for instance those where concentration dependent effects are very important (morphogens in developmental biology for example).

This can occur through mechanisms like a propensity towards low affinity promiscuous aggregation of protein product. Consequently the relative balance of genes can be the problem where it affects the stoichiometry of the system.

This is against the background of clear genome duplication over the course of vertebrate evolution. This would suggest that dosage sensitive genes should be retained after subsequent genome chromosomal rearrangement and loss. About 20-30% of the genes can be traced back to these duplication events and they are enriched for developmental genes and members of protein complexes. These are called “ohnologs”

What is interesting is that 60% of these are never associated with CNV events or deletions and duplications in healthy people and they are highly enriched for disease genes.

Critical paper: “Ohnologs are overrepresented in pathogenic copy number mutations”

6.3    Suganthi Balasubramanian, Yale: “Making sense of nonsense: consequence of premature termination”

Under discussion in this talk was the characterization of Loss of Function (LoF) mutations. There’s a lot of people who prefer not to use this term and would rather describe them as broken down into various classes which can include

  • Truncating nonsense SNVs
  • Splice disrupting mutations
  • Frameshift indels
  • Large structural variations

The average person carries around a hundred LoF mutations of which around 1/5th are in a homozygous state.

It was commented that people trying to divine information from e.g. 1kG datasets had to content with lots of sequencing artefacts or annotation artefacts when assessing this.

Critical paper: “A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes”

Critical resource:

In particular the introduction of stop codons in a transcript are hard to predict. Some of the time this will be masked by splicing events or controlled by nonsense-mediated decay which means they may not be pathogenic at all.

Also stops codons in the last exon of a gene may not be of great interest as they are unlikely to have large effects on protein conformation.

The ALOFT pipeline was developed to annotate loss of function mutations. This uses a number of resources to make predictions including information about NMD, protein domains, gene networks (shortest path to known disease genes) as well as evolutionary conservation scores (GERP), dn/ds information from mouse and macaque and a random forest approach to classification. A list of benign variants is used in the training set including things like homozygous stop mutations in the 1kG dataset which are assumed to be non-pathogenic. Dominant effects are likely to occur in haploinsufficient genes with an HGMD entry.


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

5.1    Mark Lawler, QUB, Belfast: “Personalised Cancer Medicine; Are we there yet?”

Another talk from Mark who was an excellent chair for some conference sessions as well. One of the biggest problems with personalized medicine is that some data is already silo’d, or at very best fragmented.

In the UK getting science into clinical practice within the NHS is really predicated on the evidence that it reduces costs, is transformational in terms of treatment and adds value to the current system. So the bar is set quite high.

This was contrasted with the INCa Tumour Molecular Profiling Programme which is running in France with colorectal and lung cancers. This is drawing on 28 labs around Europe. INCa appears to be run under the auspices of the Institut National du Cancer.

Critical resource:

Mark felt that empowering patient advocacy was going to be an important drive in NHS uptake of new technologies and tests. But equally important was increasing personalized medicine literacy amongst GPs, policymakers and the insurance industry.

5.2    Nazneen Rahman, ICR, London “Implementing large-scale, high-throughput cancer predisposition genomic testing in the clinic”

Nazneen is obviously interested in testing germline mutations unlike much of the rest of the cancer programme which was focused on somatic mutation detection. Consequently working with blood draws and not biopsy material.

There are >100 predisposition genes implicated in 40+ cancers and there is variable contribution depending on the mutation and the cancer type. 15% of ovarian cancers result from germline variants, and this falls to 2-3% of all cancers. For this kind of screening a negative result is just as important as a positive one.

On the NHS testing for about half these predisposition genes is already available but even basic BRAF testing is not rolled out completely so tests have ‘restricted access’.

What is really needed is more samples. Increased sample throughput drives ‘mainstreaming of cancer genetics’. And three phases need to be tested – data generation, data analysis and data interpretation.

Critical resource:

They are using a targeted panel (CAPPA – which I believe is a TruSight Cancer Panel) where every base must be covered to at least 50x, which means mean target coverage of samples approaches 1000x even for germline detection. There’s a requirement for a <8week TAT and positive and negative calls must be made. It was acknowledged that there will be a switch to WEX/WES ‘in time’ when it is cheap.

The lab runs rapid runs on a HiSeq 2500 at a density of 48 samples per run. This gives a capacity of 500+ samples per week (so I assume there’s more than one 2500 available!). 50ng of starting DNA is required and there is a very low failure rate. 2.5k samples have been run to date. 384 of these were for BRCA1/2. 3 samples have failed and 15 required ‘Sanger filling’.

In terms of analysis Stampy is used for the aligner and Platypus for variant calling due to its superior handling of indels. A modified version of ExomeDepth is used for CNV calling and internal development produced coverage evaluation and HGVS parsers. All pathogenic mutations are still validated with Sanger or another validation method.

Data interpretation is the bottleneck now, its intensive work for pathogenic variants, and VOUS are an issue – they cannot be analysed in a context independent fashion and are ‘guilty until proven innnocent’ in the clinicians mind.

They have also performed exome sequencing of 1k samples, and observed an average of 117 variants per individual of clinical significance to cancer and 16% of the population has a rare BRCA variant.

Nazneen prefers to assume that VOUS are not implicated in advance, we should stick to reporting what is known, until such time a previous VOUS is declared to be pathogenic in some form. But we should be able to autoclassify 95% of the obvious variants, reducing some of the interpretation burden. Any interpretation pipeline needs to be dynamic and iteratively improved with decision trees built into the software. As such control variant data is important, ethnic variation is a common trigger for VOUS, where the variant is not in the reference sequence, but is a population level variant for an ethnic group.

Incorporating gene level information is desirable but rarely used. For instance information about how variable a gene is would be useful in assessing whether something was likely to be pathogenic – against a background which may be highly changeable vs. one that changes little.

Although variants are generally stratified into 5 levels of significance they really need to be collapsed down into a binary state of ‘do something’ or ‘do nothing’. A number of programs help in the classification including SIFT, PolyPhen, MAPP, AlignGVD, NN-Splice, MutationTaster. The report also has Google Scholar link outs (considered to be easier to query sanely than PubMed).

To speed analysis all the tools are used to precompute scores for every base substitution possible in the panel design.

5.3    Timothy Caulfield, University of Alberta, Canada: “Marketing the Myth of Personalised Prevention in the Age of Genomics”

No notes, here but an honorable mention for Tim who gave what was easily the most entertaining talk of the conference focusing on the misappropriation of genomics health by the snake oil industries of genomic matched dating, genomic influenced exercise regimes and variant led diets.  He also asked the dangerous question that if you 1) eat healthily 2) don’t smoke 3) drink in moderation 4) exercise is there really any value in personalized medicine except for a few edge cases? Health advice hasn’t changed much in decades. And people still live unhealthily. You won’t change this by offering them a genetic test and asking them to modify their behavior. If you ever have a chance to see Tim speak, it’s worth attending. He asked for a show of hands who had done 23andMe. Quite shocking for a genetics conference 3 people had their hand in the air. Myself, Tim and one of the other speakers.


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

4.1    Mark Lawler, QUB, Belfast: “All the world’s a stage: The Global Alliance For Genomics and Health”

Mark is part of the clinical working group for GA4GH. GA4GH has a core mission that sharing of genomics data is required before the most benefit comes from the data. GA4GH has no remit to store or generate sequencing data.

Mark introduced an initiative from the Centers of Mendelian Genomics which are NIH funded to provide sequencing for free. The sequencing is provided for diagnosed Mendelian disorders for which the genetic cause has yet to be determined for all cases. Each order is evaluated on a case by case basis and is reviewed by committee before going forwards.

Submission has to be from a healthcare professional looking after a patient and the minimum dataset returned is BAM and annotated VCF files. The data is actually owned by the submitters, and help with analysis is available if required, i.e. there are staged levels of support that can be requested from the CMG.

Critical resources: and

4.2    Ada Hamosh, JHU, Baltimore: “International Efforts to Indentify New Mendelian Disease Genes: the Centers for Mendelian Genomics, PhenoDB and GeneMatcher”

Following on from Mark was Ada, from one of the aforementioned CMG groups. One of the critical early points from Ada was that phenotyping should not be limited to the disease symptoms, but a much more holistic description encompassing everything about that person. This allows you to a) disambiguate between cases b) test for things you may find and c) understand what you’re looking for. So again this is a call to a deep phenotyping approach as outlined by a number of other speakers.

They use OMIM for a clinical synopsis along with HPO and LDDB (I think this is the London Dysmorphology Database). HPO seems to be gaining traction widely.

The cases that come to the CMG can be known disorders where the gene is unknown, or where the disorder is known, and the existing associated disease genes have been tested and found negative (no known underlying genetic cause).   Other information collected is birth *decade* (date is considered identifiable), age at presentation/evaluation, photographic assessments and feature selection.

The feature selection has 21 high level categories that cascades as you click on them and then selects OMIM disorders as you enter phenotype information.

They also filter against dbSNP in 3 steps – 127, 129 and 135. There is also a built in cohort analysis tool.

Critical resources: and is a demonstrator site for the CMG and presents a ‘modular tool for collection, storage and analysis of phenotype and genotype data’. This is downloadable(?) and includes things like the consent modules. is the actual clinical holding for labs.

There was some talk about the profusion of phenotypic descriptions standards – OrphaNet, HPO, SNOMED-CT, LDDB, POSSUM.   There was an ICHPT (International Consortium of Human Phenotype Terminologies) meeting where 2300 core terms were agreed across all the standards consortiums. PhenoDB already has these mappings in place for the HPO terms.

Critical resource:

PhenomeCentral is a repository for secure data sharing targeted to clinicians and scientists working in the rare disorder community – effectively this is a matchmaking service where disorders are so rare that a central clearing house makes perfect sense to try to accumulate other patients with the same symptoms across international borders to further diagnosis and research.

Critical resource:

This is a similar service that makes matching at the gene level, rather than the phenotypic level possible.

4.3    Anthony Brookes, University of Leicester “A Multi-Faceted Approach to Releasing the Value in Genome Related Data”

One of the options that is less explored in terms of data sharing is taking the question to the data, rather than pulling data from disparate sources and then running your analysis. DataSHIELD is an effort to resolve this, without handing out your data.

Critical paper: “DataSHIELD: resolving a conflict in contemporary bioscience—performing a pooled analysis of individual-level data without sharing the data”

One of the points here is that rather than distributing entire datasets, summary data can be aggregated and shared from harmonized individual-level databases. This permits researchers to focus on knowledge and visualization, removing people from the silo mentality. It is just as valuable to share understanding as underlying data.

However data sources are still disparate. Cataloguing is essential as you need to know where the data is, what metadata may be associated with a holding, descriptions etc. The Beacon system is a good example.

GWAS Central was highlighted. This provides a centralized compilation of summary level findings from genetic association studies, both large and small. Data sets are actively gathered from public domain projects, and direct data submission from the community is encouraged.

Critical resource:

Also mentioned were OmicsConnect and Dalliance to provide resources to visualize/interrogate datasets.

Critical paper: “Dalliance: interactive genome viewing on the web

These resources can interact with DAS data sources and eDAS was also introduced as a ‘gatekeeper’ for DAS information (I can find no information on this project).

Anthony then moved onto Café Variome.

Critical resource:

Café Variome is designed to sit alongside existing local databases to bring data discovery tools to that data. From the website: “We offer a complete data discovery platform based upon enabling the ‘open discovery’ of data (rather than data ‘sharing’) for example, between networks of diagnostic laboratories or disease consortia that know/trust each other and share an interest in certain causative genes or diseases.”

Café Variome can also broker data to both the research and clinical communities. Nodes can connect to each other, and there is also a central repository which can issue DOIs built on work by DataCite. Logins support ORCID identifiers for submitters of data, another step in the direction of correct, unambiguous attribution and standardisation.


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

3.1    Peter Robinson, Humboldt University, Berlin: “Effective diagnosis of genetic disease by computational phenotpye analysis of the disease associated genome”

Peter focused on the use of bioinformatics in medicine, specifically around the use of ontologies to describe phenotypes and look for similarities between diseases. It is important to capture the signs, symptoms and behavioural abnormalities of a patient in PRECISE language to be useful.

The concept here is ‘deep phenotyping’ – there’s almost nothing here in terms of too much information about clinical presentation, but it must be consistent to enable a basis for computational comparison and analysis.

HPO (The Human Phenotype Ontology) was introduced, saying that in many ways it is indebted to OMIM (Online Mendelian Inheritance in Man).

He felt strongly that the standard exome with 17k genes was ‘useless’ in a diagnostic context, when there are 2800 genes associated with 5000 disorders, covering a huge spectrum of presenting disease. Consequently he does not recommend screening the exome as a first line test, but encourages the use of reduced clinical exomes. This allows, especially, higher coverage for the same per-sample costs and suggested that the aim should be to have 98% of the target regions covered to >20x.

Pathogenic mutations that are clearly identified are clearly the easiest thing to call from this kind of dataset, but OMIM remains the first point of call for finding out the association of a mutation to a condition. And OMIM is not going to be of much help finding information on a predicted deleterious mutation in a random chromosomal ORF.

Specifically they take VCF files and annotate them with HPO terms as well as the standard suite of Mutation Taster, Polyphen and SIFT

A standard filtering pipeline should get you down to 50 to 100 genes of interest and then you can do a phenotype comparison of the HPO terms you have collected from the clinical presentation and the HPO terms annotated in the VCF. This can give you a ranked list of variants.

This was tested by running 10k simulations of such a process with spiked in variants from HGMD into an asymptomatic individuals VCF file. The gene ranking score depends on a variant score for deleteriousness and a phenotype score for the match to the clinical phenotype. In the simulation 80% of the time, the right gene was at the top of the list.

This approach is embodied in PhenIX:

This has led to the development of a clinical bioinformatics workflow where the clinician supplies the HPO terms and runs the algorithm. Information is borrows from OMIM and Orphanet in the process.

Prioritisation of variants is not a smoking gun for pathogenicity however. This needs to be backed up by Sanger sequencing validation, and co-segregation analysis within a family (if available). Effective diagnosis of disease will not lose the human component.

Exomiser was also introduced from Damien Smedley’s group at the Sanger Institute, which uses information from the mouse and zebrafish to increase the utility as there is a huge amount of phenotype data from developmental biology studies of gene knockouts in other organisms.

3.2    Dan Bradley, Trinity College, Dublin: “Ancient population genomics: do it all, or not at all”

Dan gave a great talk on the sequencing of ancient DNA to look at population data. Ancient DNA is highly fragmented, and you’re generally working with 50-70base fragments (generally worse than FFPE samples).

DNA from ancient samples actually undergoes a target enrichment step, largely to remove environmental sequence contamination, although it was noted that repetitive DNA can be problematic in terms of ruining a capture experiment.

From the ancient samples that were covered at 22x (I don’t expect that’s genome coverage, but target capture coverage) the samples were down-sampled to 1x data, and then 1kG data used to impute the likely genotypes. This actually recapitulated 99% of calls from the original 22x data, showing that this approach can be used to reconstruct ancestral population genomics information from very limited datasets, using very modern data.

HGV2014 Meeting Report, Session 2 “THE TRACTABLE CANCER GENOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

2.1    Lillian Su, University of Toronto: “Prioritising Therapeutic Targets in the Context of Intratumour Heterogeneity”

The central question is how we can move towards molecular profiling of a patient. Heterogeneity of cancers includes not just inter-patient difference but also intra-patient differences, either within a tumour itself, or a primary tumour and its secondary metastases.

Lillian was reporting the on the IMPACT study, which has no fresh biopsy material available so works exclusively from FFPE samples Their initial work has been using a 40 gene TruSeq Custom Amplicon hotspot panel, but they are in the process of developing their own ‘550 gene’ panel which will have the report integrated with the EHR system.

The 550 gene panel has 52 hereditary hotspots, 51 full length genes and the rest presumably hotspot location. There’s also 45 SNP’s for QA/sample tracking.

Lillian went on to outline the difference between trial types and the effects of inter-individual differences. Patients can be stratified into ‘umbrella’ trials – which are histology let, or ‘basket’ trials which are led by genetic mutations (as well as N-of-1 studies where you have unmatched comparisons of drugs).

But none of this addresses the intra-patient heterogeneity, it’s not really considered in clinical trial design. Not all genes have good concordance in terms of the mutation spectra between primary and metastatic stakes (PIK3CA was given as an example). What is really required is a knowledge base of tumour heterogeneity before a truly effective trial design can be constructed. And how do you link alterations to clinical actions?

Critical paper: “Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine”

Lillian outlined the filtering strategy for variants from FFPE and matched bloods. This was a MAF of <1% in 1kG data, a VAF of >5% and a DP>500x for the tumour, and DP>50x in matched bloods. Data was cross-referenced with COSMIC, TCGA, LDSBs and existing clinical trials, and missense mutations characterized with Polyphen, SIGT, LRT (likelihood ratio test) and MutationTaster.

They are able to pick out events like KRAS G12 mutations that are enriched on treatment, and this is a driver mutation, so the treatment enriches the driver over time.

Critical paper: “The molecular evolution of acquired resistance to targeted EGFR blockade in colorectal cancers”

Lillian sees WES/WGS as important as a long term investment rather than panels as well as the use of RNA-Seq in investigating heterogeneity. Ideally you want a machine learning system overlaid over the NGS datasets. Deep sequencing of tumours early might give you some idea of whether the tumour heterogeneity is pre-existing, or is it a result of tumoural selection over time. It was acknowledged that this was hard to do for every patient but would answer more long standing questions about the existence of resistant subclones being present and stable at the start of tumourogenesis.

2.2    Charles Lee, JAX labs: “Mouse PDX and Avatars: The Jackson Laboratory Experience”

PDX stands for “Patient Derived Xenografts”. This was an amazing talk, and as such I have few notes. The basic premise here is to take a tumour from a patient and segment it and implant the segments into immunodefficient mice where the tumours can grow. There was a lot of detail on the mouse strains involved, but the applications for this seem to be huge. Tumours can be treated in situ with a number of compounds and this information used to stratify patient treatment. The material can be used for CNV work, grown up for biobanking, expression profiling etc.

Fitting in with the previous talk, this model can also be used for investigating tumour heterogeneity as you can transplant different sections of the same tumour and then follow e.g. size in response to drug dosage in a number of animals all harbouring parts of the same original tumour.

Importantly this is not just limited to solid tumour work as AML human cell lines can also be established in the mice in a matter of weeks.

2.3    Frederica Di Nicolantonio, University of Torino, Italy: “Druggable Kinases in colorectal cancer”

The quote that stayed with me from the beginning of the talk was “Precision cancer medicine stands on exceptions”. The success stories of genomic guided medicine in cancer such as EGFR and ALK mutations are actually present in very small subsets of tumours. The ALK mutation is important in NSCLC tumours, but this is only 4% of tumours and only 2% respond. Colorectal cancer (CRC) is characterized by EGFR mutations and disruption of the RAS/RAF pathway.

However the situation is that you can’t just use mutation data to predict the response to a chemotherapeutic agent. BRAF mutations give different responses to drugs in melanomas vs. CRC because the melanomas have no expression of EGFR, owing to the differences in their embryonic origin.

Consequently in cell-line studies the important question to ask is are the gene expression profiles of the cell line appropriate to the tumour? This may determine the response to treatment, which may or not be the same depending on how the cell line has developed during its time in culture. Are cell lines actually a good model at all?

Frederica made a point that RNA-Seq might not be the best for determining outlier gene expression and immunohistochemistry was their preferred route to determine whether the cell line and tumour were still in sync in terms of gene expression/drug response.

2.4    Nick Orr, Institute of Cancer Research, London “Large-scale fine-mapping and functional characterisation identifies novel bresat cancer susceptibility loci at 9q31.2”

Nick started off talking about the various classes of risk alleles that exist for breast cancer. At the top of the list there are the high penetrance risk alleles in BRCA1 and BRCA2. In the middle there are moderate risk alleles at relatively low frequency in ATM and PALB2. Then there is a whole suite of common variants that are low risk, but population wide (FGFR2 mutations cited as an example).

With breast cancer the family history is still the most important predictive factor, but even so 50% of clearly familial breast cancer cases are genetically unexplained.

He went on to talk about the COGS study which has a website at which involved a large GWAS study of 10k cases and 12k controls. This was then followed up in a replication study of 45k cases and 45k controls.

Nick has been involved in the fine mapping follow up of the COGS data, but one of the important data points was an 11q13 association with TERT and FGFR2.

Critical paper: “Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression”

Data was presented on the fine mapping work that shows associated SNPs mapping to DNAseI hypersensitivity sites in MCF7 (a metastatic breast cancer cell line) as well as to transcription binding factor sites. This work relied on information from RegulomeDB:

One of the most impressive feats of this talk was Nick reeling off 7 digit rsID’s repeatedly during his slides without stumbling over the numbers.

Work has also been performed to generate eQTLS. The GWAS loci are largely cis acting regulators of transcription factors.

HGV2014 Meeting Report, Session 1 “INTERPRETING THE HUMAN VARIOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

1.1    Pui-Yan Kwok, UCSF: “Structural Variations in the Human Genome”

Talk focused on structural variant detection, the challenges were outlined as being

  • Short reads
  • Repeats
  • CNVs
  • Haplotying for compound heterozygote identification
  • Difficulty of analysis of SVs

Currently the approach is to map short reads to an imperfect assembly. Imperfect because it is haploid, composite and incomplete with regards to gaps, N’s and repeat sizes

Critical paper:

There are 1000 structural variations per genome, accruing to 24Mb/person, and 11,000 common ones in the population covering 4% of the genome (i.e. more than your exome).

ArrayCGH dup/del arrays don’t tell you about the location of your duplications and deletions. Sequencing only identifies the boundaries.

Presented a model of single molecule analysis on the BioNanoGenomics Irys platform. Briefly this uses a restriction enzyme to introduce single stranded nicks in the DNA, which are then fluorescently labelled. These are then passed down a channel and resolved optically to create a set of sequence motif maps – that is very much akin to an optical restriction endonuclease map. This process requires high molecular weight DNA, so presumably therefore not suitable for FFPE/archival samples.

The motifs are ‘aligned’ to each other via a clustering procedure.

Critical paper:

There are some technical considerations –the labelling efficiency is not 100% (mismatch problem on alignment), some nicks are too short for optical resolution. The nicking process can make some sites fragile causing breakup of the DNA into smaller fragments. The ‘assembly’ is still an algorithmic approach and by no means a perfect solution.

However this approach shows a great synergy with NGS for combinatorial data analysis.

They took the classic CEPH trio (NA12878/891/892) and made de novo assembled genome maps for the three individuals, generating ~259Gbases of data per sample. 99% of the data maps back to the GRCh38 assembly (I assume this is done via generating a profile of GRCh38 using an in silico nickase approach). The N50 of the assemblies is 5Mbases, and 96% of GRCh38 is covered by the assembled genomes.

This obviously enables things like gap sizing in the current reference genome. They were able to validate 120/156 known deletions, and identified 135 new ones. For insertions they validated 43/59 and found 242 new ones. A number of other mismatches were identified – 6 were switched insertion/deletion events, 9 were low coverage and 31 there was no evidence for.

The strength of the system is the ability to do tandem duplications, inversions and even complex rearrangements followed by tandem duplications. It also supports haplotyping, but critically you can tell where a CNV has arrived in the genome. This would enable applications like baiting the sequences in CNV regions and mapping the flanks. This allows you to produce diploid genome maps.

Critical paper:

This platform therefore allows assessment of things like DUF1220-Domain copy number repeats, implicated in autism spectrum disorders and schizophrenia (repeat number increases in ASD, and decreases in schizophrenia).

1.2    Stephen Sherry, NCBI, Maryland: “Accessing human genetic variation in the rising era of individual genome sequence”

Stephen spoke about new NCBI services including simplified dbGAP data requests and the option to look for alleles of interest in other databases by Beacon services.

dbGAP is a genotype/phenotype database for reseachers that presents its data consistent with the terms of the original patient consent. “GRU” items are “general research use” – these are broadly consented and genotyped or sequenced datasets that are available to all. This consists of CNV, SNP, exome (3.8k cases) and imputed data. PHS000688 is the top level ID for GRU items.

The Beacon system should be the jumping point for studies looking for causative mutations in disease to find out what other studies the alleles have been observed in rather than relying on 1KG/EVS data. This is part of the GA4GH project and really exists so a researcher can ask a resource if it has a particular variant.

At some point of genome sequencing we will probably have observed a SNP event in one in every two bases, i.e. there will be a database of 1.5 billion variant events. And critically we lack the kind of infrastructure to support this level of data presentation. And the presentation is the wrong way around. We concern ourselves with project/study level data organization but this should be “variant” led – i.e. you want to identify which holdings have your SNP of interest. This is not currently possible, but the Beacon system would allow this kind of interaction between researchers.

There are a number of Beacons online, which are sharing public holdings such as 1KG. The NCBI, GA4GH, Broad, EBI are involved. There is even a meta-Beacon that allows you to query multiple Beacons.

This introduces a new worfkflow – really it allows you to open a dialogue between yourself and the data holder. The existence of a variant is still devoid of context, but you can contact the data holder and then enter a controlled access agreement for the metadata, or information down to the read level

Machine mining of Beacon resources is prohibited. However the SRA toolkit allows access to dbGAP with security tokens which allows automatic query of SRA related material with local caching.

1.3    Daniel Geraghty, FHCRC, Seattle “Complete re-sequencing of extended genomic regions using fosmid target capture and single molecule, real time (SMRT) long-read sequencing technology”

This talk introduced a fosmid enrichment strategy followed by SMRT sequencing for characterizing complex genomic regions.

The premise was set up by suggesting that GWAS leaves rare variants undetected. Fosmid based recloning of HLA has been demonstrated.

Critical paper:

The steps involved are building a fosmid library. This is then plated out. Molecular inversion probes are used to identify fosmids from the region of interest.   Single clones are then extracted and sequenced extensively. This obviously means you need a fosmid library for each individual you’re looking at and is not a hybridization extraction method like using BAC’s as baits for large regions.

Sequencing is done on Pacbio both for speed (faster than a MiSeq) and read length. At this point the data can be assembled by Velvet, or even by the venerable Phrap/Consed approaches. About 40-100 PacBio reads are required to assemble a fosmid clone.

Quiver can be used to find a consensus sequence, and one a fosmid has been assembled, it can be coassembled with other fosmids that have been similarly reconstructed to get regions of 800kb.

The question was raised whether it might be possible to bypass the fosmid step with other recombineering approaches to work directly with gDNA and MIPS.

1.4    Peter Byers, UWASH, Seattle: “Determinants of splice site mutation outcomes and comparison of characterisation in cultured cells with predictive programs”

Peter talked about the prediction of splice mutation effects with particular reference to the collagen genes. 20% of collagen mutations are splice site mutations (these genes have lots of exons). This is pathogenic in a spread of osteogeneis imperfect (OI) disorders. It is complex because we not only have to consider the effects on splice donor and splice acceptor sites but also the effects on Lariat sequences within introns.

Consequently there are a number of downstream effects – the production of cryptic splice sites, intron retention, exon skipping (which tends to lead to more severe phenotypes). But this is made more complex again by the fact a single variant can have multiple outcomes and there’s no clear explanation for this.

This complexity means that it is hard to produce a computational prediction program that takes into account all the uncertainties of the system, especially at locations 3, 4 or 5 bases outside the splice site.

SplicePort and Asseda were tested, and Asseda came out on top in the tests, with a mere 29% of events wrongly predicted when compared with experimental evidence. So what is happening to make these predictions incorrect?

Peter explained that the order of intron removal in genes is specific to the gene, but shared with individual, but there was no global model for what that order might be, however it must be encoded in some way by intronic sequence. The speed of intron removal and the effects on the mature mRNA are incredibly important to the pathogenesis of the disease. It was clearly shown that the splicing events under study were predicated by the speed of intron removal as the RNA matured.

If you want to predict the splicing effect of a mutation, you therefore need some information about the order of intron processing in the gene you’re looking at to have a completely holistic view of the system. How do you generate this information systematically? It’s a very labour intensive piece of work, and Peter was looking for suggestions on how best to mine RNA-Seq data to get to the bottom of this line of enquiry. Is it possible even to do homology based predictions of splicing speed and therefore splicing order?

Computational biology post open at OGT

Edit: The position is now filled, and we’ve welcomed Luke Goodsell to the group!

Continuing the theme of job adverts as blog posts, we’re currently seeking someone to join the growing Computational Biology group at OGT.  Microarrays are a big part of OGT’s portfolio, and this post is all about the arrays.  This is a new post, and will report to our design expert Duarte Molha:

A Computational Biologist is required to join Oxford Gene Technology’s (OGT) Computational Biology group. The successful applicant will deliver microarray designs for internal R&D projects, Genefficiency Services and CytoSure products. The individual will also be involved in advancing OGT’s expertise in probe and array designs.

We are seeking a highly motivated and innovative individual with a numerate background combined with a solid understanding of molecular biology. Programming skills (Perl is essential; Java or Python are desirable) and scripting skills (awk, sed, bash) applied in a heterogeneous computing environment (Windows, Linux) are essential, as are good attention to detail and customer focus. Experience with MySQL, PHP and version control (git) is desirable, as are experience in handling large datasets, and experience in a commercially led environment.

You must be a team player with strong communication skills. Well-developed time-management skills and good organisation are important to the role.

An undergraduate degree in Computational Biology, Molecular Biology or a related discipline is required, preferred would be a Master’s degree.

We offer a competitive salary together with an excellent benefits package.

If you are interested in applying for this position, please complete the online application form, attaching your CV and ensuring to state your salary expectations within the covering letter section. Alternatively, email the required information to We will only accept applications from candidates who are legally permitted to work in the UK.