Tag Archives: variation

15th International Conference on Human Genome Variation – Meeting report

Last week I was lucky enough to attend the HGV2014 meeting at the Culloden Hotel in Belfast. It was my first trip to Northern Ireland and my first attendance at an HGV meeting.  The meeting is small and intimate, but had a great wide-ranging programme, and I would heartily recommend attending if you get the chance and have an interest in clincal or human genomics.

Have a look at the full programme here: http://hgvmeeting.org/

Here’s a link to my write-ups for each session  (where I had notes that I could reconstruct!):

  1. Interpreting the human variome
  2. The tractable cancer genome
  3. Phenomes, genomes and archaeomes
  4. Answering the global genomics challenge
  5. Improving our health: Time to get personal
  6. Understanding the evolving genome
  7. Next-gen ‘omics and the actionable genome



Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

6.1    Yves Moreau, University of Leuven, Belgium: “Variant Prioritisation by genomic data fusion”


An essential part of the prioritization process is the integration of phenotype.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083082/ “Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis”

Yves introduced “Endeavour” which takes a gene list and matches it to the disease of interest and ranks them, but this requires phenotypic information to be ‘rich’. Two main questions need to be addressed 1) What genes are related to a phenotype? And 2) Which variants in a gene are pathogenic? Candidate gene prioritization is not a new thing, and has a long history in microarray analysis. Whilst it’s easy to interrogate things like pathway information, GO terms and literature it is much harder to find relevant expression profile information or functional annotation and existing machine learning tools do not really support these data types.

Critical paper: http://www.ncbi.nlm.nih.gov/pubmed/16680138 “Gene prioritization through genomic data fusion.”

Critical resource: http://homes.esat.kuleuven.be/~bioiuser/endeavour/tool/endeavourweb.php

Endeavour can be trained, rank according to various criteria and then merge ranks to provide ordered statistics

Next eXtasy was introduced, another variant prioritization tool for non-synonymous variants given a specific phenotype.

Critical resource: http://homes.esat.kuleuven.be/~bioiuser/eXtasy/

Critical paper: http://www.nature.com/nmeth/journal/v10/n11/abs/nmeth.2656.html “eXtasy: variant prioritization by genomic data fusion”

eXtasy allows variants to be ranked by effects on structural change in the protein, association in a case/control or GWAS study, evolutionary conservation.

The problem though is one of multiscale data integration – we might know that a megabase region is interesting through one technique, a gene is interesting by another technique, and then we need to find the variant of interest from a list of variants in that gene.

They have performed HGMD to HPO mappings (1142 HPO terms cover HGMD mutations). It was noted that Polyphen and SIFT are useless for distinguishing between disease causing and rare, benign variants.

eXtasy produces rankings for a VCF file by taking the trained classifier data and using a random forest approach to rank. One of the underlying assumptions of this approach is that any rare variant found in the 1kG dataset is benign as they are meant to be nominally asymptomatic individuals.

These approaches are integrated into NGS-Logistics a federated analysis of variants over multiple sites which has some similarities to the Beacon approaches discussed previously. NGS-Logistics is a project looking for test and partner sites

Critical paper: http://genomemedicine.com/content/6/9/71/abstract

Critical resource: https://ngsl.esat.kuleuven.be

However it’s clear what is required as much as a perfect database of pathogenic mutations is also a database of benign ones – both local population controls for ethnicity matching, but also high MAF variants, rare variants in asymptomatic datasets.

6.2    Aoife McLysaght, Trinity College Dublin: “Dosage Sensitive Genes in Evolution and Disease”


Aiofe started by saying that most CNVs in the human genome are benign. The quality that makes a CNV pathogenic is that of gene dosage. Haploinsufficiency (where half the product != half the activity) affects about 3% of genes in a systematic study in yeast. This is going to affect certain classes of genes, for instance those where concentration dependent effects are very important (morphogens in developmental biology for example).

This can occur through mechanisms like a propensity towards low affinity promiscuous aggregation of protein product. Consequently the relative balance of genes can be the problem where it affects the stoichiometry of the system.

This is against the background of clear genome duplication over the course of vertebrate evolution. This would suggest that dosage sensitive genes should be retained after subsequent genome chromosomal rearrangement and loss. About 20-30% of the genes can be traced back to these duplication events and they are enriched for developmental genes and members of protein complexes. These are called “ohnologs”

What is interesting is that 60% of these are never associated with CNV events or deletions and duplications in healthy people and they are highly enriched for disease genes.

Critical paper: http://www.pnas.org/content/111/1/361.full “Ohnologs are overrepresented in pathogenic copy number mutations”

6.3    Suganthi Balasubramanian, Yale: “Making sense of nonsense: consequence of premature termination”

Under discussion in this talk was the characterization of Loss of Function (LoF) mutations. There’s a lot of people who prefer not to use this term and would rather describe them as broken down into various classes which can include

  • Truncating nonsense SNVs
  • Splice disrupting mutations
  • Frameshift indels
  • Large structural variations

The average person carries around a hundred LoF mutations of which around 1/5th are in a homozygous state.

It was commented that people trying to divine information from e.g. 1kG datasets had to content with lots of sequencing artefacts or annotation artefacts when assessing this.

Critical paper: http://www.sciencemag.org/content/335/6070/823 “A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes”

Critical resource: http://macarthurlab.org/lof/

In particular the introduction of stop codons in a transcript are hard to predict. Some of the time this will be masked by splicing events or controlled by nonsense-mediated decay which means they may not be pathogenic at all.

Also stops codons in the last exon of a gene may not be of great interest as they are unlikely to have large effects on protein conformation.

The ALOFT pipeline was developed to annotate loss of function mutations. This uses a number of resources to make predictions including information about NMD, protein domains, gene networks (shortest path to known disease genes) as well as evolutionary conservation scores (GERP), dn/ds information from mouse and macaque and a random forest approach to classification. A list of benign variants is used in the training set including things like homozygous stop mutations in the 1kG dataset which are assumed to be non-pathogenic. Dominant effects are likely to occur in haploinsufficient genes with an HGMD entry.

HGV2014 Meeting Report, Session 2 “THE TRACTABLE CANCER GENOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

2.1    Lillian Su, University of Toronto: “Prioritising Therapeutic Targets in the Context of Intratumour Heterogeneity”

The central question is how we can move towards molecular profiling of a patient. Heterogeneity of cancers includes not just inter-patient difference but also intra-patient differences, either within a tumour itself, or a primary tumour and its secondary metastases.

Lillian was reporting the on the IMPACT study, which has no fresh biopsy material available so works exclusively from FFPE samples Their initial work has been using a 40 gene TruSeq Custom Amplicon hotspot panel, but they are in the process of developing their own ‘550 gene’ panel which will have the report integrated with the EHR system.

The 550 gene panel has 52 hereditary hotspots, 51 full length genes and the rest presumably hotspot location. There’s also 45 SNP’s for QA/sample tracking.

Lillian went on to outline the difference between trial types and the effects of inter-individual differences. Patients can be stratified into ‘umbrella’ trials – which are histology let, or ‘basket’ trials which are led by genetic mutations (as well as N-of-1 studies where you have unmatched comparisons of drugs).

But none of this addresses the intra-patient heterogeneity, it’s not really considered in clinical trial design. Not all genes have good concordance in terms of the mutation spectra between primary and metastatic stakes (PIK3CA was given as an example). What is really required is a knowledge base of tumour heterogeneity before a truly effective trial design can be constructed. And how do you link alterations to clinical actions?

Critical paper: http://www.nature.com/nm/journal/v20/n6/abs/nm.3559.html “Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine”

Lillian outlined the filtering strategy for variants from FFPE and matched bloods. This was a MAF of <1% in 1kG data, a VAF of >5% and a DP>500x for the tumour, and DP>50x in matched bloods. Data was cross-referenced with COSMIC, TCGA, LDSBs and existing clinical trials, and missense mutations characterized with Polyphen, SIGT, LRT (likelihood ratio test) and MutationTaster.

They are able to pick out events like KRAS G12 mutations that are enriched on treatment, and this is a driver mutation, so the treatment enriches the driver over time.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436069/ “The molecular evolution of acquired resistance to targeted EGFR blockade in colorectal cancers”

Lillian sees WES/WGS as important as a long term investment rather than panels as well as the use of RNA-Seq in investigating heterogeneity. Ideally you want a machine learning system overlaid over the NGS datasets. Deep sequencing of tumours early might give you some idea of whether the tumour heterogeneity is pre-existing, or is it a result of tumoural selection over time. It was acknowledged that this was hard to do for every patient but would answer more long standing questions about the existence of resistant subclones being present and stable at the start of tumourogenesis.

2.2    Charles Lee, JAX labs: “Mouse PDX and Avatars: The Jackson Laboratory Experience”

PDX stands for “Patient Derived Xenografts”. This was an amazing talk, and as such I have few notes. The basic premise here is to take a tumour from a patient and segment it and implant the segments into immunodefficient mice where the tumours can grow. There was a lot of detail on the mouse strains involved, but the applications for this seem to be huge. Tumours can be treated in situ with a number of compounds and this information used to stratify patient treatment. The material can be used for CNV work, grown up for biobanking, expression profiling etc.

Fitting in with the previous talk, this model can also be used for investigating tumour heterogeneity as you can transplant different sections of the same tumour and then follow e.g. size in response to drug dosage in a number of animals all harbouring parts of the same original tumour.

Importantly this is not just limited to solid tumour work as AML human cell lines can also be established in the mice in a matter of weeks.

2.3    Frederica Di Nicolantonio, University of Torino, Italy: “Druggable Kinases in colorectal cancer”

The quote that stayed with me from the beginning of the talk was “Precision cancer medicine stands on exceptions”. The success stories of genomic guided medicine in cancer such as EGFR and ALK mutations are actually present in very small subsets of tumours. The ALK mutation is important in NSCLC tumours, but this is only 4% of tumours and only 2% respond. Colorectal cancer (CRC) is characterized by EGFR mutations and disruption of the RAS/RAF pathway.

However the situation is that you can’t just use mutation data to predict the response to a chemotherapeutic agent. BRAF mutations give different responses to drugs in melanomas vs. CRC because the melanomas have no expression of EGFR, owing to the differences in their embryonic origin.

Consequently in cell-line studies the important question to ask is are the gene expression profiles of the cell line appropriate to the tumour? This may determine the response to treatment, which may or not be the same depending on how the cell line has developed during its time in culture. Are cell lines actually a good model at all?

Frederica made a point that RNA-Seq might not be the best for determining outlier gene expression and immunohistochemistry was their preferred route to determine whether the cell line and tumour were still in sync in terms of gene expression/drug response.

2.4    Nick Orr, Institute of Cancer Research, London “Large-scale fine-mapping and functional characterisation identifies novel bresat cancer susceptibility loci at 9q31.2”

Nick started off talking about the various classes of risk alleles that exist for breast cancer. At the top of the list there are the high penetrance risk alleles in BRCA1 and BRCA2. In the middle there are moderate risk alleles at relatively low frequency in ATM and PALB2. Then there is a whole suite of common variants that are low risk, but population wide (FGFR2 mutations cited as an example).

With breast cancer the family history is still the most important predictive factor, but even so 50% of clearly familial breast cancer cases are genetically unexplained.

He went on to talk about the COGS study which has a website at http://nature.com/icogs which involved a large GWAS study of 10k cases and 12k controls. This was then followed up in a replication study of 45k cases and 45k controls.

Nick has been involved in the fine mapping follow up of the COGS data, but one of the important data points was an 11q13 association with TERT and FGFR2.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3483423/ “Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression”

Data was presented on the fine mapping work that shows associated SNPs mapping to DNAseI hypersensitivity sites in MCF7 (a metastatic breast cancer cell line) as well as to transcription binding factor sites. This work relied on information from RegulomeDB: http://regulomedb.org/.

One of the most impressive feats of this talk was Nick reeling off 7 digit rsID’s repeatedly during his slides without stumbling over the numbers.

Work has also been performed to generate eQTLS. The GWAS loci are largely cis acting regulators of transcription factors.

HGV2014 Meeting Report, Session 1 “INTERPRETING THE HUMAN VARIOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

1.1    Pui-Yan Kwok, UCSF: “Structural Variations in the Human Genome”

Talk focused on structural variant detection, the challenges were outlined as being

  • Short reads
  • Repeats
  • CNVs
  • Haplotying for compound heterozygote identification
  • Difficulty of analysis of SVs

Currently the approach is to map short reads to an imperfect assembly. Imperfect because it is haploid, composite and incomplete with regards to gaps, N’s and repeat sizes

Critical paper: http://www.nature.com/nature/journal/v464/n7289/full/nature08516.html

There are 1000 structural variations per genome, accruing to 24Mb/person, and 11,000 common ones in the population covering 4% of the genome (i.e. more than your exome).

ArrayCGH dup/del arrays don’t tell you about the location of your duplications and deletions. Sequencing only identifies the boundaries.

Presented a model of single molecule analysis on the BioNanoGenomics Irys platform. Briefly this uses a restriction enzyme to introduce single stranded nicks in the DNA, which are then fluorescently labelled. These are then passed down a channel and resolved optically to create a set of sequence motif maps – that is very much akin to an optical restriction endonuclease map. This process requires high molecular weight DNA, so presumably therefore not suitable for FFPE/archival samples.

The motifs are ‘aligned’ to each other via a clustering procedure.

Critical paper: http://www.nature.com/nbt/journal/v30/n8/full/nbt.2324.html

There are some technical considerations –the labelling efficiency is not 100% (mismatch problem on alignment), some nicks are too short for optical resolution. The nicking process can make some sites fragile causing breakup of the DNA into smaller fragments. The ‘assembly’ is still an algorithmic approach and by no means a perfect solution.

However this approach shows a great synergy with NGS for combinatorial data analysis.

They took the classic CEPH trio (NA12878/891/892) and made de novo assembled genome maps for the three individuals, generating ~259Gbases of data per sample. 99% of the data maps back to the GRCh38 assembly (I assume this is done via generating a profile of GRCh38 using an in silico nickase approach). The N50 of the assemblies is 5Mbases, and 96% of GRCh38 is covered by the assembled genomes.

This obviously enables things like gap sizing in the current reference genome. They were able to validate 120/156 known deletions, and identified 135 new ones. For insertions they validated 43/59 and found 242 new ones. A number of other mismatches were identified – 6 were switched insertion/deletion events, 9 were low coverage and 31 there was no evidence for.

The strength of the system is the ability to do tandem duplications, inversions and even complex rearrangements followed by tandem duplications. It also supports haplotyping, but critically you can tell where a CNV has arrived in the genome. This would enable applications like baiting the sequences in CNV regions and mapping the flanks. This allows you to produce diploid genome maps.

Critical paper: http://www.cell.com/ajhg/abstract/S0002-9297%2812%2900373-4

This platform therefore allows assessment of things like DUF1220-Domain copy number repeats, implicated in autism spectrum disorders and schizophrenia (repeat number increases in ASD, and decreases in schizophrenia).

1.2    Stephen Sherry, NCBI, Maryland: “Accessing human genetic variation in the rising era of individual genome sequence”

Stephen spoke about new NCBI services including simplified dbGAP data requests and the option to look for alleles of interest in other databases by Beacon services.

dbGAP is a genotype/phenotype database for reseachers that presents its data consistent with the terms of the original patient consent. “GRU” items are “general research use” – these are broadly consented and genotyped or sequenced datasets that are available to all. This consists of CNV, SNP, exome (3.8k cases) and imputed data. PHS000688 is the top level ID for GRU items.

The Beacon system should be the jumping point for studies looking for causative mutations in disease to find out what other studies the alleles have been observed in rather than relying on 1KG/EVS data. This is part of the GA4GH project and really exists so a researcher can ask a resource if it has a particular variant.

At some point of genome sequencing we will probably have observed a SNP event in one in every two bases, i.e. there will be a database of 1.5 billion variant events. And critically we lack the kind of infrastructure to support this level of data presentation. And the presentation is the wrong way around. We concern ourselves with project/study level data organization but this should be “variant” led – i.e. you want to identify which holdings have your SNP of interest. This is not currently possible, but the Beacon system would allow this kind of interaction between researchers.

There are a number of Beacons online, which are sharing public holdings such as 1KG. The NCBI, GA4GH, Broad, EBI are involved. There is even a meta-Beacon that allows you to query multiple Beacons.

This introduces a new worfkflow – really it allows you to open a dialogue between yourself and the data holder. The existence of a variant is still devoid of context, but you can contact the data holder and then enter a controlled access agreement for the metadata, or information down to the read level

Machine mining of Beacon resources is prohibited. However the SRA toolkit allows access to dbGAP with security tokens which allows automatic query of SRA related material with local caching.

1.3    Daniel Geraghty, FHCRC, Seattle “Complete re-sequencing of extended genomic regions using fosmid target capture and single molecule, real time (SMRT) long-read sequencing technology”

This talk introduced a fosmid enrichment strategy followed by SMRT sequencing for characterizing complex genomic regions.

The premise was set up by suggesting that GWAS leaves rare variants undetected. Fosmid based recloning of HLA has been demonstrated.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1199539/

The steps involved are building a fosmid library. This is then plated out. Molecular inversion probes are used to identify fosmids from the region of interest.   Single clones are then extracted and sequenced extensively. This obviously means you need a fosmid library for each individual you’re looking at and is not a hybridization extraction method like using BAC’s as baits for large regions.

Sequencing is done on Pacbio both for speed (faster than a MiSeq) and read length. At this point the data can be assembled by Velvet, or even by the venerable Phrap/Consed approaches. About 40-100 PacBio reads are required to assemble a fosmid clone.

Quiver can be used to find a consensus sequence, and one a fosmid has been assembled, it can be coassembled with other fosmids that have been similarly reconstructed to get regions of 800kb.

The question was raised whether it might be possible to bypass the fosmid step with other recombineering approaches to work directly with gDNA and MIPS.

1.4    Peter Byers, UWASH, Seattle: “Determinants of splice site mutation outcomes and comparison of characterisation in cultured cells with predictive programs”

Peter talked about the prediction of splice mutation effects with particular reference to the collagen genes. 20% of collagen mutations are splice site mutations (these genes have lots of exons). This is pathogenic in a spread of osteogeneis imperfect (OI) disorders. It is complex because we not only have to consider the effects on splice donor and splice acceptor sites but also the effects on Lariat sequences within introns.

Consequently there are a number of downstream effects – the production of cryptic splice sites, intron retention, exon skipping (which tends to lead to more severe phenotypes). But this is made more complex again by the fact a single variant can have multiple outcomes and there’s no clear explanation for this.

This complexity means that it is hard to produce a computational prediction program that takes into account all the uncertainties of the system, especially at locations 3, 4 or 5 bases outside the splice site.

SplicePort and Asseda were tested, and Asseda came out on top in the tests, with a mere 29% of events wrongly predicted when compared with experimental evidence. So what is happening to make these predictions incorrect?

Peter explained that the order of intron removal in genes is specific to the gene, but shared with individual, but there was no global model for what that order might be, however it must be encoded in some way by intronic sequence. The speed of intron removal and the effects on the mature mRNA are incredibly important to the pathogenesis of the disease. It was clearly shown that the splicing events under study were predicated by the speed of intron removal as the RNA matured.

If you want to predict the splicing effect of a mutation, you therefore need some information about the order of intron processing in the gene you’re looking at to have a completely holistic view of the system. How do you generate this information systematically? It’s a very labour intensive piece of work, and Peter was looking for suggestions on how best to mine RNA-Seq data to get to the bottom of this line of enquiry. Is it possible even to do homology based predictions of splicing speed and therefore splicing order?