Tag Archives: exome


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

6.1    Yves Moreau, University of Leuven, Belgium: “Variant Prioritisation by genomic data fusion”


An essential part of the prioritization process is the integration of phenotype.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083082/ “Exome sequencing and disease-network analysis of a single family implicate a mutation in KIF1A in hereditary spastic paraparesis”

Yves introduced “Endeavour” which takes a gene list and matches it to the disease of interest and ranks them, but this requires phenotypic information to be ‘rich’. Two main questions need to be addressed 1) What genes are related to a phenotype? And 2) Which variants in a gene are pathogenic? Candidate gene prioritization is not a new thing, and has a long history in microarray analysis. Whilst it’s easy to interrogate things like pathway information, GO terms and literature it is much harder to find relevant expression profile information or functional annotation and existing machine learning tools do not really support these data types.

Critical paper: http://www.ncbi.nlm.nih.gov/pubmed/16680138 “Gene prioritization through genomic data fusion.”

Critical resource: http://homes.esat.kuleuven.be/~bioiuser/endeavour/tool/endeavourweb.php

Endeavour can be trained, rank according to various criteria and then merge ranks to provide ordered statistics

Next eXtasy was introduced, another variant prioritization tool for non-synonymous variants given a specific phenotype.

Critical resource: http://homes.esat.kuleuven.be/~bioiuser/eXtasy/

Critical paper: http://www.nature.com/nmeth/journal/v10/n11/abs/nmeth.2656.html “eXtasy: variant prioritization by genomic data fusion”

eXtasy allows variants to be ranked by effects on structural change in the protein, association in a case/control or GWAS study, evolutionary conservation.

The problem though is one of multiscale data integration – we might know that a megabase region is interesting through one technique, a gene is interesting by another technique, and then we need to find the variant of interest from a list of variants in that gene.

They have performed HGMD to HPO mappings (1142 HPO terms cover HGMD mutations). It was noted that Polyphen and SIFT are useless for distinguishing between disease causing and rare, benign variants.

eXtasy produces rankings for a VCF file by taking the trained classifier data and using a random forest approach to rank. One of the underlying assumptions of this approach is that any rare variant found in the 1kG dataset is benign as they are meant to be nominally asymptomatic individuals.

These approaches are integrated into NGS-Logistics a federated analysis of variants over multiple sites which has some similarities to the Beacon approaches discussed previously. NGS-Logistics is a project looking for test and partner sites

Critical paper: http://genomemedicine.com/content/6/9/71/abstract

Critical resource: https://ngsl.esat.kuleuven.be

However it’s clear what is required as much as a perfect database of pathogenic mutations is also a database of benign ones – both local population controls for ethnicity matching, but also high MAF variants, rare variants in asymptomatic datasets.

6.2    Aoife McLysaght, Trinity College Dublin: “Dosage Sensitive Genes in Evolution and Disease”


Aiofe started by saying that most CNVs in the human genome are benign. The quality that makes a CNV pathogenic is that of gene dosage. Haploinsufficiency (where half the product != half the activity) affects about 3% of genes in a systematic study in yeast. This is going to affect certain classes of genes, for instance those where concentration dependent effects are very important (morphogens in developmental biology for example).

This can occur through mechanisms like a propensity towards low affinity promiscuous aggregation of protein product. Consequently the relative balance of genes can be the problem where it affects the stoichiometry of the system.

This is against the background of clear genome duplication over the course of vertebrate evolution. This would suggest that dosage sensitive genes should be retained after subsequent genome chromosomal rearrangement and loss. About 20-30% of the genes can be traced back to these duplication events and they are enriched for developmental genes and members of protein complexes. These are called “ohnologs”

What is interesting is that 60% of these are never associated with CNV events or deletions and duplications in healthy people and they are highly enriched for disease genes.

Critical paper: http://www.pnas.org/content/111/1/361.full “Ohnologs are overrepresented in pathogenic copy number mutations”

6.3    Suganthi Balasubramanian, Yale: “Making sense of nonsense: consequence of premature termination”

Under discussion in this talk was the characterization of Loss of Function (LoF) mutations. There’s a lot of people who prefer not to use this term and would rather describe them as broken down into various classes which can include

  • Truncating nonsense SNVs
  • Splice disrupting mutations
  • Frameshift indels
  • Large structural variations

The average person carries around a hundred LoF mutations of which around 1/5th are in a homozygous state.

It was commented that people trying to divine information from e.g. 1kG datasets had to content with lots of sequencing artefacts or annotation artefacts when assessing this.

Critical paper: http://www.sciencemag.org/content/335/6070/823 “A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes”

Critical resource: http://macarthurlab.org/lof/

In particular the introduction of stop codons in a transcript are hard to predict. Some of the time this will be masked by splicing events or controlled by nonsense-mediated decay which means they may not be pathogenic at all.

Also stops codons in the last exon of a gene may not be of great interest as they are unlikely to have large effects on protein conformation.

The ALOFT pipeline was developed to annotate loss of function mutations. This uses a number of resources to make predictions including information about NMD, protein domains, gene networks (shortest path to known disease genes) as well as evolutionary conservation scores (GERP), dn/ds information from mouse and macaque and a random forest approach to classification. A list of benign variants is used in the training set including things like homozygous stop mutations in the 1kG dataset which are assumed to be non-pathogenic. Dominant effects are likely to occur in haploinsufficient genes with an HGMD entry.


Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

3.1    Peter Robinson, Humboldt University, Berlin: “Effective diagnosis of genetic disease by computational phenotpye analysis of the disease associated genome”

Peter focused on the use of bioinformatics in medicine, specifically around the use of ontologies to describe phenotypes and look for similarities between diseases. It is important to capture the signs, symptoms and behavioural abnormalities of a patient in PRECISE language to be useful.

The concept here is ‘deep phenotyping’ – there’s almost nothing here in terms of too much information about clinical presentation, but it must be consistent to enable a basis for computational comparison and analysis.

HPO (The Human Phenotype Ontology) was introduced, saying that in many ways it is indebted to OMIM (Online Mendelian Inheritance in Man).

He felt strongly that the standard exome with 17k genes was ‘useless’ in a diagnostic context, when there are 2800 genes associated with 5000 disorders, covering a huge spectrum of presenting disease. Consequently he does not recommend screening the exome as a first line test, but encourages the use of reduced clinical exomes. This allows, especially, higher coverage for the same per-sample costs and suggested that the aim should be to have 98% of the target regions covered to >20x.

Pathogenic mutations that are clearly identified are clearly the easiest thing to call from this kind of dataset, but OMIM remains the first point of call for finding out the association of a mutation to a condition. And OMIM is not going to be of much help finding information on a predicted deleterious mutation in a random chromosomal ORF.

Specifically they take VCF files and annotate them with HPO terms as well as the standard suite of Mutation Taster, Polyphen and SIFT

A standard filtering pipeline should get you down to 50 to 100 genes of interest and then you can do a phenotype comparison of the HPO terms you have collected from the clinical presentation and the HPO terms annotated in the VCF. This can give you a ranked list of variants.

This was tested by running 10k simulations of such a process with spiked in variants from HGMD into an asymptomatic individuals VCF file. The gene ranking score depends on a variant score for deleteriousness and a phenotype score for the match to the clinical phenotype. In the simulation 80% of the time, the right gene was at the top of the list.

This approach is embodied in PhenIX: http://compbio.charite.de/PhenIX/

This has led to the development of a clinical bioinformatics workflow where the clinician supplies the HPO terms and runs the algorithm. Information is borrows from OMIM and Orphanet in the process.

Prioritisation of variants is not a smoking gun for pathogenicity however. This needs to be backed up by Sanger sequencing validation, and co-segregation analysis within a family (if available). Effective diagnosis of disease will not lose the human component.

Exomiser was also introduced http://www.sanger.ac.uk/resources/databases/exomiser/query/ from Damien Smedley’s group at the Sanger Institute, which uses information from the mouse and zebrafish to increase the utility as there is a huge amount of phenotype data from developmental biology studies of gene knockouts in other organisms.

3.2    Dan Bradley, Trinity College, Dublin: “Ancient population genomics: do it all, or not at all”

Dan gave a great talk on the sequencing of ancient DNA to look at population data. Ancient DNA is highly fragmented, and you’re generally working with 50-70base fragments (generally worse than FFPE samples).

DNA from ancient samples actually undergoes a target enrichment step, largely to remove environmental sequence contamination, although it was noted that repetitive DNA can be problematic in terms of ruining a capture experiment.

From the ancient samples that were covered at 22x (I don’t expect that’s genome coverage, but target capture coverage) the samples were down-sampled to 1x data, and then 1kG data used to impute the likely genotypes. This actually recapitulated 99% of calls from the original 22x data, showing that this approach can be used to reconstruct ancestral population genomics information from very limited datasets, using very modern data.

23andMe kit ordered

So today I took a look at my savings account for ‘frivolous’ things, wondered what I could do with the money and split it between a couple of nights in a hotel in Manchester for a gig next week, and a 23andMe kit. This is something I’ve been wanting to do for a long time, previously thwarted by a lack of cash.

Watching Genomes Unzipped unfold and seeing a number of my Twitter/FriendFeed/BioStar chums get their spit analysed has been interesting, but since my day to day job now involves hunting down variations in clinical samples from exome sequencing projects, the subject of what might be lurking in my own genome has started to exhibit a morbid fascination.

I will say straight up, that if I had the money and there was an established DTC exome offering, I’d just get my exome sequenced and analyse it myself. However there is admittedly a little professional interest in how 23andMe present the data, as well as finding out more about what might be in store for me as I approach (if I have not already entered ;)) middle age.

I haven’t decided what I might do with the data (in terms of public release) when I receive it, I’m going to have a bit of a think and a read before the saliva kit arrives. Interestingly I’m feeling quite apprehensive about the results even at this stage, we will see how that develops too.