Tag Archives: hgv

HGV2014 Meeting Report, Session 2 “THE TRACTABLE CANCER GENOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

2.1    Lillian Su, University of Toronto: “Prioritising Therapeutic Targets in the Context of Intratumour Heterogeneity”

The central question is how we can move towards molecular profiling of a patient. Heterogeneity of cancers includes not just inter-patient difference but also intra-patient differences, either within a tumour itself, or a primary tumour and its secondary metastases.

Lillian was reporting the on the IMPACT study, which has no fresh biopsy material available so works exclusively from FFPE samples Their initial work has been using a 40 gene TruSeq Custom Amplicon hotspot panel, but they are in the process of developing their own ‘550 gene’ panel which will have the report integrated with the EHR system.

The 550 gene panel has 52 hereditary hotspots, 51 full length genes and the rest presumably hotspot location. There’s also 45 SNP’s for QA/sample tracking.

Lillian went on to outline the difference between trial types and the effects of inter-individual differences. Patients can be stratified into ‘umbrella’ trials – which are histology let, or ‘basket’ trials which are led by genetic mutations (as well as N-of-1 studies where you have unmatched comparisons of drugs).

But none of this addresses the intra-patient heterogeneity, it’s not really considered in clinical trial design. Not all genes have good concordance in terms of the mutation spectra between primary and metastatic stakes (PIK3CA was given as an example). What is really required is a knowledge base of tumour heterogeneity before a truly effective trial design can be constructed. And how do you link alterations to clinical actions?

Critical paper: http://www.nature.com/nm/journal/v20/n6/abs/nm.3559.html “Whole-exome sequencing and clinical interpretation of formalin-fixed, paraffin-embedded tumor samples to guide precision cancer medicine”

Lillian outlined the filtering strategy for variants from FFPE and matched bloods. This was a MAF of <1% in 1kG data, a VAF of >5% and a DP>500x for the tumour, and DP>50x in matched bloods. Data was cross-referenced with COSMIC, TCGA, LDSBs and existing clinical trials, and missense mutations characterized with Polyphen, SIGT, LRT (likelihood ratio test) and MutationTaster.

They are able to pick out events like KRAS G12 mutations that are enriched on treatment, and this is a driver mutation, so the treatment enriches the driver over time.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436069/ “The molecular evolution of acquired resistance to targeted EGFR blockade in colorectal cancers”

Lillian sees WES/WGS as important as a long term investment rather than panels as well as the use of RNA-Seq in investigating heterogeneity. Ideally you want a machine learning system overlaid over the NGS datasets. Deep sequencing of tumours early might give you some idea of whether the tumour heterogeneity is pre-existing, or is it a result of tumoural selection over time. It was acknowledged that this was hard to do for every patient but would answer more long standing questions about the existence of resistant subclones being present and stable at the start of tumourogenesis.

2.2    Charles Lee, JAX labs: “Mouse PDX and Avatars: The Jackson Laboratory Experience”

PDX stands for “Patient Derived Xenografts”. This was an amazing talk, and as such I have few notes. The basic premise here is to take a tumour from a patient and segment it and implant the segments into immunodefficient mice where the tumours can grow. There was a lot of detail on the mouse strains involved, but the applications for this seem to be huge. Tumours can be treated in situ with a number of compounds and this information used to stratify patient treatment. The material can be used for CNV work, grown up for biobanking, expression profiling etc.

Fitting in with the previous talk, this model can also be used for investigating tumour heterogeneity as you can transplant different sections of the same tumour and then follow e.g. size in response to drug dosage in a number of animals all harbouring parts of the same original tumour.

Importantly this is not just limited to solid tumour work as AML human cell lines can also be established in the mice in a matter of weeks.

2.3    Frederica Di Nicolantonio, University of Torino, Italy: “Druggable Kinases in colorectal cancer”

The quote that stayed with me from the beginning of the talk was “Precision cancer medicine stands on exceptions”. The success stories of genomic guided medicine in cancer such as EGFR and ALK mutations are actually present in very small subsets of tumours. The ALK mutation is important in NSCLC tumours, but this is only 4% of tumours and only 2% respond. Colorectal cancer (CRC) is characterized by EGFR mutations and disruption of the RAS/RAF pathway.

However the situation is that you can’t just use mutation data to predict the response to a chemotherapeutic agent. BRAF mutations give different responses to drugs in melanomas vs. CRC because the melanomas have no expression of EGFR, owing to the differences in their embryonic origin.

Consequently in cell-line studies the important question to ask is are the gene expression profiles of the cell line appropriate to the tumour? This may determine the response to treatment, which may or not be the same depending on how the cell line has developed during its time in culture. Are cell lines actually a good model at all?

Frederica made a point that RNA-Seq might not be the best for determining outlier gene expression and immunohistochemistry was their preferred route to determine whether the cell line and tumour were still in sync in terms of gene expression/drug response.

2.4    Nick Orr, Institute of Cancer Research, London “Large-scale fine-mapping and functional characterisation identifies novel bresat cancer susceptibility loci at 9q31.2”

Nick started off talking about the various classes of risk alleles that exist for breast cancer. At the top of the list there are the high penetrance risk alleles in BRCA1 and BRCA2. In the middle there are moderate risk alleles at relatively low frequency in ATM and PALB2. Then there is a whole suite of common variants that are low risk, but population wide (FGFR2 mutations cited as an example).

With breast cancer the family history is still the most important predictive factor, but even so 50% of clearly familial breast cancer cases are genetically unexplained.

He went on to talk about the COGS study which has a website at http://nature.com/icogs which involved a large GWAS study of 10k cases and 12k controls. This was then followed up in a replication study of 45k cases and 45k controls.

Nick has been involved in the fine mapping follow up of the COGS data, but one of the important data points was an 11q13 association with TERT and FGFR2.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3483423/ “Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression”

Data was presented on the fine mapping work that shows associated SNPs mapping to DNAseI hypersensitivity sites in MCF7 (a metastatic breast cancer cell line) as well as to transcription binding factor sites. This work relied on information from RegulomeDB: http://regulomedb.org/.

One of the most impressive feats of this talk was Nick reeling off 7 digit rsID’s repeatedly during his slides without stumbling over the numbers.

Work has also been performed to generate eQTLS. The GWAS loci are largely cis acting regulators of transcription factors.

HGV2014 Meeting Report, Session 1 “INTERPRETING THE HUMAN VARIOME”

Caveats: I have not taken notes in every talk of every session, a lack of notes for a particular speaker does not constitute disinterest on my part, I simply took notes for the talks that were directly related to my current work. If I have misquoted, misrepresented or misunderstood anything, and you are the speaker concerned, or a member of the team involved in the work, please leave a comment on the post, and I will rectify the situation accordingly.

1.1    Pui-Yan Kwok, UCSF: “Structural Variations in the Human Genome”

Talk focused on structural variant detection, the challenges were outlined as being

  • Short reads
  • Repeats
  • CNVs
  • Haplotying for compound heterozygote identification
  • Difficulty of analysis of SVs

Currently the approach is to map short reads to an imperfect assembly. Imperfect because it is haploid, composite and incomplete with regards to gaps, N’s and repeat sizes

Critical paper: http://www.nature.com/nature/journal/v464/n7289/full/nature08516.html

There are 1000 structural variations per genome, accruing to 24Mb/person, and 11,000 common ones in the population covering 4% of the genome (i.e. more than your exome).

ArrayCGH dup/del arrays don’t tell you about the location of your duplications and deletions. Sequencing only identifies the boundaries.

Presented a model of single molecule analysis on the BioNanoGenomics Irys platform. Briefly this uses a restriction enzyme to introduce single stranded nicks in the DNA, which are then fluorescently labelled. These are then passed down a channel and resolved optically to create a set of sequence motif maps – that is very much akin to an optical restriction endonuclease map. This process requires high molecular weight DNA, so presumably therefore not suitable for FFPE/archival samples.

The motifs are ‘aligned’ to each other via a clustering procedure.

Critical paper: http://www.nature.com/nbt/journal/v30/n8/full/nbt.2324.html

There are some technical considerations –the labelling efficiency is not 100% (mismatch problem on alignment), some nicks are too short for optical resolution. The nicking process can make some sites fragile causing breakup of the DNA into smaller fragments. The ‘assembly’ is still an algorithmic approach and by no means a perfect solution.

However this approach shows a great synergy with NGS for combinatorial data analysis.

They took the classic CEPH trio (NA12878/891/892) and made de novo assembled genome maps for the three individuals, generating ~259Gbases of data per sample. 99% of the data maps back to the GRCh38 assembly (I assume this is done via generating a profile of GRCh38 using an in silico nickase approach). The N50 of the assemblies is 5Mbases, and 96% of GRCh38 is covered by the assembled genomes.

This obviously enables things like gap sizing in the current reference genome. They were able to validate 120/156 known deletions, and identified 135 new ones. For insertions they validated 43/59 and found 242 new ones. A number of other mismatches were identified – 6 were switched insertion/deletion events, 9 were low coverage and 31 there was no evidence for.

The strength of the system is the ability to do tandem duplications, inversions and even complex rearrangements followed by tandem duplications. It also supports haplotyping, but critically you can tell where a CNV has arrived in the genome. This would enable applications like baiting the sequences in CNV regions and mapping the flanks. This allows you to produce diploid genome maps.

Critical paper: http://www.cell.com/ajhg/abstract/S0002-9297%2812%2900373-4

This platform therefore allows assessment of things like DUF1220-Domain copy number repeats, implicated in autism spectrum disorders and schizophrenia (repeat number increases in ASD, and decreases in schizophrenia).

1.2    Stephen Sherry, NCBI, Maryland: “Accessing human genetic variation in the rising era of individual genome sequence”

Stephen spoke about new NCBI services including simplified dbGAP data requests and the option to look for alleles of interest in other databases by Beacon services.

dbGAP is a genotype/phenotype database for reseachers that presents its data consistent with the terms of the original patient consent. “GRU” items are “general research use” – these are broadly consented and genotyped or sequenced datasets that are available to all. This consists of CNV, SNP, exome (3.8k cases) and imputed data. PHS000688 is the top level ID for GRU items.

The Beacon system should be the jumping point for studies looking for causative mutations in disease to find out what other studies the alleles have been observed in rather than relying on 1KG/EVS data. This is part of the GA4GH project and really exists so a researcher can ask a resource if it has a particular variant.

At some point of genome sequencing we will probably have observed a SNP event in one in every two bases, i.e. there will be a database of 1.5 billion variant events. And critically we lack the kind of infrastructure to support this level of data presentation. And the presentation is the wrong way around. We concern ourselves with project/study level data organization but this should be “variant” led – i.e. you want to identify which holdings have your SNP of interest. This is not currently possible, but the Beacon system would allow this kind of interaction between researchers.

There are a number of Beacons online, which are sharing public holdings such as 1KG. The NCBI, GA4GH, Broad, EBI are involved. There is even a meta-Beacon that allows you to query multiple Beacons.

This introduces a new worfkflow – really it allows you to open a dialogue between yourself and the data holder. The existence of a variant is still devoid of context, but you can contact the data holder and then enter a controlled access agreement for the metadata, or information down to the read level

Machine mining of Beacon resources is prohibited. However the SRA toolkit allows access to dbGAP with security tokens which allows automatic query of SRA related material with local caching.

1.3    Daniel Geraghty, FHCRC, Seattle “Complete re-sequencing of extended genomic regions using fosmid target capture and single molecule, real time (SMRT) long-read sequencing technology”

This talk introduced a fosmid enrichment strategy followed by SMRT sequencing for characterizing complex genomic regions.

The premise was set up by suggesting that GWAS leaves rare variants undetected. Fosmid based recloning of HLA has been demonstrated.

Critical paper: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1199539/

The steps involved are building a fosmid library. This is then plated out. Molecular inversion probes are used to identify fosmids from the region of interest.   Single clones are then extracted and sequenced extensively. This obviously means you need a fosmid library for each individual you’re looking at and is not a hybridization extraction method like using BAC’s as baits for large regions.

Sequencing is done on Pacbio both for speed (faster than a MiSeq) and read length. At this point the data can be assembled by Velvet, or even by the venerable Phrap/Consed approaches. About 40-100 PacBio reads are required to assemble a fosmid clone.

Quiver can be used to find a consensus sequence, and one a fosmid has been assembled, it can be coassembled with other fosmids that have been similarly reconstructed to get regions of 800kb.

The question was raised whether it might be possible to bypass the fosmid step with other recombineering approaches to work directly with gDNA and MIPS.

1.4    Peter Byers, UWASH, Seattle: “Determinants of splice site mutation outcomes and comparison of characterisation in cultured cells with predictive programs”

Peter talked about the prediction of splice mutation effects with particular reference to the collagen genes. 20% of collagen mutations are splice site mutations (these genes have lots of exons). This is pathogenic in a spread of osteogeneis imperfect (OI) disorders. It is complex because we not only have to consider the effects on splice donor and splice acceptor sites but also the effects on Lariat sequences within introns.

Consequently there are a number of downstream effects – the production of cryptic splice sites, intron retention, exon skipping (which tends to lead to more severe phenotypes). But this is made more complex again by the fact a single variant can have multiple outcomes and there’s no clear explanation for this.

This complexity means that it is hard to produce a computational prediction program that takes into account all the uncertainties of the system, especially at locations 3, 4 or 5 bases outside the splice site.

SplicePort and Asseda were tested, and Asseda came out on top in the tests, with a mere 29% of events wrongly predicted when compared with experimental evidence. So what is happening to make these predictions incorrect?

Peter explained that the order of intron removal in genes is specific to the gene, but shared with individual, but there was no global model for what that order might be, however it must be encoded in some way by intronic sequence. The speed of intron removal and the effects on the mature mRNA are incredibly important to the pathogenesis of the disease. It was clearly shown that the splicing events under study were predicated by the speed of intron removal as the RNA matured.

If you want to predict the splicing effect of a mutation, you therefore need some information about the order of intron processing in the gene you’re looking at to have a completely holistic view of the system. How do you generate this information systematically? It’s a very labour intensive piece of work, and Peter was looking for suggestions on how best to mine RNA-Seq data to get to the bottom of this line of enquiry. Is it possible even to do homology based predictions of splicing speed and therefore splicing order?