Category Archives: Science

OGT NGS Team seeking a placement student

1330361778Edit: We found ourselves a student, and we’ve welcomed Agatha Treveil to the team!

Every year we take on a placement/sandwich student from a UK University and have them work within the NGS Team in the Computational Biology group at Oxford Gene Technology.

This year is no different, and we’re looking for someone to start somewhere in the July/August 2014 window to work with us for a year.

Our previous students have come from a number of institutions, and one has not only won a Cogent Life Sciences skills award for “Placement of the Year”, but has also joined the NGS Team after graduating. Traditionally we’ve recruited from Natural Sciences, Biomedical Sciences and Biochemistry undergraduate courses – no previous bioinformatics or programming skills are required, just a willingness to deal with lots of DNA and RNA-Seq data in a commercial environment. The post involves exome sequencing, targeted resequencing and RNA-Seq analysis (whole transcriptome, small RNA) and would be coming in at an exciting point as we start to broaden our NGS services.

This is very much a ‘hands on’ position, and we specialise in taking in biologists and having them leave as bioinformaticians.

Here’s the full text of the advert:

Computational Biology (salary approx £14K p.a)
We are looking for a candidate who has (or would like to develop) programming skills and an interest in biology; is keen to learn new techniques, gain experience in a biotechnology company, and who has a particular interest in the interface of numerical methods and life sciences.

The placement will potentially involve projects in all of our scientific areas, but will begin with a focus on Next Generation Sequencing analysis where the student will learn pipeline development and data analysis. The student will gain experience working with clinical and academic scientists, and an understanding of the data processing aspects of the most exciting experimental technologies currently being applied in biological and clinical research and practice.

To apply for this position, please send your CV and covering letter (clearly stating you wish to apply for the industrial placement and which position you are applying for) to or by post to:
HR, Oxford Gene Technology, Begbroke Science Park, Begbroke Hill, Woodstock Road, Begbroke, Oxfordshire, OX5 1PF

Closing date for entries 30 April 2014
For further information about OGT please visit

Short-read alignment on the Raspberry Pi

This week I invested a little bit of spare cash in the Raspberry Pi.  Now that there’s no waiting time for these, I bought mine from Farnell’s Element 14 site, complete with a case, a copy of Raspbian on SD card and a USB power supply.  Total costs, about 50 quid.

First impressions are that it is a great little piece of hardware. I’ve always considered playing with an Arduino, but the Pi fits nicely into my existing skill set.  It did get connected to the TV briefly just to watch a tiny machine driving a 37″ flatscreen TV via HDMI.  I’m sure it’s just great, if your sofa isn’t quite as far away from the TV as mine is. So with sshd enabled on the Pi it is currently sat on the mantlepiece, blinking lights flashing, running headless.

The first thing it occurred to me to do was to do some benchmarking.  What I was interested in is the capacity of the machine to do real world work.  I’m an NGS bioinformatician so the the obvious thing to do was to throw some data at it through some short-read aligners.

I’m used to human exome data, or RNA-Seq data that generally encompasses quite a few HiSeq lanes, and used to processing them in large enough amounts that I need a few servers to do it.  I did wonder however whether the Pi might have enough grunt for smaller tasks, such as small gene panels, or bacterial genomes.  Primarily this is because I’ve got a new project at work which uses in solution hybridisation and sequencing to identify pathogens in clinical samples, and it occurred to me that the computing requirements probably aren’t the same as what I’m used to.

The first thing I did was to take some data from wgsim generated from an E.coli genome to test out paired-end alignment on 100bp reads.

Initially I thought I would try to get Bowtie2 working, on the grounds that I wasn’t really intending to do anything other than read mapping and I am under the impression it’s still faster than BWA.  BWA does tend to be my go-to aligner for mammalian data.  However I quickly ran into the fact that there is no armhf build of bowtie2 in the Raspbian repository.  Code downloaded I was struggling to get it to compile from source, and in the middle of setting up a cross-compiling environment so I could do the compilation on my much more powerful EeePC 1000HE(!) it occurred that someone might have been foolish enough to try this before.  And they had.  The fact is that bowtie2 requires a CPU with an SSE instruction set – i.e. Intel.  So whilst it might work on the Atom CPU in the EeePC it’s a complete non starter on the ARM chip in the Pi.

Bowtie1 however is in the Rasbpian repository.  And I generated 1×10^6 reads as a test dataset after seeing that it was aligning the 1000 read dataset from bowtie with some speed.  This took 55 minutes.

I then picked out a real-world E.coli dataset from the CLC Bio website.  Generated on the GAIIx, these are 36bp PE reads, around 2.6×10^6 of them.

BWA 0.6.2 is also available from the Raspbian repos (which is more up to date than the version in the Xubuntu distro I notice, probably because Raspbian is tracking the current ‘testing’ release, Wheezy).

So I did a full paired end alignment of this real world data, making sure both output to SAM.  I quickly ran out of space on my 4GB SD card, so all data was written out to an 8GB attached USB thumb drive.

Bowtie1 took just over an hour to align this data (note reads and genome for alignment are from completely different E.coli strains)

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Seeded quality full-index search: 01:01:31
# reads processed: 2622382
# reads with at least one reported alignment: 1632341 (62.25%)
# reads that failed to align: 990041 (37.75%)
Reported 1632341 paired-end alignments to 1 output stream(s)
Time searching: 01:01:32
Overall time: 01:01:32

I was a little surprised that actually BWA managed to do this a little faster (please note aligners were run with default options).  I only captured the start and end of this process for BWA.

Align start: Sat Jan 26 22:36:06 GMT 2013
Align end: Sat Jan 26 23:29:31 GMT 2013

Which brings the total alignment time for BWA to 53 minutes and 25 seconds.

Anyway it was just a little play to see how things stacked up.  I think it’s fantastic that a little machine   like the Pi has enough power to do anything like this.  It’s probably more of a comment on the fact that the people behind the aligners have managed to write such efficient code that this can be done without exceeding the 512Mb of RAM.  Bowtie memory usage was apparently lower than BWA though during running tests.

I always thought that the ‘missing aspect’ of DIYbio was getting people involved with bioinformatics, instead the community seemed desperate to follow overly ambitious plans to get involved in synthetic biology.  And it seemed to me that DIYbio should sit in the same amateur space that amateur astronomy does (i.e. within the limitations of equipment that you can buy without having to equip a laboratory).  And for a low cost entry into Linux, with enough grunt to play with NGS tools and publicly available data, it’s hard to fault the very compact Raspberry Pi. Now I just need to see exactly where the performance limits are!

YouTube Exome sequencing channel

More of a test than anything else, I’m accumulating ngs/hts/deep sequencing videos on YouTube. If you’re interested in exome sequencing these sessions from GenomeTV aka the NHGRI are worth a watch.


To see the rest of the videos in the playlist, click to see them in situ on YouTube.

Position open at the Newcastle University Bioinformatics Support Unit

I couldn’t help but notice in my GenomeWeb email deluge this morning that a position has opened up at the Bioinformatics Support Unit in Newcastle.  I didn’t really make an announcement in my blog, but this is partly because I no longer work there! I now work for Oxford Gene Technology which I may find some time to write about soon.

Simon Cockell (blog, twitter), who worked with me throughout my time at Newcastle was appointed manager of the BSU when I left, and consequently his post is now available.

The BSU engages in a lot of diverse bioinformatics work and it’s an ideal post for someone who values supporting researchers in their quest to leverage high throughput techniques for biological studies.  Job posting is below.


Newcastle University

Job Location:

Newcastle upon Tyne


£27,428 – £35,788

Job Description:

The Bioinformatics Support Unit is a successful cross-Faculty service providing high quality scientific support for a range of bioinformatics projects.

You will have experience of a range of Bioinformatics techniques, to work in the Unit on the development and delivery of scientific projects and liaison with relevant academics.


You should have at least a first degree in a relevant science related subject and preferably a PhD. You will have previous experience in bioinformatics support and an understanding of UK research funding procedures.

Contact Information:

For an informal discussion on this opportunity, please contact either Dr Simon Cockell (Senior Experimental Scientific Officer), 0191 222 7253, or Professor Neil Wipat (academic lead), 0191 222 8213,


This week we had some welcome news (by we I mean Simon Cockell, Phil Lord and others). A proposal we had submitted to JISC has been funded. This is the first research funding I have received after significant input in the bid process, as opposed to being included as a co-I for specific bioinformatics expertise. As such it’s a bit of a departure for me, but something I’m very much looking forward to spending time on over the next year.

The elevator pitch goes something like this:

“The project extends existing blogging tools for use as a lightweight, semantically linked publication environment. This enables researchers to create a hub in the linked-data environment, that we call knowledge or k-blogs. K-blogs are convenient and straight-forward for authors to use, integrating into researchers existing work practices and tools. The provide readers with distributed feedback and commenting mechanisms. We will support three communities (microarray, public health and workflow), providing immediate benefit, in addition to the long term benefit of the platform as a whole. Additionally, this will enable a user-centric development approach, while showcasing the platform as the basis for next generation research publishing.”

If this sounds like the kind of thing you’re interested in, Phil has made the full grant application that we submitted available online, on We would of course be interested in any comments or feedback. The proposal includes some technical details of what we hope to achieve, but I think that Ontogenesis has already gone some way to proving the worth of the system. It’s going to be great to provide additional tooling to support the process, and cement some of the inherent social contract with a proper workflow for publishing and review.

The  project starts almost immediately, and will be the place to stay tuned for updates.

GeneSpring GX 11, or how to make the simple – undocumentable..

I divide my time about equally between GeneSpring GX 11 from Agilent and BioConductor (free, from awesome people) for microarray analysis.  The latter for all the neat tools that GeneSpring doesn’t have, the former because sometimes it is nice to lead a researcher visually through their data, without having to type into a green on black terminal window.

GeneSpring GX 11 is the third iteration after Agilent bought up Silicon Genetics, then decided to throw the unwieldy, quirky, but very functional GeneSpring product in the trash and start again with something built on Strand Life Sciences AVADIS platform.  We’ve been through versions 9, 10 and now we’re on 11.  There’s been plenty of bugs on the way, the most serious (to me) being the one where GeneSpring 10 managed to miscall the quality flags on Illumina data in 50% of the cases.  Not good, but at least fixed.

Many people have been griping on mailing lists about functionality missing in the new GeneSpring that existed in the old version.  I always think it’s a matter of familiarity with the software really.  I hadn’t really come across anything I couldn’t do in GeneSpring 11 that I could do in GeneSpring 7.  That was until yesterday.

I sat down with a customer yesterday to look at some microbial Nimblegen data.  GeneSpring doesn’t really deal with Nimblegen data very well, you are left with the choice of not analysing it in GeneSpring, or accepting there’s going to be a bit of fudging and some extra annotation steps in order to make the data useable as a ‘Custom technology’.  The customer, quite reasonably, asked if we could get the biological genome information (effectively gene annotations that are independent of the chip technology you’re using) loaded into GeneSpring.  And thus started a morning of fun and games.

GeneSpring 11 has a very handy import feature for biological genomes under Annotations>Create Biological Genome.  That is providing you want to choose one of their predefined organisms to download the information from NCBI.  There is *NO* route in the software to add another organism to this list, or do anything than use one of their check box limited organisms.  This is not a bug apparently, because in a separate part of the software (dealing with Pathways for an organism) you can pull this information directly from NCBI using the Taxon ID of the organism you’re interested in.  So why can’t you use it to download a biological genome?  Who knows…

One of the things I really liked about the old GeneSpring was the fact that it came with a manual a foot thick.   It told you how to do every single operation in the UI, it didn’t tell you anything about the order in which to apply them, but you could generally rely on it for an answer.  There was no such answer to this issue in the GeneSpring manual..

It transpires that if you really want to do this, the following, slightly insane process needs to take place:

1) Take this snippet of XMLishness:

<hexff version="1.0">

 Homo sapiens9606
 Mus musculus10090
 Rattus norvegicus10116
 Anopheles gambiae7165
 Arabidopsis thaliana3702
 Bacillus subtilis1423
 Bos taurus9913
 Caenorhabditis elegans6239
 Canis lupus familiaris9615
 Citrus sinensis2711
 Danio rerio7955
 Drosophila melanogaster7227
 Equus caballus9796
 Escherichia coli562
 Felis catus9685
 Gallus gallus9031
 Glycine max3847
 Gossypium hirsutum3635
 Hordeum vulgare4513
 Macaca mulatta9544
 Magnaporthe grisea148305
 Medicago sativa3879
 Medicago truncatula3880
 Nicotiana tabacum4097
 Oryctolagus cuniculus9986
 Oryza sativa4530
 Ovis aries9940
 Pan troglodytes9598
 Plasmodium falciparum5833
 Pongo abelii9601
 Poplar mosaic virus12166
 Populus sp.3697
 Pseudomonas aeruginosa287
 Saccharomyces cerevisiae4932
 Saccharum officinarum4547
 Salmo salar8030
 Schizosaccharomyces pombe4896
 Staphylococcus aureus1280
 Sus scrofa9823
 Takifugu rubripes31033
 Lycopersicon esculentum4081
 Triticum aestivum4565
 Vitis vinifera29760
 Xenopus laevis8355
 Xenopus tropicalis8364
 Zea mays 4577



2) Add an entry

<key>Your organism name</key><string>NCBI Taxon ID</string>

after the Zea mays line

3) In your GeneSpring directory under this tree:

GeneSpring  GX11binpackagesmarrayproject2.1

Create a folder called ‘plugins’ and save the edited XML above as a file called TaxID.plg

4) Restart GeneSpring and proceed to update your newly added Biological genome, which now appears in the list!

Actually, I have to say, I’m not sure I ever want to see that in a manual of a piece of software as expensive as GeneSpring…  And besides this still doesn’t work for me as advertised because GeneSpring, whilst aware of what an HTTP proxy might conceivably be, has no concept of what an FTP proxy might be – which is problematic when you need to connect to  Brilliant!

Sense about Science

I don’t know what vexes me more at the moment, the BCA (who I will not deign to link to) suing Simon Singh for libel, or the numpty’s in my beloved country that voted in two BNP MEP’s.

I am however a supporter of the “Sense about Science” campaign.  Any organisation dedicated to fighting crackpots, frauds and pseudoscientists is going to find favour with me.

I do urge everyone to go and sign the ‘Keep Libel Laws out of Science’ petition here. Have a read, have a think. Is this the way we want things to continue?

free debate

The IET BioSysBio Conference 2009

So this year, after eschewing the notion of going to ISMB we had been looking for a good, UK based conference with enough of an informatics slant to keep us interested, but with a bit more ‘real science’ than the continual stream of talks on algorithmic refinement that ISMB has become.

BioSysBio had been recommended, and not least because we had a Newcastle student member on the organising committee.

BioSysBio is a relatively new conference run by the Institute of Engineering and Technology.  I admit I had been largely ignorant of the IET, probably because I’m not an engineer of any sort, and it’s a professional body for the engineering community.  BioSysBio, being a conference largely focused on systems and synthetic biology actually falls quite well under the engineering remit – not just for the engineering approaches that are applied to synthetic biology, but also because sometimes, just sometimes this kind of work needs giant, room-filling robots.

The meeting this year was held in Cambridge in the UK, one of our most venerable university towns, giving the event a feeling of academic history despite being about possibly the most cutting edge of biological sciences.  Over three days a group of probably 250 people gathered in the Music Department to discuss, listen, type, debate and network.  A word to other conference organisers, holding the conference in the Music Department is a great idea when you have a conference venue designed for performance – the acoustics were great for talks.  The lack of power supplies however belied the number of concert goers that normally need to plug in a laptop and work whilst they listen.  A reminder to us all that it’s all very well packing our power adaptors when we go to a conference, but a four-way block/gang plug would make you a popular person indeed in situations like this.

I don’t want to run over all of the sessions, you can find the programme for the three days here.  I do however just want to rave about how good a conference this is, and share a few highlights.

At the moment BioSysBio is a great size, and is run as a single track conference with unconference style breakouts given their own slot.  A two hour slot on the second afternoon was given over to a number of simultaneously run workshops (I plugged for the ONDEX tutorial, because I work with some people who are on the ONDEX project – but still had no idea how to use the software).

Interestingly there was a really good balance between the systems biology and the synthetic biology sides of the track.  I was more impressed by the lack of laptops being sported during the talks, it meant that actual, real biologists were present mixing it up with the informatics and mathematics geeks.  This also showed through in the poster sessions.  It’s a refreshing change for a bioinformatician to see the science that has resulted from informatics based approaches, rather than having the implementation of the approaches being the primary concern.

My talks highlights came from a number of people and areas.  I found Julian Huppert‘s talk on “Four-stranded DNA: How G-Quadruplexes control transcription and translation” fascinating.  Here was a whole area of transcriptional control I had never even come across, mediated by DNA structures I had never heard of.  That’s one of those nice eye-opening moments at a conference, when the complexity of the systems you study is just laid bare in front of you by something new and exciting.  On the second day we had a talk entitled “Effect of pauses on transcriptional noise” from Andre Riberio, following on from a previous talk on RNAP the previous day by Marko Djordjevic.  The idea of patterns in transcriptional noise, caused by kinetic and spatial restraints really caught my imagination.  I really enjoyed Catherine Lloyds talk on CellML – we’re interested in both SBML and CellML at Newcastle, and it’s always nice to see speakers from both projects on the same billing.  The panel discussion on security and synthetic biology (and touching on DIYbio as well) was extremely enlightening and not just because of the fantastic talk by Drew Endy who I had the pleasure of seeing speak twice at the meeting.  On the final day there was much to chose from from Christina Smolke talking about building circuits with bits of RNA to Piers Millet from the UN talking about why we should care about securing synthetic biology.

This brings me nicely onto online coverage of the conference.  If you want to read blog posts about any of the talks, there’s just one place to go and that is to Allyson Listers blog at  Ally, who we shall now refer to as the ‘Robot Blogger’, managed to have coverage of each talk online within minutes of them finishing, whilst apparently being able to type up notes from the talk immediately following this.  When Allyson spoke about her work on SAINT, Simon Cockell of FuzzierLogic picked up the blogging slack.

The conference organisers were fully Web 2.0 enabled.  We had a pre-existing room on FriendFeed that got heavily trailed on the first day, as well as a small contingent of Twitter enabled users contributing comments and commentary during the talks.  These were all aggregated with the hashtag #biosybio.  I think everyone agreed that whilst this is never the best way to present information from a conference (doing this on Friendfeed with an agreement between participants in a threaded discussion probably the best) I know that those of us who were twittering away were at least being thanked for it from others who could not attend.  That certainly kept me interested in how my wifi connection was doing throughout the day.  We also had the strange effect of some of our colleagues and former colleagues attending “The Influence and Impact of Web 2.0 on e-Research Infrastructure, Applications and Users” at NESC who were all using the hashtag #research3 and we even had some microblogging cross talk going with delegates there.  That was a new one for me certainly.

Despite not everyone on my Twitter subscription lists being a scientist, let alone ones interested in synthetic or systems biology, and despite posting hundreds of very geeky tweets, not a single person unsubscribed from my posting frenzy.  Either no-one is paying attention or people were happy to filter the noise for a couple of days, which I thought was great.

To sum up, it’s a great conference.  It will be great to go to it again next year in Oxford too.  I can’t remember the last time I attended every conference talk, even the 8.30am ones.  I can’t remember the last time I went to a conference and wasn’t tempted to have a post lunch snooze in the lecture theatre afterwards.  I can’t remember coming away from a conference so fired up and so keen to get back to work and start thinking harder about this years iGEM competition.  Actually the last conference I felt anything like this much fervour and excitement about was the excellent (but never repeated) O’Reilly Bioinformatics Technology Conference in Tucson in 2002.

Next year, if you’re in the field, or even vaguely interested in the field, beat a path to this conference in Oxford.  Beg, borrow or steal the cash.   I will see you there.

Darwin Day – Direct links to the “In Our Time” Darwin podcasts

I’m terrible for keeping up with my podcasts, and only just got around to part one of the four part “In Our Time” Darwin podcasts from Radio 4 yesterday.  Consequently when I went around to get parts two, three and four – iTunes had no knowledge of them any more. I tried to find links from the IOT RSS feeds – but all the links to the audio had been removed.

This morning, poking around on the ‘Listen Again’ part of the site, I found the embedded Flash players that would allow me to listen to the podcasts in my browser.

I wanted them to listen to on the way to work however!

A bit of digging around through the HTML and subsequent XML playlists, I managed to find downloadable, transferable mp3 content.  Linked here for your listening pleasure:

Programme 1

Melvyn tells the story of Darwin’s early life in Shropshire and discusses the significance of the three years he spent at Cambridge, where his interests shifted from religion to natural science.

Programme 2

Darwin’s expedition aboard the Beagle in December 1831 and how his work during the voyage influenced and provided evidence for his theories.

Programme 3

How Darwin was eventually persuaded to publish On the Origin of Species in November 1859 and the book’s impact on fellow scientists and the general public.

Programme 4

Melvyn visits Darwin’s home at Down House in Kent. Despite ill health and the demands of his family, Darwin continued researching and publishing until his death in April 1882.

Happy timeshifting Darwin Day!

2009 – a real year of celebration in science

I doubt there’s a biologist alive that hasn’t realised that 2009 marks a significant bicentennial – the birth of Charles Darwin, a man whose legacy is one of the most profound of any scientist who has ever lived.   Conveniently it is also the 150th anniversary of the book that is he is most famous for, “On the Origin of Species”.

A good hub for information on the celebrations would be the Natural History Museum’s Darwin200 site.

I was glad leafing through Chris Miller’s blog today to find out that I’m not the only scientist who has actually never read this book.   I actually downloaded the text from Project Gutenberg some years ago and stuck it on my iPod (in the days before I carried a smartphone) with the intention of reading it.   It was too long for the iPod reader, so I never got around to it.

In the absence of any formalised New Years Resolutions I promise to go out and find a nice hardbound copy to grace my bookshelf.  And read it too.

However, it’s not the only scientific anniversary being celebrated.  For those people who are more interested in staring at the sky than staring at living organisms 2009 is also the International Year of Astronomy.

Again, there’s a fantastic dropping off point from the IAU and UNESCO at astronomy2009.  But why is this being celebrated?  In this case it is the 400th anniversary of the first use of an astronomical telescope by Galileo Galilei, whose legacy is at least as awe inspiring as that of Darwin’s.

With an amateur telescope setup with a CCD camera/webcam capable of producing pictures rivalling that of a 200″ telescope in the 1950’s I always feel it’s a shame that more people don’t stare in awe at the sky.  I particularly liked this story on the Physics World site, about how a group of people are going to build a replica of Gallileo’s telescope and image through it, to show what the man himself might have been capable of resolving.

I find it interesting that these are both thinkers who proposed theories that were against the prevailing religious orthodoxy, and in Darwin’s case some even now failing  to be accepted by some people of a more closed minded religious persuasion.   Maybe all Darwin needs is another 250 years?

Whilst I’m making ad hoc resolutions – I will also use this year to  interact more with my local astronomical society, a group of people who I am in frequent email contact with, but have yet to pitch up to the society meetings to join.

I’m proud to be a scientist, and I’m proud of the wonderful achievements science has made, so it will be nice to use these two excellent celebrations to push my own knowledge forward a little more.  And not just focused around the computerised science I spend my time on.