Category Archives: Technology

Short-read alignment on the Raspberry Pi

This week I invested a little bit of spare cash in the Raspberry Pi.  Now that there’s no waiting time for these, I bought mine from Farnell’s Element 14 site, complete with a case, a copy of Raspbian on SD card and a USB power supply.  Total costs, about 50 quid.

First impressions are that it is a great little piece of hardware. I’ve always considered playing with an Arduino, but the Pi fits nicely into my existing skill set.  It did get connected to the TV briefly just to watch a tiny machine driving a 37″ flatscreen TV via HDMI.  I’m sure it’s just great, if your sofa isn’t quite as far away from the TV as mine is. So with sshd enabled on the Pi it is currently sat on the mantlepiece, blinking lights flashing, running headless.

The first thing it occurred to me to do was to do some benchmarking.  What I was interested in is the capacity of the machine to do real world work.  I’m an NGS bioinformatician so the the obvious thing to do was to throw some data at it through some short-read aligners.

I’m used to human exome data, or RNA-Seq data that generally encompasses quite a few HiSeq lanes, and used to processing them in large enough amounts that I need a few servers to do it.  I did wonder however whether the Pi might have enough grunt for smaller tasks, such as small gene panels, or bacterial genomes.  Primarily this is because I’ve got a new project at work which uses in solution hybridisation and sequencing to identify pathogens in clinical samples, and it occurred to me that the computing requirements probably aren’t the same as what I’m used to.

The first thing I did was to take some data from wgsim generated from an E.coli genome to test out paired-end alignment on 100bp reads.

Initially I thought I would try to get Bowtie2 working, on the grounds that I wasn’t really intending to do anything other than read mapping and I am under the impression it’s still faster than BWA.  BWA does tend to be my go-to aligner for mammalian data.  However I quickly ran into the fact that there is no armhf build of bowtie2 in the Raspbian repository.  Code downloaded I was struggling to get it to compile from source, and in the middle of setting up a cross-compiling environment so I could do the compilation on my much more powerful EeePC 1000HE(!) it occurred that someone might have been foolish enough to try this before.  And they had.  The fact is that bowtie2 requires a CPU with an SSE instruction set – i.e. Intel.  So whilst it might work on the Atom CPU in the EeePC it’s a complete non starter on the ARM chip in the Pi.

Bowtie1 however is in the Rasbpian repository.  And I generated 1×10^6 reads as a test dataset after seeing that it was aligning the 1000 read dataset from bowtie with some speed.  This took 55 minutes.

I then picked out a real-world E.coli dataset from the CLC Bio website.  Generated on the GAIIx, these are 36bp PE reads, around 2.6×10^6 of them.

BWA 0.6.2 is also available from the Raspbian repos (which is more up to date than the version in the Xubuntu distro I notice, probably because Raspbian is tracking the current ‘testing’ release, Wheezy).

So I did a full paired end alignment of this real world data, making sure both output to SAM.  I quickly ran out of space on my 4GB SD card, so all data was written out to an 8GB attached USB thumb drive.

Bowtie1 took just over an hour to align this data (note reads and genome for alignment are from completely different E.coli strains)

Time loading reference: 00:00:00
Time loading forward index: 00:00:00
Time loading mirror index: 00:00:00
Seeded quality full-index search: 01:01:31
# reads processed: 2622382
# reads with at least one reported alignment: 1632341 (62.25%)
# reads that failed to align: 990041 (37.75%)
Reported 1632341 paired-end alignments to 1 output stream(s)
Time searching: 01:01:32
Overall time: 01:01:32

I was a little surprised that actually BWA managed to do this a little faster (please note aligners were run with default options).  I only captured the start and end of this process for BWA.

Align start: Sat Jan 26 22:36:06 GMT 2013
Align end: Sat Jan 26 23:29:31 GMT 2013

Which brings the total alignment time for BWA to 53 minutes and 25 seconds.

Anyway it was just a little play to see how things stacked up.  I think it’s fantastic that a little machine   like the Pi has enough power to do anything like this.  It’s probably more of a comment on the fact that the people behind the aligners have managed to write such efficient code that this can be done without exceeding the 512Mb of RAM.  Bowtie memory usage was apparently lower than BWA though during running tests.

I always thought that the ‘missing aspect’ of DIYbio was getting people involved with bioinformatics, instead the community seemed desperate to follow overly ambitious plans to get involved in synthetic biology.  And it seemed to me that DIYbio should sit in the same amateur space that amateur astronomy does (i.e. within the limitations of equipment that you can buy without having to equip a laboratory).  And for a low cost entry into Linux, with enough grunt to play with NGS tools and publicly available data, it’s hard to fault the very compact Raspberry Pi. Now I just need to see exactly where the performance limits are!

Quick fix for when mailman fails to archive your mails on Ubuntu

For the first time in many years I find myself the owner and maintainer of a set of Mailman mailing lists on behalf of Knowledge Blog this time.  Mailman is a great piece of free software, but even the packaged install under Ubuntu (10.04.1 LTS in this case) is not without a couple of gotchas.  In particular this installation was not archiving any of the email to our public mailing list (not so great for transparency!).

Despite seemingly have the correct setup on the list, on going to the archives page users were greeted with the uninformative, and definitely incorrect error:

“No messages have been posted to this list yet, so the archives are currently empty.”

My inbox has evidence to the contrary.  A look at /var/lib/mailman/logs/error

shows this:

Nov 29 13:30:00 2010 (2426) Archive file access failure:
/var/lib/mailman/archives/private/knowledgeblog-discuss.mbox/knowledgeblog-discuss.mbox [Errno 13] Permission denied: '/var/lib/mailman/archives/private/knowledgeblog-discuss.mbox/knowledgeblog-discuss.mbox'
Nov 29 13:30:00 2010 (2426) Uncaught runner exception: [Errno 13] Permission denied: '/var/lib/mailman/archives/private/knowledgeblog-discuss.mbox/knowledgeblog-discuss.mbox'

Indeed the “private” directory in this case was owned by user ‘root’. By default on Ubuntu mailman runs as user ‘list’, so a quick:

sudo chown -R list /var/lib/mailman/archives/private/

worked a treat. In order to then generate the archive with the missed messages, all you need is a quick:

sudo /var/lib/mailman/bin/unshunt

Fighting FileMaker Server 11 Advanced: Instant Web Publishing on OS X

One of the groups I support at work is rather wedded to FileMaker Server Advanced for hosting their databases.  I understand this, they use the databases to hold strain information for their yeast experiments, ordering information for the lab and some image data.  No-one needs to learn SQL to use it, and FileMaker Server Advanced allows them to serve their databases up across the campus to both FileMaker itself and also expose them on the web.

However, if you’ve ever had the misfortune of attempting to install FileMaker Server Advance  you will realise that it needs a lot of encouragement to actually *work*.  This particular reinstall was caused by the timely death of an ageing G5 Xserve.

First of all, a couple of hints that I’ve not seen anywhere else on the FileMaker forums to address common installation problems.  For the record I installed it on a desktop machine running OS X 10.6.4

  1. Often the admin console fails to start after installation.  This is dealt with in the install manual with an unhelpful suggestion to reboot.   A little legacy trouble was the issue here, old proxy settings were interfering with the ability of the Java Web Start admin console to communicate with the back end.
  2. If you’re still having issues, switching from the 64bit to 32bit Java (Applications>Utilities>Java Preferences) is recommended

The problem I’ve hit time and time again with FileMaker Server Advanced is getting the IWP (Instant Web Publishing) working.  When Filemaker installs, it required you to be running Apache, and have a certain set of firewall ports open.  However it will invariably fail to correctly test the Apache installation and accept it.  Whilst it gives you the option to come back later and do it, this will also  never work.  If you poke around online there are many many solutions offered up to this problem, mainly to do with the httpd.conf file.  However the problem is a lot more straightforward, and must affect pretty much every single person who attempts to install this.

When FileMaker Server Advanced installs it creates a new system user called fmserver.  This is the identity under which all the database services run.  When it comes to setting up IWP, the ‘test’ it performs on the Apache installation is to attempt to install a couple of files to /Library/WebServer/Documents (this was exposed by perusing the Apache logs).  This will never work as the fmserver user does not have sufficient privileges to write *anything* to this directory as it is drwxrwxr-x and owned by root, with group admin.

The solution then is simple and obvious, you need to add the fmserver user to the admin group. I haven’t seen this solution anywhere else online and hope that somebody might find it useful.  After this procedure I uninstalled and re-installed FileMaker Server Advanced, and everything works as advertised now.  The short incantation is (as an admin user):

$ dscl localhost
cd /Local/Default/Groups
append admin GroupMembership fmserver
exit

SuperMondays – the oxymoron of face to face geek social networking

So this evening I went to my first SuperMondays event.  What is SuperMondays you ask?  Well it’s a social networking event for geeks in the North East.

One of the things I’ve always been vaguely jealous of is the amount of these kinds of events that seem to exist in the USA – there’s a meetup for everything whether you’re interested in tech, science, hacking, or publishing.  People get together, talks are given, people interact over food or a coffee (or a beer if you’re lucky).

I used to go to 2600 and alt.ph.uk meetings back in my impressionable younger days, so outside of scientific conferences this is the first opportunity I’ve taken to sit in a room with a bunch of like minded people outside of my day to day work  to chew the fat on tech for an awfully long time.  This months theme (for the meetings are most definitely monthly) was databases.  Now I can’t get terribly excited about databases per se – SQL is fugly, I prefer MySQL over PostgreSQL for ease of use rather than functionality and these days if I could do it in SQLite I probably would, but nevertheless there was a really nice series of three talks in this themed session.

Ross Cooney (SuperMondays organiser extraordinaire and @rosscooney on Twitter) gave a speedy history of the database world, and a quick reminder of the things I have already forgotten about databases after not doing a lot of db development recently (like what ACID stands for – no it’s not an HTML compliance test, or a drug (you crazy Berkeley hippies)) and introduced the other two speakers for the evening.

David Lavery followed next (@dlavery62) with a review of both SimpleDB from Amazon Web Services and Google BigTable two cloud offerings for the post-RDBMS database world.  I particularly enjoyed the SimpleDB part of the talk, anything delivered via a RESTful interface (don’t bother trying to convince me it’s not really RESTful, I could not care less) looks like a good thing to me after trying to deal with the SOAP webservices world last year.

The final talk was of a far more academic slant with David Livingstone of Northumbria University who presented RAQUEL which is an open source implementation of some of the ideas in The Third Manifesto, which appears at first glance to be an ‘RDBMS done right’ according to modern relational theory (and not affected by legacy cruft from current popular SQL implementations).  Part middleware, part programming language, part educational tool I would like to have heard a little more about the implementation here.  We were treated to a lot of syntactical details (which had me in mind of a cross of SQL, Perl and R and therefore maybe not something you would want to necessarily spend all day doing), but they’ve only just released this to the world and are looking for people to engage and interact with their foray into OSS development.  It certainly generated the most questions from the gathered geeks!

After these a roadmap for the future SuperMondays was presented.  Although this was my first SuperMonday event, it was in fact their 12th.  It may have started in a (very nice!) restaurant in Newcastle a year ago around a table, but there were maybe 80 people in the theatre tonight which suggests it is going from strength to strength.  Newly incorporated as a Community Interest Company (saving buckets of paperwork over being a charitable organisation) the future for SuperMondays looks very bright indeed.  Very much looking forward to the next one!

Yeah, there’s no oxymoron of a face to face geek event, but if you only saw the tagline in your RSS reader maybe you read a little further because of it ;)  I should also say cheers to the Newcastle ARCSOC students who I had a couple of drinks with afterwards too (depriving myself of further SuperMondays sandwiches in the process), it was nice to see you all again!

You can also find SuperMondays on Twitter (@supermondays) and on Facebook too!

Things you never wish to see shown on your RAID LED display

84 DRIVE FAILURE BOX #1, BAY1
102 VOLUME #0 STATE INTERIM RECOVERY
102 VOLUME #1 STATE INTERIM RECOVERY
102 VOLUME #2 STATE INTERIM RECOVERY
102 VOLUME #3 STATE INTERIM RECOVERY
102 VOLUME #4 STATE INTERIM RECOVERY
102 VOLUME #5 STATE INTERIM RECOVERY
84 DRIVE FAILURE BOX #1, BAY2
84 DRIVE FAILURE BOX #1, BAY3
101 VOLUME #0 STATE FAILED
101 VOLUME #1 STATE FAILED
101 VOLUME #2 STATE FAILED
101 VOLUME #3 STATE FAILED
101 VOLUME #4 STATE FAILED
101 VOLUME #5 STATE FAILED
84 DRIVE FAILURE BOX #1, BAY4
84 DRIVE FAILURE BOX #1, BAY5
84 DRIVE FAILURE BOX #1, BAY6
84 DRIVE FAILURE BOX #1, BAY7
84 DRIVE FAILURE BOX #1, BAY8
84 DRIVE FAILURE BOX #1, BAY9
84 DRIVE FAILURE BOX #1, BAY10
84 DRIVE FAILURE BOX #1, BAY11
84 DRIVE FAILURE BOX #1, BAY12

Pretty ugly eh?  It’s the kind of error that brings me out in a cold sweat every time I get emails from our users.  Generally complaints that the databases are running slowly, or that files are disappearing from directories, that home directories are empty, reports that the filesystems have become read-only.

Of course when I go to look at the machine the display apparently tells me that an entire box of drives (we have 2 boxes with 12 drives in) has suddenly failed.  The RAID volumes can’t maintain such a loss of drives, hence we INTERIM RECOVERY followed by STATE FAILED as more drives drop out of the array.

The weird thing is of course is that there’s nothing wrong with the drives at all, they’re sat there blinking little green lights at me telling me they are just fine.

The unit is an HP StorageWorks Modular Smart Array 1000, and I have to doubt the Smart moniker in this case, as it is the single most unreliable piece of hardware we own, apart from perhaps the HP blades it is attached to.  Apple RAID units, Transtec RAID units, all the RAID5’d servers seem to pretty much be able to hold themselves together, but not this one.

Every time this happens we get an engineer called out, they plug a serial console into the unit, reset the error states on the drives and volumes, reboot the RAID and everything comes up smelling of roses.  However trying to get them to send an engineer out is an exercise in frustration.  It would of course be possible to affect this fix ourselves, given a laptop with a serial port, and one of HP’s magical and deeply proprietary 259992-001 console serial cables.  Do we have one with our kit?  No.  How much do they cost? About £120.  How much did we spend on the kit in the first place?  Well over £100,000.

I will never, ever buy or recommend the purchase of another bit of HP kit as long as I am in the position to do so.  Grr.

The IET BioSysBio Conference 2009

So this year, after eschewing the notion of going to ISMB we had been looking for a good, UK based conference with enough of an informatics slant to keep us interested, but with a bit more ‘real science’ than the continual stream of talks on algorithmic refinement that ISMB has become.

BioSysBio had been recommended, and not least because we had a Newcastle student member on the organising committee.

BioSysBio is a relatively new conference run by the Institute of Engineering and Technology.  I admit I had been largely ignorant of the IET, probably because I’m not an engineer of any sort, and it’s a professional body for the engineering community.  BioSysBio, being a conference largely focused on systems and synthetic biology actually falls quite well under the engineering remit – not just for the engineering approaches that are applied to synthetic biology, but also because sometimes, just sometimes this kind of work needs giant, room-filling robots.

The meeting this year was held in Cambridge in the UK, one of our most venerable university towns, giving the event a feeling of academic history despite being about possibly the most cutting edge of biological sciences.  Over three days a group of probably 250 people gathered in the Music Department to discuss, listen, type, debate and network.  A word to other conference organisers, holding the conference in the Music Department is a great idea when you have a conference venue designed for performance – the acoustics were great for talks.  The lack of power supplies however belied the number of concert goers that normally need to plug in a laptop and work whilst they listen.  A reminder to us all that it’s all very well packing our power adaptors when we go to a conference, but a four-way block/gang plug would make you a popular person indeed in situations like this.

I don’t want to run over all of the sessions, you can find the programme for the three days here.  I do however just want to rave about how good a conference this is, and share a few highlights.

At the moment BioSysBio is a great size, and is run as a single track conference with unconference style breakouts given their own slot.  A two hour slot on the second afternoon was given over to a number of simultaneously run workshops (I plugged for the ONDEX tutorial, because I work with some people who are on the ONDEX project – but still had no idea how to use the software).

Interestingly there was a really good balance between the systems biology and the synthetic biology sides of the track.  I was more impressed by the lack of laptops being sported during the talks, it meant that actual, real biologists were present mixing it up with the informatics and mathematics geeks.  This also showed through in the poster sessions.  It’s a refreshing change for a bioinformatician to see the science that has resulted from informatics based approaches, rather than having the implementation of the approaches being the primary concern.

My talks highlights came from a number of people and areas.  I found Julian Huppert‘s talk on “Four-stranded DNA: How G-Quadruplexes control transcription and translation” fascinating.  Here was a whole area of transcriptional control I had never even come across, mediated by DNA structures I had never heard of.  That’s one of those nice eye-opening moments at a conference, when the complexity of the systems you study is just laid bare in front of you by something new and exciting.  On the second day we had a talk entitled “Effect of pauses on transcriptional noise” from Andre Riberio, following on from a previous talk on RNAP the previous day by Marko Djordjevic.  The idea of patterns in transcriptional noise, caused by kinetic and spatial restraints really caught my imagination.  I really enjoyed Catherine Lloyds talk on CellML – we’re interested in both SBML and CellML at Newcastle, and it’s always nice to see speakers from both projects on the same billing.  The panel discussion on security and synthetic biology (and touching on DIYbio as well) was extremely enlightening and not just because of the fantastic talk by Drew Endy who I had the pleasure of seeing speak twice at the meeting.  On the final day there was much to chose from from Christina Smolke talking about building circuits with bits of RNA to Piers Millet from the UN talking about why we should care about securing synthetic biology.

This brings me nicely onto online coverage of the conference.  If you want to read blog posts about any of the talks, there’s just one place to go and that is to Allyson Listers blog at lurena.vox.com.  Ally, who we shall now refer to as the ‘Robot Blogger’, managed to have coverage of each talk online within minutes of them finishing, whilst apparently being able to type up notes from the talk immediately following this.  When Allyson spoke about her work on SAINT, Simon Cockell of FuzzierLogic picked up the blogging slack.

The conference organisers were fully Web 2.0 enabled.  We had a pre-existing room on FriendFeed that got heavily trailed on the first day, as well as a small contingent of Twitter enabled users contributing comments and commentary during the talks.  These were all aggregated with the hashtag #biosybio.  I think everyone agreed that whilst this is never the best way to present information from a conference (doing this on Friendfeed with an agreement between participants in a threaded discussion probably the best) I know that those of us who were twittering away were at least being thanked for it from others who could not attend.  That certainly kept me interested in how my wifi connection was doing throughout the day.  We also had the strange effect of some of our colleagues and former colleagues attending “The Influence and Impact of Web 2.0 on e-Research Infrastructure, Applications and Users” at NESC who were all using the hashtag #research3 and we even had some microblogging cross talk going with delegates there.  That was a new one for me certainly.

Despite not everyone on my Twitter subscription lists being a scientist, let alone ones interested in synthetic or systems biology, and despite posting hundreds of very geeky tweets, not a single person unsubscribed from my posting frenzy.  Either no-one is paying attention or people were happy to filter the noise for a couple of days, which I thought was great.

To sum up, it’s a great conference.  It will be great to go to it again next year in Oxford too.  I can’t remember the last time I attended every conference talk, even the 8.30am ones.  I can’t remember the last time I went to a conference and wasn’t tempted to have a post lunch snooze in the lecture theatre afterwards.  I can’t remember coming away from a conference so fired up and so keen to get back to work and start thinking harder about this years iGEM competition.  Actually the last conference I felt anything like this much fervour and excitement about was the excellent (but never repeated) O’Reilly Bioinformatics Technology Conference in Tucson in 2002.

Next year, if you’re in the field, or even vaguely interested in the field, beat a path to this conference in Oxford.  Beg, borrow or steal the cash.   I will see you there.

A new world of pain HP + Vista

So we took delivery this morning of a shiny new HP 2710p (not for us, but for Matt). Matt couldn’t contain his excitement (or let us deliver it to him at lunch) so came up to the office.  At this point we plugged it in and turned it on.  Time on the clock was 11am.

Now, I don’t know if it was a problem with Vista, the tablet, or what.  HP’s ‘first boot’ setup makes Dell’s look streamlined.  The machine, after deciding which OS to install (64bit or 32bit), created a rescue partition, populated the rescue partition, asked us for some passwords, installed some software.  It rebooted maybe 7 times in all.

It finally booted into a workable Vista desktop at 1.15pm.   2 hours and 15 minutes for a new machine to boot into a useable configuration is apalling.  After 20 minutes we were already joking that we could have installed Ubuntu and configured it AND downloaded enough development tools to keep Matt happy in that time.  We had no idea that it would finally boot in front of us over 2 hours later during lunch.

I feel for any consumer that has to go through this process.  It’s not impressive.  By 3pm Matt was already prepping to wipe it in favour of an XP/Ubuntu dual boot.  Oh the complete install package?  Over 20GB disk was gone by the time it had finished.

Things I never knew you could do with a Canon point and shoot camera

Did you know you could shoot RAW with your Canon point and click? Extend the available time you can record video for? Script up events and actions? Play games?

Lifehacker to the rescue.

This was promptly bookmarked under ‘Things to do when I’m bored’ – my default bookmarks folder for neat things I want to try :)

Linux on a PowerBook, the iron fist in a velvet glove?

Well at least that’s the quote I’ve seen kicking around. In a fit of ‘new things technical’ over the weekend, I erased 10.5 from my G4 PowerBook (I’m sorry Apple, but if you think Leopards performance on the G4 is acceptable, there are at least 2 owners here who would heartily disagree) and decided to stick the Hardy Heron Ubuntu beta on it instead, having been superficially impressed with it in a quick show and tell with Frank on Friday afternoon.

So this quickly turned into a bit of a frustration. Having erased Kubuntu from my now Windows XP SP3(!) Dell laptop (oh so much happiness after Vista), I still feel I need a Linux platform to play on and I have issues with VMware Server (version 1 doesn’t seem to be able to bridge my WLAN connection and version 2 beta is .. dire quite frankly to the point of non-functionality).

The actual installation on to the PowerBook was fine, although booting from the LiveCD it was obvious that my wireless card was not supported ‘out of the box’. Ploughing ahead anyway, I wiped the system and booted back into a very refined Gnome interface (I never thought I would end up back in Gnome, but KDE is turning into a train wreck of a window manager). It certainly looked pretty, but the differences between well supported x86 architechture and Apple’s former PowerPC favourites quickly became clear.

There is *no* out of the box support for the PowerBook wireless. In fact getting it to work was quite the exercise in frustration. The procedure is mildly different from Gutsy (and in fact better) but nevertheless not that easy to find. I managed to get it working for, hmm, about 1 boot. Since then it has refused to work. It can’t even see half the wireless networks that my Dell can – and has great difficulty authenticating with my Linksys router (mind you my PDA also struggles with this). However this leaves me at the mercy of a wired connection, and I’m sorry – I have laptops precisely because they’re portable and it annoys Harriet less to have me in the room with her nerding rather than in the spare room nerding.

Then the other problem start to become appparent. The keyboard backlighting functions are broken, not just the fancy dimming functions – but the caps lock key. The single mouse button is a hindrance beyond belief (which it just isn’t in OS X despite a large number of context sensitive right click operations available). The screen dimming buttons make X extremely unstable. It’s impossible to install OpenOffice.Org currently from the repositories. The machine itself is running hot – and believe me this machine got hot under OS X but it’s managed to iron a nice flat depression into my sofa cushion with Ubuntu on it. Oh and there’s absolutely no support for the ATI Radeon Mobile chipset in it – so no compiz-fusion for me, or at least no way I could see to make it all hang together.

I appreciate that some of these issues might be related to the fact it’s a beta release, but it seems that PPC owners (especially laptop owners) are getting short changed. I’m sure this would work great on a desktop with an NVIDIA card in, but for me having this installed without using any of the joyful, full functionality of the laptop is just plain wrong to me.

It’s a shame, because as a distribution I still think Ubuntu is by far and away the best desktop Linux for those of us for whom compiling systems is just not fun anymore (yes I did used to do it.. 13 years ago!). It’s been rock solid as a server platform for us on x86 and x86-64 (mind you I can happily say that about RHEL4 as well, I just dont like it as much).

10.4 goes back on the PowerBook tomorrow as a) I no longer have 10.5 media and b) it’s faster. Strange how downgrading from the latest OS releases has suddenly become a necessity for me! A shame I lack a dedicated Linux platform to play with though. Maybe I should just get another laptop…..