Squeezing big data into a small organisation - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

and thanks to the software and hardware infrastructure the

bioinformatician can have much more time to pursue their own research,

such as developing new methods. As the approach stands as well for one

user as it does for one thousand, then a sensible infrastructure is scalable

to the coming demands of our data-fl ooded fi eld.

It's not all easy when relying on free and open source software. As

much of it is developed for other people's purposes, there can be signifi cant

shortcomings if your immediate purpose is slightly different from the

creators. The fi rst concern when sourcing software is 'does it do what I

want?', and all too often the best answer after surveying all the options is

'nearly'. As plant scientists and microbiologists, our model organisms

often do not fi t some of the assumptions made by analysis software, for

example SNP fi nding software that assumes a diploid population cannot

work well in reads generated from an allotetraploid plant or the formats

in which we receive data from genome databases are somewhat different

from those the software expects. One feature we would love to have but

never do is BioMart-style [30] automatic grabbing of data over the web,

such software never supports our favourite databases. This refl ects the

fact that the main source of investment in bioinformatics is from those

working in larger communities than ours, but, in general, lack of exactly

the right feature is an issue everyone will come across at some point.

Typically we fi nd ourselves looking for a piece of software that can handle

our main task and end up bridging the gaps with bits of scripts and

middleware of our own, one of the major advantages of Galaxy is that it

makes this easy.

It is surprising to us that there is lack of useable database software with

simple pre-existing schemas for genomics data. There are of course the

database schemas provided by the large bioinformatics institutes like

EMBL or the SeqFeature/GFF databases in the Open Bioinformatics

Foundation [31] projects, but these are either large and diffi cult to work

with because they are tied into considerable other software projects like

GBrowse [16] or ENSEMBL [32] browsers or just complicated. Often,

the schema seems obfuscated making it diffi cult to work with on a day-

to-day basis. Others, like CHADO [16], have been a nightmare to just

start to understand and we have given up before we begin. In this case we

felt we really needed to go back to the start and create our own solution,

the Gee Fu tool [21] we described earlier. We cannot always take this

approach, when we are stuck, we are stuck. It is not in the scope of our

expertise to re-code or extend open source software. Our team has

experience in Java and most scripting languages, but the time required to

become familiar with the internals of a package is prohibitive, with busy

Search WWH ::

Custom Search

Home