Squeezing big data into a small organisation - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

11.5 Standardising to your requirements

We found it necessary to 'get tough' with the way that users format and

describe their data. Sequence data have pretty good and straightforward

fi le formats: FASTA and FASTQ are pretty clear, even if some things like

quality encodings are sometimes a bit obscure, but most other data types

are much more variable and harder to work with and important meta-

data is always diffi cult. Bad descriptions and formats make it hard to

work with internally, even to the level that we do not know what a fi le

has in it and cannot therefore work with it. Every bioinformatician has

had to frustratedly wrangle somebody else's poorly formatted fi le because

it messed up their perfectly workable pipeline. Eventually, lots of the data

we collect will have been analysed enough to warrant publication and we

will need to submit to public repositories. Typically, the worker who

generated the data will have moved on to new projects and getting the

right meta-data will be diffi cult. To prevent these sorts of frustrations and

eventual mistakes in submitted meta-data, we require workers to stick to

certain standards when creating their data or we refuse to work with it.

We produce specifi cations for different fi le and data types too, meaning

certain pieces of information must be included, for example we require

all our GFF fi les to include a version number for the feature ontology

used, each record to have a certain format for dbxref attribute if it is

included. In doing this we found we had to provide a reliable and simple

way of describing the data when they make it. At the outset we hoped

that existing software would be fi ne, Laboratory Information Management

Systems (LIMS) seem to be the right sort of tool but we have found these

systems either far too cumbersome or just lacking in the right features for

our needs. Inevitably in developing easy procedures we have had to create

our own software, this is probably something that all bioinformatics

groups will have to do at some point, even though generic LIMS and data

management systems exist, the precise domains you need to capture may

be harder to model in these systems than in a small custom tool of your

own. We turned to Agile development environments, in particular the

Ruby-on-Rails [9] Model-View-Controller paradigm web application

framework for producing tools around popular Structured Query

Language (SQL) databases (commonly called 'web apps' or 'apps'). Rails

makes application development quick and easy by providing intelligent

scaffolding and many built-in data interfaces. Most usefully, data-mart

style REST [10] interfaces and XML [11] and JSON [12] responses are

built-in by default, and others are very easy indeed to implement. Thanks

Search WWH ::

Custom Search

Home