Biomedical Engineering Reference
In-Depth Information
11.5 Standardising to your requirements
We found it necessary to 'get tough' with the way that users format and
describe their data. Sequence data have pretty good and straightforward
fi le formats: FASTA and FASTQ are pretty clear, even if some things like
quality encodings are sometimes a bit obscure, but most other data types
are much more variable and harder to work with and important meta-
data is always diffi cult. Bad descriptions and formats make it hard to
work with internally, even to the level that we do not know what a fi le
has in it and cannot therefore work with it. Every bioinformatician has
had to frustratedly wrangle somebody else's poorly formatted fi le because
it messed up their perfectly workable pipeline. Eventually, lots of the data
we collect will have been analysed enough to warrant publication and we
will need to submit to public repositories. Typically, the worker who
generated the data will have moved on to new projects and getting the
right meta-data will be diffi cult. To prevent these sorts of frustrations and
eventual mistakes in submitted meta-data, we require workers to stick to
certain standards when creating their data or we refuse to work with it.
We produce specifi cations for different fi le and data types too, meaning
certain pieces of information must be included, for example we require
all our GFF fi les to include a version number for the feature ontology
used, each record to have a certain format for dbxref attribute if it is
included. In doing this we found we had to provide a reliable and simple
way of describing the data when they make it. At the outset we hoped
that existing software would be fi ne, Laboratory Information Management
Systems (LIMS) seem to be the right sort of tool but we have found these
systems either far too cumbersome or just lacking in the right features for
our needs. Inevitably in developing easy procedures we have had to create
our own software, this is probably something that all bioinformatics
groups will have to do at some point, even though generic LIMS and data
management systems exist, the precise domains you need to capture may
be harder to model in these systems than in a small custom tool of your
own. We turned to Agile development environments, in particular the
Ruby-on-Rails [9] Model-View-Controller paradigm web application
framework for producing tools around popular Structured Query
Language (SQL) databases (commonly called 'web apps' or 'apps'). Rails
makes application development quick and easy by providing intelligent
scaffolding and many built-in data interfaces. Most usefully, data-mart
style REST [10] interfaces and XML [11] and JSON [12] responses are
built-in by default, and others are very easy indeed to implement. Thanks
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search