Squeezing big data into a small organisation - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

for fi les that will be required by multiple users, such as BLAST [6]

databases or GFF fi les [7], that is easy to navigate and helps to keep

itself populated properly. Inspired by the six-degrees of separation

principle, we have broad top-level folder names that describe the

general data type, for example 'features' or 'sequences', then a subdirectory

of broad organism taxon, for example 'plants' or 'oomycetes', then a

subdirectory of species before getting data-specifi c with, for example

'gff' or 'multi-fasta'. This structure can bring most fi les within four or

fi ve clicks of the root data directory (Figure 11.2). We specify the skeleton

structure of the data directory but afterwards any user is able to add

(and delete) what they like in any of these directories. The structure of

the directory is kept sane by requiring users to follow simple instructions

for adding a record to a text fi le in the root of the data directory. We

have small scripts that watch and alert us to any discrepancies between

the records in the fi le and the actual contents of the directory. This simple

system is much less error-prone in practice than it may seem. In truth

the system is mostly about letting the users know in a clear fashion

where and how to deposit their data for easy sharing and retrieval,

they are usually pretty good about doing this as long as it is clear. The

threat of data deletion helps persuade users to stick to the scheme too

(although in reality we never delete a fi le, we move it to a recycling bin

and wait to see if anyone asks for it). The frequency with which new

fi les are updated in our experience is suffi ciently low to ensure that we

are not constantly running around chasing fi les, we easily manage a

couple of hundred shared fi les in this way. One important caveat is that

this sort of directory structure really must be kept for shared fi les, and

not an individual user's working fi les; assemblies from which many users

may predict genes or calculate coverage of RNAseq reads would work

well, but a spreadsheet of results derived from these for an individual

result may be best left in the user's allotted working space. Also, we do

not keep primary data from sequencers and mass specs in these folders,

these are kept separate in a read-only folder on live storage for a useful

amount of time (which will vary according to project), before being

moved to archive. Our data retention policy for our Illumina sequence

data is straightforward; we keep what we need to work with and

what we need to submit to sequence repositories on publication. In

practice we keep the FASTQ [8] sequence fi les and minimal technical

meta-data on the run. Reads are kept on live storage until the allotted

space is full and then operate a one in, one out policy, moving the older

fi les to archive. Usually this means we can keep raw reads for around

eight to twelve months.

Search WWH ::

Custom Search

Home