Biomedical Engineering Reference
In-Depth Information
for fi les that will be required by multiple users, such as BLAST [6]
databases or GFF fi les [7], that is easy to navigate and helps to keep
itself populated properly. Inspired by the six-degrees of separation
principle, we have broad top-level folder names that describe the
general data type, for example 'features' or 'sequences', then a subdirectory
of broad organism taxon, for example 'plants' or 'oomycetes', then a
subdirectory of species before getting data-specifi c with, for example
'gff' or 'multi-fasta'. This structure can bring most fi les within four or
fi ve clicks of the root data directory (Figure 11.2). We specify the skeleton
structure of the data directory but afterwards any user is able to add
(and delete) what they like in any of these directories. The structure of
the directory is kept sane by requiring users to follow simple instructions
for adding a record to a text fi le in the root of the data directory. We
have small scripts that watch and alert us to any discrepancies between
the records in the fi le and the actual contents of the directory. This simple
system is much less error-prone in practice than it may seem. In truth
the system is mostly about letting the users know in a clear fashion
where and how to deposit their data for easy sharing and retrieval,
they are usually pretty good about doing this as long as it is clear. The
threat of data deletion helps persuade users to stick to the scheme too
(although in reality we never delete a fi le, we move it to a recycling bin
and wait to see if anyone asks for it). The frequency with which new
fi les are updated in our experience is suffi ciently low to ensure that we
are not constantly running around chasing fi les, we easily manage a
couple of hundred shared fi les in this way. One important caveat is that
this sort of directory structure really must be kept for shared fi les, and
not an individual user's working fi les; assemblies from which many users
may predict genes or calculate coverage of RNAseq reads would work
well, but a spreadsheet of results derived from these for an individual
result may be best left in the user's allotted working space. Also, we do
not keep primary data from sequencers and mass specs in these folders,
these are kept separate in a read-only folder on live storage for a useful
amount of time (which will vary according to project), before being
moved to archive. Our data retention policy for our Illumina sequence
data is straightforward; we keep what we need to work with and
what we need to submit to sequence repositories on publication. In
practice we keep the FASTQ [8] sequence fi les and minimal technical
meta-data on the run. Reads are kept on live storage until the allotted
space is full and then operate a one in, one out policy, moving the older
fi les to archive. Usually this means we can keep raw reads for around
eight to twelve months.
￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search