Squeezing big data into a small organisation - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

under test with a limited number of users but reached a point where it

started to cause unexpected behaviours and errors in Galaxy (behaviours

one would not expect from a slow database), switching to PostGreSQL

fi xed all previous odd behaviour. Be warned, there is no migration script

from SQLite to PostGreSQL in Galaxy and upgrading in this way is not

supported at all by the Galaxy team. We had an arduous week testing

and re-testing a custom migration script to switch our precious two-man

month's worth of work to a new database management system.

According to its development team, Galaxy is always at version 1. This

refl ects a commitment to backward compatibility that results in daily

incremental updates to the codebase on their server, rather than signifi cant

new releases. Often these updates will be as little as a few lines. Sometimes

many megabytes of code will change. Most importantly, the database

schema can change. Galaxy provide database migration scripts between

updates, but do not provide scripts for arbitrary jumps, say from schema

27 to schema 77. The practical implication of this is that it is wise to

update often. The community at large seem to update on average once

every 12 weeks, to provide a good balance between workload and ease of

upgrading. Leaving upgrades too long can make Galaxy painful to

upgrade as many merges and schema changes and update scripts need to

be run and tested sequentially to ensure a smooth running path.

Running Galaxy, essentially a job server-based system, on a compute

cluster requires a touch of planning but is made easier by the fact that

most cluster systems are supported by reliable libraries. In our cluster

Galaxy itself runs on a head-node, which is visible to the outside world

and the machines that accept jobs from the queue. We used the free

DRMAA libraries from Fedstage [27] compiled against LSF (a Sun Grid

Engine version exists too) and merely had to confi gure job-runners to

ensure that Galaxy jobs went into the cluster rather than being executed

on the head-node machine.

When dealing with big fi les, as is inevitable in NGS analyses, it is best

to ensure that upload and download do not run in the cluster as the client

(web browser in this case) must be able to connect to the machine doing

the job. Galaxy generates large output fi les, which end-users take as their

results. Galaxy's facility for allowing users to download data occupies a

lot of processing time within the main Python process, which can cause

Galaxy to slow and fail when sending data to the web browser. The

solution is to use the apache_xsendfi le module, which provides a

mechanism for serving large static fi les from web applications. When a

user requests a large fi le for download, Galaxy authenticates the request

and hands the work of sending the fi le to the user to Apache, without

Search WWH ::

Custom Search

Home