Squeezing big data into a small organisation - Open Source Software in Life Science Research

Biomedical Engineering Reference

In-Depth Information

needing the Galaxy processes to be involved again. The apache_xsendfi le

module does not currently have an upload feature. Because we use a

closed system we do not provide for upload of large fi les into Galaxy as

this often results in data-duplication (instead users requiring large fi les to

be put into the system must seek the bioinformatician and have the fi les

put into a data library). The Galaxy team recommend the use of nginx

[28] as a server for the transfer of large fi les up to Galaxy. An FTP server

solution is also provided.

During analyses, Galaxy creates large data fi les that can only be disposed

of when the user decides that they are fi nished with. Thus in a production

environment Galaxy can use a lot of disk space. Our instance runs

comfortably in 1.5 TB of allotted disk space provided that the cleanup

scripts are run nightly. The timing of cleanup runs will depend on use but

sooner is better than later as running out of disk space causes Galaxy to

stop dead and lose all running jobs. When running scripts weekly we

found that 3 TB of disk space was not enough to prevent a weekly halt.

11.9 Helping the user to understand

the details

With all these new powerful tools at their disposal, it would be remiss of

us not to teach the biologists how to understand the settings and how to

interpret the output, and, most importantly, what are the technical

caveats of each data type. It is quite possible to train biologists to do their

own analyses and they can quickly get the hang of command-line

computer operation and simple scripting tasks. Surprisingly though, a

common faltering point is that biologists often come to see the methods

as a 'black-box' that produces results but do not see how to criticise

them. It is often counterproductive at early stages to drag a discussion

down to highly technical aspects, instead introducing simple control

experiments can work well to convince biologists to take a more

experimental approach and encourage them to perform their own

controls and optimisations. A great example comes from our next-

generation sequence-based SNP fi nding pipelines. By adding known

changes to the reference genomes that we use and running our pipelines

again, we can demonstrate to the biologist how these methods can

generate errors. This insight can be quite freeing and convinces the

biologist to take the result they are getting and challenge it, employing

controls wherever possible.

Search WWH ::

Custom Search

Home