Following Data - Life Out of Sequence

Biology Reference

In-Depth Information

number of computers to which a user can connect via ssh, and bioin-

formaticians are routinely logged into many at once, using a window-

based operating system to fl ick back and forth between the various

connections. For instance, a bioinformatician's task might involve si-

multaneously writing a program on his or her own machine, looking up

a database on a public server, and copying data to or from disk space

on a third machine. A large part of the virtuosity of such work is in be-

ing able to move oneself and one's data rapidly between places. Indeed,

bioinformaticians, like software engineers, constantly seek to reduce the

number of keystrokes necessary to move around. They can do this by

setting up aliases, short commands that act as abbreviations of lon-

ger ones. Or they can use their knowledge of programming languages

such as Perl and regular expressions to fi nd a shortcut for all but the

most intricate of maneuvers. In programs and on the command line, it

is common to see bioinformaticians using abstruse strings (for instance:

“ {^(?:[^f]|f(?!oo))*$} ”) in order to save themselves extra typing. Hav-

ing a working grasp of such intricacies, combined with a knowledge of

where important fi les and programs are located on the network, makes

a skillful bioinformatician.

Much of the work of bioinformatics can be understood as the move-

ment and transformation of data in virtual space. At EBI, I closely fol-

lowed the progress of the “release cycle,” a process that occurs every

couple of months through which the EBI's main database (known as

Ensembl) is revised and updated. A detailed description of the release

cycle will illustrate the importance of space management in bioinfor-

matic work.

Much of the work of the release coordinator is making sure that

the right data end up in the right place in the right form. Ensembl does

not produce its own data; instead, its role is to collect data from a wide

variety of sources and make them widely available in a common, coher-

ent, and consistent format. Ensembl is also an automatic annotation

system: it is software that takes raw genomic sequence and identifi es the

locations of particular structures (e.g., genes) and their functions. Mak-

ing a release involves collecting data from many individuals and places

and running them through a software pipeline. For such a large set of

databases, it is not possible to simply update them one by one. Ensembl

requires a sophisticated “staging” system whereby the new release is

prepared, processed, tested, and checked for consistency before it is re-

leased “live” onto the World Wide Web. Thus the release cycle becomes

a carefully choreographed set of movements through a virtual space in

Search WWH ::

Custom Search

Home