COLLABORATIVE SYSTEMS BIOLOGY: OPEN SOURCE, OPEN DATA, AND CLOUD COMPUTING - Collaborative Computational Technologies for Biomedical Research

Biomedical Engineering Reference

In-Depth Information

purpose software projects such as Linux, GCC, Apache, and MySQL are the

direct result of this new freedom to cooperate. Development is “ crowdsourced ” —

there are often one or more core teams of developers with fi nancial backing

from commercial users of the software, and a large community of end users

also contribute improvements that advance the projects much faster than

would be possible with a single traditional developer team at work. For

example, about 75% of new Linux kernel code is generated by teams inside

normally competitive companies like IBM and Intel, and individual contribu-

tors account for at least 18% of ongoing efforts [1].

A grand tradition of academic thriftiness has led to the widespread adop-

tion of these no-cost tools in the research community, and the “crowdsourcing”

ethic has rubbed off onto scientists' thinking about their own work. In recent

years there has been a fl owering of open-source bioinformatics software and

a move toward more open sharing of data. Indeed, many granting agencies

now require an explicit plan for data sharing, although there is little agreement

about what constitutes a reasonable plan [2].

14.4 OPEN DATA STANDARDS: ONTOLOGIES AND

INTERCHANGE FORMATS

Sharing data requires mutual understanding of the content and format of the

data, but achieving this understanding can be nontrivial. This is especially so

when dealing with unprocessed, or “raw,” data, which is typically written in

some mysterious binary format closely held by each instrument manufacturer.

The use of such closed formats is technically defensible as they are often the

most effi cient for rapidly storing data as it streams off an instrument, and they

can be altered as needed by the manufacturer without worry of disrupting

other software systems that read the data, since none exist. Of course, the fact

that an ever-shifting and undocumented data format also binds the user to the

data processing software sold by the instrument maker has long been seen as

a happy side effect by the instrument makers, but not by instrument users.

Increasingly, users are demanding and helping defi ne open standards to allow

the data they collect to be read and written by software agents other than

those provided by the equipment manufacturer, and in many cases the manu-

facturers are now supporting these efforts lest a lack of openness become a

competitive disadvantage. Developing open standards for describing pro-

cessed data and results presents an even greater challenge as the very idea of

“processing” and “results” is a rapidly moving target in the research world,

and there is often little agreement in the terms of speech used in describing

the domains themselves.

The fi rst step in creating a data standard is to disambiguate the terminology

used in the area of endeavor. This is most properly done by developing a

structured, rigorous, and thorough description of the knowledge domain, or

“ontology,” while avoiding duplication of or confl icts with ontologies in related

areas. This is a nontrivial and open-ended task requiring cooperation within

Collaborative Computational Technologies for Biomedical Research

Search WWH ::

Custom Search

Home