Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

geographic locations, are usually maintained locally, and are connected via high-

throughput networking. The elements of a Grid are generally heterogeneous, adding

compatibility issues to those of data transfer, integration and maintenance. A Grid

architecture requires appropriate protocols, services, application programming inter-

faces, and software development kits ( Foster et al. , 2001 ). Computational Grids, with

their attendant networks of people and instruments, are ideal for global-scale data

mining and analysis ( Craddock et al. , 2008 ).

Grid computing facilitates the use of workflows: analysis pipelines in which the

output of one analysis feeds into the input of the next. Researchers have been car-

rying out this procedure manually since the emergence of the affordable computer,

but current workflows can be completely automated. Automated workflows are built

upon Web services .

Web services are formally defined interfaces that allow computational resources

to be exposed in a standard, computationally comprehensible manner. Programs can

be “exposed” as Web services by adding “wrapper” code, adhering to these stan-

dards, to the core program code. Web services may be hosted anywhere on the planet,

and combined seamlessly into workflows—at least in theory. In practice, the use of

workflows is fraught with practical difficulties relating to issues such as availability

(the Web services upon which a workflow depends may go down without warning),

reliability (the builder of a workflow cedes control of its components to their pro-

grammers, and the resulting code may or may not perform as intended) and docu-

mentation (many Web services perform excellently as designed, but many

programmers are more focused upon the code than on documentation, making the

Web service hard to use). Despite these drawbacks, well-designed workflows can

perform tasks that would be prohibitive in terms of time and cost if carried out man-

ually. Workflows also facilitate the automated re-analysis of data, as new datasets

become available. Some applications, such as Microbase ( Flanagan et al. , 2012 ),

retain the results of previous analyses and process new data without the need to

re-analyse the previously analysed data.

Several programs exist to facilitate the construction of fully automated work-

flows; examples are Taverna ( Oinn et al. , 2004 ) and Microbase ( Flanagan et al. ,

2012 ). Workflows built using these tools can be stored and shared in repositories

such as MyExperiment 16 ( Goble et al. , 2010 )( Figure 2.17 ).

Workflows have been applied to several large-scale problems, such as under-

standing the reaction of E. coli to oxygen ( Maleki-Dizaji et al. , 2009 ); identification

of microbial habitats ( Kolluru et al. , 2011 ); analysis of structural differences in met-

abolic pathways ( Arrigo et al. , 2007 ).

A relatively recent development, which makes it possible to perform unprece-

dentedly large amounts of computational analysis, is Cloud computing. Cloud com-

puting includes “both the applications delivered as services over the Internet and the

hardware and systems software in the data centres that provide those services”

16 http://www.myexperiment.org/ .

Search WWH ::

Custom Search

Home