Distributed Workflows in Bioinformatics - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

constructs and this information could help the scheduler to make better scheduling

decisions.

To preserve all information, the scheduler should take as input the whole program.

However, this could introduce its own (unnecessary) complications: the scheduler

would have to not only support all the syntactic nuances of the programming language

but also be able to deal with apparent cyclic dependencies (e.g., the dependencies

present in the while loop). Furthermore, since schedulers must use specific knowledge

of the architecture of the distributed computer, different implementations, one for each

scheduler type, are necessary, and as such, each scheduler would have to be able to

parse, analyze, and interpret programs.

Hence, in the interests of code maintainability, the GEL scheduler and interpretor

are separated from one another. Instead, an intermediate description of programs

which is essentially that of job instances and their (acyclic) dependencies on other

job instances, that is, DAGs, are introduced. These DAGs give more information than

simply atomic jobs by themselves; yet at the same time, ease from the scheduler, the

burden of the syntactic analyses required for whole programs.

Thus, the monolithic scheduler, which runs programs directly on a distributed

computer, is factorized into (1) a (DAG) generator which translates from programs

to the intermediate DAG form and (2) a (DAG) executor which runs DAGs on some

target distributed computer. For example, different executor implementations can (1)

run jobs on the same computer by calling an exec OS call (useful for development

and testing purposes) or (2) interface with the local scheduler on a cluster or use

the Globus API (3) or interface with a Grid metascheduler such as Nimrod [5], or

APST [14].

23.5.3

Interpretor Anatomy

As mentioned in the previous section, the GEL interpretors are each factored into two

components: the (DAG) builder and the (DAG) executor . The builder encapsulates the

language-specific elements of the interpretor, such as the lexical analyzer, parser, and

syntax checker. The DAG builder also enables the use of a DAG-based intermediate

language between the builder and the executor by translating cyclic dependencies

between jobs into DAGs, thereby creating acyclic dependencies.

The executor thus incorporates the job submission aspects of the interpreter. Addi-

tionally, it is dependent only on the much simpler DAG-based language without having

to bother with cyclic dependcies. Hence, to interface with different middleware for-

malisms (e.g., SGE), it is only necessary to implement a version of the interpreter

that can submit jobs as DAGs to the respective scheduling mechanisms (SGE in

this case). Extension of, or changes to, the programming language, in contrast, only

require extensions to the builder, as the DAG-based intermediate language remains

unchanged.

23.5.3.1 DAG Builder DAGs are a commonly used data structure, which are

understood by most scheduling mechanisms. This ubiquity of DAGs has been taken

advantage of by the GEL interpretor. This is achieved by the builder by translating the

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home