Parallel Computing for Bayesian Networks - Bayesian Networks in R: With Applications in Systems Biology

Biology Reference

In-Depth Information

according to their inherent difficulty, with particular attention to their behavior as the

size of the input grows ( scalability ). Therefore, the only way to improve scalability

is to develop a better algorithm for the problem at hand. This is often not possible,

either because we are already using an optimal algorithm or because the problem

at hand cannot be solved in an efficient way (i.e., its computational complexity is

more than polynomial in the size of the input in the worst case). Such problems are

known as NP-hard and are common in the theory of Bayesian networks, both for

structure learning ( Chickering 1996 ) and inference ( Cooper 1990 ).

On the other hand, the software architecture can make a significant difference

in the overall performance of the program. For instance, choosing the right data

structures for the problem can significantly improve or degrade the computational

complexity of an algorithm. Tailoring the implementation to the specific hardware

it will run on can also result in noticeable speedups.

Parallel computing, defined as the execution of several calculations

simultaneously, is an application of this last idea. It was originally introduced to

overcome hardware limitations in terms of computational power; large problems

were divided into smaller ones, which were then solved concurrently on multipro-

cessor supercomputers. Special-purpose hardware architectures were then devel-

oped to take advantage of such software, thus maximizing the impact of the raw

performance of the hardware for properly implemented programs. A classification

of these hardware architectures was proposed by Flynn ( 1972 ), according to the

nature of the data and the operations they support:

•

Single-Instruction, Single-Data (SISD): a single processing unit performing a

single operations on the same data

•

Multiple-Instruction, Single-Data (MISD): multiple processing units performing

different operations (independently and asynchronously) on the same data

•

Single-Instruction, Multiple-Data (SIMD): multiple processing units performing

the same operation on multiple data

•

Multiple-Instruction, Multiple-Data (MIMD): multiple processing units perform-

ing different operations on multiple data

Nearly all general-purpose modern computers are based on the MIMD model; a

notable exception are those that offload part of their workload to graphical process-

ing units (GPU), as the latter are based on a SIMD model. All modern processors

have more than one core, and each core supports multiple concurrent execution

threads. Furthermore, in the last few years the speed of processors has peaked be-

cause some of the physical constraints mentioned above are preventing further fre-

quency scaling. At the same time, the number of cores present in each processor

is still increasing, and the same holds for the number of threads supported by each

core. As a result, modern computers are remarkably similar to the MIMD parallel

architectures from the 1970s and the 1980s described by Flynn, albeit with vastly

different capabilities.

This evolution has sparked a renewed interest in parallel computing. Recent re-

search efforts focused on two major areas: the parallelization of existing algorithms

and the development of new algorithms and software libraries explicitly designed

Bayesian Networks in R: With Applications in Systems Biology

Search WWH ::

Custom Search

Home