Parallel Computing for Bayesian Networks - Bayesian Networks in R: With Applications in Systems Biology

Biology Reference

In-Depth Information

> system.time(gs(hailfinder))

user system elapsed

4.000 0.004 4.004

There are three reasons for this disparity. First, the parallel implementation cannot

take advantage of the symmetry of the Markov blankets and the neighborhoods to

reduce the number of tests. All Markov blankets are learned simultaneously, so the

same conditional independence test will likely be performed multiple times. The

same holds for the neighborhoods. As a result, gs performs more than twice as

many tests overall:

> ntests(gs(hailfinder, optimized = TRUE))

[1] 2670

> ntests(gs(hailfinder, optimized = FALSE))

[1] 6467

Second, tests are almost never split in an optimal way among the slave processes.

This can be seen quite clearly from the examples illustrated in this section: with 4

slaves, the number of tests assigned to each of them ranges from 1,116 (17

.

25 %

of the total) to 1,905 (29

45 % of the total). This variability introduces additional

overhead in the algorithm, because slaves that have fewer tests to perform must wait

for other slaves each time the status of the cluster is synchronized.

Third, passing data back and forth between the master and the slaves also takes

some time. The efficiency of such an operation depends on the operating system

and the hardware the cluster is running on, so it must be evaluated on a case-by-case

basis.

.

5.3.2 Score-Based Structure Learning Algorithms

Score-based learning algorithms benefit from several decades of research efforts

aimed at taking advantage of parallel computing in optimization heuristics.

Most score-based algorithms are inherently sequential in nature. Consider for

example hill-climbing. At each iteration, the state of the previous iteration is used

as the starting point for the search of a new, better network structure. This is also

true for tabu search and genetic algorithms and makes the parallel implementation

of these algorithms a challenging problem.

A possible solution is to provide a parallel implementation of the computations

performed within a single iteration and to let the master process execute the it-

erations in a sequential way, synchronizing the status of the slaves each time.

This would reduce a sequential problem to a fine-grained parallel one; it is known

as the move acceleration model if each slave computes part of the score of each

candidate network, or the parallel moves model if each slave manages some of the

candidate networks. However, the resulting performance gain is likely to be out-

weighed by the overhead of the communications between the master and the slave

processes.

Bayesian Networks in R: With Applications in Systems Biology

Search WWH ::

Custom Search

Home