Parallel Computing in the Analysis of Gene Expression Relationships - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

-classifier that significantly

reduces the time to complete a classification analysis by efficiently distributing the

computational work to many processors.

In this section, we present a parallel version of the

σ

11.3.1 Design of

-Classifiers and Feature Selection

σ

When designing classifiers, we often do not know which features are required. There-

fore, the selection of good features is important in addition to the design of specific

classifiers. A classifier design method should provide a reasonable estimation of error

for each classifier relative to other classifiers, to help find the desired features. If

the number of samples available for analysis is very limited, then error estimation

for the classifiers becomes difficult. To alleviate these problems, the

-classifier is

designed from a probability distribution resulting from spreading the mass of the

sample points via a circular distribution to make classification more difficult, while

maintaining sample geometry. The algorithm is parameterized by the variance of the

circular distribution. By considering increasing variances, the algorithm finds feature

sets whose classification accuracy remains strong relative to greater spreading of the

sample. The error then gives a measure of the strength of the feature set as a function

of the variance. The

σ

-classifier designs classifiers and estimates errors analytically

to minimize the computational load. This property is crucial because of the immense

size of the feature space that will be searched.

An exhaustive search of combinatorial space results in the best feature sets for a

σ

-classifier. This approach has been successfully applied to a few sets of microarray

data of reasonable size containing a few thousand genes. Even though the

σ

-classifier

algorithm is designed for this type of search, as the number of features increases,

the computational load increases significantly, often becoming computationally pro-

hibitive. If n be the total number of features and k be the number of features in a

classifier, then there are M

σ

n

k

classifiers to design and

=

-errors to estimate. Even

with reasonably sized data, n being larger than a few thousand, M may be so large

that it is not feasible to perform an analysis on a single CPU. Therefore, parallel

processing becomes inevitable.

σ

11.3.2 Parallel Implementation of the

-Classifier

σ

When designing a parallel implementation, evenly distributing the computational

work among the processors reduces the time that some processors are idle, while

other processors are doing their work. The most efficient way to distribute the work

would be to take the total number of classifiers and divide them equally among the

processors. However, the number of classifiers quickly exceeds the largest signed

32-bit integer. Therefore, with three features per classifier, only 2345 features would

be possible. Using unsigned 32-bit integers does not solve the problem, as only 2954

features would be possible. Although using 64-bit integers is an option, we preferred

a simple, sub-optimal method for distributing the work, one that does not depend on

such system constraints and remained surprisingly efficient for our study.

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home