Biomedical Engineering Reference
In-Depth Information
Nonrigid Stage
Where the primary effort is focused on improving intensity feature matching perfor-
mance, simple efforts can be made to improve the performance of reading images
from disk, grayscale conversion, and intensity feature extraction.
Given the large size of microscope images, some in excess of 10 GB, reading
from disk and decoding can require a considerable amount of time. A parallel
file system may be employed to reduce this time, although this requires distribut-
ing large amounts of data over a network and can complicate later steps since
the data will be distributed among several nodes rather than a single head node.
A portion of the time spent reading and decoding can be hidden, however, by
overlapping reading/decoding with grayscale conversion, and using the head node
to read/decode incrementally and asynchronous communication to defer grayscale
conversion of incremental reads to worker nodes.
With the grayscale base and float images in memory, the next step is to deter-
mine which template regions will serve as candidates for intensity feature match-
ing. The process is simple: the head node divides the base image among the worker
nodes, which compute the variances of the W 1
×
W 1 template sized tiling of their
portions and return the results.
With a set of candidate intensity feature regions identified, what remains is
to rotate them, extract their templates, and perform the correlations between the
templates and their corresponding search areas. The candidate features are evenly
divided among the worker nodes, who rotate them, extract their templates, and per-
form the correlations between template and search, returning the maximum corre-
lation result magnitudes and coordinates. The base image is stored in column-major
format, so to keep communication to a minimum the candidate feature regions are
buffered in order and the remainder of the image is discarded. Asynchronous com-
munication is used to keep the head node busy while send operations post. The
search windows, taken from the float image, are handled in a similar manner.
However, since the search windows for distinct features can overlap significantly,
they are not individually buffered, rather their union is buffered as a whole.
The division of work on a single node implementation of the nonrigid stage
follows a similar strategy as the multiple node implementation except that no effort
is made to overlap reading/decoding performance. In the case where GPU acceler-
ation is used, intensity feature extraction proceeds sequentially, and as candidate
features are identified they are passed to the GPU. This process is described in
further detail in Section 8.4.3.
The discrete Fourier transforms necessary for calculating correlations on CPU
are performed using the FFT library FFTW [46]. The 2D-DFT dimensions are
critical for performance; ideally the size of the padded transform W 1 +
1is
a power of two or a small prime number. For the cases when this size rule cannot
be obeyed, FFTW provides a simple mechanism called a plan that specifies an
optimized plan of execution for the transformation. This plan is precomputed and
subsequently reused, resulting in a one-time cost. For example, with a template size
W 1 =
W 2
350 and a search window size W 2 =
700, FFTW takes around 0.7 second
×
to compute the two 1,049
1,049 forward transforms without planning, whereas
with plan the computation takes only 0.32 second with a 6-second one-time penalty
Search WWH ::




Custom Search