Information Technology Reference
In-Depth Information
function over all data points can be split into partial sums and then the results
can be aggregated:
m
1000
2000
m
g ʸ ( x i ,y i )=
g ʸ ( x i ,y i )+
g ʸ ( x i ,y i )+ ... +
g ʸ ( x i ,y i ) .
i =1
i =1
i =1001
i = m− 999
Each partial sum can then be executed over a different computing node and
a designated node performs the final aggregation of partial sums. However, this
poses additional problems as each partition of the data has to be made available
to the computing node that is going to perform each partial sum. Furthermore,
each iteration over the data requires the global sum to be recomputed. At each
iteration each computing node will have to access the partition of the data
it is commanded to sum upon which might not necessarily be the same for
all iterations. This may generate huge amounts of network trac within the
computing cluster, specially if there is a central storage system shared by all
computing nodes as it is typically the case.
In this context is where caching come to be valuable. However, as shown in
this work, caching alone might only solve partially the problem, and we need
to devise caching strategies to encourage computing nodes to reuse as much as
possible cached data throughout all iterations.
2.1 BIGS
The Big Image Data Analysis Toolkit (BIGS) was developed by our research
group to enables distributed image processing workflows over heterogeneous com-
puting infrastructures including computer clusters and cloud resources but also
desktop computers in our lab or seldom servers available in an unplanned man-
ner. BIGS promotes opportunistic, data locality aware computing through
1. a data partition iterative programming model supporting the parallelization
scheme described in the previous section,
2. users assembling image processing jobs by pipelining machine learning algo-
rithms over streams of data,
3. BIGS workers are software agents deployed over the actual distributed com-
puting resources in charge of resolving the computing load,
4. a NoSQL storage model with a reference NoSQL central database,
5. removing the need of a central control node so that workers contain the logic
to coordinate their work through the reference NoSQL database,
6. simple and opportunistic deployment model for workers, requiring only con-
nectivity to the reference NoSQL database,
7. redundant data replication throughout workers,
8. a two level data caching in workers in memory and disk,
9. a set of strategies for workers for data access so that users can enforce data
locality aware computing or only-in- memory computing,
10. a set of APIs through which BIGS can be extended with new algorithms,
storage and data import modules. More information can be found at http://
www.3igs.org. Prototype releases of BIGS are described in [10,9].
Search WWH ::




Custom Search