Database Reference
In-Depth Information
1.
combine2
(
m
i,j
,
v
j
) =
m
i,j
×
v
j
.
2.
combineAlli()
i
(
x
1
,…,
x
n
) = min{
x
j
|
j
= 1..
n
}.
3.
assign
(
v
i
,
v
new
) = min(
v
i
,
v
new
).
By repeating this process, component ids of nodes in a component are set to
the minimum node id of the component. We iteratively do the multiplication until
component ids converge. The upper bound of the number of iterations in Hcc is
d
,
where
d
is the diameter of the graph. We notice that because of the small-world
phenomenon, see Section 8.2, the diameter of real graphs is small, and therefore, in
practice, Hcc completes after a small number of iterations. For a recent work with
better practical performance, see [45].
8.4 HADOOP IMPLEMENTATION
Given the main goal of the PEGASUS project is to provide an efficient system to
the user, we discuss different Hadoop implementation approaches, starting out with
a naive implementation and progressing to faster methods for
GIM-V
. The proposed
versions are evaluated in Section 8.5.
8.4.1
GIM-V
base: n
aive
m
ultiPliCation
GIM-V
BASE is a two-stage algorithm whose pseudocode is in Algorithms 8.1 and
8.2. The inputs are an edge file and a vector file. Each line of the edge file has
the form (
id
src
,
id
dst
,
mval
), which corresponds to a nonzero entry in the adjacency
matrix. Similarly, each line of the vector file has the form (
id
,
vval
), which cor-
responds to an element in vector
v
.
Stage1
performs the
combine2
operation
by combining columns of matrix (
id
dst
of
M
) with rows of the vector (
id
of
V
). The
output of
Stage1
are (key, value) pairs where the key is the source vertex id of the
matrix (
id
src
of
M
), and the value is the partially combined result (
combine2
(
mval
,
vval
)). This output of
Stage1
becomes the input of
Stage2
.
Stage2
combines
all partial results from
Stage1
and updates the vector. The
combineAlli()
i
() and
assign
() operations are done in line 15 of
Stage2
, where the “self” and “oth-
ers” tags in lines 15 and 21 of
Stage1
are needed by
Stage2
to distinguish cases
appropriately. We note that in Algorithm 8.4 and 8.5, Output(
k
,
v
) means to output
data with the key
k
and the value.
8.4.2
GIM-V
bl: bl:
loCk
m
ultiPliCation
GIM-V
BL is a fast algorithm for
GIM-V
, which is based on block multiplication.
The main idea is to group elements of the input matrix into blocks/submatrices of
size
b
by
b
. Also, we group elements of input vectors into blocks of length
b
. In
practice, grouping means we place all elements of a group into one line of input file.
Each block contains only nonzero elements of the matrix/vector. The format of a
matrix block with
k
nonzero elements
Search WWH ::
Custom Search