PEGASUS - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

1. combine2 ( m i,j , v j ) = m i,j × v j .

2. combineAlli() i ( x 1 ,…, x n ) = min{ x j | j = 1.. n }.

3. assign ( v i , v new ) = min( v i , v new ).

By repeating this process, component ids of nodes in a component are set to

the minimum node id of the component. We iteratively do the multiplication until

component ids converge. The upper bound of the number of iterations in Hcc is d ,

where d is the diameter of the graph. We notice that because of the small-world

phenomenon, see Section 8.2, the diameter of real graphs is small, and therefore, in

practice, Hcc completes after a small number of iterations. For a recent work with

better practical performance, see [45].

8.4 HADOOP IMPLEMENTATION

Given the main goal of the PEGASUS project is to provide an efficient system to

the user, we discuss different Hadoop implementation approaches, starting out with

a naive implementation and progressing to faster methods for GIM-V . The proposed

versions are evaluated in Section 8.5.

8.4.1 GIM-V base: n aive m ultiPliCation

GIM-V BASE is a two-stage algorithm whose pseudocode is in Algorithms 8.1 and

8.2. The inputs are an edge file and a vector file. Each line of the edge file has

the form ( id src , id dst , mval ), which corresponds to a nonzero entry in the adjacency

matrix. Similarly, each line of the vector file has the form ( id , vval ), which cor-

responds to an element in vector v . Stage1 performs the combine2 operation

by combining columns of matrix ( id dst of M ) with rows of the vector ( id of V ). The

output of Stage1 are (key, value) pairs where the key is the source vertex id of the

matrix ( id src of M ), and the value is the partially combined result ( combine2 ( mval ,

vval )). This output of Stage1 becomes the input of Stage2 . Stage2 combines

all partial results from Stage1 and updates the vector. The combineAlli() i () and

assign () operations are done in line 15 of Stage2 , where the “self” and “oth-

ers” tags in lines 15 and 21 of Stage1 are needed by Stage2 to distinguish cases

appropriately. We note that in Algorithm 8.4 and 8.5, Output( k , v ) means to output

data with the key k and the value.

8.4.2 GIM-V bl: bl: loCk m ultiPliCation

GIM-V BL is a fast algorithm for GIM-V , which is based on block multiplication.

The main idea is to group elements of the input matrix into blocks/submatrices of

size b by b . Also, we group elements of input vectors into blocks of length b . In

practice, grouping means we place all elements of a group into one line of input file.

Each block contains only nonzero elements of the matrix/vector. The format of a

matrix block with k nonzero elements

Search WWH ::

Custom Search

Home