PEGASUS - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Therefore, we instead propose the following heuristic for finding the center node: we

choose the center vertex by sampling from the high-degree vertices. This heuristic

is based on the fact that vertices with large degree have small radii [28]. After find-

ing a center node, we need to renumber the edge file to swap the current minimum

vertex id with the center vertex id. The MapReduce algorithm for this renumbering

is shown in Algorithm 8.6. Since the renumbering requires only filtering, it can be

done with a map-only job.

8.4.6 a nalysis

Finally, we analyze the time and space complexity of GIM-V . It is not hard to observe

that one iteration of GIM-V takes O

nm

M

+

nm

M

+

log

time, where M stands for the

number of machines. Assuming uniformity, mappers and reducers of Stage1 and

Stage2 receive O

nm

M

+

records per machine. The running time is dominated by

the sorting time for nm

M

+

records. GIM-V requires O( n + m ) space.

8.5 SCALABILITY

We perform experiments to answer the following questions:

•

How does GIM-V scale up?

•

Which of the proposed optimizations (block multiplication, clustered edges,

and diagonal block iteration, vertex renumbering) gives the highest perfor-

mance gains?

The graphs we use in our experiments are shown in Table 8.2. We run PEGASUS

in M45 Hadoop cluster by Yahoo! and our own cluster composed of 9 machines. M45

is one of the top 50 supercomputers in the world with the total 1.5-PB storage and

3.5-TB memory. For the performance and scalability experiments, we used synthetic

Kronecker graphs [31] since we can generate them with any size, and they are one of

the most realistic graphs among synthetic graphs.

8.5.1 r esults

We first show how the performance of our method changes as we add more machines.

Figure 8.5 shows the running time and performance of GIM-V for PageRank with

Kronecker graph of 282 million edges and size 32 blocks if necessary.

In Figure 8.5a, for all of the methods the running time decreases as we add more

machines. Note that clustered edges ( GIM-V CL) did not help performance unless it

is combined with block encoding. When it is combined, however, it showed the best

performance ( GIM-V BL- CL).

In Figure 8.5b, we see that the relative performance of each method compared

with GIM-V BASE method decreases as number of machines increases. With three

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home