Link Analysis - Mining of Massive Datasets

Database Reference

In-Depth Information

will already be in main memory. Thus, most terms will require that a page be brought into

main memory to add it to the proper component. That situation, called thrashing , takes or-

ders of magnitude too much time to be feasible.

An alternative strategy is based on partitioning the matrix into k 2 blocks, while the vec-

tors are still partitioned into k stripes. A picture, showing the division for k = 4, is in Fig.

5.12 . Note that we have not shown the multiplication of the matrix by β or the addition of

(1− β ) e / n , because these steps are straightforward, regardless of the strategy we use.

Figure 5.12 Partitioning a matrix into square blocks

In this method, we use k 2 Map tasks. Each task gets one square of the matrix M , say M ij ,

and one stripe of the vector v , which must be v j . Notice that each stripe of the vector is sent

to k different Map tasks; v j is sent to the task handling M ij for each of the k possible values

of i . Thus, v is transmitted over the network k times. However, each piece of the matrix is

sent only once. Since the size of the matrix, properly encoded as described in Section 5.2.1 ,

can be expected to be several times the size of the vector, the transmission cost is not too

much greater than the minimum possible. And because we are doing considerable combin-

ing at the Map tasks, we save as data is passed from the Map tasks to the Reduce tasks.

The advantage of this approach is that we can keep both the j th stripe of v and the i th

stripe of v ′ in main memory as we process M ij . Note that all terms generated from M ij and

v j contribute to and no other stripe of v ′.

5.2.4

Representing Blocks of the Transition Matrix

Since we are representing transition matrices in the special way described in Section 5.2.1 ,

we need to consider how the blocks of Fig. 5.12 are represented. Unfortunately, the space

required for a column of blocks (a “stripe” as we called it earlier) is greater than the space

needed for the stripe as a whole, but not too much greater.

For each block, we need data about all those columns that have at least one nonzero entry

within the block. If k , the number of stripes in each dimension, is large, then most columns

will have nothing in most blocks of its stripe. For a given block, we not only have to list

those rows that have a nonzero entry for that column, but we must repeat the out-degree for

the node represented by the column. Consequently, it is possible that the out-degree will be

repeated as many times as the out-degree itself. That observation bounds from above the

Search WWH ::

Custom Search

Home