Database Reference
In-Depth Information
will already be in main memory. Thus, most terms will require that a page be brought into
main memory to add it to the proper component. That situation, called thrashing , takes or-
ders of magnitude too much time to be feasible.
An alternative strategy is based on partitioning the matrix into k 2 blocks, while the vec-
tors are still partitioned into k stripes. A picture, showing the division for k = 4, is in Fig.
5.12 . Note that we have not shown the multiplication of the matrix by β or the addition of
(1− β ) e / n , because these steps are straightforward, regardless of the strategy we use.
Figure 5.12 Partitioning a matrix into square blocks
In this method, we use k 2 Map tasks. Each task gets one square of the matrix M , say M ij ,
and one stripe of the vector v , which must be v j . Notice that each stripe of the vector is sent
to k different Map tasks; v j is sent to the task handling M ij for each of the k possible values
of i . Thus, v is transmitted over the network k times. However, each piece of the matrix is
sent only once. Since the size of the matrix, properly encoded as described in Section 5.2.1 ,
can be expected to be several times the size of the vector, the transmission cost is not too
much greater than the minimum possible. And because we are doing considerable combin-
ing at the Map tasks, we save as data is passed from the Map tasks to the Reduce tasks.
The advantage of this approach is that we can keep both the j th stripe of v and the i th
stripe of v ′ in main memory as we process M ij . Note that all terms generated from M ij and
v j contribute to and no other stripe of v ′.
5.2.4
Representing Blocks of the Transition Matrix
Since we are representing transition matrices in the special way described in Section 5.2.1 ,
we need to consider how the blocks of Fig. 5.12 are represented. Unfortunately, the space
required for a column of blocks (a “stripe” as we called it earlier) is greater than the space
needed for the stripe as a whole, but not too much greater.
For each block, we need data about all those columns that have at least one nonzero entry
within the block. If k , the number of stripes in each dimension, is large, then most columns
will have nothing in most blocks of its stripe. For a given block, we not only have to list
those rows that have a nonzero entry for that column, but we must repeat the out-degree for
the node represented by the column. Consequently, it is possible that the out-degree will be
repeated as many times as the out-degree itself. That observation bounds from above the
Search WWH ::




Custom Search