Link Analysis - Mining of Massive Datasets

Databases Reference

In-Depth Information

An alternative strategy is based on partitioning the matrix into k 2 blocks,

while the vectors are still partitioned into k stripes. A picture, showing the

division for k = 4, is in Fig. 5.12. Note that we have not shown the multiplica-

tion of the matrix by β or the addition of (1−β)e/n, because these steps are

straightforward, regardless of the strategy we use.

v

'

M

M 12

M

1

11

13

14

v

'

M

2

21

22

23

24

=

v

'

M

3

31

32

33

34

v

'

M

4

41

42

43

44

Figure 5.12: Partitioning a matrix into square blocks

In this method, we use k 2 Map tasks. Each task gets one square of the

matrix M , say M ij , and one stripe of the vector v, which must be v j . Notice

that each stripe of the vector is sent to k different Map tasks; v j is sent to the

task handling M ij for each of the k possible values of i. Thus, v is transmitted

over the network k times. However, each piece of the matrix is sent only once.

Since the size of the matrix, properly encoded as described in Section 5.2.1, can

be expected to be several times the size of the vector, the transmission cost is

not too much greater than the minimum possible. And because we are doing

considerable combining at the Map tasks, we save as data is passed from the

Map tasks to the Reduce tasks.

The advantage of this approach is that we can keep both the jth stripe of

v and the ith stripe of v ′ in main memory as we process M ij . Note that all

terms generated from M ij and v j contribute to v ′ i

and no other stripe of v ′ .

5.2.4

Representing Blocks of the Transition Matrix

Since we are representing transition matrices in the special way described in

Section 5.2.1, we need to consider how the blocks of Fig. 5.12 are represented.

Unfortunately, the space required for a column of blocks (a “stripe” as we called

it earlier) is greater than the space needed for the stripe as a whole, but not

too much greater.

For each block, we need data about all those columns that have at least one

nonzero entry within the block. If k, the number of stripes in each dimension,

is large, then most columns will have nothing in most blocks of its stripe. For

a given block, we not only have to list those rows that have a nonzero entry for

that column, but we must repeat the out-degree for the node represented by

the column. Consequently, it is possible that the out-degree will be repeated as

many times as the out-degree itself. That observation bounds from above the

Mining of Massive Datasets

Search WWH ::

Custom Search

Home