Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

designing parallel FPM algorithms is thus finding ways to split both the input and

intermediary data across all processes in such a way that no process has more data

than it can fit in memory.

A second important challenge in designing a successful parallel algorithm is to

decompose the problem into a set of tasks , where each task represents a unit of work,

s.t. tasks are independent and can be executed concurrently, in parallel. Given these

independent tasks, one must devise a work partitioning ,or static load balancing

strategy, to assign work to each process. A good work partitioning attempts to assign

equal amounts of work to all processes, s.t. all processes can finish their computation

at the same time. For example, given an n

1 vector, and p pro-

cesses, a good work partitioning for the dense matrix-vector multiplication problem

would be to assign each process n/p elements of the output vector. This assignment

achieves the desired goal of equal loads for all processes. Unlike this problem, FPM

is composed of inherently irregular tasks. FPM tasks depend on the type and size

of objects in the database, as well as the chosen minimum support threshold σ .An

important challenge is then to correctly gauge the amount of time individual tasks

are likely to take in order to properly divide tasks among processes.

A parallel application is only as fast as its slowest process. When the amount

of work assigned to a process cannot be correctly estimated, work partitioning can

lead to a load imbalance. Dynamic load balancing attempts to minimize the time

that processes are idle by actively distributing work among processes. Given their

irregular tasks, FPM algorithms are prime targets for dynamic load balancing. The

challenge of achieving good load balance becomes that of identifying points in the

algorithm execution when work can be re-balanced with little or no penalty.

×

n matrix, an n

×

3.2

Shared Memory Systems

When designing parallel algorithms, one must be cognizant of the memory model

they intend to operate under. The choice of memory model determines how data will

be stored and accessed, which in turn plays a direct role in the design and performance

of a parallel algorithm. Understanding the characteristics of each model and their

associated challenges is key in developing scalable algorithms.

Shared memory systems are parallel machines in which processes share a

single memory address space. Programming for shared memory systems has be-

come steadily more popular in recent years due to the now ubiquitous multi-core

workstations. A major advantage of working with shared memory is the ease of

communication between processes. Any cooperation that needs to take place can be

accomplished by simply changing values in memory. Global data structures may be

constructed and accessed with ease by all processes.

Designing algorithms for shared memory systems comes with its own set of chal-

lenges. A consequence of shared memory programming is the need to acknowledge

hardware details such as cache size and concurrent memory accesses. If two pro-

cesses attempt to write to the same memory address at the same time, one may

Frequent Pattern Mining

Search WWH ::

Custom Search

Home