Analyzing Data Streams in Scientific Applications - Scientific Data Management

Database Reference

In-Depth Information

a very ecient scheduler and memory manager that provide much better

throughput than existing stream processing systems.

11.3 Parallelizing High-Volume Scientific Stream Queries

WaveScope provides a complete functional programming language for specify-

ing high-volume stream processing computations. The nodes involved in these

computations can communicate using stream communication primitives where

the user explicitly specifies data interchange between WaveScope nodes. The

purpose of the systems described in this section is to provide primitives to

specify massively parallel and distributed computations in a functional query

language. The two systems GSDM (Grid Stream Data Manager) 19 and SCSQ

(Super Computer Stream Query processor) 21

provide two different ways for

parallelizing queries:

GSDM provides a library of constructors of high-level data flow distri-

bution templates to specify parallel execution schemes for functions used

in declarative stream queries. GSDM has been applied on signal analysis

in space physics applications.

SCSQ provides declarative parallelization in queries by providing stream

processes (SPs) as first-class objects in the query language. SCSQ has

been applied on space physics and trac applications.

Both GSDM and SCSQ are based on a functional data model 9 where declar-

ative queries over streams are expressed in terms of functions.

The motivating application is LOFAR, 33 which is a radio telescope in con-

struction that uses an array of 25,000 omni-directional antenna receivers

whose signals are digitized into data streams of very high rate. The LOFAR

antenna array will be the largest sensor network in the world. The receivers

produce raw data streams that arrive at the central processing facilities at a

rate that is too high for the data to be saved on disk. For these data-intensive

computations, LOFAR utilizes an IBM BlueGene supercomputer combined

with conventional Linux clusters.

High-performance stream processing for this kind of application requires the

ability to specify parallel continuous queries (CQs) running on nodes in a het-

erogeneous hardware environment. To maximize throughput of streams and

computations it is important to parallelize CQs into continuous subqueries,

each executing as a separate process on some CPU. Often the parallelization

method depends on properties of the computation executed by the query,

making it impossible to automatically parallelize the execution. The query

processing system must therefore provide primitives for customized paral-

lelization of continuous computations.

Search WWH ::

Custom Search

Home