An Overview of Large-Scale Stream Processing Engines - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

tasks [19]. The ubiquity of mobile devices, location services, sensor pervasiveness

and real-time network monitoring have created the crucial need for building scalable

and parallel architectures to process vast amounts of streamed data.

In general, stream processing systems support a large class of applications in

which data are generated from multiple sources and are pushed asynchronously to

servers that are responsible for processing. Therefore, stream processing applications

are usually deployed as continuous jobs that run from the time of their submission

until their cancellation. Many applications in several domains such as telecommuni-

cations, network security, and large-scale sensor networks require online processing

of continuous data flows. They produce very high loads that require aggregating the

processing capacity of many nodes. Rather than processing stored data like in tradi-

tional database systems, stream processing engines process tuples on-the-fly. This is

due to the amount of input that discourages persistent storage and the requirement of

providing prompt results. Queries of streaming application are generally continuous

and stateful. Once a query is registered, it starts processing events and only stops

when the system terminates or the query is deregistered from the system. Queries

typically maintain state such as aggregates of windows or local variables. Query

state is kept on the same node that executes the query.

In the last decade, there have been substantial advancements in the field of data

stream processing. From centralized stream processing systems, the state-of-the-art

has advanced to stream processing engines with the ability to distribute different

queries among a cluster of nodes [10,11,18]. This chapter provides an overview of a

set of the main systems that have been presented for achieving scalable processing

of streaming data.

12.2 AUROR A

The Aurora [2,7] is a centralized stream processor that is fundamentally presented

as a data-flow system and uses the popular boxes and arrows paradigm. In aurora, a

stream is modeled as an append-only sequence of tuples with uniform type (schema).

In addition to application-specific data fields A 1 ,..., A n , each tuple in a stream has

a timestamp (ts) that specifies its time of origin within the Aurora network. The

Aurora data model supports out-of-order data arrival. Tuples flow through a loop-

free, directed graph of processing operators (i.e., boxes). Ultimately, output streams

are presented to applications, which must be constructed to handle the asynchro-

nously arriving tuples in an output stream. Each operator accepts input streams,

transforms them in some way, and produces one or more output streams. By default,

queries are continuous in that they can potentially run forever over push-based

inputs. Figure 12.1 illustrates an overview of the Aurora system.

The Aurora Stream Query Algebra (SQuAl) supports seven operators that are

used to construct Aurora networks queries. The operators are analogous to operators

in the relational algebra. However, they differ in fundamental ways in the way they

address the special requirements of stream processing. They can be divided into two

main sections: (1) order-agnostic operators (filter, map, and union) and (2) order-

sensitive operators (BSort, Aggregate, Join, and Resample). The behavior of these

operators are described as follows:

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home