An Overview of Large-Scale Stream Processing Engines - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

The basic idea is to define a point in time p so that tuples of bucket b earlier than p

are processed by the old owner and tuples later than p are processed by the new one.

This is straightforward for stateless operators. However, it is more challenging when

reconfiguring stateful operators. Due to the sliding window semantics used in state-

ful operators, a tuple might contribute to several windows. Thus, there will be tuples

that need to be processed by both the old and new owners. SC reconfigures a sub-

cluster by triggering one or more reconfiguration actions. Each action changes the

ownership of a bucket from the old owner to the new one within the same subcluster.

Reconfiguration actions only affect the instances of the subcluster being reconfig-

ured and the upstream LBs.

Elasticity rules are specified as thresholds that set the conditions that trigger pro-

visioning, decommissioning or load balancing. Provisioning and decommissioning

are triggered if the average CPU utilization is above the upper-utilization-threshold

(UUT) or below the lower-utilization-threshold (LUT). Reconfiguration actions aim

to achieve an average CPU utilization figure that is as close as possible to the target-

utilization-threshold (TUT). Load balancing is triggered when the standard deviation

of the CPU utilization is above the upper-imbalance-threshold (UIT). A minimum-

improvement-threshold (MIT) specifies the minimum performance improvement to

endorse a new configuration. That is, the new configuration is applied only if the

imbalance reduction is above the MIT. The goal is to keep the average CPU uti-

lization within upper and lower utilization thresholds and the standard deviation

below the upper imbalance threshold in each subcluster. SC features a load-aware

provisioning strategy. When provisioning instances, a naive strategy would be to

provision one instance at a time (individual provisioning). However, individual pro-

visioning might lead to cascaded provisioning, that is, continuous allocation of new

instances. This might happen with steadily increasing loads when the additional

computing power provided by the new instance does not decrease the average CPU

utilization below UUT. To overcome this problem, SC load-aware provisioning takes

into account the current subcluster size and load to decide how many new instances

to provide in order to reach for TUT.

12.7 STOR MY

The Stormy system [18] has been presented as a distributed stream processing ser-

vice for continuous data processing that relies on techniques from existing cloud

storage systems that are adapted to efficiently execute streaming workloads. It

uses a synthesis of well-known techniques from cloud storage systems [9,15,17] to

achieve the scalability and availability goals. It uses distributed hash tables (DHT)

[5] to distribute queries across all nodes, and route events from query to query

according to the query graph. To achieve high availability, Stormy uses a repli-

cation mechanism where queries are replicated on several nodes, and events are

concurrently executed on all replicas. As Stormy's architecture is by design decen-

tralized, there is no single point-of-failure. However, Stormy uses the assumption

that a query can be completely executed on one node. Thus, there is an upper limit

on the number of incoming events of a stream. The API of Stormy consists of only

four functions:

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home