Information Technology Reference
In-Depth Information
Scientific Discovery Within Data Streams
Andrew J. Cowell, Sue Havre, Richard May, and Antonio Sanfilippo
Pacific Northwest National Laboratory, USA
{ andrew.cowell,sue.havre,richard.may,antonio.sanfilippo } @pnl.gov
1
Introduction
The term 'data-stream' is an increasingly overloaded expression. It often means
different things to different people, depending on domain, usage or operation.
Harold (2003) draws the following analogy:
“A [stream] analogy might be a queue of people waiting to get on a ride
at an amusement park. As people are processed at the front (i.e. get on
the roller coaster) more are added at the back of the line. If it's a slow
day the roller coaster may catch up with the end of the line and have to
wait for people to board. Other days there may always be people in line
until the park closes...There's always a definite number of people in line
though this number may change from moment to moment as people enter
at the back of the line and exit from the front of the line. Although all
the people are discrete, you'll sometimes have a family that must be put
together in the same car. Thus although the individuals are discrete, they
aren't necessarily unrelated.”
For our purposes we define a data-stream as a series of data (e.g. credit card
transactions arriving at a clearing oce, cellular phone trac or environmental
data from satellites) arriving in real time, that have an initiation, a continuous
ingest of data, but with no expectations on the amount, length, or end of the data
flow. The data stream does not have a database or repository as an intrinsic part
of its definition-it is a 'one-look' opportunity from the perspective of data stream
analytics. We call each data element in the stream a token and the complexity
of these tokens ranges from simple (e.g. characters in a sentence: “T H I S I S
A S T R E A M. . . ”) to extremely complex (e.g. a detailed transaction record).
The volume of data-streams is usually massive, and while each individual token
may be rather uninformative, taken as a whole they describe the nature of the
changing phenomena over time.
The properties of data streams differ from conventional stored relations in
many ways. They have no width or flow boundaries, meaning that there is no
control over the total amount of data flowing, or differences in flow volume
arriving at any particular moment. They are also time varying and unpredictable;
flow can start or stop at any point and the number of tokens per unit time that
are delivered to a receiver vary. In addition, we have no control over the order
in which data items arrive; some data-streams provide tokens in order, while
Search WWH ::




Custom Search