Real-time Machine Learning with Spark Streaming - Machine Learning with Spark

Database Reference

In-Depth Information

Stream processing

Before covering online learning with Spark, we will first explore the basics of stream pro-

cessing and introduce the Spark Streaming library.

In addition to the core Spark API and functionality, the Spark project contains another ma-

jor library (in the same way as MLlib is a major project library) called Spark Streaming ,

which focuses on processing data streams in real time.

A data stream is a continuous sequence of records. Common examples include activity

stream data from a web or mobile application, time-stamped log data, transactional data,

and event streams from sensor or device networks.

The batch processing approach typically involves saving the data stream to an intermediate

storage system (for example, HDFS or a database) and running a batch process on the

saved data. In order to generate up-to-date results, the batch process must be run periodic-

ally (for example, daily, hourly, or even every few minutes) on the latest data available.

By contrast, the stream-based approach applies processing to the data stream as it is gener-

ated. This allows near real-time processing (of the order of a subsecond to a few tenths of a

second time frames rather than minutes, hours, days, or even weeks with typical batch pro-

cessing).

Search WWH ::

Custom Search

Home