Spark Streaming - Learning Spark

Database Reference

In-Depth Information

to handle new logfiles as they show up in a directory instead, as shown in Examples

10-29 and 10-30 .

Example 10-29. Streaming text files written to a directory in Scala

val logData = ssc . textFileStream ( logDirectory )

Example 10-30. Streaming text files written to a directory in Java

JavaDStream < String > logData = jssc . textFileStream ( logsDirectory );

We can use the provided ./bin/fakelogs_directory.sh script to fake the logs, or if we

have real log data we could replace the rotator with an mv command to rotate the log‐

files into the directory we are monitoring.

In addition to text data, we can also read any Hadoop input format. As with “Hadoop

Input and Output Formats” on page 84 , we simply need to provide Spark Streaming

with the Key , Value , and InputFormat classes. If, for example, we had a previous

streaming job process the logs and save the bytes transferred at each time as a

SequenceFile, we could read the data as shown in Example 10-31 .

Example 10-31. Streaming SequenceFiles written to a directory in Scala

ssc . fileStream [ LongWritable , IntWritable ,

SequenceFileInputFormat [ LongWritable , IntWritable ]]( inputDirectory ). map {

case ( x , y ) => ( x . get (), y . get ())

}

Akka actor stream

The second core receiver is actorStream , which allows using Akka actors as a source

for streaming. To construct an actor stream we create an Akka actor and implement

the org.apache.spark.streaming.receiver.ActorHelper interface. To copy the

input from our actor into Spark Streaming, we need to call the store() function in

our actor when we receive new data. Akka actor streams are less common so we

won't go into detail, but you can look at the streaming documentation and the Actor‐

WordCount example in Spark to see them in use.

Additional Sources

In addition to the core sources, additional receivers for well-known data ingestion

systems are packaged as separate components of Spark Streaming. These receivers are

still part of Spark, but require extra packages to be included in your build file. Some

current receivers include Twitter, Apache Kafka, Amazon Kinesis, Apache Flume,

Search WWH ::

Custom Search

Home