Database Reference
In-Depth Information
to handle new logfiles as they show up in a directory instead, as shown in Examples
10-29 and 10-30 .
Example 10-29. Streaming text files written to a directory in Scala
val logData = ssc . textFileStream ( logDirectory )
Example 10-30. Streaming text files written to a directory in Java
JavaDStream < String > logData = jssc . textFileStream ( logsDirectory );
We can use the provided ./bin/fakelogs_directory.sh script to fake the logs, or if we
have real log data we could replace the rotator with an mv command to rotate the log‐
files into the directory we are monitoring.
In addition to text data, we can also read any Hadoop input format. As with “Hadoop
Input and Output Formats” on page 84 , we simply need to provide Spark Streaming
with the Key , Value , and InputFormat classes. If, for example, we had a previous
streaming job process the logs and save the bytes transferred at each time as a
SequenceFile, we could read the data as shown in Example 10-31 .
Example 10-31. Streaming SequenceFiles written to a directory in Scala
ssc . fileStream [ LongWritable , IntWritable ,
SequenceFileInputFormat [ LongWritable , IntWritable ]]( inputDirectory ). map {
case ( x , y ) => ( x . get (), y . get ())
}
Akka actor stream
The second core receiver is actorStream , which allows using Akka actors as a source
for streaming. To construct an actor stream we create an Akka actor and implement
the org.apache.spark.streaming.receiver.ActorHelper interface. To copy the
input from our actor into Spark Streaming, we need to call the store() function in
our actor when we receive new data. Akka actor streams are less common so we
won't go into detail, but you can look at the streaming documentation and the Actor‐
WordCount example in Spark to see them in use.
Additional Sources
In addition to the core sources, additional receivers for well-known data ingestion
systems are packaged as separate components of Spark Streaming. These receivers are
still part of Spark, but require extra packages to be included in your build file. Some
current receivers include Twitter, Apache Kafka, Amazon Kinesis, Apache Flume,
Search WWH ::




Custom Search