Database Reference
In-Depth Information
to handle new logfiles as they show up in a directory instead, as shown in Examples
10-29
and
10-30
.
Example 10-29. Streaming text files written to a directory in Scala
val
logData
=
ssc
.
textFileStream
(
logDirectory
)
Example 10-30. Streaming text files written to a directory in Java
JavaDStream
<
String
>
logData
=
jssc
.
textFileStream
(
logsDirectory
);
We can use the provided
./bin/fakelogs_directory.sh
script to fake the logs, or if we
have real log data we could replace the rotator with an
mv
command to rotate the log‐
files into the directory we are monitoring.
In addition to text data, we can also read any Hadoop input format. As with
“Hadoop
Input and Output Formats” on page 84
, we simply need to provide Spark Streaming
with the
Key
,
Value
, and
InputFormat
classes. If, for example, we had a previous
streaming job process the logs and save the bytes transferred at each time as a
SequenceFile, we could read the data as shown in
Example 10-31
.
Example 10-31. Streaming SequenceFiles written to a directory in Scala
ssc
.
fileStream
[
LongWritable
,
IntWritable
,
SequenceFileInputFormat
[
LongWritable
,
IntWritable
]](
inputDirectory
).
map
{
case
(
x
,
y
)
=>
(
x
.
get
(),
y
.
get
())
}
Akka actor stream
for streaming. To construct an actor stream we create an Akka actor and implement
the
org.apache.spark.streaming.receiver.ActorHelper
interface. To copy the
input from our actor into Spark Streaming, we need to call the
store()
function in
our actor when we receive new data. Akka actor streams are less common so we
won't go into detail, but you can look at
the streaming documentation
and the
Actor‐
WordCount
example in Spark to see them in use.
Additional Sources
In addition to the core sources, additional receivers for well-known data ingestion
systems are packaged as separate components of Spark Streaming. These receivers are
still part of Spark, but require extra packages to be included in your build file. Some
current receivers include Twitter, Apache Kafka, Amazon Kinesis, Apache Flume,