Database Reference
In-Depth Information
with a couple of
^^
>>Stanley Cups thrown in.
...
As we can see, each message contains some header fields that contain the sender, subject,
and other metadata, followed by the raw content of the message.
Exploring the 20 Newsgroups data
Now, we will start up our Spark Scala console, ensuring that we make enough memory
available:
>./SPARK_HOME/bin/spark-shell --driver-memory 4g
Looking at the directory structure, you might recognize that once again, we have data con-
tained in individual text files (one text file per message). Therefore, we will again use
Spark's wholeTextFiles method to read the content of each file into a record in our
RDD.
In the code that follows, PATH refers to the directory in which you extracted the
20news-bydate ZIP file:
val path = "/PATH/20news-bydate-train/*"
val rdd = sc.wholeTextFiles(path)
val text = rdd.map { case (file, text) => text }
println(text.count)
The first time you run this command, it might take quite a bit of time, as Spark needs to
scan the directory structure. You will also see quite a lot of console output, as Spark logs
all the file paths that are being processed. During the processing, you will see the follow-
ing line displayed, indicating the total number of files that Spark has detected:
...
14/10/12 14:27:54 INFO FileInputFormat: Total input paths
to process : 11314
...
After the command has finished running, you will see the total record count, which should
be the same as the preceding Total input paths to process screen output:
Search WWH ::




Custom Search