Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

with a couple of

^^

>>Stanley Cups thrown in.

...

As we can see, each message contains some header fields that contain the sender, subject,

and other metadata, followed by the raw content of the message.

Exploring the 20 Newsgroups data

Now, we will start up our Spark Scala console, ensuring that we make enough memory

available:

>./SPARK_HOME/bin/spark-shell --driver-memory 4g

Looking at the directory structure, you might recognize that once again, we have data con-

tained in individual text files (one text file per message). Therefore, we will again use

Spark's wholeTextFiles method to read the content of each file into a record in our

RDD.

In the code that follows, PATH refers to the directory in which you extracted the

20news-bydate ZIP file:

val path = "/PATH/20news-bydate-train/*"

val rdd = sc.wholeTextFiles(path)

val text = rdd.map { case (file, text) => text }

println(text.count)

The first time you run this command, it might take quite a bit of time, as Spark needs to

scan the directory structure. You will also see quite a lot of console output, as Spark logs

all the file paths that are being processed. During the processing, you will see the follow-

ing line displayed, indicating the total number of files that Spark has detected:

...

14/10/12 14:27:54 INFO FileInputFormat: Total input paths

to process : 11314

...

After the command has finished running, you will see the total record count, which should

be the same as the preceding Total input paths to process screen output:

Search WWH ::

Custom Search

Home