Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

tion . We will start by applying a simple whitespace tokenization, together with convert-

ing each token to lowercase for each document:

val text = rdd.map { case (file, text) => text }

val whiteSpaceSplit = text.flatMap(t => t.split("

").map(_.toLowerCase))

println(whiteSpaceSplit.distinct.count)

Tip

In the preceding code, we used the flatMap function instead of map , as for now, we

want to inspect all the tokens together for exploratory analysis. Later in this chapter, we

will apply our tokenization scheme on a per-document basis, so we will use the map func-

tion.

After running this code snippet, you will see the total number of unique tokens after ap-

plying our tokenization:

402978

As you can see, for even a relatively small set of text, the number of raw tokens (and,

therefore, the dimensionality of our feature vectors) can be very high.

Let's take a look at a randomly selected document:

println(whiteSpaceSplit.sample(true, 0.3,

42).take(100).mkString(","))

Tip

Note that we set the third parameter to the sample function, which is the random seed.

We set this function to 42 so that we get the same results from each call to sample so

that your results match those in this chapter.

This will display the following result:

atheist,resources

summary:,addresses,,to,atheism

keywords:,music,,thu,,11:57:19,11:57:19,gmt

distribution:,cambridge.,290

Search WWH ::

Custom Search

Home