Database Reference
In-Depth Information
tion . We will start by applying a simple whitespace tokenization, together with convert-
ing each token to lowercase for each document:
val text = rdd.map { case (file, text) => text }
val whiteSpaceSplit = text.flatMap(t => t.split("
").map(_.toLowerCase))
println(whiteSpaceSplit.distinct.count)
Tip
In the preceding code, we used the flatMap function instead of map , as for now, we
want to inspect all the tokens together for exploratory analysis. Later in this chapter, we
will apply our tokenization scheme on a per-document basis, so we will use the map func-
tion.
After running this code snippet, you will see the total number of unique tokens after ap-
plying our tokenization:
402978
As you can see, for even a relatively small set of text, the number of raw tokens (and,
therefore, the dimensionality of our feature vectors) can be very high.
Let's take a look at a randomly selected document:
println(whiteSpaceSplit.sample(true, 0.3,
42).take(100).mkString(","))
Tip
Note that we set the third parameter to the sample function, which is the random seed.
We set this function to 42 so that we get the same results from each call to sample so
that your results match those in this chapter.
This will display the following result:
atheist,resources
summary:,addresses,,to,atheism
keywords:,music,,thu,,11:57:19,11:57:19,gmt
distribution:,cambridge.,290
Search WWH ::




Custom Search