Database Reference
In-Depth Information
tion
. We will start by applying a simple
whitespace
tokenization, together with convert-
ing each token to lowercase for each document:
val text = rdd.map { case (file, text) => text }
val whiteSpaceSplit = text.flatMap(t => t.split("
").map(_.toLowerCase))
println(whiteSpaceSplit.distinct.count)
Tip
In the preceding code, we used the
flatMap
function instead of
map
, as for now, we
want to inspect all the tokens together for exploratory analysis. Later in this chapter, we
will apply our tokenization scheme on a per-document basis, so we will use the
map
func-
tion.
After running this code snippet, you will see the total number of unique tokens after ap-
plying our tokenization:
402978
As you can see, for even a relatively small set of text, the number of raw tokens (and,
therefore, the dimensionality of our feature vectors) can be very high.
Let's take a look at a randomly selected document:
println(whiteSpaceSplit.sample(true, 0.3,
42).take(100).mkString(","))
Tip
Note that we set the third parameter to the
sample
function, which is the random seed.
We set this function to
42
so that we get the same results from each call to
sample
so
that your results match those in this chapter.
This will display the following result:
atheist,resources
summary:,addresses,,to,atheism
keywords:,music,,thu,,11:57:19,11:57:19,gmt
distribution:,cambridge.,290