Database Reference
In-Depth Information
Tip
Note that we imported MLlib's SparseVector using an alias of SV . This is because
later, we will use Breeze's linalg module, which itself also imports SparseVector .
This way, we will avoid namespace collisions.
The transform function of HashingTF maps each input document (that is, a se-
quence of tokens) to an MLlib Vector . We will also call cache to pin the data in
memory to speed up subsequent operations.
Let's inspect the first element of our transformed dataset:
Tip
Note that HashingTF.transform returns an RDD[Vector] , so we will cast the res-
ult returned to an instance of an MLlib SparseVector .
The transform method can also work on an individual document by taking an Iter-
able argument (for example, a document as a Seq[String] ). This returns a single
vector.
val v = tf.first.asInstanceOf[SV]
println(v.size)
println(v.values.size)
println(v.values.take(10).toSeq)
println(v.indices.take(10).toSeq)
You will see the following output displayed:
262144
706
WrappedArray(1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0,
1.0)
WrappedArray(313, 713, 871, 1202, 1203, 1209, 1795, 1862,
3115, 3166)
We can see that the dimension of each sparse vector of term frequencies is 262,144 (or
2 18 as we specified). However, the number on non-zero entries in the vector is only 706.
Search WWH ::




Custom Search