Database Reference
In-Depth Information
Tip
Note that we imported MLlib's
SparseVector
using an alias of
SV
. This is because
later, we will use Breeze's
linalg
module, which itself also imports
SparseVector
.
This way, we will avoid namespace collisions.
The
transform
function of
HashingTF
maps each input document (that is, a se-
quence of tokens) to an MLlib
Vector
. We will also call
cache
to pin the data in
memory to speed up subsequent operations.
Let's inspect the first element of our transformed dataset:
Tip
Note that
HashingTF.transform
returns an
RDD[Vector]
, so we will cast the res-
ult returned to an instance of an MLlib
SparseVector
.
The
transform
method can also work on an individual document by taking an
Iter-
able
argument (for example, a document as a
Seq[String]
). This returns a single
vector.
val v = tf.first.asInstanceOf[SV]
println(v.size)
println(v.values.size)
println(v.values.take(10).toSeq)
println(v.indices.take(10).toSeq)
You will see the following output displayed:
262144
706
WrappedArray(1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0,
1.0)
WrappedArray(313, 713, 871, 1202, 1203, 1209, 1795, 1862,
3115, 3166)
We can see that the dimension of each sparse vector of term frequencies is 262,144 (or
2
18
as we specified). However, the number on non-zero entries in the vector is only 706.