Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Tip

Note that we imported MLlib's SparseVector using an alias of SV . This is because

later, we will use Breeze's linalg module, which itself also imports SparseVector .

This way, we will avoid namespace collisions.

The transform function of HashingTF maps each input document (that is, a se-

quence of tokens) to an MLlib Vector . We will also call cache to pin the data in

memory to speed up subsequent operations.

Let's inspect the first element of our transformed dataset:

Tip

Note that HashingTF.transform returns an RDD[Vector] , so we will cast the res-

ult returned to an instance of an MLlib SparseVector .

The transform method can also work on an individual document by taking an Iter-

able argument (for example, a document as a Seq[String] ). This returns a single

vector.

val v = tf.first.asInstanceOf[SV]

println(v.size)

println(v.values.size)

println(v.values.take(10).toSeq)

println(v.indices.take(10).toSeq)

You will see the following output displayed:

262144

706

WrappedArray(1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0,

1.0)

WrappedArray(313, 713, 871, 1202, 1203, 1209, 1795, 1862,

3115, 3166)

We can see that the dimension of each sparse vector of term frequencies is 262,144 (or

2 18 as we specified). However, the number on non-zero entries in the vector is only 706.

Search WWH ::

Custom Search

Home