Database Reference
In-Depth Information
In a real pipeline, you will likely need to preprocess and stem
words in a document before passing them to TF. For example, you
might convert all words to lowercase, drop punctuation characters,
and drop suffixes like
ing
. For best results you can call a single-
Once you have built term frequency vectors, you can use
IDF
to compute the inverse
document frequencies, and multiply them with the term frequencies to compute the
TF-IDF. You first call
fit()
on an
IDF
object to obtain an
IDFModel
representing the
inverse document frequencies in the corpus, then call
transform()
on the model to
transform TF vectors into IDF vectors.
Example 11-8
shows how you would compute
IDF starting with
Example 11-7
.
Example 11-8. Using TF-IDF in Python
from
pyspark.mllib.feature
import
HashingTF
,
IDF
# Read a set of text files as TF vectors
rdd
=
sc
.
wholeTextFiles
(
"data"
)
.
map
(
lambda
(
name
,
text
):
text
.
split
())
tf
=
HashingTF
()
tfVectors
=
tf
.
transform
(
rdd
)
.
cache
()
# Compute the IDF, then the TF-IDF vectors
idf
=
IDF
()
idfModel
=
idf
.
fit
(
tfVectors
)
tfIdfVectors
=
idfModel
.
transform
(
tfVectors
)
Note that we called
cache()
on the
tfVectors
RDD because it is used twice (once to
train the IDF model, and once to multiply the TF vectors by the IDF).
Scaling
Most machine learning algorithms consider the magnitude of each element in the
feature vector, and thus work best when the features are scaled so they weigh equally
(e.g., all features have a mean of 0 and standard deviation of 1). Once you have built
feature vectors, you can use the
StandardScaler
class in MLlib to do this scaling,
both for the mean and the standard deviation. You create a
StandardScaler
, call
fit()
on a dataset to obtain a
StandardScalerModel
(i.e., compute the mean and
variance of each column), and then call
transform()
on the model to scale a dataset.
Example 11-9
demonstrates.
Example 11-9. Scaling vectors in Python
from
pyspark.mllib.feature
import
StandardScaler
vectors
=
[
Vectors
.
dense
([
-
2.0
,
5.0
,
1.0
]),
Vectors
.
dense
([
2.0
,
0.0
,
1.0
])]