Database Reference
In-Depth Information
dataset
=
sc
.
parallelize
(
vectors
)
scaler
=
StandardScaler
(
withMean
=
True
,
withStd
=
True
)
model
=
scaler
.
fit
(
dataset
)
result
=
model
.
transform
(
dataset
)
# Result: {[-0.7071, 0.7071, 0.0], [0.7071, -0.7071, 0.0]}
Normalization
In some situations, normalizing vectors to length 1 is also useful to prepare input
data. The
Normalizer
class allows this. Simply use
Normalizer().transform(rdd)
.
By default
Normalizer
uses the
L
2
norm (i.e, Euclidean length), but you can also
pass a power
p
to
Normalizer
to use the
L
p
norm.
Word2Vec
used to feed data into many downstream algorithms. Spark includes an implementa‐
tion of it in the
mllib.feature.Word2Vec
class.
To train Word2Vec, you need to pass it a corpus of documents, represented as
Itera
ble
s of
String
s (one per word). Much like in
“TF-IDF” on page 221
, it is recom‐
mended to normalize your words (e.g., mapping them to lowercase and removing
punctuation and numbers). Once you have trained the model (with
Word2Vec.fit(rdd)
), you will receive a
Word2VecModel
that can be used to
trans
form()
each word into a vector. Note that the size of the models in Word2Vec will be
equal to the number of words in your vocabulary times the size of a vector (by
default, 100). You may wish to filter out words that are not in a standard dictionary
to limit the size. In general, a good size for the vocabulary is 100,000 words.
Statistics
Basic statistics are an important part of data analysis, both in ad hoc exploration and
understanding data for machine learning. MLlib offers several widely used statistic
functions that work directly on RDDs, through methods in the
mllib.stat.Statis
tics
class. Some commonly used ones include:
Statistics.colStats(
rdd
)
Computes a statistical summary of an RDD of vectors, which stores the min,
max, mean, and variance for each column in the set of vectors. This can be used
to obtain a wide variety of statistics in one pass.
3
Introduced in Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” 2013.