Database Reference
In-Depth Information
dataset = sc . parallelize ( vectors )
scaler = StandardScaler ( withMean = True , withStd = True )
model = scaler . fit ( dataset )
result = model . transform ( dataset )
# Result: {[-0.7071, 0.7071, 0.0], [0.7071, -0.7071, 0.0]}
Normalization
In some situations, normalizing vectors to length 1 is also useful to prepare input
data. The Normalizer class allows this. Simply use Normalizer().transform(rdd) .
By default Normalizer uses the L 2 norm (i.e, Euclidean length), but you can also
pass a power p to Normalizer to use the L
p norm.
Word2Vec
Word2Vec 3 is a featurization algorithm for text based on neural networks that can be
used to feed data into many downstream algorithms. Spark includes an implementa‐
tion of it in the mllib.feature.Word2Vec class.
To train Word2Vec, you need to pass it a corpus of documents, represented as Itera
ble s of String s (one per word). Much like in “TF-IDF” on page 221 , it is recom‐
mended to normalize your words (e.g., mapping them to lowercase and removing
punctuation and numbers). Once you have trained the model (with
Word2Vec.fit(rdd) ), you will receive a Word2VecModel that can be used to trans
form() each word into a vector. Note that the size of the models in Word2Vec will be
equal to the number of words in your vocabulary times the size of a vector (by
default, 100). You may wish to filter out words that are not in a standard dictionary
to limit the size. In general, a good size for the vocabulary is 100,000 words.
Statistics
Basic statistics are an important part of data analysis, both in ad hoc exploration and
understanding data for machine learning. MLlib offers several widely used statistic
functions that work directly on RDDs, through methods in the mllib.stat.Statis
tics class. Some commonly used ones include:
Statistics.colStats( rdd )
Computes a statistical summary of an RDD of vectors, which stores the min,
max, mean, and variance for each column in the set of vectors. This can be used
to obtain a wide variety of statistics in one pass.
3 Introduced in Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” 2013.
 
Search WWH ::




Custom Search