Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

dataset = sc . parallelize ( vectors )

scaler = StandardScaler ( withMean = True , withStd = True )

model = scaler . fit ( dataset )

result = model . transform ( dataset )

# Result: {[-0.7071, 0.7071, 0.0], [0.7071, -0.7071, 0.0]}

Normalization

In some situations, normalizing vectors to length 1 is also useful to prepare input

data. The Normalizer class allows this. Simply use Normalizer().transform(rdd) .

By default Normalizer uses the L 2 norm (i.e, Euclidean length), but you can also

pass a power p to Normalizer to use the L

p norm.

Word2Vec

Word2Vec 3 is a featurization algorithm for text based on neural networks that can be

used to feed data into many downstream algorithms. Spark includes an implementa‐

tion of it in the mllib.feature.Word2Vec class.

To train Word2Vec, you need to pass it a corpus of documents, represented as Itera

ble s of String s (one per word). Much like in “TF-IDF” on page 221 , it is recom‐

mended to normalize your words (e.g., mapping them to lowercase and removing

punctuation and numbers). Once you have trained the model (with

Word2Vec.fit(rdd) ), you will receive a Word2VecModel that can be used to trans

form() each word into a vector. Note that the size of the models in Word2Vec will be

equal to the number of words in your vocabulary times the size of a vector (by

default, 100). You may wish to filter out words that are not in a standard dictionary

to limit the size. In general, a good size for the vocabulary is 100,000 words.

Statistics

Basic statistics are an important part of data analysis, both in ad hoc exploration and

understanding data for machine learning. MLlib offers several widely used statistic

functions that work directly on RDDs, through methods in the mllib.stat.Statis

tics class. Some commonly used ones include:

Statistics.colStats( rdd )

Computes a statistical summary of an RDD of vectors, which stores the min,

max, mean, and variance for each column in the set of vectors. This can be used

to obtain a wide variety of statistics in one pass.

3 Introduced in Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” 2013.

Search WWH ::

Custom Search

Home