Database Reference
In-Depth Information
feature preparation is the most important step to large-scale learning. Both adding
more informative features (e.g., joining with other datasets to bring in more informa‐
tion) and converting the available features to suitable vector representations (e.g.,
scaling the vectors) can yield major improvements in results.
A full discussion of feature preparation is beyond the scope of this topic, but we
encourage you to refer to other texts on machine learning for more information.
However, with MLlib in particular, some common tips to follow are:
• Scale your input features. Run features through StandardScaler as described in
“Scaling” on page 222 to weigh features equally.
• Featurize text correctly. Use an external library like NLTK to stem words, and
use IDF across a representative corpus for TF-IDF.
• Label classes correctly. MLlib requires class labels to be 0 to C -1, where C is the
total number of classes.
Configuring Algorithms
Most algorithms in MLlib perform better (in terms of prediction accuracy) with regu‐
larization when that option is available. Also, most of the SGD-based algorithms
require around 100 iterations to get good results. MLlib attempts to provide useful
default values, but you should try increasing the number of iterations past the default
to see whether it improves accuracy. For example, with ALS, the default rank of 10 is
fairly low, so you should try increasing this value. Make sure to evaluate these param‐
eter changes on test data held out during training.
Caching RDDs to Reuse
Most algorithms in MLlib are iterative, going over the data multiple times. Thus, it is
important to cache() your input datasets before passing them to MLlib. Even if your
data does not fit in memory, try persist(StorageLevel.DISK_ONLY) .
In Python, MLlib automatically caches RDDs on the Java side when you pass them
from Python, so there is no need to cache your Python RDDs unless you reuse them
within your program. In Scala and Java, however, it is up to you to cache them.
Recognizing Sparsity
When your feature vectors contain mostly zeros, storing them in sparse format can
result in huge time and space savings for big datasets. In terms of space, MLlib's
sparse representation is smaller than its dense one if at most two-thirds of the entries
are nonzero. In terms of processing cost, sparse vectors are generally cheaper to com‐
pute on if at most 10% of the entries are nonzero. (This is because their
Search WWH ::




Custom Search