Designing a Machine Learning System - Machine Learning with Spark

Database Reference

In-Depth Information

types, this representation will take the form of a vector or matrix structure that contains

numerical data. Common challenges during data transformation and feature extraction in-

clude:

• Taking categorical data (such as country for geolocation or category for a movie)

and encoding it in a numerical representation.

• Extracting useful features from text data.

• Dealing with image or audio data.

• We often convert numerical data into categorical data to reduce the number of

values a variable can take on. An example of this is converting a variable for age

into buckets (such as 25-35, 45-55, and so on).

• Transforming numerical features; for example, applying a log transformation to a

numerical variable can help deal with variables that take on a very large range of

values.

• Normalizing and standardizing numerical features ensures that all the different in-

put variables for a model have a consistent scale. Many machine learning models

require standardized input to work properly.

• Feature engineering is the process of combining or transforming the existing vari-

ables to create new features. For example, we can create a new variable that is the

average of some other data, such as the average number of times a user watches a

movie.

We will cover all of these techniques through the examples in this topic.

These data-cleansing, exploration, aggregation, and transformation steps can be carried

out using both Spark's core API functions as well as the SparkSQL engine, not to mention

other external Scala, Java, or Python libraries. We can take advantage of Spark's Hadoop

compatibility to read data from and write data to the various different storage systems

mentioned earlier.

Search WWH ::

Custom Search

Home