Database Reference
In-Depth Information
types, this representation will take the form of a vector or matrix structure that contains
numerical data. Common challenges during data transformation and feature extraction in-
clude:
• Taking categorical data (such as country for geolocation or category for a movie)
and encoding it in a numerical representation.
• Extracting useful features from text data.
• Dealing with image or audio data.
• We often convert numerical data into categorical data to reduce the number of
values a variable can take on. An example of this is converting a variable for age
into buckets (such as 25-35, 45-55, and so on).
• Transforming numerical features; for example, applying a log transformation to a
numerical variable can help deal with variables that take on a very large range of
values.
• Normalizing and standardizing numerical features ensures that all the different in-
put variables for a model have a consistent scale. Many machine learning models
require standardized input to work properly.
• Feature engineering is the process of combining or transforming the existing vari-
ables to create new features. For example, we can create a new variable that is the
average of some other data, such as the average number of times a user watches a
movie.
We will cover all of these techniques through the examples in this topic.
These data-cleansing, exploration, aggregation, and transformation steps can be carried
out using both Spark's core API functions as well as the SparkSQL engine, not to mention
other external Scala, Java, or Python libraries. We can take advantage of Spark's Hadoop
compatibility to read data from and write data to the various different storage systems
mentioned earlier.
Search WWH ::




Custom Search