Database Reference
In-Depth Information
Data cleansing and transformation
The majority of machine learning models operate on features, which are typically numeric-
al representations of the input variables that will be used for the model.
While we might want to spend the majority of our time exploring machine learning models,
data collected via various systems and sources in the preceding ingestion step is, in most
cases, in a raw form. For example, we might log user events such as details of when a user
views the information page for a movie, when they watch a movie, or when they provide
some other feedback. We might also collect external information such as the location of the
user (as provided through their IP address, for example). These event logs will typically
contain some combination of textual and numeric information about the event (and also,
perhaps, other forms of data such as images or audio).
In order to use this raw data in our models, in almost all cases, we need to perform prepro-
cessing, which might include:
Filtering data : Let's assume that we want to create a model from a subset of the
raw data, such as only the most recent few months of activity data or only events
that match certain criteria.
Dealing with missing, incomplete, or corrupted data : Many real-world datasets
are incomplete in some way. This might include data that is missing (for example,
due to a missing user input) or data that is incorrect or flawed (for example, due to
an error in data ingestion or storage, technical issues or bugs, or software or hard-
ware failure). We might need to filter out bad data or alternatively decide a method
to fill in missing data points (such as using the average value from the dataset for
missing points, for example).
Dealing with potential anomalies, errors, and outliers : Erroneous or outlier data
might skew the results of model training, so we might wish to filter these cases out
or use techniques that are able to deal with outliers.
Joining together disparate data sources : For example, we might need to match
up the event data for each user with different internal data sources, such as user
profiles, as well as external data, such as geolocation, weather, and economic data.
Aggregating data : Certain models might require input data that is aggregated in
some way, such as computing the sum of a number of different event types per
user.
Once we have performed initial preprocessing on our data, we often need to transform the
data into a representation that is suitable for machine learning models. For many model
Search WWH ::




Custom Search