Database Reference
In-Depth Information
Processing and transforming your data
Now that we have done some initial exploratory analysis of our dataset and we know a
little more about the characteristics of our users and movies, what do we do next?
In order to make the raw data usable in a machine learning algorithm, we first need to clean
it up and possibly transform it in various ways before extracting useful features from the
transformed data. The transformation and feature extraction steps are closely linked, and in
some cases, certain transformations are themselves a case of feature extraction.
We have already seen an example of the need to clean data in the movie dataset. Generally,
real-world datasets contain bad data, missing data points, and outliers. Ideally, we would
correct bad data; however, this is often not possible, as many datasets derive from some
form of collection process that cannot be repeated (this is the case, for example, in web
activity data and sensor data). Missing values and outliers are also common and can be
dealt with in a manner similar to bad data. Overall, the broad options are as follows:
Filter out or remove records with bad or missing values : This is sometimes un-
avoidable; however, this means losing the good part of a bad or missing record.
Fill in bad or missing data : We can try to assign a value to bad or missing data
based on the rest of the data we have available. Approaches can include assigning
a zero value, assigning the global mean or median, interpolating nearby or similar
data points (usually, in a time-series dataset), and so on. Deciding on the correct
approach is often a tricky task and depends on the data, situation, and one's own
experience.
Apply robust techniques to outliers : The main issue with outliers is that they
might be correct values, even though they are extreme. They might also be errors.
It is often very difficult to know which case you are dealing with. Outliers can also
be removed or filled in, although fortunately, there are statistical techniques (such
as robust regression) to handle outliers and extreme values.
Apply transformations to potential outliers : Another approach for outliers or ex-
treme values is to apply transformations, such as a logarithmic or Gaussian kernel
transformation, to features that have potential outliers, or display large ranges of
potential values. These types of transformations have the effect of dampening the
impact of large changes in the scale of a variable and turning a nonlinear relation-
ship into one that is linear.
Search WWH ::




Custom Search