Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Processing and transforming your data

Now that we have done some initial exploratory analysis of our dataset and we know a

little more about the characteristics of our users and movies, what do we do next?

In order to make the raw data usable in a machine learning algorithm, we first need to clean

it up and possibly transform it in various ways before extracting useful features from the

transformed data. The transformation and feature extraction steps are closely linked, and in

some cases, certain transformations are themselves a case of feature extraction.

We have already seen an example of the need to clean data in the movie dataset. Generally,

real-world datasets contain bad data, missing data points, and outliers. Ideally, we would

correct bad data; however, this is often not possible, as many datasets derive from some

form of collection process that cannot be repeated (this is the case, for example, in web

activity data and sensor data). Missing values and outliers are also common and can be

dealt with in a manner similar to bad data. Overall, the broad options are as follows:

• Filter out or remove records with bad or missing values : This is sometimes un-

avoidable; however, this means losing the good part of a bad or missing record.

• Fill in bad or missing data : We can try to assign a value to bad or missing data

based on the rest of the data we have available. Approaches can include assigning

a zero value, assigning the global mean or median, interpolating nearby or similar

data points (usually, in a time-series dataset), and so on. Deciding on the correct

approach is often a tricky task and depends on the data, situation, and one's own

experience.

• Apply robust techniques to outliers : The main issue with outliers is that they

might be correct values, even though they are extreme. They might also be errors.

It is often very difficult to know which case you are dealing with. Outliers can also

be removed or filled in, although fortunately, there are statistical techniques (such

as robust regression) to handle outliers and extreme values.

• Apply transformations to potential outliers : Another approach for outliers or ex-

treme values is to apply transformations, such as a logarithmic or Gaussian kernel

transformation, to features that have potential outliers, or display large ranges of

potential values. These types of transformations have the effect of dampening the

impact of large changes in the scale of a variable and turning a nonlinear relation-

ship into one that is linear.

Search WWH ::

Custom Search

Home