Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Categorical features

Categorical features cannot be used as input in their raw form, as they are not numbers; in-

stead, they are members of a set of possible values that the variable can take. In the ex-

ample mentioned earlier, user occupation is a categorical variable that can take the value of

student, programmer, and so on.

Such categorical variables are also known as nominal variables where there is no concept

of order between the values of the variable. By contrast, when there is a concept of order

between variables (such as the ratings mentioned earlier, where a rating of 5 is conceptually

higher or better than a rating of 1), we refer to ordinal variables.

To transform categorical variables into a numerical representation, we can use a common

approach known as 1-of-k encoding. An approach such as 1-of-k encoding is required to

represent nominal variables in a way that makes sense for machine learning tasks. Ordinal

variables might be used in their raw form but are often encoded in the same way as nomin-

al variables.

Assume that there are k possible values that the variable can take. If we assign each pos-

sible value an index from the set of 1 to k, then we can represent a given state of the vari-

able using a binary vector of length k; here, all entries are zero, except the entry at the in-

dex that corresponds to the given state of the variable. This entry is set to one.

For example, we can collect all the possible states of the occupation variable:

all_occupations = user_fields.map(lambda fields:

fields[3]).distinct().collect()

all_occupations.sort()

We can then assign index values to each possible occupation in turn (note that we start

from zero, since Python, Scala, and Java arrays all use zero-based indices):

idx = 0

all_occupations_dict = {}

for o in all_occupations:

all_occupations_dict[o] = idx

idx +=1

# try a few examples to see what "1-of-k" encoding is

assigned

Search WWH ::

Custom Search

Home