Database Reference
In-Depth Information
Categorical features
Categorical features cannot be used as input in their raw form, as they are not numbers; in-
stead, they are members of a set of possible values that the variable can take. In the ex-
ample mentioned earlier, user occupation is a categorical variable that can take the value of
student, programmer, and so on.
Such categorical variables are also known as nominal variables where there is no concept
of order between the values of the variable. By contrast, when there is a concept of order
between variables (such as the ratings mentioned earlier, where a rating of 5 is conceptually
higher or better than a rating of 1), we refer to ordinal variables.
To transform categorical variables into a numerical representation, we can use a common
approach known as 1-of-k encoding. An approach such as 1-of-k encoding is required to
represent nominal variables in a way that makes sense for machine learning tasks. Ordinal
variables might be used in their raw form but are often encoded in the same way as nomin-
al variables.
Assume that there are k possible values that the variable can take. If we assign each pos-
sible value an index from the set of 1 to k, then we can represent a given state of the vari-
able using a binary vector of length k; here, all entries are zero, except the entry at the in-
dex that corresponds to the given state of the variable. This entry is set to one.
For example, we can collect all the possible states of the occupation variable:
all_occupations = user_fields.map(lambda fields:
fields[3]).distinct().collect()
all_occupations.sort()
We can then assign index values to each possible occupation in turn (note that we start
from zero, since Python, Scala, and Java arrays all use zero-based indices):
idx = 0
all_occupations_dict = {}
for o in all_occupations:
all_occupations_dict[o] = idx
idx +=1
# try a few examples to see what "1-of-k" encoding is
assigned
Search WWH ::




Custom Search