Database Reference
In-Depth Information
Feature hashing
Feature hashing is a technique to deal with high-dimensional data and is often used with
text and categorical datasets where the features can take on many unique values (often
many millions of values). In the previous chapters, we often used the 1-of-K encoding ap-
proach for categorical features, including text. While this approach is simple and effective,
it can break down in the face of extremely high-dimensional data.
Building and using 1-of-K feature encoding requires us to keep a mapping of each possible
feature value to an index in a vector. Furthermore, the process of creating the mapping it-
self requires at least one additional pass through the dataset and can be tricky to do in par-
allel scenarios. Up until now, we have often used a simple approach of collecting the dis-
tinct feature values and zipping this collection with a set of indices to create a map of fea-
ture value to index. This mapping is then broadcast (either explicitly in our code or impli-
citly by Spark) to each worker.
However, when dealing with huge feature dimensions in the tens of millions or more that
are common when working with text, this approach can be slow and can require significant
memory and network resources, both on the Spark master (to collect the unique values) and
workers (to broadcast the resulting mapping to each worker, which keeps it in memory to
allow it to apply the feature encoding to its local piece of the input data).
Feature hashing works by assigning the vector index for a feature based on the value ob-
tained by hashing this feature to a number (usually, an integer value) using a hash function.
For example, let's say the hash value of a categorical feature for the geolocation of Un-
ited States is 342 . We will use the hashed value as the vector index, and the value at
this index will be 1.0 to indicate the presence of the United States feature. The hash
function used must be consistent (that is, for a given input, it returns the same output each
time).
This encoding works the same way as mapping-based encoding, except that we choose a
size for our feature vector upfront. As the most common hash functions return values in the
entire range of integers, we will use a modulo operation to restrict the index values to the
size of our vector, which is typically much smaller (a few tens of thousands to a few milli-
on, depending on our requirements).
Feature hashing has the advantage that we do not need to build a mapping and keep it in
memory. It is also easy to implement, very fast, and can be done online and in real time,
thus not requiring a pass through our dataset first. Finally, because we selected a feature
Search WWH ::




Custom Search