Advanced Text Processing with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Feature hashing

Feature hashing is a technique to deal with high-dimensional data and is often used with

text and categorical datasets where the features can take on many unique values (often

many millions of values). In the previous chapters, we often used the 1-of-K encoding ap-

proach for categorical features, including text. While this approach is simple and effective,

it can break down in the face of extremely high-dimensional data.

Building and using 1-of-K feature encoding requires us to keep a mapping of each possible

feature value to an index in a vector. Furthermore, the process of creating the mapping it-

self requires at least one additional pass through the dataset and can be tricky to do in par-

allel scenarios. Up until now, we have often used a simple approach of collecting the dis-

tinct feature values and zipping this collection with a set of indices to create a map of fea-

ture value to index. This mapping is then broadcast (either explicitly in our code or impli-

citly by Spark) to each worker.

However, when dealing with huge feature dimensions in the tens of millions or more that

are common when working with text, this approach can be slow and can require significant

memory and network resources, both on the Spark master (to collect the unique values) and

workers (to broadcast the resulting mapping to each worker, which keeps it in memory to

allow it to apply the feature encoding to its local piece of the input data).

Feature hashing works by assigning the vector index for a feature based on the value ob-

tained by hashing this feature to a number (usually, an integer value) using a hash function.

For example, let's say the hash value of a categorical feature for the geolocation of Un-

ited States is 342 . We will use the hashed value as the vector index, and the value at

this index will be 1.0 to indicate the presence of the United States feature. The hash

function used must be consistent (that is, for a given input, it returns the same output each

time).

This encoding works the same way as mapping-based encoding, except that we choose a

size for our feature vector upfront. As the most common hash functions return values in the

entire range of integers, we will use a modulo operation to restrict the index values to the

size of our vector, which is typically much smaller (a few tens of thousands to a few milli-

on, depending on our requirements).

Feature hashing has the advantage that we do not need to build a mapping and keep it in

memory. It is also easy to implement, very fast, and can be done online and in real time,

thus not requiring a pass through our dataset first. Finally, because we selected a feature

Search WWH ::

Custom Search

Home