Obtaining, Processing, and Preparing Data with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Text features

In some ways, text features are a form of categorical and derived features. Let's take the ex-

ample of the description for a movie (which we do not have in our dataset). Here, the raw

text could not be used directly, even as a categorical feature, since there are virtually unlim-

ited possible combinations of words that could occur if each piece of text was a possible

value. Our model would almost never see two occurrences of the same feature and would

not be able to learn effectively. Therefore, we would like to turn raw text into a form that is

more amenable to machine learning.

There are numerous ways of dealing with text, and the field of natural language processing

is dedicated to processing, representing, and modeling textual content. A full treatment is

beyond the scope of this topic, but we will introduce a simple and standard approach for

text-feature extraction; this approach is known as the bag-of-words representation.

The bag-of-words approach treats a piece of text content as a set of the words, and possibly

numbers, in the text (these are often referred to as terms). The process of the bag-of-words

approach is as follows:

• Tokenization : First, some form of tokenization is applied to the text to split it into

a set of tokens (generally words, numbers, and so on). An example of this is simple

whitespace tokenization, which splits the text on each space and might remove

punctuation and other characters that are not alphabetical or numerical.

• Stop word removal : Next, it is usual to remove very common words such as

"the", "and", and "but" (these are known as stop words ).

• Stemming : The next step can include stemming, which refers to taking a term and

reducing it to its base form or stem. A common example is plural terms becoming

singular (for example, dogs becomes dog and so on). There are many approaches

to stemming, and text-processing libraries often contain various stemming al-

gorithms.

• Vectorization : The final step is turning the processed terms into a vector repres-

entation. The simplest form is, perhaps, a binary vector representation, where we

assign a value of one if a term exists in the text and zero if it does not. This is es-

sentially identical to the categorical 1-of-k encoding we encountered earlier. Like

1-of-k encoding, this requires a dictionary of terms mapping a given term to an in-

dex number. As you might gather, there are potentially millions of individual pos-

sible terms (even after stop word removal and stemming). Hence, it becomes critic-

al to use a sparse vector representation where only the fact that a term is present is

stored, to save memory and disk space as well as compute time.

Search WWH ::

Custom Search

Home