Database Reference
In-Depth Information
Text features
In some ways, text features are a form of categorical and derived features. Let's take the ex-
ample of the description for a movie (which we do not have in our dataset). Here, the raw
text could not be used directly, even as a categorical feature, since there are virtually unlim-
ited possible combinations of words that could occur if each piece of text was a possible
value. Our model would almost never see two occurrences of the same feature and would
not be able to learn effectively. Therefore, we would like to turn raw text into a form that is
more amenable to machine learning.
There are numerous ways of dealing with text, and the field of natural language processing
is dedicated to processing, representing, and modeling textual content. A full treatment is
beyond the scope of this topic, but we will introduce a simple and standard approach for
text-feature extraction; this approach is known as the bag-of-words representation.
The bag-of-words approach treats a piece of text content as a set of the words, and possibly
numbers, in the text (these are often referred to as terms). The process of the bag-of-words
approach is as follows:
Tokenization : First, some form of tokenization is applied to the text to split it into
a set of tokens (generally words, numbers, and so on). An example of this is simple
whitespace tokenization, which splits the text on each space and might remove
punctuation and other characters that are not alphabetical or numerical.
Stop word removal : Next, it is usual to remove very common words such as
"the", "and", and "but" (these are known as stop words ).
Stemming : The next step can include stemming, which refers to taking a term and
reducing it to its base form or stem. A common example is plural terms becoming
singular (for example, dogs becomes dog and so on). There are many approaches
to stemming, and text-processing libraries often contain various stemming al-
gorithms.
Vectorization : The final step is turning the processed terms into a vector repres-
entation. The simplest form is, perhaps, a binary vector representation, where we
assign a value of one if a term exists in the text and zero if it does not. This is es-
sentially identical to the categorical 1-of-k encoding we encountered earlier. Like
1-of-k encoding, this requires a dictionary of terms mapping a given term to an in-
dex number. As you might gather, there are potentially millions of individual pos-
sible terms (even after stop word removal and stemming). Hence, it becomes critic-
al to use a sparse vector representation where only the fact that a term is present is
stored, to save memory and disk space as well as compute time.
Search WWH ::




Custom Search