Database Reference
In-Depth Information
9.4 Representing Text
After the previous step, the team now has some raw text to start with. In this data
representation step, raw text is first transformed with text normalization techniques
such as tokenization and case folding. Then it is represented in a more structured
way for analysis.
Tokenization
is the task of separating (also called tokenizing) words from the
body of text. Raw text is converted into collections of tokens after the tokenization,
where each token is generally a word.
A common approach is tokenizing on spaces. For example, with the tweet shown
previously:
I once had a gf back in the day. Then the bPhone came out lol
tokenization based on spaces would output a list of tokens.
{I, once, had, a, gf, back, in, the, day.,
Then, the, bPhone, came, out, lol}
Note that token “
day.
” contains a period. This is the result of only using space as
the separator. Therefore, tokens “
day.
” and “
day
” would be considered different
terms in the downstream analysis unless an additional lookup table is provided. One
way to fix the problem without the use of a lookup table is to remove the period
if it appears at the end of a sentence. Another way is to tokenize the text based on
punctuation marks and spaces. In this case, the previous tweet would become:
{I, once, had, a, gf, back, in, the, day, .,
Then, the, bPhone, came, out, lol}
However, tokenizing based on punctuation marks might not be well suited to certain
scenarios. For example, if the text contains contractions such as
we'll
, tokenizing
based on punctuation will split them into separated words
we
and
ll
. For words
such as
can't
, the output would be
can
and
t
. It would be more preferable either
not to tokenize them or to tokenize
we'll
into
we
and
'll
, and
can't
into
can
and
't
. The
't
token is more recognizable as negative than the
t
token. If the team
is dealing with certain tasks such as information extraction or sentiment analysis,
tokenizing solely based on punctuation marks and spaces may obscure or even
distort meanings in the text.