Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

9.4 Representing Text

After the previous step, the team now has some raw text to start with. In this data

representation step, raw text is first transformed with text normalization techniques

such as tokenization and case folding. Then it is represented in a more structured

way for analysis.

Tokenization is the task of separating (also called tokenizing) words from the

body of text. Raw text is converted into collections of tokens after the tokenization,

where each token is generally a word.

A common approach is tokenizing on spaces. For example, with the tweet shown

previously:

I once had a gf back in the day. Then the bPhone came out lol

tokenization based on spaces would output a list of tokens.

{I, once, had, a, gf, back, in, the, day.,

Then, the, bPhone, came, out, lol}

Note that token “ day. ” contains a period. This is the result of only using space as

the separator. Therefore, tokens “ day. ” and “ day ” would be considered different

terms in the downstream analysis unless an additional lookup table is provided. One

way to fix the problem without the use of a lookup table is to remove the period

if it appears at the end of a sentence. Another way is to tokenize the text based on

punctuation marks and spaces. In this case, the previous tweet would become:

{I, once, had, a, gf, back, in, the, day, .,

Then, the, bPhone, came, out, lol}

However, tokenizing based on punctuation marks might not be well suited to certain

scenarios. For example, if the text contains contractions such as we'll , tokenizing

based on punctuation will split them into separated words we and ll . For words

such as can't , the output would be can and t . It would be more preferable either

not to tokenize them or to tokenize we'll into we and 'll , and can't into can

and 't . The 't token is more recognizable as negative than the t token. If the team

is dealing with certain tasks such as information extraction or sentiment analysis,

tokenizing solely based on punctuation marks and spaces may obscure or even

distort meanings in the text.

Search WWH ::

Custom Search

Home