Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

context, inferences, and discourse. Each word is considered a term or token (which

is often the smallest unit for the analysis). In many cases, bag-of-words

additionally assumes every term in the document is independent. The document

then becomes a vector with one dimension for every distinct term in the space, and

the terms are unordered. The permutation D* of a document D contains the same

words exactly the same number of times but in a different order. Therefore, using

the bag-of-words representation, document D and its permutation D* would share

the same representation.

Bag-of-words takes quite a naïve approach, as order plays an important role in

the semantics of text. With bag-of-words, many texts with different meanings

are combined into one form. For example, the texts “a dog bites a man” and “a

man bites a dog” have very different meanings, but they would share the same

representation with bag-of-words.

Although the bag-of-words technique oversimplifies the problem, it is still

considered a good approach to start with, and it is widely used for text analysis.

A paper by Salton and Buckley [11] states the effectiveness of using single words

as identifiers as opposed to multiple-term identifiers, which retain the order of the

words:

In reviewing the extensive literature accumulated during the past

25 years in the area of retrieval system evaluation, the

overwhelming evidence is that the judicious use of single-term

identifiers is preferable to the incorporation of more complex

entities extracted from the texts themselves or obtained from

available vocabulary schedules .

Although the work by Salton and Buckley was published in 1988, there has been

little, if any, substantial evidence to discredit the claim. Bag-of-words uses

single-term identifiers, which are usually sufficient for the text analysis in place of

multiple-term identifiers.

Using single words as identifiers with the bag-of-words representation, the term

frequency (TF) of each word can be calculated. Term frequency represents the

weight of each term in a document, and it is proportional to the number of

occurrences of the term in that document. Figure 9.2 shows the 50 most frequent

words and the numbers of occurrences from Shakespeare's Hamlet . The word

frequency distribution roughly follows Zipf's Law [12, 13]—that is, the -th most

common word occurs approximately

as often as the most frequent term. In

Search WWH ::

Custom Search

Home