Database Reference
In-Depth Information
context, inferences, and discourse. Each word is considered a term or token (which
is often the smallest unit for the analysis). In many cases, bag-of-words
additionally assumes every term in the document is independent. The document
then becomes a vector with one dimension for every distinct term in the space, and
the terms are unordered. The permutation D* of a document D contains the same
words exactly the same number of times but in a different order. Therefore, using
the bag-of-words representation, document D and its permutation D* would share
the same representation.
Bag-of-words takes quite a naïve approach, as order plays an important role in
the semantics of text. With bag-of-words, many texts with different meanings
are combined into one form. For example, the texts “a dog bites a man” and “a
man bites a dog” have very different meanings, but they would share the same
representation with bag-of-words.
Although the bag-of-words technique oversimplifies the problem, it is still
considered a good approach to start with, and it is widely used for text analysis.
A paper by Salton and Buckley [11] states the effectiveness of using single words
as identifiers as opposed to multiple-term identifiers, which retain the order of the
words:
In reviewing the extensive literature accumulated during the past
25 years in the area of retrieval system evaluation, the
overwhelming evidence is that the judicious use of single-term
identifiers is preferable to the incorporation of more complex
entities extracted from the texts themselves or obtained from
available vocabulary schedules .
Although the work by Salton and Buckley was published in 1988, there has been
little, if any, substantial evidence to discredit the claim. Bag-of-words uses
single-term identifiers, which are usually sufficient for the text analysis in place of
multiple-term identifiers.
Using single words as identifiers with the bag-of-words representation, the term
frequency (TF) of each word can be calculated. Term frequency represents the
weight of each term in a document, and it is proportional to the number of
occurrences of the term in that document. Figure 9.2 shows the 50 most frequent
words and the numbers of occurrences from Shakespeare's Hamlet . The word
frequency distribution roughly follows Zipf's Law [12, 13]—that is, the -th most
common word occurs approximately
as often as the most frequent term. In
Search WWH ::




Custom Search