Database Reference
In-Depth Information
original document given its bag of words; it means that the mapping is not
one to one.
We consider a word as a sequence of letters from a defined alphabet. In this
chapter we use word and term as synonyms. We consider a corpus as a set of
documents, and a dictionary as the set of words that appear into the corpus.
We can view a document as a bag of terms. This bag can be seen as a vector,
where each component is associated with one term from the dictionary
N ,
φ : d
−→
φ ( d )=( tf ( t 1 ,d ) ,tf ( t 2 ,d ) ,...,tf ( t N ,d ))
R
where tf ( t i ,d ) is the frequency of the term t i in d . If the dictionary contains
N terms,adocumentismappedintoa N dimensional space. In general, N is
quite large, around a hundred thousand words, and it produces a sparse VSM
representation of the document, where few tf ( t i ,d ) are non-zero.
A corpus of documents can be represented as a document-term matrix
whose rows are indexed by the documents and whose columns are indexed by
the terms. Each entry in position ( i, j ) is the term frequency of the term t j
in document i .
tf ( t 1 ,d 1 )
···
tf ( t N ,d 1 )
.
.
. . .
D =
.
tf ( t 1 ,d )
···
tf ( t N ,d )
From matrix D , we can construct:
the term-document matrix: D
the term-term matrix: D D
the document-document matrix: DD
It is important to note that the document-term matrix is the dataset S ,
while the document-document matrix is our kernel matrix.
Quite often the corpus size is smaller than the dictionary size, so the doc-
ument representation can be more ecient. Here, the dual description corre-
spond to the document representation view of the problem, and the primal to
the term representation. In the dual, a document is represented as the counts
of terms that appear in it. In the primal, a term is represented as the counts
of the documents in which it appears.
The VSM representation has some drawbacks. The most important is that
bag of words is not able to map documents that contain semantically equiva-
lent words into the same feature vectors. A classical example is synonymous
words which contain the same information, but are assigned distinct compo-
nents. Another effect is the complete loss of context information around a
word. To mitigate this effect, it is possible to apply different techniques. The
first consists in applying different weight w i to each coordinate. This is quite
common in text mining, where uninformative words, called stop words, are re-
moved from the document. Another important consideration is the influence
Search WWH ::




Custom Search