Database Reference
In-Depth Information
original document given its bag of words; it means that the mapping is not
one to one.
We consider a
word
as a sequence of letters from a defined alphabet. In this
chapter we use
word
and
term
as synonyms. We consider a
corpus
as a set of
documents, and a
dictionary
as the set of words that appear into the corpus.
We can view a document as a bag of terms. This bag can be seen as a vector,
where each component is associated with one term from the dictionary
N
,
φ
:
d
−→
φ
(
d
)=(
tf
(
t
1
,d
)
,tf
(
t
2
,d
)
,...,tf
(
t
N
,d
))
∈
R
where
tf
(
t
i
,d
) is the frequency of the term
t
i
in
d
. If the dictionary contains
N
terms,adocumentismappedintoa
N
dimensional space. In general,
N
is
quite large, around a hundred thousand words, and it produces a sparse VSM
representation of the document, where few
tf
(
t
i
,d
) are non-zero.
A corpus of
documents can be represented as a
document-term
matrix
whose rows are indexed by the documents and whose columns are indexed by
the terms. Each entry in position (
i, j
) is the term frequency of the term
t
j
in document
i
.
⎛
⎝
⎞
⎠
tf
(
t
1
,d
1
)
···
tf
(
t
N
,d
1
)
.
.
.
.
.
D
=
.
tf
(
t
1
,d
)
···
tf
(
t
N
,d
)
From matrix
D
, we can construct:
•
the
term-document
matrix:
D
•
the
term-term
matrix:
D
D
the
document-document
matrix:
DD
•
It is important to note that the document-term matrix is the dataset
S
,
while the document-document matrix is our kernel matrix.
Quite often the corpus size is smaller than the dictionary size, so the doc-
ument representation can be more ecient. Here, the dual description corre-
spond to the document representation view of the problem, and the primal to
the term representation. In the dual, a document is represented as the counts
of terms that appear in it. In the primal, a term is represented as the counts
of the documents in which it appears.
The VSM representation has some drawbacks. The most important is that
bag of words is not able to map documents that contain semantically equiva-
lent words into the same feature vectors. A classical example is synonymous
words which contain the same information, but are assigned distinct compo-
nents. Another effect is the complete loss of context information around a
word. To mitigate this effect, it is possible to apply different techniques. The
first consists in applying different weight
w
i
to each coordinate. This is quite
common in text mining, where uninformative words, called stop words, are re-
moved from the document. Another important consideration is the influence
Search WWH ::
Custom Search