Information Technology Reference
In-Depth Information
theory concepts directly to semantic issues. The conception of the mul-
tiword sequence is both consistent with and appropriately differentiated
from independently developed concepts in Saussurean linguistics. The
weak correlations between units correspond to the Saussurean view of
speech as marked by freedom of combination. In contrast to conceptions
of the extended syntagma and the examples given of extended syntagmas,
semantic cohesiveness is not necessarily implied by an understanding of
the multiword sequence in terms derived from information theory, which
is limited to the level of expression. (It could apply to automatically gen-
erated sequences for which semantic cohesiveness was not attempted.)
The conception of the multiword sequence has analytic value for under-
standing retrieval from full text and instrumental value for constructing
specific queries when semantic cohesiveness is directly humanly imposed.
In this understanding, a phrase that distinctively characterizes the topic
desired is highly likely to occur only within relevant documents and this
can be exploited in searching.
Differentiating the multiword sequence from the word as a weakly
correlated concatenation of units that are themselves internally cohesive
can yield a crucial theoretical insight into the frequency of recurrence of
identical multiword sequences. If the units (words) of the linear multi-
word sequence correlate only weakly with one another, it is theoretically
possible to recognize identical multiword sequences as highly improbable
unless one is copied from the other. This conception of the multiword
sequence also has considerable analytic power for understanding retrieval
from full text—it can yield a theoretical explanation of the experientially
encountered infrequency of recurring multiword sequences, even in very
large corpora. The relative infrequency of identical phrases or extended
multiword sequences, even in large corpora, can then be understood as
consistent with the weak correlations between the concatenated units.
Summary
We derived a precise, computationally implementable understanding of
the word and multiword sequence, realized in computational practice, by
considering written language as the message of information theory. We
also established implications for the frequency of occurrence of identical
words and multiword sequences.
Search WWH ::




Custom Search