Digital Signal Processing Reference
In-Depth Information
Fig. 6.9
Flowchart for open domain on-line knowledge source-based linguistic analysis.
of another word). These relations are partially also found in ConceptNet, e.g., the
complement of meronymy is PartOf .
6.3.4.4 Methodology
Based on the on-line knowledge sources as described, this section now introduces an
open domain approach towards linguistic analysis. Figure 6.9 visualises the principle
of the algorithm and the incorporation of the on-line knowledge sources at two steps.
The flow is as follows: First is preprocessing of the input sequence. Then, two parallel
steps extract words that convey information on a task of interest, as well as theses
task's targets—the words. This information is next combined into expressions. The
expressions are filtered aiming at discarding irrelevant ones. Finally, a score value
is obtained from the remaining expressions that can be used as linguistic feature for
classification or regression.
First, the text is split into sequences
S
of words or similar entities. The sequences
S
are then analysed by a syntactic parser for POS tagging. The POS classes include
adjective (JJ), 5 adverb (RB), determiner (DT), verb (VB), and noun (NN), and are
attached to the words by “/” in examples in the ongoing. If it is not necessary to
have comprehensive knowledge of the syntax, a chunker suffices for the chunking of
longer sequences. The chunks equal phrases, such as a noun phrase (NP), verb phrase
(VP), or prepositional phrase (PP). An additional benefit is the flat structure produced
by a chunker, which is better suited for the processing steps that follow. As a unit of
representation, ternary expressions (T-expressions) are extracted on a per-sentence
basis. T-expressions were introduced for automatic question answering in [ 87 ] and
adapted to product review classification [ 88 ]. Here, a T-expression is formatted as:
<
>
. The 'target' thereby refers to a feature term of the subject of
the sequence, e.g., a movie in the case of movie critic valence estimation. The verb is
picked from the same phrase as the target. Should the verb not provide information of
interest for the target, another according information source—mostly an adverb—is
selected instead. By this logic, the T-expression of the sequence “a/DT carefully/RB
designed/VB plot/NN” would be
target, verb, source
<
plot, designed, carefully
>
. If no verbs exist in
5
openNLP notation is followed for POS classes.
 
Search WWH ::




Custom Search