Information Technology Reference
In-Depth Information
meaningful labels from the web snippets determined by a standard search engine as de-
scribed in section 2. According to the underlying latent semantic analysis performed by
the Lingo algorithm, we interpret the labels as semantic labels. We then use these labels
and match them against the ordered list of chunk-pair-distance elements computed in
the topic extraction step described in section 2. This means that all chunk-pair-distance
elements that do not have any match with one of the semantic labels are deleted.
The idea is that this filter identifies a semantic relatedness between the labels and
the syntactically determined chunks. Since we consider the labels as semantic topics or
classes, we assume that the non-filtered pairs correspond to topic-related (via the user
query) relevant relationships between semantically related decriptive terms.
Of course, it actually remains to evaluate the quality and usefullness of the extracted
topics and topic graph. In the next sections we will discuss two directions: a) a quan-
titative evaluation against the recognition of different algorithms for identifying named
entities and other rigid identifiers, and b) a qualitative evaluation by means of the anal-
ysis of user experience.
5
Evaluation of the Extracted Topics
Our topic extraction process is completely unsupervised and web-based, so evaluation
against standard gold corpora is not possible, because they simply do not yet exist (or at
least, we do not know about them). For that reason we decided to compare the outcome
of our topic extraction process with the outcomes of a number of different recognizers
for named entities (NEs).
Note that very often the extracted topics correspond to rigid designators or general-
ized named entities, i.e., instances of proper names (persons, locations, etc.), as well as
instances of more fine grained subcategories, such as museum, river, airport, product,
event (cf. [11]). So seen, our topic extraction process (abbreviated as TEP ) can also be
considered as a query-driven context-oriented named entity extraction process with the
notable restriction that the recognized entities are unclassified. If this perspective makes
sense, then it seems plausible to measure the degree of overlap between our topic ex-
traction process and the recognized set of entities of other named entity components to
learn about the coverage and quality of TEP .
For the evaluation of TEP we compared it to the results of four different NE recog-
nizers:
1. SProUT[4]: The SProUT -system is a shallow linguistic processor that comes with
a rule-based approach for named entity recognition.
2. AlchemyAPI 4 : AlchemyAPI -system uses statistical NLP and machine learning al-
gorithms for performing the NE recognition task.
3. Stanford NER[3]: The Stanford NER -system uses a character based Maximum En-
tropy Markov model trained on annotated corpora for extracting NEs.
4. OpenNLP 5 : A collection of natural language processing tools which use the Maxent
package to resolve ambiguity, in particular for NE recognition.
We tested all systems with the three snippet corpora described in section 4.
4
http://www.AlchemyAPI.com
5 http://incubator.apache.org/opennlp/
Search WWH ::




Custom Search