MobEx: A System for Exploratory Search on the Mobile Web - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

meaningful labels from the web snippets determined by a standard search engine as de-

scribed in section 2. According to the underlying latent semantic analysis performed by

the Lingo algorithm, we interpret the labels as semantic labels. We then use these labels

and match them against the ordered list of chunk-pair-distance elements computed in

the topic extraction step described in section 2. This means that all chunk-pair-distance

elements that do not have any match with one of the semantic labels are deleted.

The idea is that this filter identifies a semantic relatedness between the labels and

the syntactically determined chunks. Since we consider the labels as semantic topics or

classes, we assume that the non-filtered pairs correspond to topic-related (via the user

query) relevant relationships between semantically related decriptive terms.

Of course, it actually remains to evaluate the quality and usefullness of the extracted

topics and topic graph. In the next sections we will discuss two directions: a) a quan-

titative evaluation against the recognition of different algorithms for identifying named

entities and other rigid identifiers, and b) a qualitative evaluation by means of the anal-

ysis of user experience.

5

Evaluation of the Extracted Topics

Our topic extraction process is completely unsupervised and web-based, so evaluation

against standard gold corpora is not possible, because they simply do not yet exist (or at

least, we do not know about them). For that reason we decided to compare the outcome

of our topic extraction process with the outcomes of a number of different recognizers

for named entities (NEs).

Note that very often the extracted topics correspond to rigid designators or general-

ized named entities, i.e., instances of proper names (persons, locations, etc.), as well as

instances of more fine grained subcategories, such as museum, river, airport, product,

event (cf. [11]). So seen, our topic extraction process (abbreviated as TEP ) can also be

considered as a query-driven context-oriented named entity extraction process with the

notable restriction that the recognized entities are unclassified. If this perspective makes

sense, then it seems plausible to measure the degree of overlap between our topic ex-

traction process and the recognized set of entities of other named entity components to

learn about the coverage and quality of TEP .

For the evaluation of TEP we compared it to the results of four different NE recog-

nizers:

1. SProUT[4]: The SProUT -system is a shallow linguistic processor that comes with

a rule-based approach for named entity recognition.

2. AlchemyAPI 4 : AlchemyAPI -system uses statistical NLP and machine learning al-

gorithms for performing the NE recognition task.

3. Stanford NER[3]: The Stanford NER -system uses a character based Maximum En-

tropy Markov model trained on annotated corpora for extracting NEs.

4. OpenNLP 5 : A collection of natural language processing tools which use the Maxent

package to resolve ambiguity, in particular for NE recognition.

We tested all systems with the three snippet corpora described in section 4.

4

http://www.AlchemyAPI.com

5 http://incubator.apache.org/opennlp/

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home