Anecdotes extraction from webpage context as image annotation - Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Image Processing Reference

In-Depth Information

on the enhancement of annotation quality, the aforementioned approaches tend to assign the

same annotation to all the images in the same cluster thereafter they suffered the lack of cus-

tomized annotation for each image.

2.2 Keyword Extraction

In the field of information retrieval, keyword extraction plays a key role in summarization,

text clustering/classification, and so on. It aims at extracting keywords that represents the text

theme. One of the most prominent problems in processing Chinese texts is the identiication

of valid words in a sentence, since there are no delimiters to separate words from characters

in a sentence. Therefore, identifying words/phrases is difficult because of segmentation ambi-

guities and the frequent occurrences of newly formed words.

In general, Chinese texts can be parsed using dictionary lookup, statistical, or hybrid ap-

proaches [ 15 ]. The dictionary lookup approach identifies keywords of a string by mapping

well-established

corpus.

For

example,

the

Chinese

string

“

” (Ying-wen Tsai went to Taipei prison to vis-

it Shui-bian Chen and talk for an hour) will be parsed as: “[

],” “[

],” “[ ],” “[ ],” and “[ ]” by a well-known dictionary-based CKIP

(Chinese Knowledge and Information Processing) segmentation system in Taiwan. This meth-

od is very efficient while it fails to identify newly formed or out-of-the-vocabulary words and

it is also blamed for the triviality of the extracted words.

Since there is no delimiters to separate words from Chinese string except for the usage of

quotation marks in special occasion, word segmentation is a challenge without the aid of dic-

tionary. The statistical technique extracts elements by using n -gram (bi-gram, tri-gram, etc.)

computation for the input string. This method relies on the frequency of each segmented token

and a threshold to determine the validity of the token. The above string through n -gram seg-

mentation will produce: “[

], [

],

[

], [

]”; “[

], [

], …, [

]” and so on. The application of this method has the benefit of corpus-free and the

capability of extracting newly formed or out-of-the-vocabulary words while at the expense of

huge computations and the follow-up filtering eforts.

Recently, a number of studies proposed substring [ 9 ] , significant estimation [ 16 ] , and re-

lational normalization [ 17 , 18 ] to identify words based on statistical calculations. The hybrid

method conducts dictionary mapping to process the major task of word extraction and handle

the leftovers through n -gram computation, which significantly reduces the amount of terms

under processing and takes care both the quality of term segmentation and the identi-fication.

of unknown words. It has gained popularity and adopted by many researchers [ 19 , 20 ] .

Since the most important task of annotation is to identify the most informative parts of a text

comparatively with the rest. Consequently, a good text segmentation shall help in this identi-

ication. In the IR theory, the representation of documents is based on the vector space model

[ 21 ] : A document is a vector of weighted words belonging to a vocabulary V : d = { w 1 , …, w | v | }.

Each w v is such that 0 ≤ w v ≤ 1 and represents how much the term t n contributes to the se-

Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition

Search WWH ::

Custom Search

Home