Database Reference
In-Depth Information
Document introduction and query processing are the foundation for developing
vector space model, Boolean Retrieval Model, and probability retrieval model,
which constitute the foundation of search engines. Since the early 1990s, search
engines have evolved into a mature business system, which generally consist of
rapidly distributed crawling, effectively inverted index, webpage sequencing based
on inlink, and search log analysis [ 10 ].
NLP can enable computers to analyze, interpret, and even generate text. Some
common NLP methods are: lexical acquisition, word sense disambiguation, part-
of-speech tagging, and probabilistic context free grammar [ 11 ]. Some NLP-based
technologies have been applied to text mining, including information extraction,
topic models, text summarization, classification, clustering, question answering, and
opinion mining. Information mining shall automatically extract specific structured
information from texts. Named entity recognition (NER) technology, as a subtask
of information extraction, aims to recognize atomic entities in texts subordinate
to scheduled categories (e.g. figures, places, and organizations), which have been
successfully applied to the development of new analysis [ 12 ] and medical appli-
cations [ 13 ] recently. The topic models are built according to the opinion that
“documents are constituted by topics and topics are the probability distribution
of vocabulary.” Topic models are models generated by documents, stipulating the
probability program to generate documents.
Presently, various probabilistic topic models have been used to analyze document
contents and lexical meanings [ 14 ]. Text summarization is to generate a reduced
summary or extract from a single or several input text files. Text summarization
may be classified into concrete summarization and abstract summarization [ 15 ].
Concrete summarization selects important sentences and paragraphs from source
documents and concentrates them into shorter forms. Abstract summarization may
interpret the source texts and, according to linguistic methods, use a few words and
phrases to represent the source texts.
Text classification is to recognize probabilistic topic of documents by putting
documents in scheduled topics. Text classification based on the new graph repre-
sentation and graph mining has recently attracted considerable interest [ 16 ]. Text
clustering is used to group similar documents with scheduled topics, which is
different from text classification that gathers documents together. In text clustering,
documents may appear in multiple subtopics. Generally, some clustering algorithms
in data mining can be utilized to compute the similarities of documents. However,
it is also shown that the structural relationship information may be exploited to
improve the clustering performance in Wikipedia [ 17 ]. The question answering
system is designed to search for the optimal answer to a given question. It involves
different technologies of question analysis, source retrieval, answer extraction, and
answering demonstration [ 18 ]. The question answering system may be applied in
many fields, including education, website, healthcare, and national defense. Opinion
mining, similar to sentiment analysis, refers to the computing technologies for
identifying and extracting subjective information from news assessment, comment,
Search WWH ::




Custom Search