Big Data Applications - Big Data: Related Technologies, Challenges and Future Prospects

Database Reference

In-Depth Information

Document introduction and query processing are the foundation for developing

vector space model, Boolean Retrieval Model, and probability retrieval model,

which constitute the foundation of search engines. Since the early 1990s, search

engines have evolved into a mature business system, which generally consist of

rapidly distributed crawling, effectively inverted index, webpage sequencing based

on inlink, and search log analysis [ 10 ].

NLP can enable computers to analyze, interpret, and even generate text. Some

common NLP methods are: lexical acquisition, word sense disambiguation, part-

of-speech tagging, and probabilistic context free grammar [ 11 ]. Some NLP-based

technologies have been applied to text mining, including information extraction,

topic models, text summarization, classification, clustering, question answering, and

opinion mining. Information mining shall automatically extract specific structured

information from texts. Named entity recognition (NER) technology, as a subtask

of information extraction, aims to recognize atomic entities in texts subordinate

to scheduled categories (e.g. figures, places, and organizations), which have been

successfully applied to the development of new analysis [ 12 ] and medical appli-

cations [ 13 ] recently. The topic models are built according to the opinion that

“documents are constituted by topics and topics are the probability distribution

of vocabulary.” Topic models are models generated by documents, stipulating the

probability program to generate documents.

Presently, various probabilistic topic models have been used to analyze document

contents and lexical meanings [ 14 ]. Text summarization is to generate a reduced

summary or extract from a single or several input text files. Text summarization

may be classified into concrete summarization and abstract summarization [ 15 ].

Concrete summarization selects important sentences and paragraphs from source

documents and concentrates them into shorter forms. Abstract summarization may

interpret the source texts and, according to linguistic methods, use a few words and

phrases to represent the source texts.

Text classification is to recognize probabilistic topic of documents by putting

documents in scheduled topics. Text classification based on the new graph repre-

sentation and graph mining has recently attracted considerable interest [ 16 ]. Text

clustering is used to group similar documents with scheduled topics, which is

different from text classification that gathers documents together. In text clustering,

documents may appear in multiple subtopics. Generally, some clustering algorithms

in data mining can be utilized to compute the similarities of documents. However,

it is also shown that the structural relationship information may be exploited to

improve the clustering performance in Wikipedia [ 17 ]. The question answering

system is designed to search for the optimal answer to a given question. It involves

different technologies of question analysis, source retrieval, answer extraction, and

answering demonstration [ 18 ]. The question answering system may be applied in

many fields, including education, website, healthcare, and national defense. Opinion

mining, similar to sentiment analysis, refers to the computing technologies for

identifying and extracting subjective information from news assessment, comment,

Search WWH ::

Custom Search

Home