Databases Reference
In-Depth Information
typically done through the discovery of patterns and trends by means such as statistical
pattern learning, topic modeling, and statistical language modeling. Text mining usu-
ally requires structuring the input text (e.g., parsing, along with the addition of some
derived linguistic features and the removal of others, and subsequent insertion into a
database). This is followed by deriving patterns within the structured data, and evalua-
tion and interpretation of the output. “High quality” in text mining usually refers to a
combination of relevance, novelty, and interestingness.
Typical text mining tasks include text categorization, text clustering, concept/entity
extraction, production of granular taxonomies, sentiment analysis, document summa-
rization, and entity-relation modeling (i.e., learning relations between named entities).
Other examples include multilingual data mining, multidimensional text analysis, con-
textual text mining, and trust and evolution analysis in text data, as well as text mining
applications in security, biomedical literature analysis, online media analysis, and ana-
lytical customer relationship management. Various kinds of text mining and analysis
software and tools are available in academic institutions, open-source forums, and
industry. Text mining often also uses WordNet, Sematic Web, Wikipedia, and other
information sources to enhance the understanding and mining of text data.
Mining Web Data
The World Wide Web serves as a huge, widely distributed, global information center for
news, advertisements, consumer information, financial management, education, gov-
ernment, and e-commerce. It contains a rich and dynamic collection of information
about web page contents with hypertext structures and multimedia, hyperlink informa-
tion, and access and usage information, providing fertile sources for data mining. Web
mining is the application of data mining techniques to discover patterns, structures, and
knowledge from the Web. According to analysis targets, web mining can be organized
into three main areas: web content mining , web structure mining , and web usage mining .
Web content mining analyzes web content such as text, multimedia data, and struc-
tured data (within web pages or linked across web pages). This is done to understand the
content of web pages, provide scalable and informative keyword-based page indexing,
entity/concept resolution, web page relevance and ranking, web page content sum-
maries, and other valuable information related to web search and analysis. Web pages
can reside either on the surface web or on the deep Web . The surface web is that por-
tion of the Web that is indexed by typical search engines. The deep Web (or hidden Web )
refers to web content that is not part of the surface web. Its contents are provided by
underlying database engines.
Web content mining has been studied extensively by researchers, search engines, and
other web service companies. Web content mining can build links across multiple web
pages for individuals; therefore, it has the potential to inappropriately disclose personal
information. Studies on privacy-preserving data mining address this concern through
the development of techniques to protect personal privacy on the Web.
Web structure mining is the process of using graph and network mining theory
and methods to analyze the nodes and connection structures on the Web. It extracts
patterns from hyperlinks, where a hyperlink is a structural component that connects a
 
Search WWH ::




Custom Search