Preview of Java Data Mining 2.0 - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

As noted, text data is far more prevalent than structured data (e.g.,

data stored in relational tables or XML). Typical use cases for text

mining involve large repositories of text documents such as reports,

technical papers, magazine articles, or e-mails that need to be

automatically classified or grouped into appropriate hierarchical

clusters. However, the ability to combine unstructured text data

with structured data as found in relational tables can supplement

the knowledge that data mining can extract. For example, customer

service representatives in call centers type in notes from their inter-

actions with customers. Physicians and nurses often provide tex-

tual notes on their patients and their progress. This valuable

information is often excluded from the mining process due to a lack

of convenient means to use it.

Although text mining is a field unto itself, with techniques ranging

from statistical analysis of terms in documents to natural language

processing, there are aspects that can be included in an application

programming interface (API) to provide benefit to JDM users. A first-

level objective for text mining in JDM is to provide a simple, minimal

interface for mining text data. Specifically, JDM allows users to

define an attribute as text or a text reference . A text attribute contains

the text value in place, whereas a text reference attribute contains a

uniform resource identifier (URI) indicating where the actual text

resides.

At this stage, JDM leaves the details of preprocessing text

attributes to the vendor. The vendor can make some assumptions

about the nature of the text and, for example, perform term extrac-

tion and feed the results into a suitable algorithm. Alternatively, the

vendor can extend the interfaces to enable the specification of sup-

plementary settings such as stop word lists, thesauri, concept hierar-

chies, index specifications, and so on. Vendors may also choose to

expose text mining-specific transformations.

18.8

Summary

This chapter introduced a few of the new features being considered

for Java Data Mining 2.0. Because the expert group is still evolving

this standard before its final release, concepts and details of the spe-

cific features, and even the features themselves, may change.

Search WWH ::

Custom Search

Home