Java Reference
In-Depth Information
As noted, text data is far more prevalent than structured data (e.g.,
data stored in relational tables or XML). Typical use cases for text
mining involve large repositories of text documents such as reports,
technical papers, magazine articles, or e-mails that need to be
automatically classified or grouped into appropriate hierarchical
clusters. However, the ability to combine unstructured text data
with structured data as found in relational tables can supplement
the knowledge that data mining can extract. For example, customer
service representatives in call centers type in notes from their inter-
actions with customers. Physicians and nurses often provide tex-
tual notes on their patients and their progress. This valuable
information is often excluded from the mining process due to a lack
of convenient means to use it.
Although text mining is a field unto itself, with techniques ranging
from statistical analysis of terms in documents to natural language
processing, there are aspects that can be included in an application
programming interface (API) to provide benefit to JDM users. A first-
level objective for text mining in JDM is to provide a simple, minimal
interface for mining text data. Specifically, JDM allows users to
define an attribute as text or a text reference . A text attribute contains
the text value in place, whereas a text reference attribute contains a
uniform resource identifier (URI) indicating where the actual text
At this stage, JDM leaves the details of preprocessing text
attributes to the vendor. The vendor can make some assumptions
about the nature of the text and, for example, perform term extrac-
tion and feed the results into a suitable algorithm. Alternatively, the
vendor can extend the interfaces to enable the specification of sup-
plementary settings such as stop word lists, thesauri, concept hierar-
chies, index specifications, and so on. Vendors may also choose to
expose text mining-specific transformations.
This chapter introduced a few of the new features being considered
for Java Data Mining 2.0. Because the expert group is still evolving
this standard before its final release, concepts and details of the spe-
cific features, and even the features themselves, may change.
Search WWH ::

Custom Search