Overview of Data Mining - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

material). The liquid sludge is diverted into holding tanks and referred to as the

pregnant solution—a liquid sludge containing 70% of the gold.

Like rock and ore, raw data needs to be prepared. The mecha-

nisms for refining it to enable knowledge extraction involve data

analysis technology, data cleansing, transformations, and attribute

synthesis. These are big terms for problems such as graphing data

values, correcting typos, dealing with missing values, categorizing

data values (e.g., age) into buckets instead of continuous values,

and creating new attributes based on other attributes (e.g., cus-

tomer lifetime value).

[The liquid sludge] is drawn from holding tanks through a clarifier , a device

that removes all the remaining rock or clay from a pregnant solution. In the

next step, the material is taken to a de-areator tank that removes bubbles of air and

further clarifies the solution.

The dataset as presented to the data mining algorithm can be

viewed as the “pregnant solution.” As a data mining algorithm

executes, it makes finer and more precise distinctions about the

data to extract knowledge. This can be in the form of, for example,

rules that define customer profiles, common co-occurrences of

product sales enabling cross-sell, or a representative case that

describes a set of patients susceptible to a type of cancer.

Zinc is added in dust form to the de-areated solution, which is drawn under

pressure through a filter press; which causes the gold and zinc to precipitate

onto canvas (heavy cloth) filter leaves. This zinc-gold precipitate (condensed into a

solid) is then cleaned from the filters while extreme heat burns off the zinc.

The purified “precipitate” of data mining is the emerging model,

which contains the extracted knowledge. It needs to be tested and

possibly refined through changing of parameters or further prepa-

ration of the data to produce a sufficient knowledge yield.

Water passing through the filters is chemically tested for gold residue before

being discharged into tailings ponds. Gold bearing water may be passed

through the filtering process several times to remove all of the gold and separate it

from impure substances.

Mining algorithms will often make several passes over the data

to continually tune or refine the model. Algorithms, such as neural

networks, decision trees, and K-means clustering, make multiple

passes over the data until any further improvements are deter-

mined insignificant or some other stopping criterion is met.

Search WWH ::

Custom Search

Home