Java Reference

In-Depth Information

Various reporting tools can issue queries about data, both in

operational data stores and the data warehouse. OLAP cubes may be

refreshed with the latest data to facilitate slicing and dicing the latest

results. Data mining, as described earlier, not only plays a key role in

understanding the interactions between historical data and out-

comes, but also in characterizing those interactions in a way that can

predict future outcomes and feed those outcomes back into other

analysis and decision making.

Although corporate data available for mining may be considerable,

when assessing the data available for mining you may need to

supplement it. Banks, for example, know every transaction their

customers have ever made, their account balances, and details from

customer loan applications. One could say that a bank has plenty of

data to mine. However, the “availability of data” may not be suffi-

cient to build a useful data mining model, or a model that satisfies a

particular need. If a bank is trying to understand its customers based

on demographics, such as personal interests as part of a marketing

campaign, those demographics are not typically part of the bank's

operational data stores or data warehouse. Before mining the data, a

bank may have to acquire demographic information, either through

direct solicitation from customers or by purchasing customer

information from third-party providers.

1.2.4

What Is a Data Mining Model?

We have used the term
model
several times already and defined it as a

compact representation of the patterns found in historical data. To

illustrate the concept of a data mining model more concretely, con-

sider a simple
linear
regression
problem—that is, predicting a continu-

ous numerical value from one or more inputs. Basically, we have a

set of points on a graph and we want to fit a straight line to them.

This functionality was provided in the Texas Instruments TI-55 scien-

tific calculator of the 1970s and had been around long before that.

Essentially, the algorithm iterates over the data to collect statistics

and then determines the coordinates of the line that best fits the set of

two-dimensional points; this is illustrated in Figure 1-4.

The model that represents this line is simply expressed as two

values from the equation
y

b,
where
m
is the slope, and
b
is

the
y
-intercept. A model that consists of
m
and
b
is sufficient to make

predictions for
y
given a value of
x
. For example, if
m

mx

2 and
b

5,

then if
x
(or age)

25, we predict the value of
y
(or income)

55

(thousand).

Search WWH ::

Custom Search