Various reporting tools can issue queries about data, both in
operational data stores and the data warehouse. OLAP cubes may be
refreshed with the latest data to facilitate slicing and dicing the latest
results. Data mining, as described earlier, not only plays a key role in
understanding the interactions between historical data and out-
comes, but also in characterizing those interactions in a way that can
predict future outcomes and feed those outcomes back into other
analysis and decision making.
Although corporate data available for mining may be considerable,
when assessing the data available for mining you may need to
supplement it. Banks, for example, know every transaction their
customers have ever made, their account balances, and details from
customer loan applications. One could say that a bank has plenty of
data to mine. However, the “availability of data” may not be suffi-
cient to build a useful data mining model, or a model that satisfies a
particular need. If a bank is trying to understand its customers based
on demographics, such as personal interests as part of a marketing
campaign, those demographics are not typically part of the bank's
operational data stores or data warehouse. Before mining the data, a
bank may have to acquire demographic information, either through
direct solicitation from customers or by purchasing customer
information from third-party providers.
What Is a Data Mining Model?
We have used the term model several times already and defined it as a
compact representation of the patterns found in historical data. To
illustrate the concept of a data mining model more concretely, con-
sider a simple linear regression problem—that is, predicting a continu-
ous numerical value from one or more inputs. Basically, we have a
set of points on a graph and we want to fit a straight line to them.
This functionality was provided in the Texas Instruments TI-55 scien-
tific calculator of the 1970s and had been around long before that.
Essentially, the algorithm iterates over the data to collect statistics
and then determines the coordinates of the line that best fits the set of
two-dimensional points; this is illustrated in Figure 1-4.
The model that represents this line is simply expressed as two
values from the equation y
b, where m is the slope, and b is
the y -intercept. A model that consists of m and b is sufficient to make
predictions for y given a value of x . For example, if m
2 and b
then if x (or age)
25, we predict the value of y (or income)