Fitting a regression line to a set of data points.
In two dimensions, 4 involving the attributes age and income, and
with a small number of data points, the problem seems fairly simple.
We could probably even “eye” a solution by drawing a line to fit the
data and estimating the values for m and b . However, consider data
that is not in two dimensions, but a hundred, a thousand, or several
thousand dimensions. The attributes may include numerical values
or consist of categories that are either numbers or strings. Some of
the categorical data may have an ordering (e.g., high, medium, low ) or
be unordered (e.g., married, unmarried, widowed ). Further, consider
cases where there are tens of thousands or millions of cases. It is
intractable for a human to make sense of this data, but it is relatively
easy for the right algorithm executing on a sufficiently powerful
computer. Here lies the essence of data mining.
Now, data mining algorithms are typically much more complex
than that of linear regression; however, the concept is the same: there
is a compact representation of the “knowledge” present in the data
that can be used for prediction or inspection.
Every field has its jargon—the vernacular of the “in crowd.” Here's a
quick overview of some of the data mining jargon.
At one level, data mining experts talk about things like models
and techniques, or mining functions called classification, regression,
clustering, attribute importance, and association. Classification mod-
els predict some outcome, such as which offer a customer will
respond to. Regression models predict a continuous number, such as
what is the predicted value of a home or a person's income.
This use of the term “dimension” here should not be confused with the same
term as used in OLAP, which has a different intent. Here, the term “dimension”
is synonymous with “attribute” or “column.”