the known outcomes to compute statistics which enable subsequent
predictions. Supervised models use data consisting of predictors and
targets . The predictors are attributes (columns) used to predict the
outcome— the target attribute (also a column).
Unsupervised learning does not require, and does not accept,
knowledge of any correct answer. It merely looks at all the data and
applies an algorithm that performs the appropriate analysis. Cluster-
ing is an unsupervised technique that determines the clusters that
naturally exist in the data.
In data mining, several terms have evolved to mean the same
thing. For example, when referring to a column of data, the typical
relational database term, we will see the terms attribute, field, and
variable . Similarly, when referring to the rows of data, we will see the
terms case, record, example , and instance . They typically can be used
interchangeably. 5 In JDM, we have adopted the terms attribute and
The Mining Metaphor
Data mining is the process of extracting knowledge from data. That
knowledge can be used to understand the nature of a business or
scientific problem, or applied to new data to make predictions or
classifications. Just as mining in the physical world involves a pro-
cess of going from raw earth to refined material (e.g., gold, steel,
and platinum) to end-products (e.g., jewelry, electronics), data min-
ing involves a process of going from large volumes of raw data to
extracted knowledge to knowledge applied in practice. This section
takes this metaphor to its limit by contrasting a description of the
gold mining process [Wells 2006] with data mining.
Gold mining involves the science, technology, and business of the discovery of
gold, in addition to its removal and sale in the marketplace. Gold may be
found in many places, most commonly rock but even sea water; in very small quanti-
ties. More often it is found in greater quantities in veins associated with igneous
rocks, rocks created by heat such as quartzite.
“Data Mining” is somewhat of a misnomer since we are not trying
to discover “data,” but the knowledge that is present in data. In any
There are some distinctions to be made, for example, a case may be comprised of
multiple records when the data is stored in transactional format . Here a case
corresponds to a transaction consisting, perhaps, of multple items as purchased
at a grocery store checkout.