require paying more attention to the data being collected, using the
right tools or tools that have more automated intelligence for the
mining process, or hiring more skilled or experienced individuals.
Before hard rock mining operations have even begun, companies explore areas
where gold may be found and scientifically analyse the rock. The actual gold
originates deep within the earth in places called pockets. These pockets are filled with
gold, heavy ore, and quartz. If enough gold is discovered in the ore, the technological
process of hard rock mining begins.
As we discuss later in Chapter 3, the data mining process
begins with a clear understanding of the business objectives and
data mining goals. Like gold mining, we then need to survey the
corporate landscape for available data. Sometimes needed data
may be readily available in repositories such as data warehouses
and data marts. Other times, data resides in various databases
that support operational systems. In less sophisticated organiza-
tions, data may reside in Excel spreadsheets or flat files. Once
sources have been identified, we need to analyze the data for
quality (e.g., missing values, consistency of values, etc.) and pre-
pare it through data cleansing and other transformations. An
assessment can be made for correlation between combinations of
attributes as to whether the data is likely to contain any useful pat-
terns or knowledge; then the process of data mining begins.
First, miners dig a tunnel into the solid rock. During the 1930s, miners
working for the companies dug these tunnels by hand, a very labour-intensive
undertaking. Miners often risked their health, digging with picks and shovels during
long shifts in these dark, damp tunnels, building the shafts and carting out the ore.
Data miners have it a little easier. However, in the early days of
data mining, statisticians applied various combinations of univari-
ate (single attribute) and multivariate (multiple attributes) statistics.
They also hand-coded algorithms, such as linear regression, to fit a
line to a set of data points. Visualization was often crude, some-
times relying on only numerical outputs. Due to hardware and soft-
ware limitations, the number of attributes and cases mined was
often relatively small, perhaps tens of attributes. Producing useful
models could take weeks or months using complex analysis. Get-
ting the results of mining into the hands of business people, or into
operational systems, often required teams of people to process the
results, produce high-level reports, and include models in opera-
Today, there are commercial tools with standard and state-of-
the-art algorithms that can automate much of the data mining