Overview of Data Mining - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

require paying more attention to the data being collected, using the

right tools or tools that have more automated intelligence for the

mining process, or hiring more skilled or experienced individuals.

Before hard rock mining operations have even begun, companies explore areas

where gold may be found and scientifically analyse the rock. The actual gold

originates deep within the earth in places called pockets. These pockets are filled with

gold, heavy ore, and quartz. If enough gold is discovered in the ore, the technological

process of hard rock mining begins.

As we discuss later in Chapter 3, the data mining process

begins with a clear understanding of the business objectives and

data mining goals. Like gold mining, we then need to survey the

corporate landscape for available data. Sometimes needed data

may be readily available in repositories such as data warehouses

and data marts. Other times, data resides in various databases

that support operational systems. In less sophisticated organiza-

tions, data may reside in Excel spreadsheets or flat files. Once

sources have been identified, we need to analyze the data for

quality (e.g., missing values, consistency of values, etc.) and pre-

pare it through data cleansing and other transformations. An

assessment can be made for correlation between combinations of

attributes as to whether the data is likely to contain any useful pat-

terns or knowledge; then the process of data mining begins.

First, miners dig a tunnel into the solid rock. During the 1930s, miners

working for the companies dug these tunnels by hand, a very labour-intensive

undertaking. Miners often risked their health, digging with picks and shovels during

long shifts in these dark, damp tunnels, building the shafts and carting out the ore.

Data miners have it a little easier. However, in the early days of

data mining, statisticians applied various combinations of univari-

ate (single attribute) and multivariate (multiple attributes) statistics.

They also hand-coded algorithms, such as linear regression, to fit a

line to a set of data points. Visualization was often crude, some-

times relying on only numerical outputs. Due to hardware and soft-

ware limitations, the number of attributes and cases mined was

often relatively small, perhaps tens of attributes. Producing useful

models could take weeks or months using complex analysis. Get-

ting the results of mining into the hands of business people, or into

operational systems, often required teams of people to process the

results, produce high-level reports, and include models in opera-

tional systems.

Today, there are commercial tools with standard and state-of-

the-art algorithms that can automate much of the data mining

Search WWH ::

Custom Search

Home