What Does Data Mining Require from IT?
This section explores some of the information technology components
involved with data mining:
• Computing hardware: Data mining often requires CPU-
intensive activity to build and apply models from large vol-
umes of data. The amount and type of computation is often
governed by the type of algorithm. For example, neural net-
works require many floating point operations, whereas
naïve bayes algorithms rely more on co-occurrence counts.
This computing hardware can range from off-the-shelf PCs
to state-of-the-art high performance multiprocessor servers.
The servers on which DME implementations run are called
modeling servers .
• Data storage hardware: Obviously data mining needs data.
Data is generally already present in the organization, so we
focus here on the additional storage required for data
mining, and how data mining affects the amount of data.
• Database software: Database software can be viewed as
supporting not only the data storage and access
requirements of data, but also the data mining capabilities.
Independent-server mining, where data mining occurs in a
separate process, often relies on databases for data storage.
Increasingly, data mining programmers are leveraging the
analytical capabilities in databases to avoid data
movement. Modern relational database management
systems (RDBMSs) provide data mining integrated with
the database kernel, which we refer to as in-database
mining. Some vendors may provide Java Data Mining
(JDM) implementations on top of these in-database mining
capabilities. Independent-server mining systems can also
expose or use these in-database mining capabilities
through their established user interfaces.
• Data access: In the case of independent-server data mining,
the modeling server must access data contained in data-
bases. This can impact the network between the modeling
server and databases. This impact does not exist in the case
of in-database DMEs where the database software runs on
the same computing hardware as the DME. This may be
less true in the case of hardware clusters where the large
data exchange needed to build models will require careful