Java Reference
In-Depth Information
the creative process of data mining involves time that dwarfs the
individual transformations, analyses, or model building. For mining
large datasets, there is a tradeoff between the data miner seeing
results quickly versus the cost of hardware. This is clearly a business
and resource issue.
In cases where a data mining tool automates much of the data
mining process through trying various algorithms or settings auto-
matically, more computing hardware can certainly improve through-
put of mining results. Without the use of automatic modeling, it is
not uncommon for the analysis and building of new models to take
several weeks; buying a machine twice as big will not reduce the time
it takes to design, build, and test models, because this time mainly
involves human intervention. Automatic modeling can make use of
the additional hardware such that the time to produce these models
is generally in the ranges of hours.
Another factor to consider regarding hardware to mine larger
datasets is the scalability of the particular algorithm (usually defined
in terms of the number of attributes and number of cases used in the
build data, but may also include attribute cardinality). Commenting
on the scalability of specific algorithms is beyond the scope of this
topic; however, users of data mining software can request scalability
figures from DME vendors. For example, if an algorithm scales
as order n 2 , where n is the number of cases, simply doubling the
number of CPUs or using a CPU with double the speed will not
provide the same performance when the dataset contains twice as
many cases. This is especially true when considering the number of
attributes because the performance of most classical algorithms is
adversely affected by a larger number of attributes.
Impacts on Data Storage Hardware
Data storage and its associated costs are a normal part of virtually all
businesses, and data volumes are growing. It is common to see For-
tune 500 companies with terabytes of data, some adding terabytes
per month. When considering data mining, data storage costs also
include the storage of data mining models, intermediate datasets,
especially if they are materialized as physical tables (as opposed to
views), settings, test results, and apply results.
When considering the storage of data mining models, each algo-
rithm typically has different model storage requirements. For example,
a decision tree typically has a very compact representation consisting of
Search WWH ::

Custom Search