Java Reference
In-Depth Information
We distinguish among these DME architectures because the first
and second cases do not require extra data storage. The third case,
involving staging data, does of course require additional disk space
as data is replicated for the purposes of mining. 1 So the question is,
how much?
The correct answer is, “It depends….” Consider an example
involving customer relationship management (CRM). Today, it is not
uncommon to find build datasets with between 50 and 5,000
attributes, the median value being around 200; and the number of
cases is often between 10,000 and 1,000,000, the median value being
around 200,000. Whereas these are only the build datasets, the vol-
umes for apply datasets are much larger. As cited earlier, a large
organization as found in the telecommunications or banking indus-
tries could have a customer database dealing with 100 million cus-
tomers. So we see here that corporate architectures requiring data
staging require large disk capacity to store not only the data to be
scored, but also the apply results.
Another impact could depend on specific DME requirements
for dataset format. For example, a specific association rule algo-
rithm could require the data in a multirecord case representation.
In this situation, a single record representation for a million records
with 100 attributes of 20 bytes each requires 2 gigabytes of space,
but the corresponding multirecord case representation requires
7 gigabytes, assuming the case ID is 20 bytes and the attribute name
is 30 bytes.
Modeling in the large also requires saving the models, the build
settings, and all the objects declared as persistent by the DME imple-
mentation. In most cases, a model should be (much) smaller than its
corresponding build dataset, because a model should extract knowl-
edge from and find relationships in data. This is especially true when
looking at the model details (i.e., the minimum information needed
to apply a model), such as the tree nodes for a decision tree or the
coefficients of a linear regression. Other models, such as support vec-
tor machine (SVM) for example, can take relatively more storage
because they maintain the support vectors; similarly, an association
rules model can contain hundreds of thousands of rules. In another
aspect, some implementations keep not only the minimum model
1
Note that when considering a tool's performance, the time required to export
data from the database and import results back to the database must be
included in the overall model build and apply time.
Search WWH ::




Custom Search