may have to address the data staging issue, or introduce alternative
hardware architectures (e.g., clusters or data segregation) to ensure
reasonable non-interference with daily operations.
While the build task uses the data access layer embedded within
the DME itself, there are other ways to perform the apply task and per-
haps the test task; we will look at these tasks separately. Besides the
architectural constraints, there is also the administration environment
Data Access for Model Building
As stated earlier, there are three types of DME architecture: in-database
DME and two different layouts of independent-server DME. The
in-database architecture does not require any data transfer since the
algorithms exist where the data reside. The independent-server
architectures requires data transfer, and there are two possibilities
in this case: (1) either the DME implementation requires a copy of
the data in a temporary or proprietary format, which implies a
duplication of data together with additional disk space, or (2) the
DME does not requires temporary or proprietary storage and
accesses the data directly from the repository—this generates more
data traffic but does not require additional disk space, and reduces
data latency issues. However, the second case either requires large
RAM to hold the data or efficient mining techniques to retrieve and
process data in manageable chunks.
In most cases, the build dataset is of smaller size than the datasets
used as input for apply. This can be attributed to one of two reasons.
Either (1) the data is known only for a population concerned with an
experience, which is generally reduced for cost reasons, or (2) robust
models can be safely built on a sample of the entire population,
resulting in smaller build times for similar model quality and robust-
ness. In many practical situations, data access for the build phase is
less demanding than for apply phase.
But the architecture of the DME and the volume of data are not
the only points to take into account. As noted earlier, the policy of IT
management against the use of data in the operational data environ-
ment may impact data access and force the staging of data to isolate
the production environment from the modeling environment.