Geology Reference
In-Depth Information
Fig. 8.4 Data correlations and distributions
training, are unavoidable. Hence, we relax the accuracy requirement to allow for
classi
cation error. The second step is to use the training samples and the classi-
er. Figure 8.5 shows an example of a
tail region for two inputs. The rest of the input space is called the
fication threshold to train and build a classi
region,
corresponding to the body of the output distribution. Among the 1,000 samples, the
red points are the tail points and the blue points are the body points.
In the third step, 10,000 samples of synthetic data were generated. The synthetic
sample points are classi
body
er that was built in the previous step
and the non-tail points are blocked. Figure 8.6 shows the classi
ed using the classi
ed synthetic data.
As can be seen, the classi
er clearly marks the tail data and the data corresponding
to tail samples can be easily extracted. One could build the model for the tail data
from the data extracted which is used for further analysis. This reduction in time
spent is high because we are trying to study only the rare events, which by defi-
-
nition constitute a very small percentage of the total synthetic sample size. As
explained, the basic idea is to use a classi
er to distinguish the tail and body
regions, and to block out the body points. For any point in the input space, gen-
erated from the synthetic data, the classi
er can predict its membership in either the
points are then carefully analyzed,
assuming that the purpose is to evaluate the rare events.
body
or the
tail
classes. Only the
tail
Search WWH ::




Custom Search