Database Reference
In-Depth Information
artificially constrained. After all, it's harder to find outliers when you don't
store data profiling the attributes of those outliers. That said, having a larger
set of data and attributes is only useful if you can handle the compute capa-
bilities that are required to churn through them and find the signals that are
buried within the noise. It's also critical to load and process data quickly
enough to trap fast-moving events.
Fraud cases traditionally involve the use of samples and models to iden-
tify customers that exhibit a certain kind of profile. Although it works, the
problem with this approach (and this is a trend that you're going to see in a
lot of these use cases) is that you're profiling a segment and not at the indi-
vidual transaction or person level. Making a forecast based on a segment is
good, but making a decision that's based upon the actual particulars of an indi-
vidual and correlate their transaction is obviously better. To do this, you need
to work up a larger set of data than is possible with traditional approaches.
We estimate that less than 50 percent (and usually much less than that) of the
available information that could be useful for fraud modeling is actually being
used. You might think that the solution would be to load the other 50 percent
of the data into your traditional analytic warehouse. The reasons why this
isn't practical seem to come up in most Big Data usage patterns, namely: the
data won't fit; it'll contain data types that the warehouse can't effectively use;
it'll most likely require disruptive schema changes; and it could very well
slow your existing workloads to a crawl.
If stuffing the rest of the data into existing warehouses isn't going to work,
then what will? We think that the core engines of the IBM Big Data platform
(BigInsights, Streams, and the analytics-based IBM PureData Systems) give
you the flexibility and agility to take your fraud models to the next level. BigIn-
sights addresses the concerns we outlined in the previous paragraph, because
it will scale to just about any volume and handle any data type required. Be-
cause it doesn't impose a schema on-write, you'll have maximum flexibility
in how you organize your data, and your work won't impact existing work-
loads and other systems. Finally, BigInsights is highly scalable; you can start
small and grow in a highly cost-effective manner (trust us when we say that
your CIO will like this part).
Now that you have BigInsights to provide an elastic and cost-effective
repository for all of the available data, how do you go about finding those
outliers?
Search WWH ::




Custom Search