Solving Problems in Industry - Java Data Mining: Strategy, Standard, and Practice

Java Reference

In-Depth Information

a nonresponder. It is from this trial, or sample, of the population that

we can also test model accuracy and obtain a lift chart.

Response modeling can be combined with a value prediction,

such as dollar amount of order, donation size, etc., to derive an

expected return on the campaign. A regression model can be built to

predict, for example, the amount each customer spends or each

alumnus donates. Multiplying this value by the probability that a

given customer will respond to the campaign produces an expected

value for that customer. Customers can be sorted not only by likeli-

hood to respond, but by expected value to identify the highest likely

spenders or donors.

Another refinement of response modeling is to determine which

channel is best to approach these customers, for example, mail, e-mail,

or phone. Once again, based on historical data, we can learn the pat-

tern of customers who respond best to mail, e-mail, or phone.

2.1.4

Fraud Detection

Anywhere money is involved, the potential for fraud exists; all

industries are vulnerable to individuals who abuse established

procedures for personal gain, often illegally. Healthcare, financial

services, and taxation are just a few areas where fraud is found.

One approach to fraud detection involves clustering. The objective

is first to group the data into clusters. We can then review each of the

clusters to see if there is a concentration of known fraud in any one

cluster, indicating that fraud is more likely to occur within a given

cluster than another. In addition, we can look for cases that don't

match any of the known clusters particularly well, or at all. These

outliers become prime candidates for investigation.

A second approach to fraud detection involves classification.

We first identify examples of fraud manually in historical data.

With classification, the goal is to learn to distinguish between

fraudulent and nonfraudulent behavior. Consider a dataset con-

sisting of various predictor attributes, such as “age,” “income,”

“wire transfer within last 10 days,” and a target attribute indicat-

ing if the case was fraudulent or not. A classification algorithm like

decision tree or support vector machine can then predict the likeli-

hood of fraud on new data. Cases with a high probability of fraud

are then good candidates for investigation. However, we can also

predict the likelihood of fraud on the original data. This allows for

a comparison between actual target values and the predicted values.

Search WWH ::

Custom Search

Home