Data Visualization and Fraud Detection - Doing Data Science

Databases Reference

In-Depth Information

because, as we discussed, they send it on to an ops team to assess it

independently. So in actuality they have a pretty complicated set of

labels, including when a charge is initially rejected but later decide it's

OK, or it's initially accepted but on further consideration might have

been bad, or it's confirmed to have been bad, or confirmed to have been

OK, and the list goes on.

Technically we would call this a semi-supervised learning problem ,

straddling the worlds of supervised and unsupervised learning. But

it's useful to note that the “label churn” settles down after a few months

when the vast majority of chargebacks have been received, so they

could treat the problem as strictly supervised learning if you go far

enough back in time. So while they can't trust the labels on recent data,

for the purpose of this discussion, Ian will describe the easier case of

solving the supervised part.

Now that we've set the stage for the problem, Ian moved on to de‐

scribing the supervised learning recipe as typically taught in school:

• Get data.

• Derive features.

• Train model.

• Estimate performance.

• Publish model!

But transferring this recipe to the real-world setting is not so simple.

In fact, it's not even clear that the order is correct. Ian advocates think‐

ing about the objective first and foremost, which means bringing per‐

formance estimation to the top of the list.

The Trouble with Performance Estimation

So let's do that: focus on performance estimation. Right away Ian

identifies three areas where we can run into trouble.

Defining the error metric

How do we measure whether our learning problem is being modeled

well? Let's remind ourselves of the various possibilities using the truth

table in Table 9-1 .

Search WWH ::

Custom Search

Home