Databases Reference
In-Depth Information
click url_1 url_2 url_3 url_4 url_5
1 0 0 0 1 0
1 0 1 1 0 1
0 1 0 0 1 0
1 0 0 0 0 0
1 1 0 1 0 1
Call this matrix “train,” and then the command line in R would be:
fit <- glm ( click ~ url_1 + url_2 + url_3 + url_4 + url_5 ,
data = train , family = binomial ( logit ))
Evaluation
Let's go back to the big picture from earlier in the chapter where we
told you that you have many choices you need to make when con‐
fronted with a classification problem. One of the choices is how you're
going to evaluate your model. We discussed this already in Chapter 3
with respect to linear regression and k-NN, as well as in the previous
chapter with respect to Naive Bayes. We generally use different eval‐
uation metrics for different kinds of models, and in different contexts.
Even logistic regression can be applied in multiple contexts, and de‐
pending on the context, you may want to evaluate it in different ways.
First, consider the context of using logistic regression as a ranking
model—meaning you are trying to determine the order in which you
show ads or items to a user based on the probability they would click.
You could use logistic regression to estimate probabilities, and then
rank-order the ads or items in decreasing order of likelihood to click
based on your model. If you wanted to know how good your model
was at discovering relative rank (notice in this case, you could care less
about the absolute scores), you'd look to one of:
Area under the receiver operating curve (AUC)
In signal detection theory, a receiver operating characteristic
curve, or ROC curve, is defined as a plot of the true positive rate
against the false positive rate for a binary classification problem
as you change a threshold. In particular, if you took your training
set and ranked the items according to their probabilities and var‐
ied the threshold (from to −∞ ) that determined whether to
classify the item as 1 or 0 , and kept plotting the true positive rate
versus the false positive rate, you'd get the ROC curve. The area
under that curve, referred to as the AUC, is a way to measure the
success of a classifier or to compare two classifiers. Here's a nice
paper on it by Tom Fawcett, “Introduction to ROC Analysis” .
Search WWH ::




Custom Search