Logistic Regression - Doing Data Science

Databases Reference

In-Depth Information

click url_1 url_2 url_3 url_4 url_5

1 0 0 0 1 0

1 0 1 1 0 1

0 1 0 0 1 0

1 0 0 0 0 0

1 1 0 1 0 1

Call this matrix “train,” and then the command line in R would be:

fit <- glm ( click ~ url_1 + url_2 + url_3 + url_4 + url_5 ,

data = train , family = binomial ( logit ))

Evaluation

Let's go back to the big picture from earlier in the chapter where we

told you that you have many choices you need to make when con‐

fronted with a classification problem. One of the choices is how you're

going to evaluate your model. We discussed this already in Chapter 3

with respect to linear regression and k-NN, as well as in the previous

chapter with respect to Naive Bayes. We generally use different eval‐

uation metrics for different kinds of models, and in different contexts.

Even logistic regression can be applied in multiple contexts, and de‐

pending on the context, you may want to evaluate it in different ways.

First, consider the context of using logistic regression as a ranking

model—meaning you are trying to determine the order in which you

show ads or items to a user based on the probability they would click.

You could use logistic regression to estimate probabilities, and then

rank-order the ads or items in decreasing order of likelihood to click

based on your model. If you wanted to know how good your model

was at discovering relative rank (notice in this case, you could care less

about the absolute scores), you'd look to one of:

In signal detection theory, a receiver operating characteristic

curve, or ROC curve, is defined as a plot of the true positive rate

against the false positive rate for a binary classification problem

as you change a threshold. In particular, if you took your training

set and ranked the items according to their probabilities and var‐

ied the threshold (from ∞ to −∞ ) that determined whether to

classify the item as 1 or 0 , and kept plotting the true positive rate

versus the false positive rate, you'd get the ROC curve. The area

under that curve, referred to as the AUC, is a way to measure the

success of a classifier or to compare two classifiers. Here's a nice

paper on it by Tom Fawcett, “Introduction to ROC Analysis” .

Search WWH ::

Custom Search

Home