Will anybody buy? Logistic regression - Improving the User Experience through Practical Data Analytics

Database Reference

In-Depth Information

because these R -square values are not the same as the r 2 we are accustomed to from

Chapters 9 and 10. In fact, they are usually referred to as “pseudo- R -square measures.”

The second section of the output in Figure 11.6 is called a Classiication Table ,

which is often the most important and useful part of the output. This provides us with

an indication of how well the model is able to predict the correct category for each

case. Put another way, it tells you how well the X (or, multiple X's) predicts whether

the Y is a 1 or a 0.

Speciically, the rows in the Classiication Table tell us the actual number of 1's

and 0's in the data (which, of course, we know already), while the columns tell you

what the regression process predicts is the case.

In our example, there are eight actual observed 0's (the sum of the “7” and the “1”

in the top row), and seven of them are predicted as 0's—so 87.5% (seven of eight) of

the 0's are, indeed, predicted as 0's. There are four actual observed 1's (the sum of

the “2” and “2” in the second row), but only two of them are predicted as 1's—50%

are predicted correctly. Overall, as you see in the bottom row of the table, we predict

correctly 75% of the (in this case) 12 data points.

SIDEBAR: THE CUTOFF POINT

By the way, unless we change a setting (and we do not suggest you do that), if the predicted prob-

ability of a 1 is at least 0.5, the software predicts/classiies the result as a “1,” while if the predicted

probability is less than 0.5, the software classiies the result as a “0.” In fact, this 0.5 “cutoff point”

is noted right below the classiication table.

Should predicting 75% of the cases correctly be considered “good”? Of course,

as a practical matter, it depends on the real-world situation. However, in the abstract,

statistically speaking, we can reason this way: If you were guessing whether each of

the 12 participants successfully completed the task, given no information at all about

these people (just a code number!!), how many would you guess correctly? 75%?

We doubt it!! So, are you going to take your chances or use logistic regression? Of

course, it's a rhetorical question.

SIDEBAR: JUST MAX ' EM OUT AND BE A PRO: CMAX AND CPRO

CRITERIA

Well, you can get 8 of the 12 (67%) correct by predicting all 12 people to be unsuccessful (i.e., 0's).

This is called the “Cmax” criterion, and you cannot guarantee a higher percentage of correct predic-

tions by using any other strategy. The strategy is to predict everyone to be in the category with the

highest frequency!! If the data had consisted of 25 people, for example, and 15 had been “1's,” with 10

being 0's, then the highest frequency category would be 1's, and you could guarantee 60% correctly

predicted (100*(15/25)). The binary logistic regression process resulted in 75% predicted correctly. It's

always nice if the regression results predict a higher percent correct than the Cmax criterion!

However, there are some cautions that need to be mentioned. First, with a sample size of only

12, it turns out that the 75% is really not statistically signiicantly higher than 67%, so the fact that

75% exceeds 67% is not that impressive—primarily due to the small sample size; these results

based on a sample size of 120 (instead of 12) would be statistically signiicant at the traditional 0.05

signiicance level.

Search WWH ::

Custom Search

Home