Database Reference
In-Depth Information
${lrPrCats * 100.0}%2.4f%%\nArea under ROC: ${lrRocCats *
100.0}%2.4f%%")
You should see output similar to this one:
LogisticRegressionModel
Accuracy: 66.5720%
Area under PR: 75.7964%
Area under ROC: 66.5483%
By applying a feature standardization transformation to our data, we improved both the
accuracy and AUC measures from 50 percent to 62 percent, and then, we achieved a fur-
ther boost to 66 percent by adding the category feature into our model (remember to apply
the standardization to our new feature set).
Note
Note that the best model performance in the competition was an AUC of 0.88906 (see ht-
tp://www.kaggle.com/c/stumbleupon/leaderboard/private ) .
One approach to achieving performance almost as high is outlined at ht-
tp://www.kaggle.com/c/stumbleupon/forums/t/5680/beating-the-benchmark-leaderboard-
auc-0-878 .
Notice that there are still features that we have not yet used; most notably, the text features
in the boilerplate variable. The leading competition submissions predominantly use the
boilerplate features and features based on the raw textual content to achieve their perform-
ance. As we saw earlier, while adding category-improved performance, it appears that
most of the variables are not very useful as predictors, while the textual content turned out
to be highly predictive.
Going through some of the best performing approaches for these competitions can give
you a good idea as to how feature extraction and engineering play a critical role in model
performance.
Search WWH ::




Custom Search