Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

${lrPrCats * 100.0}%2.4f%%\nArea under ROC: ${lrRocCats *

100.0}%2.4f%%")

You should see output similar to this one:

LogisticRegressionModel

Accuracy: 66.5720%

Area under PR: 75.7964%

Area under ROC: 66.5483%

By applying a feature standardization transformation to our data, we improved both the

accuracy and AUC measures from 50 percent to 62 percent, and then, we achieved a fur-

ther boost to 66 percent by adding the category feature into our model (remember to apply

the standardization to our new feature set).

Note

Note that the best model performance in the competition was an AUC of 0.88906 (see ht-

One approach to achieving performance almost as high is outlined at ht-

auc-0-878 .

Notice that there are still features that we have not yet used; most notably, the text features

in the boilerplate variable. The leading competition submissions predominantly use the

boilerplate features and features based on the raw textual content to achieve their perform-

ance. As we saw earlier, while adding category-improved performance, it appears that

most of the variables are not very useful as predictors, while the textual content turned out

to be highly predictive.

Going through some of the best performing approaches for these competitions can give

you a good idea as to how feature extraction and engineering play a critical role in model

performance.

Search WWH ::

Custom Search

Home