Database Reference
In-Depth Information
Wine_features.csv_GradientBoostingRegressor.results.json
Wine_features.csv_LinearRegression.log
Wine_features.csv_LinearRegression.predictions
Wine_features.csv_LinearRegression.results
Wine_features.csv_LinearRegression.results.json
Wine_features.csv_RandomForestRegressor.log
Wine_features.csv_RandomForestRegressor.predictions
Wine_features.csv_RandomForestRegressor.results
Wine_features.csv_RandomForestRegressor.results.json
Wine_summary.tsv
SKLL generates four files for each learner: one log, two with results, and one with
predictions. Moreover, SKLL generates a summary file, which contains a lot of infor‐
mation about each individual fold (too much to show here). We can extract the rele‐
vant metrics using the following SQL query:
$
< Wine_summary.tsv csvsql --query
"SELECT learner_name, pearson FROM stdin "
\
>
"WHERE fold = 'average' ORDER BY pearson DESC"
| csvlook
|----------------------------+----------------|
| learner_name | pearson |
|----------------------------+----------------|
| RandomForestRegressor | 0.741860521533 |
| GradientBoostingRegressor | 0.661957860603 |
| LinearRegression | 0.524144785555 |
|----------------------------+----------------|
The relevant column here is
pearson
, which indicates the Pearson's ranking correla‐
tion. This is a value between -1 and 1 that indicates the correlation between the true
ranking (of quality scores) and the predicted ranking. Let's paste all the predictions
back to the data set:
$
parallel
"csvjoin -c id train/features.csv <(< output/Wine_features.csv_{}"
\
>
".predictions | tr '\t' ',') | csvcut -c id,quality,prediction > {}"
:::
\
>
RandomForestRegressor GradientBoostingRegressor LinearRegression
$
csvstack *Regres* -n learner --filenames > predictions.csv
And create a plot using
Rio
(see
Figure 9-8
):
$
< predictions.csv Rio -ge
'g+geom_point(aes(quality, round(prediction), '
\
>
'color=learner), position="jitter", alpha=0.1) + facet_wrap(~ learner) + '
\
>
'theme(aspect.ratio=1) + xlim(3,9) + ylim(3,9) + guides(colour=FALSE) + '
\
>
'geom_smooth(aes(quality, prediction), method="lm", color="black") + '
\
>
'ylab("prediction")'
| display