Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

Wine_features.csv_GradientBoostingRegressor.results.json

Wine_features.csv_LinearRegression.log

Wine_features.csv_LinearRegression.predictions

Wine_features.csv_LinearRegression.results

Wine_features.csv_LinearRegression.results.json

Wine_features.csv_RandomForestRegressor.log

Wine_features.csv_RandomForestRegressor.predictions

Wine_features.csv_RandomForestRegressor.results

Wine_features.csv_RandomForestRegressor.results.json

Wine_summary.tsv

SKLL generates four files for each learner: one log, two with results, and one with

predictions. Moreover, SKLL generates a summary file, which contains a lot of infor‐

mation about each individual fold (too much to show here). We can extract the rele‐

vant metrics using the following SQL query:

$ < Wine_summary.tsv csvsql --query "SELECT learner_name, pearson FROM stdin " \

> "WHERE fold = 'average' ORDER BY pearson DESC" | csvlook

|----------------------------+----------------|

| learner_name | pearson |

|----------------------------+----------------|

| RandomForestRegressor | 0.741860521533 |

| GradientBoostingRegressor | 0.661957860603 |

| LinearRegression | 0.524144785555 |

|----------------------------+----------------|

The relevant column here is pearson , which indicates the Pearson's ranking correla‐

tion. This is a value between -1 and 1 that indicates the correlation between the true

ranking (of quality scores) and the predicted ranking. Let's paste all the predictions

back to the data set:

$ parallel "csvjoin -c id train/features.csv <(< output/Wine_features.csv_{}" \

> ".predictions | tr '\t' ',') | csvcut -c id,quality,prediction > {}" ::: \

> RandomForestRegressor GradientBoostingRegressor LinearRegression

$ csvstack *Regres* -n learner --filenames > predictions.csv

And create a plot using Rio (see Figure 9-8 ):

$ < predictions.csv Rio -ge 'g+geom_point(aes(quality, round(prediction), ' \

> 'color=learner), position="jitter", alpha=0.1) + facet_wrap(~ learner) + ' \

> 'theme(aspect.ratio=1) + xlim(3,9) + ylim(3,9) + guides(colour=FALSE) + ' \

> 'geom_smooth(aes(quality, prediction), method="lm", color="black") + ' \

> 'ylab("prediction")' | display

Search WWH ::

Custom Search

Home