Database Reference
In-Depth Information
Figure 9-8. Comparing the output of three regression algorithms
Classiication with BigML
In this fourth and last modeling section, we're going to classify wines as either red or
white. For this we'll be using a solution called BigML, which provides a prediction
API. This means that the actual modeling and predicting takes place in the cloud,
which is useful if you need a bit more power than your own computer can offer.
Although prediction APIs are relatively young, they are becoming more prevalent,
which is why we've included one in this chapter. Other prediction APIs include the
Google Prediction API and PredictionIO . One advantage of BigML is that it offers a
convenient command-line tool called bigmler (BigML, 2014) that interfaces with the
API. We can use this command-line tool like any other presented in this topic, except
in this case behind the scenes, our data set is sent to BigML's servers, which perform
the classification and send back the results.
Creating Balanced Train and Test Data Sets
First, we create a balanced data set to ensure that both classes are represented equally.
For this, we use csvstack (Groskopf, 2014), shuf (Eggert, 2012), head , and csvcut :
$ csvstack -n type -g red,white wine-red-clean.csv \
> < ( < wine-white-clean.csv body shuf | head -n 1600 ) |
> csvcut -c fixed_acidity,volatile_acidity,citric_acid, \
> residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide, \
> density,ph,sulphates,alcohol,type > wine-balanced.csv
This long command breaks down as follows:
csvstack is used to combine multiple data sets. It creates a new column type ,
which has the value “red” for all rows coming from the first file wine-red-clean.csv
and “white” for all rows coming from the second file.
 
Search WWH ::




Custom Search