Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

Figure 9-8. Comparing the output of three regression algorithms

Classiication with BigML

In this fourth and last modeling section, we're going to classify wines as either red or

white. For this we'll be using a solution called BigML, which provides a prediction

API. This means that the actual modeling and predicting takes place in the cloud,

which is useful if you need a bit more power than your own computer can offer.

Although prediction APIs are relatively young, they are becoming more prevalent,

which is why we've included one in this chapter. Other prediction APIs include the

Google Prediction API and PredictionIO . One advantage of BigML is that it offers a

convenient command-line tool called bigmler (BigML, 2014) that interfaces with the

API. We can use this command-line tool like any other presented in this topic, except

in this case behind the scenes, our data set is sent to BigML's servers, which perform

the classification and send back the results.

Creating Balanced Train and Test Data Sets

First, we create a balanced data set to ensure that both classes are represented equally.

For this, we use csvstack (Groskopf, 2014), shuf (Eggert, 2012), head , and csvcut :

$ csvstack -n type -g red,white wine-red-clean.csv \

> < ( < wine-white-clean.csv body shuf | head -n 1600 ) |

> csvcut -c fixed_acidity,volatile_acidity,citric_acid, \

> residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide, \

> density,ph,sulphates,alcohol,type > wine-balanced.csv

This long command breaks down as follows:

csvstack is used to combine multiple data sets. It creates a new column type ,

which has the value “red” for all rows coming from the first file wine-red-clean.csv

and “white” for all rows coming from the second file.

Search WWH ::

Custom Search

Home