Modeling Data - Data Science at the Command Line

Database Reference

In-Depth Information

que that splits up the whole data set into a certain number of subsets. These subsets

are called folds. (Usually, five or ten folds are used.)

We need to add an identifier to each row so that we can easily identify the data points

later (the predictions are not in the same order as the original data set):

$ mkdir train

$ wine-white-clean.csv nl -s, -w1 -v0 | sed '1s/0,/id,/' > train/features.csv

Running the Experiment

Create a configuration file called predict-quality.cfg :

[General]

experiment_name = Wine

task = cross_validate

[Input]

train_location = train

featuresets = [["features.csv"]]

learners = ["LinearRegression","GradientBoostingRegressor","RandomForestRegres

sor"]

label_col = quality

[Tuning]

grid_search = false

feature_scaling = both

objective = r2

[Output]

log = output

results = output

predictions = output

We run the experiment using the run_experiment command-line tool (Educational

Testing Service, 2014):

$ run_experiment -l predict-quality.cfg

The -l option specifies to run in local mode. SKLL also offers the possibility to run

experiments on clusters. The time it takes to run the experiment depends on the

complexity of the chosen algorithms.

Parsing the Results

Once all algorithms are done, the results can now be found in the directory output :

$ cd output

$ ls -1

Wine_features.csv_GradientBoostingRegressor.log

Wine_features.csv_GradientBoostingRegressor.predictions

Wine_features.csv_GradientBoostingRegressor.results

Search WWH ::

Custom Search

Home