Information Technology Reference
In-Depth Information
Ta b l e 2 . Percentage of wins against UCB1-T UNED of generic and learned policies. Best scores
are shown in bold.
Policy
T=10 T = 100 T = 1000
Policy
T=10 T = 100 T = 1000
Generic policies
Learned policies
UCB1
48.1 % 78.1 %
83.1 %
P OWER -1
54.6 % 82.3 %
91.3 %
UCB2
12.7 %
6.8 %
6.8 %
P OWER -2
54.2 % 84.6 %
90.3 %
UCB-V
38.3 % 57.2 %
49.6 %
F ORMULA -1 61.7 % 76.8 %
88.1 %
KL-UCB
50.5 % 65.0 %
67.0 %
F ORMULA -2 61.0 % 80.0 %
73.1 %
n -G REEDY 37.5 % 14.1 %
10.7 %
Generic Policies . As already pointed out in [4], it can be seen that UCB1-T UNED is
particularly well fitted to bandit problems with Bernoulli distributions. It also proves ef-
fective on bandit problems with Gaussian distributions, making it nearly always outper-
form the other untuned policies. By tuning UCB1, we outperform the UCB1-T UNED
policy (e.g. 4 . 91 instead of 5 . 43 on Bernoulli problems with T = 1000 ). This also
sometimes happens with UCB-V. However, though we used a careful tuning proce-
dure, UCB2 and n -G REEDY do never outperform UCB1-T UNED .
Learned Policies . We observe that when the training horizon is the same as the testing
horizon T , the learned policies (P OWER -1, P OWER -2, F ORMULA -1 and F ORMULA -
2) systematically outperform all generic policies. The overall best results are obtained
with P OWER -2 policies. Note that, due to their numerical nature and due to the large
number of parameters, these policies are extremely hard to interpret and to understand.
The results related to symbolic policies show that there exist very simple policies that
perform nearly as well as these black-box policies. This clearly shows the benefits of our
two hypothesis spaces: numerical policies enable to reach very high performances while
symbolic policies provide interpretable strategies whose behavior can be more easily
analyzed. This interpretability/performance tradeoff is common in machine learning
and has been identified several decades ago in the field of supervised learning. It is worth
mentioning that, among the 99020 formula equivalence classes, a surprisingly large
number of strategies outperforming generic policies were found: when T = 100 (resp.
T = 1000 ), we obtain about 50 (resp. 80) different symbolic policies outperforming the
generic policies.
Robustness w.r.t. the Horizon T . As expected, the learned policies give their best perfor-
mance when the training and the testing horizons are equal. Policies learned with large
training horizon prove to work well also on smaller horizons. However, when the testing
horizon is larger than the training horizon, the quality of the policy may quickly degrade
(e.g. when evaluating P OWER -1 trained with T =10 on an horizon T = 1000 ).
Robustness w.r.t. the Kind of Distribution . Although truncated Gaussian distributions are
significantly different from Bernoulli distributions, the learned policies most of the time
generalize well to this new setting and still outperform all the other generic policies.
A Word on the Learned Symbolic Policies . It is worth noticing that the best index-based
policies (F ORMULA -1) found for the two largest horizons ( T = 100 and T = 1000 )
work in a similar way as the UCB-type policies reported earlier in the literature. In-
deed, they also associate to an arm k an index which is the sum of r k and of a positive
 
Search WWH ::




Custom Search