Probabilistic Connection between Cross-Validation and Vapnik Bounds - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

but rather such a sample size so that complexity selection via SRM gives similar

results to complexity selection via cross-validation,

- we do not explicitly introduce the notion of error stability for the learning algo-

rithm, but this kind of stability is implicitly derived be means of Chernoff-Hoeffding-

like inequalities we write.

- we do not focus on the leave-one-out cross-validation; we consider a more general

n-fold non-stratified cross-validation (also: more convenient for our purposes); the

leave-one-out case can be read out from our results as a special case.

1.1

Notation Related to Statistical Learning Theory

We keep the notation similar to Vapnik's [2,1].

- We denote the finite set of samples as:

( x 1 , y 1 ) , ( x 2 , y 2 ) ,..., ( x I , y I ) ,

or more shortly by encapsulating pairs as

{

z 1 , z 2 ,..., z I }

d are input points, y i are output values corresponding to them, and I

is the set size. y i differ depending on the learning task: for classification (pattern-

recognition) y i

where x i

∈ R

- We denote the set of approximating functions (models) in the sense of both classi-

fication or regression estimation as:

∈{

1 , 2 ,..., K

}

— finite discrete set, for regression estimation y i

∈ R

{

f ( x ,

)

} ω ∈ Ω ,

where

is the domain of parameters of this set of functions, so a fixed

can be

regarded as an index of a specific function in the set.

- The risk functional R :

{

f ( x ,

)

} ω ∈ Ω → R ∪{

∞ }

is defined as

L f ( x ,

) , y p ( x , y )

p ( x ) p ( y | x )

R (

dy dx ,

(1)

x ∈ X

y ∈ Y

where p ( x ) is the distribution density of input points, p ( y

x ) is the conditional

density of system/phenomenon outputs y given a fixed x . p ( x , y )= p ( x ) p ( y

x ) is

the joint distribution density for pairs ( x , y ) . In practice, p ( x , y ) is unknown but

fixed , and hence we assume the sample

to be i.i.d. 4 L is the so called

loss function which measures the discrepancy between the output y and the model

f . For classification, L is an indicator function:

{

z 1 , z 2 ,..., z I }

0 , for y = f ( x ,

L f ( x , ω ) , y =

) ;

(2)

1 , for y

= f ( x ,

) ,

Independent, identically distributed.

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home