METHODS AND TECHNIQUES OF COMPLEX SYSTEMS SCIENCE: AN OVERVIEW - Complex Systems Science in Biomedicine

Biomedical Engineering Reference

In-Depth Information

the actual outputs are Y , is L ( Y , f ( X ,R)). Given particular values y and x , we have

the empirical loss L ( y , f ( x ,R)), or

ˆ ()

L R for short. 4

Now, a natural impulse at this point is to twist the knobs to make the loss

small: i.e., to select the R that minimizes

L R ; let's write this as follows: R ˆ =

ˆ ()

argmin R ˆ ()

L R . This procedure is sometimes called empirical risk minimiza-

tion , or ERM. (Of course, doing that minimization can itself be a tricky nonlin-

ear problem, but I will not cover optimization methods here.) The problem with

ERM is that the R ˆ we get from this data will almost surely not be the same as

the one we'd get from the next set of data. What we really care about, if we think

it through, is not the error on any particular set of data, but the error we can ex-

pect on new data, E [ L (R)]. The former,

ˆ ()

L R , is called the training or in-sample

or empirical error; the latter, E [ L (R)], the generalization or out-of-sample or

true error. The difference between in-sample and out-of-sample errors is due to

sampling noise, the fact that our data are not perfectly representative of the sys-

tem we're studying. There will be quirks in our data which are just due to

chance, but if we minimize ˆ blindly, if we try to reproduce every feature of the

data, we will be making a machine that reproduces the random quirks, which do

not generalize, along with the predictive features. Think of the empirical error

ˆ ()

L R as the generalization error, E [ L (R)], plus a sampling fluctuation, F. If we

look at machines with low empirical errors, we will pick out ones with low true

errors, which is good, but we will also pick out ones with large negative sam-

pling fluctuations, which is not good. Even if the sampling noise F is very small,

R ˆ can be very different from R min . We have what optimization theory calls an ill-

posed problem (22).

Having a higher-than-optimal generalization error because we paid too

much attention to our data is called over-fitting . Just as we are often better off if

we tactfully ignore our friends' and neighbors' little faults, we want to ignore the

unrepresentative blemishes of our sample. Much of the theory of data mining is

about avoiding over-fitting. Three of the commonest forms of tact it has devel-

oped are, in order of sophistication, cross-validation , regularization (or bold

penalties ) and capacity control .

2.1.1.

Validation

We would never over-fit if we knew how well our machine's predictions

would generalize to new data. Since our data is never perfectly representative,

we always have to estimate the generalization performance. The empirical error

provides one estimate, but it's biased towards saying that the machine will do

well (since we built it to do well on that data). If we had a second, independent

set of data, we could evaluate our machine's predictions on it, and that would

give us an unbiased estimate of its generalization. One way to do this is to take

our original data and divide it, at random, into two parts, the training set and the

Complex Systems Science in Biomedicine

Search WWH ::

Custom Search

Home