Geoscience Reference
In-Depth Information
2.4 CAVEATS
It seems reasonable that semi-supervised learning can use additional unlabeled data, which by it-
self does not carry information on the mapping
, to learn a better predictor f . As men-
tioned earlier, the key lies in the semi-supervised model assumptions about the link between the
marginal distribution P( x ) and the conditional distribution P(y
X Y
x ) . There are several different
semi-supervised learning methods, and each makes slightly different assumptions about this link.
These methods include self-training, probabilistic generative models, co-training, graph-based mod-
els, semi-supervised support vector machines, and so on. In the next several chapters, we will go
through these models and discuss their assumptions. In Section 8.2, we will also give some theoretic
justification. Empirically, these semi-supervised learning models do produce better classifiers than
supervised learning on some data sets.
However, it is worth pointing out that blindly selecting a semi-supervised learning method
for a specific task will not necessarily improve performance over supervised learning. In fact, unla-
beled data can lead to worse performance with the wrong link assumptions. The following example
demonstrates this sensitivity to model assumptions by comparing supervised learning performance
with several semi-supervised learning approaches on a simple classification problem. Don't worry if
these approaches appear mysterious; we will explain how they work in detail in the rest of the topic.
For now, the main point is that semi-supervised learning performance depends on the correctness
of the assumptions made by the model in question.
|
Example 2.3. Consider a classification task where there are two classes, each with a Gaussian
distribution. The two Gaussian distributions heavily overlap (top panel of Figure 2.2). The true
decision boundary lies in the middle of the two distributions, shown as a dotted line. Since we know
the true distributions, we can compute test sample error rates based on the probability mass of each
Gaussian that falls on the incorrect side of the decision boundary. Due to the overlapping class
distributions, the optimal error rate (i.e., the Bayes error) is 21.2%.
For supervised learning, the learned decision boundary is in the middle of the two labeled
instances, and the unlabeled instances are ignored. See, for example, the thick solid line in the second
panel of Figure 2.2. We note that it is away from the true decision boundary, because the two labeled
instances are randomly sampled. If we were to draw two other labeled instances, the learned decision
boundary would change, but most likely would still be off (see other panels of Figure 2.2). On average,
the expected learned decision boundary will coincide with the true boundary, but for any given draw
of labeled data it will be off quite a bit. We say that the learned boundary has high variance. To
evaluate supervised learning, and the semi-supervised learning methods introduced below, we drew
1000 training samples, each with one labeled and 99 unlabeled instances per class. In contrast to the
optimal decision boundary, the decision boundaries found using supervised learning have an average
test sample error rate of 31.6%. The average decision boundary lies at 0.02 (compared to the optimal
boundary of 0), but has standard deviation of 0.72.
Now without presenting the details, we show the learned decision boundaries of three semi-
supervised learning models on the training data. These models will be presented in detail in later
Search WWH ::




Custom Search