Geoscience Reference
In-Depth Information
Semi-supervised learning is attractive because it can potentially utilize both labeled and un-
labeled data to achieve better performance than supervised learning. From a different perspective,
semi-supervised learning may achieve the same level of performance as supervised learning, but with
fewer labeled instances. This reduces the annotation effort, which leads to reduced cost. We will
present several computational models in Chapters 3,4,5, 6.
Semi-supervised learning also provides a computational model of how humans learn from
labeled and unlabeled data. Consider the task of concept learning in children, which is similar to
classification: an instance x is an object (e.g., an animal), and the label y is the corresponding concept
(e.g., dog). Young children receive labeled data from teachers (e.g., Daddy points to a brown animal
and says “dog!”). But more often they observe various animals by themselves without receiving
explicit labels. It seems self-evident that children are able to combine labeled and unlabeled data
to facilitate concept learning. The study of semi-supervised learning is therefore an opportunity to
bridge machine learning and human learning. We will discuss some recent studies in Chapter 7.
2.2 HOW IS SEMI-SUPERVISEDLEARNING POSSIBLE?
At first glance, it might seem paradoxical that one can learn anything about a predictor f
: X Y
from unlabeled data. After all, f is about the mapping from instance x to label y , yet unlabeled data
does not provide any examples of such a mapping. The answer lies in the assumptions one makes
about the link between the distribution of unlabeled data P( x ) and the target label.
Figure 2.1 shows a simple example of semi-supervised learning. Let each instance be repre-
sented by a one-dimensional feature x
∈ R
. There are two classes: positive and negative. Consider
the following two scenarios:
1. In supervised learning, we are given only two labeled training instances ( x 1 ,y 1 ) = (
1 , )
and ( x 2 ,y 2 ) = ( 1 , + ) , shown as the red and blue symbols in the figure, respectively. The
best estimate of the decision boundary is obviously x
=
0: all instances with x < 0 should be
classified as y =−
, while those with x
0 as y =+
.
2. In addition, we are also given a large number of unlabeled instances, shown as green dots in
the figure. The correct class labels for these unlabeled examples are unknown. However, we
observe that they form two groups. Under the assumption that instances in each class form a
coherent group (e.g., p( x
| y) is a Gaussian distribution, such that the instances from each class
center around a central mean), this unlabeled data gives us more information. Specifically, it
seems that the two labeled instances are not the most prototypical examples for the classes.
Our semi-supervised estimate of the decision boundary should be between the two groups
instead, at x
0 . 4.
If our assumption is true, then using both labeled and unlabeled data gives us a more reliable
estimate of the decision boundary. Intuitively, the distribution of unlabeled data helps to identify
regions with the same label, and the few labeled data then provide the actual labels. In this topic, we
will introduce a few other commonly used semi-supervised learning assumptions.
Search WWH ::




Custom Search