Overview of Semi-Supervised Learning - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

Semi-supervised learning is attractive because it can potentially utilize both labeled and un-

labeled data to achieve better performance than supervised learning. From a different perspective,

semi-supervised learning may achieve the same level of performance as supervised learning, but with

fewer labeled instances. This reduces the annotation effort, which leads to reduced cost. We will

present several computational models in Chapters 3,4,5, 6.

Semi-supervised learning also provides a computational model of how humans learn from

labeled and unlabeled data. Consider the task of concept learning in children, which is similar to

classification: an instance x is an object (e.g., an animal), and the label y is the corresponding concept

(e.g., dog). Young children receive labeled data from teachers (e.g., Daddy points to a brown animal

and says “dog!”). But more often they observe various animals by themselves without receiving

explicit labels. It seems self-evident that children are able to combine labeled and unlabeled data

to facilitate concept learning. The study of semi-supervised learning is therefore an opportunity to

bridge machine learning and human learning. We will discuss some recent studies in Chapter 7.

2.2 HOW IS SEMI-SUPERVISEDLEARNING POSSIBLE?

At first glance, it might seem paradoxical that one can learn anything about a predictor f

: X → Y

from unlabeled data. After all, f is about the mapping from instance x to label y , yet unlabeled data

does not provide any examples of such a mapping. The answer lies in the assumptions one makes

about the link between the distribution of unlabeled data P( x ) and the target label.

Figure 2.1 shows a simple example of semi-supervised learning. Let each instance be repre-

sented by a one-dimensional feature x

∈ R

. There are two classes: positive and negative. Consider

the following two scenarios:

1. In supervised learning, we are given only two labeled training instances ( x 1 ,y 1 ) = ( −

1 , − )

and ( x 2 ,y 2 ) = ( 1 , + ) , shown as the red and blue symbols in the figure, respectively. The

best estimate of the decision boundary is obviously x

=

0: all instances with x < 0 should be

classified as y =−

, while those with x

≥ 0 as y =+

.

2. In addition, we are also given a large number of unlabeled instances, shown as green dots in

the figure. The correct class labels for these unlabeled examples are unknown. However, we

observe that they form two groups. Under the assumption that instances in each class form a

coherent group (e.g., p( x

| y) is a Gaussian distribution, such that the instances from each class

center around a central mean), this unlabeled data gives us more information. Specifically, it

seems that the two labeled instances are not the most prototypical examples for the classes.

Our semi-supervised estimate of the decision boundary should be between the two groups

instead, at x

≈

0 . 4.

If our assumption is true, then using both labeled and unlabeled data gives us a more reliable

estimate of the decision boundary. Intuitively, the distribution of unlabeled data helps to identify

regions with the same label, and the few labeled data then provide the actual labels. In this topic, we

will introduce a few other commonly used semi-supervised learning assumptions.

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home