Geoscience Reference
In-Depth Information
CHAPTER
2
Overview of Semi-Supervised
Learning
2.1 LEARNING FROM BOTH LABELED AND UNLABELED
DATA
As the name suggests, semi-supervised learning is somewhere between unsupervised and supervised
learning. In fact, most semi-supervised learning strategies are based on extending either unsupervised
or supervised learning to include additional information typical of the other learning paradigm.
Specifically, semi-supervised learning encompasses several different settings, including:
￿ Semi-supervised classification . Also known as classification with labeled and unlabeled data (or
partially labeled data), this is an extension to the supervised classification problem.The training
data consists of both l labeled instances
i = 1 and u unlabeled instances
l + u
j = l + 1 . One
typically assumes that there is much more unlabeled data than labeled data, i.e., u l .The goal
of semi-supervised classification is to train a classifier f from both the labeled and unlabeled
data, such that it is better than the supervised classifier trained on the labeled data alone.
{
( x i ,y i )
}
{
x j }
￿ Constrained clustering . This is an extension to unsupervised clustering. The training data con-
sists of unlabeled instances
n
j =
1 , as well as some “supervised information” about the clusters.
For example, such information can be so-called must-link constraints, that two instances x i , x j
must be in the same cluster; and cannot-link constraints, that x i , x j cannot be in the same
cluster. One can also constrain the size of the clusters. The goal of constrained clustering is to
obtain better clustering than the clustering from unlabeled data alone.
{ x i }
There are other semi-supervised learning settings, including regression with labeled and un-
labeled data, dimensionality reduction with labeled instances whose reduced feature representation
is given, and so on. This topic will focus on semi-supervised classification.
The study of semi-supervised learning is motivated by two factors: its practical value in building
better computer algorithms, and its theoretical value in understanding learning in machines and
humans.
Semi-supervised learning has tremendous practical value. In many tasks, there is a paucity of
labeled data. The labels y may be difficult to obtain because they require human annotators, special
devices, or expensive and slow experiments. For example,
￿ In speech recognition, an instance x is a speech utterance, and the label y is the corresponding
transcript. For example, here are some detailed phonetic transcripts of words as they are spoken:
Search WWH ::




Custom Search