Overview of Semi-Supervised Learning - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

film

⇒ f ih_n uh_gl_n m

be all

⇒ bcl b iy iy_tr ao_tr ao l_dl

Accurate transcription by human expert annotators can be extremely time consuming: it took

as long as 400 hours to transcribe 1 hour of speech at the phonetic level for the Switch-

board telephone conversational speech data [ 71 ] (recordings of randomly paired participants

discussing various topics such as social, economic, political, and environmental issues).

In natural language parsing, an instance x is a sentence, and the label y is the corresponding

parse tree. An example parse tree for the Chinese sentence “The National Track and Field

Championship has finished.” is shown below.

The training data, consisting of (sentence, parse tree) pairs, is known as a treebank. Tree-

banks are time consuming to construct, and require the expertise of linguists: For a mere

4000 sentences in the Penn Chinese Treebank, experts took two years to manually create the

corresponding parse trees.

In spam filtering, an instance x is an email, and the label y is the user's judgment (spam or

ham). In this situation, the bottleneck is an average user's patience to label a large number of

emails.

In video surveillance, an instance x is a video frame, and the label y is the identity of the object

in the video. Manually labeling the objects in a large number of surveillance video frames is

tedious and time consuming.

In protein 3D structure prediction, an instance x is a DNA sequence, and the label y is

the 3D protein folding structure. It can take months of expensive laboratory work by expert

crystallographers to identify the 3D structure of a single protein.

While labeled data ( x ,y) is difficult to obtain in these domains, unlabeled data x is available

in large quantity and easy to collect: speech utterances can be recorded from radio broadcasts; text

sentences can be crawled from the World Wide Web; emails are sitting on the mail server; surveillance

cameras run 24 hours a day; and DNA sequences of proteins are readily available from gene databases.

However, traditional supervised learning methods cannot use unlabeled data in training classifiers.

Search WWH ::

Custom Search

Home