Biomedical Engineering Reference
In-Depth Information
The feedback loops and a mechanism capable of responding to feedback enable two types of machine
learning: supervised and unsupervised. In supervised learning, the system is trained with a set of
examples, called the training set. The goals are specific outputs that are associated with each input.
For example, a specific amino acid sequence on the input can be associated with the name of a
protein on the output. The performance of a supervised learning system can be evaluated by
presenting the system with a known testing set that is similar to the training set.
In unsupervised learning, there is no specific output associated with a given input, and the system
must invent new categories and ways to classify the input data. In machine learning systems based
on unsupervised learning, it isn't known a priori whether the input data contains a biologically
significant pattern, where it is, or even what it looks like.
One of the key issues in supervised learning is that the training set must be sufficiently large relative
to the number of categories or different outputs provided by the machine learning system. When
there are too many categories or recognized patterns that are consistent with the input data, the
training data is said to be overfitted. That is, overfitting is the process of assigning undue importance
to random variations in the data.
Whether supervised or unsupervised, the machine learning process requires bias. It isn't enough to
simply open a database up to a machine learning algorithm and sit back while it automatically
discovers all of the interrelationships in the data. Bias is created in a machine learning system by
placing constraints on the data that can be examined, by using different underlying models, and by
altering the machine learning system goals. Bias can increase the efficiency of the machine language
process and provide more meaningful results. For example, the process can probably ignore a
correlation between the time of day a sample was evaluated and gene expression in a microarray. In
practice, the bias can be a single heuristic, such as preferring the single, simplest rule that explains
the data to a more complex solution. This "simplest solution" bias is often used with machine learning
approaches to mining nucleotide sequence data.
Inductive Logic Programming
Inductive logic programming uses a set of rules or heuristics to categorize data. A common heuristic
is to use change in entropy to iteratively choose an attribute of the data that will subset the data
according to the attribute. That is, an entropy-based classification system based on an induction
algorithm works by incrementally dividing the data into the largest possible spaces until all data has
been assigned to a collection.
Consider the scenario depicted in Figure 7-7 , in which the data to be classified includes 20 circles and
10 squares, 16 of which are white and 14 of which are black. With two dimensions to
compare—shape and color—an entropy-based inductive classifier bifurcates the space first according
to color because it provides the maximum change in entropy, resulting in one group of 14 black
circles and squares and one group of 16 white circles and squares. After dividing the space by color,
it's further subdivided by shape, as shown in the figure.
Search WWH ::




Custom Search