Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

The feedback loops and a mechanism capable of responding to feedback enable two types of machine

learning: supervised and unsupervised. In supervised learning, the system is trained with a set of

examples, called the training set. The goals are specific outputs that are associated with each input.

For example, a specific amino acid sequence on the input can be associated with the name of a

protein on the output. The performance of a supervised learning system can be evaluated by

presenting the system with a known testing set that is similar to the training set.

In unsupervised learning, there is no specific output associated with a given input, and the system

must invent new categories and ways to classify the input data. In machine learning systems based

on unsupervised learning, it isn't known a priori whether the input data contains a biologically

significant pattern, where it is, or even what it looks like.

One of the key issues in supervised learning is that the training set must be sufficiently large relative

to the number of categories or different outputs provided by the machine learning system. When

there are too many categories or recognized patterns that are consistent with the input data, the

training data is said to be overfitted. That is, overfitting is the process of assigning undue importance

to random variations in the data.

Whether supervised or unsupervised, the machine learning process requires bias. It isn't enough to

simply open a database up to a machine learning algorithm and sit back while it automatically

discovers all of the interrelationships in the data. Bias is created in a machine learning system by

placing constraints on the data that can be examined, by using different underlying models, and by

altering the machine learning system goals. Bias can increase the efficiency of the machine language

process and provide more meaningful results. For example, the process can probably ignore a

correlation between the time of day a sample was evaluated and gene expression in a microarray. In

practice, the bias can be a single heuristic, such as preferring the single, simplest rule that explains

the data to a more complex solution. This "simplest solution" bias is often used with machine learning

approaches to mining nucleotide sequence data.

Inductive Logic Programming

Inductive logic programming uses a set of rules or heuristics to categorize data. A common heuristic

is to use change in entropy to iteratively choose an attribute of the data that will subset the data

according to the attribute. That is, an entropy-based classification system based on an induction

algorithm works by incrementally dividing the data into the largest possible spaces until all data has

been assigned to a collection.

Consider the scenario depicted in Figure 7-7 , in which the data to be classified includes 20 circles and

10 squares, 16 of which are white and 14 of which are black. With two dimensions to

compare—shape and color—an entropy-based inductive classifier bifurcates the space first according

to color because it provides the maximum change in entropy, resulting in one group of 14 black

circles and squares and one group of 16 white circles and squares. After dividing the space by color,

it's further subdivided by shape, as shown in the figure.

Search WWH ::

Custom Search

Home