Databases Reference
In-Depth Information
The dynamic programming approach is commonly used for sequence alignments.
Among many available analysis packages, BLAST (Basic Local Alignment Search Tool)
is one of the most popular tools in biosequence analysis.
Hidden Markov Model for Biological Sequence Analysis
Given a biological sequence, biologists would like to analyze what that sequence repre-
sents. To represent the structure or statistical regularities of sequence classes, biologists
construct various probabilistic models such as
Markov chains
and
hidden Markov
models
. In both models, the probability of a state depends only on that of the previous
state; therefore, they are particularly useful for the analysis of biological sequence data.
The most common methods for constructing hidden Markov models are the forward
algorithm, the Viterbi algorithm, and the Baum-Welch algorithm. Given a sequence of
symbols,
x
, the
forward algorithm
finds the probability of obtaining
x
in the model; the
Viterbi algorithm
finds the most probable path (corresponding to
x
) through the model,
whereas the
Baum-Welch algorithm
learns or adjusts the model parameters so as to best
explain a set of training sequences.
13.1.2
Mining Graphs and Networks
Graphs represents a more general class of structures than sets, sequences, lattices, and
trees. There is a broad range of graph applications on the Web and in social networks,
information networks, biological networks, bioinformatics, chemical informatics, com-
puter vision, and multimedia and text retrieval. Hence, graph and network mining
have become increasingly important and heavily researched. We overview the follow-
ing major themes: (1) graph pattern mining; (2) statistical modeling of networks;
(3) data cleaning, integration, and validation by network analysis; (4) clustering and
classification of graphs and homogeneous networks; (5) clustering, ranking, and classifi-
cation of heterogeneous networks; (6) role discovery and link prediction in information
networks; (7) similarity search and OLAP in information networks; and (8) evolution
of information networks.
Graph Pattern Mining
Graph pattern mining is the mining of
frequent subgraphs
(also called
(sub)graph pat-
terns
) in one or a set of graphs. Methods for mining graph patterns can be categorized
into Apriori-based and pattern growth-based approaches. Alternatively, we can mine
the set of
closed graphs
where a graph
g
is
closed
if there exists no proper supergraph
g
0
that carries the same support count as
g
. Moreover, there are many
variant graph
patterns
, including approximate frequent graphs, coherent graphs, and dense graphs.
User-specified constraints can be pushed deep into the graph pattern mining process to
improve mining efficiency.
Graph pattern mining has many interesting applications. For example, it can be
used to generate compact and effective
graph index structures
based on the concept of