Geoscience Reference
In-Depth Information
We c al l p( x ,y)
y) the joint distribution of instances and labels.
As an example of generative models, the multivariate Gaussian distribution is a common
choice for continuous feature vectors x . The class conditional distributions have the probability
density function
=
p(y)p( x
|
1 / 2 exp
μ y ) ,
1
( 2 π) D/ 2
1
2 ( x
μ y ) y ( x
| y) = N
p( x
( x
; μ y , y ) =
(3.3)
| y |
where μ y and y are the mean vector and covariance matrix, respectively, An example task is image
classification, where x may be the vector of pixel intensities of an image. Images in each class are
modeled by a Gaussian distribution. The overall generative model is called a Gaussian Mixture
Model (GMM).
As another example of generative models, the multinomial distribution
( i = 1 x · i ) !
x · 1 !··· x · D !
D
μ x · d
p( x
= (x · 1 ,...,x · d ) | μ y ) =
yd ,
(3.4)
d = 1
where μ y is a probability vector, is a common choice for modeling count vectors x . For instance,
in text categorization x is the vector of word counts in a document (the so-called bag-of-words
representation). Documents in each category are modeled by a multinomial distribution. The overall
generative model is called a Multinomial Mixture Model.
As yet another example of generative models, Hidden Markov Models (HMM) are commonly
used to model sequences of instances. Each instance in the sequence is generated from a hidden state,
where the state conditional distribution can be a Gaussian or a multinomial, for example. In addition,
HMMs specify the transition probability between states to form the sequence. Learning HMMs
involves estimating the conditional distributions' parameters and transition probabilities. Doing so
makes it possible to infer the hidden states responsible for generating the instances in the sequences.
Now we know how to do classification once we have p( x
| y) and p(y) , but the problem remains
to learn these distributions from training data. The class conditional p( x
| y) is often determined by
some model parameters, for example, the mean μ and covariance matrix of a Gaussian distribution.
For p(y) , if there are C classes we need to estimate C
1 parameters: p(y
=
1 ),...,p(y
=
C
1 ) .
The probability p(y = C) is constrained to be 1 C 1
c = 1 p(y = c) since p(y) is normalized. We
will use θ to denote the set of all parameters in p( x
|
y) and p(y) . If we want to be explicit, we use
| y,θ) and p(y | θ) . Training amounts to finding a good θ . But how do we define
the notation p( x
goodness?
One common criterion is the maximum likelihood estimate (MLE). Given training data
D
, the
MLE is
θ =
argmax
θ
p(
D | θ) =
argmax
θ
log p(
D | θ).
(3.5)
That is, the MLE is the parameter under which the data likelihood p(
D | θ) is the largest. We often
D | θ) . They yield the same
maxima since log () is monotonic, and log likelihood will be easier to handle.
D | θ) instead of the straight likelihood p(
work with log likelihood log p(
 
Search WWH ::




Custom Search