Database Reference
In-Depth Information
AIDFweightforterm i from the dictionary is defined as w i =log( N/ DF i )
where DF i is the number of documents from the corpora which con-
tain word i . A document's TFIDF vector is a vector with elements:
w i =TF i log( N/ DF i ).
2.11 Appendix C: Kernel Canonical Correlation Analy-
sis
Canonical Correlation Analysis is a method of correlating two multidimen-
sional variables. It makes use of two different views of the same semantic
object (e.g., the same text document written in two different languages or
news event described by two different news agencies) to extract representa-
tion of the semantic.
Input to CCA is a paired dataset S = { ( u i ,v i ); u i ∈ U, v i ∈ V } ,where U and
V are two different views on the data; each pair contains two views of the same
document. The goal of CCA is to find two linear mappings into a common
semantic space W from the spaces U and V .A ldocumentsfrom U and V
canbemappedinto W to obtain a view- or in our case language-independent
representation.
The criterion used to choose the mapping is the correlation between the
projections of the two views across the training data for each dimension in W .
This criterion leads to a generalized eigenvalue problem whose eigenvectors
give the desired mappings.
CCA can be kernelized so it can be applied to feature vectors only implicitly
available through a kernel function. There is a danger that spurious correla-
tions could be found in high dimensional spaces and so the method has to be
regularized by constraining the norms of the projection weight vectors. The
kernelized version is called Kernel Canonical Correlation Analysis (KCCA).
2.11.0.0.1 Example Let the space V be the vector-space model for En-
glish and U the vector-space model for French text documents. A paired
dataset is then a set of pairs of English documents together with their
French translation. The output of KCCA on this dataset is a semantic space
where each dimension shares similar English and French meaning. By map-
ping English or French documents into this space, a language independent-
representation is obtained. In this way standard machine learning algorithms
can be used on multi-lingual datasets.
 
Search WWH ::




Custom Search