Detection of Bias in Media Outlets with Statistical Learning Methods - Text Mining: Classification, Clustering, and Applications

Database Reference

In-Depth Information

AIDFweightforterm i from the dictionary is defined as w i =log( N/ DF i )

where DF i is the number of documents from the corpora which con-

tain word i . A document's TFIDF vector is a vector with elements:

w i =TF i log( N/ DF i ).

2.11 Appendix C: Kernel Canonical Correlation Analy-

sis

Canonical Correlation Analysis is a method of correlating two multidimen-

sional variables. It makes use of two different views of the same semantic

object (e.g., the same text document written in two different languages or

news event described by two different news agencies) to extract representa-

tion of the semantic.

Input to CCA is a paired dataset S = { ( u i ,v i ); u i ∈ U, v i ∈ V } ,where U and

V are two different views on the data; each pair contains two views of the same

document. The goal of CCA is to find two linear mappings into a common

semantic space W from the spaces U and V .A ldocumentsfrom U and V

canbemappedinto W to obtain a view- or in our case language-independent

representation.

The criterion used to choose the mapping is the correlation between the

projections of the two views across the training data for each dimension in W .

This criterion leads to a generalized eigenvalue problem whose eigenvectors

give the desired mappings.

CCA can be kernelized so it can be applied to feature vectors only implicitly

available through a kernel function. There is a danger that spurious correla-

tions could be found in high dimensional spaces and so the method has to be

regularized by constraining the norms of the projection weight vectors. The

kernelized version is called Kernel Canonical Correlation Analysis (KCCA).

2.11.0.0.1 Example Let the space V be the vector-space model for En-

glish and U the vector-space model for French text documents. A paired

dataset is then a set of pairs of English documents together with their

French translation. The output of KCCA on this dataset is a semantic space

where each dimension shares similar English and French meaning. By map-

ping English or French documents into this space, a language independent-

representation is obtained. In this way standard machine learning algorithms

can be used on multi-lingual datasets.

Search WWH ::

Custom Search

Home