Standardization of Data Processing and Statistical Analysis in Comparative Plant Proteomics Experiment - Plant Proteomics: Methods and Protocols

Biology Reference

In-Depth Information

extracted from the sample one by one, the first component being

the one which explains most of the systematic variation in the data.

The plotting of principal components quickly visualizes the struc-

ture of the data, helping to find sample clusters and identify outliers.

If some groups are distinguished, it is always interesting deep into

the analysis and study which variables are most correlated to each

principal component for defining the biological processes that are

hidden in the data. The correlation of each variable to each com-

ponent is included into the so called loading matrix. These analyses

can be performed in R using prcomp{stats}, princomp{statps}, and

biplot{stats}. Here we recommend start the processing of the data

with a PCA for quickly define outliers, and later on the analysis,

when the outliers are removed, perform a complete PCA analysis

defining also the biologically interesting variables.

In contrast to PCA, Independent Component Analysis (ICA)

decomposes an input dataset into components so that each com-

ponent is statistically as independent from the others as possible.

ICA can be used to extract mixed signals from the datasets while

reducing the effects of noise or artifacts. ICA proved to be more

powerful than PCA and faster and more robust than ANOVA

dealing with proteomics data [ 14 , 15 ]. In R , {FastICA} package is

recommended.

3.8.2 Independent

Component Analysis

This is a multivariate projection-based method that, unlike PCA or

ICA, maximizes the covariance between two datasets by seeking

for linear combinations of the variables from both sets (these linear

combinations are called the latent variable). In a classical partial

least squares, discriminant analysis (PLS-DA) the response variable

is categorical, indicating the different classes (treatments) of the

samples, which are used to solve a wide range of classification/

discrimination problems in a supervised way determining which

variables shows a higher covariance with the different treatments.

{Mixomics} package contains a set of tools for performing PCA,

PLS and other multivariate tests focused on -omics data [ 16 ].

3.8.3 Partial Least

Squares, Discriminant

Analysis

Clustering of expression data is usually done to identify proteins

with similar behavior, implying that they are correlated. This

exploratory technique has clearly proven valuable, and is comple-

mentary to multivariate statistics. The representation of the differ-

ent pathways and visualization of the integrated data across time

series or treatment can improve the data interpretation, being also

sometimes helpful to select candidate variables. The use of Pearson's

correlation coefficient and Ward's aggregation method is the best

clustering strategy for proteomics data, being Euclidean distance

and UPGMA another valid strategy [ 17 ]. R package {gplots} can

be used for plotting these graphs.

3.8.4 Clustering

and Heat Maps

Plant Proteomics: Methods and Protocols

Search WWH ::

Custom Search

Home