Biomedical Engineering Reference
In-Depth Information
9.6
Integrative Analysis and Application Examples
In essence, integrative data analysis may have two meanings, emphasizing data inte-
gration (centralization) and tool integration, respectively. Here we are focusing on
the former. Data integration can be done physically or logically as discussed in
Chapter 8. Technically, integrative data analysis can be done in two ways. One way
is to extract useful features first from different data platforms as described earlier,
and then to integrate those features for integrative analysis. The other way is to inte-
grate the raw data from all of the data platforms and then analyze this huge amount
of diversified data.
It is certainly more practical to take the first approach, because whether for data
analysis or mining, different data types still require different analytical techniques
to be applied. For example, microarray gene expression data contains tens of thou-
sands of dimensions of data (probes) and not all of them are independent of each
other in expression even at the gene level. The number of samples is typically far
smaller than the number of the probes on the array, which makes it very difficult to
apply techniques such as ANN, both due to the computational complexity and the
overfitting concern. Molecular expression data would be better subject to differen-
tial analysis first to largely reduce the dimension of the data before they are inte-
grated with other types of data for analysis. This would essentially reduce this
approach to the same as the first approach, conceptually speaking.
Combining protein expression data with gene expression data is always desir-
able. Biomarkers identified in one data platform and confirmed in the other are
potentially reliable. We know that as gene products it is the proteins that carry out
physiological functions in biological systems, but DNA and RNA are also playing
critical roles in regulating gene expressions. The recent rapid developments in RNA
interference (RNAi) research brought our understanding of such regulatory mecha-
nisms to a new level [75]. However, the mismatch in the maturity of gene and pro-
tein expression analysis technologies makes it a challenge to directly compare the
data from the two domains. On one hand, we can readily measure the expression of
tens of thousands of genes, but on the other hand, we can only semireliably detect
hundreds or thousands of proteins at a time in a human specimen that contain hun-
dreds of thousands or even a million proteins (including posttranslational modifica-
tion variants). As a result, we are still in need of a way to conduct comprehensive
integrative analysis of protein and gene expression data. However, partially integra-
tive analysis can be done. For example, concordance analysis between the expres-
sions of genes and proteins (currently detectable) has shown a consistency level of
60% to 70%. It is also possible that proteins with predicted upregulation or
downregulation based on gene expression analysis can be pinned in MS or antibody
array analyses.
With defined outcome variables (e.g., disease diagnosis), when risk factors and
biomarkers are identified from clinical, genomic, and proteomic platforms, ANN,
clustering, and logistic regression are powerful tools that can be used to develop dis-
ease prediction models. ANN and logistic regression typically require a limited
number of inputs, and the former can further identify dependent variables and
remove them and thus develop models based only on independent variables. Logis-
tic regression can provide odds ratio analysis for input variables with regard to the
 
Search WWH ::




Custom Search