Biology Reference
In-Depth Information
6. Data Integration Methodology
Here, we outline the analysis strategy for combining information from
heterogeneous genomic studies:
(1)
beginning with raw (or preprocessed) primary data;
(2)
data cleaning/quality assessment;
(3)
gene matching;
(4)
single-gene generalized linear model (GLM) modeling for outcome
of interest;
(5)
combination of z -statistics across studies; and
(6)
p -value multiplicity adjustment of the final combined statistic.
6.1. Data Acquisition
For combining test statistics, we require access to all primary data, not
just the “top genes” or p- values. If image analysis files are available, we
could use them to preprocess the data. More typically, the available
genomic data are already preprocessed (e.g. image analysis and normal-
ization for microarrays). Where possible, it is also desirable to obtain the
relevant clinical data so that covariates may be included in the data
model. We are then able to model fit models within each data set, which
will yield the statistics that are to be combined across studies. In our
case, this acquisition step is part of SwissBrod preprocessing.
6.2. Data Cleaning
With the primary data in hand, we carry out data cleaning and make
quality-based decisions on which studies and samples to include. Where
possible, quality of the hybridizations should be assessed so that low-
quality chips are removed from further analysis. However, most public
data are provided as normalized expression measures, precluding rigor-
ous assessment of hybridization quality. Other aspects of quality assess-
ment include relevance of individual specific study questions, study
Search WWH ::




Custom Search