Biology Reference
In-Depth Information
5.7 Data integration and network analysis
High-throughput biological datasets tend to be large, noisy and incomplete. Different
types of data are stored in different databases, and may have very different file for-
mats. Different databases may also use different identifiers for the same protein,
while gene and protein names and aliases may overlap. All of these problems mean
that accessing all of the available data about an organism, protein or process of inter-
est is a time-consuming, tedious and error-prone business if performed manually.
A widely used approach to the analysis of large amounts of biological data is data
integration: bringing together data from a range of sources to produce a network, in
which the nodes usually represent genes or gene products, and the edges between
nodes indicate some type of interaction ( Yandell and Majoro, 2002; Lee et al. ,
2004a; Hallinan and Wipat, 2006 ). The meaning of an edge depends upon the type
of data being analysed; if the underlying dataset contains, for example, physical
protein-protein binding data, such as that derived from yeast two-hybrid analysis,
an edge between two nodes denotes that those proteins physically bind to each other,
at least under experimental conditions. If the data, in contrast, are TF binding sites,
edges represent transcriptional control.
Networks have the advantage of being easily interpretable. The network represen-
tation is familiar to most people, and the concept of nodes representing individuals and
edges representing interactions is intuitively obvious. From a more technical point of
view, networks can bring together large amounts of disparate data and present it in con-
text, in a format which is easily browsable, or which can be analysed computationally.
For example, the SinI/Roperon in Bacillus subtilis lies at the heart of the organism's cell
fate decision, and as such has beenwidely studied. Data about the relationships between
these genes are scattered amongst multiple databases, and hundreds of papers. Using
data integration, these data can be combined to produce a single picture of the current
state of knowledge about this important system ( Figure 2.16 ).
Networks can be constructed using data from a single type of experiment (e.g.
genetic relationship inferred from synthetic lethal mutation data) or may be com-
posed of data derived from many sources. In the former case an edge has a clear
meaning: the two proteins which it joins are synthetic lethals. In the latter an edge
represents any functional relationship between the two proteins, possibly based upon
several different types of interaction.
Because of the noisiness of high-throughput data, it is often desirable to have an
indication of the reliability of the data; that is, the probability that an edge present in a
computationally integrated network actually exists in vivo. This probability is usu-
ally estimated by comparing the set of interactions present in the integrated network
with those in a high-confidence, usually manually curated Gold Standard network.
Commonly used Gold Standard datasets include KEGG, 10 the MIPS database 11 and
the GO 12 ( Lee et al. , 2004a ).
10 http://www.genome.jp/kegg/ .
11 http://www.helmholtz-muenchen.de/en/ibis .
12 http://www.geneontology.org/ .
Search WWH ::




Custom Search