Relational databases in bioinformatics

1. Introduction

The rapidly growing amount of data in bioinformatics makes the use of a database management system (DBMS) indispensable. Additionally, modern DBMS provide advantages like concurrent multiuser access, security features, and easy web integration.

The relational data model, introduced by Codd in 1970, is the most successful model for DBMS, and is used in numerous commercial products.

2. Relational data management

The relational data model is based on the single concept of relations for data representation. Relations are perceived by the user as tables whose rows are called tuples representing the objects. All entries in a column of a table are atomic values from the same domain, and a single column is called an attribute. The name of a relation together with its attribute names and domains form the schema of the relation. A minimal subset of the attributes from a relation whose value combination uniquely distinguishes each possible tuple of the relation is called a key. Relationships between object sets represented by relations are also stored in tables built from the keys of the relations participating in the relationship. All operations defined on relations take relations as input and generate relations as output. Queries are usually expressed in the Standard Query Language (SQL), which allows data definition as well as data manipulation statements but no procedural elements like loops. Consequently, SQL statements have to be embedded in application programs written in a different programming language.

A sample relational data modeling is illustrated in Figure 1. The example models complexes of proteins. A complex has properties such as name and function and can consist of several proteins. A protein also has properties such as name and function and can participate in several complexes. Thus, two tables “Protein” and “Complex” are generated with particular attributes. Each table specifies attributes as primary keys, in this case artificial keys such as “pID” (protein ID) and “cID” (complex ID) for unique identification of its tuples. A further table “Contains” consisting of the primary keys of Complex and Protein links both tables, modeling the relationship between a complex and its contained proteins. A protein may be part of several complexes. Let us note that for some tuples in the tables the value might be unknown, for example, for complex “abc” with cID = 075, the function is not known. In this case, the value is null.

Sample relational model

Figure 1 Sample relational model

A sample query “Find the names of all complexes that contain protein ‘xyz’” could be formulated in SQL as follows: select from Protein, Complex, Contains where = “xyz” and Protein.pID = Contains.pID and Contains.cID = Complex.cID

3. Advantages

The use of relational DBMSs has several advantages. Owing to its widespread applicability, several commercial and even open-source relational DBMSs are available. With SQL, there exists a standardized and easy-to-learn query language. Several standard interfaces available for SQL allow the integration of databases into web applications or other information systems. Along with the widespread use of relational DBMSs, many powerful tools for such systems were developed, providing support for application development or database administration.

4. Limitations

Although relational databases are frequently used for bioinformatics applications, they cause several major drawbacks, which are described in the following.

4.1. Complex schemas

Bioinformatics research deals with complex objects. These complex objects such as proteins, genes, metabolic pathways, and so on, can often not be modeled adequately with the constructs of relational data models. The resulting data schemas are often rather complex and not intuitive anymore, and thus they are hard to understand and administrate. The information about a single complex biological object is spread over several relations, each describing a single aspect of the object. In addition, data models for bioinformatics databases tend to evolve frequently, increasing the problems of administration.

4.2. Managing biospecific objects

Bioinformatics objects are not only complex to model. Biological entities such as genes and proteins also provide bulky data types that are difficult to model and manage with traditional relational methods. Typically, bulky data types include sequences, for example, nucleotide or amino acid sequences, images, for example, MNR images, set-valued attributes such as molecules consisting of several atoms, and graph- or tree-structured data, for example, pathways or phylogenetic trees. In addition, bioinformatics databases also have to cope with missing attribute values that cannot be adequately and efficiently supported by traditional relational data management concepts.

4.3. Querying

Using SQL is the traditional, most powerful, and most comfortable way to query and extract information from relational databases. To formulate an SQL query, the user must have an overview of the database schema and must know which relations and attributes are relevant for the intended query. Obviously, complex and unintuitive relational schemas are hard to query. Furthermore, in order to manipulate an entire object distributed over several relations, these relations have to be joined during query processing, which is often very time consuming. As a consequence, many bioinformatics databases prevent their users from querying the databases directly using SQL, but instead provide the so-called fixed-form query interfaces. Such fixed-form query interfaces provide a view on the database and allow using a predetermined set of relations and attributes. The queries are allowed only against these interfaces.

5. Discussion

Despite several limitations, relational database management is rather widespread in bioinformatics. Nevertheless, several interesting approaches have been developed recently to cope with the problems and limitations of relational data modeling mentioned above. Some of them are bioinformatics-specific solutions, that is, the approaches are directly motivated by bioinformatics data management. One of these solutions is ACEDB, a DBMS providing common DBMS features such as concurrency and security features but relying on a data model that is similar to the semistructured data model and thus is more flexible and powerful in the modeling aspect. A further bioinformatics-specific solution is OPM, a commercial suite of tools providing a more powerful, extended object-oriented data model including an appropriate query language and a mapping of such OPM models to standard relational DBMSs. In addition, a more general approach is the concept of object-relational DBMSs, developed for all kinds of applications dealing with complex and spatial objects. An object-relational DBMS provides most features of a traditional relational DBMS but supports object-oriented modeling constructs such as nonatomic attribute types. These systems provide the possibility to specify objects with user-defined attribute types and user-defined functions to determine the behavior of such objects. Via a standard interface, these objects are usually accessible by standard SQL. The advantage of these concepts is that the schemas are less complex since the model is more powerful and expressive than the traditional relational model. Furthermore, the object-relational approach also provides all features of relational DBMSs such as SQL querying, concurrency, query optimization, and so on. The newest versions of DBMSs such as ORACLE, DB2, or MS SQL Server are in fact object-relational DBMSs. However, several bioinformatics databases do not use their features yet.

Next post:

Previous post: