Web Tools for Molecular Biological Data Analysis

INTRODUCTION

Bioinformatics means solving problems arising from biology using methods from computer science. The National Center for Biotechnology Information (www.ncbi.nih.gov) defines bioinformatics as:
“…the field of science in which biology, computer science, and information technology merge into a single discipline…There are three important sub-disciplines within bioinformatics: the development of new algorithms and statistics with which to access relationships among members of large data sets; the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures; and the development and implementation of tools that enable efficient access and management of different types of information.”
There are many sub-areas in bioinformatics: data comparison, data analysis, DNA assembly, protein structure prediction, data visualization, protein alignment, phyl-ogenetic analysis, drug design, and others.
The biological data (sequences and structures) are naturally very large. In addition, the number of records in the biological databases is increasing every year because of intensive research in the field of molecular biology. Analysis of this overwhelming amount of data requires intelligent bioinformatics tools in order to manage these data efficiently.
During the past two decades, the world has witnessed a technological evolution that has provided an unprecedented new medium of communications to mankind. By means of the World Wide Web, information in all forms has been disseminated throughout the world. Since the beginning, research in bioinformatics primarily used the Internet due to the fast information dissemination it allows, at essentially no cost.
This article aims to discuss some bioinformatics Web tools, but given the accelerated growth of the Web and the instability of the URLs (Uniform Resource Locators), an Internet search engine should be used to identify the current URL.

BACKGROUND

Living organisms possess entities called genes that are the basic inherited units of biological function and structure. An organism inherits its genes from its parents, and relays its own genes to its offspring.
Molecular biologists, in the later half of the 20th century, determined that the gene is made of DNA (deoxyri-bonucleic acid)—that is, DNA is the heredity material of all species. More than 50 years ago, Crick and Watson (1953) discovered the double helix structure of DNA and concluded that this specific form is fundamental to DNA’s function.
Each strand of the DNA double helix is a polymer (a compound made up of small simple molecules) consisting of four elements called nucleotides: A, T, C, and G (for adenine, thymine, cytosine, and guanine). The two strands of DNA are complementary: when a T resides on one strand, an A occupies the corresponding position on the other strand; when there is a G on one strand, a C occupies the corresponding position on the other. The sequence of nucleotides encodes the “instructions” for forming all other cellular components and provides a template for the production of an identical second strand in a process called replication.
From a computer scientist’s point of view, the DNA is information storage and a transmission system. Like the binary alphabet {0,1} used in computers, the alphabet of DNA {A, T, C, G} can encode messages of arbitrary complexity when encoded into long sequences.
The decoding of the genetic information is carried out through intermediary RNA (ribonucleic acid) molecules that are transcribed from specific regions of the DNA. RNA molecules use the base uracile (U) instead of a thymine. RNA is then translated into a protein—a chain assembled from the 20 different simple amino acids. Each consecutive triplet of DNA elements specifies one amino acid in a protein chain. Once synthesized, the protein chain folds—according to the laws of chemistry/phys-ics—into a specific shape, based on the properties and order of its amino acids. The structures of a protein can be viewed hierarchically (Lehninger, Nelson & Cox, 2000): primary (linear amino acid sequence), secondary (local sequence elements with well-determined regular shape like a-helices and a-strands), tertiary (formed by packing secondary structures into one or several compact globular units), and quaternary (combination of tertiary structures).

BIOINFORMATICS WEB TOOLS

Sequence Analysis

There is a known relationship between sequence and structure of proteins, since proteins with similar sequences tend to have similar three-dimensional structures and functions. Sequence alignment methods are useful when it is necessary to predict the structure (or function) of a new protein whose sequence has just been determined. Therefore, alignment provides a powerful tool to compare two (or more) sequences and could reflect a common evolutionary origin.
ClustalW is a general purpose multiple sequence alignment program for DNA or proteins (Higgins et al., 1994). ClustalW currently supports seven multiple sequence formats that are detailed (including examples) in the ClustalW Services Help Menu: NBRF/PIR, EMBL/ SwissProt, FASTA, GDE, ALN/ClustalW, GCG/MSF, and RSF. It produces biologically meaningful multiple sequence alignments of divergent sequences and calculates the best match for the selected sequences, considering individual weights, amino acid substitution matrices—like PAM (Altschul, Gish, Miller, Myers & Lipman, 1991) or Blosum (Henikoff & Henikoff, 1992)—and gap penalties (Apostolico & Giancarlo, 1998). After the identities, similarities and differences can be seen. ClustalW is freely available on the Internet, either as a Web-based tool or for downloading.
Another tool, T-Coffee (Notredame, Higgins & Heringa, 2000) is more accurate than ClustalW for sequences with less than 30% identity, but much slower. The T-Coffee input must have from 3 to 30 sequences (or 10,000 characters) in the FASTA format. The submission form is simple, but it does not allow user-selected options.
Cinema—Colour INteractive Editor for Multiple Alignments—is a program for sequence alignment that allows visualization and manipulation of both protein and DNA sequences (Parry-Smith, Payne, Michie & Attwood, 1997). It is a complete package in Java, locally installed, that runs on most platforms. This tool allows upload of an alignment from a local computer to the Cinema server. The input file must be in a PIR format and may then be imported into Cinema via the Load Alignment File option.

Structural Analysis

The Dali—Distance mAtrix aLIgnment—server (Holm & Sander, 1994) is a network service for comparing three-dimensional (3D) protein structures. Once the coordinates of a query protein structure is submitted, Dali compares them against those in the Protein Data Bank (PDB). The input file must be in the PDB format (Berman et al., 2000) and can be submitted by e-mail or interactively from the Web. The input options are disabled. The results are mailed back to the user. In favorable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable in primary sequences. There is a Dali database built based on exhaustive all-against-all 3D structure comparison of protein structures currently in the Protein Data Bank (PDB). The classification and alignments are continuously updated using the Dali search engine.
The Macromolecular Structure Database tool (MSD) (Golovin et al., 2004) allows one to search the active site database based on ligand or active site information. The PDB contains a significant number of protein structures that have ligands bound which are often more highly conserved across a functional family than the overall structure and fold of the macromolecule. The target of the search can be based on an uploaded file. It is possible to limit the scope of a search using restrictions based on author, keywords, experiment, resolution, and release date. Results of the search are presented in a list of PDB ID codes that can be analyzed further or viewed within a structure viewer like Rasmol (Sayle & Milner-White, 1995)—a program that intends the visualization of proteins, nucleic acids, and small molecules.
Swiss-Model (Schwede, Kopp, Guex & Peitsch, 2003) is a server for automated comparative modeling of 3D protein structures, and provides several levels of user interaction using a Web-based interface: in the first approach mode, only an amino acid sequence is submitted to build a 3D model. It could also be accessible from the program DeepView—an integrated sequence-to-structure workbench. All models are mailed back with a detailed modeling report. Template selection, alignment, and model building are automatically done by the server. The Swiss-Model alignment interface allows the submission of multiple sequence alignments in the following formats: FASTA, MSF, ClustalW, PFAM, and SELEX. The alignment must contain at least the target sequence and one template from the ExPDB template library, because the modeling process is based on this user-defined template.

Homology and Similarity Tools

The extent of similarity between two sequences can be based on percent sequence identity and/or conservation. In BLAST—Basic Local Alignment Search Tool (Altschul et al., 1990)—similarity refers to a positive matrix score. A homology class is a class whose members have been inferred to have evolved from a common ancestor.
The BLAST input file could be a sequence in FASTA format, lines of sequence data, or NCBI sequence identifiers, as explained at the Blast Search Format form. BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity. Results (many formats available) can be seen by using the browser or e-mailed to the user. Since the BLAST algorithm can detect both local and global alignments, regions of similarity in unrelated proteins can be detected, such that the discovery of similarities may provide important clues to the function of uncharacterized proteins.
FASTA (Pearson, 2000) provides sequence similarity and homology searching among a protein sequence to another protein sequence or to a protein database, or a DNA sequence to another DNA sequence or a DNA library. It can be very specific when identifying long regions of low similarity, especially for highly diverged sequences. Release 3.x of the FASTA package provides a modular set of sequence comparison programs that can run on conventional single processor computers or in parallel on multiprocessor computers.
BLAST and FASTA are used for database searching because they are fast. They use slightly different approaches to discover similar sequences, but both make refinements during searching process to increase the searching speed.

Protein Function Analysis

InterProScan (Zdobnov & Apweiler, 2001) is a tool that combines different protein signature recognition methods into a single resource. The number of signature databases and their associated scanning tools, as well as the further refinement procedures, increase the complexity of the problem. InterProScan performs considerable data lookups from databases and program outputs. This Web tool allows the input of protein sequences, either in single or multiple files. The input file format for protein sequence(s) are free text, FASTA, or UniProt; for nucleotide sequence, the GenBank format is also accepted.
GeneQuiz (Scharf et al., 1994) is an integrated system for large-scale biological sequence analysis that goes from a protein sequence to a biochemical function, using a variety of search and analysis methods and up-to-date protein and DNA databases. The input file must be in the FASTA format; the maximum number of sequences that can be uploaded per day is 12, and the maximum number of amino acids is 18,000. GeneQuiz automatically runs a series of sequence analysis tools, including BLAST and FASTA. The results are displayed as structured text. The server is freely provided to the biological research community.

STING Millennium

Suite—SMS (Neshich et al.,2003)—is a Web-based suite of programs and databases providing visualization and analysis of molecular sequence and structure for the PDB data. SMS operates with a huge collection of data (PDB, HSSP, Prosite). STING Millennium is both a didactic and a research tool. The interface is user friendly, and there are many options available for macromolecular studies.
ExPASy—Expert Protein Analysis System (Gasteiger et al., 2003)—provides access to a variety of databases and analytical tools dedicated to proteins and proteomics, including Swiss-Prot and TrEMBL, Swiss-2Dpage, PROSITE, ENZYME, and the Swiss-Model repository. There is also the UniPRot (Universal Protein Resource) (Apweiler et al., 2004), a catalog of protein information, a central repository of protein sequence and function created byjoining Swiss-Prot, TrEMBL, and PIR analysis tools. Others tools are also available at ExPaSy: pattern and profile searches; topology prediction; primary, secondary, and tertiary structure analysis; sequence alignment; and others.
PSORT (Gardy et al., 2003) is a set of computational methods that make predictions for the protein sites in a cell, examining a given protein sequence for amino acid composition, similarity to proteins of known localization, presence of a signal peptide, transmembrane alpha-helices, and motifs corresponding to specific localizations. A version of PSORT-B for Linux platforms has been recently released. PSORT is recommended for bacterial/ plant sequences, but the PSORT-B currently accepts only protein sequences from Gram-negative bacteria. The input file format is FASTA, and the output format can be selected by the user from among formats described in the PSORT documentation.

RNA Analysis

The RNAsoft suite of programs (Andronescu, Aguirre-Hernandez, Condon & Hoos, 2003) provides tools for predicting the secondary structure of a pair of DNA or RNA molecules, testing that combinatorial tag sets of DNA and RNA molecules have no unwanted secondary structure and designing RNA strands that fold to a given input secondary structure. The tools are based on standard thermodynamic models of RNA secondary structure formation and can be used for prediction as well as design of molecular structures. RNAsoft online access is freely available; however, some restrictions have been imposed on the size of the input data, in order to not overload the server limits.
The Vienna RNA package consists of a portable ISO C code library and several programs for the prediction and comparison of RNA secondary structures. This tool has a Web interface to the RNAfold program (Hofacker, 2003) and can predict secondary structures of single-stranded RNA or DNA sequences. The input file is a string of letters, consisting of A, U, G, and C. The result page shows the optimal structure, the energy of the optimal structure, and the ensemble free energy, if requested by user.

Others

A set of tools for clustering, analysis, and visualization of gene expression and other genomic data can be found in Expression Profiler—EP (Vilo, Kapushesky, Kemmeren, Sarkans & Brazma, 2003). Besides, EP allows searching gene ontology categories, generating sequence logos, extracting regulatory sequences, and studying protein interactions. It also links analysis results to external tools and databases.
ESTAnnotator (Hotz-Wagenblatt et al., 2003) is a tool for the throughput annotation of expressed sequence tags (ESTs) by automatically running a collection of bioinformatics methods. There are mainly four steps: a repeated quality check is performed, low-quality sequences are masked, successive steps of database searching and EST clustering are performed, already known transcripts that are present within mRNA and genomic DNA reference databases are identified, and finally, tools for the clustering of anonymous ESTs and for further database searches at the protein level are applied. The outputs are presented in a descriptive summary. ESTAnnotator was already successfully applied to the systematic identification and characterization of novel human genes involved in cartilage/bone formation, growth, differentiation, and homeostasis.

FUTURE TRENDS AND CONCLUSION

The WWW has a vast collection of bioinformatics tools that offer imminent possibilities for sharing, researching, and disseminating biological information. As presented, the bioinformatics research community has used Web-based application platforms intensively as the main suite for biological data analysis. Unfortunately, some application platforms use tools such as ClustalW, BLAST, or FASTA to analyze and search data from different databanks. This requires extra programs or software components for data format conversion of programs’ output data. This not only complicates the software development process, but also sometimes distracts from the main research intention. For example, programs such as ClustalW output their results in a particular format, and this format cannot be easily parsed. A small change of the ClustalW source, such as an extra or missing field in the output, could break the program. The same problem happens with biological databases such as Swiss-Prot, PDB, and others.
In order to organize this area, Nucleic Acids Research, one of the most important magazines of this field, has devoted its first issue each year, over the last several years, to documenting the availability and features of the specialized databases.
There is also a XML (eXtensible Markup Language) framework (Shui, Wong, Graham, Lee & Church, 2003) proposal to integrate different biological databanks into a unified XML framework. The proposed framework has been implemented with the emphasis of reusing the existing bioinformatics data and tools.
As the Web continues to expand, newer tools will arise, and new challenges will be proposed to researchers. In fact, a safely prediction can be made: Web-based tools are invaluable ones for daily use of those working in this exciting area of bioinformatics, and this availability and use will continue to grow.

KEY TERMS

Alignment: Explicit mapping of characters of a sequence to characters of one or more other sequence(s).
Alpha-Helix: A helical conformation of a polypeptide chain, once of the most common secondary structure in proteins.
Base Pair: Two nucleotides in nucleic acid chains are paired by hydrogen bonding of their bases; for example A with T or U, and G with C.
DNA (deoxyribonucleic acid): A specific sequence of deoxyribonucleotide units covalently joined through phosphodiester bonds.
Domain: Combines several secondary structure elements and motifs; has a specific function.
Genome: The genetic information of an organism.
Homology: Relationship by evolutionary descent from a common ancestral precursor.
Motif: Combines a few secondary structure elements with a specific geometric arrangement.
Protein: A macromolecule composed of one or more polypeptide chains, each with a characteristic sequence of amino acids linked by peptide bonds.
RNA (ribonucleic acid): A specific sequence linked by successive phosphodiester bonds.
Similarity: Maximum degree of match between two aligned sequences as indicated by some (arbitrarily chosen) scoring function, for example, percent identity.