Data Mining - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

Tools

For most applications, data mining needn't involve writing neural networks or genetic algorithms in a

traditional programming language. Instead, it can make use of a variety of general-purpose and

bioinformatics-specific tools, as well as several high-level languages (see Table 7-4 ).

The most common languages used to perform data mining in bioinformatics are Perl, Python, and

SQL. Perl and Python are scripting languages that are useful for implementing custom character- and

string-based data mining for textual and sequence data. As true programming languages, they are

flexible and powerful. The greatest limitation of Perl and Python is that they are interpreted scripting

languages. That is, unlike C++ or other high-performance languages, the scripts are not compiled,

but instead execute at runtime in an interpreter. As a result, data mining with Python or Perl is

slower than using a well-written program using the same algorithms in C++. The time penalty

associated with Python is considerably less that that associated with Perl, however, because it is

based primarily on modules written in C++. Using either Perl or Python, a script defining a data-

mining routine can be modified and executed within a few seconds without taking the time to compile

source code. This advantage often outweighs the runtime speed penalty of using an interpreted

language. In addition, Python and Perl are open-source, free programs.

Table 7-4. Examples of Data Mining Tools.

Tool

Examples

Languages

Perl, Python, SQL, XML

General-Purpose

Angoss, Clustran, Cross-Graph, Cross-z, Daisy, Data Distilleries, Database

Marksman, DataMind, GVA, IBM Intelligent, Miner, Insightful Miner, Integral

Solutions, KXEN, Magnify, MatLab, NeoVista Solutions, Oracle Darwin,

Quadstone, SAS, Spotfire, SPSS Clementine, StatPac, Syllogic,

ThinkAnalytics, Thinking Machines, Weka

Bioinformatics-Specific MEME, PIMA, Pratt, PrattWWW, SPEXS

SQL is also an interpreted language. However, SQL lacks the flexibility of Perl or Python, in that it's

useful only for querying a relational database. This specificity results in high performance, even as an

interpreted language. In addition, SQL isn't a stand-alone application, but is normally part of a

vendor-specific DBMS. The advantage of using SQL is that the language is portable from one

relational database system to the next, independent of the vendor, allowing a researcher to query

different database systems without having to learn a new query language. The SQL commands are

identical, regardless of whether the database is manufactured by Oracle, Microsoft, or IBM. Although

SQL statements can be manually submitted in real-time, they are frequently embedded in another

language, such as Perl, so that the other language can perform operations on the returned data, such

as writing the data to a new database, plotting the data, or translating it to a new format.

XML is a data format that's the current darling of online database development because of its

extensibility and use of tags that can provide contextual clues helpful in data mining. A database or

data warehouse built around XML can more readily support data mining than one that only supports

standard relational tables and SQL database queries. A major disadvantage of XML is the lack of

constraints on how it can be extended. Unless external standards are used, databases written by

different programmers using XML may bear little resemblance to each other.

In addition to programming languages, there are hundreds of general-purpose stand-alone and Web-

based data-mining applications. Of the commercial data-mining applications, many of the more

Bioinformatics Computing

Search WWH ::

Custom Search

Home