Biomedical Engineering Reference
In-Depth Information
Tools
For most applications, data mining needn't involve writing neural networks or genetic algorithms in a
traditional programming language. Instead, it can make use of a variety of general-purpose and
bioinformatics-specific tools, as well as several high-level languages (see Table 7-4 ).
The most common languages used to perform data mining in bioinformatics are Perl, Python, and
SQL. Perl and Python are scripting languages that are useful for implementing custom character- and
string-based data mining for textual and sequence data. As true programming languages, they are
flexible and powerful. The greatest limitation of Perl and Python is that they are interpreted scripting
languages. That is, unlike C++ or other high-performance languages, the scripts are not compiled,
but instead execute at runtime in an interpreter. As a result, data mining with Python or Perl is
slower than using a well-written program using the same algorithms in C++. The time penalty
associated with Python is considerably less that that associated with Perl, however, because it is
based primarily on modules written in C++. Using either Perl or Python, a script defining a data-
mining routine can be modified and executed within a few seconds without taking the time to compile
source code. This advantage often outweighs the runtime speed penalty of using an interpreted
language. In addition, Python and Perl are open-source, free programs.
Table 7-4. Examples of Data Mining Tools.
Tool
Examples
Languages
Perl, Python, SQL, XML
General-Purpose
Angoss, Clustran, Cross-Graph, Cross-z, Daisy, Data Distilleries, Database
Marksman, DataMind, GVA, IBM Intelligent, Miner, Insightful Miner, Integral
Solutions, KXEN, Magnify, MatLab, NeoVista Solutions, Oracle Darwin,
Quadstone, SAS, Spotfire, SPSS Clementine, StatPac, Syllogic,
ThinkAnalytics, Thinking Machines, Weka
Bioinformatics-Specific MEME, PIMA, Pratt, PrattWWW, SPEXS
SQL is also an interpreted language. However, SQL lacks the flexibility of Perl or Python, in that it's
useful only for querying a relational database. This specificity results in high performance, even as an
interpreted language. In addition, SQL isn't a stand-alone application, but is normally part of a
vendor-specific DBMS. The advantage of using SQL is that the language is portable from one
relational database system to the next, independent of the vendor, allowing a researcher to query
different database systems without having to learn a new query language. The SQL commands are
identical, regardless of whether the database is manufactured by Oracle, Microsoft, or IBM. Although
SQL statements can be manually submitted in real-time, they are frequently embedded in another
language, such as Perl, so that the other language can perform operations on the returned data, such
as writing the data to a new database, plotting the data, or translating it to a new format.
XML is a data format that's the current darling of online database development because of its
extensibility and use of tags that can provide contextual clues helpful in data mining. A database or
data warehouse built around XML can more readily support data mining than one that only supports
standard relational tables and SQL database queries. A major disadvantage of XML is the lack of
constraints on how it can be extended. Unless external standards are used, databases written by
different programmers using XML may bear little resemblance to each other.
In addition to programming languages, there are hundreds of general-purpose stand-alone and Web-
based data-mining applications. Of the commercial data-mining applications, many of the more
 
 
Search WWH ::




Custom Search