Data Mining Trends and Research Frontiers - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

telescopes, multispectral high-resolution remote satellite sensors, global positioning sys-

tems, and new generations of biological data collection and analysis technologies. Large

data sets are also being generated due to fast numeric simulations in various fields such

as climate and ecosystem modeling, chemical engineering, fluid dynamics, and struc-

tural mechanics. Here we look at some of the challenges brought about by emerging

scientific applications of data mining.

Data warehouses and data preprocessing : Data preprocessing and data warehouses

are critical for information exchange and data mining. Creating a warehouse often

requires finding means for resolving inconsistent or incompatible data collected

in multiple environments and at different time periods. This requires reconcil-

ing semantics, referencing systems, geometry, measurements, accuracy, and preci-

sion. Methods are needed for integrating data from heterogeneous sources and for

identifying events.

For instance, consider climate and ecosystem data, which are spatial and tempo-

ral and require cross-referencing geospatial data. A major problem in analyzing such

data is that there are too many events in the spatial domain but too few in the tem-

poral domain. For example, El Nino events occur only every four to seven years, and

previous data on them might not have been collected as systematically as they are

today. Methods are also needed for the efficient computation of sophisticated spatial

aggregates and the handling of spatial-related data streams.

Mining complex data types : Scientific data sets are heterogeneous in nature. They

typically involve semi-structured and unstructured data, such as multimedia data

and georeferenced stream data, as well as data with sophisticated, deeply hidden

semantics (e.g., genomic and proteomic data). Robust and dedicated analysis meth-

ods are needed for handling spatiotemporal data, biological data, related concept

hierarchies, and complex semantic relationships. For example, in bioinformatics,

a research problem is to identify regulatory influences on genes. Gene regulation

refers to how genes in a cell are switched on (or off) to determine the cell's func-

tions. Different biological processes involve different sets of genes acting together

in precisely regulated patterns. Thus, to understand a biological process we need to

identify the participating genes and their regulators. This requires the development

of sophisticated data mining methods to analyze large biological data sets for clues

about regulatory influences on specific genes, by finding DNA segments (“regulatory

sequences”) mediating such influence.

Graph-based and network-based mining : It is often difficult or impossible to

model several physical phenomena and processes due to limitations of existing

modeling approaches. Alternatively, labeled graphs and networks may be used to

capture many of the spatial, topological, geometric, biological, and other relational

characteristics present in scientific data sets. In graph or network modeling, each

object to be mined is represented by a vertex in a graph, and edges between ver-

tices represent relationships between objects. For example, graphs can be used to

model chemical structures, biological pathways, and data generated by numeric

Search WWH ::

Custom Search

Home