Databases Reference
In-Depth Information
telescopes, multispectral high-resolution remote satellite sensors, global positioning sys-
tems, and new generations of biological data collection and analysis technologies. Large
data sets are also being generated due to fast numeric simulations in various fields such
as climate and ecosystem modeling, chemical engineering, fluid dynamics, and struc-
tural mechanics. Here we look at some of the challenges brought about by emerging
scientific applications of data mining.
Data warehouses and data preprocessing : Data preprocessing and data warehouses
are critical for information exchange and data mining. Creating a warehouse often
requires finding means for resolving inconsistent or incompatible data collected
in multiple environments and at different time periods. This requires reconcil-
ing semantics, referencing systems, geometry, measurements, accuracy, and preci-
sion. Methods are needed for integrating data from heterogeneous sources and for
identifying events.
For instance, consider climate and ecosystem data, which are spatial and tempo-
ral and require cross-referencing geospatial data. A major problem in analyzing such
data is that there are too many events in the spatial domain but too few in the tem-
poral domain. For example, El Nino events occur only every four to seven years, and
previous data on them might not have been collected as systematically as they are
today. Methods are also needed for the efficient computation of sophisticated spatial
aggregates and the handling of spatial-related data streams.
Mining complex data types : Scientific data sets are heterogeneous in nature. They
typically involve semi-structured and unstructured data, such as multimedia data
and georeferenced stream data, as well as data with sophisticated, deeply hidden
semantics (e.g., genomic and proteomic data). Robust and dedicated analysis meth-
ods are needed for handling spatiotemporal data, biological data, related concept
hierarchies, and complex semantic relationships. For example, in bioinformatics,
a research problem is to identify regulatory influences on genes. Gene regulation
refers to how genes in a cell are switched on (or off) to determine the cell's func-
tions. Different biological processes involve different sets of genes acting together
in precisely regulated patterns. Thus, to understand a biological process we need to
identify the participating genes and their regulators. This requires the development
of sophisticated data mining methods to analyze large biological data sets for clues
about regulatory influences on specific genes, by finding DNA segments (“regulatory
sequences”) mediating such influence.
Graph-based and network-based mining : It is often difficult or impossible to
model several physical phenomena and processes due to limitations of existing
modeling approaches. Alternatively, labeled graphs and networks may be used to
capture many of the spatial, topological, geometric, biological, and other relational
characteristics present in scientific data sets. In graph or network modeling, each
object to be mined is represented by a vertex in a graph, and edges between ver-
tices represent relationships between objects. For example, graphs can be used to
model chemical structures, biological pathways, and data generated by numeric
 
Search WWH ::




Custom Search