Interactive Comprehensible Data Mining - Ambient Intelligence for Scientific Discovery

Information Technology Reference

In-Depth Information

2 Approach

2.1

Visualization

The visualization engine used in the Haiku system provides an abstract 3-d

perspective of multi-dimensional data based on the Hyper system[7-9] for force

based visualization. The visualization consists of nodes and links (similar to a

ball-and-stick model, only dynamic), whose properties are given by the param-

eters of the data. Data elements affect parameters such as node size, mass, link

strength and elasticity, and so on. Multiple elements can affect one parameter,

or a subset of parameters can be chosen.

Many forms of data can be visualized in Haiku. Typical data for data mining

consists of a number of individual “items” (representing, for example, customers)

each with the same number of numerical and/or nominal attributes. This is

similar to standard dimension reduction methods used for solely numerical data

such as Projection Pursuit [5] and Multi Dimensional Scaling [6], but applicable

to data with a mix of nominal and numeric fields. What is required for Haiku

visualization is that a similarity can be calculated between any two items. The

similarity metric should match an intuitive view of the similarity of two items.

In most cases, a simple and standard distance measure performs well.

To create the visualization, nodes are initially scattered randomly into the

3d space, with their associated links. Movement in this space is determined by

a set of rules similar to the laws of physics. Links want to assume a particular

length, determined by their elasticity and the nodes they are connected to - and

the parameters of these are affected by the data. They pull inwards until they

reach that length, or push outwards if they are compressed, just as a spring does

in the real world. The more similar an item, the stronger the link connecting

them. Nodes repel each other, based on their mass. This whole approach can be

seen as a force directed graph visualization. This initial state is then allowed to

evolve, and the links and nodes shue themselves around until they reach a low

energy, steady state. Since the strongest forces are between the items that are

the most similar, they tend to cluster in the same area of space - items unrelated

to each other are not connected and hence occupy different areas of space. The

repulsive force between nodes is used to spread them out.

The physics of the space are adjustable, but are chosen so that a steady

state solution can be reached that is static - this is unlike the real world, in

which a steady state exists that involves motion, such as we see in planetary

orbits. Having a static steady state is easier to explore, since the visual system

is especially sensitive to motion and hence orbiting systems tend to become the

focus of attention.

The system effectively reduces the data dimensionality to 3D. However, un-

like traditional dimension reduction methods, there is no predefined mapping

between the higher and lower dimensional spaces.

Computationally, the process scales exponentially with the number of links,

which is usually proportional to the number of data points. For small data sets

(up to 1000 nodes) the process can be allowed to run in real time. For larger data

Ambient Intelligence for Scientific Discovery

Search WWH ::

Custom Search

Home