ZoomTree: Unrestricted Zoom Paths in Multiscale Visual Analysis of Relational Databases (Computer Vision,Imaging and Computer Graphics) Part 2

Zoom Trees for Detail Visualization

Zooming is a frequently used operation in visualizing multi-dimensional relational databases and data cubes. In this section we propose to use zoom trees on separate layers for facilitating the presentation of zooming results along with the zooming history.

Layered Zoom Trees. Given an overview of a selected subcube in our table-based visualization component, visual analysts typically need to dig deeper into the subcube to gain more insights or discover correlations and anomalies. Since the table-based overview can only accommodate up to four dimensions/measures, the remaining dimensions are aggregated together. To discover more details, zooming needs to disaggregate such dimensions or expand an existing dimension to expose more detailed levels. A zooming process in our system is purely event driven, and it always begins with a chart in the table-based overview. The events embedded into the chart (in the table) serve as “hyperlinks”. For example, a user can initiate a zooming process by clicking any bar in a bar chart or select a region of interest in a plot chart in the table (Fig. 4). Any event triggered by such user interactions pops up a new active layer. The chart clicked by the user becomes the root of a new zoom tree initiated on this layer, and the disaggregated information corresponding to the chosen bar or region is presented in a new chart, which becomes a child of the root. The user can continue to zoom into any existing node in this tree, and a new child of the existing node is spawn holding the zooming results.

To reduce screen space clutter, at any time, only one path from the root to a leaf in the tree is visualized, and all other nodes in the tree are hidden. A path in a zoom tree is presented in a predefined layout within the layer, where the nodes are arranged from left to right horizontally and from top to bottom vertically. Each node in the tree is a chart, and represents a disaggregation of a dimension or an intermediate level of a hierarchically clustered dataset. A user can dynamically change the type of chart shown within a node. The user can also minimize (deactivate) and reactivate a layer. There can be only one active layer at any time.

There are three operations supported for zoom trees.

1. Add nodes. Double-click a bar or a pie or select a region in a plot chart, a list of dimensions will pop up. Once the user has chosen one of the dimensions, a new chart will be generated as a new child node.

2. Delete nodes. Nodes can be deleted by directly clicking the “Delete” button on each chart. If a node is deleted, all its descendants are pruned at the same time.

3. Show/Hide nodes. Since our system only shows one path from the root to a leaf in the tree, the user can choose a desired branch by clicking the radio button representing the root of the subtree. All sibling nodes of the chosen branch and their decedents become all hidden.

Compared with decomposition trees in [11] and semantic zooming in the Pad++ system [10], our zoom trees have two unique characteristics. First, a zoom tree has a generic tree structure recording the entire history of a zooming process performed on a chart in the overview. Unlike previous work, a node in a zoom tree can have an arbitrary number of children. But at any time there is only one child visualized to efficiently utilize screen space. Second, pivoting is supported during a zooming process. It provides additional dynamic views of the data and, therefore, hidden patterns could be discovered more easily. There are two types of zooming according to the data type it operates on. One is for data with aggregated dimensions and the other is for data clusters which are computed from either raw or aggregated data points to reduce screen space clutter.

Zooming Aggregated Data. This type of zooming applies to bar charts and other types of charts essentially equivalent to bar charts, such as pie charts and line charts. During a zooming step, the user chooses a bar and disaggregates it along a dimension that is different from the dimension mapped to one of the axes of the chart (Fig. 4(a)&(c)-(e)). Note that different bars in the same chart can be disaggregated along different dimensions. Such a zooming step essentially performs local drill-down over a subset of aggregated data. The flexibility of such zooming steps facilitates detailed data exploration.

Zooming Data Clusters in Plot Charts. There can be a huge number of data points in a plot chart while the screen area allocated for the chart is often quite limited. Overly crowded points in a plot chart can prevent users from identifying the underlying correlations and patterns. To reduce this type of screen space clutter, we perform clustering on the data points using screen space distance, and only visualize the cluster centers in the plot chart. Every cluster center is visualized as a small circle whole radius indicates the number of data points in the underlying cluster. The clustering algorithm is executed on the CPU which takes the screen location of the raw data points and the number of desired clusters as input (see Section 6.4). Zooming such data clusters can be initiated by drawing a rectangular region of interest (Fig. 4(b)&(f)). Cluster centers falling into the region are automatically selected. A new chart is created as a child of the current node in the zoom tree displaying a zoomed view of the region. This zoomed view is generated on the fly by calling the clustering algorithm on the server again over those raw data points falling into the selected region. Because the selected region is zoomed to cover the area of an entire chart, the number of resulting cluster centers becomes larger than that in the selected region of the original chart. Such a zooming step can be recursively performed until the number of raw data points within the region is less than a threshold. Note that zooming clustered data does not involve any aggregated dimensions.

Pivoting During Zooming. It would be desired to gain more insight during data analysis by generating additional views of a node in a zoom tree. Users can achieve this goal with the help of pivoting. Unlike pivoting discussed in Section 4.2 where the axis configuration of the entire table is changed, pivoting here is only applied locally to a chart in a particular tree node and can be performed on any node in the zoom tree. For this purpose, users can directly click the pull-down list along the dimension axis of the chart and choose the desired dimension for the new view. We restrict the target dimension for pivoting to be selected from the remaining dimensions which have not been used so far.

Query Formation

In this section, we briefly discuss how to transform user interactions into queries and how these queries are expressed according to a predefined formalism.

Query Formalism

Since we adopt the H-Tree [31] as the implementation of our partial cube, typical cube query languages such as MDX can not be used to describe a query. Therefore we develop a simple H-tree based partial cube query formalism. Generally, there are two kinds of queries for data cubes:(1) point query and (2) subcube query. A point query only includes a few instantiated dimensions but without any inquired dimensions. On the other hand, a subcube query is required to include at least one inquired dimension. We use “?” to represent an inquired dimension, “*” to represent a “Not care” dimension, and a string of values demarcated by slash(“/”) to represent an instantiated dimension. Assume the partial cube is constructed from a relational database with M dimensions and K measures. There exists a predefined order of the dimensions, D1, D2,Dm, typically specified by OLAP experts. In such a context, the two kinds of queries can be expressed in a formalism used by the following two examples:

whererepresents the label of a measure, = 1 if it is inquired otherwise it is set to 0; d31 and d33 are two specified values for the instantiated third dimension. There are two parts in each query. The first part is reserved for the dimensions demarcated by commas(“,”) and the second part is for the labels of the measures also demarcated by commas. Note that there could be more than one values specified for each instantiated dimension. (1) describes a point query, which returns one aggregated value for each inquired measure. (2) describes a subcube query with the second and fourth dimensions as inquired dimensions.

Query Generation

Queries similar to (1) and (2) are generated by tracing user interactions and filling slots corresponding to dimensions relevant to the interactions. Note that, there can be only three types of values for each slot: “*”, “?” or a string of instantiated values.

Slice/Dice Selection. As discussed in Section 4.1, slice and dice only specify instantiated dimensions. Thus, values of the instantiated dimensions will be directly filled into the corresponding slots of the query. For example, if we selected “2007” and “2008” as the values for the dimension “Year”, the “Year” slot will be filled with “2007/2008” in all subsequent queries.

Query Generation for Table-based Overview. As stated in Section 4.2, four of the six types of commonly used axis configuration generate tables of charts, and the other two

generate a single large bar chart or plot chart. In the first type of configuration mentioned in Section 4.2, there is only one dimension specified, therefore, only one subcube query is generated taking the dimension assigned to the outer vertical axis as the inquired dimension and all the measures as the inquired measures. The second type of configuration is a special case of the first one since it only inquires one measure. A 2D table can be generated by assigning two dimensions to the two outer axes. Once specified, the whole table is divided into a 2D grid of panes each of which maps to a specific pair of values of the dimensions assigned to the outer axes. A subcube query is generated for each pane. The actual query type depends on whether there is a dimension assigned to the inner axes. For instance, in the fourth type of configuration in Section 4.2, one subcube query is generated for each pane taking the inner horizontal dimension as the inquired dimension. In the fifth type of configuration, one subcube query is generated for each pane taking the two inner measures as inquired measures and all uninstantiated dimensions as inquired dimensions.

Query Generation for Zooming and Pivoting. Zooming aggregated data needs to unfold new dimensions. Every aggregated datum is decomposed into multiple ones each of which corresponds to a distinct value of the dimension chosen for disaggregation. Therefore, only one subcube query is generated for each such operation taking the chosen dimension as the inquired dimension. Similarly, a pivoting operation is also transformed to one subcube query. However, zooming clustered data is different in that no additional dimensions are required. When the user selects one region of interest to zoom in, the system automatically computes the bounding box of the region. This bounding box is appended to the query corresponding to the pane. The query will be processed as usual except that query results will be filtered using the bounding box and the filtered results will be re-clustered.

Subcube Query Translation. In our system, a subcube query is first translated into multiple point queries before being further processed. The idea is to replace all inquired dimensions in the query with all possible combinations of their values. More precisely, if there are n inquired dimensions in the query with cardinality Ci, …,Cn respectively, it will be translated into Πn=o Ci point queries each of which maps to a unique combination of values of these inquired dimensions. To minimize data transmission overhead, the translation is performed by the CPU component of the server.

Server-Side Algorithms

In this section, we present algorithms developed for the server.

Graphics Processors

Recent NVidia GPUs have a SIMD architecture and host a number of multi-processors. We adopt NVidia CUDA [37] as our programming environment. In a GPU kernel, a computational task is divided among thread blocks, which are further dynamically scheduled among the multi-processors. Each thread block can have up to 512 parallel threads. Threads within the same block are always scheduled to the same multiprocessor. They can communicate through a fast local shared memory associated with the multi-processor they are assigned to. Multi-processors allow a large number of active threads to hide high memory access latency. While some threads are waiting for the data, the others can execute instructions. This further implies that each thread block needs to have a reasonably large number of threads. For example, the suggested minimum number of threads per block is 32.

H-tree Structures

We adopt an H-tree to represent the partially materialized data cube on the server. H-tree is a hyper-linked tree structure originally presented in [31], and was later deployed in [30] as the primary data structure for stream cubes. In the following, we briefly review the original definition of an H-tree.

1. An H-tree HT is a compact representation of all tuples in a relational database. Each level in HT corresponds to a distinct dimension of the database. The order of the levels in HT follows a predefined order of the dimensions.

2. A tuple from the relational database is represented as a path from the root to a leaf. If two tuples share the same values in the first L dimensions in the predefined order, their corresponding paths in HT also share the first L segments. The two different values in the L + 1-th dimension are stored in two children nodes of the node holding the shared value of their L-th dimension.

3. There is a header table for each level of the tree. It holds all distinct values of the corresponding dimension an. All nodes sharing the same value are linked together by introducing an additional side link in each node. The header table also holds a pointer to the first node in each linked list.

4. All measures within a tuple are stored at the leaf node corresponding to that tuple. Intermediate nodes of HT hold aggregated measures resulting from data cube computation. An intermediate node saves the aggregated measures over the subset of tuples represented by the subtree rooted at the node. Thus, an H-tree is equivalent to a partially materialized data cube.

However there are two major differences in our GPU-based H-tree structure (Fig. 5) compared with the original version. First, since CUDA does not support pointers, linked lists are replaced with arrays and pointers are replaced with array indices. Second, the array allocated for a side-link list is further divided into contiguous segments each of which contains indices of nodes which share the same attribute value. We revised the structure of side links to achieve better load balance and query performance.

Recently, GPUs attract more and more attentions beyond the graphics community. Take the advantage of the GPU H-tree structure, we develop a parallel approach of online cubing algorithm to facilitate fast query processing. We adopt NVidia CUDA [37] as our programming environment. This is the first attempt to develop parallel cubing algorithms on GPUs to the best of our knowledge.

Fig. 5. GPU H-tree Structure

Online Cubing

In this section, we only present the GPU-based parallel algorithm for point queries because a subcube query can be easily translated into multiple point queries. To achieve optimal performance, we propose an approach exposing two levels of parallelism. Unlike a sequential algorithm which processes queries one by one, our algorithm can process thousands of queries simultaneously in parallel. To further exploit the massive parallelism of modern GPUs, we make each query processed in parallel. We achieve this goal by first assigning one thread block to each query and then making each thread in the block responsible for an evenly divided portion of leaves or intermediate nodes of the H-tree. Since each query is processed by one thread block, we present the per-block query processing algorithm as follows.

In a real scenario, we initiate thousands of thread blocks and each block takes care of one query. Note that, in the first step we assume that the entire header table for hd and the query itself can be completely loaded into the shared memory associated with the block responsible for the query. Much care should be taken to make sure it would not exceed the maximal limit, which is 16KB per stream processor on G80. If the cardinality of hd is relatively large, step 2 can be parallelized as well. In step 3, we evenly divide the rNum elements in the side-link list into chunks, and the size of each chunk is rNum/S, where S is the number of threads in a block. We allocate a temporary array for each inquiredmeasure. Each element in this array represents a partially aggregated value computed from a particular chunk by the corresponding thread. Since there could be more than one specified values for the last instantiated dimension, we loop over all these values and accumulate all partially aggregated values to the temporary arrays. Finally, we apply the parallel reduction primitive [38] to each temporary array to compute the final aggregated value for each inquired measure.

The average time complexity of online cubing is O(NM/(CP)) per point query, where P is the number of processors allocated to process the query, N is the number of tuples in the H-tree, M is the number of dimensions, and C is the cardinality of the last instantiated dimension in the query. The memory cost of online cubing is O(S) per point query, where S is the number of threads responsible for the query.

Online Clustering for Plot Charts

Implementing the zooming mechanism described in Section 4.3 for plot charts requires performing clustering in real time on the server. Classical clustering methods such as K-means could be used for this purpose. However, the main drawback of the k-means algorithm in this scenario is that it requires multiple iterations to cluster the data into a desired number of clusters, which makes it hard to achieve real-time response for large datasets even if we use its parallel version [8]. Here we present a simple grid-based algorithm to cluster hundreds of thousands of points into a desired number of clusters. In doing do, we can not only reduce the overhead for transferring a large amount of data but also can reduce screen space clutter. To deliver optimal performance, our clustering algorithm has been implemented on the CPU of the server and is summarized in the following steps.

1. Compute the bounding box of all input points.

2. Divide the bounding box into a 2D grid of Nbin x Nbin small boxes with equal size. Each small box serves as a bucket.

3. Accumulate each point into an appropriate bucket according to its screen space coordinates.

Fig. 6. GPU speedup and average time vs. # of dimensions and the cardinality of each dimension for online cubing

4. for every bucket in the grid, set the cluster center of the bucket at the average location of the points falling into the bucket.

This algorithm has a linear time and space complexity. A reasonable value for Nbin is 10. Users can tune this parameter to achieve a visually pleasing presentation.

Performance

The described algorithms have been implemented and tested on an Intel Core 2 Duo quad-core 2.4GHz processor with an NVidia GeForce 8800 GTX GPU. To cluster 1 millon randomly generated data points into 10×10 clusters, our grid-based clustering algorithm only takes 22.96ms on a single core. The average performance of the online cubing algorithm is presented in Fig. 6(a)&(b), where randomly generated point queries are processed using an H-tree with 400k and 800k tuples, respectively. Our GPU-based algorithm can typically achieve a speedup much larger than 10, and process 10,000 to 50,000 point queries per second. The results also show that this algorithm has more advantages when the number of dimensions and the cardinality of each dimension are relatively small. This is mainly because more dimensions and a larger cardinality of the dimensions give rise to larger H-trees which require more memory accesses. GPU memory access latency is about 400-600 cycles which is longer than CPU DRAM access latency.

Usability Evaluation

To evaluate the usability of our system, we explored several real datasets, including the American historical climate changes data of the last century and the American census data in 2000 as well as the Coffee Chain data(shown in the video). Since polaris can be treated as the state-of-art for database visualization, a user study was then conducted by comparing the visualizations of these datasets using both zoom tree and Tableau(Polaris). There were 8 total participants: 2 female,6 male. Their ages ranged from 19 – 28. They were from four different research labs, including database(2), data mining(2), graphics(2) and HCI(2).

Table 1. User Satisfaction Ratings(0:Worst, 5:Best)

Question	Zoom Tree	Polaris
Subcube Selection	3.7	3.9
Pivoting	4.6	3.5
Aesthetic Appeal	3.4	3.7
Clutter Reduction	3.9	3.3
System Response Time	4.3	4.1
Historical Vis Support	3.8	3.7

Methods and Procedure

Before the testing, about one hour training and discussion were conducted in order to make them all familiar with the meanings of datasets, the concepts of cube as well as the interfaces of the two systems. Participants were asked to perform two tasks with both systems and rate(1-5) their satisfactions by filling out questions. Note that, both tasks were involved drilling down, rolling up and pivoting operations. An example step of one task is like: Select the sub-cube: “Year=2007, Location=NewYork, Product=Green Tea, then explore and find the abnormal relationships between the remaining dimensions and the measure ’Profit’ and then record them down.”. An example of the questions is like: Rating the satisfaction about the pivoting support along a zoom path.

Results and Observations

We measured the tasks times costed by each participant. The average and variance time for task one and two used by zoom tree are (average:36s, variance:15s) and (aver-age:95s,variance:24s) respectively. While the corresponding results used by polaris for the two tasks are (average:45s, variance:12s) and (average:87s, variance:26s) respectively. We also report the user satisfaction ratings for the two different systems through table 1. From the qualitative results including both positive and negative feedbacks, we found our system is competitive with Polaris. Intuitive, easy to invoke and manipulate, less clutter for high dimensional data all make the layered zoom tree powerful. An interesting observation is that most of the participants agree table is good for overview visualization, but details should be better visualized gradually in an isolated layer to achieve clarity if focus and context is well processed. According to our experience, it’s really hard to visualize datasets with 15 dimensions above in a fixed table using dimension embedding as in polaris, the higher the dimension the more clutter the visualization. This is one of the main drawbacks of polaris that layered zoom tree avoided. The results also show that zoom tree gives quicker response time for the same dataset, that’s mainly due to the leverage of GPU parallelism through our H-tree online cubing algorithm. Moreover, zoom tree only stores a partial cube, compared with Polaris, it will save much more spatial space. Flexibly changing the view is crucial for users to facilitate the dynamic exploration, since pivoting is not supported along the zoom path in polaris, layered zoom tree is absolutely the winner with regard to this.

However, zoom tree also has some disadvantages, for example participants think that although schema based subcube selection is powerful, they prefer directly dragging and dropping dimensions to the table shelves as in polaris. We also received some valuable suggestions for further improvement. For example, one suggested to annotate the history button into a meaningful thumbnail which reveals the structure of the underlying subtree.

Conclusions

We have presented a visualization system with a client-server architecture for multiscale visualization of relational databases. Our system supports all types of data cube operations using a combination of a schema list, tables and zoom trees. To support fast query processing on the server, we have also developed efficient algorithms for online data cubing and data clustering. The user study shows that our proposed layered zoom tree and the overall system framework are effective for visualizing databases.

Limitation. Our current system does not support spatial dimensions such as maps. A spatial dimension is likely to partition the screen space into irregularly shaped regions instead of regularly shaped panes. In future, we would be interested in investigating methods for placing charts inside such regions as well as zoom interfaces for spatial dimensions.