Managing Hierarchies and Taxonomies in Relational Databases

INTRODUCTION

The need to maintain classification and retrieval mechanisms that rely on concept hierarchies is as old as language itself. Familiar examples include the Dewey decimal classification system used in libraries and the system for classifying life forms developed in the 1700s by Carolus Linnaeus. A more recent example is Yahoo’s subject taxonomy.
Information technology has led to an explosive growth in digital documents, records, multi-media files, and Web sites. To facilitate end-user access to these resources, topic hierarchies are frequently maintained to allow intuitive navigation and searching for resources related to specific categories. Maintaining such document, Web site, and knowledge taxonomies within relational databases is not trivial since the management of any hierarchical data in relational databases poses significant challenges (Millet, 2001). Taxonomies pose an additional challenge due to the typical need to classify a single document, concept, or Web site under multiple topics and due to the typical reliance on intelligent keys (Millet, 2003).
While, according to some views, non-relational database technologies and dynamic classification schemes may offer better ways for achieving our objectives, the use of relational database technology and concept taxonomies remains a contemporary practice and challenge.

BACKGROUND

Consider a document database where each document is classified into a hierarchy of topics shown in Figure 1.Since each document may belong to more than one parent topic, we cannot record the data for this hierarchy by specifying in each topic record the topic above it. Figure 2 shows a data model for this situation. Note that the classify table allows us to assign a single document to multiple topics. If instead of a topic hierarchy, we need to assign each subtopic to more than one parent topic, we would insert a topic assignment table to represent such taxonomies.
To demonstrate the difficulty of hierarchical data
retrieval against the normalized data model in Figure 2, consider the following requests:
• Show a list of all documents (at all levels) under Topic 1.
• Show how many documents (at all levels) are classified under each topic at level 1 of the hierarchy.
Using SQL, we can easily join each topic to all the documents associated with it via the classify records. However, we cannot easily identify the documents indirectly associated with a topic via its subtopics (at all levels). This difficulty in locating parent or child nodes at any given level is at the heart of the problem.


SQL-BASED SOLUTIONS

A request to show how many documents belong to each main topic, including all subtopics below it, can be handled using the SQL:1999 (ANSI/ISO/IEC 9075-2-1999) query shown in Listing 1. This query starts by creating a table expression (TOPICPATHS) populated with all main topic records as parents of themselves and appends (UNION) records for all paths of length one from these nodes to the topics directly below them. The RECURSIVE option continues the process to build all indirect paths from each topic to all its descendants.
The query then joins the end points (Topic Below) of all paths in the TOPIC_PATHS result set to the documents assigned to these topics. By limiting the start points of these paths to main topics (topics at level 1) and grouping the end result by those topics, we get the requested information. Avoiding double-counting of documents that were assigned to multiple topics is achieved by using DistinctCount(Classify.DocID) rather than Count(Classify.DocID).

Figure 1. A topic hierarchy

A topic hierarchy

Figure 2. A normalized data model with a topic hierarchy

A normalized data model with a topic hierarchy

Listing 1. Recursive hierarchy retrieval using SQL:1999

Recursive hierarchy retrieval using SQL:1999
Relying on such complex SQL is probably beyond the reach of many IT professionals. This can be addressed by implementing the complex portion of these SQL statements as database views. However, someone has to write the SQL for these views and the intensive nature of the required processing may lead to slow performance in reporting applications with large hierarchies and frequent queries.
Celko (2000) reports on a technique leading to significant improvements in query speeds by storing the hierarchy data not as parent-child relationships but as “nested sets” using a somewhat complex numbering scheme. However, alternative approaches can achieve very significant query performance gains while maintaining intuitive data storage and SQL syntax.

THE PATH TABLE APPROACH

The path table approach uses a “navigation bridge table” (Kimball et al., 1998) with records enumerating all paths starting from each node to all nodes in the branch above it, including itself. This approach provides flexibility in the sense that each subtopic node can belong to multiple direct parent topics and there is no limit on the number of levels in the taxonomy.
As demonstrated by Table 1, topic 1.1.1 would require four records in the path table reflecting the paths up to itself, topic 1.1, topic 1, and topic 0 (the top node of the hierarchy). These are just 4 of the 37 records required to capture all paths for the sample hierarchy in Figure 1.
To demonstrate how the path table can simplify data retrieval, consider the same challenge of showing how many documents belong to each main topic, including all subtopics below it. By joining the tables as shown in Figure 3, we can easily select all documents that belong to topics below each main topic. Since the path table includes a zero-length path between each topic and itself, documents that belong directly to a main topic would be included in the result set. Again, the classify table allows us to associate the same document with multiple topics.
Other requests for information can use the same approach or variations such as connecting to the topic table via the TopicID column in the path table or adding path length and terminal node information to the path table (Kimbal et al., 1998).
One limitation of the path table approach is that the number of records in the path table can grow quite large for deep hierarchies. The following section describes another approach that avoids that problem.

THE DENORMALIZED TOPIC TABLE APPROACH

The denormalized topic table approach maintains information about all higher-level topics within each topic record. The classify table still allows each document to be assigned to multiple topics, but this approach restricts the topic taxonomy to a strict hierarchy whereby each sub-topic can have only one direct parent topic.

Table 1. A path table for the sample hierarchy

Topic ID Topic Above
1.1.1 1.1.1
1.1.1 1.1
1.1.1 1
1.1.1 0

As demonstrated by Figure 4, if we assume that our topic hierarchy will not exceed six levels, we need six more columns to maintain this information. Each node will be indicated as its own parent at its own level. For example, topic 1.1.1 in Figure 1 would have 0 (the topicid for the top node in the hierarchy) as its Topic_Level_1, 1 as its Topic_Level_2, 1.1 as its Topic_Level_2, 1.1.1 (itself) as its Topic_Level_4, and Null values for Topic_Level_5 and Topic_Level_6.
To demonstrate how the denormalized topic table can simplify data retrieval, consider again the challenge of showing how many documents belong to each main topic, including all subtopics below it. By joining the tables as shown in Figure 4, we can easily select and group all topics according to Topic_Level_2. Documents classified directly under that topic would be included in the result set because each topic is its own parent at its own level. Listing 2 shows that using this approach, a simple SQL statement can generate the requested information.
Note that in order to return the topic name we resorted to adding an aliased (Topic2) copy of the topic table to the SQL statement.
The denormalized topic table is less flexible since it limits the topic taxonomy to a strict hierarchy. However, the path table approach requires more complex joins due to the addition of one more physical table. Another disadvantage of the path table approach is that for deep hierarchies, the size of the path table can grow quite large, degrading query performance. A key issue with both approaches is the question of maintaining the redundant data required by these methods.

MAINTENANCE OF DENORMALIZED HIERARCHY DATA

This section suggests methods that can simplify and automate the maintenance of redundant hierarchy data. The proposed approach provides a concrete example of an “incremental evaluation system” for handling recursive SQL queries as advanced by previous literature (Dong et al., 1999; Libkin & Wong, 1997).
Because the denormalized topic table allows for simpler maintenance logic, the remainder of this article uses that approach as the assumed data structure.
A key observation is that the redundant information maintained for each topic is a simple extension of the redundant information already maintained for its parent topic. Consider a situation where Topic 1.2.2.2 in Figure 5 is moved from under Topic 1.2.2 to under Topic 3.2. The Topic_Level_N columns from the new parent node can simply be copied to the updated topic record (Topic 1.2.2.2). We then simply add the topic as its own parent at its own level and update the level column to the level of the parent node plus one.

Figure 3. A path table connects each classification with all its parents.

A path table connects each classification with all its parents.

Figure 4. Using a denormalized topic table

Using a denormalized topic table

Listing 2. Retrieval via a denormalized topic table

Retrieval via a denormalized topic table
Topic taxonomies (such as the Dewey decimal classification system used in libraries) frequently use “intelligent” primary keys reflecting the position of each topic within the taxonomy. In our example, we would neeed to change the primary key from 1.2.2.2 to 3.2.1. Using referential integrity, the database management system can automatically cascade such primary key updates to the foreign keys in documents as well as in Parent_Level_N columns.
The remaining challenge is to handle situations where a whole branch (a topic with subtopics) is moved in the hierarchy. Consider, for example, moving topic 1.2.2 in Figure 5 from under Topic 1.2 to under Topic 1.1. The procedural logic would update the redundant data for Topic 1.2.2, but we now need to update the information for all its decendants. The most elegant solution is to recursively extend the procedural logic by applying it to all descendant nodes, as if their TopicAbove column was updated as well.
Using this update logic, users would update the structure of the hierarchy by specifying only the TopicAbove information for each node. The maintenance of the redundant data can be done either under the control of the application or through database triggers.

Figure 5. Moving Topic 10 from Parent 6 to Parent 4

Moving Topic 10 from Parent 6 to Parent 4

FRONT-END LOGIC OR DATABASE TRIGGERS

Each time a user changes the Topic_Above column, the application can call a function with the Topic_ID and its new Topic_Above as arguments. After completing the logic for that topic, the function would use embedded SQL to identify the descendants of the topic and call itself recursively against all these descendants.
One limitation of using such front-end logic to maintain the redundant hierarchy data is that it would require multiple communications between the client and the server. A much more important limitation is that we are dependent on uniform and full compliance by all client applications. The integrity of the hierarchy data can be compromised if some screens neglect to call or implement the update function appropriately.
As described in detail by Millet (2003), the update logic can be implemented via recursive Update and Insert triggers declared on the Topic_Above column of the TOPIC table. Moving the procedural logic from front-end functions to back-end triggers removes the threat of multiple points of failure and achieves better performance.

FUTURE TRENDS

The current trend in addressing the challenges of maintaining hierarchical data in relational databases is to provide more complex but more powerful data retrieval syntax. The SQL standard has added recursive queries. Similarly, proprietary SQL extensions such as the START WITH and CONNECT BY clauses provided by the Oracle DBMS are also useful mechanisms for retrieving hierarchy data from normalized data structures.
Clearly, the trend so far has been in the direction of maintaining the simplicity of the data structure at the expense of more complex queries. In the long term, I hope DBMS vendors realize the value of also supporting the opposite approach whereby queries are simplified at the expense of complicating the data structure. Declarative statements are all that is needed today to ask a DBMS to apply referential integrity logic to the data. In a similar way, we can envision a future where declarative statements identify to the DBMS the link between parent and child nodes and the Parent_Level_N columns that should be maintained by the DBMS.

CONCLUSION

Topic hierarchies are used to classify and search for documents, Web sites, and knowledge areas. Beyond the data retrieval and data maintenance challenges posed by any hierarchy domain, the need to support classifying the same document under multiple topic nodes leads to different table designs and requires care in avoiding biased aggregate results due to double counting. Since topic taxonomies frequently utilize intelligent rather than surrogate primary keys, updates to the location of topic nodes may require updates to the primary keys as well.
Fast and simple data retrieval against topic hierarchies can be achieved by maintaining redundant information. The approach of denormalizing the topic table may be preferred over maintaining a separate path table because the path table can grow too large and requires more complex data maintenance logic.
The limitation of denormalized and redundant hierarchy information is that updates to the hierarchy require special processing logic in order to avoid update anomalies. To efficiently refresh the hierarchy data we can exploit the redundant information already maintained for the specified parent topic as topic records are inserted or updated. The process can then be extended recursively for lower level nodes.
If the necessary trigger options are available for the DBMS in use, it is recommended that the processing logic for maintaining the redundant hierarchy information be implemented as triggers. This removes the burden of hierarchy maintenance from client applications. It also ensures that client applications cannot circumvent the hierarchy maintenance logic.

KEY TERMS

Database Trigger: A procedure that gets invoked by the database management system whenever a specific column gets updated or whenever a row gets deleted or inserted.
Denormalized Data Tables: A database design that violates principles of good (normalized) data models. Such database designs may lead to various problems such as data redundancy, reduced flexibility, and update anomalies.
Intelligent Key: An intelligent key contains data that has a meaning beyond the unique identification of a database record. For example, a vehicle identification number (VIN) contains information about the manufacturer, model, and other attributes of the vehicle.
Recursive: A recursive procedure is one that has the ability to call itself.
Relational Database: A method of storing data in tables and columns and operating on that data through a set-oriented query language (SQL) and specialized software the takes care of the procedural and physical aspects of data storage and manipulations. Invented by E.F. Codd at IBM in 1970.
SQL (Structured Query Language): A widely accepted standard for retrieving information from and updating relational databases.
Taxonomy: A classification system used for analysis or information retrieval.
UNION Query: By inserting the key word UNION between multiple SQL SELECT statements, the resulting rows from these statements are appended into a single result set (provided the number and data types of the columns match).

Next post:

Previous post: