The classifier must be as generic as possible in order to be applicable also
to similar problems. It has to take into account extensions such as new
characteristics of the items and different items described by a brand new set
of characteristics. During the development of the classifier the reference will
be the car accident and theft insurances problem but we will keep the
system open to extensions.
In artificial intelligence and data-mining literature it is possible to find many
works dealing with classification and the construction of classifiers. The
essential concepts involved in building and using a classifier are:
Item : is the element that is or has to be assigned to a category. It is
described by a set of features.
Feature : consists of a name and a value. It is used to describe an item.
Category : is a tag that is applied to an item based on its features.
Classifier : is the tool that automatically assigns an item to a category.
Training set : is a set of items that have already been assigned to a category.
It is used to capture the assignment criteria.
We will assume that all features are of nominal type. That is, the features
can assume value only in a finite set whose elements are defined completely
by enumerating them. This assumption makes all the algorithms, both for
identifying the classification rules and for classifying items, simpler.
There are several kinds of classifier models. The decision tree is one of the
simplest (Cherkassky and Mulier 1998). It can easily represent hierarchies
of concepts and has the great advantage of being easy to understand. In a
decision tree both nodes and arcs have a label. Each non-leaf node is associ-
ated with a split feature. The arcs to the children nodes are associated with
all the possible values of the parent feature. Each leaf is labelled with the
name of a category.
The algorithm to assign a category to a new item is described below
(Algorithm 4.1). The idea behind the algorithm is to find a path that
describes the features of the item, starting from the root. At each node the
path corresponding to the value of the feature is followed.
Input: an item, a decision tree
Output: a category for the item
1 start setting the root node as the current
2 repeat while the current node is not a leaf
(a) the node label is the name of the feature to be considered
(b) consider the value of the item's feature
(c) follow the arc corresponding to the value
(d) the node reached becomes the current node
3 the label of the leaf node is the category of the item
Algorithm 4.1 Categorization