Note that if we have a sequence containing only one symbol, its informa-
tion content is zero. Actually in Equation 4.1 the frequency f i
is exactly 1
and the number of symbols, N , is 1. Substituting these values in Equation 4.1
we obtain 0:
log 2 (1)
log 2 (1)
Equation 4.3 Information content for the limit case
Given a test set of items T , the selection of s as splitting feature generates
a group of subsets of T : T s, 1 , ... , T s, M
, where M is the number of possible values
of feature s . We define the information content of feature s for the set T as:
T s , i
| T |
I s , T =
I T −
I T s , i
Equation 4.4 Information gain
That is, the information of the split feature s is the difference between the
information of the initial set of items ( I T
) and the weighted sum of the infor-
mation of the sets of items induced by the split feature.
We are now able to summarize all the main features emerging from the
Classification . This is the main goal of the system: the system must be
able to assign a category to an item based on some criteria.
Classifier training . To fulfil the previous goal, the system must be able to
capture a set of criteria from an existing set of items.
Problem representation . The tool is problem-independent; this means
that the user should be allowed to represent the specific problem in terms
of items, features and categories.
Criteria representation . The outcome of the training must be represented
in a human-readable format, which can be checked by experts.
The following functionalities need to be tested carefully:
The most important is the correct construction of the classifier from a set
of items. The correctness of the classifier can be tested checking whether
it assigns the expected category to items whose category is known.
It is also important to check that the internal representation of the classi-
fier is implemented correctly and that it can be represented in a readable