Java Reference
In-Depth Information
Note that if we have a sequence containing only one symbol, its informa-
tion content is zero. Actually in Equation 4.1 the frequency
f
i
is exactly 1
and the number of symbols,
N
, is 1. Substituting these values in Equation 4.1
we obtain 0:
1
I
=
1
·
log
2
(1)
=
1
·
log
2
(1)
=
1
·
0
=
0
1
i
=
Equation 4.3
Information content for the limit case
Given a test set of items
T
, the selection of
s
as splitting feature generates
a group of subsets of
T
:
T
s, 1
,
...
,
T
s, M
, where
M
is the number of possible values
of feature
s
. We define the information content of feature
s
for the set
T
as:
M
|
T
s
,
i
|
|
T
|
I
s
,
T
=
I
T
−
·
I
T
s
,
i
1
i
=
Equation 4.4
Information gain
That is, the information of the split feature
s
is the difference between the
information of the initial set of items (
I
T
) and the weighted sum of the infor-
mation of the sets of items induced by the split feature.
4.2.2
Main features
We are now able to summarize all the main features emerging from the
problem analysis.
Classification
. This is the main goal of the system: the system must be
able to assign a category to an item based on some criteria.
■
Classifier training
. To fulfil the previous goal, the system must be able to
capture a set of criteria from an existing set of items.
■
Problem representation
. The tool is problem-independent; this means
that the user should be allowed to represent the specific problem in terms
of items, features and categories.
■
Criteria representation
. The outcome of the training must be represented
in a human-readable format, which can be checked by experts.
■
4.2.3
Test
The following functionalities need to be tested carefully:
The most important is the correct construction of the classifier from a set
of items. The correctness of the classifier can be tested checking whether
it assigns the expected category to items whose category is known.
■
It is also important to check that the internal representation of the classi-
fier is implemented correctly and that it can be represented in a readable
way.
■