Database Reference
In-Depth Information
Chapter 3
A Generic Algorithm for Top-Down
Induction of Decision Trees
3.1 Training Set
Before presenting the pseudo code for top-down inducing algorithm of a
decision tree, we present several definitions and notations that will be used.
In a typical supervised learning scenario, a training set is given and the
goal is to form a description that can be used to predict previously unseen
examples.
In many machine learning algorithms, the training set size and the
classification accuracy are positively correlated. In particular, the predictive
performance of a decision tree is expected to be poor if the training set
used to train the model is too small. Moreover, the size of a classification
tree is strongly correlated to the training set size. This phenomenon can be
explained by the fact that in most decision trees algorithms (such as CART
and C4.5) the number of leaves is bounded by the number of instances (as
you can't split a node if you don't have enough instances).
The training set can be described in a variety of ways. Most frequently,
it is described as a bag instance of a certain bag schema. A bag instance
is a collection of tuples (also known as records, rows or instances) that
may contain duplicates. Each tuple is described by a vector of attribute
values. The bag schema provides the description of the attributes and their
domains. In this topic, a bag schema is denoted as B ( A
y )where A denotes
the set of input attributes containing n attributes: A =
a 1 ,...,a i ,...,a n }
and y represents the class variable or the target attribute.
Attributes (sometimes called field, variable or feature) are typically
one of two types: nominal (values are members of an unordered set), or
numeric (values are real numbers). In attribute a i ,itisusefulto
denote its domain values by dom ( a i )=
{
{
v i, 1 ,v i, 2 ,...,v i,|dom ( a i ) | }
,where
23
Search WWH ::




Custom Search