Graphics Reference
In-Depth Information
6.4 Binning and Reduction of Cardinality
Binning is the process of converting a continuous variable into a set of ranges. Then,
each range can be treated as categories, with the choice of imposing order on them.
This last choice is optional and depends on the further analysis to be made on the data.
For example, we can bin the variable representing the annual income of a customer
into ranges of 5,000 dollars (0-5,000; 5,001-10,000; 10,001-15,000,
, etc.). Such
a binning could allow the analysis in a business problem may reveal that customers
in the first range have less possibility to get a loan than customers in the last range,
grouping them within an interval that bounds a numerical variable. Therefore, it
demonstrates that keeping the strict order of bins is not always necessary.
Cardinality reduction of nominal and ordinal variables is the process of combining
two or more categories into one new category. It is well known that nominal variables
with a high number of categories are very problematic to handle. If we perform a
transformation of these large cardinality variables onto indicator variables, that is,
binary variables that indicate whether or not a category is set for each example; we
will produce a large number of new variables, almost all equal to zero. On the other
hand, if we do not perform this conversion and use them just as they are in with the
algorithm that can tolerate them, such as decision trees, we run into the problem of
over-fitting the model. It is realistic to consider reducing the number of categories in
such variables.
Both processes are two common transformations used to achieve two objectives:
...
Reduce the complexity of independent and possible dependent variables.
Improve the predictive power of the variable, by carefully binning or grouping
the categories in such a way that we model the dependencies regarding the target
variable in both estimation and classification problems.
Binning and cardinality reduction are very similar procedures, differing only in
the type of variable that we want to process. In fact, both processes are distinctively
grouped within the term discretization , which constitutes the most popular nota-
tion in the literature. It is also very common to distinguish between binning and
discretization depending on the ease of the process performed. Binning is usually
associated with a quick and easy discretization of a variable. In [ 11 ], the authors dis-
tinguish among three types of discretization: binning, histogram analysis-based and
advanced discretization. The first corresponds to a splitting technique based on the
specification of the number of bins. The second family is related with unsupervised
discretization and finally, a brief inspection of the rest of the methods is drawn.
Regardless of the above, and under the discretization nomenclature, we will dis-
cuss all related issues and techniques in Chap. 9 of this topic.
 
Search WWH ::




Custom Search