Data Reduction - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

6.4 Binning and Reduction of Cardinality

Binning is the process of converting a continuous variable into a set of ranges. Then,

each range can be treated as categories, with the choice of imposing order on them.

This last choice is optional and depends on the further analysis to be made on the data.

For example, we can bin the variable representing the annual income of a customer

into ranges of 5,000 dollars (0-5,000; 5,001-10,000; 10,001-15,000,

, etc.). Such

a binning could allow the analysis in a business problem may reveal that customers

in the first range have less possibility to get a loan than customers in the last range,

grouping them within an interval that bounds a numerical variable. Therefore, it

demonstrates that keeping the strict order of bins is not always necessary.

Cardinality reduction of nominal and ordinal variables is the process of combining

two or more categories into one new category. It is well known that nominal variables

with a high number of categories are very problematic to handle. If we perform a

transformation of these large cardinality variables onto indicator variables, that is,

binary variables that indicate whether or not a category is set for each example; we

will produce a large number of new variables, almost all equal to zero. On the other

hand, if we do not perform this conversion and use them just as they are in with the

algorithm that can tolerate them, such as decision trees, we run into the problem of

over-fitting the model. It is realistic to consider reducing the number of categories in

such variables.

Both processes are two common transformations used to achieve two objectives:

...

•

Reduce the complexity of independent and possible dependent variables.

•

Improve the predictive power of the variable, by carefully binning or grouping

the categories in such a way that we model the dependencies regarding the target

variable in both estimation and classification problems.

Binning and cardinality reduction are very similar procedures, differing only in

the type of variable that we want to process. In fact, both processes are distinctively

grouped within the term discretization , which constitutes the most popular nota-

tion in the literature. It is also very common to distinguish between binning and

discretization depending on the ease of the process performed. Binning is usually

associated with a quick and easy discretization of a variable. In [ 11 ], the authors dis-

tinguish among three types of discretization: binning, histogram analysis-based and

advanced discretization. The first corresponds to a splitting technique based on the

specification of the number of bins. The second family is related with unsupervised

discretization and finally, a brief inspection of the rest of the methods is drawn.

Regardless of the above, and under the discretization nomenclature, we will dis-

cuss all related issues and techniques in Chap. 9 of this topic.

Search WWH ::

Custom Search

Home