Bayesian Approach to the Pattern Recognition Problem in Nonstationary Environment (Pattern Recognition and Machine Learning)

Abstract

The classical learning problem of the pattern recognition in a finite-dimensional linear space of real-valued features is studied under the conditions of a non-stationary universe. The training criterion of non-stationary pattern recognition is formulated as a generalization of the classical Support Vector Machine. The respective numerical algorithm has the computation complexity proportional to the length of the training time series.

The majority of modern methods for pattern recognition works under the assumption that properties of the environment and consequently required decision rule doesn’t change in a process of data collection as well as during the exam. However, recently there have appeared many applications, in which samples of training set are entering the system over a long period of time, when the properties of the analyzed phenomenon may undergo considerable changes. The example of such problem is task of filtering irrelevant or publicity hyperlinks as a result of retrieval request. The behavior of internet advertising distributors constantly turns and it causes changes of publicity hyperlinks features that should result in the adequate correction of search mechanism. Often in this kind of tasks both the nature of changes in the environment and the fact of changes itself are hidden from direct observation and this makes learning even more difficult. In literature this problem is called concept drift [1].


At present there are three approaches to the construction of algorithms taking into account non-stationary character of decision rule: algorithms based on selecting instances, algorithms based on weighted instances and algorithms based on classifier selection and mergers. The goal of algorithms using selection of instances is the choice of prototypes which are relevant for decision rule at the present moment. As a rule, it is realized with the help of running window technology when decision rule at the present moment is made only on the basis of the instances obtained from previous time points. The examples of such algorithms can be FLORA family of algorithms [1] and TMF [2].

Algorithms based on weighting of instances [3] obtained from different time points use the ability of some learning algorithms such as Support Vector Machines (SVMs) method to assign weights to different instances. As a rule, weights are assigned to the instances according to their "age" (e.g. period of time from their obtaining).

By ensemble-based approach [5] for pattern recognition in non-stationary environment, required decision rule is made as voting or weighted voting of classifiers obtained for different conditions. For example, in the paper [4] it is proposed to cluster training samples and to construct its own classifier for each cluster.

In general, we can mark that all existing algorithms are more or less heuristic and a certain set of heuristics is determined by a specificity of the current task. In this paper we propose stochastic concept of the non-stationary environment based on Bayesian approach to the pattern recognition problem. The main instantaneous property of the non-stationary environment is understood as time-dependent separating hyperplane that in the best way describes the differences between feature vectors of samples of two classes. The proposed concept brings about the learning algorithm that is a generalization of the classical SVM for the case when the parameters of the separating hyperplane change with time.

Bayesian Definition of the Pattern Recognition Problem in Non-stationary Environment

Let every instance of the environmenttmp1411-60_thumbis presented by a point in the linear feature spacetmp1411-61_thumb, and its hidden membership in one of two classes is determined by an index value of the classtmp1411-62_thumb The paper [6] proposed a stochastic model of the universe. The main model assumption is a priori parametric distribution of the instances tmp1411-66_thumb

And this distribution is defined by the objectively existing hyperplanetmp1411-67_thumb tmp1411-68_thumbwith an unknown directing vector of featurestmp1411-69_thumb having a priori distributiontmp1411-70_thumbUsing the maximum a posteriori probability principle for the directing vector parameters estimation (a, b) leads us to the widely-known support vector criterion

tmp1411-75_thumb

that was proposed by V. N. Vapnik from a strictly deterministic point of view. Of course, the concept of time is completely absent here.

The principal distinction of the

concept of non-stationary environment given in this paper consists in taking into consideration time factor t. We suppose that the main property of non-stationary environment is expressed by time-dependent separating hyperplane that characterizes primary difference of the feature vectors of instances of two classes. This separating hyperplane, in turn, is completely defined by its own direction vector and location parameter, which should be considered as time functionstmp1411-76_thumbIn the terms of the improper probability distributions of two classes in the feature space such idea is formulated in the form of hypothesis that parameters of this pair of distributions are time varying:

tmp1411-78_thumb

where

tmp1411-79_thumb

We will suppose that at the zero time moment a priori distribution of the separating hyperplane is improper and has the appearance:tmp1411-80_thumb tmp1411-81_thumb. In turn we will regard directing vector aj as a stochastic stationary process:tmp1411-82_thumb

Dynamic Support Vector Machine Criterion

Now every instancetmp1411-86_thumbis considered only together with the indication of time point when this instance was presentedtmp1411-87_thumb. As a result the training set represents as the set of triplettmp1411-88_thumb a subset of instances, entered in time point t.

We obtain the sought-for sequencetmp1411-89_thumbas maximum point of joint a priori distribution of separating hyperplane’s parameters and training sample. The maximum a posteriori probability principle lets to the criterion

tmp1411-94_thumb

The criterion (4) realizes the conception of the rather smooth sequence of optimal separating hyperplanes as opposed to the conception of the single optimal hyperplane in (2).

As the classic learning problem, the dynamical problem (4) is a quadratic programming problem but contains T(n +1) + N variables in contrast to (n + 1)+N variables in (2). It is known that a computational complexity of the general quadratic programming problem is proportional to the cube of the number of variables, i.e. a dynamic problem ex facte is more complex than the classical problem.

But the objective function in a dynamic problem (4) is pair-wise separable, i.e. representing a sum of private functions every of which depends on the variables connected with one or two time points in their increasing order. The algorithm of the pair-wise separable criterion optimization suggested in this paper consists in an approximate implementation of the dynamic programming procedure and permits to solve the problem mentioned above within the number of iterations proportional to the length of the training sequence.

Quickly Optimization Procedure for a Dynamic SVM Criterion

The algorithm for the optimization of the obtained pair-wise separable criterion proposed in this paper is based on the using a general principle of the dynamic programming. Let us to introduce the following notationtmp1411-95_thumb

tmp1411-97_thumb

and rewrite the criterion (4) in a more convenient form

tmp1411-98_thumb

where the areas of acceptable values for variables are determined by the conditions

tmp1411-99_thumb

The central idea of the dynamical programming method uses the concept of a sequence of Bellman functionstmp1411-100_thumbcon

The central idea of the dynamical programming method uses the concept of a sequence of Bellman functionstmp1411-101_thumbconnected with partial criteriatmp1411-102_thumb

having the same structure as the full objective function but defined on the set of variablestmp1411-103_thumbIn order to obtain the filtering estimations of the separating hyperplane’s parameters we will use the fundamental property of Bellman function

tmp1411-104_thumb

which is called the direct recurrence relation. The procedure begins with the first Bellman functiontmp1411-105_thumbThen Bellman functions are recurrently evaluated for the following observation. And the minimum of Bellman function on every step determines the filtering value of the parameters of the optimal separating hyperplane

tmp1411-106_thumb

The optimization procedure is based on the hypothesis that there exists an appropriate compact form for Bellman functions representation, allowed to store this functions in the memory. But in the case then inequality constraints are imposed upon the sought-for variables and the Bellman functions are piecewise quadratic and consequently the dynamic programming procedure cannot be applied immediately. In order to save the computation advantages of the dynamic programming procedure we use here the following trick.

We heuristically replace the non quadratic functionstmp1411-113_thumb

tmp1411-114_thumbby the some appropriate quadratic approximationtmp1411-115_thumbThen the following approximations of Bellman functions will be quadratic too and a numerical implementation of the dynamic programming procedure will be possible. Thus, the quadratic approximation of the Bellman function comes to the selection of appropriate values of the parameterstmp1411-116_thumbof the quadratic functiontmp1411-117_thumb which would ensure invariance conservation of the main features of, generally speaking, non-quadratic function and consequently the initial Bellman function. Such features are the minimum point positiontmp1411-118_thumb, minimum point valuetmp1411-119_thumb, and also the matrix of the second derivatives at the minimum pointtmp1411-120_thumb

Case Study: Spam-Filtering Problem

The object of the experimental study in this paper is the problem of filtering spam-addresses in a result of retrieval request. The behavior of distributors of network advertising constantly improves and as a result a classifier of advertising links used by a search engine should adapt to the behavior of spammers. Consequently we come to problem of construction the time-dependent decision rule.

As the training data we took an anonymous set of hyperlinks URL Reputation Data Set from the repository UCI, first described in the paper [7]. This set consists of the addresses of Web resources for 121 days grouped according to the observation days and contains both advertising and relevant hyperlinks. Every instance in the data base is characterized by 3.2 million features which may be divided into two groups: lexical (hostname, primary domain, TLD and etc.) and host-based (WHOIS info, IP prefix, geographic and etc.). The values of features are normalized from 0 to 1, and the features themselves are anonymous. For carrying out the experiments from this data base in different ways there have been randomly selected 10 instances for each of 11 days taken from the 1st to the 110th day with the step of 10 days. The testing set was comprised by accidentally selected 4000 instances of the 120th day. We deleted the features with the values that are equal on all instances of the training set out of feature descriptions of the instances participating in the experiment, the number of the rest was 326. The experiment was repeated for different sets of training instances. Two algorithms were compared in the course of the experiment: normal support vector machine with the hyperplane, constructed on all instances of the training set and the algorithm of successive refinement of the decision rule, suggested in the paper. In the proposed algorithm of dynamic pattern recognition the parameters of the required decision rule d and d’ were chosen by cross-validation procedure, on each iteration of which non-stationary decision rule was constructed without taking into account the instances entered over one of the days. The general per cent of erroneously classified links e was calculated for each algorithm: the percent of malicious URLs erroneously classified as benign ones, and the per cent of benign URLs, erroneously classified as malicious ones e+. As the results given in the table show, the incremental learning algorithm of the pattern recognition allows to improve significantly the quality of recognition that indirectly confirms non-stationary character of the data used for the experiments.

Algorithm

tmp1411-129 tmp1411-130 tmp1411-131

SVM

32.32

5.94

16.0

Incremental Algorithm

14.96

4.49

8.49

Conclusions

In the paper the adaptation of the general definition of the pattern recognition learning problem with two classes of instances in the finite-dimensional space of real features is made under the conditions of non-stationary environment. The non-stationary property of environment is considered as time-changing decision rule. Under this assumption the learning criterion appeared a dynamic modification of SVM criterion. Also we have suggested optimization algorithm for Dynamic SVM having the linear computational complexity relative to the length of the training time series. Our algorithm was applied to the problem of filtering irrelevant or publicity hyperlinks as a result of retrieval request and its application achieved a good enough result as compared with classical SVM.

Next post:

Previous post: