Part of Advances in Neural Information Processing Systems 16 (NIPS 2003)
Koby Crammer, Jaz Kandola, Yoram Singer
Online algorithms for classification often require vast amounts of mem- ory and computation time when employed in conjunction with kernel functions. In this paper we describe and analyze a simple approach for an on-the-fly reduction of the number of past examples used for prediction. Experiments performed with real datasets show that using the proposed algorithmic approach with a single epoch is competitive with the sup- port vector machine (SVM) although the latter, being a batch algorithm, accesses each training example multiple times.
1
Introduction and Motivation
Kernel-based methods are widely being used for data modeling and prediction because of their conceptual simplicity and outstanding performance on many real-world tasks. The support vector machine (SVM) is a well known algorithm for finding kernel-based linear classifiers with maximal margin [7]. The kernel trick can be used to provide an effective method to deal with very high dimensional feature spaces as well as to model complex in- put phenomena via embedding into inner product spaces. However, despite generalization error being upper bounded by a function of the margin of a linear classifier, it is notoriously difficult to implement such classifiers efficiently. Empirically this often translates into very long training times. A number of alternative algorithms exist for finding a maximal margin hyperplane many of which have been inspired by Rosenblatt’s Perceptron algorithm [6] which is an on-line learning algorithm for linear classifiers. The work on SVMs has in- spired a number of modifications and enhancements to the original Perceptron algorithm. These incorporate the notion of margin to the learning and prediction processes whilst ex- hibiting good empirical performance in practice. Examples of such algorithms include the Relaxed Online Maximum Margin Algorithm (ROMMA) [4], the Approximate Maximal Margin Classification Algorithm (ALMA) [2], and the Margin Infused Relaxed Algorithm (MIRA) [1] which can be used in conjunction with kernel functions.
A notable limitation of kernel based methods is their computational complexity since the amount of computer memory that they require to store the so called support patterns grows linearly with the number prediction errors. A number of attempts have been made to speed up the training and testing of SVM’s by enforcing a sparsity condition. In this paper we devise an online algorithm that is not only sparse but also generalizes well. To achieve this goal our algorithm employs an insertion and deletion process. Informally, it can be thought of as revising the weight vector after each example on which a prediction mistake
has been made. Once such an event occurs the algorithm adds the new erroneous example (the insertion phase), and then immediately searches for past examples that appear to be redundant given the recent addition (the deletion phase). As we describe later, making this adjustment to the algorithm allows us to modify the standard online proof techniques so as to provide a bound on the total number of examples the algorithm keeps.
This paper is organized as follows. In Sec. 2 we formalize the problem setting and provide a brief outline of our method for obtaining a sparse set of support patterns in an online setting. In Sec. 3 we present both theoretical and algorithmic details of our approach and provide a bound on the number of support patterns that constitute the cache. Sec. 4 provides experimental details, evaluated on three real world datasets, to illustrate the performance and merits of our sparse online algorithm. We end the paper with conclusions and ideas for future work.
2 Problem Setting and Algorithms
This work focuses on online additive algorithms for classification tasks. In such problems we are typically given a stream of instance-label pairs (x1; y1); : : : ; (xt; yt); : : :. we assume that each instance is a vector xt 2 Rn and each label belongs to a finite set Y. In this and the next section we assume that Y = f(cid:0)1; +1g but relax this assumption in Sec. 4 where we describe experiments with datasets consisting of more than two labels. When dealing with the task of predicting new labels, thresholded linear classifiers of the form h(x) = sign(w (cid:1) x) are commonly employed. The vector w is typically represented as a weighted linear combination of the examples, namely w = Pt (cid:11)tytxt where (cid:11)t (cid:21) 0. The instances for which (cid:11)t > 0 are referred to as support patterns. Under this assumption, the output of the classifier solely depends on inner-products of the form x (cid:1) xt the use of kernel functions can easily be employed simply by replacing the standard scalar product with a function K((cid:1); (cid:1)) which satisfies Mercer conditions [7]. The resulting classification rule takes the form h(x) = sign(w (cid:1) x) = sign(Pt (cid:11)tytK(x; xt)). The majority of additive online algorithms for classification, for example the well known Perceptron [6], share a common algorithmic structure. These online algorithms typically work in rounds. On the tth round, an online algorithm receives an instance xt, computes the inner-products st = Pi Input: Tolerance (cid:12). Initialize: Set 8t (cid:11)t = 0 ; w0 = 0 ; C0 = ;. Loop: For t = 1; 2; : : : ; T (cid:15) Get a new instance xt 2 Rn. (cid:15) Predict ^yt = sign (yt(xt (cid:1) wt(cid:0)1)). (cid:15) Get a new label yt. (cid:15) if yt(xt (cid:1) wt(cid:0)1) (cid:20) (cid:12) update: Output : H(x) = sign(wT (cid:1) x). Figure 1: The aggressive Perceptron algorithm with a variable-size cache. this paper we shift the focus to the problem of devising online algorithms which are budget-conscious as they attempt to keep the number of support patterns small. The approach is attractive for at least two reasons. Firstly, both the training time and clas- sification time can be reduced significantly if we store only a fraction of the potential support patterns. Secondly, a classier with a small number of support patterns is intu- itively ”simpler”, and hence are likely to exhibit good generalization properties rather than complex classifiers with large numbers of support patterns. (See for instance [7] for formal results connecting the number of support patterns to the generalization error.) Input: C; w; ((cid:11)1; : : : ; (cid:11)t). Loop: (cid:15) Choose i 2 C such that (cid:12) (cid:20) yi(w (cid:0) (cid:11)iyixi). Figure 2: DistillCache (cid:15) if no such i exists then return. (cid:15) Remove the example i : In Sec. 3 we present a formal analysis and the algorithmic details of our approach. Let us now provide a general overview of how to restrict the number of support patterns in an online setting. Denote by Ct the indices of patterns which consti- tute the classification vector wt. That is, i 2 Ct if and only if (cid:11)i > 0 on round t when xt is received. The online classi- fication algorithms discussed above keep enlarging Ct – once an example is added to Ct it will never be deleted. However, as the online algorithm receives more ex- amples, the performance of the classifier improves, and some of the past examples may have become redundant and hence can be removed. Put another way, old examples may have been inserted into the cache sim- ply due the lack of support patterns in early rounds. As more examples are observed, the old examples maybe replaced with new examples whose location is closer to the decision boundary induced by the online classifier. We thus add a new stage to the online algorithm in which we discard a few old examples from the cache Ct. We suggest a modification of the online algorithm structure as follows. Whenever yt (cid:0)Pi Return : C; w; ((cid:11)1; : : : ; (cid:11)t). rithm employs a variable-size cache since we do no limit explicitly the number of support patterns though we do attempt to discard as many patterns as possible from the cache. A similar modification, to that described for aggressive Perceptron, can be made to all of the online classification algorithms outlined above. In particular, we use a modification of the MIRA [1] algorithm in our experiments.