{"title": "Vicinal Risk Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 416, "page_last": 422, "abstract": null, "full_text": "Vicinal Risk Minimization \n\nOlivier Chapelle, Jason Weston* , Leon Bottou and Vladimir Vapnik \n\nAT&T Research Labs, 100 Schultz drive, Red Bank, NJ, USA \n\n* Barnhill BioInformatics.com, Savannah, GA, USA. \n\n{chapelle, weston,leonb, vlad}@research.att.com \n\nAbstract \n\nThe Vicinal Risk Minimization principle establishes a bridge between \ngenerative models and methods derived from the Structural Risk Mini(cid:173)\nmization Principle such as Support Vector Machines or Statistical Reg(cid:173)\nularization. We explain how VRM provides a framework which inte(cid:173)\ngrates a number of existing algorithms, such as Parzen windows, Support \nVector Machines, Ridge Regression, Constrained Logistic Classifiers and \nTangent-Prop. We then show how the approach implies new algorithm(cid:173)\ns for solving problems usually associated with generative models. New \nalgorithms are described for dealing with pattern recognition problems \nwith very different pattern distributions and dealing with unlabeled data. \nPreliminary empirical results are presented. \n\n1 Introduction \n\nStructural Risk Minimisation (SRM) in a learning system can be achieved using constraints \non the parameter vectors, using regularization terms in the cost function, or using Support \nVector Machines (SVM). All these principles have lead to well established learning algo(cid:173)\nrithms. \n\nIt is often said, however, that some problems are best addressed by generative models. The \nfirst problem is of missing data. We may for instance have a few labeled patterns and a \nlarge number of unlabeled patterns. Intuition suggests that these unlabeled patterns carry \nuseful information. The second problem is of discriminating classes with very different \npattern distributions. This situation arises naturally in anomaly detection systems. This \nalso occurs often in recognition systems that reject invalid patterns by defining a garbage \nclass for grouping all ambiguous or unrecognizable cases. Although there are successful \nnon-generative approaches (Schuurmans and Southey, 2000) (Drucker, Wu and Vapnik, \n1999), the generative framework is undeniably appealing. Recent results (Jaakkola, Meila \nand Jebara, 2000) even define generative models that contain SVM as special cases. \n\nThis paper discusses the Vicinal Risk Minimization (VRM) principle, summarily intro(cid:173)\nduced in (Vapnik, 1999). This principle was independently hinted at by Tong and Koller \n(Tong and Koller, 2000) with a useful generative interpretation. In particular, they proved \nthat SVM are a limiting case of their Restricted Bayesian Classifiers. We extend Tong's \nand Koller's result by showing that VRM subsumes several well known techniques such as \nRidge Regression (Hoerl and Kennard, 1970), Constrained Logistic Classifier, or Tangent \nProp (Simard et aI., 1992). We then go on to show how VRM naturally leads to simple algo-\n\n\frithms that can deal with problems for which one would have formally considered purely \ngenerative models. We provide algorithms and preliminary empirical results for dealing \nwith unlabeled data or recognizing classes with very different pattern distributions. \n\n2 Vicinal Risk Minimization \n\nThe learning problem can be formulated as the search of the function f E F that minimizes \nthe expectation of a given loss \u00a3(f(x), y) . \n\nR(f) = f \u00a3(f(x), y) dP(x, y) \n\n(1) \n\nIn the classification framework, y takes values \u00b11 and \u00a3(f(x) , y) is a step function such as \n1 - Sign(yf(x)), whereas in the regression framework, y is a real number and commonly \n\u00a3(f(x), y) is the mean squared error (f(x) _ y)2. \nThe expectation (1) cannot be computed since the distribution P(x, y) is unknown. How(cid:173)\never, given a training set {(Xi, Yi) h * \u00b0 generates a decision boundary which depends on all the examples. The contribu(cid:173)\n\ntion of each example decreases exponentially when its distance to the decision boundary \nincreases. This is only slightly different from a soft margin SVM whose boundary relies \non support vectors that can be more distant than those selected by hard margin SVM. The \ndifference here is just in the cost functions (sigmoid compared to linear loss). \ne) SVM and Constrained Logistic Classifiers - The two previous paragraphs show that \nthe same particular case of VRM is (a) equivalent to a Logistic Classifier with a constraint \n\non the weights, and (b) tends towards the SVM classifier when a ~ \u00b0 and when the \n\nexamples are separable. As a consequence, we can state that the Logistic Classifier decision \nboundary tends towards the SVM decision boundary when we relax the constraint on the \nweights. \n\nIn practice we can find the SVM solution with a Logistic Classifier by simply using an \niterative weight update algorithm such as gradient descent, choosing small initial weights, \nand letting the norm of the weights grow slowly while the iterative algorithm is running. \nAlthough this algorithm is not exact, it is fast and efficient. This is in fact similar to what \nis usually done with back-propagation neural networks (LeCun et aI., 1998). The same \nalgorithm can be used for the VRM. In that context early stopping is similar to choosing \nthe optimal a using cross-validation. \n\n4 New Algorithms and Results \n\n4.1 Adaptive Kernel Widths \n\nIt is known in density estimation theory that the quality of the density estimate can be \nimproved using variable kernel widths (Breiman, Meisel and Purcell, 1977). In regions \nof the space where there is little data, it is safer to have a smooth estimate of the density, \nwhereas in the regions of the space there is more data one wants to be as accurate as \npossible via sharper kernel estimates. The VRM principle can take advantage of these \nimproved density estimates for other problem domains. We consider here the following \ndensity estimate: \n\ndPest(x, y) = - L 8Yi (y) NUi (x - Xi) dx \n\n1 \nn \n\n. \n\n~ \n\nwhere the specific kernel width ai for each training example Xi is computed from the \ntraining set. \n\na) Wisconsin Breast Cancer - We made a first test of the method on the Wisconsin breast \ncancer dataset l which contains 589 examples on 30 dimensions. We compared VRM using \nthe set of linear classifiers with various underlying density estimates. The minimization was \nachieved using gradient descent on the vicinal risk. All hyperparameters were determined \nusing cross-validation. The following table reports results averaged on 100 runs. \n\n1 h up:1 /horn. first. gmd .de/ ..... raetschl data/breast -cancer. \n\n\fTraining Set HardSVM \n\n10 \n20 \n40 \n80 \n\n11.3% \n8.3 % \n6.3 % \n5.4% \n\nSoftSVM \n\nBeste \n11.1 % \n7.5% \n5.5% \n4.0% \n\nVRM \n\nBest fixed U \n\nVRM \n\nAdaptive Ui \n\n10.8% \n6.9% \n5.2% \n3.9% \n\n9.6% \n6.6% \n4.8% \n3.7% \n\nThe adaptive kernel width Ui were computed by multiplying a global factor by the average \ndistance of the five closest training examples. The best global factor is determined by cross(cid:173)\nvalidation. These results suggest that VRM with adaptive kernel widths can outperform \nstate of the art classifiers on small training sets. \n\nb) MNIST \"I\" versus other digits - A second test was performed using the MNIST \nhandwritten digits2\u2022 We considered the sub-problem of recognizing the ones versus all \nother digits. The testing set contains 10000 digits (5000 ones and 5000 non-ones). Two \ntraining set sizes were considered with 250 or 500 ones and an equal number of non-ones. \nComputations were achieved using the algorithm suggested in section (3.e). We simply \ntrained a single linear unit with a sigmoid transfer function using stochastic gradient up(cid:173)\ndates. This is appropriate for implementing an approximate VRM with a single kernel \nwidth. Adaptive kernel widths are implemented by simply changing the slope of the sig(cid:173)\nmoid for each example. For each example Xi, the kernel width Ui is computed from the \ntraining set using the 5/1000th quantile of the distances of all other examples to example \nXi. The sigmoid slopes are then computed by renormalizing the Ui in order to make their \nmean equal to 1. Early stopping was achieved with cross-validation. \n\nTraining Set HardSVM \n\n250+250 \n500+500 \n1000+1000 \n\n3.34% \n3.11 % \n2.94% \n\nVRM \n\nVRM \n\nFixed slope \n\nAdaptive slope \n\n2.79% \n2.47% \n2.08% \n\n2.54% \n2.27% \n1.96% \n\nThe statistical signifiance of these results can be asserted with very high probability by \ncomparing the list of errors performed by each system (Bottou and Vapnik, 1992). Again \nthese results suggest that VRM with adaptive kernel widths can be very useful with small \ntraining sets. \n\n4.2 Unlabeled Data \n\nIn some applications unlabeled data is in abundance whereas labeled data is not. The use \nof unlabeled data falls into the framework of VRM by simply making the same vicinal \nloss for unlabeled points. Given m unlabeled points xi, ... , x:n, one obtains the following \nformulation: \n\n1 n f \nRvic(f) =;;: L \n\n1 m f \nl(f(X),Yi)dPXi(x) + m L \n\nl(f(x),f(xn)dPx;(x) . \n\ni=l \n\ni=l \n\nTo give an example of the usefulness of our approach consider the following example. \nTwo normal distributions on the real line N( -1.6,1) and N(1.6, 1) model the patterns of \ntwo classes with equal probability; 20 labeled points and 100 unlabeled points are drawn. \nThe following table compares the true generalization error of VRM with gaussian kernels \nand linear functions. Results are averaged over 100 runs. Two different kernel widths UL \nand Uu were used for kernels associated with labeled or unlabeled examples. Best kernel \nwidths were obtained by cross-validation. We also studied the case UL -+ 0 in order to \nprovide a result equivalent to a plain SVM. \n\n2http://www.research.att.com/ ... yannlocr/index.html \n\n\fLabeled \n\nLabeled+Unlabeled \n\naL -+ 0 \nBest aL \n\nBest au \nBest au \n\n6.5% \n5.0% \n\n5.6% \n4.3% \n\nNote that when both aL and au tend to zero, this algorithm reverts to a transduction al(cid:173)\ngorithm due to Vapnik which was previously solved by the more difficult optimization \nprocedure of integer programming (Bennet and Demiriz, 1999). \n\n5 Conclusion \n\nIn conclusion, the Vicinal Risk Minimization VRM principle provides a useful bridge be(cid:173)\ntween generative models and SRM methods such as SVM or Statistic Regularization. Sev(cid:173)\neral well known algorithms are in fact special cases of VRM. The VRM principle also sug(cid:173)\ngests new algorithms. In this paper we proposed algorithms for dealing with unlabeled data \nand recognizing classes with very different pattern distributions, obtaining initial promising \nresults. We hope that this approach can lead to further understanding of existing methods \nand also to suggest new ones. \n\nReferences \n\nBennet, K. and Demiriz, A. (1999). Semi-supervised support vector machines. \n\nIn Advances in \n\nNeural Information Processing Systems 11, pages 368-374. MIT Press. \n\nBottou, L. and Vapnik, V. N. (1992). Local learning algorithms, appendix on confidence intervals. \n\nNeural Computation, 4(6):888- 900. \n\nBreiman, L., Meisel, W., and Purcell, E. (1977). Variable kernel estimates of multivariate densities. \n\nTechnometrics, 19:135- 144. \n\nDrucker, H., Wu, D., and Vapnik, V. (1999). Support vector machines for spam categorization. \n\nNeural Networks, 10:1048- 1054. \n\nHoed, A. and Kennard, R. W. (1970). Ridge regression: Biased estimation for non orthogonal prob(cid:173)\n\nlems. Technometrics, 12(1):55--67. \n\nJaakkola, T., Meila, M., and Jebara, T. (2000). Maximum entropy discrimination. In Advances in \n\nNeural Information Processing Systems 12. MIT Press. \n\nLeCun, Y., Bottou, L., Orr, G., and Muller, K. (1998). Efficient backprop. In Orr, G. and K., M., \n\neditors, Neural Networks: Tricks of the Trade. Springer. \n\nLeen, T. K. (1995). Invariance and regularization in learning. In Advances in Neural Infonnation \n\nProcessing Systems 7. MIT Press. \n\nScholkopf, B., Simard, P., Smola, A., Vapnik, V. (1998). Prior knowledge in support vector kernels . \n\nIn Advances in Neural Information Processing Systems 10. MIT Press. \n\nSchuurmans, D. and Southey, F. (2000). An adaptive regularization criterion for supervised learn(cid:173)\ning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-\n2000). \n\nSimard, P. , Victorri, B., Le Cun, Y., and Denker, J. (1992). Tangent prop: a formalism for specify(cid:173)\n\ning selected invariances in adaptive networks. In Advances in Neural Information Processing \nSystems 4, Denver, CO. Morgan Kaufman. \n\nTong, S. and Koller, D. (2000). Restricted bayes optimal classifiers. Proceedings of the 17th National \n\nConference on Artificial Intelligence (AAAI). \n\nVapnik, V. (1999). The Nature of Statistical Learning Theory (Second Edition). Springer Verlag, \n\nNew York. \n\n\f", "award": [], "sourceid": 1876, "authors": [{"given_name": "Olivier", "family_name": "Chapelle", "institution": null}, {"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "L\u00e9on", "family_name": "Bottou", "institution": null}, {"given_name": "Vladimir", "family_name": "Vapnik", "institution": null}]}*