{"title": "A Parallel Mixture of SVMs for Very Large Scale Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 640, "abstract": null, "full_text": "A Parallel Mixture of SVMs for Very Large Scale \n\nProblems \n\nRonan Collobert* \n\nUniversite de Montreal, DIRG \nCP 6128, Succ. Centre-Ville \nMontreal, Quebec, Canada \n\ncollober\u00a9iro.umontreal.ca \n\nSamy Bengio \n\nIDIAP \n\nCP 592, rue du Simp Ion 4 \n1920 Martigny, Switzerland \n\nbengio\u00a9idiap.ch \n\nYoshua Bengio \n\nUniversite de Montreal, DIRG \nCP 6128, Succ. Centre-Ville \nMontreal, Quebec, Canada \nbengioy\u00a9iro.umontreal.ca \n\nAbstract \n\nSupport Vector Machines (SVMs) are currently the state-of-the-art models for \nmany classification problems but they suffer from the complexity of their train(cid:173)\ning algorithm which is at least quadratic with respect to the number of examples. \nHence, it is hopeless to try to solve real-life problems having more than a few \nhundreds of thousands examples with SVMs. The present paper proposes a \nnew mixture of SVMs that can be easily implemented in parallel and where \neach SVM is trained on a small subset of the whole dataset. Experiments on a \nlarge benchmark dataset (Forest) as well as a difficult speech database, yielded \nsignificant time improvement (time complexity appears empirically to locally \ngrow linearly with the number of examples) . In addition, and that is a surprise, \na significant improvement in generalization was observed on Forest. \n\n1 \n\nIntroduction \n\nRecently a lot of work has been done around Support Vector Machines [9], mainly due to \ntheir impressive generalization performances on classification problems when compared to other \nalgorithms such as artificial neural networks [3, 6]. However, SVMs require to solve a quadratic \noptimization problem which needs resources that are at least quadratic in the number of training \nexamples, and it is thus hopeless to try solving problems having millions of examples using \nclassical SVMs. \n\nIn order to overcome this drawback, we propose in this paper to use a mixture of several SVMs, \neach of them trained only on a part of the dataset. The idea of an SVM mixture is not new, \nalthough previous attempts such as Kwok's paper on Support Vector Mixtures [5] did not train \nthe SVMs on part of the dataset but on the whole dataset and hence could not overcome the \n'Part of this work has been done while Ronan Collobert was at IDIAP, CP 592, rue du Simplon 4, \n\n1920 Martigny, Switzerland. \n\n\fLUHe CUIHIJ1eJULY vrUUleUI lUI 1i:L1!!,e UaLaOeLO. \nl:i't'fltpte 'fIte~ltuu LU LlalH oUCH \na mixture, and we will show that in practice this method is much faster than training only \none SVM, and leads to results that are at least as good as one SVM. We conjecture that the \ntraining time complexity of the proposed approach with respect to the number of examples is \nsub-quadratic for large data sets. Moreover this mixture can be easily parallelized, which could \nimprove again significantly the training time. \n\nvve vruvuoe Here a \n\nThe organization of the paper goes as follows: in the next section, we briefly introduce the SVM \nmodel for classification. In section 3 we present our mixture of SVMs, followed in section 4 by \nsome comparisons to related models. In section 5 we show some experimental results, first on a \ntoy dataset, then on two large real-life datasets. A short conclusion then follows . \n\n2 \n\nIntroduction to Support Vector Machines \n\nSupport Vector Machines (SVMs) [9] have been applied to many classification problems, gener(cid:173)\nally yielding good performance compared to other algorithms. The decision function is of the \nform \n\n(1) \n\nwhere x E ~d is the d-dimensional input vector of a test example, y E {-I, I} is a class label, Xi \nis the input vector for the ith training example, Yi is its associated class label, N is the number \nof training examples, K(x , Xi) is a positive definite kernel function , and 0: = {a1 , ... ,aN} and \nb are the parameters of the model. Training an SVM consists in finding 0: that minimizes the \nobjective function \n\nN \n\n1 N N \n\nQ(o:) = - 2..: ai + 22..:2..:aiajYiyjK(Xi , Xj) \n\nsubject to the constraints \n\nand \n\ni=l \n\ni=l j=l \n\nN \n2..: aiYi = 0 \ni=l \n\nO:S ai :S C Vi. \n\nThe kernel K(X,Xi) can have different forms, such as the Radial Basis Function (RBF): \n\nK(Xi, Xj) = exp (-llxi(T~ Xj112) \n\n(2) \n\n(3) \n\n(4) \n\n(5) \n\nwith parameter (T. \nTherefore, to train an SVM, we need to solve a quadratic optimization problem, where the \nnumber of parameters is N. This makes the use of SVMs for large datasets difficult: computing \nK(Xi' Xj) for every training pair would require O(N2) computation, and solving may take up \nto O(N3). Note however that current state-of-the-art algorithms appear to have training time \ncomplexity scaling much closer to O(N2 ) than O(N3) [2]. \n\n3 A New Conditional Mixture of SVMs \n\nIn this section we introduce a new type of mixture of SVMs. The output of the mixture for an \ninput vector X is computed as follows: \n\nf(x) = h (II wm(x)sm(x)) \n\n(6) \n\n\fwuen~ \n\nlVl 1::; LUe UUUIUel Ul eXvelL::; lU LUe lUIXLUle, ;;m~;,r;) 1::; LUe UULVUL Ul LUe 'fit \n\nexvelL \ngiven input x, wm(x) is the weight for the mth expert given by a \"gater\" module taking also \nx in input, and h is a transfer function which could be for example the hyperbolic tangent for \nclassification tasks. Here each expert is an SVM, and we took a neural network for the gater in \nour experiments. In the proposed model, the gater is trained to minimize the cost function \n\nN \n\nC = L [f(xi) - Yi]2 . \n\ni=l \n\n(7) \n\nTo train this model, we propose a very simple algorithm: \n\n1. Divide the training set into M random subsets of size near N j M. \n2. Train each expert separately over one of these subsets. \n3. Keeping the experts fixed, train the gater to minimize (7) on the whole training set. \n4. Reconstruct M subsets: for each example (Xi,Yi), \n\n\u2022 sort the experts in descending order according to the values Wm(Xi), \n\u2022 assign the example to the first expert in the list which has less than (NjM + c) \n\nexamples*, in order to ensure a balance between the experts. \n\n5. If a termination criterion is not fulfilled (such as a given number of iterations or a \n\nvalidation error going up), goto step 2. \n\nNote that step 2 of this algorithm can be easily implemented in parallel as each expert can \nbe trained separately on a different computer. Note also that step 3 can be an approximate \nminimization (as usually done when training neural networks). \n\n4 Other Mixtures of SVMs \n\nThe idea of mixture models is quite old and has given rise to very popular algorithms, such \nas the well-known Mixture of Experts [4] where the cost function is similar to equation (7) but \nwhere the gater and the experts are trained, using gradient descent or EM, on the whole dataset \n(and not subsets) and their parameters are trained simultaneously. Hence such an algorithm \nis quite demanding in terms of resources when the dataset is large, if training time scales like \nO(NP) with p > 1. \nIn the more recent Support Vector Mixture model [5], the author shows how to replace the \nexperts (typically neural networks) by SVMs and gives a learning algorithm for this model. \nOnce again the resulting mixture is trained jointly on the whole dataset, and hence does not \nsolve the quadratic barrier when the dataset is large. \n\nIn another divide-and-conquer approach [7], the authors propose to first divide the training set \nusing an unsupervised algorithm to cluster the data (typically a mixture of Gaussians), then \ntrain an expert (such as an SVM) on each subset of the data corresponding to a cluster, and \nfinally recombine the outputs of the experts. Here, the algorithm does indeed train separately the \nexperts on small datasets, like the present algorithm, but there is no notion of a loop reassigning \nthe examples to experts according to the prediction made by the gater of how well each expert \nperforms on each example. Our experiments suggest that this element is essential to the success \nof the algorithm. \n\nFinally, the Bayesian Committee Machine [8] is a technique to partition the data into several \nsubsets, train SVMs on the individual subsets and then use a specific combination scheme based \non the covariance of the test data to combine the predictions. This method scales linearly in the \n\n'where c is a small positive constant. In the experiments, c = 1. \n\n\fllUll1ue1 U1 Lld111111!!, UdLd, UUL 1~ 111 1dCL d HUnIjU \u00b7uc~\u00b7tVt; ll1eLllUU ~ 1L CdllllUL Uve1dLe Ull d ~U1!!,1e \ntest example. Like in the previous case, this algorithm assigns the examples randomly to the \nexperts (however the Bayesian framework would in principle allow to find better assignments). \n\nRegarding our proposed mixture of SVMs, if the number of experts grows with the number \nof examples, and the number of outer loop iterations is a constant, then the total training \ntime of the experts scales linearly with the number of examples. Indeed, &iven N the total \nis a constant r; \nnumber of examples, choose the number of expert M such that the ratio M \nThen, if k is the number of outer loop iterations, and if the training time for an SVM with r \nexamples is O(ri3 ) (empirically f3 is slightly above 2), the total training time of the experts is \nO(kri3 * M) = O(kri3- 1 N), where k, rand f3 are constants, which gives a total training time \nof O(N). In particular for f3 = 2 that gives O(krN). The actual total training time should \nhowever also include k times the training time of the gater, which may potentially grow more \nrapidly than O(N). However, it did not appear to be the case in our experiments, thus yielding \napparent linear training time. Future work will focus on methods to reduce the gater training \ntime and guarantee linear training time per outer loop iteration. \n\n5 Experiments \n\nIn this section, we present three sets of experiments comparing the new mixture of SVMs to \nother machine learning algorithms. Note that all the SVMs in these experiments have been \ntrained using SVMTorch [2] . \n\n5.1 A Toy Problem \n\nIn the first series of experiments, we first tested the mixture on an artificial toy problem for \nwhich we generated 10,000 training examples and 10,000 test examples. The problem had two \nnon-linearly separable classes and had two input dimensions. On Figure 1 we show the decision \nsurfaces obtained first by a linear SVM, then by a Gaussian SVM, and finally by the proposed \nmixture of SVMs. Moreover, in the latter, the gater was a simple linear function and there were \ntwo linear SVMs in the mixturet . This artificial problem thus shows clearly that the algorithm \nseems to work, and is able to combine, even linearly, very simple models in order to produce a \nnon-linear decision surface. \n\n5.2 A Large-Scale Realistic Problem: Forest \n\nFor a more realistic problem, we did a series of experiments on part of the UCI Forest dataset+. \nWe modified the 7-class classification problem into a binary classification problem where the \ngoal was to separate class 2 from the other 6 classes. Each example was described by 54 input \nfeatures, each normalized by dividing by the maximum found on the training set. The dataset \nhad more than 500,000 examples and this allowed us to prepare a series of experiments as follows : \n\n\u2022 We kept a separate test set of 50,000 examples to compare the best mixture of SVMs \n\nto other learning algorithms. \n\n\u2022 We used a validation set of 10,000 examples to select the best mixture of SVMs , varying \n\nthe number of experts and the number of hidden units in the gater. \n\n\u2022 We trained our models on different training sets, using from 100,000 to 400,000 examples. \n\u2022 The mixtures had from 10 to 50 expert SVMs with Gaussian kernel and the gater was \n\nan MLP with between 25 and 500 hidden units. \n\ntNote that the transfer function hO was still a tanhO. \ntThe Forest dataset \n\navailable on \n\nis \n\nthe VCI website \n\nftp://ftp.ics.uci.edu/pub/rnachine-learning-databases/covtype/covtype.info. \n\nat \n\nthe \n\nfollowing \n\naddress: \n\n\f(a) Linear SVM \n\n(b) Gaussian SVM \n\n(c) Mixture of two linear \nSVMs \n\nFigure 1: Comparison of the decision surfaces obtained by (a) a linear SVM, (b) a Gaussian \nSVM, and (c) a linear mixture of two linear SVMs, on a two-dimensional classification toy \nproblem. \n\nNote that since the number of examples was quite large, we selected the internal training pa(cid:173)\nrameters such as the (J of the Gaussian kernel of the SVMs or the learning rate of the gater \nusing a held-out portion of the training set. We compared our models to \n\n\u2022 a single MLP, where the number of hidden units was selected by cross-validation between \n\n25 and 250 units, \n\n\u2022 a single SVM, where the parameter of the kernel was also selected by cross-validation, \n\u2022 a mixture of SVMs where the gater was replaced by a constant vector, assigning the \n\nsame weight value to every expert. \n\nTable 1 gives the results of a first series of experiments with a fixed training set of 100,000 \nexamples. To select among the variants of the gated SVM mixture we considered performance \nover the validation set as well as training time. All the SVMs used (J = 1. 7. The selected model \nhad 50 experts and a gater with 150 hidden units. A model with 500 hidden units would have \ngiven a performance of 8.1 % over the test set but would have taken 621 minutes on one machine \n(and 388 minutes on 50 machines). \n\nTrain \n\nTest \n\nError (%) \n\none MLP \none SVM \nuniform SVM mixture \ngated SVM mixture \n\n17.56 18.15 \n16.03 16.76 \n19.69 20.31 \n5.91 \n9.28 \n\n(1 cpu) \n12 \n3231 \n85 \n237 \n\nTime (minutes) \n\n(50 cpu) \n\n2 \n73 \n\nTable 1: Comparison of performance between an MLP (100 hidden units), a single SVM, a \nuniform SVM mixture where the gater always output the same value for each expert, and finally \na mixture of SVMs as proposed in this paper. \n\nAs it can be seen, the gated SVM outperformed all models in terms of training and test error. \nNote that the training error of the single SVM is high because its hyper-parameters were selected \nto minimize error on the validation set (other values could yield to much lower training error but \nlarger test error). It was also much faster, even on one machine, than the SVM and since the \nmixture could easily be parallelized (each expert can be trained separately) , we also reported \n\n\fLue LIUIe IL LUUK LU LldUI UU ClV UldCUIUei:>. \n.1U d UIi:>L dLLeUIVL LU UUUeli:>LdUU LUei:>e lei:>UILi:>, uue \ncan at least say that the power of the model does not lie only in the MLP gater, since a single \nMLP was pretty bad, it is neither only because we used SVMs, since a single SVM was not \nas good as the gated mixture, and it was not only because we divided the problem into many \nsub-problems since the uniform mixture also performed badly. It seems to be a combination of \nall these elements. \n\nWe also did a series of experiments in order to see the influence of the number of hidden units \nof the gater as well as the number of experts in the mixture. Figure 2 shows the validation error \nof different mixtures of SVMs, where the number of hidden units varied from 25 to 500 and the \nnumber of experts varied from 10 to 50. There is a clear performance improvement when the \nnumber of hidden units is increased, while the improvement with additional experts exists but \nis not as strong. Note however that the training time increases also rapidly with the number of \nhidden units while it slightly decreases with the number of experts if one uses one computer per \nexpert. \n\nValidation error as a function of the number of hidden units \n\nof the gater and the number of experts \n\n2!'50 \n\n100 \n\n150200 \n\n250 \n\n50 \n\nNumber of hidden \nunits of the gater \n\n500 \n\n10 \n\nFigure 2: Comparison of the validation error of different mixtures of SVMs with various number \nof hidden units and experts. \n\nIn order to find how the algorithm scaled with respect to the number of examples, we then \ncompared the same mixture of experts (50 experts, 150 hidden units in the gater) on different \ntraining set sizes. Table 3 shows the validation error of the mixture of SVMs trained on training \nsets of sizes from 100,000 to 400,000. It seems that, at least in this range and for this particular \ndataset, the mixture of SVMs scales linearly with respect to the number of examples, and not \nquadratically as a classical SVM. It is interesting to see for instance that the mixture of SVMs \nwas able to solve a problem of 400,000 examples in less than 7 hours (on 50 computers) while it \nwould have taken more than one month to solve the same problem with a single SVM. \n\nFinally, figure 4 shows the evolution of the training and validation errors of a mixture of 50 \nSVMs gated by an MLP with 150 hidden units, during 5 iterations of the algorithm. This \nshould convince that the loop of the algorithm is essential in order to obtain good performance. \nIt is also clear that the empirical convergence of the outer loop is extremely rapid. \n\n5.3 Verification on Another Large-Scale Problem \n\nIn order to verify that the results obtained on Forest were replicable on other large-scale prob(cid:173)\nlems, we tested the SVM mixture on a speech task. We used the Numbers95 dataset [1] and \n\n\f450 ,----~--~-~--~-~-_ \n\nError as a function of the number of training iterations \n\n400 \n\n350 \n\n_300 \nc: \nE \n-;250 \nE \ni= 200 \n\n150 \n\n100 \n\n1~ \n\n2 \n\nNumber of train examples \n\n2~ \n\n3 \n\n3~ \n\n4 \nx 105 \n\nFigure 3: Comparison of the training time \nof the same mixture of SVMs (50 experts, \n150 hidden units in the gater) trained on \ndifferent training set sizes, from 100,000 to \n400,000. \n\n- Train error \n\nValidation Error \n\n-\n\n-\n\n-\n\n1\n\n14 \n\n13 \n\n12 \n\n11 \n~10 \ng 9 \n\nw \n\n8 \n\n7 \n\n6 \n\n~L---~2~--~3---~4---~5 \n\nNumber of training iterations \n\nFigure 4: Comparison of the training and \nvalidation errors of the mixture of SVMs as \na function of the number of training itera(cid:173)\ntions. \n\nturned it into a binary classification problem where the task was to separate silence frames from \nnon-silence frames . The total number of frames was around 540,000 frames. The training set \ncontained 100,000 randomly chosen frames out of the first 400,000 frames. The disjoint valida(cid:173)\ntion set contained 10,000 randomly chosen frames out of the first 400,000 frames also. Finally, \nthe test set contained 50,000 randomly chosen frames out of the last 140,000 frames. Note that \nthe validation set was used here to select the number of experts in the mixture, the number of \nhidden units in the gater, and a. Each frame was parameterized using standard methods used \nin speech recognition (j-rasta coefficients, with first and second temporal derivatives) and was \nthus described by 45 coefficients, but we used in fact an input window of three frames, yielding \n135 input features per examples. \n\nTable 2 shows a comparison between a single SVM and a mixture of SVMs on this dataset. The \nnumber of experts in the mixture was set to 50, the number of hidden units of the gater was set \nto 50, and the a of the SVMs was set to 3.0. As it can be seen, the mixture of SVMs was again \nmany times faster than the single SVM (even on 1 cpu only) but yielded similar generalization \nperformance. \n\none SVM \ngated SVM mixture \n\nTrain Test \nError (%) \n0.98 \n4.41 \n\n7.57 \n7.32 \n\nTime (minutes) \n(1 cpu) \n6787 \n851 \n\n(50 cpu) \n\n65 \n\nTable 2: Comparison of performance between a single SVM and a mixture of SVMs on the \nspeech dataset. \n\n6 Conclusion \n\nIn this paper we have presented a new algorithm to train a mixture of SVMs that gave very good \nresults compared to classical SVMs either in terms of training time or generalization performance \non two large scale difficult databases. Moreover, the algorithm appears to scale linearly with \nthe number of examples, at least between 100,000 and 400,000 examples. \n\n\f.1 uebe lebUILb dle eXLleIuelY e UCUUli:t!!,l1l!!, dllU bu!!,!!,ebL LUi:tL Lue plupUbeu lueLuuu CUUIU dllUW \ntraining SVM-like models for very large multi-million data sets in a reasonable time. If training \nof the neural network gater with stochastic gradient takes time that grows much less than \nquadratically, as we conjecture it to be the case for very large data sets (to reach a \"good enough\" \nsolution), then the whole method is clearly sub-quadratic in training time with respect to the \nnumber of training examples. Future work will address several questions: how to guarantee \nlinear training time for the gater as well as for the experts? can better results be obtained by \ntuning the hyper-parameters of each expert separately? Does the approach work well for other \ntypes of experts? \n\nAcknowledgments \n\nRC would like to thank the Swiss NSF for financial support (project FN2100-061234.00). YB \nwould like to thank the NSERC funding agency and NCM2 network for support. \n\nReferences \n\n[1] RA. Cole, M. Noel, T. Lander, and T. Durham. New telephone speech corpora at CSLU. \nProceedings of the European Conference on Speech Communication and Technology, EU(cid:173)\nROSPEECH, 1:821- 824, 1995. \n\n[2] R Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression \n\nproblems. Journal of Machine Learning Research, 1:143- 160, 200l. \n\n[3] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273- 297, 1995. \n[4] Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive \n\nmixtures of local experts. Neural Computation, 3(1):79- 87, 1991. \n\n[5] J. T. Kwok. Support vector mixture for classification and regression problems. In Proceedings \nof the International Conference on Pattern Recognition (ICPR), pages 255-258, Brisbane, \nQueensland, Australia, 1998. \n\n[6] E. Osuna, R Freund, and F. Girosi. Training support vector machines: an application to \nface detection. In IEEE conference on Computer Vision and Pattern Recognition, pages \n130- 136, San Juan, Puerto Rico, 1997. \n\n[7] A. Rida, A. Labbi, and C. Pellegrini. Local experts combination trough density decomposi(cid:173)\n\ntion. In International Workshop on AI and Statistics (Uncertainty'99). Morgan Kaufmann, \n1999. \n\n[8] V. Tresp. A bayesian committee machine. Neural Computation, 12:2719-2741,2000. \n[9] V. N. Vapnik. The nature of statistical learning theory. Springer, second edition, 1995. \n\n\f", "award": [], "sourceid": 1949, "authors": [{"given_name": "Ronan", "family_name": "Collobert", "institution": null}, {"given_name": "Samy", "family_name": "Bengio", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}