{"title": "Hoeffding Races: Accelerating Model Selection Search for Classification and Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 66, "abstract": null, "full_text": "Hoeffding Races: Accelerating Model \nSelection Search for Classification and \n\nFunction Approximation \n\nOded Maron \n\nArtificial Intelligence Laboratory \n\nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAndrew W. Moore \n\nRobotics Institute \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA 15213 \n\nAbstract \n\nSelecting a good model of a set of input points by cross validation \nis a computationally intensive process, especially if the number of \npossible models or the number of training points is high. Tech(cid:173)\nniques such as gradient descent are helpful in searching through \nthe space of models, but problems such as local minima, and more \nimportantly, lack of a distance metric between various models re(cid:173)\nduce the applicability of these search methods. Hoeffding Races is \na technique for finding a good model for the data by quickly dis(cid:173)\ncarding bad models, and concentrating the computational effort at \ndifferentiating between the better ones. This paper focuses on the \nspecial case of leave-one-out cross validation applied to memory(cid:173)\nbased learning algorithms, but we also argue that it is applicable \nto any class of model selection problems. \n\n1 \n\nIntroduction \n\nModel selection addresses \"high level\" decisions about how best to tune learning \nalgorithm architectures for particular tasks. Such decisions include which function \napproximator to use, how to trade smoothness for goodness of fit and which fea(cid:173)\ntures are relevant. The problem of automatically selecting a good model has been \nvariously described as fitting a curve, learning a function, or trying to predict future \n\n59 \n\n\f60 \n\nMaron and Moore \n\n'-' \n\n'-' e 0.2 \n~ = = 0.18 \n' ; : \n~ \n7a 0.16 \n;;>-\n~ 0.14 \ne \nU \n\n0.22 \n\n0.12 \n\n1 \n\n3 \n\n5 \n\n7 \n\n9 \n\nk Nearest Neigh bors Used \n\nFigure 1: A space of models consisting of local-weighted-regression models with \ndifferent numbers of nearest neighbors used. The global minimum is at one-nearest(cid:173)\nneighbor, but a gradient descent algorithm would get stuck in local minima unless \nit happened to start in in a model where k < 4. \n\ninstances of the problem. One can think of this as a search through the space of \npossible models with some criterion of \"goodness\" such as prediction accuracy, com(cid:173)\nplexity of the model, or smoothness. In this paper, this criterion will be prediction \naccuracy. Let us examine two common ways of measuring accuracy: using a test \nset and leave-one-out cross validation (Wahba and Wold, 1975) . \n\n\u2022 The test set method arbitrarily divides the data into a training set and a \ntest set. The learner is trained on the training set, and is then queried with \njust the input vectors of the test set. The error for a particular point is the \ndifference between the learner's prediction and the actual output vector . \n\n\u2022 Leave-one-out cross validation trains the learner N times (where N is \nthe number of points), each time omitting a different point. We attempt to \npredict each omitted point. The error for a particular point is the difference \nbetween the learner's prediction and the actual output vector. \n\nThe total error of either method is computed by averaging all the error instances. \n\nThe obvious method of searching through a space of models, the brute force ap(cid:173)\nproach, finds the accuracy of every model and picks the best one. The time to find \nthe accuracy (error rate) of a particular model is proportional to the size of the test \nset IT EST!, or the size of the training set in the case of cross validation . Suppose \nthat the model space is discretized into a finite number of models IMODELSI -\nthen the amount of work required is O(IMODELSI x ITEST!), which is expensive. \n\nA popular way of dealing with this problem is gradient descent. This method can \nbe applied to find the parameters (or weights) of a model. However, it cannot be \nused to find the structure (or architecture) of the modeL There are two reasons for \n\n\fHoeffding Races: Accelerating Model Selection \n\n61 \n\nthis. First, we have empirically noted many occasions on which the search space is \npeppered with local minima (Figure 1). Second, at the highest level we are selecting \nfrom a set of entirely distinct models, with no numeric parameters over which to \nhill-climb. For example, is a neural net with 100 hidden units closer to a neural net \nwith 50 hiden units or to a memory-based model which uses 3 nearest neighbors? \nThere is no viable answer to this question since we cannot impose a viable metric \non this model space. \n\nThe algorithm we describe in this paper, Hoeffding Races, combines the robustness \nof brute force and the computational feasibility of hill climbing. We instantiated the \nalgorithm by specifying the set of models to be memory-based algorithms (Stan(cid:173)\nfill and Waltz, 1986) (Atkeson and Reinkensmeyer, 1989) (Moore, 1992) and the \nmethod of finding the error to be leave-one-out cross validation. We will discuss \nhow to extend the algorithm to any set of models and to the test set method in the \nfull paper. We chose memory-based algorithms since they go hand in hand with \ncross validation. Training is very cheap - simply keep all the points in memory, and \nall the algorithms of the various models can use the same memory. Finding the \nleave-one-out cross validation error at a point is cheap as making a prediction: sim(cid:173)\nply \"cover up\" that point in memory, then predict its value using the current model. \nFor a discussion of how to generate various memory-based models, see (Moore et \nal., 1992). \n\n2 Hoeffding Races \n\nThe algorithm was inspired by ideas from (Haussler, 1992) and (Kaelbling, 1990) \nand a similar idea appears in (Greiner and Jurisica, 1992). It derives its name from \nHoeffding's formula (Hoeffding, 1963), which concerns our confidence in the sample \nmean of n independently drawn points Xl, \u2022\u2022. , X n . The probability of the estimated \nmean Ee3t = ~ 2::l* f) < 8. Combining the \ntwo equations and solving for \u20ac gives us a bound on how close the estimated mean \nis to the true mean after n points with confidence 1 - 8: \n\n_ j B 2 1og(2/6) \n\n2n \n\n\u20ac \n\n-\n\nThe algorithm starts with a collection of learning boxes. We call each model a \nlearning box since we are treating the models as if they were black boxes. We \nare not looking at how complex or time-consuming each prediction is, just at the \ninput and output of the box. Associated with each learning box are two pieces of \ninformation: a current estimate of its error rate and the number of points it has \nbeen tested upon so far. The algorithm also starts with a test set of size N. For \nleave-one-out cross validation, the test set is simply the training set. \n\n\f62 \n\nMaron and Moore \n\n---------- ----------;; \n\nERROR \n\nUppez \nBound \n\nI \n\no ~------r_----_+------~----~~----_r------+_----~-------------\n\nlearning \nbox #0 \n\nlearning \nbox 411 \n\nlearning \nbox 112 \n\nlearning \nbox 413 \n\nlearning \nbox 114 \n\nlearning \nbox lIS \n\nlearning \nbox 116 \n\nFigure 2: An example where the best upper bound of learning box #2 eliminates \nlearning boxes #1 and #5. The size of f varies since each learning box has its own \nupper bound on its error range, B. \n\nAt each point in the algorithm, we randomly select a point from the test set. We \ncompute the error at that point for all learning boxes, and update each learning \nbox's estimate of its own total error rate. In addition, we use Hoeffding's bound \nto calculate how close the current estimate is to the true error for each learning \nbox. We then eliminate those learning boxes whose best possible error (their lower \nbound) is still greater than the worst error of the best learning box (its upper \nbound); see Figure 2. The intervals get smaller as more points are tested, thereby \n\"racing\" the good learning boxes, and eliminating the bad ones. \n\nWe repeat the algorithm until we are left with just one learning box, or until we \nrun out of points. The algorithm can also be stopped once f has reached a certain \nthreshhold. The algorithm returns a set of learning boxes whose error rates are \ninsignificantly (to within f) different after N test points. \n\n3 Proof of Correctness \n\nThe careful reader would have noticed that the confidence {; given in the previous \nsection is incorrect. In order to prove that the algorithm indeed returns a set of \nlearning boxes which includes the best one, we'll need a more rigorous approach. \nWe denote by ~ the probability that the algorithm eliminates what would have \nbeen the best learning box. The difference between ~ and {; which was glossed over \nin the previous section is that 1 - ~ is the confidence for the success of the entire \nalgrithm, while 1 -\n{; is the confidence in Hoeffding's bound for one learning box \n\n\fHoeffding Races: Accelerating Model Selection \n\n63 \n\nduring one iteration of the algorithm. \n\nWe would like to make a formal connection between Ll and {;. In order to do that, let \nus make the requirement of a correct algorithm more stringent. We'll say that the \nalgorithm is correct if every learning box is within f of its true error at every iteration \nof the algorithm. This requirement encompasses the weaker requirement that we \ndon't eliminate the best learning box. An algorithm is correct with confidence Ll if \nPr{ all learning boxes are within f on all iterations} :2: 1 - Ll. \nWe'll now derive the relationship between {; and Ll by using the disjunctive proba(cid:173)\nbility inequality which states that Pr{A V B} ~ Pr{A} + Pr{B}. \nLet's assume that we have n iterations (we have n points in our test set), and that \nwe have m learning boxes (LBl .. \u00b7LBm). By Hoeffding's inequality, we know that \n\nPr{ a particular LB is within f on a particular iteration} :2: 1 -\n\n{; \n\nFlipping that around we get: \n\nPr{ a particular LB is wrong on a particular iteration} < {; \n\nUsing the disjunctive inequality we can say \n\nPr{ a particular LB is wrong on iteration 1 V \na particular LB is wrong on iteration 2 V \n\na particular LB is wrong on iteration n} ~ {; . n \n\nLet's rewrite this as: \n\nPr{ a particular LB is wrong on any iteration} ~ {; . n \n\nN ow we do the same thing for all learning boxes: \n\nPr{ LBl is wrong on \nLB2 is wrong on \n\nany iteration V \nany iteration V \n\nLBm is wrong on \n\nany iteration} ~ {; . n . m \n\nor in other words: \n\nPr{ some LB is wrong in some iteration} ~ {; . n . m \n\nWe flip this to get: \n\nPr{ all LBs are within f on all iterations} :2: 1 - {; . n . m \n\nWhich is exactly what we meant by a correct algorithm with some confidence. \nTherefore, {; = n~m. When we plug this into our expression for f from the previous \nsection, we find that we have only increased it by a constant factor. In other words, \nby pumping up f, we have managed to ensure the correctness of this algorithm with \nconfidence Ll. The new f is expressed as: \n\nf = V~B-~-(l-Og-(-2-nm-n-)--I-O-g(-~-)-) \n\n\f64 \n\nMaron and Moore \n\nProblem \nROBOT \n\nPROTEIN \n\nENERGY \n\nPOWER \n\nPOOL \n\nDISCONT \n\nTable 1: Test problems \n\nDescrIption \n\n10 input attributes, 5 outputs. Given an initial and a final description \nof a robot arm, learn the control needed in order to make the robot \nperform devil-sticking (Schaal and Atkeson, 1993). \n3 inputs, output is a classification into one of three classes. This is the \nfamous protein secondary structure database, with some preprocessing \n(Zhang et al., 1992). \nGiven solar radiation sensing, predict the cooling load for a building. \nThis is taken from the Building Energy Predictor Shootout. \nMarket data for electricity generation pricing period class for the new \nUnited Kingdom Power Market. \nThe visually perceived mapping from pool table configurations to shot \noutcome for two-ball collisions (Moore, 1992). \nAn artificially constructed set of points with many discontinuities. Lo(cid:173)\ncal models should outperform global ones. \n\nClearly this is an extremely pessimistic bound and tighter proofs are possible (Omo(cid:173)\nhundro, 1993). \n\n4 Results \n\nWe ran Hoeffding Races on a wide variety of learning and prediction problems. \nTable 1 describes the problems, and Table 2 summarizes the results and compares \nthem to brute force search. \nFor Table 2, all ofthe experiments were run using Ll = .01. The initial set of possible \nmodels was constructed from various memory based algorithms: combinations of \ndifferent numbers of nearest neighbors, different smoothing kernels, and locally \nconstant vs. \nlocally weighted regression. We compare the algorithms relative to \nthe number of queries made, where a query is one learning box finding its error at \none point. The brute force method makes ITESTI x ILEARNING BOXESI queries. \nHoeffding Races eliminates bad learning boxes quickly, so it should make fewer \nquerIes. \n\n5 Discussion \n\nHoeffding Races never does worse than brute force. It is least effective when all \nmodels perform equally well. For example, in the POOL problem, where there \nwere 75 learning boxes left at the end of the race, the number of queries is only \nslightly smaller for Hoeffding Races than for brute force . In the ROBOT problem, \nwhere there were only 6 learning boxes left, a significant reduction in the number of \nqueries can be seen. Therefore, Hoeffding Races is most effective when there exists \na subset of clear winners within the initial set of models. We can then search over \na very broad set of models without much concern about the computational expense \n\n\fHoeffding Races: Accelerating Model Selection \n\n65 \n\nTable 2: Results of Brute Force vs. Hoeffding Races. \n\nqueries \n\nwith \n\nHoeffding \n\nRaces \n\nlearning \nboxes \nleft \n\n15637 \n349405 \n121400 \n13119 \n22095 \n25144 \n\n6 \n60 \n40 \n48 \n75 \n29 \n\nProblem \n\npoints \n\nROBOT \nPROTEIN \nENERGY \nPOWER \nPOOL \nDISCONT \n\n972 \n4965 \n2444 \n210 \n259 \n500 \n\nInitial # \nlearning \n\nboxes \n\n95 \n95 \n189 \n95 \n95 \n95 \n\nqueries \n\nwith \nBrute \nForce \n\n92340 \n471675 \n461916 \n19950 \n24605 \n47500 \n\n60000 \n\n60000 \n\n400 00 \n\n:;';0000 \n\nFigure 3: The x-axis is the size of a set of initial learning boxes (chosen randomly) \nand the y-axis is the number of queries to find a good model for the ROBOT \nproblem. The bottom line shows performance by the Hoeffding Race algorithm) \nand the top line by brute force. \n\n\f66 \n\nMaron and Moore \n\nof a large initial set. Figure 3 demonstrates this. In all the cases we have tested, \nthe learning box chosen by brute force is also contained by the set returned from \nHoeffding Races. Therefore, there is no loss of performance accuracy. \n\nThe results described here show the performance improvement with relatively small \nproblems. Preliminary results indicate that performance improvements will increase \nas the problems scale up. In other words, as the number of test points and the \nnumber of learning boxes increase, the ratio of the number of queries made by \nbrute force to the number of queries made by Hoeffding Races becomes larger. \nHowever, the cost of each query then becomes the main computational expense. \n\nAcknowledgements \n\nThanks go to Chris Atkeson, Marina Meila, Greg Galperin, Holly Yanco, and \nStephen Omohundro for helpful and stimulating discussions. \n\nReferences \n\n[Atkeson and Reinkensmeyer, 1989] C. G. Atkeson and D. J. Reinkensmeyer. Using asso(cid:173)\n\nciative content-addressable memories to control robots. In W. T. Miller, R. S. Sutton, \nand P. J. Werbos, editors, Neural Networks for Control. MIT Press, 1989. \n\n[Greiner and Jurisica, 1992] R. Greiner and I. Jurisica. A statistical approach to solv(cid:173)\n\ning the EBL utility problem. In Proceedings of the Tenth International conference on \nArtificial Intelligence (AAAI-92). MIT Press, 1992. \n\n[Haussler, 1992] D. Haussler. Decision theoretic generalizations of the pac model for neural \nnet and other learning applications. Information and Computation, 100:78-150, 1992. \n[Hoeffding, 1963] Wassily Hoeffding. Probability inequalities for sums of bounded random \n\nvariables. Journal of the American Statistical Association, 58:13-30, 1963. \n\n[Kaelbling, 1990] 1. P. Kaelbling. Learning in Embedded Systems. PhD. Thesis; Technical \nReport No. TR-90-04, Stanford University, Department of Computer Science, June 1990. \n\n[Moore et al., 1992] A. W. Moore, D. J. Hill, and M. P. Johnson. An empirical inves(cid:173)\n\ntigation of brute force to choose features, smoothers and function approximators. In \nS. Hanson, S. Judd, and T. Petsche, editors, Computational Learning Theory and Nat(cid:173)\nural Learning Systems, Volume 9. MIT Press, 1992. \n\n[Moore, 1992] A. W. Moore. Fast, robust adaptive control by learning only forward mod(cid:173)\n\nels. In J. E. Moody, S. J. Hanson, and R. P. Lippman, editors, Advances in Neural \nInformation Processing Systems 4. Morgan Kaufmann, April 1992. \n\n[Omohundro, 1993] Stephen Omohundro. Private communication, 1993. \n[Pollard, 1984] David Pollard. Convergence of Stochastic Processes. Springer-Verlag, 1984. \n[Schaal and Atkeson, 1993] S. Schaal and C. G. Atkeson. Open loop stable control strate-\ngies for robot juggling. In Proceedings of IEEE conference on Robotics and Automation, \nMay 1993. \n\n[Stanfill and Waltz, 1986] C. Stanfill and D. Waltz. Towards memory-based reasoning. \n\nCommunications of the A CM, 29(12):1213-1228, December 1986. \n\n[Wahba and Wold, 1975] G. Wahba and S. Wold. A completely automatic french curve: \nFitting spline functions by cross-validation. Communications in Statistics, 4(1), 1975. \n[Zhang et al., 1992] X. Zhang, J.P. Mesirov, and D.L. Waltz. Hybrid system for protein \n\nsecondary structure prediction. Journal of Molecular Biology, 225: 1 049-1 063, 1992. \n\n\f", "award": [], "sourceid": 841, "authors": [{"given_name": "Oded", "family_name": "Maron", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}*