{"title": "Optimization on a Budget: A Reinforcement Learning Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 1385, "page_last": 1392, "abstract": "Many popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA), use heuristic-based controllers'' that modulate the behavior of the optimizer during the optimization process. For example, in the LMA a damping parameter is dynamically modified based on a set rules that were developed using various heuristic arguments. Reinforcement learning (RL) is a machine learning approach to learn optimal controllers by examples and thus is an obvious candidate to improve the heuristic-based controllers implicit in the most popular and heavily used optimization algorithms. Improving the performance of off-the-shelf optimizers is particularly important for time-constrained optimization problems. For example the LMA algorithm has become popular for many real-time computer vision problems, including object tracking from video, where only a small amount of time can be allocated to the optimizer on each incoming video frame. Here we show that a popular modern reinforcement learning technique using a very simply state space can dramatically improve the performance of general purpose optimizers, like the LMA. Most surprisingly the controllers learned for a particular domain appear to work very well also on very different optimization domains. For example we used RL methods to train a new controller for the damping parameter of the LMA. This controller was trained on a collection of classic, relatively small, non-linear regression problems. The modified LMA performed better than the standard LMA on these problems. Most surprisingly, it also dramatically outperformed the standard LMA on a difficult large scale computer vision problem for which it had not been trained before. Thus the controller appeared to have extracted control rules that were not just domain specific but generalized across a wide range of optimization domains.\"", "full_text": "Optimization on a Budget: A Reinforcement\n\nLearning Approach\n\nPaul Ruvolo\n\nDepartment of Computer Science\nUniversity of California San Diego\n\nLa Jolla, CA 92093\n\npruvolo@cs.ucsd.edu\n\nIan Fasel\n\nDepartment of Computer Sciences\n\nUniversity of Texas at Austin\n\nianfasel@cs.utexas.edu\n\nJavier Movellan\n\nMachine Perception Laboratory\n\nUniversity of California San Diego\nmovellan@mplab.ucsd.edu\n\nAbstract\n\nMany popular optimization algorithms, like the Levenberg-Marquardt algorithm\n(LMA), use heuristic-based \u201ccontrollers\u201d that modulate the behavior of the op-\ntimizer during the optimization process. For example, in the LMA a damping\nparameter \u03bb is dynamically modi\ufb01ed based on a set of rules that were developed\nusing heuristic arguments. Reinforcement learning (RL) is a machine learning\napproach to learn optimal controllers from examples and thus is an obvious can-\ndidate to improve the heuristic-based controllers implicit in the most popular and\nheavily used optimization algorithms.\nImproving the performance of off-the-shelf optimizers is particularly important\nfor time-constrained optimization problems. For example the LMA algorithm has\nbecome popular for many real-time computer vision problems, including object\ntracking from video, where only a small amount of time can be allocated to the\noptimizer on each incoming video frame.\nHere we show that a popular modern reinforcement learning technique using a\nvery simple state space can dramatically improve the performance of general pur-\npose optimizers, like the LMA. Surprisingly the controllers learned for a particular\ndomain also work well in very different optimization domains. For example we\nused RL methods to train a new controller for the damping parameter of the LMA.\nThis controller was trained on a collection of classic, relatively small, non-linear\nregression problems. The modi\ufb01ed LMA performed better than the standard LMA\non these problems. This controller also dramatically outperformed the standard\nLMA on a dif\ufb01cult computer vision problem for which it had not been trained.\nThus the controller appeared to have extracted control rules that were not just\ndomain speci\ufb01c but generalized across a range of optimization domains.\n\n1 Introduction\n\nMost popular optimization algorithms, like the Levenberg-Marquardt algorithm (LMA) use simple\n\u201ccontrollers\u201d that modulate the behavior of the optimization algorithm based on the state of the\noptimization process. For example, in the LMA a damping factor \u03bb modi\ufb01es the descent step to\nbehave more like Gradient Descent or more like the Gauss-Newton optimization algorithm [1, 2].\n\n1\n\n\fThe LMA uses the following heuristic for controlling \u03bb: If an iteration of the LMA with the current\ndamping factor \u03bbt reduces the error then the new parameters produced by the LMA iteration are\naccepted and the damping factor is divided by a constant term \u03b7 > 0, i.e., \u03bbt+1 = \u03bbt/\u03b7. Otherwise,\nif the error is not reduced, the new parameters are not accepted, the damping factor is multiplied\nby \u03b7, and the LMA iteration is repeated with the new damping parameter. While various heuristic\narguments have been used to justify this particular way of controlling the damping factor, it is not\nclear whether this \u201ccontroller\u201d is optimal in any way or whether it can be signi\ufb01cantly improved.\nImproving the performance of off-the-shelf optimizers is particularly important for time-constrained\noptimization problems. For example the LMA algorithm has become popular for many real-time\ncomputer vision problems, including object tracking from video, where only a small amount of time\ncan be allocated to the optimizer on each incoming video frame. Time constrained optimization\nis in fact becoming an increasingly important problem in applications such as operations research,\nrobotics, and machine perception. In these problems the focus is on achieving the best possible\nsolution in a \ufb01xed amount of time. Given the special properties of time constrained optimization\nproblems it is likely that the heuristic-based controllers used in off-the-shelf optimizers may not be\nparticularly ef\ufb01cient. Additionally, standard techniques for non-linear optimization like the LMA\ndo not address issues such as when to stop a fruitless local search or when to revisit a previously\nvisited part of the parameter space.\nReinforcement learning (RL) is a machine learning approach to learn optimal controllers by exam-\nples and thus is an obvious candidate to improve the heuristic-based controllers used in the most\npopular and heavily used optimization algorithms. An advantage of RL methods over other ap-\nproaches to optimal control is that they do not require prior knowledge of the underlying system\ndynamics and the system designer is free to choose reward metrics that best match the desiderata for\ncontroller performance. For example, in the case of optimization under time constraints a suitable\nreward could be to achieve the minimum loss within a \ufb01xed amount of time.\n\n2 Related Work\n\nThe idea of using RL in optimization problems is not new [3, 4, 5, 6, 7]. However, previous\napproaches have focused on using RL methods to develop problem-speci\ufb01c optimizers for NP-\ncomplete problems. Here our focus is on using RL methods to modify the controllers implicit in\nthe most popular and heavily used optimization algorithms. In particular our goal is to make these\nalgorithms more ef\ufb01cient for optimization on time budget problems. As we will soon show, a sim-\nple RL approach can result in dramatic improvements in performance of these popular optimization\npackages.\nThere has also been some work on empirical evaluations of the LMA algorithm versus other nonlin-\near optimization methods in the computer vision community. In [8], the LMA and Powell\u2019s dog-leg\nmethod are compared on the problem of bundle adjustment. The approach outlined in this document\ncould in principle learn to combine these two methods to perform ef\ufb01cient optimization.\n\n3 The Levenberg Marquardt Algorithm\nConsider the problem of optimizing a loss function f : Rn \u2192 R over the space Rn. There are\nmany approaches to this problem, including zeroth-order methods (such as the Metropolis-Hastings\nalgorithm), \ufb01rst order approaches, such as gradient descent and the Gauss-Newton method, and\nsecond order approaches such as the Newton-Raphson algorithm.\nEach of these algorithms have advantages and disadvantages. For example, on each iteration of gra-\ndient descent, parameters are changed in the opposite direction of the gradient of the loss function,\ne.g.,\n\nxk+1 = xk \u2212 \u03b7 (cid:53)x f(xk)\n\n(1)\nSteepest Descent has convergence guarantees provided the value of \u03b7 is reduced over the course of\nthe optimization and in general is robust, but quite slow.\nThe Gauss-Newton method is a technique for minimizing sums of squares of non-linear functions.\nLet g be a function from Rn \u2192 Rm with a corresponding loss function L(x) = g(x)(cid:62)g(x). The\n\n2\n\n\fif f(xk)(cid:62)f(xk) > f(xk\u22121)(cid:62)f(xk\u22121) then\n\nxk \u2190 xk\u22121\n\u03bb \u2190 \u03b7 \u00d7 \u03bb\n\u03bb \u2190 1\n\u03b7 \u00d7 \u03bb\n\nelse\n\nend if\n\nFigure 1: A heuristic algorithm for updating lambda during Levenberg-Marquardt non-linear least\nsquares optimization.\n\nalgorithm works by \ufb01rst linearizing the function g using its \ufb01rst order Taylor expansion. The sum of\nsquares loss function, L, then becomes a quadratic function that can be analytically minimized. Let\nH = J(xk)(cid:62)J(xk) and d = J(xk)(cid:62)g(xk), where J is the Jacobian of g with respect to x. Each\niteration of the Gauss-Newton method is of the following form:\nxk+1 = xk \u2212 H\u22121d\n\n(2)\nThe Gauss-Newton method has a much faster convergence rate than gradient descent, however, it is\nnot as robust as gradient descent. It can actually perform very poorly when the linear approximation\nto g is not accurate.\nLevenberg-Marquardt [1] is a popular optimization algorithm that attempts to blend gradient de-\nscent and Gauss-Newton in order to obtain both the fast convergence rate of Gauss-Newton and the\nconvergence guarantees of gradient descent. The algorithm has the following update rule:\n\nxk+1 = xk \u2212 (H + \u03bbdiag(H))\u22121d\n\n(3)\nThis update rule is also known as damped Gauss-Newton because the \u03bb parameter serves to dampen\nthe Gauss-Newton step by blending it with the gradient descent step. Marquardt proposed a heuristic\nbased control law to dynamically modify \u03bb during the optimization process (see Figure 1). This\ncontrol has become part of most LMA packages.\nThe LMA algorithm has recently become a very popular approach to solve real-time problems in\ncomputer vision [9, 10, 11], such as object tracking and feature tracking in video. Due to the special\nnature of this problem it is unclear whether the heuristic-based controller embedded in the algorithm\nis optimal or could be signi\ufb01cantly improved upon.\nIn the remainder of this document we explore whether reinforcement learning methods can help\nimprove the performance of LMA by developing an empirically learned controller of the damping\nfactor rather than the commonly used heuristic controller.\n\n4 Learning Control Policies for Optimization Algorithms\n\nAn optimizer is an algorithm that uses some statistics about the current progress of the optimization\nin order to produce a next iterate to evaluate. It is natural to frame optimization in the language\nof control theory by thinking of the statistics of the optimization progress used by the controller\nto choose the next iterate as the control state and the next iterate to visit as the control action. In\nthis work we choose to restrict our state space to a few statistics that capture both the current time\nconstraints and the recent progress of the optimization procedure. The action space is restricted by\nmaking the observation that current methods for non-linear optimization provide good suggestions\nfor the next point to visit. In this way our action space encodes which one of a \ufb01xed set of opti-\nmization subroutines (see Section 3) to use for the next iteration, along with actions that control\nvarious heuristic parameters for each optimization subroutine (for instance schedules for updating \u03b7\nin gradient descent and heuristics for modifying the value of \u03bb in the LMA).\nIn order to de\ufb01ne the optimality of a controller we de\ufb01ne a reward function that indicates the de-\nsirability of the solution found during optimization. In the context of optimization with semi-rigid\ntime constraints an appropriate reward function balances reduction in loss of the objective function\nwith the number of steps needed to achieve that reduction. In the case of optimization with a \ufb01xed\nbudget, a more natural choice might be the overall reduction in the loss function within the alloted\nbudget of function evaluations. For speci\ufb01c applications, in a similar spirit to the work of Boyan [6],\n\n3\n\n\fInitialize a policy \u03c00 that explores randomly\nS \u2190 {}\nfor i = 1 to n do\n\nGenerate a random optimization problem U\nOptimize U for T time steps using policy \u03c00 and generate samples V \u2208 (s, a, r, s(cid:48))T\nS \u2190 S \u222a V\n\nConstruct the approximate action-value function Q\u03c0t\nSet \u03c0t+1 to be the one step policy improvement of \u03c0t using Q\u03c0t\nt \u2190 t + 1\nt\n\nt using the samples S\n\nend for\nrepeat\n\nt\u22121 \u2248 Q\u03c0\n\nt\n\nuntil Q\u03c0\nreturn \u03c0t\n\nFigure 2: Our algorithm for learning controllers for optimization on a budget. The construction\nof the approximate action-value function and the policy improvement step are performed using the\ntechniques outlined in [12].\n\nthe reward function could be modi\ufb01ed to include features of intermediate solutions that are likely to\nindicate the desirability of the current point.\nGiven a state space, action space, and reward function for a given optimization problem, reinforce-\nment learning methods provide an appropriate set of techniques for learning an optimal optimization\ncontroller. While there are many reinforcement learning algorithms that are appropriate for our prob-\nlem formulation, in this work we employ Least-Squares Policy Iteration (LSPI) [12]. Least Squares\nPolicy Iteration is particularly attractive since it handles continuous state spaces, is ef\ufb01cient in terms\nof the number of interactions with the system needed to learn a good controller, does not need an\nunderlying model of the process dynamics, and learns models that are amenable to interpretation.\nLSPI is an iterative procedure that repeatedly applies the following two steps until convergence:\napproximating the action-value function as a linear combination of a \ufb01xed set of basis functions\nand then improving the current policy greedily over the approximate value function. The bases are\nfunctions of the state and action and can be non-linear. The method is ef\ufb01cient in terms of the\nnumber of interactions required with the dynamical system and can reuse the same set of samples\nto evaluate multiple policies, which is a crucial difference between LSPI and earlier methods like\nLSTD. The output of the LSPI procedure is a weight vector that de\ufb01nes the action-value function of\nthe optimal policy as a linear combination of the basis vectors.\nOur method for learning an optimization controller consists of two phases. In the \ufb01rst phase samples\nare collected through interactions between a random optimization controller and an optimization\nproblem in a series of \ufb01xed length optimization episodes. These samples are tuples of the form\n(s, a, r, s(cid:48)) where s(cid:48) denotes the state arrived at when action a was executed starting from state s\nand reward r was received. The second phase of our algorithm applies LSPI to learn an action-\nvalue function and implicitly an optimal policy (which is given by the greedy maximization of the\naction-value function over actions for a given state). A sketch of our algorithm is given in Figure 2.\n\n5 Experiments\n\nWe demonstrate the ability of our method to both achieve superior performance to off the shelf non-\nlinear optimization techniques as well as provide insight into the speci\ufb01c policies and action-value\nfunctions learned.\n\n5.1 Optimizing Nonlinear Least-Squares Functions with a Fixed Budget\n\nBoth the classical non-linear problems and the facial expression recognition task were formulated in\nterms of optimization given a \ufb01xed budget of function evaluations. This criterion suggests a natural\nreward function where L is a loss function we are trying to minimize, B is the budget of function\nevaluations, I is the indicator function, x0 is the initial point visited in the optimization, and xopt is\n\n4\n\n\fthe point with the lowest loss visited in the current optimization episode:\n\nrk = I(k < B) \u00d7 I(L(xk) < L(xopt)) \u00d7 (L(xopt) \u2212 L(xk)) \u00d7 1\nL(x0)\n\n(4)\n\nThis reward function encourages controllers that achieve large reductions in loss within the \ufb01xed\nbudget of function evaluations.\nEach optimization problem takes the form of minimizing the sum of squares of non-linear functions\nand thus are well-suited to Levenberg-Marquardt style optimization. The action space we consider\nin our experiments consists of adjustments to the damping factor (maintain, decrease by a multi-\nplicative factor, or increase by a multiplicative factor) used in the LMA, the decision of whether or\nnot to throw away the last descent step, along with two actions that are not available to the LMA.\nThese additional actions include moving to a new random point in the domain of the objective func-\ntion and also returning to the best point found so far and performing one descent step using the LMA\n(using the current damping factor). The number of actions available at each step is 8 (6 for various\ncombinations of adjustments to \u03bb and returning the the previous iterate along with the 2 additional\nactions just described).\nThe state space used to make the action decision includes a \ufb01xed-length window of history that\nencodes whether a particular step in the past increased or decreased the residual error from the\nprevious iterate. This window is set to size 2 for most of our experiments, however, we did evaluate\nthe relative improvement of using a window size of 1 versus 2 (see Figure 4). Also included in\nthe state space is the amount of function evaluations left in our budget and a problem-speci\ufb01c state\nfeature described in Section 5.3.\nThe state and action space are mapped through a collection of \ufb01xed basis functions which the LSPI\nalgorithm combines linearly to approximate the optimal action-value function. For most applica-\ntions of LSPI these functions consist of radial-basis functions distributed throughout the continuous\nstate and action space. The basis we use in our problem treats each action independently and thus\nconstructs a tuple of basis functions for each action. To encode the number of evaluations left in\nthe optimization episode, we use a collection of radial-basis functions centered at different values\nof budget remaining (speci\ufb01cally we use basis functions spaced at 4 step intervals with a band-\nwidth of .3). The history window of whether the loss went up or down during recent iterations of\nthe algorithm is represented as a d-dimensional binary vector where d is the length of history win-\ndow considered. For the facial expression recognition task the tuple includes an additional basis\ndescribed in Section 5.3.\n\n5.2 Classical Nonlinear Least Squares Problems\n\nIn order to validate our approach we apply it to a dataset of classical non-linear optimization prob-\nlems [13]. This dataset of problems includes famous optimization problems that cover a wide variety\nof non-linear behavior. Examples include the Kowalik and Osborne function and the Scaled Meyer\nfunction. When restricted to a budget of 5 function evaluations, our method is able to learn a policy\nwhich results in a 6% gain in performance (measured in total reduction in loss from the starting\npoint) when compared to the LMA.\n\n5.3 Learning to Classify Facial Expressions\n\nThe box-\ufb01lter features that proved successful for face detection in [14] have also shown promise for\nrecognizing facial expressions when combined using boosting methods. The response of a box-\ufb01lter\nto an image patch is obtained by weighting the sum of the pixel brightnesses in various boxes by a\ncoef\ufb01cient de\ufb01ned by the particular box-\ufb01lter kernel. In our work we frame the problem of feature\nselection as an optimization procedure over a continuous parameter space. The parameter space\nde\ufb01nes an in\ufb01nite set of box-\ufb01lters that includes many of those proposed in [14] as a special case\n(see Figure 3). Each feature can be described as a vector in [0, 1]6 where the 6 dimensions of the\nvector are depicted in Figure 3.\nWe learn a detector for the presence or absence of a smile using the pixel intensities of an image\npatch containing a face. We accomplish this by employing the sequential regression procedure L2-\nboost [15]. L2-boost creates a strong classi\ufb01er by iteratively \ufb01tting the residuals of the current model\n\n5\n\n\fFigure 3: A parameterized feature space. The position of the cross-hairs in the middle of the box\n\ufb01lter can freely \ufb02oat. This added generality allows for the features proposed in [14] to be generated\nas special cases. A complete description of a feature is composed of the 6 parameters depicted above:\nhorizontal offset, vertical crossbar, vertical offset, \ufb01lter height, horizontal crossbar, and \ufb01lter width.\nThe weighting coef\ufb01cients for the four boxes (depicted in a checkerboard pattern) is determined by\nlinear regression between \ufb01lter outputs of each box and the labels of the training set.\n\nover a collection of weak-learners (in this case our parameterized features). The L2-boost procedure\nselects a box-\ufb01lter at each iteration that most reduces the difference between the current predictions\nof the model and the correct image labels. Once a suf\ufb01ciently good feature is found this feature is\nadded to the current ensemble. L2-boost learns a linear model for predicting the label of the image\npatch since each weak learner (box-\ufb01lter) is a linear \ufb01lter on the pixel values and L2-boost combines\nweak learners in a linear fashion. The basis space for LSPI is augmented for this task by included a\nbasis that speci\ufb01es the number of features already selected by the L2-boost procedure.\nWe test our algorithm on the task of smile detection using a subset of 1, 000 images from the GENKI\ndataset (which is a collection of 60, 000 faces from the web). Along with information about the\nlocation of faces and facial features, human labelers have labeled each image as containing or not\ncontaining a smile. In this experiment our goal is to predict the human smile labels using the L2-\nboost procedure outlined above.\nDuring each trial 3 box \ufb01lters are selected using the L2-boosting procedure. Within each round of\nfeature selection a total of 20 feature evaluations are allowed per round. We use the default version\nof the LMA as a mode of comparison. After collecting samples from 100 episodes of optimization\non the GENKI dataset, LSPI is able to learn a policy that achieves a 2.66 fold greater reduction in\ntotal loss than the LMA on a test set of faces from the GENKI dataset (see Figure 4). Since the\nLMA does not have access to the ability to move to a new random part of the state space a more fair\ncomparison would be to our method without access to this action. In this experiment our method is\nstill able to achieve a 20% greater reduction in total loss than the LMA.\nFigure 4 shows that the policies learned using our method not only achieves greater reduction in\nloss on the training set, but that this reduction in loss translates to a signi\ufb01cant gain in performance\nfor classi\ufb01cation on a validation set of test images. Our method achieves between .036 and .083\nbetter classi\ufb01cation performance (as measured by area under the ROC curve) depending on the\noptimization budget. Note that given the relatively high baseline performance of the LMA on the\nsmile detection task, an improvement of .083 in terms of area under the ROC translates to almost\nhalving the error rate. Also of signi\ufb01cance is that the information encoded in the state space does\nmake a difference in the performance of the algorithm. Learning a policy that uses a history window\nof error changes on the last two time steps is able to achieve a 16% greater reduction in total loss\nthan a policy learned with a history window of size 1.\nAlso of interest is the nature of the policies learned for smile detection on a \ufb01xed budget. The\npolicies learned exhibit the following general trend: during the early stages of selecting a speci\ufb01c\nfeature the learned policies either sample a new point in the feature space (if the error has increased\nfrom the last iteration) or do a Levenberg-Marquardt step on the best point visited up until now (if\nthe error has gone down at the last iteration). This initial strategy makes sense since if the current\npoint does not look promising (error has increased) it is wise to try a different part of the state space,\n\n6\n\nFilter WidthVerticalOffsetHorizontalFilterHeightOffsetCrossbarVertCrossbarHoriz\fController Type\nLearned (history window = 1)\nLearned (history window = 2)\nLearned (no random restarts)\nLearned on Classical (no random restarts)\nDefault LMA\n\nAverage Reduction in Loss Relative to the LMA\n\n2.3\n2.66\n1.2\n1.19\n1.0\n\nFigure 4: Top: The performance on detecting smile versus not smile is substantially better when\nusing an optimization controller learned with our algorithm than using the default LMA. In each run\n3 features are selected by the L2-boost procedure. The number of feature evaluations per feature\n(the budget) varies along the x-axis. Bottom: This table describes the relative improvement in total\nloss reduction for policies learned using our method.\n\nhowever, if the error is decreasing it is best to continue to apply local optimization methods. Later in\nthe optimization, the policy always performs a Levenberg-Marquardt step on the current best point\nno matter what the change in error was. This strategy makes sense since once a few different parts of\nthe state space have been investigated the utility of sampling a new part of the state space is reduced.\nSeveral trends can be seen by examining the basis weights learned by LSPI. The \ufb01rst trend is that\nthe learned policy favors discarding the last iterate versus keeping (similar to the LMA). The second\ntrend is that the policy favors increasing the damping parameter when the error has increased on the\nlast iteration and decreasing the damping factor when the error has decreased (also similar to the\nLMA).\n\n5.4 Cross Generalization\n\nA property of choosing a general state space for our method is that the policies learned on one class\nof optimization problem are applicable to other classes of optimization. The optimization controllers\nlearned in the classical least squares minimization task achieve a 19% improvement over the standard\nLMA on the smile detection task. Applying the controllers learned on the smile detection task to\nthe classical least squares problem yields a more modest 5% improvement. These results support\nthe claim that our method is extracting useful structure for optimizing under a \ufb01xed budget and not\nsimply learning a controller that is amenable to a particular problem domain.\n\n6 Conclusion\n\nWe have presented a novel approach to the problem of learning optimization procedures for opti-\nmization on a \ufb01xed budget. We have shown that our approach achieves better performance than\nubiquitous methods for non-linear least squares optimization on the task of optimizing within a\n\ufb01xed budget of function evaluations for both classical non-linear functions and a dif\ufb01cult computer\nvision task. We have also provided an analysis of the patterns learned by our method and how they\n\n7\n\n01020304050600.80.850.9Optimization Budget Per Feature SelectionArea Under the ROCPerformance on Smile Detection as a function of Budget Learned MethodDefault LMA\fmake sense in the context of optimization under a \ufb01xed budget. Additionally, we have presented\nextensions to the features used in [14] that are signi\ufb01cant in their own right.\nIn the future we will more fully explore the framework that we have outlined in this document.\nThe speci\ufb01c application of the framework in the current work (state, action, and bases) while quite\neffective may be able to be improved. For instance, by incorporating domain speci\ufb01c features into\nthe state space richer policies might be learned. We also want to apply this technique to other\nproblems in machine perception. An upcoming project will test the viability of our technique for\n\ufb01nding feature point locations on a face that simultaneously exhibit high likelihood in terms of\nappearance and high likelihood in terms of the relative arrangement of facial features. The real-time\nconstraints of this problem make it a particularly appropriate target for the methods presented in this\ndocument.\n\nReferences\n[1] K. Levenberg, \u201cA method for the solution of certain problems in least squares,\u201d Applied Math\n\nQuarterly, 1944.\n\n[2] D. Marquardt, \u201cAn algorithm for least-squares estimation of nonlinear parameters,\u201d SIAM Jour-\n\nnal of Applied Mathematics, 1963.\n\n[3] V. V. Miagkikh and W. F. P. III, \u201cGlobal search in combinatorial optimization using rein-\nforcement learning algorithms,\u201d in Proceedings of the Congress on Evolutionary Computation,\nvol. 1.\n\nIEEE Press, 6-9 1999, pp. 189\u2013196.\n\n[4] Y. Zhang, \u201cSolving large-scale linear programs by interior-point methods under the MATLAB\n\nenvironment,\u201d Optimization Methods and Software, vol. 10, pp. 1\u201331, 1998.\n\n[5] L. M. Gambardella and M. Dorigo, \u201cAnt-q: A reinforcement learning approach to the traveling\n\nsalesman problem,\u201d in International Conference on Machine Learning, 1995, pp. 252\u2013260.\n\n[6] J. A. Boyan and A. W. Moore, \u201cLearning evaluation functions for global optimization and\n\nboolean satis\ufb01ability,\u201d in AAAI/IAAI, 1998, pp. 3\u201310.\n\n[7] R. Moll, T. J. Perkins, and A. G. Barto, \u201cMachine learning for subproblem selection,\u201d in ICML\nSan\n\n\u201900: Proceedings of the Seventeenth International Conference on Machine Learning.\nFrancisco, CA, USA: Morgan Kaufmann Publishers Inc., 2000, pp. 615\u2013622.\n\n[8] M. I. Lourakis and A. A. Argyros, \u201cIs levenberg-marquardt the most ef\ufb01cient optimization\n\nalgorithm for implementing bundle adjustment?\u201d Proceedings of ICCV, 2005.\n\n[9] D. Cristinacce and T. F. Cootes, \u201cFeature detection and tracking with constrained local mod-\n\nels,\u201d BMVC, pp. 929\u2013938, 2006.\n\n[10] M. Pollefeys, L. V. Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, \u201cVisual\n\nmodeling with a hand-held camera,\u201d IJCV, vol. 59, no. 3, pp. 207\u2013232, 2004.\n\n[11] P. Beardsley, P. Torr, and A. Zisserman, \u201c3d model acquisition from extended image se-\n\nquences.\u201d Proceedings of ECCV, pp. 683\u2013695, 1996.\n\n[12] M. Lagoudakis and R. Parr, \u201cLeast-squares policy iteration,\u201d Journal of Machine Learning\n\nResearch, 2003.\n\n[13] H. B. Nielsen, \u201cUctp problems for unconstrained optimization,\u201d Technical Report, Technical\n\nUniversity of Denmark, 2000.\n\n[14] P. Viola and M. Jones, \u201cRobust real-time object detection,\u201d International Journal of Computer\n\nVision, 2002.\n\n[15] P. Buhlmann and B. Yu, \u201cBoosting with the l2 loss: Regression and classi\ufb01cation,\u201d Journal of\n\nthe American Statistical Association, 2003.\n\n8\n\n\f", "award": [], "sourceid": 447, "authors": [{"given_name": "Paul", "family_name": "Ruvolo", "institution": null}, {"given_name": "Ian", "family_name": "Fasel", "institution": null}, {"given_name": "Javier", "family_name": "Movellan", "institution": null}]}