{"title": "Constrained Differential Optimization", "book": "Neural Information Processing Systems", "page_first": 612, "page_last": 621, "abstract": null, "full_text": "612 \n\nConstrained Differential Optimization \n\nJohn C. Platt \nAlan H. Barr \n\nCalifornia Institute of Technology, Pasadena, CA 91125 \n\nAbstract \n\nMany optimization models of neural networks need constraints to restrict the space of outputs to \na subspace which satisfies external criteria. Optimizations using energy methods yield \"forces\" which \nact upon the state of the neural network. The penalty method, in which quadratic energy constraints \nare added to an existing optimization energy, has become popular recently, but is not guaranteed \nto satisfy the constraint conditions when there are other forces on the neural model or when there \nare multiple constraints. In this paper, we present the basic differential multiplier method (BDMM), \nwhich satisfies constraints exactly; we create forces which gradually apply the constraints over time, \nusing \"neurons\" that estimate Lagrange multipliers. \n\nThe basic differential multiplier method is a differential version of the method of multipliers \nfrom Numerical Analysis. We prove that the differential equations locally converge to a constrained \nminimum. \n\nExamples of applications of the differential method of multipliers include enforcing permutation \ncodewords in the analog decoding problem and enforcing valid tours in the traveling salesman problem. \n\n1. Introduction \n\nOptimization is ubiquitous in the field of neural networks. Many learning algorithms, such as \nback-propagation,18 optimize by minimizing the difference between expected solutions and observed \nsolutions. Other neural algorithms use differential equations which minimize an energy to solve \na specified computational problem, such as associative memory, D differential solution of the trav(cid:173)\neling salesman problem,s,lo analog decoding,lS and linear programming.1D Furthennore, Lyapunov \nmethods show that various models of neural behavior find minima of particular functions.4,D \n\nSolutions to a constrained optimization problem are restricted to a subset of the solutions of the \ncorresponding unconstrained optimization problem. For example, a mutual inhibition circuitS requires \none neuron to be \"on\" and the rest to be \"off\". Another example is the traveling salesman problem,ls \nwhere a salesman tries to minimize his travel distance, subject to the constraint that he must visit \nevery city exactly once. A third example is the curve fitting problem, where elastic splines are as \nsmooth as possible, while still going through data points.s Finally, when digital decisions are being \nmade on analog data, the answer is constrained to be bits, either 0 or 1.14 \n\nA constrained optimization problem can be stated as \n\nminimize / (~), \nsubject to g(~) = 0, \n\n(1) \n\nwhere ~ is the state of the neural network, a position vector in a high-dimensional space; f(~) is a \nscalar energy, which can be imagined as the height of a landscape as a function of position~; g(~) = 0 \nis a scalar equation describing a subspace of the state space. During constrained optimization, the \nstate should be attracted to the subspace g(~) = 0, then slide along the subspace until it reaches the \nlocally smallest value of f(~) on g(~) = O. \n\nIn section 2 of the paper, we describe classical methods of constrained optimization, such as the \n\npenalty method and Lagrange multipliers. \n\nSection 3 introduces the basic differential multiplier method (BDMM) for constrained optimiza(cid:173)\ntion, which calcuIates a good local minimum. If the constrained optimization problem is convex, then \nthe local minimum is the global minimum; in general, finding the global minimum of non-convex \nproblems is fairly difficult. \n\nIn section 4, we show a Lyapunov function for the BDMM by drawing on an analogy from \n\nphysics. \n\n\u00a9 American Institute of Physics 1988 \n\n\f613 \n\nIn section 5, augmented Lagrangians, an idea from optimization theory, enhances the convergence \n\nproperties of the BDMM. \n\nIn section 6, we apply the differential algorithm to two neural problems, and discuss the insen(cid:173)\n\nsitivity of BDMM to choice of parameters. Parameter sensitivity is a persistent problem in neural \nnetworks. \n\n2. Classical Methods of Constrained Optimization \n\nThis section discusses two methods of constrained optimization, the penalty method and Lagrange \nmultipliers. The penalty method has been previously used in differential optimization. The basic \ndifferential multiplier method developed in this paper applies Lagrange multipliers to differential \noptimization. \n\n2.l. The Penalty Method \n\nThe penalty method is analogous to adding a rubber band which attracts the neural state to \nthe subspace g(~) = o. The penalty method adds a quadratic energy term which penalizes viola(cid:173)\ntions of constraints. 8 Thus, the constrained minimization problem (1) is converted to the following \nunconstrained minimization problem: \n\n(2) \n\nFigure 1. The penalty method makes a trough in state space \n\nThe penalty method can be extended to fulfill multiple constraints by using more than one rubber \n\nband. Namely, the constrained optimization problem \n\nminimize f (.~), \n8ubject to go (~) = OJ \n\na = 1,2, ... , n; \n\nis converted into unconstrained optimization problem \n\nminimize l'pena1ty(~) = f(~) + L Co(go(~))2. \n\nn \n\n0:::1 \n\n(3) \n\n(4) \n\nThe penalty method has several convenient features. First, it is easy to use. Second, it is globally \nconvergent to the correct answer as Co -\n00.8 Third, it allows compromises between constraints. For \nexample, in the case of a spline curve fitting input data, there can be a compromise between fitting \nthe data and making a smooth spline. \n\n\f614 \n\nHowever, the penalty method has a number of disadvantages. First, for finite constraint strengths \nCOl' it doesn't fulfill the constraints exactly. Using multiple rubber band constraints is like building \na machine out of rubber bands: the machine would not hold together perfectly. Second, as more \nconstraints are added, the constraint strengths get harder to set, especially when the size of the \nnetwork (the dimensionality of .u gets large. \nIn addition, there is a dilemma to the setting of the constraint strengths. If the strengths are small, \nthen the system finds a deep local minimum, but does not fulfill all the constraints. If the strengths \nare large, then the system quickly fulfills the constraints, but gets stuck in a poor local minimum. \n\nLagrange multiplier methods also convert constrained optimization problems into unconstrained \nextremization problems. Namely, a solution to the equation (1) is also a critical point of the energy \n\n2.2. Lagrange Multipliers \n\n). is called the Lagrange multiplier for the constraint g(~) = 0.8 \n\nA direct consequence of equation (5) is that the gradient of f is collinear to the gradient of 9 at \nthe constrained extrema (see Figure 2). The constant of proportionality between 'i1 f and 'i1 9 is -).: \n\n'i1 'Lagrange = 0 = 'i1 f + ). 'i1 g. \nWe use the collinearity of 'i1 f and 'i1 9 in the design of the BDMM. \n\n(6) \n\n(5) \n\nFigure 2. At the constrained minimum, 'i1 f = -). 'i1 9 \n\nA simple example shows that Lagrange multipliers provide the extra degrees of freedom necessary \nto solve constrained optimization problems. Consider the problem of finding a point (x, y) on the \nline x + y = 1 that is closest to the origin. Using Lagrange multipliers, \n\nNow, take the derivative with respect to all variables, x, y, and A. \n\n'Lagrange = x 2 + y2 + ).(x + y - 1) \n\naeLagrange = 2x + A = 0 \n\na'Lagrange = 2y + A = 0 \n\nax \nay \n\na'Lagrange = x + y - 1 = 0 \n\na). \n\n(7) \n\n(8) \n\n\f615 \n\nWith the extra variable A, there are now three equations in three unknowns. In addition, the last \nequation is precisely the constraint equation. \n\n3. The Basic Differential Multiplier Method for Constrained Optimization \nThis section presents a new \"neural\" algorithm for constrained optimization, consisting of dif(cid:173)\n\nferential equations which estimate Lagrange multipliers. The neural algorithm is a variation of the \nmethod of multipliers, first presented by Hestenes9 and Powell 16 \u2022 \n\n3.1. Gradient Descent does not work with Lagrange Multipliers \n\nThe simplest differential optimization algorithm is gradient descent, where the state variables of \nthe network slide downhill, opposite the gradient. Applying gradient descent to the energy in equation \n(5) yields \n\nx. - _ a!Lagrange = _ al _ A ag \n, -\nax' ' \n\\. \nJ\\ = -\n\nax\u00b7 \n\" \n= -g * \n) \n( \n. \n\nax\u00b7 \n, \naA \n\na!Lagrange \n\n(9) \n\nNote that there is a auxiliary differential equation for A, which is an additional \"neuron\" necessary \nto apply the constraint g(~) = O. Also, recall that when the system is at a constrained extremum, \nVI = -AVg, hence, x. = O. \n\nEnergies involving Lagrange multipliers, however, have critical points which tend to be saddle \npoints. Consider the energy in equation (5). If ~ is frozen, the energy can be decreased by sending \nA to +00 or -00. \n\nGradient descent does not work with Lagrange multipliers, because a critical point of the energy \nin equation (5) need not be an attractor for (9). A stationary point must be a local minimum in order \nfor gradient descent to converge. \n\n3.2. The New Algorithm: the Basic Differential Multiplier Method \n\nWe present an alternative to differential gradient descent that estimates the Lagrange multipliers, \nso that the constrained minima are attractors of the differential equations, instead of \"repulsors.\" The \ndifferential equations that solve (1) is \n\nag \nal \n. \nX' = - - -A -\n, \nax, \nax.' \ni = +g(*). \n\n(10) \n\nEquation (10) is similar to equation (9). As in equation (9), constrained extrema of the energy \n(5) are stationary points of equation (10). Notice, however, the sign inversion in the equation for i, \nas compared to equation (9). The equation (10) is performing gradient ascent on A. The sign flip \nmakes the BDMM stable, as shown in section 4. \n\nEquation (10) corresponds to a neural network with anti-symmetric connections between the A \n\nneuron and all of the ~ neurons. \n\n3.3. Extensions to the Algorithm \n\nOne extension to equation (10) is an algorithm for constrained minimization with multiple con(cid:173)\nstraints. Adding an extra neuron for every equality constraint and summing all of the constraint forces \ncreates the energy \n\n!multiple = !(~) + I: Ao 0 such that if c > c*, the damping matrix \nin equation (28) is positive definite at constrained minima. Using continuity, the damping matrix is \npositive definite in a region R surrounding each constrained minimum. If the system starts in the \nregion R and remains bounded and in R, then the convergence theorem at the end of section 4 is \napplicable, and MDMM will converge to a constrained minimum. \n\nThe minimum necessary penalty strength c for the MDMM is usually much less than the strength \n\nneeded by the penalty method alone.2 \n\n6. Examples \n\nThis section contains two examples which illustrate the use of the BDMM and the MDMM. First, \nthe BDMM is used to find a good solution to the planar traveling salesman problem. Second, the \nMDMM is used to enforcing mutual inhibition and digital results in the task of analog decoding. \n\nThe traveling salesman problem (fSP) is, given a set of cities lying in the plane, find the shortest \nclosed path that goes through every city exactly once. Finding the shortest path is NP-complete. \n\n6.1. Planar Traveling Salesman \n\n\f619 \n\nFinding a nearly optimal path, however, is much easier than finding a globally optimal path. There \nexist many heuristic algorithms for approximately solving the traveling salesman problem.5,10,11,13 \nThe solution presented in this section is moderately effective and illustrates the independence of \nBDMM to changes in parameters. \n\nFollowing Durbin and Willshaw,5 we use an elastic snake to solve the TSP. A snake is a discretized \ncurve which lies on the plane. The elements of the snake are points on the plane, (Xi, Yd. A snake \nis a locally connected neural network, whose neural outputs are positions on the plane. \n\nThe snake minimizes its length \n\nsubject to the constraint that the snake must lie on the cities: \n\n2:)Xi+1 - x,)2 - (Yi+l - Yi)2, \ni \n\n(29) \n\n(30) \nwhere (x*, y*) are city coordinates, (xc, Yc) is the closest snake point to the city, and k is the constraint \nstrength. \n\nk(x* - xc) = 0, \n\nk(y* - Yc) = 0, \n\nThe minimization in equation (29) is quadratic and the constraints in equation (30) are piecewise \nlinear, corresponding to a CO continuous potential energy in equation (21). Thus, the damping is \npositive definite, and the system converges to a state where the constraints are fulfilled. \n\nIn practice, the snake starts out as a circle. Groups of cities grab onto the snake, deforming \nit As the snake gets close to groups of cities, it grabs onto a specific ordering of cities that locally \nminimize its length (see Figure 4). \n\nThe system of differential equations that solve equations (29) and (30) are piecewise linear. The \ndifferential equations for Xi and Yi are solved with implicit Euler's method, using tridiagonal LV \ndecomposition to solve the linear system.17 The points of the snake are sorted into bins that divide \nthe plane, so that the computation of finding the nearest point is simplified. \n\nFigure 4. The snake eventually attaches to the cities \n\nThe constrained minimization in equations (29) and (30) is a reasonable method for approximately \nsolving the TSP. For 120 cities distributed in the unti square, and 600 snake points, a numerical step \nsize of 100 time units, and a constraint strength of 5 x 10- 3 , the tour lengths are 6% \u00b1 2% longer \nthan that yielded by simulated annealing11 . Empirically, for 30 to 240 cities, the time needed to \ncompute the final city ordering scales as N1.6, as compared to the Kernighan-Lin method13 , which \nscales roughly as N 2.2 \u2022 \n\nThe constraint strength is usable for both a 30 city problem and a 240 city problem. Although \n\nchanging the constraint strength affects the performance, the snake attaches to the cities for any non(cid:173)\nzero constraint strength. Parameter adjustment does not seem to be an issue as the number of cities \nincreases, unlike the penalty method. \n\n\f620 \n\n6.2. Analog Decoding \n\nAnalog decoding uses analog signals from a noisy channel to reconstruct codewords. Analog \ndecoding has been performed neurally,15 with a code space of permutation matrices, out of the \npossible space of binary matrices. \n\nTo perform the decoding of permutation matrices, the nearest permutation matrix to the signal \nmatrix must be found. In other words, find the nearest matrix to the signal matrix, subject to the \nconstraint that the matrix has on/off binary elements, and has exactly one \"on\" per row and one \"on\" \nper column. If the signal matrix is Ii; and the result is Vi;, then minimize \n\nsubject to constraints \n\nVi,,(l- Vi;) = OJ \n\n- \"v.. ,1-. \nL..J ., ., \ni ,; \n\n(31) \n\n(32) \n\nLVi\" -1 = O. \n; \n\nIn this example, the first constraint in equation (32) forces crisp digital decisions. The second \n\nand third constraints are mutual inhibition along the rows and columns of the matrix. \n\nThe optimization in equation (31) is not quadratic, it is linear. In addition, the first constraint in \nequation (32) is non-linear. Using the BDMM results in undamped oscillations. In order to converge \nonto a constrained minimum, the MDMM must be used. For both a 5 x 5 and a 20 x 20 system, a \nc = 0,2 is adequate for damping the oscillations. The choice of c seems to be reasonably insensitive \nto the size of the system, and a wide range of c, from 0.02 to 2.0, damps the oscillations . \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u00b7 ' \u2022....\u2022. \n.. , .. . \n.. ... . \n. ..\u2022...... \n. ... \n\u2022 \u2022\u2022\u2022\u2022\u2022 \n\u2022 \u2022 \u2022\u2022\u2022 \n\u2022 \u2022 ... e\u00b7 ... . \n.\u2022. ' . \n\u2022 \u2022\u2022\u2022\u2022\u2022 \u2022 \n\u00b7 .. . e\u00b7 ... \u00b7 .. \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022\u2022 \n\u2022 \u2022 \u2022 \n. . \n.\u2022. ' \n\u2022 \u2022 \u2022\u2022\u2022 \n\u2022 \u2022\u2022\u2022\u2022 \n.. . . . \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022 \n.... . . . ... \u2022 \u2022 \u2022 \n:~:.:.: \n\u2022 \u2022\u2022\u2022\u2022\u2022\u2022\u2022\u2022 \n\u2022\u2022\u2022 \u2022\u2022\u2022 \u2022 \u2022 \u2022 \u2022 \u2022 \n\u00b7\u00b7\u00b7::r:::::::: \u2022 \u2022 \n.. . . . . , ... . ... \n\u2022 \u2022 \n: :e&:.:: ....\u2022. \n\u2022 \u2022 \n\u2022\u2022\u2022\u2022\u2022 \u2022\u2022\u2022 \n\n\u2022 \n\n.\u2022... . \n. \u2022.... \n..' . \n\u2022 \u2022\u2022\u2022 \u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022\u2022 \u2022 \u2022\u2022 ..' . \n\u2022\u2022\u2022 \u2022\u2022 \n\n\u2022 \u2022\u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022 \n\u2022\u2022\u2022 \n\u2022\u2022\u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022\u2022 \n\u2022 \u2022\u2022 \n\n. \u2022.... . ..... \n\nFigure 5. The decoder finds the nearest permutation matrix \n\nIn a test of the MDMM, a signal matrix which is a permutation matrix plus some noise, with \na signal-to-noise ratio of 4 is supplied to the network. In figure 5, the system has turned on the \ncorrect neurons but also many incorrect neurons. The constraints start to be applied, and eventually \nthe system reaches a permutation matrix. The differential equations do not need to be reset. If a new \nsignal matrix is applied to the network, the neural state will move towards the new solution. \n\n7. ConClusions \n\nIn the field of neural networks, there are differential optimization algorithms which find local \nsolutions to non-convex problems. The basic differential multiplier method is a modification of a \nstandard constrained optimization algorithm, which improves the capability of neural networks to \nperform constrained optimization. \n\nThe BDMM and the MDMM offer many advantages over the penalty method. First, the differ(cid:173)\n\nential equations (10) are much less stiff than those of the penalty method. Very large quadratic terms \nare not needed by the MDMM in order to strongly enforce the constraints. The energy terrain for the \n\n\f621 \n\npenalty method looks like steep canyons, with gentle floors; finding minima of these types of energy \nsurfaces is numerically difficult In addition, the steepness of the penalty tenns is usually sensitive \nto the dimensionality of the space. The differential multiplier methods are promising techniques for \nalleviating stiffness. \n\nThe differential multiplier methods separate the speed of fulfilling the constraints from the ac(cid:173)\n\ncuracy of fulfilling the constraints. In the penalty method, as the strengths of a constraint goes to \n00, the constraint is fulfilled, but the energy has many undesirable local minima. The differential \nmultiplier methods allow one to choose how quickly to fulfill the constraints. \n\nThe BDMM fulfills constraints exactly and is compatible with the penalty method. Addition of \npenalty tenns in the MDMM does not change the stationary points of the algorithm, and sometimes \nhelps to damp oscillations and improve convergence. \n\nSince the BDMM and the MDMM are in the form of first-order differential equations, they can \nbe directly implemented in hardware. Performing constrained optimization at the raw speed of analog \nVLSI seems like a promising technique for solving difficult perception problems. 14 \n\nThere exist Lyapunov functions for the BDMM and the MDMM. The BDMM converges glob(cid:173)\n\nally for quadratic programming. The MDMM is provably convergent in a local region around the \nconstrained minima Other optimization algorithms, such as Newton's method,17 have similar lo(cid:173)\ncal convergence properties. The global convergence properties of the BDMM and the MDMM are \ncurrently under investigation. \n\nIn summary, the differential method of multipliers is a useful way of enforcing constraints on \nneural networks for enforcing syntax of solutions, encouraging desirable properties of solutions, and \nmaking crisp decisions. \n\nThis paper was supported by an AT&T Bell Laboratories fellowship (JCP). \n\nAcknowledgments \n\nReferences \n\n1. K. J. Arrow, L. Hurwicz, H. Uzawa, Studies in Linear and Nonlinear Programming. (Stanford \n\nUniversity Press, Stanford, CA, 1958). \n\n2. D. P. Bertsekas, Automatica, 12, 133-145, (1976). \n3. C. de Boor, A Practical Guide to Splines. (Springer-Verlag, NY, 1978). \n4. M. A. Cohen, S. Grossberg, IEEE Trans. Systems. Man. and Cybernetics, ,815-826, (1983). \n5. R. Durbin, D. Willshaw, Nature, 326, 689-691, (1987). \n6. J. C. Eccles, The Physiology of Nerve Cells, (Johns Hopkins Press, Baltimore, 1957). \n7. M. R. Hestenes, J. Opt. Theory Appl., 4, 303-320, (1969). \n8. M. R. Hestenes, Optimization Theory, (Wiley & Sons, NY, 1975). \n9. J. J. Hopfield, PNAS, 81, 3088, (1984). \n10. J. J. Hopfield, D. W. Tank, Biological Cybernetics, 52, 141, (1985). \n11. S. Kirkpatrick, C. D. Gelatt, C. M. Vecchi, Science, 220, 671-680, (1983). \n12. J. LaSalle, The Stability of Dynamical Systems, (SIAM, Philadelphia, 1976). \n13. S. Lin, B. W. Kernighan, Oper. Res., 21,498-516 (1973). \n14. C. A. Mead, Analog VLSI and Neural Systems, (Addison-Wesley, Reading. MA, TBA). \n15. J. C. Platt, J. J. Hopfield, in AlP Con/. Proc.151: Neural Networksfor Computing (1. Denker \n\ned.) 364-369, (American Institute of PhysiCS, NY, 1986). \n\n16. M. 1. Powell, in Optimization, (R. Fletcher, ed.), 283-298, (Academic Press, NY, 1969). \n17. W. H. Press, B. P. Flannery, S. A. Teukolsky, W. T. Vetterling, Numerical Recipes, (Cam(cid:173)\n\nbridge University Press, Cambridge, 1986). \n\n18. D. Rumelhart, G. Hinton, R. Williams, in Parallel Distributed Processing, (D. Rumelhart, \n\ned), 1, 318-362, (MIT Press, Cambridge, MA, 1986). \n\n19. D. W. Tank, J. J. Hopfield, IEEE Trans. Cir. & Sys., CAS-33, no. 5,533-541 (1986). \n\n\f", "award": [], "sourceid": 4, "authors": [{"given_name": "John", "family_name": "Platt", "institution": null}, {"given_name": "Alan", "family_name": "Barr", "institution": null}]}