{"title": "Interior Point Implementations of Alternating Minimization Training", "book": "Advances in Neural Information Processing Systems", "page_first": 569, "page_last": 576, "abstract": null, "full_text": "Interior Point Implementations of \nAlternating Minimization Training \n\nMichael Lemmon \n\nDept . of Electrical Engineering \n\nUniversity of Notre Dame \n\nNotre Dame, IN 46556 \n\nlemmon@maddog.ee.nd.edu \n\nPeter T. Szymanski \n\nDept. of Electrical Engineering \n\nUniversity of Notre Dame \n\nNotre Dame, IN 46556 \n\npszymans@maddog.ee.nd.edu \n\nAbstract \n\nThis paper presents an alternating minimization (AM) algorithm \nused in the training of radial basis function and linear regressor \nnetworks. The algorithm is a modification of a small-step interior \npoint method used in solving primal linear programs. The algo(cid:173)\nrithm has a convergence rate of O( fo,L) iterations where n is a \nmeasure of the network size and L is a measure of the resulting \nsolution's accuracy. Two results are presented that specify how \naggressively the two steps of the AM may be pursued to ensure \nconvergence of each step of the alternating minimization. \n\n1 \n\nIntroduction \n\nIn recent years, considerable research has investigated the use of alternating min(cid:173)\nimization (AM) techniques in the supervised training of radial basis function \nnetworks. AM techniques were first introduced in soft-competitive learning al(cid:173)\ngorithms[l]. This training procedure was later shown to be closely related to \nExpectation-Maximization algorithms used by the statistical estimation commu(cid:173)\nnity[2]. Alternating minimizations search for optimal network weights by breaking \nthe search into two distinct minimization problems. A given network performance \nfunctional is extremalized first with respect to one set of network weights and then \nwith respect to the remaining weights. These learning procedures have found ap(cid:173)\nplications in the training of local expert systems [3], and in Boltzmann machine \ntraining [4]. More recently, convergence rates have been derived by viewing the AM \n\n\f570 \n\nMichael Lemmon. Peter T. Szymanski \n\nmethod as a variable metric algorithm [5]. \nThis paper examines AM as a perturbed linear programming (LP) problem. Recent \nadvances in the application of barrier function methods to LP problems have re(cid:173)\nsulted in the development of \"path following\" or \"interior point\" (IP) algorithms [6]. \nThese algorithms are characterized by a fast convergence rate that scales in a sub(cid:173)\nlinear manner with problem size. This paper shows how a small-step IP algorithm \nfor solving a primal LP problem can be modified into an AM training procedure .. \nThe principal results of the paper are bounds on how aggressively the legs of the \nalternating minimization can be pursued so that the AM algorithm maintains the \nsublinear convergence rate characteristic of its LP counterpart and so that both legs \nconverge to an optimal solution. \n\n2 Problem Statement \n\nConsider a function approximation problem where a stochastic approximator learns \na mapping f : IRN -+ IR. The approximator computes a predicted output, Y E IR, \ngiven the input z E IRN. The prediction is computed using a finite set of M \nregressors. The mth regressor is characterized by the pair (Om,wm) E IRN X IRN \n(m = 1, ... , M). The output of the mth regressor, Ym E IR, in response to an input, \nz E IRN is given by the linear function \n\nYm = O~z. \n\n(1) \nThe mth regressor (m = 1, ... , M) has an associated radial basis function (RBF) \nwith parameter vector Wm E IRN. The mth RBF weights the contribution of the \nmth output in computing Y and is defined as a normal probability density function \n(2) \nwhere km is a normalizing constant. The set of all weights or gating probabilities is \ndenoted by Q. The parameterization of the mth regressor is em = (O~ ,w~)T E IR2N \n(m = 1, ... , M) and the parameterization of the set of M linear regressors is \n\nQ(mlz) = km exp(-u- 2 1Iwm - zW) \n\ne - (eT \n\n-\n\nI \" \" , M \n\neT)T \n. \n\n(3) \n\nThe preceding stochastic approximator can be viewed as a neural network. The \nnetwork consists of M + 1 neurons. M of the neurons are agent neurons while the \nother neuron is referred to as a gating neuron. The mth a~ent neuron is parame(cid:173)\nterized by Om, the first element of the pair em = (O?'n, w?'n) \n(m = 1, ... , M). The \nagent neurons receive as input, the vector z E IRN. The output of the mth agent \nneuron in response to an input z is Ym = O?'nz (m = 1, ... , M). The gating neuron is \nparameterized by the conditional gating probabilities, Q. The gating probabilities \nare defined by the set of vectors, Ii; = {WI, ... ,W M }. The gating neuron receives the \nagent neurons' outputs and the vector z as its input. The gating neuron computes \nthe network's output, Y, as a hard (4) or soft (5) choice \nY = Ym; m = arg max Q(mlz) \nm=I .... ,M \nL:~-I Q(mlz)Ym \nL:m=1 Q(mlz) \n\nY= \n\n(5) \n\n(4) \n\nM \n\nA \n\n\fInterior Point Implementations of Alternating Minimization Training \n\n571 \n\nThe network will be said to be \"optimal\" with respect to a training set T = {( Zi, Yi) : \nYi = f(zd, i = 1, ... , R} if a mean square error criterion is minimized. Define the \nsquare output error of the mth agent neuron to be em(zi) = (Yi - O~Zi)2 and the \nsquare weighting or classifier error of the mth RBF to be Cm(Zi) = Ilwm - Zi 112. Let \nthe combined square approximation error of the m,th neuron be dm (Zi) = Kee m (Zi) + \nKcCm(Zi) and let the average square approximation error of the network be \n\nd(Q, e, T) = L LP(zi)Q(mlzddm(zi)' \n\nM R \n\nm=l i=l \n\n(6) \n\nMinimizing (6) corresponds to minimizing both the output error in the M linear \nregressors and the classifier error in assigning inputs to the M regressors.: Since T \nis a discrete set and the Q are gating probabilities, the minimization of d(Q, e, T) \nis constrained so that Q is a valid conditional probability mass function over the \ntraining set, T. \nNetwork training can therefore be viewed as a constrained optimization problem. \nIn particular, this optimization problem can be expressed in a form very similar \nto conventional LP problems. The following notational conventions are adopted to \nhighlight this connection. Let x E lR,MR be the gating neuron's weight vector where \n\n(7) \nLet em = (O~,w~)T E lR,2N denote the parameter vectors associated with the mth \nregressor and define the cost vector conditioned on e = (er, ... , et)T as \nc(e) = (p(zt}d1(zt), ... ,p(zR)d1(ZR),p(Z2)d2(Z2), ... ,p(zi)dm(Zi), ... )T \nWith this notation, the network training problem can be stated as follows, \n\n(8) \n\nmmlmlze \nwith respect to \n\ncT(e)x \nx, e \n\nsubject to Ax = b, x 2:: 0 \n\n(9) \n\nwhere b = (1, ... ,1? E lR,R, A = [IRxR\u00b7 .. IRxR] E IRRXMR , and x 2:: 0 implies \nXi 2:: 0 for i = 1, ... , MR. \nOne approach for solving this problem is to break up the optimization into two \nsteps. The first step involves minimizing the above cost functional with respect to \nx assuming a fixed e. This is the Q-update of the algorithm. The second leg of the \nalgorithm minimizes the functional with respect to e assuming fixed x. This leg \nis called the e-update. Because the proposed optimization alternates between two \ndifferent subsets of weights, this training procedure is often referred to as alternating \nminimization. Note that the Q-update is an LP problem while the e-update is a \nquadratic programming problem. Consequently, the AM training procedure can be \nviewed as a perturbed LP problem. \n\n3 Proposed Training Algorithm \n\nThe preceding section noted that network training can be viewed as a perturbed \nLP problem. This observation is significant for there exist very efficient LP solvers \n\n\f572 \n\nMichael Lemmon, Peter T. Szymanski \n\nbased on barrier function methods used in non-linear optimization. Recent advances \nin path following or interior point (IP) methods have developed LP solvers which \nexhibit convergence rates which scale in a sublinear way with problem size [6]. This \nsection introduces a modification of a small-step primal IP algorithm that can be \nused for neural network training. The proposed modification is later shown to \npreserve the computational efficiency enjoyed by its LP counterpart. \nTo see how such a modification might arise, we first need to examine path following \nLP solvers. Consider the following LP problem. \n\nmmlmlze \nwith respect to \n\ncTx \nx E IRn \n\nsubject to Ax = b, x 2: 0 \n\n(10) \n\nThis problem can be solved by solving a sequence of augmented optimization prob(cid:173)\nlems arising from the primal parameterization of the LP problem. \n\nmmlmlze O'O:)cTx(k) - 2:i logx~k) \n\nwith respect to \nsubject to \n\nx(k) E IRn \n\nAx(k) = b, x(k) 2: 0 \n\n(11) \n\nwhere a(k) 2: 0 (k = 1,\u00b7 .. ,1<) is a finite length, monotone increasing sequence of \nreal numbers. x*(a(k\u00bb denotes the solution for the kth optimization problem in \nthe sequence and is referred to as a central point. The locus of all points, x* (O'( k\u00bb \nwhere a(k) 2: 0 is called the central path. The augmented problem takes the original \nLP cost function and adds a logarithmic barrier which keeps the central point away \nfrom the boundaries of the feasible set. As a increases, the effect of the barrier \nis decreased, thereby allowing the kth central point to approach the LP problem's \nsolution in a controlled manner. \nPath following (IP) methods solve the LP problem by approximately solving \nthe sequence of augmented problems shown in (11). The parameter sequence, \n0'(0),0'(1), ... , a(K), is chosen to be a monotone increasing sequence so that the \ncentral points, x*(a(k\u00bb, of each augmented optimization approach the LP solution \nin a monotone manner. It has been shown that for specific choices of the a sequence, \nthat the sequence of approximate central points will converge to an f-neighborhood \nof the LP solution after a finite number of iterations. For primal IP algorithms, the \nrequired condition is that successive approximations of the central points lie within \nthe region of quadratic convergence for a scaling steepest descent (SSD) algorithm \n[6]. In particular, it has been shown that if the kth approximating solution is suffi(cid:173)\nciently close to the kth central point and if O'(k+ 1) = O'(k)(l + v/..{ii) where v ~ 0.1 \ncontrols the distance between successive central points, then the \"closeness\" to the \n(k + lyt central point is guaranteed and the resulting algorithm will converge in \nO( foL) iterations where L = n + p+ 1 specifies the size of the LP problem and p is \nthe total number of bits used to represent the data in A , b, and c. If the algorithm \ntakes small steps, then it is guaranteed to converge efficiently. \nThe preceding discussion reveals that a key component to a path following method's \ncomputational efficiency lies in controlling the iteration so that successive central \npoints lie within the SSD algorithm's region of quadratic convergence. If we are to \nsuccessfully extend such methods to (9), then this \"closeness\" of successive solutions \nmust be preserved by the 6-update of the algorithm. Due to the quadratic nature \n\n\fInterior Point Implementations of Alternating Minimization Training \n\n573 \n\nof the e-update, this minimization can be done exactly using a single Newton(cid:173)\nRaphson iteration. Let e* denote e-update's minimizer. If we update e to e*, \nit is quite possible that the cost vector, c(e), will be rotated in such a way that \nthe current solution, x(k), no longer lies in the region of quadratic convergence. \nTherefore, if we are to preserve the IP method's computational efficiency it will be \nnecessary to be less \"aggressive\" in the e-update. In particular, this paper proposes \nthe following convex combination as the e-update \n\ne~+l) = (1 - \"Y(k\u00bbe~) + \"Y(k)e~+l),* \n\n(12) \nwhere e~) is the mth parameter vector at time k and 0 < \"Y(k) < 1 controls the size \nof the update. This will ensure convergence of the Q-update. \nConvergence of the AM algorithm also requires convergence of the e-update. For \nthe e-update to converge, \"Y(k) in (12) must go to unity as k increases. Convergence \nof \"Y(k) to unity requires that the sequence Ile~+l),* -e~)11 be monotone decreasing. \nAs the e-update minimizer, e(k+l),*, depends upon the current weights, Q(k)(mlz), \nlarge changes to Q can prevent the sequence from being monotone decreasing. Thus, \nit is necessary to also be less \"aggressive\" in the Q-update. An appropriate bound \non v is the proposed solution to guarantee convergence of the e-update. \n\nAlgorithm 1 (Proposed Training Algorithm) \n\nInitialize \n\nk = O. \nChoose X~k). e~). and O'(k) for i= 1,\u00b7\u00b7\u00b7,(MR). and for m= 1,\u00b7\u00b7\u00b7,M. \n\nrepeat \n\nO'(k+l) = O'(k)(1 + v/Vn). where v ~ 0.1. \nQ-update: \nXo = X(k) \nfor i = 0, ... , P - 1 \n\nXi+l = ScalingSteepestD.escent(x~k+l), O'(k+l), e(k\u00bb) \n\nX(k+l) = xp \n\ne-update: For m = 1, ... , M \n\ne~+l) = (1 - r(k\u00bb)e~) + r(k)e~+l),* \n\nk=k+l \nuntil(.6. < c) \n\n4 Theoretical Results \n\nThis section provides bounds on the parameter, ,(k), controlling the AM algorithm's \ne-update so that successive x(k) vector solutions lie within the SSD algorithm's \nregion of quadratic convergence and on v controlling the Q-update so that successive \ncentral points are not too far apart, thus allowing convergence of the e-update. \n\nTheorem 1 Let e~) and e~ ),* be the current and minimizing parameter vec(cid:173)\ntors at time k. Let c(k) = c(e(k\u00bb and c(k),* = c(e(k),*). Let 6(x, a, e) = \n\n\f574 \n\nMichael Lemmon. Peter T. Szymanski \n\nIIPAXX (O'c(8) - x-I) II be the step size of the SSD update where PA = 1-\nAT (AAT)-l A and X = diag(xl, ... , x n ) . Assume that 6(xCHl ), O'CHl), 8 Ck ) = \n61 < 0.5 and let e~+1) = (1 - ')'Ck)e~) + ')'Ck)e~+l), \u2022. If ')'Ck) is chosen as \n\nCk) < \n')' \n\n- n(I + 1I/y'n)llcCHl),. - C(2:-J1.-m-ax\"\"\") , \n-1 + JI + ')'min f e/(2K)} \n\n(15) \n\n\fInterior Point Implementations of Alternating Minimization Training \n\n575 \n\nwhere 1\\ = ILmax Y\u00abl + (2 ILmax/(l- r\u00bb, 1min = (62 - 6d/(n(1 + 0.1/ y'n)llc(l),\u00b7 -\nc(O)11) and f0 is the largest lIer!+1),\u00b7 - er!)11 such that 1(.1:) = 1, then the e-update \nwill converge with Iler!+1),\u00b7 - er!)II-+ 0 and 1(.1:) -+ 1 as k increases. \n\nThe preceding results guarantee the convergence of the component minimizations \nseparately. Convergence of the total algorithm relies on the simultaneous conver(cid:173)\ngence of both steps. This is currently being addressed using contraction mapping \nconcepts and stability results from nonlinear stability analysis [8]. \nThe convergence rate of the algorithm is established using the LP problem's duality \ngap. The duality gap is the difference between the current solutions for the primal \nand dual formulations of the LP problem. Path following algorithms allow the \nduality gap to be expressed as follows \n\nA( (k\u00bb _ n + 0.5y'n \n. \nL..). a \n\na(k) \n\n-\n\n(16) \n\nand thus provide a convenient stopping criterion for the algorithm . Note that \na(k) = a(O) /{3k where {3 ~ (1+11/ y'n). This implies that ~(k) = 13k ~(O) ~ {3k2L. If k \nis chosen so that {3k2L :::; 2- L , then ~(k) :::; 2- L which implies that k ~ 2L/ log(l/ {3). \nInserting our choice of {3 one finds that k ~ (2y'nL/II)+2L. The preceding argument \nestablishes that the .proposed convergence rate of O( y'nL) iterations. \nIn other \nwords, the procedure's training time scales in a sublinear manner with network size. \n\n5 Simulation Example \n\nSimulations were run on a time series prediction task to test the proposed algorithm. \nThe training set is T = {(Zi,Yi) : Yi = y(iT),Zi = (Yi-l,Yi-2, ... ,Yi-Nf E lR,N} \nfor i = 0,1, ... ,100, N = 4, and T = 0.04 where the time series is defined as \n\ny(t) = sin(1I\"t) - sin(211\"t) + sin(311\"t) - sin(1I\"t/2) \n\n(17) \n\nThe results describe the convergence of the algorithm. These experiments con(cid:173)\nsisted of 100 randomly chosen samples with N = 4 and a number of agent neurons \nranging from M = 4 to 20. This corresponds to an LP problem dimension of \nn = 404 to 2020. The stopping criteria for the tests was to run until the solution \nwas within f = 10- 3 of a local minimum. The number of iterations and floating \npoint operations (FLOPS) for the AM algorithm to converge are shown in Fig(cid:173)\nures l(a) and l(b) with AM results denoted by \"0\" and the theoretical rates by a \nsolid line . The algorithm exhibits approximately O( y'nL) iterations to converge as \npredicted. The computational cost, however is O(n 2 L) FLOPS which is better than \nthe predicted O( n3 .5 L). The difference is due to the use of sparse matrix techniques \nwhich reduce the number of computations. The resulting AM algorithm then has \nthe complexity of a matrix multiplication instead of a matrix inversion. The use of \nthe algorithm resulted in networks having mean square errors on the order of 10- 3 . \n\n6 Discussion \n\nThis paper has presented an AM algorithm which can be proven to converge in \nO( foL) iterations. The work has established a means by which IP methods can be \n\n\f576 \n\nMichael Lemmon, Peter T. Szymanski \n\n104r---------~--________ ~ \n\nG) \n\n~ > c::: \n8 \n\n'0 ... \nZ \nE \n::J \nZl~~--------~----------~ \n104 \n\n102 \n\n103 \n\nLP Problem Size (n) \n\n~ 1015 ,....-________ - - - - - - - - - - - , \no \n~ G) \n0-o \nC \n~ \n01 1010 \nc::: :; \n.2 \n\nLL -o ... G) \n\n.0 \nE \n~105 ~--------~---------~ \n10\u00b7 \n\n102 \n\n103 \n\nLP Problem Size (n) \n\n(a) \n\n(b) \n\nFigure 1: Convergence rates as a function of n \n\napplied to NN training in a way which preserves the computational efficiency of IP \nsolvers. The AM algorithm can be used to solve off-line problems such as codebook \ngeneration and parameter identification in colony control applications. The method \nis currently being used to solve hybrid control problems of the type in [9]. Areas of \nfuture research concern the study of large-step IP methods and extensions of AM \ntraining to other EM algorithms. \n\nReferences \n\n[1] S. Nowlan, \"Maximum likelihood competitive learning,\" in Advances in Neural Infor(cid:173)\nmation Processing Systems 2, pp. 574-582, San Mateo, California: Morgan Kaufmann \nPublishers, Inc., 1990. \n\n[2] M. Jordan and R. Jacobs, \"Hierarchical mixtures of experts and the EM algorithm,\" \n\nTech. Rep. 9301, MIT Computational Cognitive Science, Apr. 1993. \n\n[3] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton, \"Adaptive mixtures oflocal experts,\" \n\nNeural Computation, vol. 3, pp. 79-87, 1991. \n\n[4] W. Byrne, \"Alternating minimization and Boltzmann machine learning,\" IEEE Trans(cid:173)\n\nactions on Neural Networks , vol. 3, pp. 612-620, July 1992. \n\n[5] M. Jordan and 1. Xu, \"Convergence results for the EM approach to mixtures of experts \n\narchitectures,\" Tech. Rep. 9303, MIT Computational Cognitive Science, Sept. 1993. \n\n[6] C. Gonzaga, \"Path-following methods for linear programming,\" SIAM Review, vol. 34, \n\npp. 167-224, June 1992. \n\n[7] P. Szymanski and M. Lemmon, \"A modified interior point method for supervisory con(cid:173)\ntroller design,\" in Proceedings of the 99rd IEEE Conference on Decision and Control, \npp. 1381-1386, Dec. 1994. \n\n[8] M. Vidyasagar, Nonlinear Systems Analysis. Englewood Cliffs, New Jersey: Prentice(cid:173)\n\nHall, Inc., 1993. \n\n[9] M. Lemmon, J. Stiver, and P. Antsaklis, \"Event identification and intelligent hybrid \ncontrol,\" in Hybrid Systems (R. L. Grossman, A. Nerode, A. P. Ravn, and H. Rischel, \neds.), vol. 736 of Lecture Notes in Computer Science, pp. 265-296, Springer-Verlag, \n1993. \n\n\f", "award": [], "sourceid": 980, "authors": [{"given_name": "Michael", "family_name": "Lemmon", "institution": null}, {"given_name": "Peter", "family_name": "Szymanski", "institution": null}]}