{"title": "Efficient Parallel Learning Algorithms for Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 40, "page_last": 48, "abstract": null, "full_text": "40 \n\nEFFICIENT PARALLEL LEARNING \n\nALGORITHMS FOR NEURAL NETWORKS \n\nAlan H. Kramer and A. Sangiovanni-Vincentelli \n\nDepartment of EECS \n\nU .C. Berkeley \n\nBerkeley, CA 94720 \n\nABSTRACT \n\nParallelizable optimization techniques are applied to the problem of \nlearning in feedforward neural networks. In addition to having supe(cid:173)\nrior convergence properties, optimization techniques such as the Polak(cid:173)\nRibiere method are also significantly more efficient than the Back(cid:173)\npropagation algorithm. These results are based on experiments per(cid:173)\nformed on small boolean learning problems and the noisy real-valued \nlearning problem of hand-written character recognition. \n\n1 INTRODUCTION \nThe problem of learning in feedforward neural networks has received a great deal \nof attention recently because of the ability of these networks to represent seemingly \ncomplex mappings in an efficient parallel architecture. This learning problem can \nbe characterized as an optimization problem, but it is unique in several respects. \nFunction evaluation is very expensive. However, because the underlying network is \nparallel in nature, this evaluation is easily parallelizable. In this paper, we describe \nthe network learning problem in a numerical framework and investigate parallel \nalgorithms for its solution. Specifically, we compare the performance of several \nparallelizable optimization techniques to the standard Back-propagation algorithm. \nExperimental results show the clear superiority of the numerical techniques. \n\n2 NEURAL NETWORKS \nA neural network is characterized by its architecture, its node functions, and its \ninterconnection weights. In a learning problem, the first two of these are fixed, so \nthat the weight values are the only free parameters in the system. when we talk \nabout \"weight space\" we refer to the parameter space defined by the weights in a \nnetwork, thus a \"weight vector\" w is a vector or a point in weightspace which defines \nthe values of each weight in the network. We will usually index the components of \na weight vector as Wij, meaning the weight value on the connection from unit i to \nunit j. Thus N(w, r), a network function with n output units, is an n-dimensional \nvector-valued function defined for any weight vector wand any input vector r: \n\nN( w, r) = [Ol(W, r), D2(w, r), ... , on(w, r)f \n\n\fEfficient Parallel Learning Algorithms \n\n41 \n\nwhere 0. is the ith output unit of the network. Any node j in the network has input \nij(w,r) = E'efanin; o.(w,r)wij and output oj(w,r) ;: I;(ij(w,r\u00bb, where 1;0 is \nthe node function. The evaluation of NO is inherently parallel and the time to \nevaluate NO on a single input vector is O(#layers). If pipelining is used, multiple \ninput vectors can be evaluated in constant time. \n\n3 LEARNING \nThe \"learning\" problem for a neural network refers to the problem of finding a \nnetwork function which approximates some desired \"target\" function TO, defined \nover the same set of input vectors as the network function. The problem is simplified \nby asking that the network function match the target function on only a finite set of \ninput vectors, the \"training set\" R. This is usually done with an error measure. The \nmost common measure is sum-squared error, which we use to define the \"instance \nerror\" between N(w, r) and T(r) at weight vector wand input vector r: \n! (Ta(r) - o.(w, r\u00bb2 = !IIT(r) - N(w, r)1I2. \n\neN,T(w, r) = E \n\nieoutputs \n\nWe can now define the \"error function\" between NO and TO over R as a function \nofw: \n\nEN,T,R(w) = I: eN,T(w, r). \n\nreR \n\nThe learning problem is thus reduced to finding a w for which EN T R(w) is min-\nimized. If this minimum value is zero then the network function approximates the \ntarget function exactly on all input vectors in the training set. Henceforth, for no(cid:173)\ntational simplicity we will write eO and EO rather than eN TO and EN T RO. \n\n, , \n\n, \n\n.\u00bb , \n\n4 OPTIMIZATION TECHNIQUES \nAs we have framed it here, the learning problem is a classic problem in optimization. \nMore specifically, network learning is a problem of function approximation, where \nthe approximating function is a finite parameter-based system. The goal is to find \na set of parameter values which minimizes a cost function, which in this case, is a \nmeasure of the error between the target function and the approximating function. \n\nAmong the optimization algorithms that can be used to solve this type of problem, \ngradient-based algorithms have proven to be effective in a variety of applications \n{Avriel, 1976}. These algorithms are iterative in nature, thus Wk is the weight \nvector at the kth iteration. Each iteration is characterized by a search direction dk \nand a step ak. The weight vector is updated by taking a step in the search direction \nas below: \n\ntor(k=o; evaluate(wk) != CONVERGED; ++k) { \n\ndk = determine-.eearch_directionO; \nak = determine-.etepO; \nWk+l = wit; + akdk ; \n\n} \n\n\f42 \n\nKramer and Sangiovanni-Vincentelli \n\nIf dk is a direction of descent I such as the negative of the gradient, a sufficiently \nsmall step will reduce the value of EO. Optimization algorithms vary in the way \nthey determine Q and d, but otherwise they are structured as above. \n\n5 CONVERGENCE CRITERION \nThe choice of convergence criterion is important. An algorithm must terminate \nwhen EO has been sufficiently minimized. This may be done with a threshold on \nthe value of EO, but this alone is not sufficient. In the case where the error surface \ncontains \"bad\" local minima, it is possible that the error threshold will be unattain(cid:173)\nable, and in this case the algorithm will never terminate. Some researchers have \nproposed the use of an iteration limit to guarantee termination despite an unattain(cid:173)\nable error threshold {Fahlman, 1989}. Unfortunately, for practical problems where \nthis limit is not known a priori, this approach is inapplicable. \nA necessary condition for W* to be a minimum, either local or global I is that the \ngradient g(w*) = V E(w*) = o. Hence, the most usual convergence criterion for \noptimization algorithms is Ilg(Wk)11 ~ l where l \nis a sufficiently small gradient \nthreshold. The downside of using this as a convergence test is that, for successful \ntrials, learning times will be longer than they would be in the case of an error thresh(cid:173)\nold. Error tolerances are usually specified in terms of an acceptable bit error, and \na threshold on the maximum bit error (MBE) is a more appropriate representation \nof this criterion than is a simple error threshold. For this reason we have chosen \na convergence criterion consisting of a gradient threshold and an M BE threshold \n(T), terminating when IIg(wk)1I < lor M BE(Wk) < T, where M BEO is defined as: \n\nM BE(w,,) = max (. max \n\nreR \n\nleoutputs \n\n(!(Ti(r) - Oi(Wkl r))2)) . \n\n6 STEEPEST DESCENT \nSteepest Descent is the most classical gradient-based optimization algorithm. In \nthis algorithm the search direction d\" is always the negative of the gradient - the \ndirection of steepest descent. For network learning problems the computation of \ng(w), the gradient of E(w), is straightforward: \n\ng(W) = VE(w) \n\nwhere \n\nwhere for output units \n\nwhile for all other units \n\nVe(w, r) \n\n8e(wlr) \n\n8W ij \n6j(w, r) \n\n6j(w, r) \n\n[d~ 2:e(Wlr)]T = 2: Ve(w, r), \n[ 8e(Wlr), 8e(w,r) , ... , 8e(w, r)]T \n\nreR \n\nreR \n\n8wmn \n\n8wn \n\n8W12 \n\n,; (ij (w, r))(oj(w, r) - Tj(r)), \n,;(ij(w, r)) L \n\n6j (w, r)Wjk. \n\nkefanout; \n\n\fEfficient Parallel Learning Algorithms \n\n43 \n\nThe evaluation of g is thus almost dual to the evaluation of N; while the latter feeds \nforward through the net, the former feeds back. Both computations are inherently \nparallelizable and of the same complexity. \n\nThe method of Steepest Descent determines the step Ok by inexact linesearch, mean(cid:173)\ning that it minimizes E(Wk - Okdk). There are many ways to perform this com(cid:173)\nputation, but they are all iterative in nature and thus involve the evaluation of \nE(Wk - Okdk) for several values of Ok. As each evaluation requires a pass through \nthe entire training set, this is expensive. Curve fitting techniques are employed to \nreduce the number of iterations needed to terminate a linesearch. Again, there are \nmany ways to curve fit . We have employed the method of false position and used \nthe Wolfe Test to terminate a line search {Luenberger, 1986}. In practice we find \nthat the typical linesearch in a network learning problem terminates in 2 or 3 iter(cid:173)\nations. \n\n7 PARTIAL CONJUGATE GRADIENT METHODS \nBecause linesearch guarantees that E(Wk+d < E(Wk), the Steepest Descent algo(cid:173)\nrithm can be proven to converge for a large class of problems {Luenberger, 1986}. \nUnfortunately, its convergence rate is only linear and it suffers from the problem \nof \"cross-stitching\" {Luenberger, 1986}, so it may require a large number of iter(cid:173)\nations. One way to guarantee a faster convergence rate is to make use of higher \norder derivatives. Others have investigated the performance of algorithms of this \nclass on network learning tasks, with mixed results {Becker, 1989}. We are not \ninterested in such techniques because they are less parallelizable than the methods \nwe have pursued and because they are more expensive, both computationally and \nin terms of storage requirements. Because we are implementing our algorithms on \nthe Connection Machine, where memory is extremely limited, this last concern is \nof special importance. We thus confine our investigation to algorithms that require \nexplicit evaluation only of g, the first derivative. \n\nConjugate gradient techniques take advantage of second order information to avoid \nthe problem of cross-stitching without requiring the estimation and storage of the \nHessian (matrix of second-order partials). The search direction is a combination of \nthe current gradient and the previous search direction: \n\nThere are various rules for determining 13k; we have had the most success with the \nPolak-Ribiere rule, where 13k is determined from gk+l and gk according to \n\na _ (gk+l - gk)T . gk+l \n. \n}Jk -\n\nT \n\ngk . gk \n\nAs in the Steepest Descent algorithm, Ok is determined by linesearch. \\Vith a sim(cid:173)\nple reinitialization procedure partial conjugate gradient techniques are as robust as \nthe method of Steepest Descent {Powell, 1977}; in practice we find that the Polak(cid:173)\nRibiere method requires far fewer iterations than Steepest Descent. \n\n\f44 \n\nKramer and Sangiovanni-Vincentelli \n\n8 BACKPROPAGATION \nThe Batch Back-propagation algorithm {Rumelhart, 1986} can be described in \nterms of our optimization framework. Without momentum, the algorithm is very \nsimilar to the method of Steepest Descent in that dk = -gk. Rather than being \ndetermined by a linesearch, a, the \"learning rate\", is a fixed user-supplied constant. \nWith momentum, the algorithm is similar to a partial conjugate gradient method, \nas dk+l = -~+l + ,Bkdk, though again (3, the \"momentum term\", is fixed. On-line \nBack-propagation is a variation which makes a change to the weight vector following \nthe presentation of each input vector: dk = V'e(wk' rk). \n\nThough very simple, we can see that this algorithm is numerically unsound for sev(cid:173)\neral reasons. Because,B is fixed, d k may not be a descent direction, and in this \ncase any a will increase EO. Even if dk is a direction of descent (as is the case \nfor Batch Back-propagation without momentum), a may be large enough to move \nfrom one wall of a \"valley\" to the opposite wall, again resulting in an increase in \nEO. Because the algorithm can not guarantee that EO is reduced by successive \niterations, it cannot be proven to converge. In practice, finding a value for a which \nresults in fast progress and stable behavior is a black art, at best. \n\n9 WEIGHT DECAY \nOne of the problems of performing gradient descent on the \"error surface\" is that \nminima may be at infinity. \n(In fact, for boolean learning problems all minima \nare at infinity.) Thus an algorithm may have to travel a great distance through \nweightspace before it converges. Many researchers have found that weight decay is \nuseful for reducing learning times {Hinton, 1986}. This technique can be viewed as \nadding a term corresponding to the length of the weight vector to the cost function; \nthis modifies the cost surface in a way that bounds all the minima. Rather than \nminimizing on the error surface, minimization is performed on the surface with cost \nfunction \n\nC(W) = E(w) + 211wll2 \n\n2 \n\nwhere I, the relative weight cost, is a problem-specific parameter. The gradient for \nthis cost function is g( w) = V' C( w) = V' E( w) + IW, and for any step o'k, the effect \nof I is to \"decay\" the weight vector by a factor of (1 - O'ey): \n\n10 PARALLEL IMPLEMENTATION ISSUES \nWe have emphasized the parallelism inherent in the evaluation of EO and gO. To \nbe efficient, any learning algorithm must exploit this parallelism. Without momen(cid:173)\ntum, the Back-propagation algorithm is the simplest gradient descent technique, as \nit requires the storage of only a single vector, gk. Momentum requires the storage of \nonly one additional vector, dk-l. The Steepest Descent algorithm also requires the \nstorage of only a single vector more than Back-propagation without momentum: \n\n\fEfficient Parallel Learning Algorithms \n\n45 \n\ndk, which is needed for linesearch. In addition to dk, the Polak-Ribiere method \nrequires the storage of two additional vectors: dk-l and gk-l. The additional stor(cid:173)\nage requirements of the optimization techniques are thus minimal. The additional \ncomputational requirements are essentially those needed for linesearch - a single dot \nproduct and a single broadcast per iteration. These operations are parallelizable \n(log time on the Connection Machine) so the additional computation required by \nthese algorithms is also minimal, especially since computation time is dominated \nby the evaluation of EO and gO. Both the Steepest Descent and Polak-Ribiere \nalgorithms are easily parallelizable. We have implemented these algorithms, as well \nas Back-propagation, on a Connection Machine {Hillis, 1986}. \n\n11 EXPERIMENTAL RESULTS - BOOLEAN LEARNING \nWe have compared the performance of the Polak-Ribiere (P-R), Steepest Descent \n(S-D), and Batch Back-propagation (B-B) algorithms on small boolean learning \nproblems. In all cases we have found the Polak-Ribiere algorithm to be significantly \nmore efficient than the others. All the problems we looked at were based on three(cid:173)\nlayer networks (1 hidden layer) using the logistic function for all node functions. \nInitial weight vectors were generated by randomly choosing each component from \n(+r, -r). '1 is the relative weight cost, and f and r define the convergence test. \nLearning times are measured in terms of epochs (sweeps through the training set). \n\nThe encoder problem is easily scaled and has no bad local minima (assuming suf(cid:173)\nficient hidden units: log(#inputs)). All Back-propagation trials used Q' = 1 and \n(3 = OJ these values were found to work about as well as any others. Table 1 sum(cid:173)\nmarizes the results. Standard deviations for all data were insignificant \u00ab 25%). \n\nTABLE 1. Encoder Results \n\nEncoder \nProblem \n10-5-10 \n10-5-10 \n10-5-10 \n10-5-10 \n10-5-10 \n10-5-10 \n4-2-4 \n8-3-8 \n16-4-16 \n32-5-32 \n64-6-64 \n\nnum \ntrials \n100 \n100 \n100 \n100 \n100 \n100 \n100 \n100 \n100 \n25 \n25 \n\nParameter Values \nr/ \nr 1 \n\n'11 \n\nS-D / \n\nf \n1e-8 \n1e-8 \n1e-8 \n\nP-R 1 \n63.71 \n71.27 \n104.70 \n0.0 1e-4 279.52 \n0.0 1e-6 353.30 \nle-8 417.90 \n0.0 \n1e-8 \n36.92 \n0.1 \n67.63 \n1e-8 \n0.1 \n121.30 \n1e-8 \n0.1 \n208.60 \n1e-8 \n0.1 \n0.1 \n1e-8 \n405.60 \n\nA verage Epochs to Convergence \nB-B \n196.93 \n299.55 \n3286.20 \n13117.00 \n24910.00 \n35260.00 \n179.95 \n594.76 \n990.33 \n1826.15 \n> 10000 \n\n109.06 \n142.31 \n431.43 \n1490.00 \n2265.00 \n2863.00 \n56.90 \n194.80 \n572.80 \n1379.40 \n4187.30 \n\n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n1.0 \n\n1e-4 1e-1 \n1e-4 2e-2 \n1e-4 7e-4 \n1e-4 \n1e-4 \n1e-4 \n1e-4 \n1e-4 \n1e-4 \n1e-4 \n1e-4 \n\n\f46' Kramer and Sangiovanni-Vincentelli \n\nThe parity problem is interesting because it is also easily scaled and its weightspace \nis known to contain bad local minima.. To report learning times for problems with \nbad local minima, we use expected epochs to solution, EES. This measure makes \nsense especially if one considers an algorithm with a restart procedure: if the algo(cid:173)\nrithm terminates in a bad local minima it can restart from a new random weight \nvector. EES can be estimated from a set of independent learning trials as the \nratio of total epochs to successful trials. The results of the parity experiments are \nsummarized in table 2. Again, the optimization techniques were more efficient than \nBack-propagation. This fact is most evident in the case of bad trials. All trials used \nr = 1, \"y = 1e - 4, T = 0.1 and f = 1e - 8. Back-propagation used a = 1 and f3 = o. \n\nTABLE 2. Parity Results \n\nII Parity \n2-2-1 \n\n4-4-1 \n\n8-8-1 \n\nalg \nP-R \nS-D \nB-B \nP-R \nS-D \nB-B \nP-R \nS-D \nB-B \n\ntrials I %\"uee I avg\"uee \n73 \n100 \n95 \n100 \n100 \n684 \n100 \n352 \n2052 \n100 \n100 \n8704 \n16 \n1716 \n6 \n>10000 \n2 \n>100000 \n\n72% \n80% \n78% \n61% \n99% \n71% \n50% \n-\n-\n\n(s.d.) I avgun, \n(43) \n232 \n3077 \n(115 \n(1460 \n47915 \n(122 \n453 \n18512 \n(1753 \n(8339 \n95345 \n953 \n(748 \n\n>10000 \n>100000 \n\n(s.d.) I \n(54) \n(339) \n(5505) \nJ}17 \n(-\n(11930 \n(355 \n\nEES 11 \n163 \n864 \n14197 \n641 \n2324 \n48430 \n2669 \n>10000 \n>100000 \n\n12 LETTER RECOGNITION \nOne criticism of batch-based gradient descent techniques is that for large real-world, \nreal-valued learning problems, they will be be less efficient than On-line Back(cid:173)\npropagation. The task of characterizing hand drawn examples of the 26 capital \nletters was chosen as a good problem to test this, partly because others have used \nthis problem to demonstrate that On-line Back-propagation is more efficient than \nBatch Back-propagation {Le Cun, 1986}. The experimental setup was as follows: \n\nCharacters were hand-entered in a 80 x 120 pixel window with a 5 pixel-wide brush \n(mouse controlled). Because the objective was to have many noisy examples of the \nsame input pattern, not to learn scale and orientation invariance, all characters were \nroughly centered and roughly the full size of the window. Following character entry, \nthe input window was symbolically gridded to define 100 8 x 12 pixel regions. Each \nof these regions was an input and the percentage of \"on\" pixels in the region was \nits value. There were thus 100 inputs, each of which could have any of 96 (8 x 12) \ndistinct values. 26 outputs were used to represent a one-hot encoding of the 26 \nletters, and a network with a single hidden layer containing 10 units was chosen. \nThe network thus had a 100-10-26 architecture; all nodes used the logistic function. \n\n\fEfficient Parallel Learning Algorithms \n\n47 \n\nA training set consisting of 64 distinct sets of the 26 upper case letters was created \nby hand in the manner described. 25\" A\" vectors are shown in figure 1. This \nlarge training set was recursively split in half to define a series of 6 successively \nlarger training sets; Ro to Ro, where Ro is the smallest training set consisting \nof 1 of each letter and Ri contains Ri-l and 2i - 1 new letter sets. A testing set \nconsisting of 10 more sets of hand-entered characters was also created to measure \nnetwork performance. For each Ri, we compared naive learning to incremental \nlearning, where naive learning means initializing w~i) randomly and incremental \nlearning means setting w~i) to w~i-l) (the solution weight vector to the learning \nproblem based on Ri-d. The incremental epoch count for the problem based on \nRi was normalized to the number of epochs needed starting from w~i-l) plus! the \nnumber of epochs taken by the problem based on Ri-l (since IRi-ll = !IRd). This \nnormalized count thus reflects the total number of relative epochs needed to get \nfrom a naive network to a solution incrementally. \n\nBoth Polak-Ribiere and On-line Back-propagation were tried on all problems. Table \n3 contains only results for the Polak-Ribiere method because no combination of \nweight-decay and learning rate were found for which Back-propagation could find a \nsolution after 1000 times the number of iterations taken by Polak-Ribiere, although \nvalues of \"y from 0.0 to 0.001 and values for 0' from 1.0 to 0.001 were tried. All \nproblems had r = 1, \"y = 0.01, r = Ie - 8 and \u20ac = 0.1. Only a single trial was done \nfor each problem. Performance on the test set is shown in the last column. \n\nFIGURE 1. 25 \"A\"s \n\n.. ....:I \n\n\u00b7r::. \n~-!j. :11;; H\u00b7-:' \nf\"t \nI .. \u00b7 ... M \n... ~ r ~ , . . ..... ~ ~ .: \n'~:!'!I \nF\": 1'\\ .r .,. \nr! 1 \n'1 . . ,l'1 J.'1 \n~\" ,:: HI .\u2022. \n;: \n.~ ;i....J _ \u2022\u2022 \nI! = t\u00b7 \n. \n., \n'. \n\n.. \nI. t\u00b7 \n\n:'1'''' \n\n! , \n\nTABLE 3. Letter Recognition \nprob Learning Time .r epochs) Test \nINC I NORM I NAIV % \nset \n95 \nRO \n53.5 \n83 \nR1 \n69.2 \n63 \nR2 \n80.4 \n14 \n83.4 \nR3 \n191 \nR4 \n92.3 \n98.1 \n153 \nR5 \nR6 \n46 \n99.6 \n\n95 \n85 \n271 \n388 \n1129 \n1323 \n657 \n\n95 \n130 \n128 \n78 \n230 \n268 \n180 \n\nThe incremental learning paradigm was very effective at reducing learning times. \nEven non-incrementally, the Polak-Ribiere method was more efficient than on-line \nBack-propagation on this problem. The network with only 10 hidden units was \nsufficient, indicating that these letters can be encoded by a compact set of features. \n\n13 CONCLUSIONS \nDescribing the computational task of learning in feedforward neural networks as \nan optimization problem allows exploitation of the wealth of mathematical pro(cid:173)\ngramming algorithms that have been developed over the years. We have found \n\n\f48 \n\nKramer and Sangiovanni-Vincentelli \n\nthat the Polak-Ribiere algorithm offers superior convergence properties and signif(cid:173)\nicant speedup over the Back-propagation algorithm. In addition, this algorithm is \nwell-suited to parallel implementation on massively parallel computers such as the \nConnection Machine. Finally, incremental learning is a way to increase the efficiency \nof optimization techniques when applied to large real-world learning problems such \nas that of handwritten character recognition. \n\nAcknowledgments \n\nThe authors would like to thank Greg Sorkin for helpful discussions. This work was \nsupported by the Joint Services Educational Program grant #482427-25304. \n\nReferences \n\n{Avriel, 1976} Mordecai Avriel. Nonlinear Programming, Analysis and Methods. \n\nPrentice-Hall, Inc., Englewood Cliffs, New Jersey, 1976. \n\n{Becker, 1989} Sue Becker and Yan Le Cun. Improving the Convergence of Back(cid:173)\n\nPropagation Learning with Second Order Methods. In Proceedings of the 1988 \nConnectionist Alodels Summer School, pages 29-37, Morgan Kaufmann, San \nMateo Calif., 1989. \n\n{Fahlman, 1989} Scott E. Fahlman. Faster Learning Variations on Back-Propagation: \n\nAn Empirical Study. In Proceedings of the 1988 Connectionist Models Sum(cid:173)\nmer School, pages 38-51, Morgan Kaufmann, San Mateo Calif., 1989. \n\n{Hillis, 1986} William D. Hillis. The Connection Machine. MIT Press, Cambridge, \n\nMass, 1986. \n\n{Hinton, 1986} G. E. Hinton. Learning Distributed Representations of Concepts. \n\nIn Proceedings of the Cognitive Science Society, pages 1-12, Erlbaum, 1986. \n\n{Kramer, 1989} Alan H. Kramer. Optimization Techniques for Neural Networks. \n\nTechnical Memo #UCB-ERL-M89-1, U.C. Berkeley Electronics Research Lab(cid:173)\noratory, Berkeley Calif., Jan. 1989. \n\n{Le Cun, 1986} Yan Le Cun. HLM: A Multilayer Learning Network. In Pro(cid:173)\n\nceedings of the 1986 Connectionist Alodels Summer School, pages 169-177, \nCarnegie-Mellon University, Pittsburgh, Penn., 1986. \n\n{Luenberger, 1986} David G. Luenberger. Linear and Nonlinear Programming. \n\nAddison-Wesley Co., Reading, Mass, 1986. \n\n{Powell, 1977} M. J. D. Powell. \"Restart Procedures for the Conjugate Gradient \n\nMethod\", Mathematical Programming 12 (1977) 241-254 \n\n{Rumelhart, 1986} David E Rumelhart, Geoffrey E. Hinton, and R. J. Williams. \n\nLearning Internal Representations by Error Propagation. \ntributed Processing: Explorations in the Microstructure , of Cognition. Vol 1: \nFoundations, pages 318-362, MIT Press, Cambridge, Mass., 1986 \n\nIn Parallel Dis(cid:173)\n\n\f", "award": [], "sourceid": 134, "authors": [{"given_name": "Alan", "family_name": "Kramer", "institution": null}, {"given_name": "Alberto", "family_name": "Sangiovanni-Vincentelli", "institution": null}]}