{"title": "A Multiscale Attentional Framework for Relaxation Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 633, "page_last": 639, "abstract": "", "full_text": "A Multiscale Attentional Framework for \n\nRelaxation Neural Networks \n\nDimitris I. Tsioutsias \n\nEric Mjolsness \n\nDept. of Electrical Engineering \n\nDept. of Computer Science & Engineering \n\nYale University \n\nNew Haven, CT 06520-8285 \ntsioutsias~cs.yale.edu \n\nUniversity of California, San Diego \n\nLa Jolla, CA 92093-0114 \n\nemj~cs.ucsd.edu \n\nAbstract \n\nWe investigate the optimization of neural networks governed by \ngeneral objective functions. Practical formulations of such objec(cid:173)\ntives are notoriously difficult to solve; a common problem is the \npoor local extrema that result by any of the applied methods. In \nthis paper, a novel framework is introduced for the solution oflarge(cid:173)\nscale optimization problems. It assumes little about the objective \nfunction and can be applied to general nonlinear, non-convex func(cid:173)\ntions; objectives in thousand of variables are thus efficiently min(cid:173)\nimized by a combination of techniques - deterministic annealing, \nmultiscale optimization, attention mechanisms and trust region op(cid:173)\ntimization methods. \n\n1 \n\nINTRODUCTION \n\nMany practical problems in computer vision, pattern recognition , robotics and other \nareas can be described in terms of constrained optimization . In the past decade, \nresearchers have proposed means of solving such problems with the use of neural \nnetworks [Hopfield & Tank, 1985; Koch et ai., 1986], which are thus derived as \nrelaxation dynamics for the objective functions codifying the optimization task. \n\nOne disturbing aspect of the approach soon became obvious, namely the appar(cid:173)\nent inability of the methods to scale up to practical problems, the principal reason \nbeing the rapid increase in the number of local minima present in the objectives as \nthe dimension of the problem increases. Moreover most objectives, E( v), are highly \nnonlinear, non-convex functions of v , and simple techniques (e.g. steepest descent) \n\n\f634 \n\nD. I. TSIOUTSIAS, E. MJOLSNESS \n\nwill , in general, locate the first minimum from the starting point. \n\nIn this work, we propose a framework for solving large-scale instances of such opti(cid:173)\nmization problems. We discuss several techniques which assist in avoiding spurious \nminima and whose combined result is an objective function solution that is compu(cid:173)\ntationallyefficient, while at the same time being globally convergent. In section 2.1 \nwe discuss the use of deterministic annealing as a means of avoiding getting trapped \ninto local minima. Section 2.2 describes multiscale representations of the original \nobjective in reduced spatial domains. In section 2.3 we present a scheme for reduc(cid:173)\ning the computational requirements of the optimization method used, by means of \na focus of attention mechanism. Then, in section 2.4 we introduce a trust region \nmethod for the relaxation phase of the framework, which uses second order informa(cid:173)\ntion (i.e. curvature) of the objective function. In section 3 we present experimental \nresults on the application of our framework to a 2-D region segmentation objective \nwith discontinuities. Finally, section 4 summarizes our presentation. \n\n2 THEORETWALFRAMEWORK \n\nOur optimization framework takes the form of a list of nested loops indicating the \norder of conceptual (and computational) phases that occur: from the outer to the \ninner loop we make use of deterministic annealing, a multiscale representation , an \nattentional mechanism and a trust region optimization method. \n\n2.1 ANNEALING NETS \n\nThe usefulness of statistical mechanics for designing optimization procedures has \nrecently been established; prime examples are simulated annealing and its various \nmean field theory approximations [Hopfield & Tank, 1985; Durbin & Willshaw, \n1987]. The success of such methods is primarily due to entropic terms included in \nthe objective (i .e. syntactic terms), but the price to pay is their highly nonlinear \nform. Interestingly, those terms can effectively be convexified by the use of a \"tem(cid:173)\nperature\" parameter, T , allowing for a reduction in the number of minima and the \nability to track the solution through \"temperature\". \n\n2.2 MULTISCALE REPRESENTATION \n\nTo solve large-scale problems in thousands of variables, we need to speed up the \nconvergence of the method while still retaining valid state-space trajectories. To \naccomplish this we introduce smaller, approximate versions of the problem at coarser \nspatial scales [Mjolsness et al. , 1991] ; the nonlinearity of the original objective is \nmaintained at all scales, as opposed to other approaches where the objectives and \ntheir derivatives are either approximated by the use of finite difference methods , \nor solved for by multigrid techniques where a quadratic objective is still assumed. \nConsequently, the multiscale representation exploits the effective smoothness in the \nobjectives: by alternating relaxation phases between coarser and finer scales, we \nuse the former to identify extrema and the latter to localise them. \n\n2.3 FOCUS OF ATTENTION \n\nTo further reduce the computational requirements of larg~scale optimization (and \nindirectly control its temporal behavior), we use a focus of attention (FoA) mecha(cid:173)\nnism [Mjolsness & Miranker, 1993], reminiscent of the spotlight hypothesis argued \n\n\fA Multiscale Attentional Framework for Relaxation Neural Networks \n\n635 \n\nto exist in early vision systems [Koch & Ullman, 1985; Olshausen et al., 1993]. The \neffect of a FoA is to support efficient, responsive analysis: it allows resources to be \nfocused on selected areas of a computation and can rapidly redirect them as the \ntask requirements evolve. \n\nSpecifically, the FoA becomes a characteristic function, 7l'(X) , determining which \nof the N neurons are active and which are clamped during relaxation, by use of a \ndiscrete-valued vector, X, and by the rule: 7l'i(X) = 1 if neuron Vi is in the FoA, and \nzero otherwise. Moreover, a limited number, n, of neurons Vi are active at any given \ninstant: I:i 7l'i(X) = n, with n\u00ab Nand n chosen as an optimal FoA size. To tie the \nattentional mechanism to the multiscale representation, we introduce a partition \nof the neurons Vi into blocks indexed by a (corresponding to coarse-scale block(cid:173)\nneurons), via a sparse rectangular matrix Bia E {O, I} such that I:a Bia = 1, Vi, \nwith i = 1, ... ,N, a = 1,oo.,K and K\u00abN. Then 7l'i(X) = I:aBiaXa, and we use \neach component of X for switching a different block of the partition; thus, a neuron \nVi is in the FoA iff its coarse scale block a is in the FoA, as indicated by Xa. As \na result, our FoA need not necessarily have a single region of activity: it may well \nhave a distributed activity pattern as determined by the partitions Bia. 1 \n\nClocked objective function notation [Mjolsness & Miranker, 1993] makes the task \nmore apparent: during the active-x phase the FoA is computed for the next active(cid:173)\nv phase, determining the subset of neurons Vi on which optimization is to be carried \nout. We introduce the quantity E ;dv] == g~ ~ (Ti is a time axis for Vi) [Mjolsness \n& Miranker, 1993] as an estimate of the predicted dE arising from each Vi if it joins \nthe FoA. For HopfieldjGrossberg dynamics this measure becomes: \n\nE ;d v ] = _g~(gi1(Vi)) (~~) 2 == -gHU i)(E,i)2 \n\n(1) \n\nwi th E,i ~f 'V'i E, and gi the transfer function for neuron Vi (e.g. a sigmoid func(cid:173)\ntion). Eq. (1) is used here analogously to saliency measures introduced into neu(cid:173)\nrophysiological work [Koch & Ullman, 1985]; we propose it as a global measure \nof conspicuousness. As a result, attention becomes a k-winner-take-all (kWTA) \nnetwork: \n\na \n\na \n\nwhere I refers to the scale for which the FoA is being determined (I = 1, ... , L), EEl \nconforms with the clocked objective notation, and the last summand corresponds \nto the subspace on which optimization is to be performed, as determined by the \ncurrent FoA.2 Periodically, an analogous FoA through spatial scales is run, allowing \nre-direction of system resources to the scale which seems to be having the largest \ncombined benefit and cost effect on the optimization [Tsioutsias & Mjolsness, 1995]. \nThe combined effect of multiscale optimization and FoA is depicted schematically in \nFig. 1: reduced-dimension functionals are created and a FoA beam \"shines\" through \nscales picking the neurons to work on. \n\n1 Preferably, Bia will be chosen to minimize the number of inter-block connections. \n2 Before computing a new FoA we update the neighbors of all neurons that were included \n\nin the last focus; this has a similar effect to an implicit spreading of activation. \n\n\f636 \n\nD. I. TSIOUTSIAS, E. MJOLSNESS \n\nLayer 3 \n\nLayer 1 \n\nFigure 1: Multiscale Attentional Neural Nets: FoA on a layer (e.g. L=l) competes \nwith another FoA (e.g. L=2) to determine both preferable scale and subspace. \n\n2.4 OPTIMIZATION PHASE \n\nTo overcome the problems generally associated with the steepest descent method, \nother techniques have been devised . Newton 's method , although successful in small \nto medium-sized problems, does not scale well in large non-convex instances and is \ncomputationally intensive. Quasi-Newton methods are efficient to compute, have \nquadratic termination but are not globally convergent for general nonlinear, non(cid:173)\nconvex functions. A method that guarantees global convergence is the trust region \nmethod [Conn et al., 1993] . The idea is summarized as follows : Newton's method \nsuffers from non-positive definite Hessians; in such a case, the underlying function \nm(k)(6) obtained from the 2nd order Taylor expansion of E(Vk + 6) does not have \na minimum and the method is not defined, or equivalently, the region around the \ncurrent point Vk in which the Taylor series is adequate does not include a minimizing \npoint of m(k)(6). To resolve this, we can define a neighborhood Ok of Vk such that \nm(k)(6) agrees with E(Vk + 6) in some sense; then, we pick Vk+l = Vk + 6 k , where \n6 k minimizes m(k)(6) , V(Vk + 6) E Ok . Thus, we seek a solution to the resulting \nsubproblem: \n\n(3) \n\nwhere 1I \u00b7lIp is any kind of norm (for instance, the L2 norm leads to the Levenberg(cid:173)\naccuracy ratio Tk = (~E(k)/~m(k) = (E(k ) - E(Vk + 6k\u00bb/(m(k)(O) - m(k)(6 k\u00bb; \nMarquardt methods) , and ~k is the radius of Ok, adaptively modified based on an \n\n~E(k) is the \"actual reduction\" in E(k) when step 6 k is taken, and ~m(k) the \n\"predicted reduction\" . The closer Tk is to unity, the better the agreement between \nthe local quadratic model of E (k) and the objective itself is, and ~k is modified \nadaptively to reflect this [Conn et al., 1993]. \n\nWe need to make some brief points here (a complete discussion will be given else(cid:173)\nwhere [Tsioutsias & Mjolsness, 1995]): \n\n\fA Multiscale Attentional Framework for Relaxation Neural Networks \n\n637 \n\n\u2022 At each spatial scale of our multiscale representation, we optimize the corre(cid:173)\n\nsponding objective by applying a trust region method. To obtain sufficient \nrelaxation progress as we move through scales we have to maintain mean(cid:173)\ningful region sizes, Llk; to that end we use a criterion based on the curvature \nof the functionals along a searching direction. \n\n\u2022 The dominant relaxation computation within the algorithm is the solution \nof eq. (3). We have chosen to solve this subproblem with a preconditioned \nconjugate gradient method (PCG) that uses a truncated Newton step to \nspeed up the computation; steps are accepted when a sufficiently good \napproximation to the quasi-Newton step is found. 3 In our case, the norm \nin eq. (3) becomes the elliptical norm 1I~llc = ~tc~, where a diagonal \npreconditioner to the Hessian is used as the scaling matrix C. \n\n\u2022 If the neuronal connectivity pattern of the original objective is sparse (as \n\nhappens for most practical combinatorial optimization problems), the pat(cid:173)\ntern of the resulting Hessian can readily be represented by sparse static data \nstructures,4 as we have done within our framework. Moreover, the partition \nmatrices, Bia, introduce a moderate fill-in in the coarser objectives and the \nsparsity of the corresponding Hessians is again taken into account. \n\n3 EXPERIMENTS \n\nWe have applied our proposed optimization framework to a spatially structured \nobjective from low-level vision, namely smooth 2-D region segmentation with the \ninclusion of discontinuity detection processes: \n\nij \n\nij \n\nij \n\nij \n\nij \n\nwhere d is the set of image intensities, j is the real-valued smooth surface to be fit to \nthe data, lV and lh are the discrete-valued line processes indicating a non-zero value \nin the intensity gradient, and \u00a2(x) = -(2go)-1[lnx+ln(1-x)] is a barrier function \nrestricting each variable into (0,1) by infinite barriers at the borders. Eq. (4) is \na mixed-nonlinear objective involving both continuous and binary variables; our \nframework optimizes vectors j, lh and lV simultaneously at any given scale as con(cid:173)\ntinuous variables, instead of earlier two-step, alternate continuous/discrete-phase \napproaches [Terzopoulos, 1986]. \n\nWe have tested our method on gradually increasing objectives, from a \"small\" size \nof N=12,288 variables for a 64x64 image, up to a large size of N=786 ,432 variables \nfor a 512x512 image; the results seem to coincide with our theoretical expectations: \na significant reduction in computational cost was observed and consistent conver(cid:173)\ngence towards the optimum of the objective was found for various numbers of coarse \nscales and FoA sizes. The dimension of the objective at any scale I was chosen via \na power law: N(L-l+1)! L, where L is the total number of scales and N the size of \n\n3 The algorithm can also handle directions of negative curvature. \n4 This property becomes important in a neural net implementation. \n\n\f638 \n\nthe original objective. \n\nD. I. TSIOUTSIAS, E. MJOLSNESS \n\nThe effect of our multiscale optimization with and without a FoA is shown in Fig. 2 \nfor the 128x128 and the 512x512 nets, where E( v*) is the best final configuration \nwith a one-level no-FoA net , and cumulative cost is an accumulated measure in the \nnumber of connection updates at each scale; a consistent scale-up in computational \nefficiency can be noted when L > 1, while the cost measure also reflects the relative \ntotal wall-clock times needed for convergence. Fig. 3 shows part of a comparative \nstudy we made for saliency measures alternative to eq. (1) (e.g. g~IE,il), in order \nto investigate the validity of eq. (1) as a predictor of l:!..E: \nthe more prominent \n\"linearity\" in the left scatterplot seems to justify our choice of saliency. \n\n104 . - -___ M-'S-'-/_A_T_N_e_t_s ,..,,(_12_8_t2-,)_: _L_=--,1 ,_2'-,3 ___ ---, \n\n10' \n\n10' \n\n10' \n\n10-' \n\n10\" \n\n10-110 \n\n2 \n\nNl \n\nMS/ AT Nets (512t2) : L=1,2,3,4 \n\n#1 \n\n10' \n\n10' \n\n10' \n\n10' \n\n10' \n\n~ 10 l \n\n~ '\" \nI 10' \n>' \ng10 - 1 \n\n10-' \n\n10\" \n\n10-4 \n\n10-' \n\n2000 \n\n10-' 0 \n\n60000 \n\nFigure 2: Multiscale Optimization (curves labeled by number of scales used): #(cid:173)\nnumbered curves correspond to nets without a FoA , simply-numbered ones to nets \nwith a FoA used at all scales. The lowest costs result from the combined use of \nmultiscale optimization and FoA. \n\n4 CONCLUSION \n\nWe have presented a framework for the optimization of large-scale objective func(cid:173)\ntions using neural networks that incorporate a multiscale attentional mechanism. \nOur method allows for a continuous adaptation of the system resources to the com(cid:173)\nputational requirements of the relaxation problem through the combined use of \nseveral techniques. The framework was applied to a 2-D image segmentation ob(cid:173)\njective with discontinuities; formulations of this problem with tens to hundreds of \nthousands of variables were then successfully solved. \n\nAcknow ledgements \n\nThis work was supported partly by AFOSR-F49620-92-J-0465 and the Yale Center \nof Theoretical and Applied Neuroscience. \n\n\fA Multiscale Attentional Framework for Relaxation Neural Networks \n\n639 \n\n10' (128t2) : Focus on 1st level - proposed saliency \n\n10' (128t2) : Focus on 1st level - absolute gradient \n\n10' \n\n~ 10\u00b0 \no o \n:0 \n~ 10-' \n\n.!! .. \n8-\n,.,10-' \no c \n. !! \nOJ \n11l 10-3 \n\n\" .. \n\n~ \n!10- 4 \n\n.. \n\n8 \n\n.. \n\n8 \n\n00 \n00 \n\",00 \n\n0 \n\no \no \n\no \n\n0 \n\no \no \n\no \no \n\n10' \n\n,. \n:0 .. o \n\no o \n\n: 10' \n\" 0. \n~ c \n\n~10-1 \n\n.!! .. \n~ \n~ \n\n0 \n0 \n\n3 \n\n0 \n\n0 \n\n0 \n0 \n0 \n\n8 o 8 \n~0o; \n\n.. \n\n/ .0 0 \n\"1:,00 0 \n\n0\" \n\n. . 0 \n\n10-' \n\n10-~0~-.-'-'-'u.tl~Oo.,-- J....Ll.J\"!'1*=0-.-'-'-~1 O:=.-.l.....L..Ll.';t'!loO r-'-~.tO!:.-r-u~1~0-:r'-'~100 \n\n\" I \n\n10-~0b.--'\"-U.~I~\"ol_:r'-'.w.m~I\"O~I _r-u.li;~lo.L_:r'-'-,-\"~uI\"O~I_ \u2022 .-'-'-l.l..lLU~lulo,,,\"_ \u2022 .l.....L..Lu.;I\"~ol-cr'-'-~1 00 \n\n(Average Della-E per block) \n\n(Average Della-E per block) \n\nFigure 3: Saliency Comparison: (left), saliency as in eq. (1); (right), the absolute \ngradient was used instead. \n\nReferences \n\nA. Conn, N. Gould, A. Sartanaer, & Ph . Toint. (1993) Global Convergence of a \nClass of Trust Region Algorithms for Optimization Using Inexact Projections on \nConvex Constraints. SIAM J. of Optimization, 3(1) :164-221. \n\nR. Durbin & D. Willshaw. (1987) An Analogue Approach to the TSP Problem \nUsing an Elastic Net Method. Nature , 326:689-691. \n\nJ. Hopfield & D. W. Tank. (1985) Neural Computation of Decisions in Optimization \nProblems. Bioi. Cybernei., 52:141-152. \n\nC. Koch , J. Marroquin & A. Yuille. (1986) Analog 'Neuronal' Networks in Early \nVision. Proc. of the National Academy of Sciences USA, 83:4263-4267. \n\nC . Koch, & S. Ullman. (1985) Shifts in Selective Visual Attention : Towards the \nUnderlying Neural Circuitry. Human Neurobiology , 4 :219-227 . \n\nE. Mjolsness, C. Garrett, & W. Miranker. (1991) Multiscale Optimization in Neural \nNets. IEEE Trans. on Neural Networks , 2(2):263-274 . \n\nE. Mjolsness & W. Miranker. \n(1993) Greedy Lagrangians for Neural Networks: \nThree Levels of Optimization in Relaxation Dynamics. YALEU/DCS/TR-945. \n(URL file:!!cs.ucsd.edu!pub!emj!papers!yale-TR-945.ps.Z) \n\nB. Olshausen, C. Anderson, & D. Van Essen. (1993) A Neurobiological Model of \nVisual Attention and Invariant Pattern Recognition Based on Dynamic Routing of \nInformation. The Journal of Neuroscience , 13(11):4700-4719. \n\nD. Terzopoulos. (1986) Regularization of Inverse Visual Problems Involving Dis(cid:173)\ncontinuities. IEEE Trans. PAMI, 8:419-429 . \n\nD. I. Tsioutsias & E. Mjolsness. (1995) Global Optimization in Neural Nets: A \nNovel Relaxation Framework . To appear as a UCSD-CSE-TR, Dec. 1995. \n\n\f", "award": [], "sourceid": 1022, "authors": [{"given_name": "Dimitris", "family_name": "Tsioutsias", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}]}