{"title": "Fixing Implicit Derivatives: Trust-Region Based Learning of Continuous Energy Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 1476, "page_last": 1486, "abstract": "We present a new technique for the learning of continuous energy functions that\nwe refer to as Wibergian Learning. One common approach to inverse problems\nis to cast them as an energy minimisation problem, where the minimum cost\nsolution found is used as an estimator of hidden parameters. Our new approach\nformally characterises the dependency between weights that control the shape of\nthe energy function, and the location of minima, by describing minima as fixed\npoints of optimisation methods. This allows for the use of gradient-based end-to-\nend training to integrate deep-learning and the classical inverse problem methods.\nWe show how our approach can be applied to obtain state-of-the-art results in the\ndiverse applications of tracker fusion and multiview 3D reconstruction.", "full_text": "Fixing Implicit Derivatives: Trust-Region Based\n\nLearning of Continuous Energy Functions\n\nMatteo Toso\n\nCVSSP,\n\nUniversity of Surrey\n\nNeill D. F. Campbell\nUniversity of Bath\n\nChris Russell\n\nCVSSP, University of Surrey\nand The Alan Turing Institute\n\nAbstract\n\nWe present a new technique for the learning of continuous energy functions that\nwe refer to as Wibergian Learning. One common approach to inverse problems\nis to cast them as an energy minimisation problem, where the minimum cost\nsolution found is used as an estimator of hidden parameters. Our new approach\nformally characterises the dependency between weights that control the shape of\nthe energy function, and the location of minima, by describing minima as \ufb01xed\npoints of optimisation methods. This allows for the use of gradient-based end-to-\nend training to integrate deep-learning and the classical inverse problem methods.\nWe show how our approach can be applied to obtain state-of-the-art results in the\ndiverse applications of tracker fusion and multiview 3D reconstruction.\n\n1\n\nIntroduction\n\nAlthough deep networks are now ubiquitous in machine learning and computer vision, prior to this,\nthere was a strong and established literature on inverse problems that made use of geometric and\nprobabilistic models to reason about the world. These models remain much easier for humans to\ninterpret than deep learning approaches, as they perform a small number of complex operations while\ndeep learning involves many simple operations. This is further aided by the geometric nature of many\nof the models that makes them easier to visualise.\nContinuous inverse problems - like SfM, tomography, and EEG signal analysis - often have a strong\ntheoretical basis, being built upon explicit Bayesian models of physical phenomena. These methods\nhave strong and robust performances, thanks to decades of re\ufb01nement, experience, and hand-crafted\npriors. However, they have fallen out of favour to deep-learning methods that allow for end-to-end\n\ufb01tting of parameters that tailor the model to directly maximise performance over a speci\ufb01ed loss.\n\nCoupling parameters and solutions: Arguably the primary factor in their decline, is that inverse\nproblems are much harder to train owing to the numerical optimisation that lies at the core of their\nuse. Internally, the solution to an inverse problem is generated by \ufb01tting a model to a single instance\nof data as part of a search for a Maximum a Posteriori estimate. This step creates an extra layer of\nindirection between the solution found and the parameters that de\ufb01ne the probabilistic model.\nWithout a direct coupling between the solution and the parameters it is not possible to perform\nend-to-end learning and these classical methods have been relegated to ad hoc post-processing of\ndeep learning results. By formally characterising the dependency between these parameters and the\nresulting MAP estimates, we show how end-to-end training can integrate deep learning and classical\ninverse problem methods, giving rise to high-performant and interpretable models.\n\nEnd-to-end learning with inverse problems: This work shows how to integrate continuous energy\nminimisation in a general learning framework, combining models developed over decades and deep\nlearning methods, and jointly training them end-to-end. We provide a straightforward derivation for\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFunction Forward:\n\nreturn\ny\u2217 = arg miny E(y ; w);\n\nend\n\nFunction Backwards:\n\nreturn\n\u2212(Hy\u2217(w) + \u03bbI)\u22121 d\u2207y\u2217(w)\n\n;\n\ndw\n\nend\n\nFigure 1: Predicting how the minimum moves with the parameters. Given a smooth function E(y ; w)\nof y parameterised by w (purple curve), we can locally approximate it by a quadratic function (cyan\ncurve) about a minimum y\u2217. The quadratic approximation has an analytic minimum; by predicting\nhow it varies with small changes in parameter, we predict how the location of the minimum of E\nchanges (this step is equivalent to performing an iteration of a trust-region method). This gives rise to\nthe forward/backwards components. \u03bb is \ufb01xed to be 0.1.\n\nthe derivatives of minima of smooth energy functions with respect to a set of general parameters that\ncontrol the shape of the energy function.\nTo this end, we formulate the location of a local minimum as a function of the parameters controlling\nthe energy function. This is reminiscent of the Wiberg optimisation, in which the variables are\npartitioned into two sets with one replaced by an analytic formula for its minimum. As such, we refer\nto our novel technique as Wibergian Learning.\n\nOur contribution: We propose the \ufb01rst stable method that allows us to perform end-to-end dis-\ncriminative learning for the location of minima on arbitrary continuous energy functions. We do\nthis by characterising the minimum y\u2217 as a local function of the controlling weights w of the energy\nfunction. From this we derive dy\u2217\n/dw which allows us to train the energy function discriminatively\nwith standard stochastic gradient descent methods used in deep learning.\nWe test our approach on two standard problems from computer vision and approached as energy\noptimisation tasks in the past. The \ufb01rst is tracker fusion, formulated as the minimisation of a robust\nnon-convex loss, and the second is human pose estimation, approached as a problem of non-rigid\nstructure from motion using a mixture of pre-trained bases.\nSamuel et al. [1] took a similar approach to ours to train continuous Markov Random Field Models.\nHowever, for non-quadratic functions their method suffers from exploding gradients and does not\nconverge if the determinant of the Hessian tends to zero about local minima. To avoid this instability,\nwe make two contributions. We \ufb01rst offer a novel derivation with same result as [1], but by using\n\ufb01xed-points of the Newton-step algorithm rather than implicit functions. This, then, allows us to\nreplace the Newton-step algorithm with a more stable-trust region algorithm which preserves the\nsame \ufb01xed-points, where Newton-step converges, but is guaranteed to converge in a wider range of\noptimisation problems. Moreover, this modi\ufb01cation bounds the size of the gradient update, explicitly\npreventing exploding gradients from being caused by a single component.\n\n2 Related Work\n\nThe Wiberg algorithm for matrix factorisation [2, 3] was a principal inspiration for our work. This\nef\ufb01cient optimisation technique converges to good solutions even with random initialisation. The\nidea underlying [2] is that by conditioning on a subset of parameters, we can make some nonlinear\nleast squares problems linear with respect to the remaining parameters. These parameters then\nadmit a closed form solution. By replacing the linear parameters with their analytical solution, we\ncan arrive at an optimisation problem with fewer parameters and more informative gradients and\nsecond-derivatives. For an extensive evaluation of Wiberg-like approaches see [4], or [5] for an\nextension of the approach to arbitrary functions and nested variants.\n\n2\n\n\fStructured learning [6, 7, 8], (for an exhaustive review see [9]) concerns itself with learning the\noptimal energy functions to maximise a prede\ufb01ned criterion over the training set. While we deal with\nvariables y with continuous states, structured learning considers an y that takes one of a (typically\nexponentially large) set of discrete states, and learns a function that ideally separates the cost of\na particular pre-given labelling from all other possible labellings. As such, unlike our method,\nstructured learning requires \ufb01rst discretising the problem space.\nMeta-learning, also combines two optimisation processes: a learner that improves on a given task,\nand a meta-learner that optimises the learner. Popular applications include (i) \u201clearning optimisers\u201d\nusing meta-learning to determine: the learner\u2019s update rules [10]; the priors and regularisers of\na non-linear least squares optimiser [11]; the learning rate of a neural network [12]; or to reduce\nover\ufb01tting [13]. (ii) \u201cfew-shot learning\u201d e.g. [14, 15], and (iii) \u201chyper-parameter optimisation\u201d which\nhas been characterised as meta-learning for a single task [16]. Many works in (iii) have focused on\nef\ufb01cient characterisation of the inner learner. [17] suggested reconstructing the learner\u2019s optimisation\ntrajectory rather than storing it; [18] focused on quadratic training criterion and showed Cholesky\ndecomposition can be used to compute the gradient with respect to hyper-parameters. Finally, [16]\ncharacterised the inner optimisation procedure as a dynamical system. In contrast, our approach, and\n[1] directly describe the behaviour of a function\u2019s local minimum.\nFinally, others have developed methods to integrate optimisation processes and neural networks\nby treating optimisation process as custom nodes within the network: [19] formulated quadratic\nprograms solvers as a single, differentiable component, while [20] generalised back-propagation\nto matrix operations, to produce custom components for structured matrix computation including\nsecond order pooling. [21] propagated gradients through an existing least squares optimisation, but\ndid not attempt to modify the optimisation itself.\n\n3 Formulation\n\nLet E(y ; w, x) be an energy, corresponding to the negative logarithm of an (unnormalised) probability\nfunction, de\ufb01ned over variables y, and parameterised by weights w and pre-existing data x. It is\ntraditional to perform optimisation to \ufb01nd the Maximum a Posteriori or (MAP) solution de\ufb01ned as\n\ny\u2217 = arg min\n\nE(y ; w, x).\n\n(1)\n\ny\n\nThe framework is based on three assumptions: (i) The true distribution that generated the data is\nactually contained in the family of models described by the parameterised energy; (ii) the optimal\nparameters w corresponding to that generating distribution can be determined; and (iii) the MAP\nsolution coincides with the desired output, for example the one that minimises an expected loss. In\npractice, none of these assumptions are guaranteed to hold [22].\n\nDecoupling assumptions: To make progress, we decouple these assumptions, as is common in\ndiscriminative approaches. We treat E(y; w, x) as an arbitrary cost function, instead of the log-\nprobability of a distribution, and choose w so that y\u2217 is optimal with respect to some pre-existing\nloss function (cid:96)(\u00b7) de\ufb01ned over the empiric distribution of training data. That is, we seek w such that\n(2)\n\n(cid:96)(y\u2217(w); x) : y\u2217(w) = arg min\n\nE(y ; w, x).\n\ny\n\n(cid:88)\n\nx\u2208X\n\narg min\n\nw\n\nWhat makes this challenging is the decoupling of the losses on the two sides of the equation; the ideal\nvalue of y\u2217 that minimises the loss (cid:96)(\u00b7) will not be the minimiser of the energy E(\u00b7).\n\nIn light of this, we propose a novel local reparameterisation of y\u2217 as\nLocal reparameterisation:\na function of w, i.e. y\u2217(w), and show that this allows us to compute dy\u2217\n/dw enabling the ef\ufb01cient\nlearning of w using standard methods for stochastic gradient descent and as part of an end-to-end\nlearning framework. The key insight to our approach is that if E(\u00b7) is suf\ufb01ciently smooth and\nwell-behaved, the change in the solution y\u2217(w) \u2192 y\u2217(w(cid:48)) caused by a small perturbation of w \u2192 w(cid:48)\nis well approximated by a single step of either Newton method, or a more robust alternative, on y on\nthe new function E(y ; w(cid:48), x), starting from the current solution at y\u2217.\nWe begin by presenting our result in the general case, before discussing its application in end-to-end\nlearning and detailing ef\ufb01cient solutions for two important special cases in the supp. materials.\n\n3\n\n\fThe General Case Given a local minimum y\u2217 of the following equation\n\ny\u2217(w) = arg min\n\nE(y ; w),\n\ny\n\n(3)\n\nwe are interested in characterising how the location of the local minimum varies with change in w.\nWe drop the dependency on x for clarity of notation. We present an informal derivation assuming\nthat the procedure works, and leave the formal derivation to the supplementary material. We assume\nthe local neighbourhood about y\u2217(w) is strongly convex; and that Newton\u2019s method given y\u2217(w) as\nan input will converge in a single iteration.\n\nNewton update w.r.t. y: Considering the second-order Taylor expansion of the energy around y,\nwe have\n\nE(y(cid:48) ; w) \u2248 E(y ; w) + (y(cid:48) \u2212 y)[\u2207E(y ; w)] + 1/2 (y(cid:48) \u2212 y)T[HE(y ; w)](y(cid:48) \u2212 y),\n\n(4)\nwhere \u2207E(y ; w) denotes the Jacobian and HE(y ; w) the Hessian of E(\u00b7) with respect to y evaluated\nat y. If we are suf\ufb01ciently close to a minimum of E(\u00b7), the expansion well models the function and\nthe minimum of the two coincide. This leads to Newton\u2019s update rule\n\nE(y(cid:48) ; w) \u2248 y \u2212 [HE(y ; w)]\n\n\u22121\u2207E(y ; w).\n\narg min\n\ny(cid:48)\n\nIf we evaluate this at the minimum y = y\u2217(w), we have\n(w) ; w)]\n\n(w) \u2212 [HE(y\n\n(w) = y\n\ny\n\n\u2217\n\n\u2217\n\n\u2217\n\nwith \u2207E(y\u2217(w) ; w) = 0 at optimality.\n\n\u22121\u2207E(y\n\n\u2217\n\n(w) ; w),\n\n(5)\n\n(6)\n\n\u2217\n(w); w + \u2206).\n\n(w + \u2206) \u2248 y\n\u2217\n\n(w) \u2212 [HE(y\n\u2217\n\n\u2217\n(w); w + \u2206)]\n\ny\n\nUpdating parameters w: We then ask, \u201cWhat would a single iteration of Newton\u2019s method do,\nif the parameters w are updated?\u201d The answer is that, for suf\ufb01ciently small updates of w, y\u2217(w)\nremains in the strongly convex region about the new minimum and one iteration of Newton\u2019s method\nmoves y\u2217(w) directly towards the new minimum y\u2217(w(cid:48)). Writing w(cid:48) = w + \u2206, then as \u2206 \u2192 0 only\na single iteration of Newton\u2019s method is needed to get arbitrarily close to the new minimum. That is,\n\u22121\u2207E(y\n(7)\ny at the \ufb01xed location y\u2217,\nWriting Hy\u2217(w) as shorthand for the Hessian of E(y ; w) w.r.t.\ni.e. HE(y\u2217; w), and \u2207y\u2217(w) as shorthand for the Jacobian of E(y ; w) with respect to y at y\u2217,\ni.e. \u2207E(y\u2217; w), we rearrange and normalise the above equation to get\ny\u2217(w+\u2206) \u2212 y\u2217(w)\n\ny\u2217 (w)\u2207y\u2217(w)\nH\u22121\n(8)\nwith the \ufb01nal expression following from the fact that \u2207E(y\u2217(w); w) = 0. Proof that this expression\nconverges as \u2206 \u2192 0 is given in the supplementary materials.\nIn the limit \u2206 \u2192 0, and again using \u2207y\u2217(w) = 0 by de\ufb01nition, this allows us to derive\ndy\u2217\ndw\n\n,\n(9)\nHere we use the partial derivative to emphasise that the value of y\u2217 comes from a previous iteration\nof Newton\u2019s method, and that it does not vary with w.\n\ny\u2217 (w+\u2206)\u2207y\u2217(w+\u2206)\n\ny\u2217 (w+\u2206)\u2207y\u2217(w+\u2206)\n\n\u2207y\u2217(w) \u2212 H\u22121\n\n= \u2212 \u2202H\u22121\n\n= \u2212 dH\u22121\n\ny\u2217 (w)\u2207y\u2217(w)\n\n= \u2212 H\u22121\n\n\u2248 \u2212 H\u22121\n\n= \u2212H\u22121\n\ny\u2217 (w)\n\n\u2202\u2207y\u2217(w)\n\n\u2202w\n\n\u2202\u2207y\u2217(w)\n\n\u2202w\n\ny\u2217 (w)\n\ny\u2217 (w)\n\u2202w\n\n\u2206\n\n+\n\n\u2206\n\n\u2206\n\n\u2206\n\ndw\n\nTrust-Region Based Robustness\n\nThe derivation up to this point gives the same update step as Samuel et al. [1]. However, it is\nimmediately apparent that if the function descends sharply into a \ufb02at region about a local minimum,\nand is better approximated by a higher-order function than by a quadratic function, the Hessian\nmay tend to zero around the minimum, and H\u22121\ny\u2217 (w) is ill-de\ufb01ned. Even if Hy\u2217(w) is non-zero, it\nmay become arbitrarily small leading to exploding gradients. To avoid such issues, we turn to the\ntrust-region method of [23], which replaces Hy\u2217(w) in the update step of Newton\u2019s method with\n\n4\n\n\fSum of RBF learnt with [1]\n\nSum of RBF learnt with ours\n\nThe evolution of the parameter \u03c32\nfor the two methods\n\nFigure 2: Learning a RanSac like function. A simple illustration of learning the kernel width \u03c32 on\n1D RBF functions so that the minimal cost solution corresponds to the mean of a dense set of 10\ninliers (blue samples in the \ufb01rst two graphs) and discards 100 outliers(red samples). The green cross\nindicates the function minima corresponding to the estimate found by each method.\n\nHy\u2217(w) + \u03bbI for some \u03bb \u2265 0, where I is the identity matrix. As shown by [23], this is equivalent to\nminimising the same quadratic problem at each iteration of Newton\u2019s method subject to additional\nconstraints on the maximal size of the step, and it inherits the same quadratic convergence in the local\nneighbourhood of a strongly convex minimum, with stronger convergence properties elsewhere. Use\nof the trust-region instead of Newton\u2019s method gives rise to the damped gradient prediction\n\n= \u2212(cid:0)Hy\u2217(w) + \u03bbI(cid:1)\u22121 \u2202\u2207y\u2217(w)\n\n\u2202w\n\n(d)\n\ndy\u2217\ndw\n\n.\n\n(10)\n\n2, \u2202y\u2217\n\n\u2202w\n\n(d) is the solution to the same programme subject to the requirement that ||y||2\n\nUsing the argument of [23], this value can be interpreted as damped variant of the original gradient in\ndw is the solution to the quadratic programme arg miny ||Hy\u2217(w)y \u2212\nthe following sense: while dy\u2217\n\u2202\u2207y\u2217(w)\n2 \u2264 k\n\u2202w ||2\nfor some k.\nCompared to the undamped formulation of (9), trust-region methods converge to a true minimum\nfor a strictly larger class of functions making the new approach directly applicable to a wider\nrange of problems. Moreover, we can bound the magnitude of the damped gradient directly, in all\ncircumstances. As Hy\u2217(w) is positive semi-de\ufb01nite, and \u03bbI positive de\ufb01nite, we have || dy\u2217\n(d)|| \u2264\n\u03bb\u22121|| \u2202\u2207y\u2217(w)\nThis is not just a convenience: in section 4, we demonstrate a problem that fails to converge using\n[1]; but gives state of the art results using our formulation. In practice no tuning or online adaption of\nthe value of \u03bb is needed, and we simply set it to 0.1.\nIf the energy is quadratic, trust region analysis is unneeded, however [1] may still fail to converge for\nill-posed quadratic problems. Analysis showing closed form updates for well- and ill-posed quadratic\nenergies guaranteed to converge are in the supplementary materials.\n\n\u2202w ||, and exploding gradients can no longer be created by a single layer.\n\ndw\n\nUsing the Derivative in Learning The derivatives we have speci\ufb01ed are quite general, and, im-\nportantly, they make no assumptions about the energy minimisation technique used to obtain the\noptimum. In practice, approximate second order approaches such as L-BFGS [24, 25] converge to a\nneighbourhood about the minimum, but the solution found does not satisfy the \ufb01xed point equation\n(6). In this case, a single step of (trust-region) Newton\u2019s method is required for the numeric gradients\nand the analytic solution found to coincide. This instability primarily affects the numeric derivatives\nof the analytic solution described in equation (9). The right-hand term, set to 0 a priori, describes the\ndrift in Newton\u2019s method due to not starting at an exact local minimum.\nGiven knowledge of how to compute the gradients, energy minimisation can be treated as a component\nof any end-to-end training network, which makes use of stochastic subgradient descent, and integrated\ndirectly. We exploit this in both our examples. In tracking, we have simple stub functions that weight\nour con\ufb01dence in particular bounding-boxes based on how their scale has changed from the \ufb01rst\nframe. In 3D reconstruction, we take a weighted average of multiple reconstructions and jointly learn\nthese weights and the ideal energy function.\n\n5\n\n\fFigure 3: Learning a better energy function for tracker fusion. Left: Input image. Centre Left:\nOverlay of 72 tracker boxes (blue), ground-truth detection (red) and our prediction (yellow). Centre\nRight: The initial energy used to predicting the top left corner of the box locations before training.\nRight: The energy used to predict the same corner of box locations after training (red continues to\nindicate ground-truth). Our learning mechanism naturally adapts the form of the energy so that it\ncontains fewer poor local minima that the optimisation could stick in, and directs the local minima to\na better location.\n\ni.e.: \u02c6\u00b5 = arg mint E(t, x; \u03c3) =(cid:80)\n\nRanSaC as an Illustrative Example We demonstrate our approach on a simple 1-dimensional\nexample. We consider the problem of estimating the mean of a set of 10 inliers sampled from a\nnormal distribution N (U [\u221240, 40], 42) in the presence of 100 outliers drawn from a broad uniform\ndistribution U [\u221240, 40]. This can be formulated as an MLESaC[26] type optimisation where the\nmean is estimated by minimising a one-dimensional sum of RBF functions centred on the samples\ni \u2212 exp(\u2212(xi \u2212 t)2/\u03c32). We compare our approach, and that of\n[1] to \ufb01nd optimal value of \u03c3 to minimise the squared error between the estimated mean and its true\nvalue. The large amount of volatility in this problem means that adaptive gradient methods[27, 28]\nstop early and do not converge to optimal solutions. Instead we use stochastic gradient descent\nwith a strong momentum value of 0.999 to damp the oscillations. For any choice of step-size and\nmomentum, with probability 1 we will eventually draw a set of points that have a suf\ufb01ciently small\ncurvature about the minimum, causing an arbitrarily large step for the undamped update of [1] while\nour update remains bounded. This behaviour can be seen in \ufb01gure 2.\n\n4 Tracker Fusion\n\nWe demonstrate how our approach can be used to train a model to fuse existing candidate trackers.\nThis demonstration leverages the comprehensive evaluation work performed by the Visual Object\nTracking challenge team [29]. Alongside annotated ground-truth results, the VOT team have archived\nthe tracking results of 72 entries to their competition, and we take these results as initial candidates\nfor tracking. The VOT challenge has been set up to evaluate several different types of tracking, both\nlong and short-term, and with and without reinitialisation of the tracker after failure. To demonstrate\nthe merits of our approach, we focus on short-term tracking performed without reinitialisation of the\ntracker. Our justi\ufb01cation is that (i) short-term tracking is a more established benchmark with more\ntrackers available, and that (ii) the unsupervised case, where no reinitialisation is performed, makes it\neasier to work with existing trackers. We take the output of all trackers run once through the sequence\nwithout intervention and learn to fuse the tracker results on any given frame. Our task is to predict\nthe four corners of a bounding box, given a set of candidate locations from the other trackers.\n\n6\n\n\fTable 1: The VOT2018 challenge; showing the best 5 methods of the 72 entries on different test sets.\nWe evaluate on two data partitions one where individual frames are randomly assigned to training\nand test, and another where entire sequences are assigned to each set. \u2217 = did not converge.\n\nFrames Assigned at Random\n\nTracker\nDLSTpp\nFSAN\n\nSiamRPN [30]\nLSART[31]\nR_MCPF\n\nIoU\n0.530\n0.490\n0.484\n0.472\n0.465\nMean Fusion\n0.238\n0.428\nMedian Fusion\nSamuel et al.[1] N/A \u2217\n0.565\n\nOur Fusion\n\nSequences Assigned at Random\n\nTracker\n\nSA_Siam_R\n\nMBSiam\n\nSiamRPN [30]\n\nFSAN\n\nLADCF[32]\nMean Fusion\nMedian Fusion\nSamuel et al.[1]\n\nOur Fusion\n\nIoU\n0.4643\n0.4624\n0.4621\n0.4618\n0.4601\n0.2455\n0.4458\nN/A \u2217\n0.4960\n\nTo aid visualisation of the learning, we split the energy function describing the corners into two\nindependent components, one used to predict the top left corner of the box (see Figure 3), and the\nother to predict the bottom right corner. Each component is modelled as a sum of 72 Radial Basis\nFunctions (RBF), one for each prediction given by an existing tracker. The standard deviation of\neach RBF, along with temporal based importance weights, are learnt for each tracker, allowing us to\ndiscount trackers that are more prone to drift later on in the sequence, along scale factors that correct\nfor trackers that consistently underestimate the size of the bounding box. These scale factors account\nfor much of the movement of minima in \ufb01gure 3. The full formulation is given in the supp. materials.\nWe take the centre of each RBF as a candidate solution, and use the lowest energy candidate as an\ninitialisation for continuous optimisation of the objective using L-BFGS [24, 25]. Importantly, this\noptimisation is not guaranteed to \ufb01nd the global optimum, and our learning framework does not\nrequire it to. We simply require that, given a similar characterisations of the loss, the solution found\nis a local minimum that remains stable in expectation1, and will drive the solution we \ufb01nd towards\nsomething that will minimise our training loss.\nWe take the training loss as the (cid:96)1 difference between the found minima and ground-truth solution. In\npractice, the method of [1] frequently failed converge on this problem. Although this could sometimes\nbe avoided by a mixture of early stopping and a careful tuning of step sizes, this hurt the quality of\nsolutions found and still intermittently failed.\n\nResults Our fused tracker shows state of the art results on the highly competitive VOT2018 chal-\nlenge. For evaluation purposes we divide the VOT2018 dataset roughly into two thirds training/one\nthird test by number of frames, so that no sequence occurs in both halves, training on the \ufb01rst half\nof the frames and reporting loss on the second; we also report on \u201cmissing at random frames\u201d,\nwhere frames from the same sequence occur in both training and test. Mean and median fusion\nare baseline \u201cwisdom of the crowd\u201d methods showing the effectiveness of predicting using either\nthe mean or median prediction over the 72 methods for the bounding box corners. In Table 1 we\nreport the intersection over union measure both for our approach, and for the top \ufb01ve methods from\nthe VOT2018 challenge on the same partitions. For the newer and uncited methods, please see the\nappendix of [29] for further details. Results differ from [29] due to training and test partitions.\n\n5 Human Pose Estimation\n\nWe further demonstrate our approach on 3D Human Pose Estimation from 2D detections. We consider\nboth the monocular model by Tome et al. [21] and its multi-camera extension [33], and use our\nlearning method to improve on their hand-tuned energy functions. We emphasises that we do not\nalter the optimisation used, but instead alter the shape of the energy function so that its minimum is\ncloser to the ground-truth.\n\n1In theory, this allows for the use of random restarts and non-deterministic optimisation in the energy\n\nminimisation step.\n\n7\n\n\fFigure 4: Reconstructed poses from our three multi-camera approaches (red) vs ground truth (blue).\n\nThe Model The method [21] was based on classical basis shape approaches to Non-Rigid Structure\nfrom Motion [34] and found the reconstructed 3D pose P by solving the least squares problem:\n\n(cid:13)(cid:13)X \u2212 s \u03a0 K R [ \u00b5 + a \u00b7 e ](cid:13)(cid:13) +(cid:13)(cid:13)\u03c3 \u00b7 a(cid:13)(cid:13)2\n\n2\n\na\u2217, R\u2217 = arg min\n\na,R\n\n(11)\n\nPi(R\u2217, a\u2217) = R\u2217 [ \u00b5i + a\u2217 \u00b7 ei ]\n\n(12)\nwhere R is an in-plane rotation, a the basis coef\ufb01cients vector, X is an input 2D pose, \u00b5 a rest\nshape, e a tensor of basis vectors, \u03a0 the orthographic projection matrix, K a known external camera\ncalibration matrix, and s the estimated per-frame scale.\nThe authors also extended this approach to a mixture of models [\u00b5i, ei, \u03c3i], selecting the model\nwith the minimal reconstruction error as the true reconstruction. The approach was extended to a\nmulti-camera setting that also replaced the search over all rotations with a weighted average [33].\n\nLearning Parameters The optimisation process solves a well-posed quadratic function, allowing\ndirect computation of the gradient of a. Each reconstruction can be written as a direct smooth function\nof ai - making the full process differentiable and allowing end-to-end training.\nWe replaced the Huber loss used in [33] with its smooth variant to make its transition from (cid:96)2 to\n(cid:96)1 optimisable, and following the argument in [33] replaced the choice of best \ufb01tting model with a\nweighted average. sto speed up training, and would not be required: choosing the best model and\nupdating it induces a valid subgradient.\nWhere possible, parameters were initialised to the previously used values. New terms that are\nmultiplied by or added to an existing function were respectively set to 1 or 0. The model parameters\nwere then trained and tested using the stacked hourglass detections of Martinez et al.[35]. This\nchange to more reliable detectors, but with a different distribution of errors, leads to the large increase\nin reconstruction error between [21, 33] and our baselines.\n\nResults We evaluate our approach on the Human3.6M dataset[40], over the standard training and\ntesting sets for both monocular and multiview reconstruction. Results can be seen in Table 2. In\nmonocular reconstruction, we show a 12mm reduction in error vs. [21], whose pose model was\ngenerated with probabilistic PCA and hand-tuned the remaining weights. However, performance\nremains worse than methods such as [35].\nWe have greater success in the multiview case. Training the humane pose model and camera weights\nreduces the error by 34mm with respect to the baseline, and by jointly optimising these parameters\nalongside those that were originally found by grid-search (see table 3) we obtain state-of-the-art\nresults. This corresponds to a 6mm reduction in error over the hand-tuned approach [33]. The closest\ncompetitor [38] made use of a personalised model of 2D joint appearance while we learn a better\nmodel for \ufb01tting 3D pose; as such, the two approaches are strongly complementary.\n\nComputational ef\ufb01ciency: Our approach, being a gradient-based online method, is much more\nscalabale to large datasets and optimisation over large numbers of parameter than grid search. Given\n\n8\n\n\fTable 2: Average per joint 3D reconstruction error on Human3.6M, expressed in mm.\n\nSitD. Smoke Wait WalkD. Walk WalkT. Avg\nDir. Disc.\n73.1 88.4\n65.0 73.5 76.8 86.4 86.3 110.7 68.9 74.8 110.2 173.9\n51.8 56.2 58.1 59.0 69.5 78.4 55.2 58.1 74.0 94.6\n52.4 62.9\n100.8 104.1 103.1 108.4 109.2 122.3 98.1 130.7 136.6 192.1 102.0 108.1 113.6 98.4 104.9 114.2\n57.5 68.5 65.9 71.3 82.0 80.2 57.5 60.7 102.8 136.7\n64.4 76.1\n\n85.0 85.8\n62.3 59.1\n\n86.3 71.4\n65.1 49.5\n\n75.8 70.6\n\n74.4 58.2\n\nSit\n\nEat Greet Phone Photo Pose Purch.\n\nMonocular\nTome et al. [21]\nMartinez et al. [35]\nOurs - Baseline\nOurs\nMulticamera\nTrumble et al. [36]\n92.7 85.9 72.3 93.2 86.2 101.2 75.1 78.0 83.5 94.8\nZhou et al. [37]\n54.8 60.7 58.2 71.4 62.0 65.5 53.8 55.6 75.2 111.6\nPavlakos et al. [38] (a)\n41.2 49.2 42.8 43.4 55.6 46.9 40.3 63.7 97.6 119.9\nPavlakos et al. [38] (b)\n46.0 68.1 73.9\nN\u00fa\u00f1ez et al. [39]\n40.2 47.0 42.1 66.5 53.0 58.0 36.4 41.8 68.2 113.8\nTome et al. [33] - L2\n51.3 54.9 47.9 55.8 56.8 71.3 45.8 49.2 74.7 102.0\n43.3 49.6 42.0 48.8 51.1 64.3 40.3 43.3 66.0 95.2\nTome et al. [33] - Huber\nOurs - Baseline\n85.7 90.8 79.8 87.3 107.1 94.8 78.5 87.6 102.0 100.1\nOurs - L2, learn only \u00b5, e, \u03c3, Wc 50.9 55.0 49.3 53.7 70.3 53.9 48.5 50.0 63.8 71.9\n38.5 44.3 39.2 42.1 61.9 44.4 36.0 38.6 56.7 65.6\nOurs - L2\n38.2 42.2 39.5 39.1 57.2 45.2 34.1 39.1 57.8 68.0\nOurs - Huber\n\n85.8 82.0 114.6 94.9\n51.4 63.2\n64.2 66.1\n52.1 42.7\n51.9 41.8\n\n51.6 66.7\n56.2 62.2\n50.2 52.2\n95.2 85.1\n60.8 51.5\n50.6 41.0\n48.7 39.1\n\n47.2 35.6\n56.1 48.7\n51.1 43.9\n92.3 85.8\n57.6 57.7\n47.7 45.3\n46.7 40.5\n\n79.7 87.3\n55.3 64.9\n39.4 56.9\n47.8\n35.5 54.2\n54.0 59.4\n45.3 52.8\n87.4 91.8\n57.5 57.8\n46.6 47.7\n41.1 46.1\n\nTable 3: Parameters used in multi-view reconstruction. [21] and [33] optimised the parameters\nmarked \u2021 with grid-search.\nScaling \u2021\n\nCovariance Huber \u2021\n\nBody Models Reprojection\n\nFusion \u2021\n\nCameras\n\nComponent\n# Parameters\n\n3978\n\n92\n\n36\n\n2523\n\n9\n\n6\n\n4\n\nk parameters, N possible values for each and a training set evaluation time of t hours, grid search\nrequires t \u00b7 N k hours. Generating the Huber-loss reconstruction over the entire multiview training\nset of Human3.6M takes 5 hours on a typical CPU. With N = 10 and the parameters of Table 3, the\nsearch would take 5 \u00b7 106648 hours, while our stochastic online approach took approximately 120\ncore hours to converge (i.e. much faster than searching over 2 parameters).\n\nLimitations: Our approach inherits many of the advantages and disadvantages of gradient descent\nmethods in neural nets. In particular, just as vanishing gradients, and stuck neurons are a concern, it is\npossible for particular components of the energy to have too narrow a range to in\ufb02uence the location\nof minima; if this is the case, they will remain \ufb01xed. As such, the use of sensible initialisations and\nregularisers to ensure that by default components have an initial broad range is important. These\nmodi\ufb01cations typically decrease the curvature at local minima, making the use of our modi\ufb01ed update\nstep more important. In general, our modi\ufb01ed update step is most important at the start of the learning\nprocess, while the \ufb01nal energy functions that our algorithm converges to tends to be better behaved.\n\n6 Conclusion\n\nWe have presented a novel approach that allows the classical energy minimisation methods of inverse\nproblems to bene\ufb01t from the end-to-end training that has been a fundamental part of the success\nof deep-learning. By placing these energy minimisation techniques within the framework of end-\nto-end learning we have opened the door for the integration of the two approaches; allowing us to\nlearn things that can not be easily expressed in either paradigm. For example, using the technique\nproposed, it is possible to use Siamese networks [41] to output con\ufb01dence weights on their matches\nand by integrating them with standard SfM techniques, train to maximise directly the 3D accuracy of\nreconstructions. Equally, the routing of Capsule Networks [42] could be formulated as an energy\nminimisation problem (similar to the pictorial structure/spring model [43]) and jointly trained.\nOur work provides a bridge that connects together variational methods and inverse problems with\ndeep learning; and as such it is particularly well suited for modelling complex interactions that are\ndif\ufb01cult to describe with standard convolutional networks.\nCode is available at: https://github.com/MatteoT90/WibergianLearning\n\nAcknowledgements: We acknowledge funding from the RCUK Centre for the Analysis of Motion,\nEntertainment Research and Applications (CAMERA, EP/M023281/1) and the Royal Society.\n\n9\n\n\fReferences\n[1] Kegan G. G. Samuel and Marshall F. Tappen. Learning optimized MAP estimates in continuously-valued\nMRF models. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 477\u2013484,\n2009.\n\n[2] T. Wiberg. Computation of principal components when data is missing. In Second Symp. Computational\n\nStatistics, pages 229\u2013236, 1976.\n\n[3] Takayuki Okatani and Koichiro Deguchi. On the wiberg algorithm for matrix factorization in the presence\n\nof missing components. International Journal of Computer Vision, 72(3):329\u2013337, May 2007.\n\n[4] J. H. Hong and A. W. Fitzgibbon. Secrets of matrix factorization: Approximations, numerics, manifold\noptimization and random restarts. In IEEE International Conference on Computer Vision (ICCV), 2015.\n\n[5] Dennis Strelow. General and nested Wiberg minimization. In IEEE Conference on Computer Vision and\n\nPattern Recognition (CVPR), pages 1584\u20131591, 2012.\n\n[6] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Advances in Neural\n\nInformation Processing Systems, pages 25\u201332, 2004.\n\n[7] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction\nmodels: A large margin approach. In Proceedings of the 22nd international conference on Machine\nlearning, pages 896\u2013903, 2005.\n\n[8] G\u00f6khan BakIr, Thomas Hofmann, Bernhard Sch\u00f6lkopf, Alexander J. Smola, Ben Taskar, and S.V.N.\n\nVishwanathan. Predicting structured data. MIT Press, 2007.\n\n[9] Sebastian Nowozin and Christoph H Lampert. Structured learning and prediction in computer vision.\n\nFoundations and Trends in Computer Graphics and Vision, 6(3-4):185\u2013365, 2011.\n\n[10] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. Advances in\nNeural Information Processing Systems 29, 2016.\n\n[11] Ronald Clark, Michael Bloesch, Jan Czarnowski, Stefan Leutenegger, and Andrew J. Davison. Learning\nto solve nonlinear least squares for monocular stereo. In The European Conference on Computer Vision\n(ECCV), pages 284\u2013299, 2018.\n\n[12] Christian Daniel, Jonathan Taylor, and Sebastian Nowozin. Learning step size controllers for robust neural\nnetwork training. In Proceedings of the Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, AAAI\u201916,\npages 1519\u20131525. AAAI Press, 2016.\n\n[13] Simon Jenni and Paolo Favaro. Deep bilevel learning. In The European Conference on Computer Vision\n\n(ECCV), 2018.\n\n[14] Luca Bertinetto, Jo\u00e3o F. Henriques, Philip H. S. Torr, and Andrea Vedaldi. Meta-learning with differentiable\n\nclosed-form solvers. In International Conference on Learning Representations, 2019.\n\n[15] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\nnetworks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.\n\n[16] Luca Franceschi, Paolo Frasconi, Saverio Salzo, Riccardo Grazzi, and Massimiliano Pontil. Bilevel\nprogramming for hyperparameter optimization and meta-learning. In Proceedings of the 35th International\nConference on Machine Learning (ICML 2018), 2018.\n\n[17] Dougal Maclaurin, David Duvenaud, and Ryan Adams. Gradient-based hyperparameter optimization\nthrough reversible learning. In Proceedings of the 32nd International Conference on Machine Learning,\nvolume 37, pages 2113\u20132122, 2015.\n\n[18] Yoshua Bengio. Gradient-based optimization of hyperparameters. Neural Computation, 12:1889\u20131900,\n\n2000.\n\n[19] Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. In\n\nProceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.\n\n[20] Catalin Ionescu, Orestis Vantzos, and Cristian Sminchisescu. Matrix backpropagation for deep networks\nwith structured layers. IEEE International Conference on Computer Vision (ICCV), pages 2965\u20132973,\n2015.\n\n10\n\n\f[21] Denis Tome, Christopher Russell, and Lourdes Agapito. Lifting from the deep: Convolutional 3D pose\n\nestimation from a single image. CVPR 2017 Proceedings, pages 2500\u20132509, 2017.\n\n[22] Vladimir Vapnik. Estimation of Dependencies Based on Empirical Data. Springer, 2 edition, 2006.\n[23] Danny C. Sorensen. Newton\u2019s method with a model trust region modi\ufb01cation. SIAM Journal on Numerical\n\nAnalysis, 19(2):409\u2013426, 1982.\n\n[24] Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi-newton matrices and\n\ntheir use in limited memory methods. Mathematical Programming, 63(1):129\u2013156, Jan 1994.\n\n[25] Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of computation,\n\n35(151):773\u2013782, 1980.\n\n[26] P. H. S. Torr and A. Zisserman. MLESAC: A new robust estimator with application to estimating image\n\ngeometry. Computer Vision and Image Understanding, 78:2000, 2000.\n\n[27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[28] Matthew D Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.\n\n[29] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pfugfelder, Luka Cehovin Zajc,\nTomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, Gustavo Fernandez, and et al. The\nsixth visual object tracking vot2018 challenge results, 2018.\n\n[30] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. Distractor-aware siamese networks\nfor visual object tracking. In European Conference on Computer Vision, pages 103\u2013119. Springer, 2018.\n\n[31] Chong Sun, Huchuan Lu, and Ming-Hsuan Yang. Learning spatial-aware regressions for visual tracking.\n\nCVPR, 2018.\n\n[32] Tianyang Xu, Zhen hua Feng, Xiao-Jun Wu, and Josef Kittler. Learning adaptive discriminative correlation\n\ufb01lters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE\nTransactions on Image Processing, June 2019.\n\n[33] Denis Tome, Matteo Toso, Lourdes Agapito, and Chris Russell. Rethinking pose in 3D: Multi-stage\nre\ufb01nement and recovery for markerless motion capture. In 2018 International Conference on 3D Vision\n(3DV), pages 474\u2013483. IEEE, 2018.\n\n[34] Christoph Bregler, Aaron Hertzmann, and Henning Biermann. Recovering non-rigid 3D shape from image\n\nstreams. In Computer Vision and Pattern Recognition, volume 2, pages 690\u2013696. IEEE, 2000.\n\n[35] Julieta Martinez, Rayat Hossain, Javier Romero, and James J. Little. A simple yet effective baseline for 3D\n\nhuman pose estimation. In ICCV, 2017.\n\n[36] Matt Trumble, Andrew Gilbert, Charles Malleson, Adrian Hilton, and John Collomosse. Total Capture:\n3D human pose estimation fusing video and inertial sensors. In 2017 British Machine Vision Conference\n(BMVC), 2017.\n\n[37] Xingyi Zhou, Qixing Huang, Xiao Sun, Xiangyang Xue, and Yichen Wei. Towards 3D human pose\nestimation in the wild: a weakly-supervised approach. In The IEEE International Conference on Computer\nVision (ICCV), 2017.\n\n[38] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G Derpanis, and Kostas Daniilidis. Harvesting multiple\nviews for marker-less 3D human pose annotations. In Computer Vision and Pattern Recognition (CVPR),\n2017.\n\n[39] Juan Carlos N\u00fa\u00f1ez, Ra\u00fal Cabido, Jose V\u00e9lez, Antonio S. Montemayor, and Juan Pantrigo. Multiview 3D\nhuman pose estimation using improved least-squares and lstm networks. Neurocomputing, 323, 10 2018.\n\n[40] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets\nand predictive methods for 3D human sensing in natural environments. IEEE transactions on pattern\nanalysis and machine intelligence, 36(7):1325\u20131339, 2014.\n\n[41] Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural\n\nnetworks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[42] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in\n\nNeural Information Processing Systems, pages 3856\u20133866, 2017.\n\n[43] Pedro F. Felzenszwalb and Daniel P. Huttenlocher. Pictorial structures for object recognition. International\n\nJournal of Computer Vision, 61(1):55\u201379, 2005.\n\n11\n\n\f", "award": [], "sourceid": 837, "authors": [{"given_name": "Chris", "family_name": "Russell", "institution": "The Alan Turing Institute/ The University of Surrey"}, {"given_name": "Matteo", "family_name": "Toso", "institution": "University of Surrey"}, {"given_name": "Neill", "family_name": "Campbell", "institution": "University of Bath"}]}