{"title": "Proximal Deep Structured Models", "book": "Advances in Neural Information Processing Systems", "page_first": 865, "page_last": 873, "abstract": "Many problems in real-world applications involve predicting continuous-valued random variables that are statistically related.  In this paper, we propose a powerful deep structured model that is able to learn complex non-linear functions which encode the   dependencies between continuous output variables.  We show that  inference in our  model using proximal methods can be efficiently solved as a feed-foward pass of a special  type of  deep recurrent neural network. We demonstrate the  effectiveness of our approach in the tasks of image denoising, depth refinement and optical flow estimation.", "full_text": "Proximal Deep Structured Models\n\nShenlong Wang\n\nUniversity of Toronto\n\nslwang@cs.toronto.edu\n\nSanja Fidler\n\nUniversity of Toronto\n\nfidler@cs.toronto.edu\n\nRaquel Urtasun\n\nUniversity of Toronto\n\nurtasun@cs.toronto.edu\n\nAbstract\n\nMany problems in real-world applications involve predicting continuous-valued\nrandom variables that are statistically related. In this paper, we propose a powerful\ndeep structured model that is able to learn complex non-linear functions which\nencode the dependencies between continuous output variables. We show that\ninference in our model using proximal methods can be ef\ufb01ciently solved as a feed-\nfoward pass of a special type of deep recurrent neural network. We demonstrate\nthe effectiveness of our approach in the tasks of image denoising, depth re\ufb01nement\nand optical \ufb02ow estimation.\n\n1\n\nIntroduction\n\nMany problems in real-world applications involve predicting a collection of random variables that\nare statistically related. Over the past two decades, graphical models have been widely exploited\nto encode these interactions in domains such as computer vision, natural language processing and\ncomputational biology. However, these models are shallow and only a log linear combination of\nhand-crafted features is learned [34]. This limits the ability to learn complex patterns, which is\nparticularly important nowadays as large amounts of data are available, facilitating learning.\nIn contrast, deep learning approaches learn complex data abstractions by compositing simple non-\nlinear transformations. In recent years, they have produced state-of-the-art results in many applica-\ntions such as speech recognition [17], object recognition [21], stereo estimation [38], and machine\ntranslation [33]. In some tasks, they have been shown to outperform humans, e.g., \ufb01ne grained\ncategorization [7] and object classi\ufb01cation [15].\nDeep neural networks are typically trained using simple loss functions. Cross entropy or hinge\nloss are used when dealing with discrete outputs, and squared loss when the outputs are continuous.\nMulti-task approaches are popular, where the hope is that dependencies of the output will be captured\nby sharing intermediate layers among tasks [9].\nDeep structured models attempt to learn complex features by taking into account the dependencies\nbetween the output variables. A variety of methods have been developed in the context of predicting\ndiscrete outputs [7, 3, 31, 39]. Several techniques unroll inference and show how the forward\nand backward passes of these deep structured models can be expressed as a set of standard layers\n[1, 14, 31, 39]. This allows for fast end-to-end training on GPUs.\nHowever, little to no attention has been given to deep structured models with continuous valued\noutput variables. One of the main reasons is that inference (even in the shallow model) is much less\nwell studied, and very few solutions exist. An exception are Markov random \ufb01elds (MRFs) with\nGaussian potentials, where exact inference is possible (via message-passing) if the precision matrix is\npositive semi-de\ufb01nite and satis\ufb01es the spectral radius condition [36]. A family of popular approaches\nconvert the continuous inference problem into a discrete task using particle methods [18, 32]. Speci\ufb01c\nsolvers have also been designed for certain types of potentials, e.g. polynomials [35] and piecewise\nconvex functions [37].\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nProximal methods are a popular solution to perform inference in continuous MRFs when the potentials\nare non-smooth and non-differentiable functions of the outputs [26]. In this paper, we show that\nproximal methods are a special type of recurrent neural networks. This allows us to ef\ufb01ciently train a\nwide family of deep structured models with continuous output variables end-to-end on the GPU. We\nshow that learning can simply be done via back-propagation for any differentiable loss function. We\ndemonstrate the effectiveness of our algorithm in the tasks of image denoising, depth re\ufb01nement and\noptical \ufb02ow and show superior results over competing algorithms on these tasks.\n\n2 Proximal Deep Structured Networks\nIn this section, we \ufb01rst introduce continuous-valued deep structured models and brie\ufb02y review\nproximal methods. We then propose proximal deep structured models and discuss how to do ef\ufb01cient\ninference and learning in these models. Finally we discuss the relationship with previous work.\n\nin predicting. The output space is a product space of all the elements: y \u2208 Y = (cid:81)N\n\n2.1 Continuous-valued Deep Structured Models\nGiven an input x \u2208 X , let y = (y1, ..., yN ) be the set of random variables that we are interested\ni=1 Yi, and\nthe domain of each individual variable yi is a closed subset in real-valued space, i.e. Yi \u2282 R. Let\nE(x, y; w) : X \u00d7 Y \u00d7 RK \u2192 R be an energy function which encodes the problem that we are\ninterested in solving. Without loss of generality we assume that the energy decomposes into a sum of\nfunctions, each depending on a subset of variables\n\nE(x, y; w) =\n\nfi(yi, x; wu) +\n\nf\u03b1(y\u03b1, x; w\u03b1)\n\n(1)\n\ni\n\n\u03b1\n\nwhere fi(yi; x, w) : Yi \u00d7 X \u2192 R is a function that depends on a single variable (i.e., unary term)\nand f\u03b1(y\u03b1) : Y\u03b1 \u00d7 X \u2192 R depends on a subset of variables y\u03b1 = (yi)i\u2208\u03b1 de\ufb01ned on a domain\nY\u03b1 \u2282 Y. Note that, unlike standard MRF models, the functions fi and f\u03b1 are non-linear functions of\nthe parameters.\nThe energy function is parameterized in terms of a set of weights w, and learning aims at \ufb01nding the\nvalue of these weights which minimizes a loss function. Given an input x, inference aims at \ufb01nding\nthe best con\ufb01guration by minimizing the energy function:\n\ny\u2217 = arg min\ny\u2208Y\n\nfi(yi, x; wu) +\n\nf\u03b1(y\u03b1, x; w\u03b1)\n\n(2)\n\ni\n\n\u03b1\n\nZ(x;w) exp(\u2212E(x, y|w)), with Z(x; w) the partition function.\n\nFinding the best scoring con\ufb01guration y\u2217 is equivalent to maximizing the posteriori distribution:\np(y|x; w) = 1\nStandard multi-variate deep networks (e.g., FlowNet [11]) have potential functions which depend on\na single output variable. In this simple case, inference corresponds to a forward pass that predicts the\nvalue of each variable independently. This can be interpreted as inference in a graphical model with\nonly unary potentials fi.\nIn the general case, performing inference in MRFs with continuous variables involves solving\na very challenging numerical optimization problem. Depending on the structure and properties\nof the potential functions, various methods have been proposed. For instance, particle methods\nperform approximate inference by performing message passing on a series of discrete MRFs [18, 32].\nExact inference is possible for a certain type of MRFs, i.e., Gaussian MRFs with positive semi-\nde\ufb01nite precision matrix. Ef\ufb01cient dedicated algorithms exist for a restricted family of functions,\ne.g., polynomials [35]. If certain conditions are satis\ufb01ed, inference is often tackled by a group of\nalgorithms called proximal methods [26]. In this section, we will focus on this family of inference\nalgorithms and show that they are a particular type of recurrent net. We will use this fact to ef\ufb01ciently\ntrain deep structured models with continuous outputs.\n2.2 A Review on Proximal Methods\nNext, we brie\ufb02y discuss proximal methods, and refer the reader to [26] for a thorough review.\nProximal algorithms are very generally applicable, but they are particularly successful at solving\nnon-smooth, non-differentiable, or constrained problems. Their base operation is evaluating the\nproximal operator of a function, which involves solving a small convex optimization problem that\n\n2\n\n\foften admits a closed-form solution. In particular, the proximal operator proxf (x0) : R \u2192 R of a\nfunction is de\ufb01ned as\n\nproxf (x0) = arg min\n\ny\n\n(y \u2212 x0)2 + f (y)\n\nIf f is convex, the \ufb01xed points of the proximal operator of f are precisely the minimizers of f. In\nother words, proxf (x\u2217) = x\u2217 iff x\u2217 minimizes f. This \ufb01x-point property motivates the simplest\nproximal method called the proximal point algorithm which iterates x(n+1) = proxf (x(n)). All the\nproximal algorithms used here are based on this \ufb01x-point property. Note that even if the function\nf (\u00b7) is not differentiable (e.g., (cid:96)1 norm) there might exist a closed-form or easy to compute proximal\noperator.\nWhile the original proximal operator was designed for the purpose of obtaining the global optimum\nin convex optimization, recent work has shown that proximal methods work well for non-convex\noptimization as long as the proximal operator exists [20, 30, 5].\nFor multi-variate optimization problems the proximal operator might not be trivial to obtain (e.g.,\nwhen having high-order potentials). In this case, a widely used solution is to decompose the high-\norder terms into small problems that can be solved through proximal operators. Examples of this\nfamily of algorithms are half-quadratic splitting [13], alternating direction method of multipliers\n[12] and primal-dual methods [2] . In this work, we focus on the non-convex multi-variate case.\n\n2.3 Proximal Deep Structured Models\nIn order to apply proximal algorithms to tackle the inference problem de\ufb01ned in Eq. (2), we require\nthe energy functions fi and f\u03b1 to satisfy the following conditions:\n\n1. There exist functions hi and gi such that fi(yi, x; w) = gi(yi, hi(x, w)), where gi is a\n\ndistance function 1;\n\n2. There exists a closed-form proximal operator for gi(yi, hi(x; w)) wrt yi.\n3. There exist functions h\u03b1 and g\u03b1 such that f\u03b1(y\u03b1, x; w) can be re-written as f\u03b1(y\u03b1, x; w) =\n\nh\u03b1(x; w)g\u03b1(wT\n\n\u03b1 y\u03b1).\n\n4. There exists a proximal operator for either the dual or primal form of g\u03b1(\u00b7).\n\nA fairly general family of deep structured models satis\ufb01es these conditions. Our experimental\nevaluation will demonstrate the applicability in a wide variety of tasks including depth re\ufb01nement,\nimage denoising as well as optical \ufb02ow. If our potential functions satisfy the conditions above, we\ncan rewrite our objective function as follows:\n\n(cid:88)\n\n(cid:88)\n\nE(x, y; w) =\n\ngi(yi, hi(x; w)) +\n\nh\u03b1(x; w)g\u03b1(wT\n\n\u03b1 y\u03b1)\n\n(3)\n\ni\n\n\u03b1\n\nIn this paper, we make the important observation that each iteration of most existing proximal solvers\ncontain \ufb01ve sub-steps: (i) compute the locally linear part; (ii) compute the proximal operator proxgi;\n(iii) deconvolve; (iv) compute the proximal operator proxg\u03b1; (v) update the result through a gradient\ndescent step. Due to space restrictions, we show primal-dual solvers in this section, and refer the\nreader to the supplementary material for ADMM, half-quadratic splitting and the proximal gradient\nmethod.\nThe general idea of primal dual solvers is to introduce auxiliary variables z to decompose the high-\norder terms. We can then minimize z and y alternately through computing their proximal operator.\nIn particular, we can transform the primal problem in Eq. (3) into the following saddle point problem\n\n(cid:88)\n\ngi(yi, hi(x, wu)) \u2212(cid:88)\n\ny\u2208Y max\nmin\nz\u2208Z\n\nh\u03b1(x, w)g\u2217\n\n\u03b1(z\u03b1) +\n\nh\u03b1(x, w)(cid:104)wT\n\n\u03b1 y\u03b1, z\u03b1(cid:105)\n\n(4)\n\n(cid:88)\n\ni\n\n\u03b1\n\n\u03b1\n\nwhere g\u2217\nconjugate of g\u2217\n\n\u03b1(\u00b7) is the convex conjugate of g\u03b1(\u00b7): g\u2217\n\u03b1 is g\u03b1 itself, if g\u03b1(\u00b7) is convex.\n\n\u03b1(z\u2217) = sup{(cid:104)z\u2217, z(cid:105) \u2212 g\u03b1(z)|z \u2208 Z} and the convex\n\n1A function g : Y \u00d7 Y \u2192 [0,\u221e) is called a distance function iff it satis\ufb01es the condition of non-negativity,\n\nidentity of indisernibles, symmetry and triangle inequality.\n\n3\n\n\fFigure 1: The whole architecture (top) and one iteration block (bottom) of our proximal deep\nstructured model.\n\nThe primal-dual method solves the problem in Eq. (4) by iterating the following steps: (i) \ufb01x y\nand minimize the energy wrt z; (ii) \ufb01x z and minimize the energy wrt y; (iii) conduct a Nesterov\nextrapolation gradient step. These iterative computation steps are:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 z(t+1)\n\n\u03b1\ny(t+1)\ni\n\u00afy(t+1)\ni\n\n(z(t)\n\n\u03b1\n\n\u03b1 + \u03c3\u03c1\n= proxg\u2217\n= proxgi,hi(x,w)(y(t)\n= y(t+1)\n+ \u03c3ex(y(t+1)\n\nh\u03b1(x;w) wT\ni \u2212 \u03c3\u03c4\nh\u03b1(x;w) w\u2217T\u00b7,i z(t+1))\n\u2212 y(t)\ni )\n\n\u03b1 \u00afy(t)\n\u03b1 )\n\ni\n\ni\n\n(5)\n\nwhere y(t) is the solution at the t-th iteration, z(t) is an auxiliary variable and h(x, wu) is the deep\nunary network. Note that different functions gi and g\u03b1 in (3) have different proximal operators.\nIt is not dif\ufb01cult to see that the inference process in Eq. (5) can be written as a feed-forward pass in a\nrecurrent neural network by stacking multiple computation blocks. In particular, the \ufb01rst step is a\nconvolution layer and the third step can be considered as a deconvolution layer sharing weights with\nthe \ufb01rst step. The proximal operators are non-linear activation layers and the gradient descent step\nis a weighted sum. We also rewrite the scalar multiplication as a 1 \u00d7 1 convolution. We refer the\nreader to Fig. 1 for an illustration. The lower \ufb01gure depicts one iteration of inference while the whole\ninference process as a recurrent net is shown in the top \ufb01gure.\nNote that the whole inference process has two stages: \ufb01rst we compute the unaries h(x; wu) with a\nforward pass. Then we perform MAP inference through our recurrent network.\nThe \ufb01rst non-linearity for the primal dual method is the proximal operator of the dual function\nof f\u03b1. This changes for other types of proximal methods. In the case of the alternating direction\nmethod of multipliers (ADMM) the nonlinearity corresponds to the proximal operator of f\u03b1; for\nhalf-qudratic splitting it is the proximal operator of f\u03b1\u2019s primal form while the second non linearity is\na least-squares solver; if fi or f\u03b1 is reduced to a quadratic function of y, the algorithm is simpli\ufb01ed,\nas the proximal operator of a quadratic function is a linear function [5]. We refer the reader to the\nsupplementary material for more details on other proximal methods.\n\n2.4 Learning\nGiven training pairs composed of inputs {xn}N\naims at \ufb01nding parameters which minimizes a regularized loss function:\n\nn=1 and their corresponding output {ygt\n(cid:88)\n\nn, ygt\n\nn ) + \u03b3r(w)\n\nn }N\n\nn=1, learning\n\nw\u2217 = arg min\n\nw\n\n(cid:96)(y\u2217\n\nn\n\nwhere (cid:96)(\u00b7) is the loss, r(\u00b7) is a regularizer of the weight (we use (cid:96)2-norm in practice), y\u2217\nn is the\nminimizer of Eq. (3) for the n-th example and \u03b3 is a scalar. Given the conditions that both proxfi and\nproxg\u03b1 (or proxg\u2217\n) are sub-differentiable wrt. w and y, back-propagation can be used to compute\nthe gradient ef\ufb01ciently. We refer the reader to Fig. 2 for an illustration of our learning algorithm.\nParameters such as the gradient steps \u03c3\u03c1, \u03c3\u03c4 , \u03c3ex in Eq. (5) are considered hyper-parameters in\nproximal methods and are typically manually set. In contrast, we can learn them as they are 1 \u00d7 1\nconvolution weights.\n\n\u03b1\n\n4\n\n\fAlgorithm: Learning Continuous-Valued Deep Structured Models\nRepeat until stopping criteria\n\n1. Forward pass to compute hi(x, w) and h\u03b1(x, w)\n2. Compute y\u2217 i via forward pass in Eq. (5)\n3. Compute the gradient via backward pass\n4. Parameter update\n\nFigure 2: Algorithm for learning proximal deep structured models.\n\nNon-shared weights: The weights and gradient steps for high-order potentials are shared among\nall the iteration blocks in the inference network, which guarantees the feed-forward pass to explicitly\nminimize the energy function in Eq. (2). In practice we found that by removing the weight-sharing\nand \ufb01xed gradient step constraints, we can give extra \ufb02exibility to our model, boosting the \ufb01nal\nperformance. This observation is consistent with the \ufb01ndings of shrinkage \ufb01eld [30] and inference\nmachines [27].\n\nMulti-loss:\nIntermediate layer outputs y(t) should gradually converge towards the \ufb01nal output.\nMotivated by this fact, we include a loss over the intermediate computations to accelerate convergence.\n\n2.5 Discussion and Related Work\nOur approach can be considered as a continuous-valued extension of deep structured models [3,\n31, 39]. Unlike previous methods where the output lies in a discrete domain and inference is\nconducted through a specially designed message passing layer, the output of the proposed method is\nin continuous domain and inference is done by stacking convolution and non-linear activation layers.\nWithout deep unary potentials, our model is reduced to a generalized version of \ufb01eld-of-experts [28].\nThe idea of stacking shrinkage functions and convolutions as well as learning iteration-speci\ufb01c\nweights was exploited in the learning iterative shrinkage algorithm (LISTA) [14]. LISTA can be\nconsidered as a special case of our proposed model with sparse coding as the energy function and\nproximal gradient as the inference algorithm. Our approach is also closely related to the recent\nstructured prediction energy networks (SPEN) [1], where our unary network is analogous to the\nfeature net in SPEN and the whole energy model is analogous to the energy net. Both SPEN and our\nproposed method can be considered as a special case of optimization-based learning [8]. However,\nSPEN utilizes plain gradient descent for inference while our network is proximal algorithm motivated.\nPrevious methods have tried to learn multi-variate regression networks for optical \ufb02ow [11] and\nstereo [24]. But none of these approaches model the interactions between output variables. Thus,\nthey can be considered a special case of our model, where only unary functions fi are present.\n\n3 Experiments\n\nWe demonstrate the effectiveness of our approach in three different applications: image denoising,\ndepth re\ufb01nement and optical \ufb02ow estimation. We employ mxnet [4] with CUDNNv4 acceleration to\nimplement the networks, which we train end-to-end. Our experiments are conducted on a Xeon 3.2\nGhz machine with a Titan X GPU.\n\n3.1\n\nImage Denoising\n\nWe \ufb01rst evaluate our method for the task of image denoising (i.e., shallow unary) using the BSDS\nimage dataset [23]. We corrupt each image with Gaussian noise with standard deviation \u03c3 = 25. We\nuse the energy function typically employed for image denoising:\n\n(cid:88)\n\ny\u2217 = arg min\ny\u2208Y\n\n(cid:88)\n\n(cid:107)yi \u2212 xi(cid:107)2\n\n2 + \u03bb\n\n(cid:107)wT\n\nho,\u03b1y\u03b1(cid:107)1\n\n(6)\n\ni\n\n\u03b1\n\nAccording to the primal dual algorithm, the activation function for the \ufb01rst nonlinearity is the proximal\n\u03c1(z) = min(|z|, 1)\u00b7sign(z), which is the projection\noperator of the dual function of the (cid:96)1-norm: prox\u2217\n\u03c1(z) = max(min(z, 1),\u22121). The\nonto an (cid:96)\u221e-norm ball. In practice we encode this function as prox\u2217\n\n5\n\n\fBM3D [6] EPLL [40] LSSC [22] CSF [30] RTF [29] Ours Ours GPU\n\nPSNR\n\nTime (second)\n\n28.56\n2.57\n\n28.68\n108.72\n\n28.70\n516.48\n\n28.72\n5.10\n\n28.75\n69.25\n\n28.79\n0.23\n\n28.79\n0.011\n\nTable 1: Natural Image Denoising on BSDS dataset [23] with noise variance \u03c3 = 25.\n\n3 \u00d7 3\n28.43\n28.48\n28.49\n\n5 \u00d7 5\n28.57\n28.64\n28.68\n\n7 \u00d7 7\n28.68\n28.76\n28.79\n\n16\n32\n64\n\nTable 2: Performance of the proposed model with different hyper-parameters\n\nFigure 3: Qualitative results for image denoising. Left to right: noisy input, ground-truth, our result.\n\nsecond nonlinearity is the proximal operator of the primal function of the (cid:96)2-norm, which is a\nweighted sum: prox(cid:96)2 (y, \u03bb) = x+\u03bby\n1+\u03bb .\nFor training, we select 244 images, following the con\ufb01guration of [30]. We randomly cropped\n128 \u00d7 128 clean patches from the training images and obtained the noisy input by adding random\nnoise. We use mean square error as the loss function and set a weight decay strength of 0.0004 for all\nsettings. Note that for all the convolution and deconvolution layers, the bias is set to zero. MSRA\ninitialization [16] is used for the convolution parameters and the initial gradient step for each iteration\nis set to be 0.02. We use adam [19] with a learning rate of t = 0.02 and hyper-parameters \u03b21 = 0.9\nand \u03b22 = 0.999 as in Kingma et al. [19]. The learning rate is divided by 2 every 50 epoch, and we\nuse a mini-batch size of 32.\nWe compare against a number of recent state-of-the-art techniques [6, 40, 22, 30, 29]. 2 The Peak\nSignal-to-Noise Ratio (PSNR) is used as a performance measure. As shown in Tab. 1 our proposed\nmethod outperforms all methods in terms of accuracy and speed. The second best performing\nmethod is RTF [29], while being two orders of magnitude slower than our approach. Our GPU\nimplementation achieves real-time performance with more than 90 frames/second. Note that a GPU\nversion of CSF is reported to run at 0.92s on a 512 \u00d7 512 image on a GTX 480 [30]. However, since\nGPU implementation is not available online, we cannot make proper comparisons.\nTab. 2 shows performance with different hyper-parameters (\ufb01lter size, number of \ufb01lters per each layer).\nAs we can see, larger receptive \ufb01elds and more convolution \ufb01lters slightly boost the performance.\nFig. 3 depicts the qualitative results of our model for the denoising task.\n\n3.2 Depth Re\ufb01nement\n\nDue to specularities and intensity changes of structured light imaging, the sensor\u2019s output depth is\noften noisy. Thus, re\ufb01ning the depth to generate a cleaner, more accurate depth image is an important\ntask. We conduct the depth re\ufb01nement experiment on the 7 Scenes dataset [25]. We follow the\ncon\ufb01guration of [10], where the ground-truth depth was computed using KinectFusion [25]. The\nnoise [10] has a Poisson-like distribution and is depth-dependent, which is very different from the\nimage denoising experiment which contained Gaussian noise.\nWe use the same architecture as for the task of natural image denoising. The multi-stage mean square\nerror is used as loss function and the weight decay strength is set to be 0.0004. Adam (\u03b21 = 0.9 and\n2We chose the model with the best performance for each competing algorithm. For the CSF method, we use\n7\u00d77; for RTF we use RTF5; for our method, we pick 7 \u00d7 7 \u00d7 64 high-order structured network.\n\nCSF5\n\n6\n\n\fFigure 4: Qualitative results for depth re\ufb01nement. Left to right: input, ground-truth, wiener \ufb01lter,\nbilateral \ufb01lter, BM3D, Filter Forest, Ours.\n\nPSNR\n\nWiener Bilateral\n32.29\n30.95\nTable 3: Performance of depth re\ufb01nement on dataset [10]\n\nLMS BM3D [6]\n24.37\n\n35.46\n\nFilterForest [10] Ours\n36.31\n\n35.63\n\nFigure 5: Optical \ufb02ow: Left to right: \ufb01rst and second input, ground-truth, Flownet [11], ours.\n\n\u03b22 = 0.999) is used as the optimizer with a learning rate of 0.01. Data augmentation is used to avoid\nover\ufb01tting, including random cropping, \ufb02ipping and rotation. We used a mini-batch size of 16.\nWe train our model on 1000 frames of the Chess scene and test on the other scenes. PSNR is used to\nevaluate the performance. As shown in Tab. 3, our approach outperforms all competing algorithms.\nThis shows that our deep structured network is able to handle non-additive non-Gaussian noise.\nQualitative results are shown in Fig. 4. Compared to the competing approaches, our method is able to\nrecover better depth estimates particularly along the depth discontinuities.\n\n3.3 Optical Flow\n\nWe evaluate the task of optical \ufb02ow estimation on the Flying Chairs dataset [11]. The size of training\nimages is 512 \u00d7 384. We formulate the energy as follows:\n\n(cid:107)yi \u2212 fi(xl, xr; wu)(cid:107)1 + \u03bb\n\n(cid:107)wT\n\nho,\u03b1y\u03b1(cid:107)1\n\n(7)\n\ni\n\n\u03b1\n\n(cid:88)\n\ny\u2217 = arg min\ny\u2208Y\n\n(cid:88)\n\nwhere fi(xl, xr; wu) is a Flownet model [11], is a fully-convolutional encoder-decoder network\nthat predicts 2D optical \ufb02ow per pixel. It has 11 encoding layers and 11 deconv layers with skip\nconnections. xl and xr are the left and right input images respectively and y is the desired optical\n\ufb02ow output. Note that we use the (cid:96)1-norm for both, the data and the regularization term. The \ufb01rst\nnonlinearity activation function is the proximal operator of the (cid:96)1-norm\u2019s dual function: prox\u2217\n\u03c1(z) =\nmin(|z|, 1) \u00b7 sign(z), and the second non-linear activation function is the proximal operator of the\n(cid:96)1-norm\u2019s primal form: prox\u03c4,x(y, \u03bb) = x \u2212 min(|x \u2212 y|, \u03bb) \u00b7 sign(x), which is a soft shrinkage\nfunction [26].\n\n7\n\n\fFlownet\n\nFlownet + TV-l1 Our proposed\n\nEnd-point-error\nTable 4: Performance of optical \ufb02ow on Flying chairs dataset [11]\n\n4.98\n\n4.96\n\n4.91\n\nWe build a deep structured model with 5 iteration blocks. Each iteration block has 32 convolution\n\ufb01lters of size 7 \u00d7 7 for both the convolution and deconvolution layers, which results in 10 convo-\nlution/deconv layers and 10 non-linearities. The multi-stage mean square error is used as the loss\nfunction and the weight decay strength is set to be 0.0004.\nTraining is conducted on the training subset of the Flying Chairs dataset. Our unary model is\ninitialized with a pre-trained Flownet parameters. The high-order term is initialized with MSRA\nrandom initialization [16]. The hyper-parameter \u03bb in this experiment is pre-set to be 10. We use\nrandom \ufb02ipping, cropping and color-tuning for data augmentation, and employ the adam optimizer\nwith the same con\ufb01guration as before (\u03b21 = 0.9 and \u03b22 = 0.999) with a learning rate t = 0.005. The\nlearning rate is divided by 2 every 10 epoch and the mini-batch size is set to be 12.\nWe evaluate all approaches on the test set of the Flying chairs dataset. End-point error is used as a\nmeasure of performance. The unary-only model (i.e. plain \ufb02ownet) is used as baseline and we also\ncompare against a plain TV-l1 model with four pre-set gradient operators as post-processing. As\nshown in Tab. 4 our method outperforms all the baselines. From Fig. 5 we can see that our method is\nless noisy than Flownet\u2019s output and better preserves the boundaries. Note that our current model is\nisotropic. In order to further boost the performance, incorporating anisotropic \ufb01ltering like bilateral\n\ufb01ltering is an interesting future direction.\n\n4 Conclusion\n\nWe have proposed a deep structured model that learns non-linear functions encoding complex\ndependencies between continuous output variables. We have showed that inference in our model\nusing proximal methods can be ef\ufb01ciently solved as a feed-foward pass on a special type of deep\nrecurrent neural network. We demonstrated our approach in the tasks of image denoising, depth\nre\ufb01nement and optical \ufb02ow. In the future we plan to investigate other proximal methods and a wider\nvariety of applications.\n\nReferences\n[1] David Belanger and Andrew McCallum. Structured prediction energy networks. In ICML, 2016.\n\n[2] A. Chambolle and T. Pock. A \ufb01rst-order primal-dual algorithm for convex problems with applications to\n\nimaging. JMIV, 2011.\n\n[3] L. Chen, A. Schwing, A. Yuille, and R. Urtasun. Learning deep structured models. In ICML, 2015.\n\n[4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A\n\n\ufb02exible and ef\ufb01cient machine learning library for heterogeneous distributed systems. arXiv, 2015.\n\n[5] Y. Chen, W. Yu, and T. Pock. On learning optimized reaction diffusion processes for effective image\n\nrestoration. In CVPR, 2015.\n\n[6] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain\n\ncollaborative \ufb01ltering. TIP, 2007.\n\n[7] J. Deng, N. Ding, Y. Jia, A. Frome, K. Murphy, S. Bengio, Y. Li, H. Neven, and H. Adam. Large-scale\n\nobject classi\ufb01cation using label relation graphs. In ECCV. 2014.\n\n[8] Justin Domke. Generic methods for optimization-based modeling. In AISTATS, 2012.\n\n[9] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale\n\nconvolutional architecture. In ICCV, 2015.\n\n[10] S. Fanello, C. Keskin, P. Kohli, S. Izadi, J. Shotton, A. Criminisi, U. Pattacini, and T. Paek. Filter forests\n\nfor learning data-dependent convolutional kernels. In CVPR, 2014.\n\n[11] P. Fischer, A. Dosovitskiy, E. Ilg, P. H\u00e4usser, C. Haz\u0131rba\u00b8s, V. Golkov, P. van der Smagt, D. Cremers, and\n\nT. Brox. Flownet: Learning optical \ufb02ow with convolutional networks. In CVPR, 2015.\n\n8\n\n\f[12] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via \ufb01nite\n\nelement approximation. Computers & Mathematics with Applications, 1976.\n\n[13] D. Geman and C. Yang. Nonlinear image recovery with half-quadratic regularization. TIP, 1995.\n\n[14] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In ICML, 2010.\n\n[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv, 2015.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance on\n\nimagenet classi\ufb01cation. In ICCV, 2015.\n\n[17] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath,\n\net al. Deep neural networks for acoustic modeling in speech recognition. SPM, IEEE, 2012.\n\n[18] A. Ihler and D. McAllester. Particle belief propagation. In AISTATS, 2009.\n\n[19] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv, 2014.\n\n[20] D. Krishnan and R. Fergus. Fast image deconvolution using hyper-laplacian priors. In NIPS, 2009.\n\n[21] A. Krizhevsky, I. Sutskever, and G. Hinton.\n\nnetworks. In NIPS, 2012.\n\nImagenet classi\ufb01cation with deep convolutional neural\n\n[22] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration.\n\nIn ICCV, 2009.\n\n[23] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\n\napplication to evaluating segmentation algorithms and measuring ecological statistics. In ICCV, 2001.\n\n[24] N. Mayer, E. Ilg, P. H\u00e4usser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train\n\nconvolutional networks for disparity, optical \ufb02ow, and scene \ufb02ow estimation. arXiv, 2015.\n\n[25] R. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. Davison, P. Kohi, J. Shotton, S. Hodges,\n\nand A. Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In ISMAR, 2011.\n\n[26] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, 2014.\n\n[27] S. Ross, D. Munoz, M. Hebert, and J. Bagnell. Learning message-passing inference machines for structured\n\nprediction. In CVPR, 2011.\n\n[28] S. Roth and M. Black. Fields of experts: A framework for learning image priors. In CVPR, 2005.\n\n[29] U. Schmidt, J. Jancsary, S. Nowozin, S. Roth, and C. Rother. Cascades of regression tree \ufb01elds for image\n\nrestoration. PAMI, 2013.\n\n[30] U. Schmidt and S. Roth. Shrinkage \ufb01elds for effective image restoration. In CVPR, 2014.\n\n[31] A. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv, 2015.\n\n[32] E. Sudderth, A. Ihler, M. Isard, W. Freeman, and A. Willsky. Nonparametric belief propagation. Communi-\n\ncations of the ACM, 2010.\n\n[33] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.\n\n[34] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdepen-\n\ndent and structured output spaces. In ICML, 2004.\n\n[35] S. Wang, A. Schwing, and R. Urtasun. Ef\ufb01cient inference of continuous markov random \ufb01elds with\n\npolynomial potentials. In NIPS, 2014.\n\n[36] Y. Weiss and W. Freeman. Correctness of belief propagation in gaussian graphical models of arbitrary\n\ntopology. Neural computation, 2001.\n\n[37] C. Zach and P. Kohli. A convex discrete-continuous approach for markov random \ufb01elds. In ECCV. 2012.\n\n[38] J. Zbontar and Y. LeCun. Computing the stereo matching cost with a convolutional neural network. In\n\nCVPR, 2015.\n\n[39] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional\n\nrandom \ufb01elds as recurrent neural networks. In ICCV, 2015.\n\n[40] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In\n\nICCV, 2011.\n\n9\n\n\f", "award": [], "sourceid": 528, "authors": [{"given_name": "Shenlong", "family_name": "Wang", "institution": "University of Toronto"}, {"given_name": "Sanja", "family_name": "Fidler", "institution": "University of Toronto"}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": "University of Toronto"}]}