{"title": "Learning Neural Networks with Adaptive Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 11393, "page_last": 11404, "abstract": "Feed-forward neural networks can be understood as a combination of an intermediate representation and a linear hypothesis. While most previous works aim to diversify the representations, we explore the complementary direction by performing an adaptive and data-dependent regularization motivated by the empirical Bayes method. Specifically, we propose to construct a matrix-variate normal prior (on weights) whose covariance matrix has a Kronecker product structure. This structure is designed to capture the correlations in neurons through backpropagation. Under the assumption of this Kronecker factorization, the prior encourages neurons to borrow statistical strength from one another. Hence, it leads to an adaptive and data-dependent regularization when training networks on small datasets. To optimize the model, we present an efficient block coordinate descent algorithm with analytical solutions. Empirically, we demonstrate that the proposed method helps networks converge to local optima with smaller stable ranks and spectral norms. These properties suggest better generalizations and we present empirical results to support this expectation. We also verify the effectiveness of the approach on multiclass classification and multitask regression problems with various network structures. Our code is publicly available at:~\\url{https://github.com/yaohungt/Adaptive-Regularization-Neural-Network}.", "full_text": "Learning Neural Networks with\n\nAdaptive Regularization\n\nHan Zhao\u2217\u2020, Yao-Hung Hubert Tsai\u2217\u2020, Ruslan Salakhutdinov\u2020, Geoffrey J. Gordon\u2020\u2021\n\n\u2020Carnegie Mellon University, \u2021Microsoft Research Montreal\n{han.zhao,yaohungt,rsalakhu}@cs.cmu.edu\n\ngeoff.gordon@microsoft.com\n\nAbstract\n\nFeed-forward neural networks can be understood as a combination of an interme-\ndiate representation and a linear hypothesis. While most previous works aim to\ndiversify the representations, we explore the complementary direction by perform-\ning an adaptive and data-dependent regularization motivated by the empirical Bayes\nmethod. Speci\ufb01cally, we propose to construct a matrix-variate normal prior (on\nweights) whose covariance matrix has a Kronecker product structure. This struc-\nture is designed to capture the correlations in neurons through backpropagation.\nUnder the assumption of this Kronecker factorization, the prior encourages neurons\nto borrow statistical strength from one another. Hence, it leads to an adaptive\nand data-dependent regularization when training networks on small datasets. To\noptimize the model, we present an ef\ufb01cient block coordinate descent algorithm\nwith analytical solutions. Empirically, we demonstrate that the proposed method\nhelps networks converge to local optima with smaller stable ranks and spectral\nnorms. These properties suggest better generalizations and we present empirical\nresults to support this expectation. We also verify the effectiveness of the approach\non multiclass classi\ufb01cation and multitask regression problems with various net-\nwork structures. Our code is publicly available at: https://github.com/\nyaohungt/Adaptive-Regularization-Neural-Network.\n\n1\n\nIntroduction\n\nAlthough deep neural networks have been widely applied in various domains [19, 25, 27], usually its\nparameters are learned via the principle of maximum likelihood, hence its success crucially hinges\non the availability of large scale datasets. When training rich models on small datasets, explicit\nregularization techniques are crucial to alleviate over\ufb01tting. Previous works have explored various\nregularization [39] and data augmentation [19, 38] techniques to learn diversi\ufb01ed representations.\nIn this paper, we look into an alternative direction by proposing an adaptive and data-dependent\nregularization method to encourage neurons of the same layer to share statistical strength through\nexploiting correlations between data and gradients. The goal of our method is to prevent over\ufb01tting\nwhen training (large) networks on small dataset. Our key insight stems from the famous argument\nby Efron [8] in the literature of the empirical Bayes method: It is bene\ufb01cial to learn from the\nexperience of others. The empirical Bayes methods provide us a guiding principle to learn model\nparameters even if we do not have complete information about prior distribution. From an algorithmic\nperspective, we argue that the connection weights of neurons in the same layer (row/column vectors\nof the weight matrix) will be correlated with each other through the backpropagation learning. Hence,\nby learning the correlations of the weight matrix, a neuron can \u201cborrow statistical strength\u201d from\nother neurons in the same layer, which essentially increases the effective sample size during learning.\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAs an illustrating example, consider a simple setting where the input x \u2208 Rd is fully connected to a\nhidden layer h \u2208 Rp, which is further fully connected to the single output \u02c6y \u2208 R. Let \u03c3(\u00b7) be the\nnonlinear activation function, e.g., ReLU [33], W \u2208 Rp\u00d7d be the connection matrix between the\ninput layer and the hidden layer, and a \u2208 Rp be the vector connecting the output and the hidden layer.\nWithout loss of generality, ignoring the bias term in each layer, we have: \u02c6y = aT h, h = \u03c3(W x).\n2|\u02c6y \u2212 y|2 and take the derivative of (cid:96)(\u02c6y, y) w.r.t.\nConsider using the usual (cid:96)2 loss function (cid:96)(\u02c6y, y) = 1\nW . We obtain the update formula in backpropagation as W \u2190 W \u2212 \u03b1(\u02c6y \u2212 y)(a \u25e6 h(cid:48)) xT , where\nh(cid:48) is the component-wise derivative of h w.r.t. its input argument, and \u03b1 > 0 is the learning rate.\nRealize that (a \u25e6 h(cid:48)) xT is a rank 1 matrix, and the component of h(cid:48) is either 0 or 1. Hence, the\nupdate for each row vector of W is linearly proportional to x. Similar observation also holds for\neach column vector of W , so it implies that the row/column vectors of W are correlated with each\nother through learning. Although in this example we only discuss a one-hidden-layer network, it is\nstraightforward to verify that the gradient update formula for general feed-forward networks admits\nthe same rank one structure. The above observation leads us to the following question:\n\nCan we de\ufb01ne a prior distribution over W that captures the correlations through\nthe learning process for better generalization?\n\nOur Contributions To answer the above question, we develop an adaptive regularization method\nfor neural nets inspired by the empirical Bayes method. Motivated by the example above, we propose\na matrix-variate normal prior whose covariance matrix admits a Kronecker product structure to\ncapture the correlations between different neurons. Using tools from convex analysis, we present\nan ef\ufb01cient block coordinate descent algorithm with closed-form solutions to optimize the model.\nEmpirically, we show the proposed method helps the network converge to local optima with smaller\nstable ranks and spectral norms, and we verify the effectiveness of the approach on both multiclass\nclassi\ufb01cation and multitask regression problems with various network structures.\n\n2 Preliminary\n\nNotation and Setup We use lowercase letter to represent scalar and lowercase bold letter to denote\nvector. Capital letter, e.g., X, is reserved for matrix. Calligraphic letter, such as D, is used to denote\nset. We write Tr(A) as the trace of a matrix A, det(A) as the determinant of A and vec(A) as\nA\u2019s vectorization by column. [n] is used to represent the set {1, . . . , n} for any integer n. Other\nnotations will be introduced whenever needed. Suppose we have access to a training set D of n pairs\nof data instances (xi, yi), i \u2208 [n]. We consider the supervised learning setting where xi \u2208 X \u2286 Rd\nand yi \u2208 Y. Let p(y | x, w) be the conditional distribution of y given x with parameter w. The\nparametric form of the conditional distribution is assumed be known. In this paper, we assume the\nmodel parameter w is sampled from a prior distribution p(w | \u03b8) with hyperparameter \u03b8. On the\nother hand, given D, the posterior distribution of w is denoted by p(w | D, \u03b8).\n\n(cid:90)\n\n\u03b8\n\n\u02c6\u03b8 = arg max\n\n\u03b8\n\np(D | w) \u00b7 p(w | \u03b8) dw.\n\nThe Empirical Bayes Method To compute the predictive distribution, we need access to the value\nof the hyperparameter \u03b8. However, complete information about the hyperparameter \u03b8 is usually not\navailable in practice. To this end, empirical Bayes method [1, 9, 10, 12, 36] proposes to estimate \u03b8\nfrom the data directly using the marginal distribution:\np(D | \u03b8) = arg max\n\n(1)\nUnder speci\ufb01c choice of the likelihood function p(x, y | w) and the prior distribution p(w | \u03b8), e.g.,\nconjugate pairs, we can solve the above integral in closed form. In certain cases we can even obtain\nan analytic solution of \u02c6\u03b8, which can then be plugged into the prior distribution. At a high level, by\nlearning the hyperparameter \u03b8 in the prior distribution directly from data, the empirical Bayes method\nprovides us a principled and data-dependent way to obtain an estimator of w. In fact, when both the\nprior and the likelihood functions are normal, it has been formally shown that the empirical Bayes\nestimators, e.g., the James-Stein estimator [23] and the Efron-Morris estimator [11], dominate the\nclassic maximum likelihood estimator (MLE) in terms of quadratic loss for every choice of the model\nparameter w. At a colloquial level, the success of the empirical Bayes method can be attributed to\nthe effect of \u201cborrowing statistical strength\u201d [8], which also makes it a powerful tool in multitask\nlearning [28, 43] and meta-learning [15].\n\n2\n\n\fFigure 1: Illustration for Bayes/ Empirical Bayes, and our proposed adaptive regularization.\n\n3 Learning with Adaptive Regularization\n\nIn this section we \ufb01rst propose an adaptive regularization (AdaReg) method, which is inspired by the\nempirical Bayes method, for learning neural networks. We then combine our observation in Sec. 1\nto develop an ef\ufb01cient adaptive learning algorithm with matrix-variate normal prior. Through our\nderivation, we provide several connections and interpretations with other learning paradigms.\n\n3.1 The Proposed Adaptive Regularization\nWhen the likelihood function p(D | w) is implemented as a neural network, the marginalization in (1)\nover model parameter w cannot be computed exactly. Nevertheless, instead of performing expensive\nMonte-Carlo simulation, we propose to estimate both the model parameter w and the hyperparameter\n\u03b8 in the prior simultaneously from the joint distribution p(D, w | \u03b8) = p(D | w) \u00b7 p(w | \u03b8).\nSpeci\ufb01cally, given an estimate \u02c6w of the model parameter, by maximizing the joint distribution w.r.t.\n\u03b8, we can obtain \u02c6\u03b8 as an approximation of the maximum marginal likelihood estimator. As a result,\nwe can use \u02c6\u03b8 to further re\ufb01ne the estimate \u02c6w by maximizing the posterior distribution as follows:\n\n\u02c6w \u2190 max\n\nw\n\np(w | D) = max\n\nw\n\np(D | w) \u00b7 p(w | \u02c6\u03b8).\n\n(2)\n\nThe maximizer of (2) can in turn be used in an updated joint distribution. Formally, we can de\ufb01ne the\nfollowing optimization problem that characterizes our Adaptive Regularization (AdaReg) framework:\n\nmax\n\nw\n\nmax\n\n\u03b8\n\nlog p(D | w) + log p(w | \u03b8).\n\n(3)\n\nIt is worth connecting the optimization problem (3) to the classic maximum a posteriori (MAP)\ninference and also discuss their difference. If we drop the inner optimization over the hyperparameter\n\u03b8 in the prior distribution. Then for any \ufb01xed value \u02c6\u03b8, (3) reduces to MAP with the prior de\ufb01ned by\nthe speci\ufb01c choice of \u02c6\u03b8, and the maximizer \u02c6w corresponds to the mode of the posterior distribution\ngiven by \u02c6\u03b8. From this perspective, the optimization problem in (3) actually de\ufb01nes a series of MAP\ninference problems, and the sequence { \u02c6wj(\u02c6\u03b8j)}j de\ufb01nes a solution path towards the \ufb01nal model\nparameter. On the algorithmic side, the optimization problem (3) also suggests a natural block\ncoordinate descent algorithm where we alternatively optimize over w and \u03b8 until the convergence of\nthe objective function. An illustration of the framework is shown in Fig. 1.\n\n3.2 Neural Network with Matrix-Normal Prior\n\nInspired by the observation from Sec. 1, we propose to de\ufb01ne a matrix-variate normal distribution [16]\nover the connection weight matrix W : W \u223c MN (0p\u00d7d, \u03a3r, \u03a3c), where \u03a3r \u2208 Sp\n++ and \u03a3c \u2208 Sd\n++\nare the row and column covariance matrices, respectively.2 Equivalently, one can understand the\nmatrix-variate normal distribution over W as a multivariate normal distribution with a Kronecker\nproduct covariance structure over vec(W ): vec(W ) \u223c N (0p\u00d7d, \u03a3c \u2297 \u03a3r). It is then easy to check\nthat the marginal prior distributions over the row and column vectors of W are given by:\n\nWi: \u223c N (0d, [\u03a3r]ii \u00b7 \u03a3c), W:j \u223c N (0p, [\u03a3c]jj \u00b7 \u03a3r).\nr W \u03a3\u22121\n\n2The probability density function is given by p(W | \u03a3r, \u03a3c) =\n\nexp(\u2212 Tr(\u03a3\u22121\nc W T )/2)\n(2\u03c0)pd/2 det(\u03a3r )d/2 det(\u03a3c)p/2 .\n\n3\n\n\fWe point out that the Kronecker product structure of the covariance matrix exactly captures our prior\nabout the connection matrix W : the fan-in/fan-out of neurons in the same layer (row/column vectors\nof W ) are correlated with the same correlation matrix in the prior, and they only differ at the scales.\nFor illustration purpose, let us consider the simple feed-forward network discussed in Sec. 1. Consider\na reparametrization of the model by de\ufb01ning \u2126r := \u03a3\u22121\nto be the corresponding\nprecision matrices and plug in the prior distribution into the our AdaReg framework (see (3)). After\nroutine algebraic simpli\ufb01cations, we reach the following concrete optimization problem:\n\nand \u2126c := \u03a3\u22121\n\nr\n\nc\n\nF \u2212 \u03bb(cid:0)d log det(\u2126r) + p log det(\u2126c)(cid:1)\n\n||2\n\n(cid:88)\n\nmin\nW,a\n\nmin\n\u2126r,\u2126c\n\n(\u02c6y(xi; W, a) \u2212 yi)2 + \u03bb||\u21261/2\n\n1\n2n\nuIp (cid:22) \u2126r (cid:22) vIp, uId (cid:22) \u2126c (cid:22) vId\n\ni\u2208[n]\n\nr W \u21261/2\n\nc\n\nsubject to\n(4)\nwhere \u03bb is a constant that only depends on p and d, 0 < u \u2264 v and uv = 1. Note that the constraint\nis necessary to guarantee the feasible set to be compact so that the optimization problem is well\nformulated and a minimum is attainable. 3 It is not hard to show that in general the optimization\nproblem (4) is not jointly convex in terms of {a, W, \u2126r, \u2126c}, and this holds even if the activation\nfunction is linear. However, as we will show later, for any \ufb01xed a, W , the reparametrization makes\nthe partial optimization over \u2126r and \u2126c bi-convex. More importantly, we can derive an ef\ufb01cient\nalgorithm that \ufb01nds the optimal \u2126r(\u2126c) for any \ufb01xed a, W, \u2126c(\u2126r) in O(max{d3, p3}) time with\nclosed form solutions. This allows us to apply our algorithm to networks of large sizes, where\na typical hidden layer can contain thousands of nodes. Note that this is in contrast to solving a\ngeneral semi-de\ufb01nite programming (SDP) problem using black-box algorithm, e.g., the interior-point\nmethod [32], which is computationally intensive and hard to scale to networks with moderate sizes.\nBefore we delve into the details on solving (4), it is instructive to discuss some of its connections and\ndifferences to other learning paradigms.\nMaximum-A-Posteriori Estimation. Essentially, for model parameter W , (4) de\ufb01nes a sequence of\nMAP problems where each MAP is indexed by the pair of precision matrices (\u2126(t)\nc ) at iteration t.\nEquivalently, at each stage of the optimization, we can interpret (4) as placing a matrix variate normal\nprior on W where the precision matrix in the prior is given by \u2126(t)\nc . From this perspective, if\nc = Id, \u2200t, then (4) naturally reduces to learning with (cid:96)2 regularization [26].\nwe \ufb01x \u2126(t)\nMore generally, for non-diagonal precision matrices, the regularization term for W becomes:\n\nr = Ip and \u2126(t)\n\nr \u2297 \u2126(t)\n\nr , \u2126(t)\n\n||\u21261/2\n\nr W \u21261/2\n\nc\n\n||2\nF = ||vec(\u21261/2\n\nr W \u21261/2\n\nc\n\n)||2\n2 = ||(\u21261/2\n\nc \u2297 \u21261/2\n\nr\n\n) vec(W )||2\n2,\n\nr\n\nc \u2297 \u21261/2\n\nand this is exactly the Tikhonov regularization [13] imposed on W where the Tikhonov matrix \u0393 is\ngiven by \u0393 := \u21261/2\n. But instead of manually designing the regularization matrix \u0393 to improve\nthe conditioning of the estimation problem, we propose to also learn both precision matrices (so \u0393 as\nwell) from data. From an algorithmic perspective, \u0393T \u0393 = \u2126c \u2297 \u2126r serves as a preconditioning matrix\nw.r.t. model parameter W to reshape the gradient according to the geometry of the data [7, 17, 18].\nVolume Minimization. Let us consider the log det(\u00b7) function over the positive de\ufb01nite cone. It\nis well known that the log-determinant function is concave [3]. Hence for any pair of matrices\nA1, A2 \u2208 Sm\nlog det(A1) \u2264 log det(A2) + (cid:104)\u2207 log det(A2), A1 \u2212 A2(cid:105) = log det(A2) + Tr(A\u22121\n2 A1) \u2212 m. (5)\nApplying the above inequality twice by \ufb01xing A1 = W \u2126cW T /2d, A2 = \u03a3r and A1 =\nW T \u2126rW/2p, A2 = \u03a3c respectively leads to the following inequalities:\n\n++, the following inequality holds:\n\nd log det(W \u2126cW T /2d) \u2264 \u2212d log det(\u2126r) +\np log det(W T \u2126rW/2p) \u2264 \u2212p log det(\u2126c) +\n\nTr(\u2126rW \u2126cW T ) \u2212 dp,\nTr(\u2126rW \u2126cW T ) \u2212 dp.\n\n1\n2\n1\n2\n\nRealize Tr(\u2126rW \u2126cW T ) = ||\u21261/2\nd log det(W \u2126cW T )+p log det(W T \u2126rW ) \u2264 ||\u21261/2\nwhere c is a constant that only depends on d and p. Recall that | det(AT A)| computes the squared\nvolume of the parallelepiped spanned by the column vectors of A. Hence (6) gives us a natural\n\nF \u2212(cid:0)d log det(\u2126r)+p log det(\u2126c)(cid:1)+c, (6)\n\n||2\nF . Summing the above two inequalities leads to:\n\nr W \u21261/2\n\nr W \u21261/2\n\n||2\n\nc\n\nc\n\n3The constraint uv = 1 is only for the ease of presentation in the following part and can be readily removed.\n\n4\n\n\finterpretation of the objective function in (4): the regularizer essentially upper bounds the log-volume\nof the two parallelpipeds spanned by the row and column vectors of W . But instead of measuring the\nvolume using standard Euclidean inner product, it also takes into account the local curvatures de\ufb01ned\nby \u03a3r and \u03a3c, respectively. For vectors with \ufb01xed lengths, the volume of the parallelepiped spanned\nby them becomes smaller when they are more linearly correlated, either positively or negatively. At a\ncolloquial level, this means that the regularizer in (4) forces fan-in/fan-out of neurons at the same\nlayer to be either positively or negatively correlated with each other, and this corresponds exactly to\nthe effect of sharing statistical strengths.\n\n3.3 The Algorithm\n\nIn this section we describe a block coordinate descent algorithm to optimize the objective function\nin (4) and detail how to ef\ufb01ciently solve the matrix optimization subproblems in closed form using\ntools from convex analysis. Due to space limit, we defer proofs and detailed derivation to appendix.\nGiven a pair of constants 0 < u \u2264 v, we de\ufb01ne the following thresholding function T[u,v](x):\n\nT[u,v](x) := max{u, min{v, x}}.\n\n(7)\nWe summarize our block coordinate descent algorithm to solve (4) in Alg. 1. In each iteration, Alg. 1\ntakes a \ufb01rst-order algorithm A, e.g., the stochastic gradient descent, to optimize the parameters of the\nneural network by backpropagation. It then proceeds to compute the optimal solutions for \u2126r and \u2126c\nusing INVTHRESHOLD as a sub-procedure. Alg. 1 terminates when a stationary point is found.\nWe now proceed to show that the procedure INVTHRESHOLD \ufb01nds the optimal solution given all the\nother variables \ufb01xed. Due to the symmetry between \u2126r and \u2126c in (4), we will only prove this for \u2126r,\nand similar arguments can be applied to \u2126c as well. Fix both W , \u2126c and ignore all the terms that do\nnot depend on \u2126r, the sub-problem on optimizing \u2126r becomes:\n\nmin\n\u2126r\n\nTr(\u2126rW \u2126cW T ) \u2212 d log det(\u2126r),\n\n(8)\nIt is not hard to show that the optimization problem (9) is convex. De\ufb01ne the constraint set C := {A \u2208\n++ | uIp (cid:22) A (cid:22) vIp} and the indicator function IC(A) = 0 iff A \u2208 C else \u221e. Given the convexity\nSp\nof (9), we can use the indicator function to \ufb01rst transform (9) into the following unconstrained one:\n(9)\n\nTr(\u2126rW \u2126cW T ) \u2212 d log det(\u2126r) + IC(\u2126r).\n\nsubject to uIp (cid:22) \u2126r (cid:22) vIp.\n\nmin\n\u2126r\n\nThen we can use the \ufb01rst-order optimality condition to characterize the optimal solution:\n\nTr(\u2126rW \u2126cW T ) \u2212 log det(\u2126r) + IC(\u2126r)\n\n= W \u2126cW T /d \u2212 \u2126\u22121\n\nr + NC(\u2126r),\n\n(cid:19)\n\n(cid:18) 1\n\nd\n\n0 \u2208 \u2202\n\nwhere NC(A) := {B \u2208 Sp | Tr(BT (Z \u2212 A)) \u2264 0,\u2200Z \u2208 C} is the normal cone w.r.t. C at A. The\nfollowing key lemma characterizes the structure of the normal cone:\nLemma 1. Let \u2126r \u2208 C, then NC(\u2126r) = \u2212NC(\u2126\u22121\nr ).\nEquivalently, combining Lemma 1 with the optimality condition, we have\n\nr \u2208 NC(\u2126\u22121\nr ).\nis the Euclidean projection of W \u2126cW T /d onto C.\nGeometrically, this means that the optimum \u2126\u22121\nHence in order to solve (9), it suf\ufb01ces if we can solve the following Euclidean projection problem\n\nef\ufb01ciently, where(cid:102)\u2126r \u2208 Sp is a given real symmetric matrix:\n\nW \u2126cW T /d \u2212 \u2126\u22121\n\nr\n\nsubject to uIp (cid:22) \u2126r (cid:22) vIp.\n\n(10)\n\n||\u2126r \u2212(cid:102)\u2126r||2\n\nF ,\n\nmin\n\u2126r\n\nPerhaps a little bit surprising, we can \ufb01nd the optimal solution to the above Euclidean projection\nproblem ef\ufb01ciently in closed form:\n\nTheorem 1. Let(cid:102)\u2126r \u2208 Sp with eigendecomposition as(cid:102)\u2126r = Q\u039bQT and ProjC(\u00b7) be the Euclidean\nprojection operator onto C, then ProjC((cid:102)\u2126r) = QT[u,v](\u039b)QT .\n\nCorollary 1. Let W \u2126cW T be eigendecomposed as Qdiag(r)QT , then the optimal solution to (9) is\ngiven by QT[u,v](d/r)QT .\nSimilar arguments can be made to derive the solution for \u2126c in (4). The \ufb01nal algorithm is very\nsimple as it only contains one SVD, hence its time complexity is O(max{d3, p3}). Note that the total\nnumber of parameters in the network is at least \u2126(dp), hence the algorithm is ef\ufb01cient as it scales\nsub-quadratically in terms of number of parameters in the network.\n\n5\n\n\f, \u2126(t\u22121)\n\nc\n\nr\n\n, optimize \u03c6(t) by backpropagation and algorithm A\n\n++, \ufb01rst-order optimization algorithm A.\n\nr \u2208 Sp\n\nc \u2208 Sd\n\n++ and \u2126(0)\n\nAlgorithm 1 Block Coordinate Descent for Adaptive Regularization\nInput: Initial value \u03c6(0) := {a(0), W (0)}, \u2126(0)\n1: for t = 1, . . . ,\u221e until convergence do\nFix \u2126(t\u22121)\n2:\nr \u2190 INVTHRESHOLD(W (t)\u2126(t\u22121)\n\u2126(t)\n3:\nc \u2190 INVTHRESHOLD(W (t)T \u2126(t)\n\u2126(t)\n4:\n5: end for\n6: procedure INVTHRESHOLD(\u2206, m, u, v)\n7:\nCompute SVD: Qdiag(r)QT = SVD(\u2206)\nHard thresholding r(cid:48) \u2190 T[u,v](m/r)\n8:\nreturn Qdiag(r(cid:48))QT\n9:\n10: end procedure\n\nc W (t)T , d, u, v)\nr W (t), p, u, v)\n\n4 Experiments\n\nIn this section we demonstrate the effectiveness of AdaReg in learning practical deep neural networks\non real-world datasets. We report generalization, optimization as well as stability results.\n\n4.1 Experimental Setup\n\nMulticlass Classi\ufb01cation (MNIST & CIFAR10): In this experiment, we show that AdaReg provides\nan effective regularization on the network parameters. To this end, we use a convolutional neural\nnetwork as our baseline model. To show the effect of regularization, we gradually increase the\ntraining set size. In MNIST we use the step from 60 to 60,000 (11 different experiments) and in\nCIFAR10 we consider the step from 5,000 to 50,000 (10 different experiments). For each training\nset size, we repeat the experiments for 10 times. The mean along with its standard deviation are\nshown as the statistics. Moreover, since both the optimization and generalization of neural networks\nare sensitive to the size of minibatches [14, 24], we study two minibatch settings for 256 and 2048,\nrespectively. In our method, we place a matrix-variate normal prior over the weight matrix of the last\nsoftmax layer, and we use Alg. 1 to optimize both the model weights and two covariance matrices.\nMultitask Regression (SARCOS): SARCOS relates to an inverse dynamics problem for a seven\ndegree-of-freedom (DOF) SARCOS anthropomorphic robot arm [41]. The goal of this task is to\nmap from a 21-dimensional input space (7 joint positions, 7 joint velocities, 7 joint accelerations) to\nthe corresponding 7 joint torques. Hence there are 7 tasks and the inputs are shared among all the\ntasks. The training set and test set contain 44,484 and 4,449 examples, respectively. Again, we apply\nAdaReg on the last layer weight matrix, where each row corresponds to a separate task vector.\nWe compare AdaReg with classic regularization methods in the literature, including weight decay,\ndropout [39], batch normalization (BN) [22] and the DeCov method [6]. We also note that we\n\ufb01x all the hyperparameters such as learning rate to be the same for all the methods. We report\nevaluation metrics on test set as a measure of generalization. To understand how the proposed\nadaptive regularization helps in optimization, we visualize the trajectory of the loss function during\ntraining. Lastly, we also present the inferred correlation of the weight matrix for qualitative study.\n\n4.2 Results and Analysis\n\nMulticlass Classi\ufb01cation (MNIST & CIFAR10): Results on the multiclass classi\ufb01cation for dif-\nferent training sizes are show in Fig. 2. For both MNIST and CIFAR10, we \ufb01nd AdaReg, Weight\nDecay, and Dropout are the effective regularization methods, while Batch Normalization and DeCov\nvary in different settings. Batch Normalization suffers from large batch size in CIFAR10 (comparing\nFig. 2 (c) and (d)) but is not sensitive to batch size in MNIST (comparing Fig. 2 (a) and (b)). The\nperformance deterioration in large batch size of Batch Normalization is also observed by [21]. DeCov,\non the other hand, improves the generalization in MNIST with batch size 256 (see Fig. 2 (a)), while\nit demonstrates only comparable or even worse performance in other settings. To conclude, as\ntraining set size grows, AdaReg consistently performs better generalization as comparing to other\nregularization methods. We also note that AdaReg is not sensitive to the size of minibatches while\nmost of the methods suffer from large minibatches. In appendix, we show the combination of AdaReg\nwith other generalization methods can usually lead to even better results.\n\n6\n\n\fTable 1: Explained variance of different methods on 7 regression tasks from the SARCOS dataset.\n\nMethod\nMTL\nMTL-Dropout\nMTL-BN\nMTL-DeCoV\nMTL-AdaReg\n\n1st\n\n0.4418\n0.4413\n0.4768\n0.4027\n0.4769\n\n2nd\n\n0.3472\n0.3271\n0.3770\n0.3137\n0.3969\n\n3rd\n\n0.5222\n0.5202\n0.5396\n0.4703\n0.5485\n\n4th\n\n0.5036\n0.5063\n0.5216\n0.4515\n0.5308\n\n5th\n\n0.6024\n0.6036\n0.6117\n0.5229\n0.6202\n\n6th\n\n0.4727\n0.4711\n0.4936\n0.4224\n0.5085\n\n7th\n\n0.5298\n0.5345\n0.5479\n0.4716\n0.5561\n\nFigure 2: Generalization performance on MNIST and CIFAR10. AdaReg improves generalization under both\nminibatch settings.\n\n(a) T/B: 600/256\n\n(b) T/B: 6000/256\n\n(c) T/B: 600/2048\n\n(d) T/B: 6000/2048\n\nFigure 3: Optimization trajectory of AdaReg on MNIST with training size/batch size on training and\ntest sets. AdaReg helps to converge to better local optima. Note the log-scale on y-axis.\n\nMultitask Regression (SARCOS): In this experiment we are interested in investigating whether\nAdaReg can lead to better generalization for multiple related regression problems. To do so, we\nreport the explained variance as a normalized metric, e.g., one minus the ratio between mean squared\nerror and the variance of different methods in Table 1. The larger the explained variance, the better\nthe predictive performance. In this case we observe a consistent improvement of AdaReg over other\ncompetitors on all the 7 regression tasks. We would like to emphasize that all the experiments\nshare exactly the same experimental protocol, including network structure, optimization algorithm,\ntraining iteration, etc, so that the performance differences can only be explained by different ways of\nregularizations. For better visualization, we also plot the result in appendix.\nOptimization: It has recently been empirically shown that BN helps optimization not by reducing\ninternal covariate shift, but instead by smoothing the landscape of the loss function [37]. To understand\nhow AdaReg improves generalization, in Fig. 3, we plot the values of the cross entropy loss function\non both the training and test sets during optimization using Alg. 1. The experiment is performed\nin MNIST with batch size 256/2048. In this experiment, we \ufb01x the number of outer loop to be 2/5\nand each block optimization over network weights contains 50 epochs. Because of the stochastic\noptimization over model weights, we can see several unstable peaks in function value around iteration\n50 when trained with AdaReg, which corresponds to the transition phase between two consecutive\nouter loops with different row/column covariance matrices. In all the cases AdaReg converges to\nbetter local optima of the loss landscape, which lead to better generalization on the test set as well\nbecause they have smaller loss values on the test set when compared with training without AdaReg.\nStable rank and spectral norm: Given a matrix W , the stable rank of W , denoted as srank(W ), is\nde\ufb01ned as srank(W ) := ||W||2\n2. As its name suggests, the stable rank is more stable than\nthe rank because it is largely unaffected by tiny singular values. It has recently been shown [34,\nTheorem 1] that the generalization error of neural networks crucially depends on both the stable ranks\nand the spectral norms of connection matrices in the network. Speci\ufb01cally, it can be shown that the\n\ngeneralization error is upper bounded by O(cid:0)(cid:113)(cid:81)L\n\nj=1 srank(Wj)/n(cid:1), where L is the\n(cid:80)L\n\nF /||W||2\n\nj=1 ||Wj||2\n\n2\n\n7\n\n(a) MNIST (Batch Size: 256) (b) MNIST (Batch Size: 2048) (c) CIFAR10 (Batch Size: 256) (d) CIFAR10 (Batch Size: 2048) AdaRegAdaReg020406080100Iteration101102103Cross-entropy LossCNN-trainCNN-testCNN-AdaReg-trainCNN-AdaReg-test020406080100Iteration101102103104Cross-entropy LossCNN-trainCNN-testCNN-AdaReg-trainCNN-AdaReg-test050100150200250Iteration101102103Cross-entropy LossCNN-trainCNN-testCNN-AdaReg-trainCNN-AdaReg-test050100150200250Iteration102103104Cross-entropy LossCNN-trainCNN-testCNN-AdaReg-trainCNN-AdaReg-test\f(a) MNIST: S. rank\n\n(b) MNIST: S. norm\n\n(c) CIFAR10: S. rank\n\n(d) CIFAR10: S. norm\n\nFigure 4: Comparisons of stable ranks (S. rank) and spectral norms (S. norm) from different methods\non MNIST and CIFAR10. x-axis corresponds to the training size.\n\n(a) CNN, Acc: 89.34\n\n(b) AdaReg, Acc: 92.50\n\n(c) CNN, Acc: 98.99\n\n(d) AdaReg, Acc: 99.19\n\nFigure 5: Correlation matrix of the weight matrix in the softmax layer. The left two correspond to\ndataset with training size 600 and the right two with size 60,000. Acc means the test set accuracy.\n\nnumber of layers in the network. Essentially, this upper bound suggests that smaller spectral norm\n(smoother function mapping) and stable rank (skewed spectrum) leads to better generalization.\nTo understand why AdaReg improves generalization, in Fig. 4, we plot both the stable rank and the\nspectral norm of the weight matrix in the last layer of the CNNs used in our MNIST and CIFAR10\nexperiments. We compare 3 methods: CNN without any regularization, CNN trained with weight\ndecay and CNN with AdaReg. For each setting we repeat the experiments for 5 times, and we plot\nthe mean along with its standard deviation. From Fig. 4a and Fig. 4c it is clear that AdaReg leads to a\nsigni\ufb01cant reduction in terms of the stable rank when compared with weight decay, and this effect\nis consistent in all the experiments with different training size. Similarly, in Fig. 4b and Fig. 4d we\nplot the spectral norm of the weight matrix. Again, both weight decay and AdaReg help reduce the\nspectral norm in all settings, but AdaReg plays a more signi\ufb01cant role than the usual weight decay.\nCombining the experiments with the generalization upper bound introduced above, we can see that\ntraining with AdaReg leads to an estimator of W that has lower stable rank and smaller spectral norm,\nwhich explains why it achieves a better generalization performance.\nFurthermore, this observation holds on the SARCOS datasets as well. For the SARCOS dataset, the\nweight matrix being regularized is of dimension 100 \u00d7 7. Again, we compare the results using three\nmethods: MTL, MTL-WeightDecay and MTL-AdaReg. As can be observed from Table 2, compared\nwith the weight decay regularization, AdaReg substantially reduces both the stable rank and the\nspectral norm of learned weight matrix, which also helps to explain why MTL-AdaReg generalizes\nbetter compared with MTL and MTL-WeightDecay.\n\nTable 2: Stable rank and spectral norm on SARCOS.\n\nMTL MTL-WeightDecay MTL-AdaReg\nStable Rank\n4.48\nSpectral Norm 0.96\n\n2.88\n0.70\n\n4.83\n0.92\n\nCorrelation Matrix: To verify that AdaReg imposes the effect of \u201csharing statistical strength\u201d\nduring training, we visualize the weight matrix of the softmax layer by computing the corresponding\ncorrelation matrix, as shown in Fig. 5. In Fig. 5, darker color means stronger correlation. We conduct\ntwo experiments with training size 600 and 60,000 respectively. As we can observe, training with\nAdaReg leads to weight matrix with stronger correlations, and this effect is more evident when the\ntraining set is large. This is consistent with our analysis of sharing statistical strengths. As a sanity\ncheck, from Fig. 5 we can also see that similar digits, e.g., 1 and 7, share a positive correlation while\ndissimilar ones, e.g., 1 and 8, share a negative correlation.\n\n8\n\n102103104Train Size3456Stable RankCNNCNN-WeightDecayCNN-AdaReg102103104Train Size0.40.60.81.01.21.4Spectral NormCNNCNN-WeightDecayCNN-AdaReg1000020000300004000050000Train Size234567Stable RankCNNCNN-WeightDecayCNN-AdaReg1000020000300004000050000Train Size0.00.51.01.52.02.53.03.5Spectral NormCNNCNN-WeightDecayCNN-AdaReg012345678901234567890.80.40.00.40.8012345678901234567890.80.40.00.40.8012345678901234567890.80.40.00.40.8012345678901234567890.80.40.00.40.8\f5 Related Work\n\nThe Empirical Bayes Method vs Bayesian Neural Networks Despite the name, empirical Bayes\nmethod is in fact a frequentist approach to obtain estimator with favorable properties. On the other\nhand, truly Bayesian inference would instead put a posterior distribution over model weights to\ncharacterize the uncertainty during training [2, 20, 30]. However, due to the complexity of nonlinear\nneural networks, analytic posterior is not available, hence strong independent assumptions over model\nweight have to be made in order to achieve computationally tractable variational solution. Typically,\nboth the prior and the variational posterior are assumed to fully factorize over model weights. As an\nexception, Louizos and Welling [29], Sun et al. [40] seek to learn Bayesian neural nets where they\napproximate the intractable posterior distribution using matrix-variate Gaussian distribution. The\nprior for weights are still assumed to be known and \ufb01xed. As a comparison, we use matrix-variate\nGaussian as the prior distribution and we learn the hyperparameter in the prior from data. Hence our\nmethod does not belong to Bayesian neural nets: we instead use the empirical Bayes principle to\nderive adaptive regularization method in order to have better generalization, as done in [4, 35].\n\nRegularization Techniques in Deep Learning Different kinds of regularization approaches\nhave been studied and designed for neural networks, e.g., weight decay [26], early stopping [5],\nDropout [39] and the more recent DeCov [6] method. BN was proposed to reduce the internal\ncovariate shift during training, but recently it has been empirically shown to actually smooth the land-\nscape of the loss function [37]. As a comparison, we propose AdaReg as an adaptive regularization\nmethod, with the aim to reduce over\ufb01tting by allowing neurons to share statistical strengths. From\nthe optimization perspective, learning the row and column covariance matrices help to converge to\nbetter local optimum that also generalizes better.\n\nKronecker Factorization in Optimization The Kronecker factorization assumption has also been\napplied in the literature of neural networks to approximate the Fisher information matrix in second-\norder optimization methods [31, 42]. The main idea here is to approximate the curvature of the loss\nfunction\u2019s landscape, in order to achieve better convergence speed compared with \ufb01rst-order method\nwhile maintaining the tractability of such computation. Different from these work, here in our method\nwe assume a Kronecker factorization structure on the covariance matrix of the prior distribution, not\nthe Fisher information matrix of the log-likelihood function. Furthermore, we also derive closed-form\nsolutions to optimize these factors without any kind of approximations.\n\n6 Conclusion\n\nInspired by empirical Bayes method, in this paper we propose an adaptive regularization (AdaReg)\nwith matrix-variate normal prior for model parameters in deep neural networks. The prior encourages\nneurons to borrow statistical strength from other neurons during the learning process, and it provides\nan effective regularization when training networks on small datasets. To optimize the model, we\ndesign an ef\ufb01cient block coordinate descent algorithm to learn both model weights and the covariance\nstructures. Empirically, on three datasets we demonstrate that AdaReg improves generalization by\n\ufb01nding better local optima with smaller spectral norms and stable ranks. We believe our work takes\nan important step towards exploring the combination of ideas from the empirical Bayes literature\nand rich prediction models like deep neural networks. One interesting direction for future work is\nto extend the current approach to online setting where we only have access to one training instance\nat a time, and to analyze the property of such method in terms of regret analysis with adaptive\noptimization methods.\n\nAcknowledgments\n\nHZ and GG would like to acknowledge support from the DARPA XAI project, contract\n#FA87501720152 and an Nvidia GPU grant. YT and RS were supported in part by DARPA\ngrant FA875018C0150, DARPA SAGAMORE HR00111990016, Of\ufb01ce of Naval Research grant\nN000141812861, AFRL CogDeCON, and Apple. YT and RS would also like to acknowledge\nNVIDIA\u2019s GPU support. Last, we thank Denny Wu for suggestions on exploring and analyzing our\nalgorithm in terms of stable rank.\n\n9\n\n\fReferences\n[1] Jos\u00e9 M Bernardo and Adrian FM Smith. Bayesian theory, 2001.\n\n[2] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. arXiv preprint arXiv:1505.05424, 2015.\n\n[3] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[4] Philip J Brown, James V Zidek, et al. Adaptive multivariate ridge regression. The Annals of\n\nStatistics, 8(1):64\u201374, 1980.\n\n[5] Rich Caruana, Steve Lawrence, and C Lee Giles. Over\ufb01tting in neural nets: Backpropagation,\nconjugate gradient, and early stopping. In Advances in neural information processing systems,\npages 402\u2013408, 2001.\n\n[6] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing\nover\ufb01tting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068,\n2015.\n\n[7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[8] Bradley Efron. Large-scale inference: empirical Bayes methods for estimation, testing, and\n\nprediction, volume 1. Cambridge University Press, 2012.\n\n[9] Bradley Efron and Trevor Hastie. Computer age statistical inference, volume 5. Cambridge\n\nUniversity Press, 2016.\n\n[10] Bradley Efron and Carl Morris. Stein\u2019s estimation rule and its competitors\u2014an empirical Bayes\n\napproach. Journal of the American Statistical Association, 68(341):117\u2013130, 1973.\n\n[11] Bradley Efron and Carl Morris. Stein\u2019s paradox in statistics. Scienti\ufb01c American, 236(5):\n\n119\u2013127, 1977.\n\n[12] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B\n\nRubin. Bayesian data analysis. CRC press, 2013.\n\n[13] Gene H Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for\n\nchoosing a good ridge parameter. Technometrics, 21(2):215\u2013223, 1979.\n\n[14] Priya Goyal, Piotr Doll\u00e1r, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola,\nAndrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: training\nimagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.\n\n[15] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grif\ufb01ths. Recasting\n\ngradient-based meta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.\n\n[16] Arjun K Gupta and Daya K Nagar. Matrix variate distributions. Chapman and Hall/CRC, 2018.\n\n[17] Vineet Gupta, Tomer Koren, and Yoram Singer. A uni\ufb01ed approach to adaptive regularization\n\nin online and stochastic optimization. arXiv preprint arXiv:1706.06569, 2017.\n\n[18] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex\n\noptimization. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[20] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable\nlearning of bayesian neural networks. In International Conference on Machine Learning, pages\n1861\u20131869, 2015.\n\n10\n\n\f[21] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the\ngeneralization gap in large batch training of neural networks. In Advances in Neural Information\nProcessing Systems, pages 1731\u20131741, 2017.\n\n[22] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[23] William James and Charles Stein. Estimation with quadratic loss. In Proceedings of the fourth\nBerkeley symposium on mathematical statistics and probability, volume 1, pages 361\u2013379,\n1961.\n\n[24] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[25] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\n2009.\n\n[26] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In\n\nAdvances in neural information processing systems, pages 950\u2013957, 1992.\n\n[27] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436,\n\n2015.\n\n[28] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and S Yu Philip. Learning multiple tasks\nwith multilinear relationship networks. In Advances in Neural Information Processing Systems,\npages 1594\u20131603, 2017.\n\n[29] Christos Louizos and Max Welling. Structured and ef\ufb01cient variational deep learning with\nmatrix gaussian posteriors. In International Conference on Machine Learning, pages 1708\u20131716,\n2016.\n\n[30] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural\n\ncomputation, 4(3):448\u2013472, 1992.\n\n[31] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored ap-\nproximate curvature. In International conference on machine learning, pages 2408\u20132417,\n2015.\n\n[32] Sanjay Mehrotra. On the implementation of a primal-dual interior point method. SIAM Journal\n\non optimization, 2(4):575\u2013601, 1992.\n\n[33] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\nIn Proceedings of the 27th international conference on machine learning (ICML-10), pages\n807\u2013814, 2010.\n\n[34] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-\nbayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint\narXiv:1707.09564, 2017.\n\n[35] Samuel D Oman. A different empirical bayes interpretation of ridge and stein estimators.\n\nJournal of the Royal Statistical Society: Series B (Methodological), 46(3):544\u2013557, 1984.\n\n[36] Herbert Robbins. An empirical bayes approach to statistics. Technical report, Columbia\n\nUniversity, New York City, United States, 1956.\n\n[37] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch\nnormalization help optimization?(no, it is not about internal covariate shift). arXiv preprint\narXiv:1805.11604, 2018.\n\n[38] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[39] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: A simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n11\n\n\f[40] Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty\nin bayesian neural networks. In Arti\ufb01cial Intelligence and Statistics, pages 1283\u20131292, 2017.\n\n[41] Sethu Vijayakumar and Stefan Schaal. Locally weighted projection regression: Incremental\nreal time learning in high dimensional space. In Proceedings of the Seventeenth International\nConference on Machine Learning, pages 1079\u20131086. Morgan Kaufmann Publishers Inc., 2000.\n\n[42] Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gradient\n\nas variational inference. arXiv preprint arXiv:1712.02390, 2017.\n\n[43] Han Zhao, Otilia Stretcu, Alex Smola, and Geoff Gordon. Ef\ufb01cient multitask feature and\nrelationship learning. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Arti\ufb01cial\nIntelligence. AUAI Press, 2019.\n\n12\n\n\f", "award": [], "sourceid": 6074, "authors": [{"given_name": "Han", "family_name": "Zhao", "institution": "Carnegie Mellon University"}, {"given_name": "Yao-Hung Hubert", "family_name": "Tsai", "institution": "Carnegie Mellon University"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": "MSR Montr\u00e9al & CMU"}]}