{"title": "Meta-Curvature", "book": "Advances in Neural Information Processing Systems", "page_first": 3314, "page_last": 3324, "abstract": "We propose meta-curvature (MC), a framework to learn curvature information for better generalization and fast model adaptation. MC expands on the model-agnostic meta-learner (MAML) by learning to transform the gradients in the inner optimization such that the transformed gradients achieve better generalization performance to a new task. For training large scale neural networks, we decompose the curvature matrix into smaller matrices in a novel scheme where we capture the dependencies of the model's parameters with a series of tensor products. We demonstrate the effects of our proposed method on several few-shot learning tasks and datasets. Without any task specific techniques and architectures, the proposed method achieves substantial improvement upon previous MAML variants and outperforms the recent state-of-the-art methods. Furthermore, we observe faster convergence rates of the meta-training process. Finally, we present an analysis that explains better generalization performance with the meta-trained curvature.", "full_text": "Meta-Curvature\n\nEunbyung Park\n\nDepartment of Computer Science\n\nJunier B. Oliva\n\nDepartment of Computer Science\n\nUniversity of North Carolina at Chapel Hill\n\nUniversity of North Carolina at Chapel Hill\n\neunbyung@cs.unc.edu\n\njoliva@cs.unc.edu\n\nAbstract\n\nWe propose meta-curvature (MC), a framework to learn curvature information\nfor better generalization and fast model adaptation. MC expands on the model-\nagnostic meta-learner (MAML) by learning to transform the gradients in the inner\noptimization such that the transformed gradients achieve better generalization\nperformance to a new task. For training large scale neural networks, we decompose\nthe curvature matrix into smaller matrices in a novel scheme where we capture\nthe dependencies of the model\u2019s parameters with a series of tensor products. We\ndemonstrate the effects of our proposed method on several few-shot learning tasks\nand datasets. Without any task speci\ufb01c techniques and architectures, the proposed\nmethod achieves substantial improvement upon previous MAML variants and\noutperforms the recent state-of-the-art methods. Furthermore, we observe faster\nconvergence rates of the meta-training process. Finally, we present an analysis that\nexplains better generalization performance with the meta-trained curvature.\n\n1\n\nIntroduction\n\nDespite huge progress in arti\ufb01cial intelligence, the ability to quickly learn from few examples is\nstill far short of that of a human. We are capable of utilizing prior knowledge from past experiences\nto ef\ufb01ciently learn new concepts or skills. With the goal of building machines with this capability,\nlearning-to-learn or meta-learning has begun to emerge with promising results.\nOne notable example is model-agnostic meta-learning (MAML) [9, 30], which has shown its effec-\ntiveness on various few-shot learning tasks. It formalizes learning-to-learn as meta objective function\nand optimizes it with respect to a model\u2019s initial parameters. Through the meta-training procedure,\nthe resulting model\u2019s initial parameters become a very good prior representation and the model can\nquickly adapt to new tasks or skills through one or more gradient steps with a few examples. Although\nthis end-to-end approach, using standard gradient descent as the inner optimization algorithm, was\ntheoretically shown to approximate any learning algorithm [10], recent studies indicate that the choice\nof the inner-loop optimization algorithm affects performance. [22, 4, 13].\nGiven the sensitivity to the inner-loop optimization algorithm, second order optimization methods\n(or preconditioning the gradients) are worth considering. They have been extensively studied and\nhave shown their practical bene\ufb01ts in terms of faster convergence rates [31], an important aspect of\nfew-shot learning. In addition, the problems of computational and spatial complexity for training deep\nnetworks can be effectively handled thanks to recent approximation techniques [24, 38]. Nevertheless,\nthere are issues with using second order methods in its current form as an inner loop optimizer in\nthe meta-learning framework. First, they do not usually consider generalization performance. They\ncompute local curvatures with training losses and move along the curvatures as far as possible. It can\nbe very harmful, especially in the few-shot learning setup, because it can over\ufb01t easily and quickly.\n\nThe code is available at https://github.com/silverbottlep/meta_curvature\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this work, we propose to learn a curvature for better generalization and faster model adaptation in\nthe meta-learning framework, we call meta-curvature. The key intuition behind MAML is that there\nare some representations are broadly applicable to all tasks. In the same spirit, we hypothesize that\nthere are some curvatures that are broadly applicable to many tasks. Curvatures are determined by\nthe model\u2019s parameters, network architectures, loss functions, and training data. Assuming new tasks\nare distributed from the similar distribution as meta-training distribution, there may exist common\ncurvatures that can be obtained through meta-training procedure. The resulting meta-curvatures,\ncoupled with the simultaneously meta-trained model\u2019s initial parameters, will transform the gradients\nsuch that the updated model has better performance on new tasks with fewer gradient steps. In order\nto ef\ufb01ciently capture the dependencies between all gradient coordinates for large networks, we design\na multilinear mapping consisting of a series of tensor-products to transform the gradients. It also\nconsiders layer speci\ufb01c structures, e.g. convolutional layers, to effectively re\ufb02ects our inductive bias.\nIn addition, meta-curvature can be easily implemented (simply transform the gradients right before\npassing through the optimizers) and can be plugged into existing meta-learning frameworks like\nMAML without additional, burdensome higher-order gradients.\nWe demonstrate the effectiveness of our proposed method on the few-shot learning tasks done by\n[44, 34, 9]. We evaluated our methods on few-shot regression and few-shot classi\ufb01cation tasks\nover Omniglot [19], miniImagenet [44], and tieredImagnet [35] datasets. Experimental results show\nsigni\ufb01cant improvements on other MAML variants on all few-shot learning tasks. In addition, MC\u2019s\nsimple gradient transformation outperformed other more complicated state-of-the-art methods that\ninclude additional bells and whistles.\n2 Background\n2.1 Tensor Algebra\n\nWe review basics of tensor algebra that will be used to formalize the proposed method. We refer the\nreader to [17] for a more comprehensive review. Throughout the paper, tensors are de\ufb01ned as multi-\ndimensional arrays and denoted by calligraphic letters, e.g. Nth-order tensor, X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN .\nMatrices are second-order tensors and denoted by boldface uppercase, e.g. X \u2208 RI1\u00d7I2.\nFibers: Fibers are a higher-order generalization of matrix rows and columns. A matrix column is\na mode-1 \ufb01ber and a matrix row is a mode-2 \ufb01ber. The mode-1 \ufb01bers of a third order tensor X are\ndenoted as X:,j,k, where a colon is used to denote all elements of a mode.\nTensor unfolding: Also known as \ufb02attening (reshaping) or matricization, is the operation of arrang-\ning the elements of an higher-order tensors into a matrix. The mode-n unfolding of a Nth-order\ntensor X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN , arranges the mode-n \ufb01bers to be the columns of the matrix, denoted\nk(cid:54)=n Ik. The elements of the tensor, Xi1,i2,...,iN are mapped to\n\nby X[n] \u2208 RIn\u00d7IM , where IM =(cid:81)\nX[n]in,j, where j = 1 +(cid:80)N\n\nm=1,m(cid:54)=n Im.\n\nn-mode product: It de\ufb01nes the product between tensors and matrices. The n-mode product of a\ntensor X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7IN with a matrix M \u2208 RJ\u00d7In is denoted by X \u00d7n M and computed as\n\nk(cid:54)=n,k=1(ik \u2212 1)Jk, with Jk =(cid:81)k\u22121\nIn(cid:88)\n\n(X \u00d7n M)i1,...,in\u22121,j,in+1,...,iN =\n\nXi1,i2,...,iN Mj,in.\n\n(1)\n\nMore concisely, it can be written as (X \u00d7n M)[n] = MX[n] \u2208 RI1\u00d7\u00b7\u00b7\u00b7\u00d7In\u22121\u00d7J\u00d7In+1\u00d7\u00b7\u00b7\u00b7\u00d7IN . Despite\ncumbersome notation, it is simply n-mode unfolding (reshaping) followed by matrix multiplication.\n2.2 Model-Agnostic Meta-Learning (MAML)\n\nin=1\n\nMAML aims to \ufb01nd a transferable initialization (a prior representation) of any model such that the\nmodel can adapt quickly from the initialization and produce good generalization performance on\nnew tasks. The meta-objective is de\ufb01ned as validation performance after one or few step gradient\nupdates from the model\u2019s initial parameters. By using gradient descent algorithms to optimize the\nmeta-objective, its training algorithm usually takes the form of nested gradient updates: inner updates\nfor model adaptation to a task and outer-updates for the model\u2019s initialization parameters. Formally,\n(2)\n\n(cid:1)],\n\nmin\n\nE\u03c4i[L\u03c4i\n\nval\n\n(cid:0) \u03b8 \u2212 \u03b1\u2207L\u03c4i\n(cid:123)(cid:122)\n(cid:124)\n\ninner udpate\n\n(cid:125)\n\ntr (\u03b8)\n\n\u03b8\n\n2\n\n\fval(\u00b7) denotes a loss function for a validation set of a task \u03c4i, and L\u03c4i\n\nwhere L\u03c4i\ntr (\u00b7) for a training set, or\nLtr(\u00b7) for brevity. The inner update is de\ufb01ned as a standard gradient descent with \ufb01xed learning rate\n\u03b1. For conciseness, we assume as single adaptation step, but it can be easily extended to more steps.\nFor more details, we refer to [9]. Several variations of inner update rules were suggested. Meta-SGD\n[22] suggested coordinate-wise learning rates, \u03b8 \u2212 \u03b1 \u25e6 \u2207Ltr, where \u03b1 is the learnable parameters and\n\u25e6 is element wise product. Recently, [4] proposed a learnable learning rate per each layers for more\n\ufb02exible model adaptation. To alleviate computational complexity, [30] suggested an algorithm that\ndo not require higher order gradients.\n\n2.3 Second order optimization\n\nThe biggest motivation of second order methods is that \ufb01rst-order optimization such as standard\ngradient descent performs poorly if the Hessian of a loss function is ill-conditioned, e.g. a long\nnarrow valley loss surface. There are a plethora of works that try to accelerate gradient descent by\nconsidering local curvatures. Most notably, the update rules of Newton\u2019s method can be written\nas \u03b8 \u2212 \u03b1H\u22121\u2207Ltr, with Hessian matrix H and a step size \u03b1 [31]. Every step, it minimizes a local\nquadratic approximation of a loss function, and the local curvature is encoded in the Hessian matrix.\nAnother promising approach, especially in neural network literature, is natural gradient descent [2].\nIt \ufb01nds a steepest descent direction in distribution space rather than parameter space by measuring\nKL-divergence as a distance metric. Similar to Newton\u2019s method, it preconditions the gradient with\nthe Fisher information matrix and a common update rule is \u03b8 \u2212 \u03b1F\u22121\u2207Ltr. In order to mitigate\ncomputational and spatial issues for large scale problems, several approximation techniques has been\nproposed, such as online update methods [31, 38], Kronecker-factored approximations [24], and\ndiagonal approximations of second order matrices [43, 16, 8].\n\n3 Meta-Curvature\n\nWe propose to learn a curvature along with the model\u2019s initial parameters simultaneously via the\nmeta-learning process. The goal is that the meta-learned curvature works collaboratively with the\nmeta-learned model\u2019s initial parameters to produce good generalization performance on new tasks\nwith fewer gradient steps. In this work, we focus on learning a meta-curvature and its ef\ufb01cient forms\nto scale large networks. We follow the meta-training algorithms suggested in [9] and the proposed\nmethod can be easily plugged in.\n\n3.1 Motivation\n\nWe begin with the hypothesis that there are broadly applicable curvatures to many tasks. In training\na neural network with a loss function, local curvatures are determined by the model\u2019s parameters,\nthe network architecture, the loss function, and training data. Since new tasks are sampled from the\nsame or similar distributions and all other factors are \ufb01xed, it is intuitive idea that there may exist\nsome curvatures found via meta-training that can be effectively applied to the new tasks. Throughout\nthe meta-training, we can observe how the gradients affect the validation performance and use those\nexperiences to learn how to transform or correct the gradient from the new task.\nWe take a learning approach because existing curvature estimations do not consider generalization\nperformance, e.g. Hessian and the Fisher-information matrix. The local curvatures are approximated\nwith only current training data and loss functions. Therefore, these methods may end up converging\nfast to a poor local minimum. This is especially true when we have few training examples.\n\n3.2 Method\n\nFirst, we present a simple and ef\ufb01cient form of the meta-curvature computation through the lens\nof tensor algebra. Then, we present a matrix-vector product view to provide intuitive idea of the\nconnection to the second order matrices. Lastly, we discuss the relationships to other methods.\n\n3.2.1 Tensor product view\n\nWe consider neural networks as our models. With a slight abuse of notation, let the model\u2019s parameters\nW l \u2208 RCl\nin\u00d7dl, at each layers l. To avoid\n\nin\u00d7dl and its gradients of loss function Gl \u2208 RCl\n\nout\u00d7Cl\n\nout\u00d7Cl\n\n3\n\n\fFigure 1: An example of meta-curvature computational illustration with G \u2208 R2\u00d73\u00d7d. Top: tensor\nalgebra view, Bottom: matrix-vector product view.\n\ncluttered notation, we will omit the superscript l. We choose superscripts and dimensions with 2D\nconvolutional layers in mind, but the method can be easily extended to higher dimension convolutional\nlayers or other layers that consists of higher dimension parameters. Cout, Cin, and d are the number\nof output channels, the number of input channels, and the \ufb01lter size respectively. d is height \u00d7 width\nin convolutional layers and 1 in fully connected layers. We also de\ufb01ne meta-curvature matrices,\nMo \u2208 RCout\u00d7Cout, Mi \u2208 RCin\u00d7Cin, and Mf \u2208 Rd\u00d7d. Now a meta-curvature function takes a\nmultidimensional tensor as an input and has all meta-curvature matrices as learnable parameters:\n\nMC(G) = G \u00d73 Mf \u00d72 Mi \u00d71 Mo.\n\n(3)\nFigure 1 (top) shows an example of computational illustration with an input tensor G \u2208 R2\u00d73\u00d7d.\nFirst, it performs linear transformations for all 3-mode \ufb01bers of G. In other words, Mf captures the\nparameter dependencies between the elements within a 3-mode \ufb01ber, e.g. all gradient elements in a\nchannel of a convolutional \ufb01lter. Secondly, the 2-mode product models the dependencies between 3-\nmode \ufb01bers computed from the previous stage. All 3-mode \ufb01bers are updated by linear combinations\nof other 3-mode \ufb01bers belonging to the same output channel (linear combinations of 3-mode \ufb01bers in\na convolutional \ufb01lter). Finally, the 1-mode product is performed in order to model the dependencies\nbetween the gradients of all convolutional \ufb01lters. Similarly, the gradients of all convolutional \ufb01lters\nare updated by linear combinations of gradients of other convolutional \ufb01lters.\nA useful property of n-mode products is the fact that the order of the multiplications is irrelevant for\ndistinct modes in a series of multiplications. For example, G \u00d73 Mf \u00d72 Mi \u00d71 Mo = G \u00d71 Mo \u00d72\nMi \u00d73 Mf . Thus, the proposed method indeed examines the dependencies of the elements in the\ngradient all together.\n\n3.2.2 Matrix-vector product view\n\nWe can also view the proposed meta-curvature computation as a matrix-vector product analogous\nto that from other second order methods. Note that this is for the purpose of intuitive illustration\nand we cannot compute or maintain this large matrices for large deep networks. We can expand the\nmeta-curvature matrices as follows.\n\n(cid:100)Mo = Mo \u2297 ICin \u2297 Id, (cid:99)Mi = ICout \u2297 Mi \u2297 Id, (cid:100)Mf = ICout \u2297 ICin \u2297 Mf ,\n\n(4)\nwhere \u2297 is the Kronecker product, Ik is k dimensional identity matrix, and the three expanded\n\nmatrices are all same size(cid:100)Mo, (cid:99)Mi,(cid:100)Mf \u2208 RCoutCind\u00d7CoutCind. Now we can transform the gradients\nwhere Mmc =(cid:100)Mo(cid:99)Mi(cid:100)Mf . The expanded matrices satisfy commutative property, e.g.(cid:100)Mo(cid:99)Mi(cid:100)Mf =\n(cid:100)Mf(cid:99)Mi(cid:100)Mo, as shown in the previous section. Thus, Mmc models the dependencies of the model\nFigure 1 (bottom) shows a computational illustration. (cid:100)Mf vec(G), which is equivalent computation to\n\nparameters all together. Note that we can also write Mmc = Mo \u2297 Mi \u2297 Mf , but this is non-\ncommutative, Mo \u2297 Mi \u2297 Mf (cid:54)= Mf \u2297 Mi \u2297 Mo.\n\nG\u00d73 Mf , can be interpreted as a giant matrix-vector multiplication with block diagonal matrix, where\n\nwith the meta-curvature as\n\nvec(MC(G)) = Mmcvec(G),\n\n(5)\n\n4\n\n\feach block shares same meta-curvature matrix Mf . It resembles the block diagonal approximation\nstrategies in some second-order methods for training deep networks, but as we are interested in\nlearning meta-curvature matrices, no approximation is involved. And matrix-vector product with\n\n(cid:100)Mo and (cid:99)Mi are used to capture inter-parameter dependencies and are computationally equivalent to\n\n2-mode and 3-mode products of Eq. 3.\n\n3.2.3 Relationship to other methods\n\nTucker decomposition [17] decomposes a tensor into low rank cores with projection factors and aims\nto closely reconstruct the original tensor. We maintain full rank gradient tensors, however, and our\nmain goal is to transform the gradients for better generalization. [18] proposed to learn the projection\nfactors in Tucker decomposition for fully connected layers in deep networks. Again, their goal was to\n\ufb01nd the low rank approximations of fully connected layers for saving computational and spatial cost.\nKronecker-factored Approximate Curvature (K-FAC) [24, 14] approximates the Fisher matrix by\nthe Kronecker product, e.g. F \u2248 A \u2297 G, where A is computed from the activation of input units\nand G is computed from the gradient of output units. Its main goal is to approximate the Fisher\nsuch that matrix vector products between its inversion and the gradient can be computed ef\ufb01ciently.\nHowever, we found that maintaining A \u2208 RCind\u00d7Cind was quite expensive both computationally\nand spatially even for smaller networks. In addition, when we applied this factorization scheme\nto meta-curvature, it tends to easily over\ufb01t to meta-training set. On the contrary, we maintain two\nseparated matrices, Mi \u2208 RCin\u00d7Cin and Mf \u2208 Rd\u00d7d, which allows us to avoid over\ufb01tting and\nheavy computation. More importantly, we learn meta-curvature matrices to improve generalization\ninstead of directly computing them from the activation and the gradient of training loss. Also, we do\nnot require expensive matrix inversions.\n\n3.2.4 Meta-training\n\nWe follow a typical meta-training algorithm and initialize all meta-curvature matrices as identity\nmatrices so that the gradients do not change at the beginning. We used the ADAM [16] optimizer\nfor the outer loop optimization and update the model\u2019s initial parameters and meta-curvatures\nsimultaneously. We provide the details of algorithm in appendices.\n\n4 Analysis\n\nIn this section, we will explore how a meta-trained matrix Mmc, or M for brevity, can operate for\nbetter generalization. Let us take the gradient of meta-objective w.r.t M for a task \u03c4i. With the inner\nupdate rule \u03b8\u03c4i(M) = \u03b8 \u2212 \u03b1M\u2207\u03b8L\u03c4i\n\ntr (\u03b8), and by applying chain rule,\nval(\u03b8\u03c4i)\u2207\u03b8L\u03c4i\n\nval(\u03b8\u03c4i(M)) = \u2212\u03b1\u2207\u03b8\u03c4iL\u03c4i\n\ntr (\u03b8)(cid:62),\n\n\u2207ML\u03c4i\n\n(6)\nwhere \u03b8\u03c4i is the parameter for the task \u03c4i after the inner update. It is the outer product between the\ngradients of validation loss and training loss. Note that there is a signi\ufb01cant connection to the Fisher\ninformation matrix. For a task \u03c4i, if we de\ufb01ne the loss function as negative log likelihood, e.g. a\nsupervised classi\ufb01cation task L\u03c4i(\u03b8) = E(x,y)\u223cp(\u03c4i)[\u2212 log\u03b8 p(y|x)], then the empirical Fisher can\nbe de\ufb01ned as F = E(x,y)\u223cp(\u03c4i)[\u2207\u03b8 log\u03b8 p(y|x)\u2207\u03b8 log\u03b8 p(y|x)(cid:62)]. There are three clear distinctions.\nFirst, the training and validation sets are treated separately in the meta-gradient \u2207ML\u03c4i\nval, while the\nempirical Fisher is computed with only training set (validation set is not available during training).\nSecondly, the gradient of the validation set is evaluated at new parameters \u03b8\u03c4i after the inner update\nin the meta-gradient. Finally, the Fisher is positive semi-de\ufb01nite by construction, but it is not the case\nfor the meta-gradient. This is an attractive property since it guarantees that the transformed gradient\nis always a descent direction. However, we mainly care about generalization performance in this\nwork. Hence, we rather not force this property in this work, but leave it for future work.\nNow let us consider what the meta-gradient can do for good generalization performance. Given a\n\ufb01xed point \u03b8 and a meta training set T = {\u03c4i}, standard gradient descent from an initialization M,\ngives the following update.\n\nMT = M \u2212 \u03b2\n\n\u2207ML\u03c4i\n\nval(\u03b8\u03c4i (M)) = M + \u03b1\u03b2\n\n\u2207\u03b8L\u03c4i\n\nval(\u03b8\u03c4i (M))\u2207\u03b8L\u03c4i\n\ntr (\u03b8)(cid:62),\n\n(7)\n\n|T |(cid:88)\n\n|T |(cid:88)\n\ni=1\n\ni=1\n\n5\n\n\fwhere \u03b1 and \u03b2 are \ufb01xed inner/outer learning rates respectively. Here, we assume a standard gradient\ndescent for simplicity. But the argument extends to other advanced gradient algorithms, such as\nmomentum and ADAM.\nWe apply MT to the gradients of a new task, giving the transformed gradients\n\n(\u03b8) =(cid:0)M + \u03b1\u03b2\n\n|T |(cid:88)\n\n\u2207\u03b8L\u03c4i\n\nMT \u2207\u03b8L\u03c4new\n\ntr\n\ntr (\u03b8)(cid:62)(cid:1)\u2207\u03b8L\u03c4new\n\ntr\n\nval(\u03b8\u03c4i)\u2207\u03b8L\u03c4i\n|T |(cid:88)\n|T |(cid:88)\n\n(cid:0)\u2207\u03b8L\u03c4i\n(cid:0)\u2207\u03b8L\u03c4i\n(cid:124)\n\ni=1\n\ni=1\n\ntr (\u03b8)(cid:62)\u2207\u03b8L\u03c4new\n\ntr\n\n(cid:123)(cid:122)\n\ntr (\u03b8)(cid:62)\u2207\u03b8L\u03c4new\n\ntr\n\nA. Gradient similarity\n\n(\u03b8)\n\n(\u03b8)(cid:1)\u03b1\u2207\u03b8L\u03c4i\n(cid:1)(cid:0) \u03b1\u2207\u03b8L\u03c4i\n(cid:125)\n(cid:124)\n\n(\u03b8)\n\nval(\u03b8\u03c4i)\n\n(cid:123)(cid:122)\n(cid:125)\nval(\u03b8) + O(\u03b12)\n\nB. Taylor expansion\n\n(8)\n\n(9)\n\n(cid:1).\n\ni=1\n\n= M\u2207\u03b8L\u03c4new\n\ntr\n\n(\u03b8) + \u03b2\n\n= M\u2207\u03b8L\u03c4new\n\ntr\n\n(\u03b8) + \u03b2\n\n\u03b8L\u03c4i\n\ntr (\u03b8) \u2212 \u03b8).\n\nval(\u03b8) + \u22072\n\nval(\u03b8)(\u03b8 \u2212 \u03b1M\u2207\u03b8L\u03c4i\n\n(10)\nGiven M = I, the second term in the R.H.S. of Eq. 10 can represent the \ufb01nal gradient direction for\nthe new task. For Eq. 10, we used the Taylor expansion of vector-valued function, \u2207\u03b8L\u03c4i\nval(\u03b8\u03c4i) \u2248\n\u2207\u03b8L\u03c4i\nThe term A of Eq. 10 is the inner product between the gradients of meta-training losses and new\ntest losses. We can simply interpret this as how similar the gradient directions between two different\ntasks. This has been explicitly used in continual learning or multi-task learning setup to consider task\nsimilarity [7, 23, 36]. When we have a loss function in the form of \ufb01nite sums, this term can be also\ninterpreted as a kernel similarity between the respective sets of gradients (see Eq. 4 of [28]).\nWith the \ufb01rst term in B of Eq. 10, we compute a linear combination of the gradients of validation\nlosses from the meta-training set. Its weighting factors are computed based on the similarities between\nthe tasks from the meta-training set and the new task as explained above. Therefore, we essentially\nperform a soft nearest neighbor voting to \ufb01nd the direction among the validation gradients from\nthe meta-training set. Given the new task, the gradient may lead the model to over\ufb01t (or under\ufb01t).\nHowever, the proposed method will extract the knowledge from the past experiences and \ufb01nd the\ngradients that gave us good validation performance during the meta-training process.\n\n5 Related Work\n\nMeta-learning: Model-agnostic meta-learning (MAML) highlighted the importance of the model\u2019s\ninitial parameters for better generalization [10] and there have been many extensions to improve the\nframework, e.g. for continuous adaptation [1], better credit assignment [37], and robustness [15]. In\nthis work, we improve the inner update optimizers by learning a curvature for better generalization\nand fast model adaptation. Meta-SGD [22] suggests to learn coordinate-wise learning rates. We can\ninterpret it as an diagonal approximation to meta-curvature in a similar vein to recent adaptive learning\nrates methods, such as [43, 16, 8], performing diagonal approximations of second-order matrices.\nRecently, [4] suggested to learn layer-wise learning rates through the meta-training. However, both\nmethods do not consider the dependencies between the parameters, which was crucial to provide more\nrobust meta-training process and faster convergence. [21] also attempted to transform the gradients.\nThey used simple binary mask applied to the gradient update to determine which parameters are to\nbe updated while we introduce dense learnable tensors to model second-order dependencies with a\nseries of tensor products.\nFew-shot classi\ufb01cation: As a good test bed to evaluate few-shot learning, huge progress has been\nmade in the few-shot classi\ufb01cation task. Triggered by [44], many recent studies have focused on\ndiscovering effective inductive bias on classi\ufb01cation task. For example, network architectures that\nperform nearest neighbor search [44, 41] were suggested. Some improved the performance by\nmodeling the interactions or correlation between training examples [26, 11, 42, 32, 29]. In order to\novercome the nature of few-shot learning, the generative models have been suggested to augment the\ntraining data [40, 45] or generate model parameters for the speci\ufb01ed task [39, 33]. The state-of-the-art\nresults are achieved by additionally training 64-way classi\ufb01cation task for pretraining [33, 39, 32]\n\n6\n\n\fTable 2: Few-shot classi\ufb01cation results on Omniglot dataset. \u2020 denotes 3 model ensemble.\n\nSNAIL [27]\nGNN [12]\nMAML\nMeta-SGD\nMAML++\u2020 [4]\nMC1\nMC2\nMC2\u2020\n\n5-way 1-shot\n99.07 \u00b1 0.16\n\n99.2\n\n99.47\n\n98.7 \u00b1 0.4\n99.53 \u00b1 0.26\n99.47 \u00b1 0.27\n99.77 \u00b1 0.17\n99.97 \u00b1 0.06\n\n5-way 5-shot\n99.78 \u00b1 0.09\n\n99.7\n\n99.93\n\n99.9 \u00b1 0.1\n99.93 \u00b1 0.09\n99.57 \u00b1 0.12\n99.79 \u00b1 0.10\n99.89 \u00b1 0.06\n\n20-way 1-shot\n97.64 \u00b1 0.30\n\n97.4\n\n95.8 \u00b1 0.3\n95.93 \u00b1 0.38\n97.65 \u00b1 0.05\n97.60 \u00b1 0.29\n97.86 \u00b1 0.26\n99.12 \u00b1 0.16\n\n20-way 5-shot\n99.36 \u00b1 0.18\n\n99.0\n\n98.9 \u00b1 0.2\n98.97 \u00b1 0.19\n99.33 \u00b1 0.03\n99.23 \u00b1 0.08\n99.24 \u00b1 0.07\n99.65 \u00b1 0.05\n\nwith larger ResNet models [33, 39, 29, 26]. In this work, our focus is to improve the model-agnostic\nfew-shot learner that is broadly applicable to other tasks, e.g. reinforcement learning setup.\nLearning optimizers: Our proposed method may fall within the learning optimizer category [34, 3,\n46, 25]. They also take as input the gradient and transform it via a neural network to achieve better\nconvergence behavior. However, their main focus is to capture the training dynamics of individual\ngradient coordinates [34, 3] or to obtain a generic optimizer that is broadly applicable for different\ndatasets and architectures [46, 25, 3]. On the other hand, we meta-learn a curvature coupled with the\nmodel\u2019s initialization parameters. We focus on a fast adaptation scenario requiring a small number of\ngradient steps. Therefore, our method does not consider a history of the gradients, which enables\nus to avoid considering a complex recurrent architecture. Finally, our approach is well connected to\nexisting second order methods while learned optimizers are not easily interpretable since the gradient\npasses through nonlinear and multilayer recurrent neural networks.\n\n6 Experiments\n\nWe evaluate the proposed method on a synthetic data few-shot regression task few-shot image\nclassi\ufb01cation tasks with Omniglot and MiniImagenet datasets. We test two versions of the meta-\ncurvature. The \ufb01rst one, named as MC1, we \ufb01xed the Mo = I Eq. 4. The second one, named as\nMC2, we learn all three meta-curvature matrices. We also report results on few-shot reinforcement\nlearning in appendices.\n\n6.1 Few-shot regression\n\nTo begin with, we perform a simple regression prob-\nlem following [9, 22]. During the meta-training\nprocess, sinusoidal functions are sampled, where\nthe amplitude and phase are varied within [0.1, 5.0]\nand [0, \u03c0] respectively. The network architecture\nand all hyperparameters are same as [9] and we\nonly introduce the suggested meta-curvature. We\nreported the mean squared error with 95% con\ufb01-\ndence interval after one gradient step in Figure 1.\nThe details are provided in appendices.\n\n6.2 Few-shot classi\ufb01cation on Omniglot\n\nTable 1: Few-shot regression results.\n10-shot\n\n5-shot\n\nMethod\n0.686 \u00b1 0.070\nMAML\nMeta-SGD 0.482 \u00b1 0.061\n0.528 \u00b1 0.068\nLayerLR\n0.426 \u00b1 0.054\nMC1\n0.405 \u00b1 0.048\nMC2\n\n0.435 \u00b1 0.039\n0.258 \u00b1 0.026\n0.269 \u00b1 0.027\n0.239 \u00b1 0.025\n0.201 \u00b1 0.020\n\nThe Omniglot dataset consists of handwritten characters from 50 different languages and 1632\ndifferent characters. It has been widely used to evaluate few-shot classi\ufb01cation performance. We\nfollow the experimental protocol in [9] and all hyperparameters and network architecture are same as\n[9]. Further experimental details are provided in appendices. Except 5-shot 5-way setting, our simple\n4 layers CNN with meta-curvatures outperform all MAML variants and also achieved state-of-the-\nart results without additional specialized architectures, such as attention module (SNAIL [27]) or\nrelational module (GNN [12]). We provide the training curves in Figure 2 and our methods converge\nmuch faster and achieve higher accuracy.\n\n7\n\n\fFigure 2: Few-shot classi\ufb01cation accuracy over training iterations.\n\nTable 3: Few-shot classi\ufb01cation results on miniImagenet test set (5-way classi\ufb01cation) with baseline\n4 layer CNNs. * is from the original papers. \u2020 denotes 3 model ensembles.\n5-shot\n\n1-shot\n\nInner steps\n*MAML\n*Meta-SGD\n*MAML++\u2020\nMAML\nMeta-SGD\nLayerLR\nMC1\nMC2\nMC2\u2020\n\n1 step\n\n\u00b7\n\n50.47 \u00b1 1.87\n51.05 \u00b1 0.31\n46.28 \u00b1 0.89\n49.87 \u00b1 0.87\n50.04 \u00b1 0.87\n53.37 \u00b1 0.88\n54.23 \u00b1 0.88\n54.90 \u00b1 0.90\n\n5 step\n\n\u00b7\n\n48.7 \u00b1 1.84\n52.15 \u00b1 0.26\n48.85 \u00b1 0.88\n48.99 \u00b1 0.86\n50.55 \u00b1 0.87\n53.74 \u00b1 0.84\n54.08 \u00b1 0.93\n55.73 \u00b1 0.94\n\n1 step\n\n\u00b7\n\u00b7\n\n64.03 \u00b1 0.94\n\n59.26 \u00b1 0.72\n66.35 \u00b1 0.72\n65.06 \u00b1 0.71\n68.47 \u00b1 0.69\n67.94 \u00b1 0.71\n69.46 \u00b1 0.70\n\n5 step\n\n\u00b7\n\n63.1 \u00b1 0.92\n68.32 \u00b1 0.44\n63.92 \u00b1 0.74\n63.84 \u00b1 0.71\n66.64 \u00b1 0.69\n68.01 \u00b1 0.73\n67.99 \u00b1 0.73\n70.33 \u00b1 0.72\n\n6.3 Few-shot classi\ufb01cation on miniImagenet and tieredImagenet\n\nDatasets: The miniImagenet dataset was proposed by [44, 34] and it consists of 100 subclasses out\nof 1000 classes in the original dataset (64 training classes, 12 validation classes, 24 test classes).\nThe tieredImagenet dataset [35] is a larger subset, composed of 608 classes and reduce the semantic\nsimilarity between train/val/test splits by considering high-level categories.\nbaseline CNNs: We used 4 layers convolutional neural network with the batch normalization\nfollowed by a fully connected layer for the \ufb01nal classi\ufb01cation. In order to increase the capacity of\nthe network, we increased the \ufb01lter size up to 128. We found that the model with the larger \ufb01lter\nseriously over\ufb01t (also reported in [9]). To avoid over\ufb01tting, we applied data augmentation techniques\nsuggested in [5, 6]. For a fair comparison to [4], we also reported the results of model ensemble.\nThroughout the meta-training, we saved the model regularly and picked 3 models that have the best\naccuracy on the meta-validation dataset. We re-implemented all three baselines and performed the\nexperiments with the same settings. We provide further the details in the appendices.\nFig. 2 and Table 3 shows the results of baseline CNNs experiments on miniImagenet. MC1 and MC2\noutperformed all other baselines for all different experiment settings. Not only does MC reach a higher\naccuracy at convergence, but also showed a much faster convergence rates for meta-training. Our\nmethods share the same bene\ufb01ts as second order methods although we do not approximate any Hessian\nor Fisher matrices. Unlike other MAML variants, which required an extensive hyperparameter search,\nour methods are very robust to hyperparameter settings. Usually, MC2 outperforms MC1 because the\nmore \ufb01ne-grained meta-curvature enable us to effectively increase the model\u2019s capacity.\nWRN-28-10 features and MLP: To the best of our knowledge, [39, 33] are current state-of-the-art\nmethods that use a pretrained WRN-28-10 [47] network (trained with 64-way classi\ufb01cation task on\nentire meta-training set) as a feature extractor network. We evaluated our methods on this setting by\nadding one hidden layer MLP followed by a softmax classi\ufb01er and our method again improved MAML\nvariants by a large margin. Despite our best attempts, we could not \ufb01nd a good hyperparameters to\n\n8\n\n\fTable 4: The results on miniImagenet and tieredImagenet. \u2021 indicates that both meta-train and\nmeta-validation are used during meta-training. \u2020 denotes indicates that 15-shot meta-training was\nused for both 1-shot and 5-shot testing. MetaOptNet [3] used ResNet-12 backbone and trained\nend-to-end manner while we used the \ufb01xed features provided by [2] (center - features from the central\ncrop, multiview - features averaged over four corners, central crops, and horizontal mirrored).\n\n[33]\u2021\nLEO (center)\u2021 [39]\nLEO (multiview)\u2021 [39]\nMetaOptNet-SVM\u2021\u2020 [20]\nMeta-SGD (center)\nMC2 (center)\nMC2 (center)\u2021\nMC2 (multiview)\u2021\n\nminiImagenet\n\n1-shot\n\n59.60 \u00b1 0.41\n61.76 \u00b1 0.08\n63.97 \u00b1 0.20\n64.09 \u00b1 0.62\n56.58 \u00b1 0.21\n61.22 \u00b1 0.10\n61.85 \u00b1 0.10\n64.40 \u00b1 0.10\n\n5-shot\n\n73.74 \u00b1 0.19\n77.59 \u00b1 0.12\n79.49 \u00b1 0.70\n80.00 \u00b1 0.45\n68.84 \u00b1 0.19\n75.92 \u00b1 0.17\n77.02 \u00b1 0.11\n80.21 \u00b1 0.10\n\ntieredImagenet\n\n1-shot\n\n5-shot\n\n66.33 \u00b1 0.05\n65.81 \u00b1 0.74\n59.75 \u00b1 0.25\n66.20 \u00b1 0.10\n67.21 \u00b1 0.10\n\n\u00b7\n\u00b7\n\n\u00b7\n\n81.44 \u00b1 0.09\n81.75 \u00b1 0.53\n69.04 \u00b1 0.22\n82.21 \u00b1 0.08\n82.61 \u00b1 0.08\n\n\u00b7\n\u00b7\n\n\u00b7\n\ntrain original MAML in this setting. Although our main goal is to push how much a simple gradient\ntransformation in the inner loop optimization can improve general and broadly applicable MAML\nframeworks, our methods outperformed the recent methods that used various task speci\ufb01c techniques,\ne.g. task dependent weight generating methods [39, 33] and relational networks [39]. Our methods\nalso outperformed the very latest state of the art results [20] that used extensive data-augmentation,\nregularization, and 15-shot meta-training schemes with different backbone networks.\n\n7 Conclusion\n\nWe propose to meta-learn the curvature for faster adaptation and better generalization. The suggested\nmethod signi\ufb01cantly improved the performance upon previous MAML variants and outperformed the\nrecent state of the art methods. It also leads to faster convergence during meta-training. We present\nan analysis about generalization performance and connect to existing second order methods, which\nwould provide useful insights for further research.\n\nReferences\n\n[1] Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter\nAbbeel. Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environ-\nments. In International Conference on Learning Representations (ICLR), 2018.\n\n[2] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013\n\n276, 1998.\n\n[3] Marcin Andrychowicz, Misha Denil, Sergio G\u00f3mez Colmenarejo, Matthew W. Hoffman, David\nPfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient\ndescent by gradient descent. In Neural Information Processing Systems (NeurIPS), 2016.\n\n[4] Antreas Antoniou, Harrison Edwards, and Amos Storkey. How to train your MAML. Interna-\n\ntional Conference on Learning Representations (ICLR), 2019.\n\n[5] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V. Le. AutoAugment:\n\nLearning Augmentation Policies from Data. arXiv:1805.09501, 2018.\n\n[6] Terrance DeVries and Graham W. Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv:1708.04552, 2017.\n\n[7] Yunshu Du, Wojciech M. Czarnecki, Siddhant M. Jayakumar, Razvan Pascanu, and Balaji\nLakshminarayanan. Adapting Auxiliary Losses Using Gradient Similarity. arXiv:1812.02224,\n2018.\n\n[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\n\nand stochastic optimization. Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n9\n\n\f[9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-Agnostic Meta-Learning for Fast\nAdaptation of Deep Networks. In International Conference on Machine Learning (ICML),\n2017.\n\n[10] Chelsea Finn and Sergey Levine. Meta-Learning and Universality: Deep Representations and\nGradient Descent Can Approximate Any Learning Algorithm. In International Conference on\nLearning Representations (ICLR), 2018.\n\n[11] Victor Garcia and Joan Bruna. Few-Shot Learning with Graph Neural Networks. In International\n\nConference on Learning Representations (ICLR), 2018.\n\n[12] Victor Garcia and Joan Bruna. Few-Shot Learning with Graph Neural Networks. In International\n\nConference on Learning Representations (ICLR), 2018.\n\n[13] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grif\ufb01ths. Recasting\nGradient-Based Meta-Learning as Hierarchical Bayes. In International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[14] Roger Grosse and James Martens. A Kronecker-factored approximate Fisher matrix for convo-\n\nlution layers. In International Conference on Machine Learning (ICML), 2016.\n\n[15] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn.\nBayesian Model-Agnostic Meta-Learning. In Neural Information Processing Systems (NeurIPS),\n2018.\n\n[16] Diederik P. Kingma and Jimmy Lei Ba. Adam: A method for stochastic optimization. In\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\n[17] Tamara G. Kolda and Brett W. Bader. Tensor Decompositions and Applications. SIAM Review,\n\n51(3):455\u2013500, 2009.\n\n[18] Jean Kossaif, Zachary Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar.\n\nTensor Regression Networks. arXiv:1707.08308, 2018.\n\n[19] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n[20] Kwonjoon Lee, Subhransu Maji, Avinash Ravichandran, and Stefano Soatto. Meta-Learning\nwith Differentiable Convex Optimization. In IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR), 2019.\n\n[21] Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric\n\nand subspace. In International Conference on Machine Learning (ICML), 2018.\n\n[22] Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. Meta-SGD: Learning to Learn Quickly for\n\nFew Shot Learning. arXiv:1707.09835, 2017.\n\n[23] David Lopez-Paz and Marc\u2019Aurelio Ranzato. Gradient Episodic Memory for Continual Learn-\n\ning. In Neural Information Processing Systems (NeurIPS), 2017.\n\n[24] James Martens and Roger Grosse. Optimizing Neural Networks with Kronecker-factored\n\nApproximate Curvature. In International Conference on Machine Learning (ICML), 2015.\n\n[25] Luke Metz, Niru Maheswaranathan, Jeremy Nixon, C. Daniel Freeman, and Jascha Sohl-\nDickstein. Learned Optimizers That Outperform SGD On Wall-Clock And Test Loss.\narXiv:1810.10180, 2018.\n\n[26] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A Simple Neural Attentive\n\nMeta-Learner. In International Conference on Learning Representations (ICLR), 2018.\n\n[27] Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A Simple Neural Attentive\n\nMeta-Learner. In International Conference on Learning Representations (ICLR), 2018.\n\n[28] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch\u00f6lkopf. Learning\nfrom Distributions via Support Measure Machines. In Neural Information Processing Systems\n(NeurIPS), 2012.\n\n[29] Tsendsuren Munkhdalai, Xingdi Yuan, Soroush Mehri, and Adam Trischler. Rapid Adaptation\nwith Conditionally Shifted Neurons. In International Conference on Machine Learning (ICML),\n2018.\n\n[30] Alex Nichol, Joshua Achiam, and John Schulman. On First-Order Meta-Learning Algorithms.\n\narXiv:1803.02999, 2018.\n\n10\n\n\f[31] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Science & Business\n\nMedia, 2006.\n\n[32] Boris N. Oreshkin, Pau Rodriguez, and Alexandre Lacoste. TADAM: Task dependent adaptive\nmetric for improved few-shot learning. In Neural Information Processing Systems (NeurIPS),\n2018.\n\n[33] Siyuan Qiao, Chenxi Liu, Wei Shen, and Alan Yuille. Few-Shot Image Recognition by Predicting\nParameters from Activations. In IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2018.\n\n[34] Sachin Ravi and Hugo Larochelle. Optimization As a Model For Few-shot Learning.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n[35] Mengye Ren, Eleni Trianta\ufb01llou, Sachin Ravi, Jake Snell, Kevin Swersky, Joshua B. Tenen-\nbaum, Hugo Larochelle, and Richard S. Zemel. Meta-Learning for Semi-Supervised Few-Shot\nClassi\ufb01cation. In International Conference on Learning Representations (ICLR), 2018.\n\n[36] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald\nTesauro. Learning to Learn without Forgetting By Maximizing Transfer and Minimizing\nInterference. In International Conference on Learning Representations (ICLR), 2019.\n\n[37] Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and Pieter Abbeel. Promp: Proximal\nmeta-policy search. In International Conference on Learning Representations (ICLR), 2019.\n[38] Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural\n\ngradient algorithm. In Neural Information Processing Systems (NeurIPS), 2008.\n\n[39] Andrei A. Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osin-\ndero, and Raia Hadsell. Meta-Learning with Latent Embedding Optimization. In International\nConference on Learning Representations (ICLR), 2019.\n\n[40] Eli Schwartz, Leonid Karlinsky, Joseph Shtok, Sivan Harary, Mattias Marder, Rogerio Feris,\nAbhishek Kumar, Raja Giryes, and Alex M. Bronstein. Delta-encoder: an effective sample\nsynthesis method for few-shot object recognition. In Neural Information Processing Systems\n(NeurIPS), 2018.\n\n[41] Jake Snell, Kevin Swersky, and Richard S. Zemel. Prototypical Networks for Few-shot Learning.\n\nIn Neural Information Processing Systems (NeurIPS), 2017.\n\n[42] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, and Timothy M. Hospedales Philip H.S. Torr.\nLearning to Compare: Relation Network for Few-Shot Learning. In IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), 2018.\n\n[43] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5\u2014RmsProp: Divide the gradient by a\nrunning average of its recent magnitude. COURSERA: Neural Networks for Machine Learning,\n2012.\n\n[44] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wier-\nstra. Matching Networks for One Shot Learning. In Neural Information Processing Systems\n(NeurIPS), 2016.\n\n[45] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-Shot Learning\nfrom Imaginary Data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2018.\n\n[46] Olga Wichrowska, Niru Maheswaranathan, Matthew W. Hoffman, Sergio Gomez Colmenarejo,\nMisha Denil, Nando de Freitas, and Jascha Sohl-Dickstein. Learned Optimizers that Scale and\nGeneralize. In International Conference on Machine Learning (ICML), 2017.\n\n[47] Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In The British Machine\n\nVision Conference (BMVC), 2016.\n\n11\n\n\f", "award": [], "sourceid": 1842, "authors": [{"given_name": "Eunbyung", "family_name": "Park", "institution": "UNC Chapel Hill / Nuro"}, {"given_name": "Junier", "family_name": "Oliva", "institution": "UNC - Chapel Hill"}]}