{"title": "Neural Similarity Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 5025, "page_last": 5036, "abstract": "Inner product-based convolution has been the founding stone of convolutional neural networks (CNNs), enabling end-to-end learning of visual representation. By generalizing inner product with a bilinear matrix, we propose the neural similarity which serves as a learnable parametric similarity measure for CNNs. Neural similarity naturally generalizes the convolution and enhances flexibility. Further, we consider the neural similarity learning (NSL) in order to learn the neural similarity adaptively from training data. Specifically, we propose two different ways of learning the neural similarity: static NSL and dynamic NSL. Interestingly, dynamic neural similarity makes the CNN become a dynamic inference network. By regularizing the bilinear matrix, NSL can be viewed as learning the shape of kernel and the similarity measure simultaneously. We further justify the effectiveness of NSL with a theoretical viewpoint. Most importantly, NSL shows promising performance in visual recognition and few-shot learning, validating the superiority of NSL over the inner product-based convolution counterparts.", "full_text": "Neural Similarity Learning\n\nWeiyang Liu1,* Zhen Liu2,*\n\nJames M. Rehg1 Le Song1,3\n\n*Equal Contribution\n1Georgia Institute of Technology 2Mila, Universit\u00e9 de Montr\u00e9al\nwyliu@gatech.edu, zhen.liu.2@umontreal.ca, rehg@gatech.edu, lsong@cc.gatech.edu\n\n3Ant Financial\n\nAbstract\n\nInner product-based convolution has been the founding stone of convolutional\nneural networks (CNNs), enabling end-to-end learning of visual representation. By\ngeneralizing inner product with a bilinear matrix, we propose the neural similarity\nwhich serves as a learnable parametric similarity measure for CNNs. Neural simi-\nlarity naturally generalizes the convolution and enhances \ufb02exibility. Further, we\nconsider the neural similarity learning (NSL) in order to learn the neural similarity\nadaptively from training data. Speci\ufb01cally, we propose two different ways of\nlearning the neural similarity: static NSL and dynamic NSL. Interestingly, dynamic\nneural similarity makes the CNN become a dynamic inference network. By regu-\nlarizing the bilinear matrix, NSL can be viewed as learning the shape of kernel and\nthe similarity measure simultaneously. We further justify the effectiveness of NSL\nwith a theoretical viewpoint. Most importantly, NSL shows promising performance\nin visual recognition and few-shot learning, validating the superiority of NSL over\nthe inner product-based convolution counterparts.\n\nIntroduction\n\n1\nRecent years have witnessed the unprecedented success of convolutional neural networks (CNNs) in\nsupervised learning tasks such as image recognition [20], object detection [47], semantic segmenta-\ntion [40], etc. As the core of CNN, a standard convolution operator typically contains two components:\na learnable template (i.e., kernel) and a similarity measure (i.e., inner product). One active stream\nof works [13, 63, 25, 61, 8, 53, 24, 59, 26] aims to improve the \ufb02exibility of the convolution kernel\nand increases its receptive \ufb01eld in a data-driven way. Another stream of works [39, 36] focuses on\n\ufb01nding a better similarity measure to replace the inner product. However, there still lacks a uni\ufb01ed\nformulation that can take both the shape of kernel and the similarity measure into consideration.\nTo bridge this gap, we propose the neural similarity\nlearning (NSL) for CNNs. NSL \ufb01rst de\ufb01nes the\nneural similarity by generalizing the inner product\nwith a parametric bilinear matrix and then learns the\nneural similarity jointly with the convolution kernels.\nA graphical comparison between inner product and\nneural similarity is given in Figure 1. With certain\nregularities on the neural similarity, NSL can be\nviewed as learning the shape of the kernel and the\nsimilarity measure simultaneously. Based on the\nneural similarity, we propose the neural similarity\nnetwork (NSN) by stacking convolution layers with\nneural similarity. We consider two distinct ways to\nlearn the neural similarity in CNN. First, we learn\na static neural similarity which is essentially a (regularized) bilinear similarity. By having more\nparameters, the static neural similarity becomes a natural generalization of the standard inner product.\nSecond and more interestingly, we also consider to learn the neural similarity in a dynamic fashion.\n\nFigure 1: Bipartite graph comparison of inner prod-\nuct, static neural similarity and dynamic neural sim-\nilarity. A line represents a multiplication operation\nand a circle denotes an element in a vector. Green\ncolor denotes kernel and yellow denotes input.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n Input 1 Input 2Dynamic Neural SimilarityStatic Neural SimilarityInner ProductKernelInputPlaceholder\fSpeci\ufb01cally, we use an additional neural network module to learn the neural similarity adaptively\nfrom the input images. This module is jointly optimized with the CNN via back-propagation. Using\nthe dynamic neural similarity, the CNN becomes a dynamic neural network, because the equivalent\nweights of the neuron are input-dependent. In a high-level sense, CNNs with dynamic neural similarity\nshare the same spirits with HyperNetworks [18] and dynamic \ufb01lter networks [28].\nA key motivation behind NSL lies in the fact that inner product-based similarity is unlikely to be\noptimal for every task. Learning the similarity measure adaptively from data can be bene\ufb01cial in\ndifferent tasks. A hidden layer with dynamic neural similarity can be viewed as a quadratic function\nof the input, while a standard hidden layer is a linear function of the input. Therefore, dynamic neural\nsimilarity introduces more \ufb02exibility from the function approximation perspective.\nNSL aims to construct a \ufb02exible CNN with strong generalization ability, and we can control the\n\ufb02exibility by imposing different regularizations on the bilinear similarity matrix. In this paper, we\nmostly consider the block-diagonal matrix with shared blocks as the bilinear similarity matrix in\norder to reduce the number of parameters. In different applications, we will usually impose domain-\nspeci\ufb01c regularizations. By properly regularizing the bilinear similarity matrix, NSL is able to make\nbetter use of the parameters than standard convolutional learning and \ufb01nd a good trade-off between\ngeneralization ability and representation \ufb02exibility.\nNSL is closely connected to a surprising theoretical result in [16] that optimizing an underdetermined\nquadratic objective over a matrix W with gradient descent on a factorization of this matrix leads\nto an implicit regularization for the solution (minimum nuclear norm). A more recent theoretical\nresult in [5] further shows that gradient descent for deep matrix factorization tends to give low-rank\nsolutions. Since NSL can be viewed as a form of factorization over the convolution kernel, we argue\nthat such factorization also yields some implicit regularization in gradient-based optimization, which\nmay lead to more generalizable inductive bias. We will give more theoretical insights in the paper.\nWhile showing strong generalization ability in generic visual recognition, NSL is also very effective\nfor few-shot learning due to its better \ufb02exibility. Compared to initialization based methods [14, 46],\nNSL can naturally make full use of the pretrained model for few-shot learning. Speci\ufb01cally, we\npropose three different learning strategies to perform few-shot recognition. Besides applying both\nstatic and dynamic NSL to few-shot recognition, we further propose to meta-learn the neural similarity.\nSpeci\ufb01cally, we adopt the model-agnostic meta learning [14] to learn the bilinear similarity matrix.\nUsing this strategy, NSL can bene\ufb01t from the generalization ability of both the pretrained model\nand the meta information [14]. Our results show that NSL can effectively improve the few-shot\nrecognition by a considerable margin.\nOur main contributions can be summarized as follows:\n\u2022 We propose the neural similarity which generalizes the inner product via bilinear similarity.\nFurthermore, we derive the neural similarity network by stacking convolution layers with neural\nsimilarity. Although this paper mostly discusses CNNs, we note that NSL can easily be applied\nto fully connected networks and recurrent networks.\n\u2022 We propose both static and dynamic learning strategies for the neural similarity. To order to\novercome the convergence dif\ufb01culty of dynamic neural similarity, we propose hyperspherical\nlearning [39] with identity residuals to stablize the training.\n\u2022 We apply the neural similarity learning to generic visual recognition and few-shot recognition.\nFor few-shot learning, we propose novel usages of NSL and signi\ufb01cantly improve the current\nfew-shot learning performance.\n\n2 Related Works\nFlexible convolution. Dilated (atrous) [61, 8] convolution has been proposed in order to construct\na convolution kernel with large receptive \ufb01eld for semantic segmentation. [13, 25] improve the\nconvolution kernel for high-level vision tasks by making the shape of kernel learnable and deformable.\n[39, 36] provide a decoupled view to understand the similarity measure and propose some alternative\n(learnable) similarity measures. Such decoupled similarity is shown to be useful for improving\nnetwork generalization and adversarial robustness.\nDynamic neural networks. Dynamic neural networks have input-dependent neurons, which makes\nthe network adaptively changing in order to deal with different inputs. HyperNetworks [18] uses\na recurrent network to dynamically generate weights for another recurrent network, such that the\nweights can vary across many timesteps. Dynamic \ufb01lter networks [28] generates its \ufb01lters which\n\n2\n\n\fare dynamically conditioned on an input. These dynamic neural networks usually perform poorly in\nimage recognition tasks and can not make use of any pretrained models. In contrast, the dynamic NSN\nperforms consistently better than the CNN counterpart, and is able to take advantage of the pretrained\nmodels for few-shot learning. [11] investigates the input-dependent networks by dynamically selecting\n\ufb01lters, while NSN uses totally different approach to achieve the dynamic inference.\nMeta-learning. A classic approach [7, 50] for meta-learning is to train a meta-learner that learns\nto update the parameters of the learner\u2019s model. This approach has been adopted to learn deep\nnetworks [1, 32, 43, 51]. Recently, There are a series of works [46, 14] that address the meta-\nlearning problem by learning a good network initialization. Speci\ufb01cally for few-shot learning,\nthere are initialization-based methods [43, 46, 14, 10], hallucination-based methods [57, 19, 2] and\nmetric learning-based methods [55, 52, 54]. Besides having very different formulation from the\nprevious works, NSL also combines the advantages from the initialization-based methods and the\ngeneralization ability from the pretrained model.\n3 Neural Similarity Learning\n3.1 Generalizing Convolution with Bilinear Similarity\nWe denote a convolution kernel with size C \u00d7 H \u00d7 V (C for the number of channels, H for the\nheight and V for the width) as \u02dcW . We \ufb02atten the kernel in each channel separately and then\nconcatenate them to a vector: W ={ \u02dcW F\ni,:,: is the \ufb02atten\nkernel weights of the i-th channel. Similarly, we denote an input patch of the same size C \u00d7 H \u00d7 V\nas \u02dcX, and its \ufb02atten version as X. A standard convolution operator uses inner product W (cid:62)X to\ncompute the output feature map in a sliding window fashion. Instead of using the inner product to\ncompute the similarity, we generalize the convolution with a bilinear similarity matrix:\n\nC,:,:}\u2208RCHV where \u02dcW F\n\n2,:,:,\u00b7\u00b7\u00b7 , \u02dcW F\n\n1,:,:, \u02dcW F\n\n(cid:62)\n\nM X\n\nfM (W , X) = W\n\n(1)\nwhere M \u2208RCHV \u00d7CHV denotes the bilinear similarity matrix and is used to parameterize the\nsimilarity measure. In fact, if we requires M to be a symmetric positive semi-de\ufb01nite matrix, it\nshares some similarities with the distance metric learning [60]. Although we do not necessarily\nneed to constrain the matrix M, we will still impose some structural constraints on M in order to\nstablize the training and save parameters in practice. To avoid introducing too many parameters in\nthe generalized convolution operator, we make the bilinear similarity matrix M to be block-diagonal\nwith shared blocks (there are C blocks in total):\n\n\uf8ee\uf8ef\uf8f0Ms\n\n\uf8f9\uf8fa\uf8fb X\n\nfM (W , X) = W\n\n(cid:62)\n\n...\n\n(2)\n\nMs\n\n1\n\nwhere M =diag(Ms,\u00b7\u00b7\u00b7 , Ms) and Ms is of size HV \u00d7 HV . Interestingly, the hyperspherical\nconvolution [39] becomes a special case of this bilinear formulation when M is a diagonal matrix\nwith a normalizing factor\n(cid:107)W (cid:107)(cid:107)X(cid:107) being the diagonal. Since additional parameters are introduced to\ncontrol the similarity measure, we are able to learn a similarity measure directly from data (i.e., static\nneural similarity) or learn a neural predictor that can estimate such a similarity matrix from the input\nfeature map (i.e., dynamic neural similarity). In the paper, we mainly consider two structures for Ms.\nDiagonal/Unconstrained neural similarity. If we require Ms to be a diagonal matrix, then we end\nup with the diagonal neural similarity (DNS). DNS is very parameter-ef\ufb01cient and can be viewed\nas a weighted inner product or an element-wise attention. Besides that, DNS is essentially putting\nan additional spatial mask over the feature map, so it is semantically meaningful. If no constraint is\nimposed on Ms, then we have the unconstrained neural similarity (UNS) which is very \ufb02exible but\nrequires much more parameters.\n\n3.2 Learning Static Neural Similarity\nWe \ufb01rst introduce a static learning strategy for the neural similarity. Speci\ufb01cally, we learn the matrix\nMs jointly with the convolution kernel via back-propagation. An intuitive overview for static neural\nsimilarity is given in Figure 2(a). When Ms has been jointly learned after training, it will stay \ufb01xed\nin the inference stage. More interestingly, as we can see from Equation (1) that the neural similarity is\nincorporated into the convolution operator via a linear multiplication, we can compute an equivalent\nweights for the kernel in advance if the neural similarity is static. Therefore, we can view the new\n\n3\n\n\fkernel as M(cid:62)W . As a result, when it comes to deployment in practice, the number of parameters\nused in static NSN is the same as the CNN baseline and the inference speed is also the same.\nLearning static neural similarity can be viewed as\na factorized learning of neurons. It also shares a\nlot of similarities with matrix factorization in the\nsense that the equivalent neuron weights \u02c6W is fac-\ntorized into into two matrix M(cid:62) and W . Although\nthe original weights and the factorized weights are\nmathematically equivalent, they have different be-\nhaviors and properties during gradient-based opti-\nmization [16]. Recent theories [16, 5, 33] suggest\nthat an implicit regularization may encourage the\ngradient-based matrix factorization to give minimum nuclear norm or low-rank solutions. Besides\nthat, we also have structural constraints to explicitly regularize the matrix M. Furthermore, we can\nalso view this static neural similarity convolution as a one-hidden-layer linear network. It has been\nshown that such over-parameterization can be bene\ufb01cial to the generalization [29, 3, 44, 4].\n\nFigure 2: Intuitive comparison between static neural\nsimilarity and dynamic neural similarity.\n\n3.3 Learning Dynamic Neural Similarity\n3.3.1 Formulation\nBesides the static neural similarity, we further propose to learn the neural similarity dynamically.\nThe intuition behind is that the similarity measure should be adaptive to the input in order to achieve\noptimal \ufb02exibility. From a cognitive science perspective, it is also plausible to enable the network with\ndynamic inference [56, 31]. The difference between static and dynamic neural similarity is shown in\nFigure 2. Speci\ufb01cally, the dynamic neural similarity is generated dynamically using an additional\nneural network M\u03b8(\u00b7) with parameters \u03b8, namely Ms =M\u03b8(X). As a result, learning a dynamic\nneural similarity jointly with the network parameters is to solve the following optimization problem\n(without loss of generality, we simply use one neuron as an example in the following formulation):\n\n(cid:88)\n\nL(cid:0)yi, W\n\n(cid:62)\n\ni\n\n(cid:1)\n\n{W , \u03b8} = arg min{W ,\u03b8}\n\nM\u03b8(Xi)Xi\n\n(3)\nwhere yi is the ground truth value for Xi, and L is some loss function. Both W and \u03b8 can be\nlearned end-to-end using back-propagation. Note that, although Xi denotes the entire sample here,\nXi will become the local patch of the input feature map in CNNs. For simplicity, we consider a\none-neuron fully connected layer instead of a convolution layer. Due to the dynamic neural similarity,\nthe equivalent weights M\u03b8(X)(cid:62)W become a function of the input X and therefore construct a\ndynamic neural network. In fact, dynamic networks which generate the neuron weights entirely based\non an additional neural network have poor generalization ability for recognition tasks [18]. In contrast,\nour dynamic NSN achieves a dedicate balance between generalization and \ufb02exibility by using neuron\nweights that are \u201csemi-generated\u201d (i.e., part of the weights are statically and directly learned from\nsupervisions, and the neural similarity matrix is generated dynamically from the input). Interestingly,\nwe notice that hyperspherical convolution [39] can be viewed as a special case of dynamic neural\nsimilarity. One can see that its equivalent similarity matrix M\u03b8(X) =diag(\n(cid:107)W (cid:107)(cid:107)X(cid:107) )\nalso depends on the input feature map but does not have any parameter \u03b8.\nHyperspherical learning with identity residuals. In our experiments, we \ufb01nd that naively using a\nneural network to predict the neural similarity is very unstable during training, leading to dif\ufb01cultly\nin convergence (it requires a lot of tricks to converge). To address the training stability problem,\nwe propose hyperspherical networks (SphereNet) [39] with identity residuals to serve as the neural\nsimilarity predictor. The convergence stability of hyperspherical learning over standard neural\nnetworks is discussed in [39, 37, 38, 36, 35, 34]. In order to further stablize the training, we learn\nthe residual of an identity similarity matrix instead of directly learning the entire similarity matrix.\nFormally, the neural similarity predictor is written as M\u03b8(X) =SphereNet(X; \u03b8) + I where I\nis an identity matrix and SphereNet(X; \u03b8) denotes the hyperspherical network with parameter \u03b8\nand input X. To save parameters, we can use hyperspherical convolutional networks instead of\nhyperspherical fully-connected networks. One advantage of SphereNet is that each element of the\noutput in SphereNet is bounded between \u22121 and 1 ([0, 1] if using ReLU), making the similarity matrix\nbounded and well behaved. In contrast, the output is unbounded in a standard neural network, easily\nmaking some values of the similarity matrix dominantly large. Most importantly, SphereNet with\nidentity residuals empirically yields not only more stable convergence but also stronger generalization.\n\n(cid:107)W (cid:107)(cid:107)X(cid:107) ,\u00b7\u00b7\u00b7 ,\n\n1\n\n1\n\n4\n\nConvInputNeural Similarity ConvInputNeural Similarity (a) Static Neural Similarity(b) Dynamic Neural Similarity\fFigure 3: Comparison between disjoint and shared parame-\nterization for dynamic neural similarity predictor.\n\n3.3.2 Disjoint and Shared Parameterization in Neural Similarity Predictor\nWe mainly consider disjoint and shared parameterizations for the dynamic neural similarity predictor.\nDisjoint parameterization. Disjoint param-\neterization treats every dynamic neural sim-\nilarity independently. For each convolution\nkernel (i.e., neuron), we use a disjoint neural\nnetwork to predict the neural similarity matrix\nMs. A brief overview is given in Figure 3(a).\nShared parameterization. Assuming that\nthere exists an intrinsic structure to predict\nthe neural similarity from the input, we con-\nsider a shared neural network that produces\nthe neural similarity matrix for different con-\nvolution kernels (usually convolution kernels of the same size). To address the dimension mismatch\nproblem of the input feature map, we adopt an adaptation network (e.g., convolution networks or\nfully-connected networks) to \ufb01rst transform the inputs to the same dimension. Note that, these\nadaptation networks are not shared across different kernels in general, but we can share those adapta-\ntion networks for the input feature map of the same size. An intuitive comparison between disjoint\nand shared parameterization is given in Figure 3 (Conv1 and Conv2 denote different convolution\nkernels). By sharing the neural similarity prediction networks across different kernels, the number of\nparameters used in total can be signi\ufb01cantly reduced. Most importantly, this shared neural similarity\nnetwork may be able to learn some meta-knowledge about the neural similarity.\n3.4 Regularization for Neural Similarity\nOne of the largest advantages about the neural similarity formulation is that one can impose suitable\nregularizations on the neural similarity matrix M in different tasks. It gives us a way to incorporate\nour prior knowledge and problem understandings into the neural networks. The regularization on M\ncontrols the \ufb02exibility of the neural similarity. If we impose no constraints on M, then it will have\nway too many parameters. Although it may be \ufb02exible enough, the generalization is not necessarily\ngood. Instead we usually need to impose some constraints (e.g., the block-diagonal with shared\nblocks, diagonal, etc.) in order to save parameters and improve generalization.\nStructural regularization. As a typical example, requiring M to be a block-diagonal matrix with\nshared blocks is a strong structural regularization. Dilated convolution can be viewed as both\nstructural and sparsity regularization on Ms. In fact, more advanced structural regularizations could\nbe considered. For instance, requiring M to be a symmetric or symmetric positive semi-de\ufb01nite\nmatrix is also feasible (by using a Cholesky factorization M = LL(cid:62) where L a learnable lower\ntriangular matrix) and can largely limit the learnable class of similarity measures. Most importantly,\nstructural regularizations may bring more geometric and semantic interpretability.\nSparsity regularization. Soft sparsity regularization on the matrix Ms can be enforced via a (cid:96)1-\nnorm penalty. One can also impose a hard sparsity constraint to limit the non-zero values in Ms,\nsimilar to [42]. It is also appealing to enforce sparsity-one pattern on Ms, because it can construct\nef\ufb01cient neural networks based on the shift operation in [59].\n3.5\nFormulation. NSL is also a uni\ufb01ed framework for jointly learning the kernel shape and similarity\nmeasure. If we further factorize Ms to the multiplication of a diagonal Boolean matrix D and a\nsimilarity matrix R, then the neural similarity can be parameterized as\n\nJoint Learning of Kernel Shape and Similarity\n\n\uf8f9\uf8fa\uf8fb\n(cid:125)\n\n\uf8ee\uf8ef\uf8f0R\n(cid:124)\n\n\u00b7\n\n...\n(cid:123)(cid:122)\n\n\uf8f9\uf8fa\uf8fb\n(cid:125)\n\nD\n\nR\n\nKernel Shape\n\nSimilarity Measure\n\n\u00b7X\n\n(4)\n\nfM (W , X) = W\n\n(cid:62)\n\n\uf8ee\uf8ef\uf8f0DR\n\n\uf8f9\uf8fa\uf8fb X = W\n\n(cid:62) \u00b7\n\n\uf8ee\uf8ef\uf8f0D\n(cid:124)\n\n...\n(cid:123)(cid:122)\n\n...\n\nDR\n\nwhere D =diag(d1,\u00b7\u00b7\u00b7 , dHV ) in which di\u2208{0, 1},\u2200i is a Boolean value. D actually controls the\nshape of the kernel because it will spatially mask out some elements in the kernel. Speci\ufb01cally,\nbecause the diagonal of D is binary, some elements of Ms will become zero and therefore the kernel\nshape is controlled by D. On the other hand, R still serves as the neural similarity matrix, similar\n\n5\n\nConv2Input(a) Disjoint ParameterizationConv1Adaptation2Adaptation1InputSharedOutputOutputConv2InputConv1InputOutputOutputDisjoint 1Disjoint 2(b) Shared Parameterization\fto the previous Ms. D can also be viewed as masking out some elements of each column in R.\nInterestingly, if we do not require the diagonal of D to be Boolean, then it will become a continuous\nspatial mask for the kernel shape.\nOptimization. First of all, we only consider D to be static in both static and dynamic NSN. The\noptimization of D is non-trivial, because it is a Boolean matrix which is discretized and can not be\noptimized directly using gradients. Therefore, we use a heuristic approach to optimize D. Speci\ufb01cally,\nwe preserve a real-valued matrix Dr which is used to construct the Boolean matrix D. We de\ufb01ne\nD =I(Dr, \u03b1) where I(v, \u03b1) is an element-wise function that outputs 1 if v > \u03b1 and 0 otherwise. \u03b1\nis a \ufb01xed threshold. We will update Dr with the following equation:\n\n{Dr}t+1 = {Dr}t \u2212 \u03b7\n\n\u2202L\n\u2202D\n\n(5)\n\nwhere Dr is only computed in order to update D. In both forward and backward passes, only D\nis used for computation, but Dr is used to generate D. Essentially, the gradient w.r.t D serves as a\nnoisy gradient for Dr. Similar optimization strategy has also been employed in [22, 12, 42]. R is\nupdated end-to-end using back-propagation. It is also easy to dynamically produce D with a neural\nnetwork, but we do not consider this case for simplicity.\n4 Neural Similarity Networks\nAfter introducing the neural similarity learning of a single convolution kernel, we discuss how to\nconstruct a neural similarity network using this building block. In order to save parameters, we let all\nthe convolution kernels of the same layer share the same neural similarity matrix, which means that\nwe require the same convolution layer has the same similarity measure. We will empirically validate\nthis design choice in Section 7.1. Stacking convolution layers with static (dynamic) neural similarity\ngives us static (dynamic) NSN. Note that, static NSN has the same number of parameters as standard\nCNN in deployment but yields better generalization ability. Compared to [28], dynamic NSN has\nbetter regularity on the convolution kernel and is also able to utilize the pretrained CNN models.\nTraining from pretrained models. In order to make use of the pretrained models, we can simply use\nthe pretrained model as our backbone network (with all the weights loaded). Then we add the static or\ndynamic neural similarity modules to the convolutional kernels and train the neural similarity modules\nwith backbone weights \ufb01xed until convergence. Optionally, we can \ufb01netune the entire network after\nthe training of the neural similarity module. In contrast, the other dynamic networks [18, 28] are\nnot able to take advantage of the pretrained models. Note that, it is not necessary for both static and\ndynamic NSN to be trained from pretrained models. They can also be trained from scratch (weights\nof both backbone and neural similarity module are optimized from random initialization) and still\nyield better result than the CNN baselines.\nTraining and inference. Similar to CNNs, both static and dynamic NSN can be trained end-to-end\nusing mini-batch stochastic gradient descent. Apart from that the factorized form with D and R\nneeds to be optimized using a heuristic approach, the training is basically the same as the standard\nCNN. In the inference stage, we can compute all the equivalent weights for static NSN in advance to\nspeed up inference in practice. For dynamic NSN, the inference is also similar to the standard CNN\nwith slightly more additional computations from the neural similarity module.\n5 Theoretical Insights\n5.1\nAs mentioned before, NSL can be viewed as a form of matrix multiplication where the weight matrix\nW is factorized as M(cid:62)W (cid:48) (W (cid:48) is the new weight matrix and M is the similarity matrix). Such\nfactorization form not only provides more modeling and regularization \ufb02exibility, but it also introduces\nan implicit regularization (in gradient descent). The implicit regularization in matrix factorization\nis studied in [16]. We \ufb01rst compare the behavior of gradient descent on W and {W (cid:48), M} to\nobserve the difference. We consider a simple example of a one-layer neural network with least\nsquare loss (i.e., linear regression): minW L(W ) := 1\n2 where W \u2208Rn\u00d7m is the\nweight matrix for neurons, yi\u2208Rm is the target and Xi\u2208Rn is the i-th sample. The behavior of\ngradient descent with in\ufb01nitesimally small learning rate can be captured by the differential equation:\n\u02d9Wt +\u2207L(Wt) = 0 with an initial condition W0, where \u02d9Wt := dWt\ndt . For NSL, the objective becomes\nmin{W (cid:48),M} L(W (cid:48), M ) := 1\n2, so the corresponding differential equations\n\n(cid:80)\ni (cid:107)yi\u2212 W (cid:48)(cid:62)M Xi(cid:107)2\n\nImplicit Regularization Induced by NSL\n\n(cid:80)\ni (cid:107)yi\u2212 W (cid:62)Xi(cid:107)2\n\n2\n\n2\n\n6\n\n\fNSL Derivative:\n\ni\n\n+\n\ni\n\nof gradient descent on W (cid:48) and M are \u02d9W (cid:48)\nrespectively (with initial condition W (cid:48)\non W and the factorized NSL update on {W (cid:48), M} can be expressed as\nStandard Derivative:\n\nt +\u2207W (cid:48)L(W (cid:48)\nt , M ) = 0 and \u02d9Mt +\u2207ML(W (cid:48)\n(cid:88)\n(cid:88)\n\nt , M ) = 0,\n0 and M0). Therefore, the gradient \ufb02ows of the standard update\n\n(De\ufb01ne ri\n(cid:62)\n\n(cid:88)\n\n(cid:88)\n\n(6)\n\n(cid:62)\nt Xi)\n\n(cid:62)\n\n(cid:62)\n\n\u02d9Wt =\ni\n(cid:62)\n\u02d9Wt = M\nt\n\nXi(yi \u2212 W\n(cid:62)\n\u02d9W\nt W\n\n(cid:48)\nt + \u02d9M\n\n=\ni\n(cid:62)\nt Mt\n\nXi(ri\nt)\nXi(ri\nt)\n\n(cid:48)\nt = M\n\nt = yi \u2212 W\nXi(ri\nW\nt)\n\n(cid:62)\n\n(cid:62)\nt Xi)\n(cid:48)(cid:62)\n(cid:48)\nt W\nt\n\nfrom which we observe that the gradient dynamics of the NSL update is very different from the\ngradient dynamics of the standard update. Therefore, NSL may introduce a regularization effect\nthat is different from the standard update, and we argue that such implicit regularization induced by\nNSL is bene\ufb01cial to the generalization power. [16] conjectures that optimizing matrix factorization\nwith gradient descent implicitly regularizes the solution towards minimum nuclear norm. [5] extends\nthe analysis of implicit regularization to deep matrix factorization (i.e., multi-layer linear neural\nnetworks) and shows that multi-layer matrix factorization enhances an implicit tendency towards\nlow-rank solution. [15, 27] show that gradient descent converges to the maximum margin solution in\nlinear neural networks for binary classi\ufb01cation of separable data. More interestingly, [5] argues that\nimplicit regularization in matrix factorization may not be captured using simple mathematical norms.\n5.2 Connection to Dynamical Systems\nClassic dynamic neural unit (DNU) [17] receives not only external inputs but also state feedback\nsignals from themselves and other neurons. A general mathematical model of an isolated DNU is\ngiven by a differential equation \u02d9x(t) =\u2212\u03b1x(t) + f (w, x(t), u), y(t) = g(x(t)) where x is DNU\u2019s\nneural state, wi is the weight vector, u is the external input, f (\u00b7) is the nonlinear activation and g(\u00b7)\nis DNU\u2019s output. As a dynamical system, the output of DNU depends on both the external input\nand the output time stamp. The neural state trajectory also depends on the equilibrium convergence\nproperty of DNU. Different from DNU, dynamic NSN does not have the state feedback and self-\nrecurrence. Instead it realizes the dynamic output with a neural similarity generator that changes the\nequivalent weight matrix adaptively based on the input. However, it will be interesting to combine\nself-recurrence to NSL, since it can save parameters and strengthen the approximation power.\nRecent work [9, 41, 49, 58] shows that many existing deep neural networks can be consider as\ndifferent numerical schemes approximating an ordinary differential equation (ODE). NSN with certain\nsimilarity design is also equivalent to approximating ODEs. For example, fM = W (cid:62)( \u02dcW +M )X =\nXm + W (cid:62)M X where W (cid:62) \u02dcW =Diag(0,\u00b7\u00b7\u00b7 , 0, 1, 0,\u00b7\u00b7\u00b7 , 0) (1 lies in the center location) can be\nwritten as xn+1 = xn + \u2206t\u00b7 gn(xn) (i.e., ResNet) where xn is the input feature map at depth n and\ngn(\u00b7) is the transformation at depth n. It is one step of forward Euler discretization of the ODE\nxt = g(x, t). Different neural similarity designs correspond to different iterative method for ODEs.\n6 Discussions\nConnection and comparison to the existing works. Static NSN is a direct generalization from\nthe standard CNN, and can be viewed as a factorized learning (with optional regularizations) of\nconvolution kernels. Dynamic NSN can be viewed as a non-trivial generalization of hyperspherical\nconvolution [39] in the sense that hyperspherical convolution is also input-dependent and can be\nviewed as a special case of M being\n(cid:107)W (cid:107)(cid:107)X(cid:107) I. Compared to dynamic \ufb01lter networks [28], dynamic\nNSN achieves a better trade-off between \ufb02exiblity and generalization. Dynamic \ufb01lter networks\nare very \ufb02exible since the weights are completely generated using another network, but it yields\nunsatisfactory image recognition accuracy. In contrast, dynamic NSN imposes strong regularizations\non the weights and is less \ufb02exible than dynamic \ufb01lter networks, but it has much stronger generalization\nability while still being dynamic. When M has no constraints, our dynamic NSN will become\nessentially equivalent to the dynamic \ufb01lter network. [11] proposes to dynamically select \ufb01lters to\nperform inference, while NSL dynamically estimates a similarity measure.\nDynamic NSN is a high-order function of input. Dynamic NSN outputs W (cid:62)M\u03b8(X)X. Assum-\ning M\u03b8(X) is a one-layer neural network, i.e., M\u03b8(X) = W (cid:48)X(cid:62). Then the one-layer dynamic NSN\nis written as W (cid:62)W (cid:48)X(cid:62)X which is a quadratic function of X. In general, M\u03b8(X) is much more\nnonlinear, so one-layer dynamic NSN is naturally a high-order function of the input X. Therefore,\ndynamic NSN has stronger approximation ability and \ufb02exibility than the standard convolution.\nSelf-attention as a global dynamic neural similarity. Since self-attention [62] is also a high-order\nfunction of input, it can also be viewed as a form of dynamic neural similarity. We de\ufb01ne a novel\nglobal neural similarity that can reduce to self-attention in Appendix B.\n\n1\n\n7\n\n\fMethod\n\nError (%)\n\nBaseline CNN\n\nStatic NSN\n\nStatic NSN (J)\nDynamic NSN\n\nBaseline CNN\n\n7.78\n7.04\nDynamic NSN (CNN)\n6.85\nDynamic NSN (SN)\nTable 1: Predictor Network.\n\n7 Applications\n7.1 Generic Visual Recognition\nExperimental settings. For fair comparison, the backbone network architecture is the same in\neach experiment. We will mostly use VGG-like plain CNN architecture. Detailed structures for\nbaselines and NSN are provided in Appendix A. For CIFAR10 and CIFAR100, we follow the same\naugmentation settings from [21]. For Imagenet 2012 dataset, we mostly follow the settings in [30].\nBatch normalization, ReLU, mini-batch 128, and SGD with momentum 0.9 are used as default in all\nmethods. For CIFAR-10 and CIFAR-100, we start momentum SGD with the learning rate 0.1. The\nlearning rate is divided by 10 at 34K, 54K iterations and the training stops at 64K. For ImageNet, the\nlearning rate starts with 0.1, and is divided by 10 at 200K, 375K, 550K iterations (\ufb01nsihed at 600K).\nDifferent neural similarity predictor. We consider two types of\narchitectures: CNN and SphereNet [39] for the neural similarity\npredictor of dynamic NSN. We experiment on CIFAR-10 and DNS\n(Ms is diagonal) is used in NSN. Table 1 shows that SphereNet\nworks better than standard CNN as a neural similarity predictor. It is\nbecause SphereNet has better convergence properties can can stablize\nNSN training. In fact, dynamic NSN can not converge if trivially applying CNN to the predictor,\nand we have to perform normalization (or sigmoid activation) to the predictor\u2019s \ufb01nal output to make\nit converge. In contrast, SphereNet can make dynamic NSN converge easily and perform better.\nTherefore, we will use SphereNet as the neural similarity predictor for dynamic NSN by default.\nJoint learning of kernel shape and similarity. We now evaluate how\njointly learning kernel shape and similarity can improve NSN. We\nuse CIFAR-10 in the experiment. For both static and dynamic NSN,\nwe use DNS (Ms is a diagonal matrix). For dynamic NSN, we use\nSphereNet [39] as the neural similarity predictor. Table 2 show that joint\nlearning D and R performs better than simply learning Ms. However,\nto be simple, we will still learn a single Ms in the other experiments.\nShared v.s. disjoint dynamic NSN. We evaluate the shared and\ndisjoint parameterization for the neural similarity predictor. We use\nCIFAR-10 in the experiment. For both static and dynamic NSN,\nwe use DNS. Table 3 shows that the shared similarity predictor\nperforms slightly worse than the disjoint one, but the shared one\nsaves nearly half of the parameters used in the disjoint one.\nCIFAR-10/100. We comprehensively evaluate both\nstatic and dynamic NSN on CIFAR-10 and CIFAR-100.\nAll dynamic NSN variants use SphereNet as neural sim-\nilarity predictor. Both DNS and UNS are experimented\nfor comparison. Because dynamic NSN uses slightly\nmore parameters than the baseline CNN, we construct\na new baseline CNN++ by making the baseline CNN\ndeeper and wider such that the number of parameters\nis slightly larger than all variants of NSN. The results\nin Table 4 verify the superiority of both static and dynamic NSN. Our dynamic NSN outperforms\nboth baseline CNN and baseline CNN++ by a considerable margin. Moreover, one can observe that\ndynamic NSN performs generally better than static NSN, showing that dynamic inference can be\nbene\ufb01cial for the image recognition task. Both DNS and UNS perform similarly on CIFAR-10 and\nCIFAR-100, indicating that DNS is already \ufb02exible enough for the image recognition task.\nImageNet-2012. In order to be parameter-ef\ufb01cient,\nwe evaluate the dynamic NSN with DNS on the\nImageNet-2012 dataset. The backbone network is a\nVGG-like 10-layer plain CNN, so the absolute perfor-\nmance is not state-of-the-art. However, the purpose\nhere is to perform apple-to-apple fair comparison. Us-\ning the same backbone network, dynamic NSN is signi\ufb01cantly and consistently better than both\nbaseline CNN and CNN++. Note that, baseline CNN++ is a deeper and wider version of baseline\nCNN. The results in Table 5 show that dynamic NSN yields strong generalization ability with the\n\n7.78\n7.15\n6.92\n6.85\n6.64\nDynamic NSN (J)\nTable 2: Joint learning.\n\nTable 5: Validation error (%) on ImageNet-2012.\n\nTable 4: Error (%) on CIFAR-10 & CIFAR-100.\n\nBaseline CNN\nBaseline CNN++\nStatic NSN w/ DNS\nStatic NSN w/ UNS\n\nTable 3: Predictor parameterization.\n\nBaseline CNN\n\nDynamic NSN (Shared)\nDynamic NSN (Disjoint)\n\nBaseline CNN\nBaseline CNN++\n\nDynamic NSN w/ DNS\n\nDynamic NSN w/ DNS\nDynamic NSN w/ UNS\n\nMethod\n\nCIFAR-10\n\nCIFAR-100\n\n# params\n8.90M\n9.71M\n9.61M\n\nTop-1\n42.72\n42.11\n40.61\n\nTop-5\n19.11\n18.98\n18.04\n\nMethod\n\nError (%)\n\nMethod\n\nError (%)\n\n7.78\n7.20\n6.85\n\n28.95\n28.70\n28.35\n28.11\n27.81\n28.02\n\n7.78\n7.29\n7.15\n7.38\n6.85\n6.5\n\nMethod\n\n8\n\n\fsame number of parameter, and most importantly, the experiments demonstrated that the dynamic\ninference mechanism can work well in a challenging large-scale image recognition task.\n7.2 Few-Shot Learning\nStatic NSN. It is very natural to apply static NSN to the few-shot learning. Similar to the \ufb01netuning\nbaseline, we \ufb01rst train a backbone network using the base class data. When it comes to the testing\nstage, we \ufb01rst \ufb01netune both the static neural similarity matrix and the classi\ufb01er on the novel class\ndata and then use the \ufb01netuned classi\ufb01er to make prediction. Note that, in order to use a pretrained\nbackbone, we need to initialize the neural similarity matrix with an identity matrix. Due to the\nstrong regularity that we imposed to the mete-similarity matrix, static NSN is able to preserve rich\ninformation from the pretrained model while quickly adapting to the novel class data.\nDynamic NSN. Dynamic NSN is very suitable for the few-shot learning due to its dynamic nature.\nIts \ufb01lters are conditioned on the input. Because dynamic NSN is able to learn a meta-information\nabout the similarity measure, so its intermediate layers do not need to be \ufb01netuned in the testing\nstage. From a high-level perspective, dynamic NSN shares some similarities with MAML [14] in\nthe sense that dynamic NSN learns to transform its \ufb01lters with a projection matrix, while MAML\ntransforms its \ufb01lters using gradient updates during inference. We directly train the dynamic NSN on\nthe base class data. In the testing stage, we \ufb01rst retrain the classi\ufb01ers using the novel class data, and\nthen classify the query image using the dynamic NSN and the retrained classi\ufb01er.\nMeta-learned static NSN. Inspired by MAML [14], we propose to meta-learn the neural similarity.\nWe pretrain the network on the base classes with identity similarity and then meta-learned the neural\nsimilarity and classi\ufb01ers similar to MAML. The meta-learned static NSN dynamically transforms its\n\ufb01lters via projection using the gradients, similar to MAML. The meta-optimization is given by\n\nL\u03c4i (fM(cid:48) )\n\ns.t. M\n\n(cid:48)\n\n= M \u2212 \u03b7\u2207ML\u03c4i (fM )\n\n(7)\n\n(cid:88)\n\nmin\nM\n\n\u03c4i\u223cp(\u03c4 )\n\nMethod\n\nMatchingNet [46]\n\nProtoNet [52]\nMAML [14]\n\nRelationNet [54]\nStatic NSN (ours)\n\nFinetuning Baseline [46]\n\nNearest Neightbor Baseline [46]\n\nBackbone\nCNN-4\nCNN-4\nCNN-4\nCNN-4\nCNN-4\nCNN-4\nCNN-4\nCNN-4\nCNN-4\n\nwhich aims to learn a good initialization for the static neural similarity matrix. During testing, the\nprocedure exactly follows MAML [14] except that the meta-learned static NSN only updates the\nneural similarity matrix with gradients. The pretrained model is recently shown to perform well with\ncertain normalization [10]. Meta-learned static NSN is able to take full advantage of the pretrained\nmodel, and can be viewed as an interpolation between the pretrained model and MAML [14]. In fact,\ndynamic neural similarity can be also meta-learned similarly, which is left for future investigation.\nExperiment on Mini-ImageNet. The exper-\nimental protocol is the same as [46, 14]. Fol-\nlowing [46], we use 4 convolution layers with\n32 3\u00d7 3 \ufb01lters per layer. Batch normaliza-\ntion [23], ReLU non-linearity and 2\u00d7 2 pool-\ning are used. For all the NSN variants, we use\nthe best setup and hyperparameters. The re-\nsults in Table 6 show that all of our proposed\nthree few-shot learning strategies work rea-\nsonably well. The dynamic NSN outperforms\nthe other competitive methods by a consid-\nerably large margin. Static NSN works bet-\nter than most exisint methods. Meta-learned\nstatic NSN also shows obvious advantages\nover its direct competitor MAML. Moreover, we also compare with the recent state-of-the-art method\nLEO [48] which uses features from ResNet-28. Our dynamic NSN with the CNN-9 backbone\nachieves 77.44% accuracy, which is comparable to LEO but ours has much fewer network parameters.\nThis experiment further validates the strong generalization ability of all NSN variants.\n8 Concluding Remarks\nWe have proposed a general yet powerful framework to generalize traditional convolution with\nthe neural similarity. Our framework can capture the similarity structure that lies in our data of\ninterest, and regularizing the similarity to accommodate the nature of input dataset may yield better\nperformance. Our experiments on image recognition and few-shot learning show the potential of our\nframework being \ufb02exible, generalizable and interpretable. This framework can be further applied to\nmore applications, e.g., semantic segmentation, and may inspire different threads of research.\n\n5-shot Accuracy\n49.79 \u00b1 0.79\n51.04 \u00b1 0.65\n55.31 \u00b1 0.73\n68.20 \u00b1 0.66\n63.15 \u00b1 0.91\n65.32 \u00b1 0.70\n65.74 \u00b1 0.68\n66.21 \u00b1 0.69\n71.26 \u00b1 0.65\n73.90 \u00b1 0.30\n76.7 \u00b1 0.3\n77.59 \u00b1 0.12\n77.44 \u00b1 0.63\n\nTable 6: Few-shot classi\ufb01cation on Mini-Imagenet test set.\n\nResNet-34\nResNet-12\nResNet-28\n\nCNN-9\n\nTadam [45]\nLEO [48]\n\nMeta-learned static NSN (ours)\n\nDynamic NSN (ours)\n\nDiscriminative k-shot [6]\n\nDynamic NSN (ours)\n\n9\n\n\fAcknowledgements\nWeiyang Liu was supported in part by Baidu Fellowship and Nvidia GPU Grant. Le Song was\nsupported in part by NSF grants CDS&E-1900017 D3SC, CCF-1836936 FMitF, IIS-1841351,\nCAREER IIS-1350983, DARPA Program on Learning with Less Labels.\n\nReferences\n[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul,\nBrendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. In\nNIPS, 2016. 3\n\n[2] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial\n\nnetworks. arXiv preprint arXiv:1711.04340, 2017. 3\n\n[3] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent\n\nfor deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018. 4\n\n[4] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration\n\nby overparameterization. arXiv preprint arXiv:1802.06509, 2018. 4\n\n[5] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization.\n\narXiv preprint arXiv:1905.13655, 2019. 2, 4, 7\n\n[6] Matthias Bauer, Mateo Rojas-Carulla, Jakub Bart\u0142omiej \u00b4Swi \u02dbatkowski, Bernhard Sch\u00f6lkopf, and Richard E\nTurner. Discriminative k-shot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.\n9\n\n[7] Yoshua Bengio, Samy Bengio, and Jocelyn Cloutier. Learning a synaptic learning rule. Universit\u00e9 de\n\nMontr\u00e9al, D\u00e9partement d\u2019informatique et de recherche . . . , 1990. 3\n\n[8] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\nTPAMI, 2018. 1, 2\n\n[9] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential\n\nequations. In NIPS, 2018. 7\n\n[10] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at\n\nfew-shot classi\ufb01cation. In ICLR, 2019. 3, 9\n\n[11] Zhourong Chen, Yang Li, Samy Bengio, and Si Si. Gaternet: Dynamic \ufb01lter selection in convolutional\n\nneural network via a dedicated global gating network. arXiv preprint arXiv:1811.11205, 2018. 3, 7\n\n[12] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural\n\nnetworks with binary weights during propagations. In NIPS, 2015. 6\n\n[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\n\nconvolutional networks. In ICCV, 2017. 1, 2\n\n[14] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. In ICML, 2017. 2, 3, 9, 16\n\n[15] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear\n\nconvolutional networks. In NIPS, 2018. 7\n\n[16] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro.\n\nImplicit regularization in matrix factorization. In NIPS, 2017. 2, 4, 6, 7\n\n[17] Madan Gupta, Liang Jin, and Noriyasu Homma. Static and dynamic neural networks: from fundamentals\n\nto advanced theory. John Wiley & Sons, 2004. 7\n\n[18] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016. 2, 4, 6\n\n[19] Bharath Hariharan and Ross Girshick. Low-shot visual recognition by shrinking and hallucinating features.\n\nIn ICCV, 2017. 3\n\n[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, 2016. 1\n\n10\n\n\f[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn ECCV, 2016. 8\n\n[22] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural\n\nnetworks. In NIPS, 2016. 6\n\n[23] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. In ICML, 2015. 9\n\n[24] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, 2015.\n\n1, 15\n\n[25] Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classi\ufb01cation.\n\nIn CVPR, 2017. 1, 2\n\n[26] Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution. In NIPS,\n\n2018. 1\n\n[27] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. arXiv preprint\n\narXiv:1810.02032, 2018. 7\n\n[28] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic \ufb01lter networks. In NIPS, 2016.\n\n2, 6, 7\n\n[29] Kenji Kawaguchi. Deep learning without poor local minima. In NIPS, 2016. 4\n\n[30] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012. 8\n\n[31] Samuel J Leven and Daniel S Levine. Multiattribute decision making in context: A dynamic neural network\n\nmethodology. Cognitive Science, 20(2):271\u2013299, 1996. 4\n\n[32] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016. 3\n\n[33] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix\n\nsensing and neural networks with quadratic activations. In COLT, 2018. 4\n\n[34] Rongmei Lin, Weiyang Liu, Zhen Liu, Chen Feng, Zhiding Yu, James M. Rehg, Li Xiong, and Le Song.\n\nCompressive hyperspherical energy minimization. arXiv preprint arXiv:1906.04892, 2019. 4\n\n[35] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towards\n\nminimum hyperspherical energy. In NIPS, 2018. 4\n\n[36] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song.\n\nDecoupled networks. CVPR, 2018. 1, 2, 4, 13\n\n[37] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep\n\nhypersphere embedding for face recognition. In CVPR, 2017. 4\n\n[38] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional\n\nneural networks. In ICML, 2016. 4\n\n[39] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep\n\nhyperspherical learning. In NIPS, 2017. 1, 2, 3, 4, 7, 8, 13, 15\n\n[40] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmenta-\n\ntion. In CVPR, 2015. 1\n\n[41] Yiping Lu, Aoxiao Zhong, Quanzheng Li, and Bin Dong. Beyond \ufb01nite layer neural networks: Bridging\n\ndeep architectures and numerical differential equations. In ICML, 2018. 7\n\n[42] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple\n\ntasks by learning to mask weights. arXiv preprint arXiv:1801.06519, 2018. 5, 6\n\n[43] Tsendsuren Munkhdalai and Hong Yu. Meta networks. arXiv preprint arXiv:1703.00837, 2017. 3\n\n[44] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards\nunderstanding the role of over-parametrization in generalization of neural networks. arXiv preprint\narXiv:1805.12076, 2018. 4\n\n11\n\n\f[45] Boris Oreshkin, Pau Rodr\u00edguez L\u00f3pez, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for\nimproved few-shot learning. In Advances in Neural Information Processing Systems, pages 721\u2013731, 2018.\n9\n\n[46] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. In ICLR, 2017. 2, 3, 9\n\n[47] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n1\n\n[48] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and\nRaia Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.\n9\n\n[49] Lars Ruthotto and Eldad Haber. Deep neural networks motivated by partial differential equations. arXiv\n\npreprint arXiv:1804.04272, 2018. 7\n\n[50] J\u00fcrgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992. 3\n\n[51] Albert Shaw, Wei Wei, Weiyang Liu, Le Song, and Bo Dai. Meta architecture search. In NeurIPS, 2019. 3\n\n[52] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NIPS,\n\n2017. 3, 9\n\n[53] Yu-Chuan Su and Kristen Grauman. Learning spherical convolution for fast features from 360 imagery. In\n\nNIPS, 2017. 1\n\n[54] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to\n\ncompare: Relation network for few-shot learning. In CVPR, 2018. 3, 9\n\n[55] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot\n\nlearning. In NIPS, 2016. 3\n\n[56] Yingxu Wang. The cognitive processes of formal inferences. International Journal of Cognitive Informatics\n\nand Natural Intelligence (IJCINI), 1(4):75\u201386, 2007. 4\n\n[57] Yu-Xiong Wang, Ross Girshick, Martial Hebert, and Bharath Hariharan. Low-shot learning from imaginary\n\ndata. arXiv preprint arXiv:1801.05401, 2018. 3\n\n[58] E Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and\n\nStatistics, 5(1):1\u201311, 2017. 7\n\n[59] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad,\nJoseph Gonzalez, and Kurt Keutzer. Shift: A zero \ufb02op, zero parameter alternative to spatial convolutions.\nIn CVPR, 2018. 1, 5\n\n[60] Eric P Xing, Michael I Jordan, Stuart J Russell, and Andrew Y Ng. Distance metric learning with\n\napplication to clustering with side-information. In NIPS, 2003. 3\n\n[61] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint\n\narXiv:1511.07122, 2015. 1, 2\n\n[62] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial\n\nnetworks. In ICML, 2019. 7, 15\n\n[63] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better\n\nresults. arXiv preprint arXiv:1811.11168, 2018. 1\n\n12\n\n\f", "award": [], "sourceid": 2770, "authors": [{"given_name": "Weiyang", "family_name": "Liu", "institution": "Georgia Institute of Technology"}, {"given_name": "Zhen", "family_name": "Liu", "institution": "MILA, University of Montreal"}, {"given_name": "James", "family_name": "Rehg", "institution": "Georgia Tech"}, {"given_name": "Le", "family_name": "Song", "institution": "Georgia Institute of Technology"}]}