{"title": "Deep Neural Nets with Interpolating Function as Output Activation", "book": "Advances in Neural Information Processing Systems", "page_first": 743, "page_last": 753, "abstract": "We replace the output layer of deep neural nets, typically the softmax function, by a novel interpolating function. And we propose end-to-end training and testing algorithms for this new architecture. Compared to classical neural nets with softmax function as output activation, the surrogate with interpolating function as output activation combines advantages of both deep and manifold learning. The new framework demonstrates the following major advantages: First, it is better applicable to the case with insufficient training data. Second, it significantly improves the generalization accuracy on a wide variety of networks. The algorithm is implemented in PyTorch, and the code is available at https://github.com/\nBaoWangMath/DNN-DataDependentActivation.", "full_text": "Deep Neural Nets with Interpolating Function as\n\nOutput Activation\n\nBao Wang\n\nXiyang Luo\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nwangbaonj@gmail.com\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nxylmath@gmail.com\n\nZhen Li\n\nWei Zhu\n\nDepartment of Mathematics\n\nDepartment of Mathematics\n\nHKUST, Hong Kong\nlishen03@gmail.com\nZuoqiang Shi\n\nDuke University\n\nzhu@math.duke.edu\nStanley J. Osher\n\nDepartment of Mathematics\n\nTsinghua University\n\nzqshi@mail.tsinghua.edu.cn\n\nDepartment of Mathematics\n\nUniversity of California, Los Angeles\n\nsjo@math.ucla.edu\n\nAbstract\n\nWe replace the output layer of deep neural nets, typically the softmax function, by\na novel interpolating function. And we propose end-to-end training and testing\nalgorithms for this new architecture. Compared to classical neural nets with\nsoftmax function as output activation, the surrogate with interpolating function\nas output activation combines advantages of both deep and manifold learning.\nThe new framework demonstrates the following major advantages: First, it is\nbetter applicable to the case with insuf\ufb01cient training data. Second, it signi\ufb01cantly\nimproves the generalization accuracy on a wide variety of networks. The algorithm\nis implemented in PyTorch, and the code is available at https://github.com/\nBaoWangMath/DNN-DataDependentActivation.\n\n1\n\nIntroduction\n\nGeneralizability is crucial to deep learning, and many efforts have been made to improve the training\nand generalization accuracy of deep neural nets (DNNs) [3, 14]. Advances in network architectures\nsuch as VGG networks [28], deep residual networks (ResNets)[12, 13] and more recently DenseNets\n[16] and many others [6], together with powerful hardware make the training of very deep networks\nwith good generalization capabilities possible. Effective regularization techniques such as dropout\nand maxout [15, 30, 10], as well as data augmentation methods [19, 28, 32] have also explicitly\nimproved generalization for DNNs.\nA key component of neural nets is the activation function. Improvements in designing of activation\nfunctions such as the recti\ufb01ed linear unit (ReLU) [8], have led to huge improvements in performance\nin computer vision tasks [23, 19]. More recently, activation functions adaptively trained to the data\nsuch as the adaptive piecewise linear unit (APLU) [1] and parametric recti\ufb01ed linear unit (PReLU)\n[11] have lead to further improvements in performance of DNNs. For output activation, support vector\nmachine (SVM) has also been successfuly applied in place of softmax[29]. Though training DNNs\nwith softmax or SVM as output activation is effective in many tasks, it is possible that alternative\nactivations that consider manifold structure of data by interpolating the output based on both training\nand testing data can boost performance of the network. In particular, ResNets can be reformulated as\nsolving control problems of a class of transport equations in the continuum limit [21, 5]. Transport\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftheory suggests that by using an interpolating function that interpolates terminal values from initial\nvalues can dramatically simplify the control problem compared to an ad-hoc choice. This further\nsuggests that a \ufb01xed and data-agnostic activation for the output layer may be suboptimal.\nTo this end, based on the ideas from manifold learning, we propose a novel output layer named\nweighted nonlocal Laplacian (WNLL) layer for DNNs. The resulted DNNs achieve better gen-\neralization and are more robust for problems with a small number of training examples. On CI-\nFAR10/CIFAR100, we achieve on average a 30%/20% reduction in terms of test error on a wide\nvariety of networks. These include VGGs, ResNets, and pre-activated ResNets. The performance\nboost is even more pronounced when the model is trained on a random subset of CIFAR with a low\nnumber of training examples. We also present an ef\ufb01cient algorithm to train the WNLL layer via an\nauxiliary network. Theoretical motivation for the WNLL layer is also given from the viewpoint of\nboth game theory and terminal value problems for transport equations.\nThis paper is structured as follows: In Section 2, we introduce the motivation and practice of using the\nWNLL interpolating function in DNNs. In Section 2.2, we explain in detail the algorithms for training\nand testing DNNs with WNLL as output layer. Section 3 provides insight of using an interpolating\nfunction as output layer from the angle of terminal value problems of transport equations and game\ntheory. Section 4 demonstrates the effectiveness of our method on a variety of numerical examples.\n\n2 Network Architecture\n\nIn coarse grained representation, training and testing DNNs with softmax layer as output are illustrated\nin Fig. 1 (a) and (b), respectively. In kth iteration of training, given a mini-batch training data (X, Y),\nwe perform:\nForward propagation: Transform X into deep features by DNN block (ensemble of conv layers,\nnonlinearities and others), and then activated by softmax function to obtain the predicted labels \u02dcY:\n\n\u02dcY = Softmax(DNN(X, \u0398k\u22121), Wk\u22121).\n\nThen compute loss (e.g., cross entropy) between Y and \u02dcY: L = Loss(Y, \u02dcY).\nBackpropagation: Update weights (\u0398k\u22121, Wk\u22121) by gradient descent (learning rate \u03b3):\n\nWk = Wk\u22121 \u2212 \u03b3\n\n\u2202L\n\u2202 \u02dcY\n\n\u00b7 \u2202 \u02dcY\n\u2202W\n\n, \u0398k = \u0398k\u22121 \u2212 \u03b3\n\n\u2202L\n\u2202 \u02dcY\n\n\u00b7 \u2202 \u02dcY\n\u2202 \u02dcX\n\n\u00b7 \u2202 \u02dcX\n\u2202\u0398\n\n.\n\nFigure 1: Training (a) and testing (b) procedures of DNNs with softmax as output activation layer.\n\n(a)\n\n(b)\n\nOnce the model is optimized, for testing data X, the predicted labels are:\n\n\u02dcY = Softmax(DNN(X, \u0398), W),\n\nfor notational simplicity, we still denote the test set and optimized weights as X, \u0398, and W,\nrespectively. In essence the softmax layer acts as a linear model on the space of deep features\n\u02dcX, which does not take into consideration the underlying manifold structure of \u02dcX. The WNLL\ninterpolating function, which will be introduced in the following subsection, is an approach to\nalleviate this de\ufb01ciency. Moreover, WNLL interpolation is based on the harmonic extension which\navoids the curse of dimensionality issue in high dimensional interpolation.\n\n2\n\n\f1 , xte\n\n2 ,\u00b7\u00b7\u00b7 , xte\n\n2.1 Manifold Interpolation - An Harmonic Extension Approach\nLet X = {x1, x2,\u00b7\u00b7\u00b7 , xn} be a set of points in a high dimensional manifold M \u2282 Rd and\nXte = {xte\nm} be a subset of X. Suppose we have a (possibly vector valued) label\nfunction g(x) de\ufb01ned on Xte, and we want to interpolate a function u that is de\ufb01ned on the entire\nmanifold and can be used to label the entire dataset X. Interpolation by using basis function in high\ndimensional space suffers from the curse of dimensionality. Instead, an harmonic extension is a\nnatural and elegant approach to \ufb01nd such an interpolating function, which is de\ufb01ned by minimizing\nthe following Dirichlet energy functional:\n\nw(x, y) (u(x) \u2212 u(y))2 ,\n\n(1)\n\n(cid:88)\n\nx,y\u2208X\n\nE(u) =\n\n1\n2\n\nwith the boundary condition:\n\nu(x) = g(x), x \u2208 Xte,\n\nwhere w(x, y) is a weight function, typically chosen to be Gaussian: w(x, y) = exp(\u2212||x\u2212y||2\nwith \u03c3 a scaling parameter. The Euler-Lagrange equation for Eq.(1) is:\n\n\u03c32\n\n)\n\ny\u2208X (w(x, y) + w(y, x)) (u(x) \u2212 u(y)) = 0 x \u2208 X/Xte\n\nx \u2208 Xte.\n\nu(x) = g(x)\n\n(cid:40)(cid:80)\n\n(2)\n\n(3)\n\nBy solving the linear system Eq.(2), we get the interpolated labels u(x) for unlabeled data x \u2208 X/Xte.\nThis interpolation becomes invalid when labeled data is tiny, i.e., |Xte| (cid:28) |X/Xte|. There are two\nsolutions to resolve this issue: one is to replace the 2-Laplacian in Eq.(1) by a p-Laplacian [4]; the\nother is to increase the weights of the labeled data in the Euler-Lagrange equation [27], which gives\nthe following weighted nonlocal Laplacian (WNLL) interpolating function:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n(cid:80)\n(cid:16) |X|\ny\u2208X (w(x, y) + w(y, x)) (u(x) \u2212 u(y)) +\n|Xte| \u2212 1\n\n(cid:17)(cid:80)\n\nu(x) = g(x)\n\ny\u2208Xte w(y, x) (u(x) \u2212 u(y)) = 0 x \u2208 X/Xte\n\nx \u2208 Xte.\n\nFor notational simplicity, we name the solution u(x) to Eq.3 as WNLL(X, Xte, Yte). For classi\ufb01ca-\ntion tasks, g(x) is the one-hot labels for the example x. To ensure accuracy of WNLL, the labeled\ndata should cover all classes of data in X. We give a necessary condition in Theorem 1.\nTheorem 1. Suppose we have a data pool formed by N classes of data uniformly, with the number\nof instances of each class be suf\ufb01ciently large. If we want all classes of data to be sampled at least\n\n(cid:1) data is need to be sampled from the data pool.\n\nonce, on average at least N(cid:0)1 + 1\n\n2 + 1\n\n3 + \u00b7\u00b7\u00b7 + 1\n\nN\n\nIn this case, the number of data sampled, in expectation for each class, is 1 + 1\n\n2 + 1\n\n3 + \u00b7\u00b7\u00b7 + 1\nN .\n\n2.2 WNLL Activated DNNs and Algorithms\n\nIn both training and testing of the WNLL activated DNNs, we need to reserve a small portion of\ndata/label pairs denoted as (Xte, Yte), to interpolate the label Y for new data. We name (Xte, Yte)\nas the preserved template. Directly replacing softmax by WNLL (Fig. 2(a)) has dif\ufb01culties in back\npropagation, namely, the true gradient \u2202L\n\u2202\u0398 is dif\ufb01cult to compute since WNLL de\ufb01nes a very complex\nimplicit function. Instead, to train WNLL activated DNNs, we propose a proxy via an auxiliary\nneural nets (Fig. 2(b)). On top of the original DNNs, we add a buffer block (a fully connected layer\nfollowed by a ReLU), and followed by two parallel layers, WNLL and the linear (fully connected)\nlayers. The auxiliary DNNs can be trained by alternating between the following two steps (training\nDNNs with linear and WNLL activations, respectively):\nTrain DNNs with linear activation: Run N1 steps of the following forward and back propagation,\nwhere in kth iteration, we have:\nForward propagation: The training data X is transformed, respectively, by DNN, Buffer and Linear\nblocks to the predicted labels \u02dcY:\n\n\u02dcY = Linear(Bu\ufb00er(DNN(X, \u0398k\u22121), Wk\u22121\n\nB ), Wk\u22121\nL ).\n\n3\n\n\fThen compute loss between the ground truth labels Y and predicted ones \u02dcY, denoted as LLinear (e.g.,\ncross entropy loss, and the same as following LWNLL).\nBackpropagation: Update weights (\u0398k\u22121, Wk\u22121\n\nB , Wk\u22121\n\nWk\n\nL = Wk\u22121\n\nL \u2212 \u03b3\n\n\u2202 \u02dcY\n\n\u2202LLinear\n\n\u00b7 \u2202 \u02dcY\n\u2202WL\n\u0398k = \u0398k\u22121 \u2212 \u03b3\n\n, Wk\n\u2202LLinear\n\n\u2202 \u02dcY\n\nL ) by gradient descent:\n\u00b7 \u2202 \u02dcY\n\u2202 \u02c6X\n\nB \u2212 \u03b3\n\n\u2202LLinear\n\nB = Wk\u22121\n\n\u00b7 \u2202 \u02c6X\n\u2202WB\n\n,\n\n\u2202 \u02dcY\n\u00b7 \u2202 \u02dcX\n\u2202\u0398\n\n.\n\n\u00b7 \u2202 \u02dcY\n\u2202 \u02c6X\n\n\u00b7 \u2202 \u02c6X\n\u2202 \u02dcX\n\nTrain DNNs with WNLL activation: Run N2 steps of the following forward and back propagation,\nwhere in kth iteration, we have:\nForward propagation: The training data X, template Xte and Yte are transformed, respectively, by\nDNN, Buffer, and WNLL blocks to get predicted labels \u02c6Y:\n\n\u02c6Y = WNLL(Bu\ufb00er(DNN(X, \u0398k\u22121), Wk\u22121\n\nB ), \u02c6Xte, Yte).\n\nThen compute loss, LWNLL, between the ground truth labels Y and predicted ones \u02c6Y.\nBackpropagation: Update weights Wk\u22121\nin training DNNs with linear activation, by gradient descent.\n\nB only, Wk\u22121\n\nand \u0398k\u22121 will be tuned in the next iteration\n\nL\n\nWk\n\nB = Wk\u22121\n\nB \u2212 \u03b3\n\n\u2202LWNLL\n\n\u2202 \u02c6Y\n\n\u00b7 \u2202 \u02c6Y\n\u2202 \u02c6X\n\n\u00b7 \u2202 \u02c6X\n\u2202WB\n\n\u2248 Wk\u22121\n\nB \u2212 \u03b3\n\n\u2202LLinear\n\n\u2202 \u02dcY\n\n\u00b7 \u2202 \u02dcY\n\u2202 \u02c6X\n\n\u00b7 \u2202 \u02c6X\n\u2202WB\n\n.\n\n(4)\n\n\u2202 \u02dcY\n\n\u00b7 \u2202 \u02dcY\n\u2202 \u02c6X\n\nHere we use the computational graph of the left branch (linear layer) to retrieval the approximated\n\u2248\ngradients for WNLL. For a given loss value of LWNLL, we adopt the approximation \u2202LWNLL\n\u2202LLinear\nwhere the right hand side is also evaluated at this value. The main heuristic behind\nthis approximation is the following: WNLL de\ufb01nes a harmonic function implicitly, and a linear\nfunction is the simplest nontrivial explicit harmonic function. Empirically, we observe this simple\napproximation works well in training the network. The reason why we freeze the network in the\nDNN block is mainly due to stability concerns.\n\n\u00b7 \u2202 \u02c6Y\n\u2202 \u02c6X\n\n\u2202 \u02c6Y\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: Training and testing procedure of the deep neural nets with WNLL as the last activation\nlayer.(a): Direct replacement of the softmax by WNLL, (b): An alternating training procedure. (c):\nTesting.\n\nThe above alternating scheme is an algorithm of a greedy fashion. During training, WNLL activation\nplays two roles: on one hand, the alternating between linear and WNLL activations bene\ufb01ts each other\nwhich enables the neural nets to learn features that is appropriate for both linear classi\ufb01cation and\nWNLL based manifold interpolation. On the other hand, in the case where we lack suf\ufb01cient training\ndata, the training of DNNs usually gets stuck at some bad local minima which cannot generalize well\non new data. We use WNLL interpolation which provides a perturbation to the trained sub-optimal\nweights and can help to arrive at a local minima with better generalizability. At test time, we remove\nthe linear classi\ufb01er from the neural nets and use the DNN block together with WNLL to predict new\ndata (Fig. 2 (c)). The reason for using WNLL instead of a linear layer is because WNLL is superior\n\n4\n\n(X,Y),(Xte,Yte)(\u02dcX,\u02dcXte)=DNN(X,Xte,\u0398)\u02dcY=Linear(\u02c6X,WL)\u02c6Y=WNLL(\u02c6X,\u02c6Xte,Yte)Loss(\u02dcY,Y)(\u02c6X,\u02c6Xte)=Bu\ufb00er(\u02dcX,\u02dcXte,WB)Loss(\u02c6Y,Y)\fto the linear classi\ufb01er and this superiority is preserved when applied to deep features (which will be\nshown in Section. 4). Moreover, WNLL utilizes both the learned DNNs and the preserved template\nat test time which seems to be more stable to perturbations on the input data.\nWe summarize the training and testing procedures for the WNLL activated DNNs in Algorithms 1\nand 2, respectively. In each round of the alternating procedure i.e., each outer loop in Algorithm. 1,\nthe entire training set (X, Y) is \ufb01rst used to train the DNNs with linear activation. We randomly\nseparate a template, e.g., half of the entire data, from the training set which will be used to perform\nWNLL interpolation in training WNLL activated DNNs. In practice, for both training and testing, we\nuse minibatches for both the template and the interpolated points when the entire dataset is too large.\nThe \ufb01nal predicted labels are obtained by a majority voted across interpolation results from all the\ntemplate minibatches.\nRemark 1. In Algorithm. 1, the WNLL interpolation is also performed in mini-batch manner (as\nshown in the inner iteration). Based on our experiments, this does not reduce the interpolation\naccuracy signi\ufb01cantly.\n\nAlgorithm 1 DNNs with WNLL as Output Activation: Training Procedure.\n\nInput: Training set: (data, label) pairs (X, Y).\nOutput: An optimized DNNs with WNLL as output activation, denoted as DNNWNLL.\nfor iter = 1,. . . , N (where N is the number of alternating steps.) do\n\n//Train the left branch: DNNs with linear activation.\nTrain DNN + Linear blocks, and denote the learned model as DNNLinear.\n//Train the right branch: DNNs with WNLL activation.\nSplit (X, Y) into training data and template, i.e., (X, Y)\nfor i = 1, 2,\u00b7\u00b7\u00b7 , M do\n\nPartition the training data into M mini-batches, i.e., (Xtr, Ytr) =(cid:83)M\n(cid:83) Xte by DNNLinear, i.e., \u02dcXtr(cid:83) \u02dcXte = DNNLinear(Xtr\nApply WNLL (Eq.(3)) on { \u02dcXtr(cid:83) \u02dcXte, Yte} to interpolate label \u02dcYtr.\n\n= (Xtr, Ytr)(cid:83)(Xte, Yte).\n(cid:83) Xte).\n\nTransform Xtr\ni\n\ni=1(Xtr\n\ni ).\ni , Ytr\n\n.\n\ni\n\nBackpropagate the error between Ytr and \u02dcYtr via Eq.(4) to update WB only.\n\nAlgorithm 2 DNNs with WNLL as Output Activation: Testing Procedure.\n\nInput: Testing data X, template (Xte, Yte). Optimized model DNNWNLL.\nOutput: Predicted label \u02dcY for X.\n\nApply the DNN block of DNNWNLL to X(cid:83) Xte to get the representation \u02dcX(cid:83) \u02dcXte.\nApply WNLL (Eq.(3)) on { \u02dcX(cid:83) \u02dcXte, Yte} to interpolate label \u02dcY.\n\n3 Theoretical Explanation\n\nIn training WNLL activated DNNs, the two output activation functions in the auxiliary networks\nare, in a sense, each competing to minimize its own objective where, in equilibrium, the neural nets\ncan learn better features for both linear and interpolation-based activations. This in \ufb02avor is similar\nto generative adversarial nets (GAN) [9]. Another interpretation of our model is the following: As\nnoted in [21], in the continuum limit, ResNet can be modeled as the following control problem for a\ntransport equation:\n\n(cid:40) \u2202u(x,t)\n\u2202t + v(x, t) \u00b7 \u2207u(x, t) = 0 x \u2208 X, t \u2265 0\n\n(5)\n\nu(x, 1) = f (x)\n\nx \u2208 X.\n\nHere u(\u00b7, 0) is the input of the continuum version of ResNet, which maps the training data to the\ncorresponding label. f (\u00b7) is the terminal value which analogous to the output activation function in\nResNet which maps deep features to the predicted label. Training ResNet is equivalent to tuning\nv(\u00b7, t), i.e., continuous version of the weights, s.t. the predicted label f (\u00b7) matches that of the training\ndata. If f (\u00b7) is a harmonic extension of u(\u00b7, 0), the corresponding weights v(x, t) would be close to\nzero. This results in a simpler model and may generalize better from a model selection point of view.\n\n5\n\n\f4 Numerical Results\n\nTo validate the classi\ufb01cation accuracy, ef\ufb01ciency and robustness of the proposed framework, we test\nthe new architecture and algorithm on CIFAR10, CIFAR100 [18], MNIST[20] and SVHN datasets\n[24]. In all experiments, we apply standard data augmentation that is widely used for the CIFAR\ndatasets [12, 16, 31]. For MNIST and SVHN, we use the raw data without any augmentation. We\nimplement our algorithm on the PyTorch platform [26]. All computations are carried out on a machine\nwith a single Nvidia Titan Xp graphics card.\nBefore diving into the performance of DNNs with different output activation functions, we \ufb01rst\ncompare the performance of WNLL with softmax on the raw input images for various datasets. The\ntraining sets are used to train the softmax models and interpolate labels for testing set in softmax\nand WNLL, respectively. Table 1 lists the classi\ufb01cation accuracies of WNLL and softmax on three\ndatasets. For WNLL interpolation, in order to speed up the computation, we only use 15 nearest\nneighbors to ensure sparsity of the weight matrix, and the 8th neighbor\u2019s distance is used to normalize\nthe weight matrix. The nearest neighbors are searched via the approximate nearest neighbor (ANN)\nalgorithm [22]. WNLL outperforms softmax signi\ufb01cantly in all three tasks. These results show the\npotential of using WNLL instead of softmax as the output activation function in DNNs.\n\nTable 1: Accuracies of softmax and WNLL in classifying some classical datasets.\n\nDataset\nsoftmax\nWNLL\n\nCIFAR10 MNIST\nSVHN\n39.91% 92.65% 24.66%\n40.73% 97.74% 56.17%\n\nFor the deep learning experiments below: We take two passes alternating steps, i.e., N = 2 in\nAlgorithm. 1. For the linear activation stage (Stage 1), we train the network for n = 400 epochs. For\nthe WNLL stage, we train for n = 5 epochs. In the \ufb01rst pass, the initial learning rate is 0.05 and\nhalved after every 50 epochs in training linear activated DNNs, and 0.0005 when training the WNLL\nactivation. The same Nesterov momentum and weight decay as used in [12, 17] are used for CIFAR\nand SVHN experiments, respectively. In the second pass, the learning rate is set to be one \ufb01fth of the\ncorresponding epochs in the \ufb01rst pass. The batch sizes are 128 and 2000 when training softmax/linear\nand WNLL activated DNNs, respectively. For fair comparison, we train the vanilla DNNs with\nsoftmax output activation for 810 epochs with the same optimizers used in WNLL activated ones. All\n\ufb01nal test errors reported for the WNLL method are done using WNLL activations for prediction on\nthe test set. In the rest of this section, we show that the proposed framework resolves the issue of\nlacking big training data and boosts the generalization accuracies of DNNs via numerical results on\nCIFAR10/CIFAR100. The numerical results on SVHN are provided in the appendix.\n\n4.1 Resolving the Challenge of Insuf\ufb01cient Training Data\n\nWhen we do not have suf\ufb01cient training data, the generalization accuracy typically degrades as\nthe network goes deeper, as illustrated in Fig.3. The WNLL activated DNNs, with its superior\nregularization of the parameters and perturbation on bad local minima, are able to overcome this\ndegradation. The left and right panels plot the cases when the \ufb01rst 1000 and 10000 data in the training\nset of CIFAR10 are used to train the vanilla and WNLL DNNs. As shown in Fig. 3, by using WNLL\nactivation, the generalization error rates decay consistently as the network goes deeper, in contrast\nto the degradation for vanilla DNNs. The generalization accuracy between the vanilla and WNLL\nDNNs can differ up to 10 percent within our testing regime.\nFigure.4 plots the evolution of generalization accuracy during training. We compute the test accuracy\nper epoch. Panels (a) and (b) plot the test accuracies for ResNet50 with softmax and WNLL activations\n(1-400 and 406-805 epochs corresponds to linear activation), respectively, with only the \ufb01rst 1000\nexamples as training data from CIFAR10. Charts (c) and (d) are the corresponding plots with 10000\ntraining instances, using a pre-activated ResNet50. After around 300 epochs, the accuracies of the\nvanilla DNNs plateau and cannot improve any more. In comparison, the test accuracy for WNLL\njumps at the beginning of Stage 2 in \ufb01rst pass; during Stage 1 of the second pass, even though initially\nthere is an accuracy reduction, the accuracy continues to climb and eventually surpasses that of the\nWNLL activation in Stage 2 of \ufb01rst pass. The jumps in accuracy at epoch 400 and 800 are due to\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: Resolving the degradation problem of vanilla DNNs by WNLL activation. Panels (a) and\n(b) plot the generation errors when 1000 and 10000 training data are used to train the vanilla and the\nWNLL activated DNNs, respectively. In each plot, we test three different networks: PreActResNet18,\nPreActResNet34, and PreActResNet50. All tests are done on the CIFAR10 dataset.\n\nswitching from linear activation to WNLL for predictions on the test set. The initial decay when\nalternating back to softmax is caused partially by the \ufb01nal layer WL not being tuned with respect to\nthe deep features \u02dcX, and partially due to predictions on the test set being made by softmax instead\nof WNLL. Nevertheless, the perturbation via the WNLL activation quickly results in the accuracy\nincreasing beyond the linear stage in the previous pass.\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 4: The evolution of the generation accuracies over the training procedure. Charts (a) and (b)\nare the accuracy plots for ResNet50 with 1000 training data, where (a) and (b) are plots for the epoch\nv.s. accuracy of the vanilla and the WNLL activated DNNs. Panels (c) and (d) correspond to the case\nof 10000 training data for PreActResNet50. All tests are done on the CIFAR10 dataset.\n\n4.2\n\nImproving Generalization Accuracy\n\nWe next show the superiority of WNLL activated DNNs in terms of generalization accuracies when\ncompared to their surrogates with softmax or SVM output activations. Besides ResNets, we also\n\n7\n\n\ftest the WNLL surrogate on the VGG networks. In table 2, we list the generalization errors for 15\ndifferent DNNs from VGG, ResNet, Pre-activated ResNet families on the entire, \ufb01rst 10000 and\n\ufb01rst 1000 instances of the CIFAR10 training set. We observe that WNLL in general improves more\nfor ResNets and pre-activated ResNets, with less but still signi\ufb01cant improvements for the VGGs.\nExcept for VGGs, we can achieve relatively 20% to 30% testing error rate reduction across all neural\nnets. All results presented here and in the rest of this paper are the median of 5 independent trials.\nWe also compare with SVM as an alternative output activation, and observe that the results are still\ninferior to WNLL. Note that the bigger batch-size is to ensure the interpolation quality of WNLL.\nA reasonable concern is that the performance increase comes from the variance reduction due to\nincreasing the batch size. However, experiments done with a batch size of 2000 for vanilla networks\nactually deteriorates the test accuracy.\n\nTable 2: Generalization error rates over the test set of vanilla DNNs, SVM and WNLL activated\nones trained over the entire, the \ufb01rst 10000, and the \ufb01rst 1000 instances of training set of CIFAR10.\n(Median of 5 independent trials)\n\nNetwork\n\nVGG11\nVGG13\nVGG16\nVGG19\nResNet20\nResNet32\nResNet44\nResNet56\nResNet110\nResNet18\nResNet34\nResNet50\n\nPreActResNet18\nPreActResNet34\nPreActResNet50\n\nWhole\n\n10000\n\n1000\n\nVanilla\n9.23%\n6.66%\n6.72%\n6.95%\n\n9.06% (8.75%[12])\n7.99% (7.51%[12])\n7.31% (7.17%[12])\n7.24% (6.97%[12])\n6.41% (6.43%[12])\n\n6.16%\n5.93%\n6.24%\n6.21%\n6.08%\n6.05%\n\nWNLL\nSVM Vanilla WNLL Vanilla WNLL\n7.35% 9.28% 10.37% 8.88% 26.75% 24.10%\n7.64% 24.85% 22.56%\n5.58% 7.47% 9.12%\n7.54% 25.41% 22.23%\n5.69% 7.29% 9.01%\n5.92% 7.99% 9.62%\n8.09% 25.70% 22.87%\n7.09% 9.60% 12.83% 9.96% 34.90% 29.91%\n5.95% 8.73% 11.18% 8.15% 33.41% 28.78%\n5.70% 8.67% 10.66% 7.96% 34.58% 27.94%\n9.83% 7.61% 37.83% 28.18%\n5.61% 8.58%\n8.91% 7.13% 42.94% 28.29%\n4.98% 8.06%\n4.65% 6.00% 8.26%\n6.29% 27.02% 22.48%\n6.11% 26.47% 20.27%\n4.26% 6.32% 8.31%\n6.49% 29.69% 20.19%\n4.17% 6.63% 9.64%\n6.61% 27.36% 21.88%\n4.74% 6.38% 8.20%\n6.34% 23.56% 19.02%\n4.40% 5.88% 8.52%\n4.27% 5.91% 9.18%\n6.05% 25.05% 18.61%\n\nTables 2 and 3 list the error rates of 15 different vanilla networks and WNLL activated networks\non CIFAR10 and CIFAR100 datasets. On CIFAR10, WNLL activated DNNs outperforms the\nvanilla ones with around 1.5% to 2.0% absolute, or 20% to 30% relative error rate reduction. The\nimprovements on CIFAR100 are more signi\ufb01cant. We independently ran the vanilla DNNs on both\ndatasets, and our results are consistent with the original reports and other researchers\u2019 reproductions\n[12, 13, 16]. We provide experimental results of DNNs\u2019 performance on SVHN data in the appendix.\nInterestingly, the improvement are more signi\ufb01cant on harder tasks, suggesting potential for our\nmethods to succeed on other tasks/datasets. For example, reducing the sizes of DNNs is an important\ndirection to make the DNNs applicable for generalize purposes, e.g., auto-drive, mobile intelligence,\netc. So far the most successful attempt is DNNs weights quantization[2]. Our approach is a new\ndirection for reducing the size of the model: to achieve the same level of accuracy, compared to the\nvanilla networks, our model\u2019s size can be much smaller.\n\n5 Concluding Remarks\n\nWe are motivated by ideas from manifold interpolation and the connection between ResNets and\ncontrol problems of transport equations. We propose to replace the classical output activation function,\ni.e., softmax, by a harmonic extension type of interpolating function. This simple surgery enables the\ndeep neural nets (DNNs) to make suf\ufb01cient use of the manifold information of data. An end-to-end\ngreedy style, multi-stage training algorithm is proposed to train this novel output layer. On one\nhand, our new framework resolves the degradation problem caused by insuf\ufb01cient data; on the other\nhand, it boosts the generalization accuracy signi\ufb01cantly compared to the baseline. This improvement\nis consistent across networks of different types and different number of layers. The increase in\n\n8\n\n\fTable 3: Error rates of the vanilla DNNs v.s. the WNLL activated DNNs over the whole CIFAR100\ndataset. (Median of 5 independent trials)\n\nNetwork Vanilla DNNs WNLL DNNs\nVGG11\nVGG13\nVGG16\nVGG19\nResNet20\nResNet32\nResNet44\nResNet56\n\n28.80%\n25.21%\n25.72%\n25.07%\n31.53%\n28.04%\n26.32%\n25.36%\n\n32.68%\n29.03%\n28.59%\n28.55%\n35.79%\n32.01%\n31.07%\n30.03%\n\nNetwork\nResNet110\nResNet18\nResNet34\nResNet50\n\nPreActResNet18\nPreActResNet34\nPreActResNet50\n\nVanilla DNNs WNLL DNNs\n\n28.86%\n27.57%\n25.55%\n25.09%\n28.62%\n26.84%\n25.95%\n\n23.74%\n22.89%\n20.78%\n20.45%\n23.45%\n21.97%\n21.51%\n\ngeneralization accuracy could also be used to train smaller models with the same accuracy, which has\ngreat potential for the mobile device applications.\n\n5.1 Limitation and Future Work\n\nThere are several limitations of our framework to improve which we wish to remove. Currently, the\nmanifold interpolation step is still a computational bottleneck in both speed and memory. During\nthe interpolation, in order to make the interpolation valid, the batch size needs to be quasilinear with\nrespect to the number of classes. tTis pose memory challenges for the ImageNet dataset [7]. Another\nimportant issue is the approximation of the gradient of the WNLL activation function. Linear function\nis one option but it is far from optimal. We believe a better harmonic function approximation can\nfurther lift the model\u2019s performance.\nDue to the robustness and generalization capabilities shown by our experiments, we conjecture\nthat by using the interpolation function as output activation, neural nets can become more stable\nto perturbations and adversarial attacks [25]. The reason for this stability conjecture is because\nour framework combines both learned decision boundary and nearest neighbor information for\nclassi\ufb01cation.\n\n9\n\n\fAcknowledgments\n\nThis material is based on research sponsored by the Air Force Research Laboratory and DARPA under\nagreement number FA8750-18-2-0066. And by the U.S. Department of Energy, Of\ufb01ce of Science\nand by National Science Foundation, under Grant Numbers DOE-SC0013838 and DMS-1554564,\n(STROBE). And by the NSF DMS-1737770 and the Simons foundation. The U.S. Government\nis authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any\ncopyright notation thereon.\n\nReferences\n[1] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi. Learning activation functions to improve\n\ndeep neural networks. arXiv preprint arXiv:1412.6830, 2014.\n\n[2] M. Courbariaux Y. Bengio and J. David. Binaryconnet: Training deep neural networks with\n\nbinary weights. NIPS, 2015.\n\n[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep\n\nnetworks. NIPS, 2007.\n\n[4] J. Calder. The game theoretic p-laplacian and semi-supervised learning with few labels.\n\nArXiv:1711.10144, 2018.\n\n[5] Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual\n\nnetworks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.\n\n[6] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. NIPS, 2017.\n\n[7] J. Deng, W. Dong., R. Socher, J. Li, K. Li, and F. Li. ImageNet: A Large-Scale Hierarchical\n\nImage Database. In CVPR09, 2009.\n\n[8] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In Proceedings\nof the Fourteenth International Conference on Arti\ufb01cial Intelligence and Statistics, pages 315\u2013\n323, 2011.\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. Advances in Neural Information Processing Systems,\npages 2672\u20132680, 2014.\n\n[10] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks.\n\narXiv preprint arXiv:1302.4389, 2013.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level\nperformance on imagenet classi\ufb01cation. In Proceedings of the IEEE international conference\non computer vision, pages 1026\u20131034, 2015.\n\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR,\n\npages 770\u2013778, 2016.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. ECCV,\n\n2016.\n\n[14] G. Hinton, S. Osindero, and T. Teh. A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18(7):1527\u20131554, 2006.\n\n[15] G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural\nnetworks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580,\n2012.\n\n[16] G. Huang, Z. Liu, K. Weinberger, and L. van der Maaten. Densely connected convolutional\n\nnetworks. CVPR, 2017.\n\n[17] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. WeinBerger. Deep networks with stochastic depth.\n\nECCV, 2016.\n\n10\n\n\f[18] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.\n\n[19] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105,\n2012.\n\n[20] Y. LeCun. The mnist database of handwritten digits. 1998.\n\n[21] Z. Li and Z. Shi. Deep residual learning and pdes on manifold. arXiv preprint arXiv:1708.05115,\n\n2017.\n\n[22] M. Muja and D. Lowe. Scalable nearest neighbor algorithms for high dimensional data. Pattern\n\nAnalysis and Machine Intelligence (PAMI), 36, 2014.\n\n[23] V. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines.\n\nIn\nProceedings of the 27th international conference on machine learning (ICML-10), pages 807\u2013\n814, 2010.\n\n[24] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Ng. Reading digits in natural images\nwith unsupervised features learning. NIPS Workshop on Deep Learning and Unsupervised\nFeature Learning, 2011.\n\n[25] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. Celik, and A. Swami. The limitations of\n\ndeep learning in adversarial settings. ArXiv:1511.07528, 2015.\n\n[26] A. Paszke and et al. Automatic differentiation in pyTorch. 2017.\n\n[27] Z. Shi, S. Osher, and W. Zhu. Weighted nonlocal Laplacian on interpolation from sparse data.\n\nJournal of Scienti\ufb01c Computing, 73:1164\u20131177, 2017.\n\n[28] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. Arxiv:1409.1556, 2014.\n\n[29] Y. Tang. Deep learning using linear support vector machines. ArXiv:1306.0239, 2013.\n\n[30] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using\n\ndropconnect. In International Conference on Machine Learning, pages 1058\u20131066, 2013.\n\n[31] S. Zagoruyko and N. Komodakis. Wide residual networks. BMVC, 2016.\n\n[32] W. Zhu, Q. Qiu, J. Huang, R. Carderbank, G. Sapiro, and I. Daubechies. LDMNet: Low\n\ndimensional manifold regularized neural networks. UCLA CAM Report: 17-66, 2017.\n\n11\n\n\f", "award": [], "sourceid": 426, "authors": [{"given_name": "Bao", "family_name": "Wang", "institution": "UCLA"}, {"given_name": "Xiyang", "family_name": "Luo", "institution": "Google"}, {"given_name": "Zhen", "family_name": "Li", "institution": "Hong Kong University of Science & Technology"}, {"given_name": "Wei", "family_name": "Zhu", "institution": "Duke University"}, {"given_name": "Zuoqiang", "family_name": "Shi", "institution": "zqshi@mail.tsinghua.edu.cn"}, {"given_name": "Stanley", "family_name": "Osher", "institution": "UCLA"}]}