{"title": "Can We Gain More from Orthogonality Regularizations in Training Deep Networks?", "book": "Advances in Neural Information Processing Systems", "page_first": 4261, "page_last": 4271, "abstract": "This paper seeks to answer the question: as the (near-) orthogonality of weights is found to be a favorable property for training deep convolutional neural networks, how can we enforce it in more effective and easy-to-use ways? We develop novel orthogonality regularizations on training deep CNNs, utilizing various advanced analytical tools such as mutual coherence and restricted isometry property. These plug-and-play regularizations can be conveniently incorporated into training almost any CNN without extra hassle. We then benchmark their effects on state-of-the-art models: ResNet, WideResNet, and ResNeXt, on several most popular computer vision datasets: CIFAR-10, CIFAR-100, SVHN and ImageNet. We observe consistent performance gains after applying those proposed regularizations, in terms of both the final accuracies achieved, and faster and more stable convergences. We have made our codes and pre-trained models publicly available: https://github.com/nbansal90/Can-we-Gain-More-from-Orthogonality.", "full_text": "Can We Gain More from Orthogonality\nRegularizations in Training Deep CNNs?\n\nNitin Bansal\n\nXiaohan Chen\n\nZhangyang Wang\n\nDepartment of Computer Science and Engineering\n\nTexas A&M University,\n\nCollege Station, TX 77843, USA\n\n{bansa01, chernxh, atlaswang}@tamu.edu\n\nAbstract\n\nThis paper seeks to answer the question: as the (near-) orthogonality of weights is\nfound to be a favorable property for training deep convolutional neural networks,\nhow can we enforce it in more effective and easy-to-use ways? We develop novel\northogonality regularizations on training deep CNNs, utilizing various advanced\nanalytical tools such as mutual coherence and restricted isometry property. These\nplug-and-play regularizations can be conveniently incorporated into training almost\nany CNN without extra hassle. We then benchmark their effects on state-of-the-art\nmodels: ResNet, WideResNet, and ResNeXt, on several most popular computer\nvision datasets: CIFAR-10, CIFAR-100, SVHN and ImageNet. We observe consis-\ntent performance gains after applying those proposed regularizations, in terms of\nboth the \ufb01nal accuracies achieved, and faster and more stable convergences. We\nhave made our codes and pre-trained models publicly available.1.\n\n1\n\nIntroduction\n\nDespite the tremendous success of deep convolutional neural networks (CNNs) [1], their training\nremains to be notoriously dif\ufb01cult both theoretically and practically, especially for state-of-the-art\nultra-deep CNNs. Potential reasons accounting for such dif\ufb01culty lie in multiple folds, ranging from\nvanishing/exploding gradients [2], to feature statistic shifts [3], to the proliferation of saddle points\n[4], and so on. To address these issues, various solutions have been proposed to alleviate those issues,\nexamples of which include parameter initialization [5], residual connections [6], normalization of\ninternal activations [3], and second-order optimization algorithms [4].\nThis paper focuses on one type of structural regularizations: orthogonality, to be imposed on linear\ntransformations between hidden layers of CNNs. The orthogonality implies energy preservation,\nwhich is extensively explored for \ufb01lter banks in signal processing and guarantees that energy of\nactivations will not be ampli\ufb01ed [7]. Therefore, it can stabilize the distribution of activations\nover layers within CNNs [8, 9] and make optimization more ef\ufb01cient. [5] advocates orthogonal\ninitialization of weight matrices, and theoretically analyzes its effects on learning ef\ufb01ciency using\ndeep linear networks. Practical results on image classi\ufb01cation using orthogonal initialization are\nalso presented in [10]. More recently, a few works [11\u201315] look at (various forms of) enforcing\northogonality regularizations or constraints throughout training, as part of their specialized models for\napplications such as classi\ufb01cation [14] or person re-identi\ufb01cation [16]. They observed encouraging\nresult improvements. However, a dedicated and thorough examination on the effects of orthogonality\nfor training state-of-the-art general CNNs has been absent so far.\n\n1https://github.com/nbansal90/Can-we-Gain-More-from-Orthogonality\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fEven more importantly, how to evaluate and enforce orthogonality for non-square weight matrices\ndoes not have a sole optimal answer. As we will explain later, existing works employ the most\nobvious but not necessarily appropriate option. We will introduce a series of more sophisticated\nregularizers that lead to larger performance gains.\nThis paper investigates and pushes forward various ways to enforce orthogonality regularizations on\ntraining deep CNNs. Speci\ufb01cally, we introduce three novel regularization forms for orthogonality,\nranging from the double-sided variant of standard Frobenius norm-based regularizer, to utilizing\nMutual Coherence (MC) and Restricted Isometry Property (RIP) tools [17\u201319]. Those orthogonality\nregularizations have a plug-and-play nature, i.e., they can be incorporated with training almost any\nCNN without hassle. We extensively evaluate the proposed orthogonality regularizations on three\nstate-of-the-art CNNs: ResNet [6], ResNeXt [20], and WideResNet [21]. In all experiments, we\nobserve the consistent and remarkable accuracy boosts (e.g., 2.31% in CIFAR-100 top-1 accuracy\nfor WideResNet), as well as faster and more stable convergences, without any other change made to\nthe original models. It implies that many deep CNNs may have not been unleashed with their full\npowers yet, where orthogonality regularizations can help. Our experiments further reveal that larger\nperformance gains can be attained by designing stronger forms of orthogonality regularizations. We\n\ufb01nd the RIP-based regularizer, which has better analytical grounds to characterize near-orthogonal\nsystems [22], to consistently outperform existing Frobenius norm-based regularizers and others.\n\n2 Related Work\n\nTo remedy unstable gradient and co-variate shift problems, [2, 23] advocated near constant variances\nof each layer\u2019s output for initialization. [3] presented a major breakthrough in stabilizing training, via\nensuring each layer\u2019s output to be identical distributions which reduce the internal covariate shift.\n[24] further decoupled the norm of the weight vector from its phase(direction) while introducing\nindependences between minibatch examples, resulting in a better optimization problem. Orthogonal\nweights have been widely explored in Recurrent Neural Networks (RNNs) [25\u201330] to help avoid\ngradient vanishing/explosion. [25] proposed a soft constraint technique to combat vanishing gradient,\nby forcing the Jacobian matrices to preserve energy measured by Frobenius norm. The more recent\nstudy [29] investigated the effect of soft versus hard orthogonal constraints on the performance of\nRNNs, the former by specifying an allowable range for the maximum singular value of the transition\nmatrix and thus allowing for its small intervals around one.\nIn CNNs, orthogonal weights are also recognized to stabilize the layer-wise distribution of activations\n[8] and make optimization more ef\ufb01cient. [5, 10] presented the idea of orthogonal weight initialization\nin CNNs, which is driven by the norm-preserving property of orthogonal matrix: a similar outcome\nwhich BN tried to achieve. [5] analyzed the non-linear dynamics of CNN training. Under simpli\ufb01ed\nassumptions, they concluded that random orthogonal initialization of weights will give rise to the\nsame convergence rate as unsupervised pre-training, and will be superior than random Gaussian\ninitialization. However, a good initial condition such as orthogonality does not necessarily sustain\nthroughout training. In fact, the weight orthogonality and isometry will break down easily when\ntraining starts, if not properly regularized [5]. Several recent works [12, 13, 15] considered Stiefel\nmanifold-based hard constraints of weights. [12] proposed a Stiefel layer to guarantee fully connected\nlayers to be orthogonal by using Reimannian gradients, without considering similar handling for\nconvolutional layers; their performance reported on VGG networks [31] were less than promising.\n[13] extended Riemannian optimization to convolutional layers and require \ufb01lters within the same\nchannel to be orthogonal. To overcome the challenge that CNN weights are usually rectangular rather\nthan square matrices, [15] generalized Stiefel manifold property and formulated an Optimization over\nMultiple Dependent Stiefel Manifolds (OMDSM) problem. Different from [13], it ensured \ufb01lters\nacross channels to be orthogonal. A related work [11] adopted a Singular Value Bounding (SVB)\nmethod, via explicitly thresholding the singular values of weight matrices between a pre-speci\ufb01ed\nnarrow band around the value of one.\nThe above methods [11\u201313, 15] all fall in the category of enforcing \u201chard orthogonality constraints\u201d\ninto optimization ([11] could be viewed as a relaxed constraint), and have to repeat singular value\ndecomposition (SVD) during training. The cost of SVD on high-dimensional matrices is expensive\neven in GPUs, which is one reason why we choose not to go for the \u201chard constraint\u201d direction in\nthis paper. Moreover, since CNN weight matrices cannot exactly lie on a Stiefel manifold as they are\neither very \u201cthin\u201d or \u201cfat\u201d (e.g., W T W = I may never happen for an overcomplete \u201cfat\u201d W due to\n\n2\n\n\frank de\ufb01ciency of its gram matrix), special treatments are needed to maintain the hard constraint. For\nexample, [15] proposed group based orthogonalization to \ufb01rst divide an over-complete weight matrix\ninto \u201cthin\u201d column-wise groups, and then applying Stiefel manifold constraints group-wise. The\nstrategy was also motivated by reducing the computational burden of computing large-scale SVDs.\nLately, [32, 33] interpreted CNNs as Template Matching Machines, and proposed a penalty term to\nforce the templates to be orthogonal with each other, leading to signi\ufb01cantly improved classi\ufb01cation\nperformance and reduced over\ufb01tting with no change to the deep architecture.\nA recent work [14] explored orthogonal regularization, by enforcing the Gram matrix of each weight\nmatrix to be close to identity under Frobenius norm. It constrains orthogonality among \ufb01lters\nin one layer, leading to smaller correlations among learned features and implicitly reducing the\n\ufb01lter redundancy. Such a soft orthonormal regularizer is differentiable and requires no SVD, thus\nbeing computationally cheaper than its \u201chard constraint\u201d siblings. However, we will see later that\nFrobenius norm-based orthogonality regularization is only a rough approximation, and is inaccurate\nfor \u201cfat\u201d matrices as well. The authors relied on a backward error modulation step, as well as similar\ngroup-wise orthogonalization as in [15]. We also notice that [14] displayed the strong advantage of\nenforcing orthogonality in training the authors\u2019 self-designed plain deep CNNs (i.e. without residual\nconnections). However, they found fewer performance impacts when applying the same to training\nprevalent network architectures such as ResNet [6]. In comparison, our orthogonality regularizations\ncan be added to CNNs as \u201cplug-and-play\u201d components, without any other modi\ufb01cation needed. We\nobserve evident improvements brought by them on most popular ResNet architectures.\nFinally, we brie\ufb02y outline a few works related to orthogonality in more general senses. One may\nnotice that enforcing matrix to be (near-)orthogonal during training will lead to its spectral norm\nbeing always equal (or close) to one, which links between regularizing orthogonality and spectrum.\nIn [34], the authors showed that the spectrum of Extended Data Jacobian Matrix (EDJM) affected the\nnetwork performance, and proposed a spectral soft regularizer that encourages major singular values\nof EDJM to be closer to the largest one. [35] claimed that the maximum eigenvalue of the Hessian\npredicted the generalizability of CNNs. Motivated by that, [36] penalized the spectral norm of weight\nmatrices in CNNs. A similar idea was later extended in [37] for training generative adversarial\nnetworks, by proposing a spectral normalization technique to normalize the spectral norm/Lipschitz\nnorm of the weight matrix to be one.\n\n3 Deriving New Orthogonality Regularizations\n\nIn this section, we will derive and discuss several orthogonality regularizers. Note that those\nregularizers are applicable to both fully-connected and convolutional layers. The default mathematical\nexpressions of regularizers will be assumed on a fully-connected layer W \u2208 m\u00d7n (m could be either\nlarger or smaller than n). For a convolutional layer C \u2208 S\u00d7H\u00d7C\u00d7M , where S, H, C, M are \ufb01lter\nwidth, \ufb01lter height, input channel number and output channel number, respectively, we will \ufb01rst\nreshape C into a matrix form W (cid:48) \u2208 m(cid:48) \u00d7 n(cid:48), where m(cid:48) = S \u00d7 H \u00d7 C and n(cid:48) = M. The setting for\nregularizing convolutional layers follows [14, 15] to enforces orthogonality across \ufb01lter, encouraging\n\ufb01lter diversity. All our regularizations are directly amendable to almost any CNN: there is no change\nneeded on the network architecture, nor any other training protocol (unless otherwise speci\ufb01ed).\n\n3.1 Baseline: Soft Orthogonality Regularization\n\nPrevious works [14, 32, 33] proposed to require the Gram matrix of the weight matrix to be close to\nidentity, which we term as Soft Orthogonality (SO) regularization:\n\n\u03bb||W T W \u2212 I||2\nF ,\n\n(SO)\n\n(1)\nwhere \u03bb is the regularization coef\ufb01cient (the same hereinafter). It is a straightforward relaxation from\nthe \u201chard orthogonality\u201d assumption [12, 13, 15, 38] under the standard Frobenius norm, and can\nbe viewed as a different weight decay term limiting the set of parameters close to a Stiefel manifold\nrather than inside a hypersphere. The gradient is given in an explicit form: 4\u03bbW (W T W \u2212 I), and\ncan be directly appended to the original gradient w.r.t. the current weight W .\nHowever, SO (1) is \ufb02awed for an obvious reason: the columns of W could possibly be mutually\northogonal, if and only if W is undercomplete (m \u2265 n). For overcomplete W (m < n), its gram\nmatrix W T W \u2208 Rn\u00d7n cannot be even close to identity, because its rank is at most m, making\n\n3\n\n\f||W T W \u2212 I||2\nF a biased minimization objective. In practice, both cases can be found for layer-\nwise weight dimensions. The authors of [15, 14] advocated to further divide overcomplete W into\nundercomplete column groups to resolve the rank de\ufb01ciency trap. In this paper, we choose to simply\nuse the original SO version (1) as a fair comparison baseline.\nThe authors of [14] argued against the hybrid utilization of the original (cid:96)2 weight decay and the\nSO regularization. They suggested to stick to one type of regularization all along training. Our\nexperiments also \ufb01nd that applying both together throughout training will hurt the \ufb01nal accuracy.\nInstead of simply discarding (cid:96)2 weight decay, we discover a scheme change approach which is\nvalidated to be most bene\ufb01cial to performance, details on this can be found in Section 4.1.\n\n3.2 Double Soft Orthogonality Regularization\n\nThe double soft orthogonality regularization extends SO in the following form:\n\n(DSO)\n\n\u03bb(||W T W \u2212 I||2\n\nF + ||W W T \u2212 I||2\nF ).\n\n(2)\n\nF but will likely have large residual ||W T W \u2212 I||2\n\nNote that an orthogonal W will satisfy W T W = W W T = I; an overcomplete W can be regularized\nto have small ||W W T \u2212 I||2\nF , and vice versa for an\nunder-complete W . DSO is thus designed to cover both over-complete and under-complete W cases;\nfor either case, at least one term in (2) can be well suppressed, requiring either rows or columns of W\nto stay orthogonal. It is a straightforward extension from SO.\nAnother similar alternative to DSO is \u201cselective\u201d soft orthogonality regularization, de\ufb01ned as:\n\u03bb||W T W \u2212 I||2\nF if m \u2264 n. Our experiments \ufb01nd that DSO always\noutperforms the selective regularization, therefore we only report DSO results.\n\nF , if m > n; \u03bb||W W T \u2212 I||2\n\n3.3 Mutual Coherence Regularization\n\nThe mutual coherence [18] of W is de\ufb01ned as:\n\n\u00b5W = max\ni(cid:54)=j\n\n|(cid:104)wi, wj(cid:105)|\n||wi|| \u00b7 ||wj|| ,\n\n(3)\n\nwhere wi denotes the i-th column of W , i = 1, 2, ..., n. The mutual coherence (3) takes values\nbetween [0,1], and measures the highest correlation between any two columns of W . In order for W\nto have orthogonal or near-orthogonal columns, \u00b5W should be as low as possible (zero if m \u2265 n).\nWe wish to suppress \u00b5W as an alternative way to enforce orthogonality. Assume W has been \ufb01rst\nnormalized to have unit-norm columns, (cid:104)wi, wj(cid:105) is essentially the (i, j)-the element of the Gram\nmatrix W T W , and i (cid:54)= j requires us to consider off-diagonal elements only. Therefore, we propose\nthe following mutual coherence (MC) regularization term inspired by (3:\n\n\u03bb||W T W \u2212 I||\u221e.\n\n(MC)\n\n(4)\nAlthough we do not explicitly normalize the column norm of W to be one, we \ufb01nd experimentally\nthat minimizing (4) often tends to implicitly encourage close-to-unit-column-norm W too, making\nthe objective of (4) a viable approximation of mutual coherence (3)2.\nThe gradient of ||W T W \u2212 I||\u221e could be explicitly solved by applying a smoothing technique\nto the nonsmooth (cid:96)\u221e norm, e.g., [39]. However, it will invoke an iterative routine each time to\ncompute (cid:96)1-ball proximal projection, which is less ef\ufb01cient in our scenario where massive gradient\ncomputations are needed. In view of that, we turn to using auto-differentiation to approximately\ncompute the gradient of (4) w.r.t. W .\n\n3.4 Spectral Restricted Isometry Property Regularization\n\nRecall that the RIP condition [17] of W assumes:\nAssumption 1 For all vectors z \u2208 Rn that is k-sparse, there exists a small \u03b4W \u2208 (0, 1) s.t. (1 \u2212\n\u03b4W ) \u2264 ||W z||2\n\n||z||2 \u2264 (1 + \u03b4W ).\n\n2We also tried to \ufb01rst normalize columns of W and then apply (4), without \ufb01nding any performance bene\ufb01ts.\n\n4\n\n\fThe above RIP condition essentially requires that every set of columns in W , with cardinality no\nlarger than k, shall behave like an orthogonal system. If taking an extreme case with k = n, RIP\nthen turns into another criterion that enforces the entire W to be close to orthogonal. Note that both\nmutual incoherence and RIP are well de\ufb01ned for both under-complete and over-complete matrices.\nWe rewrite the special RIP condition with k = n in the form below:\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b4W , \u2200z \u2208 Rn\n\n(cid:12)(cid:12)(cid:12)(cid:12)||W z||2\n\n||z||2 \u2212 1\n||W z||\n||z||\n\n(5)\n\n(6)\n\nNotice that \u03c3(W ) = supz\u2208Rn,z(cid:54)=0\nis the spectral norm of W , i.e., the largest singular value\n|||W z||2\n||z||2 \u2212 1|. In order to enforce orthogonality to W\nof W . As a result, \u03c3(W T W \u2212 I) = sup\nfrom an RIP perspective, one may wish to minimize the RIP constant \u03b4W in the special case k = n,\n|||W z||2\n||z||2 \u2212 1| as from (5). Therefore,\nwhich according to the de\ufb01nition should be chosen as\nwe end up equivalently minimizing the spectral norm of W T W \u2212 I:\n\nz\u2208Rn,z(cid:54)=0\n\nz\u2208Rn,z(cid:54)=0\n\nsup\n\n(SRIP)\n\n\u03bb \u00b7 \u03c3(W T W \u2212 I).\n\nIt is termed as the Spectral Restricted Isometry Property (SRIP) regularization.\nThe above reveals an interesting hidden link: regularizations with spectral norms were previously\ninvestigated in [36, 37], through analyzing small perturbation robustness and Lipschitz constant. The\nspectral norm re-arises from enforcing orthogonality when RIP condition is adopted. But compared\nto the spectral norm (SN) regularization [36] which minimizes \u03c3(W ), SRIP is instead enforced on\nW T W \u2212 I. Also compared to [37] requiring the spectral norm of W to be exactly 1 (developed for\nGANs), SRIP requires all singular values of W to be close to 1, which is essentially stricter because\nthe resulting W needs also be well conditioned.\nWe again refer to auto differentiation to compute the gradient of (6) for simplicity. However, even\ncomputing the objective value of (6) can invoke the computationally expensive EVD. To avoid that,\nwe approximate the computation of spectral norm using the power iteration method. Starting with a\nrandomly initialized v \u2208 Rn, we iteratively perform the following procedure a small number of times\n(2 times by default) :\n\nu \u2190 (W T W \u2212 I)v, v \u2190 (W T W \u2212 I)u, \u03c3(W T W \u2212 I) \u2190 ||v||\n||u|| .\n\n(7)\nWith such a rough approximation as proposed, SRIP reduces computational cost from O(n3) to\nO(mn2), and is practically much faster for implementation.\n\n4 Experiments on Benchmarks\n\nFirst of all, we will base our experiments on several popular state-of-the-art models: ResNet[6, 40]\n(including several different variants), Wide ResNet[21] and ResNext[20]. For fairness, all pre-\nprocessing, data augmentation and training/validation/testing splitting are strictly identical to the\noriginal training protocols in [21, 6, 40, 20]. All hyper-parameters and architectural details remain\nunchanged too, unless otherwise speci\ufb01ed.\nWe structure the experiment section in the following way. In the \ufb01rst part of experiments, we design\na set of intensive experiments on CIFAR 10 and CIFAR-100, which consist of 60,000 images of\nsize 32\u00d732 with a 5-1 training-testing split, divided into 10 and 100 classes respectively. We will\ntrain each of the three models with each of the proposed regularizers, and compare their performance\nwith the original versions, in terms of both \ufb01nal accuracy and convergence. In the second part, we\nfurther conduct experiments on ImageNet and SVHN datasets. In both parts, we also compare our\nbest performer SRIP with existing regularization methods with similar purposes.\n\nScheme Change for Regularization Coef\ufb01cients All the regularizers have an associated regular-\nization coef\ufb01cient denoted by \u03bb, whose choice play an important role in the regularized training\nprocess. Correspondingly, we denote the regularization coef\ufb01cient for the (cid:96)2 weight decay used by\noriginal models as \u03bb2. From experiments, we observe that fully replacing (cid:96)2 weight decay with\northogonal regularizers will accelerate and stabilize training at the beginning of training, but will\n\n5\n\n\fnegatively affect the \ufb01nal accuracies achievable. We conjecture that while the orthogonal parameter\nstructure is most bene\ufb01cial at the initial stage, it might be overly strict when training comes to the\n\ufb01nal \u201c\ufb01ne tune\u201d stage, when we should allow for more \ufb02exibility for parameters. In view of that,\nwe did extensive ablation experiments and identify a switching scheme between two regularizations,\nat the beginning and late stages of training. Concretely, we gradually reduce \u03bb (initially 0.1-0.2) to\n10\u22123, 10\u22124 and 10\u22126, after 20, 50 and 70 epochs, respectively, and \ufb01nally set it to zero after 120\nepochs. For \u03bb2, we start with 10\u22128; then for SO/DSO regularizers, we increase \u03bb2 to 10\u22124/5 \u00d7 10\u22124,\nafter 20 epochs. For MC/SRIP regularizers, we \ufb01nd them insensitive to the choice of \u03bb2, potentially\ndue to their stronger effects in enforcing W T W close to I; we thus stick to the initial \u03bb2 throughout\ntraining for them. Such an empirical \u201cscheme change\u201d design is found to work nicely with all models,\nbene\ufb01ting both accuracy and ef\ufb01ciency. The above \u03bb/\u03bb2 choices apply to all our experiments.\nAs pointed out by one anonymous reviewer, applying orthogonal regularization will change the\noptimization landscape, and its power seems to be a complex and dynamic story throughout training.\nIn general, we \ufb01nd it to show a strong positive impact at the early stage of training (not just\ninitialization), which concurs with previous observations. But such impact is observed to become\nincreasingly negligible, and sometime (slightly) negative, when the training approaches the end. That\ntrend seems to be the same for all our regularizers.\n\nFigure 1: Validation curves during training for ResNet-110. Top: CIFAR-10; Bottom: CIFAR-100;\n\n4.1 Experiments on CIFAR-10 and CIFAR-100\n\nWe employ three model con\ufb01gurations on the CIFAR-10 and CIFAR-100 datasets:\nResNet 110 Model [6] The 110-layer ResNet Model [6] is a very strong and popular ResNet version.\nIt uses Bottleneck Residual Units, with a formula setting given by p = 9n + 2, where n denotes the\ntotal number of convolutional blocks used and p the total depth. We use the Adam optimizer to train\nthe model for 200 epochs, with learning rate starting with 1e-2, and then subsequently decreasing to\n10\u22123, 10\u22125 and 10\u22126, after 80, 120 and 160 epochs, respectively.\n\n6\n\n\fTable 1: Top-1 error rate comparison by ResNet 110, Wide ResNet 28-10 and ResNext 29-8-64 on\nCIFAR-10 and CIFAR-100. * indicates results by us running the provided original model.\n\nModel\nResNet-110 [6]\n\nWide ResNet 28-10 [21]\n\nResNext 29-8-64 [20]\n\nRegularizer\nNone\nSO\nDSO\nMC\nSRIP\nNone\nSO\nDSO\nMC\nSRIP\nNone\nSO\nDSO\nMC\nSRIP\n\nCIFAR-10\n7.04*\n6.78\n7.04\n6.97\n6.55\n4.16*\n3.76\n3.86\n3.68\n3.60\n3.70*\n3.58\n3.85\n3.65\n3.48\n\nCIFAR-100\n25.42*\n25.01\n25.83\n25.43\n25.14\n20.50*\n18.56\n18.21\n18.90\n18.19\n18.53*\n17.59\n19.78\n17.62\n16.99\n\nWide ResNet 28-10 Model [21] For the Wide ResNet model [21], we use depth 28 and k (width)\n10 here, as this con\ufb01guration gives the best accuracies for both CIFAR-10 and CIFAR-100, and is\n(relatively) computationally ef\ufb01cient. The model uses a Basic Block B(3,3), as de\ufb01ned in ResNet\n[6]. We use the SGD optimizer with a Nesterov Momentum of 0.9 to train the model for 200 epochs.\nThe learning rate starts at 0.1, and is then decreased by a factor of 5, after 60, 120 and 160 epochs,\nrespectively. We have followed all other settings of [21] identically.\nResNext 29-8-64 Model [20] For ResNext Model [20], we consider the 29-layer architecture with\na cardinality of 8 and widening factor as 4, which reported the best state-of-the-art CIFAR-10/CIFAR-\n100 results compared to other contemporary models with similar amounts of trainable parameters.\nWe use the SGD optimizer with a Nesterov Momentum of 0.9 to train the model for 300 epochs. The\nlearning starts from 0.1, and decays by a factor of 10 after 150 and 225 epochs, respectively.\nResults Table 1 compares the top-1 error rates in the three groups of experiments. To summarize,\nSRIP is obviously the winner in almost all cases (except the second best for ResNet-110, CIFAR-100),\nwith remarkable performance gains, such as an impressive 2.31% top-1 error reduction for Wide\nResNet-28-10. SO acts a surprisingly strong baseline and is often only next to SRIP. MC can usually\noutperform the original baseline but remains inferior to SRIP and SO. DSO seems the most ineffective\namong all four, and might perform even worse than the original baseline. We also carefully inspect\nthe training curves (in term of validation accuracies w.r.t epoch numbers) of different methods on\nCIFAR-10 and CIFAR-100, with ResNet-110 curves shown in Fig. 1 for example. All starting from\nrandom scratch, we observe that all four regularizers signi\ufb01cantly accelerate the training process\nin the initial training stage, and maintain at higher accuracies throughout (most part of) training,\ncompared to the un-regularized original version. The regularizers can also stabilize the training in\nterms of less \ufb02uctuations of the training curves. We defer a more detailed analysis to Section 4.3.\nBesides, we validate the helpfulness of scheme change. For example, we train Wide ResNet 28-10\nwith SRIP, but without scheme change (all else remains the same). We witness a 0.33% top-1\nerror increase on CIFAR-10, and 0.90% on CIFAR-100, although still outperforming the original\nun-regularized models. Other regularizers perform even worse without scheme change.\nComparison with Spectral Regularization We compare SRIP with the spectral regularization\n(SR) developed in [36]: \u03bbs\n2 \u03c3(W )2, with the authors\u2019 default \u03bbs = 0.1. All other settings in [36] have\nbeen followed identically. We apply the SR regularization to training the Wide ResNet-28-10 Model\nand the ResNext 29-8-64 Model. For the former, we obtain a top-1 error rate of 3.93% on CIFAR-10,\nand 19.08% on CIFAR-100. For the latter, the top-1 error rate is 3.54% for CIFAR-10, and 17.27%\nfor CIFAR-100. Both are inferior to SRIP results from the same settings of Table 1.\nComparison with Optimization over Multiple Dependent Stiefel Manifolds OMDSM We also\ncompare SRIP with OMDSM developed in [15], which makes a fair comparison with ours, on soft\nregularization forms versus hard constraint forms of enforcing orthogonality. This work trained Wide\nResNet 28-10 on CIFAR-10 and CIFAR-100 and got error rates 3.73% and 18.76% respectively,\nboth being inferior to SRIP (3.60% for CIFAR-10 and 18.19% for CIFAR-100).\n\n7\n\n\fComparison with Jacobian Norm Regularization A recent work [41] propounds the idea of\nusing the norm of the CNN Jacobian as a training regularizer. The paper used a variant of Wide\nResNet [21] with 22 layers of width 5, whose original top-1 error rate was 6.66% on on CIFAR-10,\nand and reported a reduced error rate of 5.68% with their proposed regularizer. We trained this same\nmodel using SRIP over the same augmented full training set, achieving 4.28% top-1 error, that shows\na large gain over the Jacobian norm-based regularizer.\n\n4.2 Experiments on ImageNet and SVHN\n\nRegularizer\nNone\nOMDSM [15]\nSRIP\nPre-Resnet 34 [40] None\n\nWe extend the experiments to two larger and more complicated datasets: ImageNet and SVHN (Street\nView House Numbers). Since SRIP clearly performs the best in the above experiments, among the\nproposed four, we will focus on comparing SRIP only.\nExperiments on ImageNet We train\nResNet 34, Pre-ResNet 34 and ResNet\n50 [40] on the ImageNet dataset with\nand without SRIP regularizer, respec-\ntively. The training hyperparameters\nsettings are consistent with the origi-\nnal models. The initial learning rate is\nset to 0.1, and decreases at epoch 30,\n60, 90 and 120 by a factor of 10. The\ntop-5 error rates are then reported on\nthe ILSVRC-2012 val set, with single\nmodel and single-crop. [15] also reported their top-5 error rates with both ResNet 34 and Pre-ResNet\n34 on ImageNet. As seen in Table 2. SRIP clearly outperforms the best for all three models.\nExperiments on SVHN On the SVHN dataset, we train the original Wide ResNet 16-8 model,\nfollowing its original implementation in [21] with initial learning 0.01 which decays at epoch 60,120\nand 160 all by a factor of 5. We then train the SRIP-regularized version with no change made other\nthan adding the regularizer. While the original Wide ResNet 16-8 gives rise to an error rate of 1.63%,\nSRIP reduces it to 1.56%.\n\nTable 2: Top-5 error rate comparison on ImageNet.\nModel\nImageNet\n9.84\nResNet 34 [6]\n9.68\n8.32\n9.79\n9.45\n8.79\n7.02\n6.87\n\nResNet 50 [6]\n\nOMDSM [15]\nSRIP\nNone\nSRIP\n\n4.3 Summary, Remarks and Insights\n\nFrom our extensive experiments with state-of-the-art models on popular benchmarks, we can conclude\nthe following points:\n\u2022 In response to the question in our title: Yes, we can gain a lot from simply adding orthogo-\nnality regularizations into training. The gains can be found in both \ufb01nal achievable accuracy\nand empirical convergence.\nFor the former, the three models have obtained (at most) 0.49%, 0.56%, and 0.22% top-1\naccuracy gains on CIFAR-10, and 0.41%, 2.31%, and 1.54% on CIFAR-100, respectively.\nFor the latter, positive impacts are widely observed in our training and validation curves\n(Figure 1 as a representative example), in particular faster and smoother curves at the initial\nstage. Note that those impressive improvements are obtained with no other changes made,\nand is extended to datasets such as ImageNet and SVHN.\n\u2022 With its nice theoretical grounds, SRIP is also the best practical option among all four\nproposed regularizations. It consistently performs the best in achieving the highest accuracy\nas well as accelerating/stabilizing training curves. It also outperforms other recent methods\nutilizing spectral norm [36], hard orthogonality [15], and Jacobian norm [41]\n\u2022 Despite its simplicity (and potential estimation bias), SO is a surprisingly robust baseline\nand frequently ranks second among all four. We conjecture that SO bene\ufb01ts from its smooth\nform and continuous gradient, which facilitates the gradient-based optimization, while both\nSRIP and MC have to deal with non-smooth problems.\n\u2022 DSO does not seem to be helpful. It often performs worse than SO, and sometimes even\nworse than the un-regularized original model. We interpret it by recalling how the matrix W\nis constructed (Section 3 beginning): enforcing W T W close to I has \u201cinter-channel\u201d effects\n(i.e., requiring different output channels to have orthogonal \ufb01lter groups); whereas enforcing\nW W T close to I enforce \u201cintra-channel\u201d orthogonality (i.e., same spatial locations across\n\n8\n\n\fdifferent \ufb01lter groups have to be orthogonal). The former is a better accepted idea. Our\nresults on DSO seems to provide further evidence (from the counter side) that orthogonality\nshould be primarily considered for \u201cinter-channel\u201d, i.e., between columns of W .\n\u2022 MC brings in certain improvements, but not as signi\ufb01cantly as SRIP. We notice that (4)\nwill approximate (3) well only when W has unit columns. While we \ufb01nd minimizing (4)\ngenerally has the empirical results of approximately normalizing W columns, it is not exactly\nenforced all the time. As we observed from experiments, large deviations of column-wise\nnorms could occur at some point of training and potentially bring in negative impacts. We\nplan to look for re-parameterization of W to ensure unit norms throughout training, e.g.,\nthrough integrating MC with weight normalization [24], in future work.\n\u2022 In contrast to many SVD-based hard orthogonality approaches, our proposed regularizers\nare light to use and incur negligible extra training complexity. Our experiments show that the\nper-iteration (batch) running time remains almost unchanged with or without our regularizers.\nAdditionally, the improvements by regularization prove to be stable and reproducible. For\nexample, we tried to train Wide ResNet 28-10 with SRIP from three different random\ninitializations (all other protocols unchanged), and \ufb01nd that the \ufb01nal accuracies very stable\n(deviation smaller than 0.03%), with best accuracy 3.60%.\n\n5 Conclusion\n\nWe presented an ef\ufb01cient mechanism for regularizing different \ufb02avors of orthogonality, on several\nstate-of-art convolutional deep CNNs [21, 6, 20]. We showed that in all cases, we can achieve better\naccuracy, more stable training curve and smoother convergence. In almost all times, the novel SRIP\nregularizer outperforms all else consistently and remarkably. Those regularizations demonstrate\noutstanding generality and easiness to use, suggesting that orthogonality regularizations should be\nconsidered as standard tools for training deeper CNNs. As future work, we are interested to extend\nthe evaluation of SRIP to training RNNs and GANs. Summarizing results, a be\ufb01tting quote would be:\nEnforce orthogonality in training your CNN and by no means will you regret!\n\nAcknowledgments\n\nThe work by N. Bansal, X. Chen and Z. Wang is supported in part by NSF RI-1755701. We would\nalso like to thank all anonymous reviewers for their tremendously useful comments to help improve\nour work.\n\nReferences\n[1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[2] Xavier Glorot and Yoshua Bengio. Understanding the dif\ufb01culty of training deep feedforward\nIn Proceedings of the Thirteenth International Conference on Arti\ufb01cial\n\nneural networks.\nIntelligence and Statistics, pages 249\u2013256, 2010.\n\n[3] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International conference on machine learning, pages\n448\u2013456, 2015.\n\n[4] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and\nYoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-\nconvex optimization. In Advances in neural information processing systems, pages 2933\u20132941,\n2014.\n\n[5] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n9\n\n\f[7] Jianping Zhou, Minh N Do, and Jelena Kovacevic. Special paraunitary matrices, cayley trans-\nform, and multidimensional orthogonal \ufb01lter banks. IEEE Transactions on Image Processing,\n15(2):511\u2013519, 2006.\n\n[8] Pau Rodr\u00edguez, Jordi Gonzalez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regu-\nlarizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967, 2016.\n\n[9] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, et al. Natural neural networks. In\n\nAdvances in Neural Information Processing Systems, pages 2071\u20132079, 2015.\n\n[10] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422,\n\n2015.\n\n[11] Kui Jia, Dacheng Tao, Shenghua Gao, and Xiangmin Xu. Improving training of deep neural\n\nnetworks via singular value bounding. CoRR, abs/1611.06013, 2016.\n\n[12] Mehrtash Harandi and Basura Fernando. Generalized backpropagation,\\\u2019{E} tude de cas:\n\nOrthogonality. arXiv preprint arXiv:1611.05927, 2016.\n\n[13] Mete Ozay and Takayuki Okatani. Optimization on submanifolds of convolution kernels in\n\ncnns. arXiv preprint arXiv:1610.07008, 2016.\n\n[14] Di Xie, Jiang Xiong, and Shiliang Pu. All you need is beyond a good init: Exploring better\nsolution for training extremely deep convolutional neural networks with orthonormality and\nmodulation. arXiv preprint arXiv:1703.01827, 2017.\n\n[15] Lei Huang, Xianglong Liu, Bo Lang, Adams Wei Yu, and Bo Li. Orthogonal weight normaliza-\ntion: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks.\narXiv preprint arXiv:1709.06079, 2017.\n\n[16] Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnet for pedestrian retrieval.\n\narXiv preprint, 2017.\n\n[17] Emmanuel J Candes and Terence Tao. Decoding by linear programming. IEEE transactions on\n\ninformation theory, 51(12):4203\u20134215, 2005.\n\n[18] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289\u2013\n\n1306, 2006.\n\n[19] Zhaowen Wang, Jianchao Yang, Haichao Zhang, Zhangyang Wang, Yingzhen Yang, Ding Liu,\nand Thomas S Huang. Sparse Coding and its Applications in Computer Vision. World Scienti\ufb01c,\n2016.\n\n[20] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR),\n2017 IEEE Conference on, pages 5987\u20135995. IEEE, 2017.\n\n[21] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[22] Tong Zhang. Sparse recovery with orthogonal matching pursuit under rip. IEEE Transactions\n\non Information Theory, 57(9):6215\u20136221, 2011.\n\n[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into recti\ufb01ers:\nSurpassing human-level performance on imagenet classi\ufb01cation. In Proceedings of the IEEE\ninternational conference on computer vision, pages 1026\u20131034, 2015.\n\n[24] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to\naccelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 901\u2013909, 2016.\n\n[25] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\nneural networks. In International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n10\n\n\f[26] Victor Dorobantu, Per Andre Stromhaug, and Jess Renteria. Dizzyrnn: Reparameterizing recur-\nrent neural networks for norm-preserving backpropagation. arXiv preprint arXiv:1612.04035,\n2016.\n\n[27] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1120\u20131128, 2016.\n\n[28] Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Ef\ufb01cient orthogonal\nparametrisation of recurrent neural networks using householder re\ufb02ections. arXiv preprint\narXiv:1612.00188, 2016.\n\n[29] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and\nlearning recurrent networks with long term dependencies. arXiv preprint arXiv:1702.00071,\n2017.\n\n[30] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity\nunitary recurrent neural networks. In Advances in Neural Information Processing Systems,\npages 4880\u20134888, 2016.\n\n[31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale\n\nimage recognition. arXiv preprint arXiv:1409.1556, 2014.\n\n[32] Randall Balestriero and Richard Baraniuk. A spline theory of deep networks. Proceedings of\n\nthe 35th International Conference on Machine Learning (ICML), 2018.\n\n[33] Randall Balestriero and Richard Baraniuk. Mad max: Af\ufb01ne spline insights into deep learning.\n\narXiv preprint arXiv:1805.06576, 2018.\n\n[34] Shengjie Wang, Abdel-rahman Mohamed, Rich Caruana, Jeff Bilmes, Matthai Plilipose,\nMatthew Richardson, Krzysztof Geras, Gregor Urban, and Ozlem Aslan. Analysis of deep\nneural networks with extended data jacobian matrix. In International Conference on Machine\nLearning, pages 718\u2013726, 2016.\n\n[35] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping\nTak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima.\narXiv preprint arXiv:1609.04836, 2016.\n\n[36] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generaliz-\n\nability of deep learning. arXiv preprint arXiv:1705.10941, 2017.\n\n[37] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization\n\nfor generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.\n\n[38] Zhangyang Wang, Hongyu Xu, Haichuan Yang, Ding Liu, and Ji Liu. Learning simple\n\nthresholded features with sparse support recovery. arXiv preprint arXiv:1804.05515, 2018.\n\n[39] Zhouchen Lin, Canyi Lu, and Huan Li. Optimized projections for compressed sensing via direct\n\nmutual coherence minimization. arXiv preprint arXiv:1508.03117, 2015.\n\n[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In European conference on computer vision, pages 630\u2013645. Springer, 2016.\n\n[41] Jure Sokoli\u00b4c, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin\n\ndeep neural networks. IEEE Transactions on Signal Processing, 65(16):4265\u20134280, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2089, "authors": [{"given_name": "Nitin", "family_name": "Bansal", "institution": "Texas A&M University"}, {"given_name": "Xiaohan", "family_name": "Chen", "institution": "Texas A&M University"}, {"given_name": "Zhangyang", "family_name": "Wang", "institution": "TAMU"}]}