{"title": "Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets", "book": "Advances in Neural Information Processing Systems", "page_first": 1108, "page_last": 1116, "abstract": "An important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D. These occur naturally in e.g. neural language models or the learning of word-embeddings, often posed as predicting the probability of next words among a vocabulary of size D (e.g. 200,000). Computing the equally large, but typically non-sparse D-dimensional output vector from a last hidden layer of reasonable dimension d (e.g. 500) incurs a prohibitive $O(Dd)$ computational cost for each example, as does updating the $D \\times d$ output weight matrix and computing the gradient needed for backpropagation to previous layers. While efficient handling of large sparse network inputs is trivial, this case of large sparse targets is not, and has thus so far been sidestepped with approximate alternatives such as hierarchical softmax or sampling-based approximations during training. In this work we develop an original algorithmic approach that, for a family of loss functions that includes squared error and spherical softmax, can compute the exact loss, gradient update for the output weights, and gradient for backpropagation, all in $O(d^2)$ per example instead of $O(Dd)$, remarkably without ever computing the D-dimensional output. The proposed algorithm yields a speedup of $\\frac{D}{4d}$, i.e. two orders of magnitude for typical sizes, for that critical part of the computations that often dominates the training time in this kind of network architecture.", "full_text": "Ef\ufb01cient Exact Gradient Update for training Deep\n\nNetworks with Very Large Sparse Targets\n\nPascal Vincent\u2217, Alexandre de Br\u00e9bisson, Xavier Bouthillier\nD\u00e9partement d\u2019Informatique et de Recherche Op\u00e9rationnelle\n\nUniversit\u00e9 de Montr\u00e9al, Montr\u00e9al, Qu\u00e9bec, CANADA\n\n\u2217 and CIFAR\n\nAbstract\n\nAn important class of problems involves training deep neural networks with sparse\nprediction targets of very high dimension D. These occur naturally in e.g. neural\nlanguage models or the learning of word-embeddings, often posed as predicting\nthe probability of next words among a vocabulary of size D (e.g. 200 000). Com-\nputing the equally large, but typically non-sparse D-dimensional output vector\nfrom a last hidden layer of reasonable dimension d (e.g. 500) incurs a prohibitive\nO(Dd) computational cost for each example, as does updating the D \u00d7 d output\nweight matrix and computing the gradient needed for backpropagation to previous\nlayers. While ef\ufb01cient handling of large sparse network inputs is trivial, the case\nof large sparse targets is not, and has thus so far been sidestepped with approxi-\nmate alternatives such as hierarchical softmax or sampling-based approximations\nduring training. In this work we develop an original algorithmic approach which,\nfor a family of loss functions that includes squared error and spherical softmax,\ncan compute the exact loss, gradient update for the output weights, and gradi-\nent for backpropagation, all in O(d2) per example instead of O(Dd), remarkably\nwithout ever computing the D-dimensional output. The proposed algorithm yields\na speedup of D\n4d, i.e. two orders of magnitude for typical sizes, for that critical part\nof the computations that often dominates the training time in this kind of network\narchitecture.\n\n1\n\nIntroduction\n\nMany modern applications of neural networks have to deal with data represented, or representable,\nas very large sparse vectors. Such representations arise in natural language related tasks, where\nthe dimension D of that vector is typically (a multiple of) the size of the vocabulary, and also in\nthe sparse user-item matrices of collaborative-\ufb01ltering applications. It is trivial to handle very large\nsparse inputs to a neural network in a computationally ef\ufb01cient manner: the forward propagation\nand update to the input weight matrix after backpropagation are correspondingly sparse. By con-\ntrast, training with very large sparse prediction targets is problematic: even if the target is sparse, the\ncomputation of the equally large network output and the corresponding gradient update to the huge\noutput weight matrix are not sparse and thus computationally prohibitive. This has been a practical\nproblem ever since Bengio et al. [1] \ufb01rst proposed using a neural network for learning a language\nmodel, in which case the computed output vector represents the probability of the next word and\nis the size of the considered vocabulary, which is becoming increasingly large in modern applica-\ntions [2]. Several approaches have been proposed to attempt to address this dif\ufb01culty essentially by\nsidestepping it. They fall in two categories:\n\u2022 Sampling or selection based approximations consider and compute only a tiny fraction of the\noutput\u2019s dimensions sampled at random or heuristically chosen. The reconstruction sampling of\nDauphin et al. [3], the ef\ufb01cient use of biased importance sampling in Jean et al. [4], the use of\n\n1\n\n\fNoise Contrastive Estimation [5] in Mnih and Kavukcuoglu [6] and Mikolov et al. [7] all fall\nunder this category. As does the more recent use of approximate Maximum Inner Product Search\nbased on Locality Sensitive Hashing techniques[8, 9] to select a good candidate subset.\n\ncomputation of the normalized probability of the target class.\n\n\u2022 Hierarchical softmax [10, 7] imposes a heuristically de\ufb01ned hierarchical tree structure for the\nCompared to the initial problem of considering all D output dimensions, both kinds of approaches\nare crude approximations. In the present work, we will instead investigate a way to actually perform\nthe exact gradient update that corresponds to considering all D outputs, but do so implicitly, in a\ncomputationally ef\ufb01cient manner, without actually computing the D outputs. This approach works\nfor a relatively restricted class of loss functions, the simplest of which is linear output with squared\nerror (a natural choice for sparse real-valued regression targets). The most common choice for\nmulticlass classi\ufb01cation, the softmax loss is not part of that family, but we may use an alternative\nspherical softmax, which will also yield normalized class probabilities. For simplicity and clarity,\nour presentation will focus on squared error and on an online setting. We will brie\ufb02y discuss its\nextension to minibatches and to the class of possible loss functions in sections 3.5 and 3.6.\n\n2 The problem\n\n2.1 Problem de\ufb01nition and setup\n\nWe are concerned with gradient-descent based training of a deep feed-forward neural network with\ntarget vectors of very high dimension D (e.g. D = 200 000) but that are sparse, i.e. a comparatively\nsmall number, at most K (cid:28) D, of the elements of the target vector are non-zero. Such a K-\nsparse vector will typically be stored and represented compactly as 2K numbers corresponding\nto pairs (index, value). A network to be trained with such targets will naturally have an equally\nlarge output layer of dimension D. We can also optionally allow the input to the network to be a\nsimilarly high dimensional sparse vector of dimension Din. Between the large sparse target, output,\nand (optionally large sparse) input, we suppose the network\u2019s intermediate hidden layers to be of\nsmaller, more typically manageable, dimension d (cid:28) D (e.g. d = 500)1.\nMathematical notation: Vectors are denoted using lower-case letters, e.g. h, and are considered\ncolumn-vectors; corresponding row vectors are denoted with a transpose, e.g. hT . Matrices are\ndenoted using upper-case letters, e.g. W , with W T the transpose of W . The ith column of W is\n\ndenoted Wi , and its ith row W:i (both viewed as a column vector). U\u2212T =(cid:0)U\u22121(cid:1)T denotes the\ntranspose of the inverse of a square matrix. Id is the d \u00d7 d identity matrix.\nNetwork architecture: We consider a standard feed forward neural network architecture as de-\npicted in Figure 1. An input vector x \u2208 RDin is linearly transformed into a linear activation\na(1) = W (1)T x + b(1) through a Din \u00d7 d input weight matrix W (1) (and an optional bias vector\nb(1) \u2208 Rd). This is typically followed by a non-linear transformation s to yield the representation of\nthe \ufb01rst hidden layer h(1) = s(a(1)). This \ufb01rst hidden layer representation is then similarly trans-\nformed through a number of subsequent non-linear layers (that can be of any usual kind amenable to\nbackpropagation) e.g. h(k) = s(a(k)) with a(k) = W (k)T h(k\u22121) + b(k) until we obtain last hidden\nlayer representation h = h(m). We then obtain the \ufb01nal D-dimensional network output as o = W h\nwhere W is a D \u00d7 d output weight matrix, which will be our main focus in this work. Finally, the\nnetwork\u2019s D-dimensional output o is compared to the D-dimensional target vector y associated with\ninput x using squared error, yielding loss L = (cid:107)o \u2212 y(cid:107)2.\nTraining procedure: This architecture is a typical (possibly deep) multi-layer feed forward neural\nnetwork architecture with a linear output layer and squared error loss. Its parameters (weight matri-\nces and bias vectors) will be trained by gradient descent, using gradient backpropagation [11, 12, 13]\nto ef\ufb01ciently compute the gradients. The procedure is shown in Figure 1. Given an example from\nthe training set as an (input,target) pair (x, y), a pass of forward propagation proceeds as out-\nlined above, computing the hidden representation of each hidden layer in turn based on the pre-\nvious one, and \ufb01nally the network\u2019s predicted output o and associated loss L. A pass of gradient\nbackpropagation then works in the opposite direction, starting from \u2207o = \u2202L\n\u2202o = 2(o \u2212 y) and\n1Our approach does not impose any restriction on the architecture nor size of the hidden layers, as long as\n\nthey are amenable to usual gradient backpropagation.\n\n2\n\n\fFigure 1: The computational problem posed by very large sparse targets. Dealing with sparse in-\nput ef\ufb01ciently is trivial, with both the forward and backward propagation phases easily achieved in\nO(Kd). However this is not the case with large sparse targets. They incur a prohibitive compu-\ntational cost of O(Dd) at the output layer as forward propagation, gradient backpropagation and\nweight update each require accessing all D \u00d7 d elements of the large output weight matrix.\n\n\u2202h(k) and \u2207a(k) = \u2202L\n\npropagating back the gradients \u2207h(k) = \u2202L\n\u2202a(k) upstream through the network.\nThe corresponding gradient contributions on parameters (weights and biases), collected along the\nway, are straightforward once we have the associated \u2207a(k). Speci\ufb01cally they are \u2207b(k) = \u2207a(k)\nand \u2207W (k) = h(k\u22121)(\u2207a(k))T . Similarly for the input layer \u2207W (1) = x(\u2207a(1))T , and for the\noutput layer \u2207W = (o \u2212 y)hT . Parameters are then updated through a gradient descent step\nW (k) \u2190 W (k) \u2212 \u03b7\u2207W (k) and b(k) \u2190 b(k) \u2212 \u03b7\u2207b(k), where \u03b7 is a positive learning-rate. Similarly\nfor the output layer which will be our main focus here: W \u2190 W \u2212 \u03b7\u2207W .\n\n2.2 The easy part: input layer forward propagation and weight update\n\nIt is easy and straightforward to ef\ufb01ciently compute the forward propagation, and the backpropa-\ngation and weight update part for the input layer when we have a very large Din-dimensional but\nK\u2212sparse input vector x with appropriate sparse representation. Speci\ufb01cally we suppose that x is\nrepresented as a pair of vectors u, v of length (at most) K, where u contains integer indexes and v\nthe associated real values of the elements of x such that xi = 0 if i /\u2208 u, and xuk = vk.\n\u2022 Forward propagation through the input layer: The sparse representation of x as the positions\nof K elements together with their value makes it cheap to compute W (1)T x. Even though W (1)\nmay be a huge full Din \u00d7 d matrix, only K of its rows (those corresponding to the non-zero\nentries of x) need to be visited and summed to compute W (1)T x. Precisely, with our (u, v) sparse\nrepresentation of x this operation can be written as W (1)T x =(cid:80)K\n:uk where each W (1)\n:uk\n\u2022 Gradient and update through input layer: Let us for now suppose that we were able to get\ngradients (through backpropagation) up to the \ufb01rst hidden layer activations a(1) \u2208 Rd in the form\n\u2202a(1) . The corresponding gradient-based update to input layer weights\nof gradient vector \u2207a(1) = \u2202L\nW (1) is simply W (1) \u2190 W (1) \u2212 \u03b7x(\u2207a(1))T . This is a rank-one update to W (1). Here again, we\nsee that only the K rows of W (1) associated to the (at most) K non-zero entries of x need to be\nmodi\ufb01ed. Precisely this operation can be written as: W (1)\n:uk \u2212\u03b7vk\u2207a(1) \u2200k \u2208 {1, . . . , K}\nmaking this again a O(Kd) operation rather than O(Dd).\n\nk=1 vkW (1)\nis a d-dimensional vector, making this an O(Kd) operation rather than O(Dd).\n\n:uk \u2190 W (1)\n\n3\n\nEf\ufb01cient Exact Gradient Update for Training Deep Networks with Very Large Sparse TargetsPascal Vincent * Alexandre de Br\u00e9bisson Xavier BouthillierAbstractAn important class of problems involves training deep neural networks with sparse prediction targets of very high dimension D. These occur naturally in e.g. neural language models or the learning of word-embeddings, often posed as predicting the probability of next words among a vocabulary of size D (e.g. 500 000). Computing the equally large, but typically non-sparse D-dimensional output vector from a last hidden layer of reasonable dimension d (e.g. 500) incurs a prohibitive O(Dd) computational cost for each example, as does updating the D ! d output weight matrix and computing the gradient needed for backpropagation to previous layers. While ef\ufb01cient handling of large sparse network inputs is trivial, this case of large sparse targets is not, and has thus so far been sidestepped with approximate alternatives such as hierarchical softmax or sampling-based approximations during training. In this work we develop an original algorithmic approach that, for a family of loss functions that includes squared error and spherical softmax, can compute the exact loss, gradient update for the output weights, and gradient for backpropagation, all in O(d2) per example instead of O(Dd), remarkably without ever computing the D-dimensional output. Training time is thus independent of output-layer size (or number of classes). Compared to naive backprop, the proposed algorithm is expected to yield an actual speedup of at least D/4d , i.e. two orders of magnitude for typical sizes, for that critical part of the computation that often dominates the training time in this kind of network architecture.The Problem\u2023Training deep neural networks with very large sparse targets is an important problem\u2023Arises e.g. in Neural Language Models [1] with large vocabulary size (e.g. D = 500 000 one-hot target).\u2023Ef\ufb01cient handling of large sparse inputs is trivial.\u2023But backprop training with large sparse targets is prohibitively expensive.\u2023Focus on output layer: maps last hidden representation h of reasonable dimension d (e.g. 500)to very large output o of dimension D (e.g. 500 000) with a Dxd parameter matrix W:Experimental validationTiming of output layer computations, for CPU implementation on 2 GHz Intel Core i7. Minibatch size m =10.Both naive backprop version and the proposed factorised parameter version learn the same actual W. Detailed algorithm, bene\ufb01ts and limitationsAcceptedasaworkshopcontributionatICLR20153.5PUTTINGITALLTOGETHER:ALGORITHMFORCOMPUTINGTHECOSTL,GRADIENTONh,ANDUPDATINGUANDVEf\ufb01cientcomputationofcostL,gradientwithrespecttoh(tobelaterbackpropagatedfurther)aswellasupdatingUandVandperformingthebookkeepingforU\u2212TandQ.Thefollowingtabledescribesthealgorithmicstepsthatweputtogetherfromtheequationsderivedabove.Step#OperationComputationalcomplexityNumberofmultiply-adds1:\u02c6h=QhO(d2)d22:\u02c6y=UT(VTy)O(Kd+d2)Kd+d23:\u02c6z=\u02c6h\u2212\u02c6yO(d)d4:\u2207h=2\u02c6zO(d)d5:L=hT\u02c6h\u22122hT\u02c6y+yTyO(2d+K)2d+K+16:Unew=U\u22122\u03b7(Uh)hTO(d2)2d2+d7:U\u2212Tnew=U\u2212T+2\u03b71\u22122\u03b7\uffffh\uffff2(U\u2212Th)hTO(d2)2d2+2d+38:Vnew=V+2\u03b7y(U\u2212Tnewh)TO(d2+Kd)d2+K+Kd9:Qnew=Q\u22122\u03b7\uffffh\u02c6zT+\u02c6zhT\uffff+(4\u03b72L)hhTO(d2)4+2d+3d24DISCUSSION:EXPECTEDBENEFITS,EXTENSIONSANDLIMITATIONSHavingK\uffffd\uffffDweseethattheproposedalgorithmrequiresO(d2)operationswhereasthestandardapproachrequiredO(Dd)operations.IfwetakeK\u2248d,wemaystatemorepreciselythattheproposedalgorithm,forcomputingthelossandthegradientupdateswillrequiresroughly12d2operationswhereasthestandardapproachrequiredroughly3Ddoperations.SooveralltheproposedalgorithmchangecorrespondstoacomputationalspeedupbyafactorofD4d.ForD=200000andd=500theexpectedspeedupisthus100.Notethattheadvantageisnotonlyincomputationalcomplexity,butalsoinmemoryaccess.Foreachexample,thestandardapproachneedstoaccessandchangeallD\u00d7delementsofmatrixW,whereastheproposedapproachonlyaccessesthemuchsmallernumberK\u00d7delementofVaswellasthethreed\u00d7dmatricesU,U\u2212T,andQ.Sooverallwehaveamuchfasteralgorithm,whichwhiledoingsoimplicitly,willhoweverperformtheexactsamegradientupdateasthestandardapproach.Wewanttoemphasizeherethatwhatwearedoingisnotatallthesameassimplychaining2linearlayersUandVandperformingordinarygradientdescentupdatesonthese:thiswouldresultinthesameprohibitivecomputationalcomplexityasthestandardapproach,andsuchordinaryseparategradientupdatestoUandVwouldnotbeequivalenttotheordinarygradientupdatetoW=VU.Ouralgorithmcanbestraightforwardlyextendedtotheminibatchcase,andisexpectedtoyieldthesamespeedupfactorcomparedtothestandardapproach.ButoneneedstobecarefulinordertokeepthecomputationofU\u2212Threasonablyef\ufb01cient.Indeed,dependingonthesizeoftheminibatchm,itmaybemoreef\ufb01cienttoresolvethecorrepsondinglinearequationforeachminibatchfromscratchratherthanupdatingU\u2212TwiththeWoodburyequation(whichgeneralizestheSheman-Morrisonformulaform>1).Thisapproachthatwedetailedforlinearoutputandsquarederrorcaneasilybeextendedtoslightlymoreexoticlossfunctions:basicallyanylossfunctionthatcanbeexpressedusingonlytheocassociatedtonon-zeroycand\uffffo\uffff2=\uffffjo2jthesquarednormofthewholeoutputvector,whichwecancomputecheaply.Thisfamilyoflossfunctionsdoesnotincludethestandardsoftmax,butincludestheso-calledsphericalsoftmax:logo2c\uffffjo2j(wherecisthecorrectclasslabel).Itremainstobeseeninpracticehowthisapproachperformscomputationally,andwhetherwelosesomethingduetousingthismorelimitedfamilyoflossfunctions.7...\uffff\uffff\uffff\uffff...\uffff\uffff\uffff\uffff(large D, but K-sparse) (large D, but K-sparse) ...(small d)...large D, not sparse LossInput xTarget yOutput olast hidden hL=\uffffo\u2212y\uffff2hidden 2(small d)hidden 1(small d)O(Kd) O(d2) O(d2) O(Dd) Prohibitivley expensive!Ex: D = 500 000, K=5Ex: d = 500O(D) O(d2) O(d2) O(d2) Forward propagationBackpropagation(dxd)W(2)O(D) !o = 2(o-y) O(Dd) !h = W T !o O(Dd) W \"W- ! !o hT O(Kd) W(1) \"W (1)- ! x !aT cheap!W(1)(Dxd)Prohibitivley expensive!Altogether: O( Dd ) 3Problem: expensive computationwe suppose K << d << D o = Wh* and CIFARProposed approachWe can do much better than O( Dd ). We can compute!loss L!gradient w.r.t. last hidden layer !h !exact same gradient update to Wall in O(d2) without ever computing full output o=Wh !First trick: L and !h can be computed ef\ufb01ciently if we keep an up-to-date d x d matrix Q = WTW Second trick: represent W implicitly as factorization and update U and V instead5.1Computingthesquarederrorlosse\ufb03cientlySupposewehave,foranetworkinputexamplex,computedlasthiddenrepre-sentationh\u2208Rdthroughforwardpropagation.Thenetwork\u2019sDdimensionaloutputo=Whistheninprinciplecomparedtohighdimensionaltargety\u2208RD.ThecorrespondingsquarederrorlossisL=\uffffWh\u2212y\uffff2.AswehaveseeninSection3.3,computingitinthedirectnaivewaywouldhaveaprohibitivecom-putationalcomplexityofO(Dd+D)=O(Dd)becausecomputingoutputWhwithafullD\u00d7dmatrixWandatypicallynon-sparsehisO(Dd).Notehoweverthatwecanrewritethisas:L=\uffffWh\u2212y\uffff2=(Wh\u2212y)T(Wh\u2212y)=hTWTWh\u2212yTWh\u2212hTWTy+yTy=hTQh\u22122hT(WTy)+yTy=hTQh\u22122hTUTVTy+yTy=hT(Qh)\u22122hT(UT(VTy))+yTy=hT(Qh\uffff\uffff\uffff\uffff\u02c6h\u22122(UT(VTy)\uffff\uffff\uffff\uffff\u02c6y)+yTySHORTIDEAFORMULATIONFORSLIDES:L=\uffffO(Dd)\uffff\uffff\uffff\uffffWh\u2212y\uffff2=(Wh\u2212y)T(Wh\u2212y)=hTWTWh\u22122hT(WTy)+yTy=hT(Qh\uffff\uffff\uffff\uffffO(d2)\u22122(WTy)\uffff\uffff\uffff\uffffO(Kd))+yTy\uffff\uffff\uffff\uffffO(K)withQ=WTWSupposingwehavemaintainedanup-to-dateQ=WTW,whichisacompactd\u00d7dmatrix(wewillseehowweupdateQcheaplyinsection??????),computing\u02c6h=QhhasacomplexityofO(d2).ThankstotheK\u2212sparsityandsparserepresentationofy,computingVTyisO(Kd)andresultsinad\u2212dimensionalvector,sothatcomputing\u02c6y=UT(VTy)isO(Kd+d2).ThelasttermisO(K).SotheoverallcomputationalcomplexityforcomputingLinthiswayisO(Kd+d2)=O((K+d)d).WithK\uffffDandd\uffffDthiscanbeseveralordersofmagnitudecheaperthantheprohibitiveO(Dd)ofthedirectapproach.Ifwede\ufb01neintermediatevectors\u02c6h=Qhand\u02c6y=WTy=UT(VTy)thecomputationofLcanberewrittenalittlemorecompactlyasL=hT(\u02c6h\u22122\u02c6y)+\uffffy\uffff25this is O(Kd +d2 +K) = O(d2)Computing loss L5.2Computingthegradientonhe\ufb03cientlyTobackpropagatethegradientthroughthenetwork,weneedtocomputethegradientoflossLwithrespecttolasthiddenlayerrepresentationh.Thisis\u2207h=\u2202L\u2202h=\u2202\uffffWh\u2212y\uffff2\u2202h=2WT(Wh\u2212y).Again,ifweweretocomputeitdirectlyinthismannerthecomputationalcomplexitywouldbeaprohibitiveO(Dd).Butwecaninsteadrewriteitas\u2207h=\u2202L\u2202h=\u2202\uffffWh\u2212y\uffff2\u2202h=2WT(Wh\u2212y)=2\uffffWTWh\u2212WTy\uffff=2\uffffQh\u2212UTVTy\uffff=2\uffffQh\u2212UT(VTy)\uffff=2(\u02c6h\u2212\u02c6y)Again,supposingwehavemaintainedanup-to-dateQ(wewillseehowweupdateQcheaplyinsection?????)computing\u2202L\u2202hthiswayisO(Kd+d2)=O((K+d)d),muchcheaperthantheO(Dd)ofthedirectapproach.SHORTIDEAFORMULATIONFORSLIDES:\u2207h=\u2202L\u2202h=\u2202\uffffWh\u2212y\uffff2\u2202h=2WT(Wh\u2212y)=2(Qh\uffff\uffff\uffff\uffffO(d2)\u2212WTy\uffff\uffff\uffff\uffffO(Kd))5.3E\ufb03cientgradientupdateofWThegradientofthesquarederrorlosswithrespecttooutputlayerweightmatrixWis\u2202L\u2202W=\u2202\uffffWh\u2212y\uffff2\u2202W=2(Wh\u2212y)hT.AndthecorrespondinggradientdescentupdatetoWwouldbeWnew\u2190W\u22122\u03b7(Wh\u2212y)hTwhere\u03b7isapositivelearningrate.Again,computedinthismanner,thisinducesaprohibitiveO(Dd)computationalcomplexity,bothtocomputeoutputandresidueWh\u2212y,andthentoupdatealltheDdelementsofW(sincegenerallyneitherWh\u2212ynorhwillbesparse).Toovercomethisdi\ufb03cultyletus\ufb01rstrewritetheupdateasWnew=W\u22122\u03b7(Wh\u2212y)hT=W\u22122\u03b7WhhT+2\u03b7yhTNotethatwecandecomposethisupdateintotwoconsecutiveupdatesteps:6this is O(Kd +d2) = O(d2)Provided we maintain an up-to-date Q = WTW (achievable cheaply) Computing gradient !h w.r.t. last hidden layerW\uffff\uffff\uffff\uffffD\u00d7d=V\uffff\uffff\uffff\uffffD\u00d7dU\uffff\uffff\uffff\uffffd\u00d7d5.2Computingthegradientonhe\ufb03cientlyTobackpropagatethegradientthroughthenetwork,weneedtocomputethegradientoflossLwithrespecttolasthiddenlayerrepresentationh.Thisis\u2207h=\u2202L\u2202h=\u2202\uffffWh\u2212y\uffff2\u2202h=2WT(Wh\u2212y).Again,ifweweretocomputeitdirectlyinthismannerthecomputationalcomplexitywouldbeaprohibitiveO(Dd).Butwecaninsteadrewriteitas\u2207h=\u2202L\u2202h=\u2202\uffffWh\u2212y\uffff2\u2202h=2WT(Wh\u2212y)=2\uffffWTWh\u2212WTy\uffff=2\uffffQh\u2212UTVTy\uffff=2\uffffQh\u2212UT(VTy)\uffff=2(\u02c6h\u2212\u02c6y)Again,supposingwehavemaintainedanup-to-dateQ(wewillseehowweupdateQcheaplyinsection?????)computing\u2202L\u2202hthiswayisO(Kd+d2)=O((K+d)d),muchcheaperthantheO(Dd)ofthedirectapproach.SHORTIDEAFORMULATIONFORSLIDES:\u2207h=\u2202L\u2202h=\u2202\uffffWh\u2212y\uffff2\u2202h=2WT(Wh\u2212y)=2(Qh\uffff\uffff\uffff\uffffO(d2)\u2212WTy\uffff\uffff\uffff\uffffO(Kd))5.3E\ufb03cientgradientupdateofWThegradientofthesquarederrorlosswithrespecttooutputlayerweightmatrixWis\u2202L\u2202W=\u2202\uffffWh\u2212y\uffff2\u2202W=2(Wh\u2212y)hT.AndthecorrespondinggradientdescentupdatetoWwouldbeWnew\u2190W\u22122\u03b7(Wh\u2212y)hTwhere\u03b7isapositivelearningrate.Again,computedinthismanner,thisinducesaprohibitiveO(Dd)computationalcomplexity,bothtocomputeoutputandresidueWh\u2212y,andthentoupdatealltheDdelementsofW(sincegenerallyneitherWh\u2212ynorhwillbesparse).Toovercomethisdi\ufb03cultyletus\ufb01rstrewritetheupdateasWnew=W\u22122\u03b7(Wh\u2212y)hT=W\u22122\u03b7WhhT+2\u03b7yhTNotethatwecandecomposethisupdateintotwoconsecutiveupdatesteps:6Naive gadient update is a rank-one update to W (all Dd elements of W modi\ufb01ed!)Equivalently decomposed in 2 sequential steps:O( Dd ) a)W\u2190W\u22122\u03b7WhhTb)W\u2190W+2\u03b7yhTWewillnowseehowwecanperformeachoftheseupdatesimplicitlybyupdatingonlyUandVrespectively,aswellashowwemaintaincorrespondinglyup-to-dateversionsofQ=VTV(neededtoe\ufb03cientlycomputecostLandgradientonhinEquations????and????above)andU\u2212T=(U\u22121)T(thatwillbeneededforupdateb)).Solution:a)Unew=U\u22122\u03b7(Uh)hTb)Vnew=V+2\u03b7y(U\u2212Tnewh)TProof:VnewUnew=(V+2\u03b7y(U\u2212Tnewh)T)Unew=VUnew+2\u03b7y(U\u2212Tnewh)TUnew=VUnew+2\u03b7yhTU\u22121newUnew=V(U\u22122\u03b7(Uh)hT)+2\u03b7yhT(U\u22121newUnew)=VU\u22122\u03b7VUhhT+2\u03b7yhT=VU\u22122\u03b7(VUh\u2212y)hT=W\u22122\u03b7(Wh\u2212y)ThT=Wnewa)FirstupdateoftheformW\u2190W\u22122\u03b7WhhTThiscanbeachievedimplicitlybyupdatingonlyUasfollows:Unew=U\u22122\u03b7(Uh)hTProof:Wnew=VUnew=V(U\u22122\u03b7(Uh)hT)=VU\u22122\u03b7VUhhT=W\u22122\u03b7WhhTChangingUdoesn\u2019tchangeQ=VTV.Butwewillneedanup-to-dateU\u2212Tinthesecondupdateb).ProvidedwealreadyhaveU\u2212TthiscanbeachievedcheaplybyusingtheSherman-Morissonformulafortherank-oneupdatetotheinverseofU:(U+uvT)\u22121=U\u22121\u221211+vTU\u22121uU\u22121uvTU\u221217a)W\u2190W\u22122\u03b7WhhTb)W\u2190W+2\u03b7yhTWewillnowseehowwecanperformeachoftheseupdatesimplicitlybyupdatingonlyUandVrespectively,aswellashowwemaintaincorrespondinglyup-to-dateversionsofQ=VTV(neededtoe\ufb03cientlycomputecostLandgradientonhinEquations????and????above)andU\u2212T=(U\u22121)T(thatwillbeneededforupdateb)).Solution:a)Unew=U\u22122\u03b7(Uh)hTb)Vnew=V+2\u03b7y(U\u2212Tnewh)TProof:VnewUnew=(V+2\u03b7y(U\u2212Tnewh)T)Unew=VUnew+2\u03b7y(U\u2212Tnewh)TUnew=VUnew+2\u03b7yhTU\u22121newUnew=V(U\u22122\u03b7(Uh)hT)+2\u03b7yhT(U\u22121newUnew)=VU\u22122\u03b7VUhhT+2\u03b7yhT=VU\u22122\u03b7(VUh\u2212y)hT=W\u22122\u03b7(Wh\u2212y)ThT=Wnewa)FirstupdateoftheformW\u2190W\u22122\u03b7WhhTThiscanbeachievedimplicitlybyupdatingonlyUasfollows:Unew=U\u22122\u03b7(Uh)hTProof:Wnew=VUnew=V(U\u22122\u03b7(Uh)hT)=VU\u22122\u03b7VUhhT=W\u22122\u03b7WhhTChangingUdoesn\u2019tchangeQ=VTV.Butwewillneedanup-to-dateU\u2212Tinthesecondupdateb).ProvidedwealreadyhaveU\u2212TthiscanbeachievedcheaplybyusingtheSherman-Morissonformulafortherank-oneupdatetotheinverseofU:(U+uvT)\u22121=U\u22121\u221211+vTU\u22121uU\u22121uvTU\u221217That can be performed implicity through U and V:rank-1 update to U: O(d2)O(Kd)O(d2)provided we updated U-1 cheaplyusing Sherman-MorrisonSparse update: only K rows of V instead of all D rows of W !O( Dd ) Proof:AcceptedasaworkshopcontributionatICLR2015a)W\u2190W\u22122\u03b7WhhTb)W\u2190W+2\u03b7yhTNoticethatwecanperformeachoftheseupdatesimplicitlybyupdatingonlyUandVrespectively.:a)Unew=U\u22122\u03b7(Uh)hT(4)b)Vnew=V+2\u03b7y(U\u2212Tnewh)T(5)ThisresultsinimplicitlyupdatingWaswedidexplicitlyinthenaiveapproachofEq.3.Proof:VnewUnew=(V+2\u03b7y(U\u2212Tnewh)T)Unew=VUnew+2\u03b7y(U\u2212Tnewh)TUnew=VUnew+2\u03b7yhTU\u22121newUnew=V(U\u22122\u03b7(Uh)hT)+2\u03b7yhT(U\u22121newUnew)=VU\u22122\u03b7VUhhT+2\u03b7yhT=VU\u22122\u03b7(VUh\u2212y)hT=W\u22122\u03b7(Wh\u2212y)ThT=WnewWeseethattheupdateofUinEq.4isasimpleO(d2)operation.Followingthissimplerank-oneupdatetoU,wecanusetheSherman-Morrisonformulatoderivethecorrespondingrank-oneupdatetoU\u2212TwhichwillalsobeO(d2):U\u2212Tnew=U\u2212T+2\u03b71\u22122\u03b7\uffffh\uffff2(U\u2212Th)hT(6)ItistheneasytocomputetheU\u2212Tnewh,anO(d2)operationneededinEq.5,andtheensuingrank-oneupdateofV,thankstotheK-sparsityofyisonlyO(Kd).ThankstotheK\u2212sparsityandsparserepresentationofy,computing\u02c6y=VTyisO(Kd)and\ufffft\uffff2isO(K).Computationof\u02c6h=U\u2212ThisO(d2).Giventhese,theupdateofQisO(d2)andtherank-oneupdateofV,thankstotheK-sparsityofyisO(Kd).SotheseoperationstogetherhavecomputationalcomplexityofO(Kd+d2)=O((K+d)d),whichismuchcheaperthantheprohibitiveO(Dd)ofthedirectapproach.3.4BOOKKEEPING:KEEPINGANUP-TO-DATEQANDU\u2212TWehavealreadyseen,inEq.6,howwecancheaplymaintainanup-to-dateU\u2212TfollowingourupdateofU.Similarly,followingourupdatestoUandV,weneedtokeepanup-to-dateQ=WTWwhichisneededtoef\ufb01cientlycomputethelossL(Eq.1)andgradient\u2207h(Eq.2).TheupdatestoUandVinEquations4and5areequivalenttoimplicitlyupdatingWasinEq.3,andthistranslatesintothefollowingupdatetoQ=WTW:\u02c6z=Qh\u2212UT(VTy)Qnew=Q\u22122\u03b7\uffffh\u02c6zT+\u02c6zhT\uffff+(4\u03b72L)hhT(7)Proofisstraightforwardbutnotprovidedhereduetospaceconstraints.6Bookkeeping operations as we update U and V:!Using factored representation of W=VU does not change the complexity of the computation of L and !h .!Need to maintain an up-to-date U-1 following rank-1 update to U. \" achieved in O(d2) through Sherman-Morrison formula. !Need to maintain an up-to-date Q = WTW following updates to U and V. \" achieved in O(d2) as follows:a)W\u2190W\u22122\u03b7WhhTb)W\u2190W+2\u03b7yhTWewillnowseehowwecanperformeachoftheseupdatesimplicitlybyupdatingonlyUandVrespectively,aswellashowwemaintaincorrespondinglyup-to-dateversionsofQ=VTV(neededtoe\ufb03cientlycomputecostLandgradientonhinEquations????and????above)andU\u2212T=(U\u22121)T(thatwillbeneededforupdateb)).Solution:a)Unew=U\u22122\u03b7(Uh)hTb)Vnew=V+2\u03b7y(U\u2212Tnewh)TProof:VnewUnew=(V+2\u03b7y(U\u2212Tnewh)T)Unew=VUnew+2\u03b7y(U\u2212Tnewh)TUnew=VUnew+2\u03b7yhTU\u22121newUnew=V(U\u22122\u03b7(Uh)hT)+2\u03b7yhT(U\u22121newUnew)=VU\u22122\u03b7VUhhT+2\u03b7yhT=VU\u22122\u03b7(VUh\u2212y)hT=W\u22122\u03b7(Wh\u2212y)ThT=WnewSHORTFORMULATIONFORSLIDESOFUPDATEOFQINON-LINECASE:\u02c6z=Qh\u2212UT(VTy)Qnew=Q\u22122\u03b7\uffffh\u02c6zT+\u02c6zhT\uffff+(4\u03b72L)hhTa)FirstupdateoftheformW\u2190W\u22122\u03b7WhhTThiscanbeachievedimplicitlybyupdatingonlyUasfollows:Unew=U\u22122\u03b7(Uh)hTProof:Wnew=VUnew=V(U\u22122\u03b7(Uh)hT)=VU\u22122\u03b7VUhhT=W\u22122\u03b7WhhT7Note: this is NOT the same as a ordinary backprop update on two consecutive layers U and V which would still be O( Dd ).Altogether: O( d2 ) we suppose K << d << D we suppose K << d << D Current workarounds are approximations:\u2023Sampling based approximations compute only a tiny fraction of the output\u2019s dimensions sampled at random. Reconstruction sampling [2] and the use of Noise Contrastive Estimation [3] in [4, 5] fall under this category.\u2023Hierarchical softmax [6, 4] imposes a heuristically de\ufb01ned hierarchical tree structure for the computation of the normalized probability of the target class.[1] Bengio, Y., Ducharme, R., and Vincent, P. (2001). A neural probabilistic language model. NIPS 2000.[2] Dauphin, Y., Glorot, X., and Bengio, Y. (2011). Large-scale learning of embeddings with reconstruction sampling. ICML 2011.[5] Mnih, A. and Kavukcuoglu, K. (2013). Learning word embeddings ef\ufb01ciently with noise-contrastive estimation. NIPS 2013.[6] Morin, F. and Bengio, Y. (2005). Hierarchical probabilistic neural network language model. AISTATS 2005.[3] Gutmann, M. and Hyvarinen, A. (2010). Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. AISTATS 2010.[4] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Ef\ufb01cient estimation of word representations in vector space. ICLR 2013 workshop track.we suppose K << d << D Full algorithm (online version):!Computation: O(12 d2) v.s. O(3 Dd) \" speedup of D/4d for typical sizes: between 50 and 300!Memory access: for each example access only Kd elements of V and d2 elements of U, U-1 and Q v.s. Dd elements of W.Anticipated bene\ufb01ts:!Approach limited to loss functions expressible using ||o||2 and the oc associated to non-zero yc only:\u2713 linear output + squared error# not regular log softmax\u2713 linear+spherical softmax: !Step 6 can lead over time to ill conditioning \" must periodically apply numerical stabilization strategy.LimitationsExtension for minibatch of size m:!Straightforward except for step 7:!Update of U-T no longer with simple Sherman-Morrison. !Several possibilities: Woodbury identity (must invert m x m matrix), or iterated Sherman-Morrison, or solving UTx = h each time. Best choice will depends on m.!\" complexity remains O(d2) per example.AcceptedasaworkshopcontributionatICLR20153.5PUTTINGITALLTOGETHER:ALGORITHMFORCOMPUTINGTHECOSTL,GRADIENTONh,ANDUPDATINGUANDVEf\ufb01cientcomputationofcostL,gradientwithrespecttoh(tobelaterbackpropagatedfurther)aswellasupdatingUandVandperformingthebookkeepingforU\u2212TandQ.Thefollowingtabledescribesthealgorithmicstepsthatweputtogetherfromtheequationsderivedabove.Step#OperationComputationalcomplexityNumberofmultiply-adds1:\u02c6h=QhO(d2)d22:\u02c6y=UT(VTy)O(Kd+d2)Kd+d23:\u02c6z=\u02c6h\u2212\u02c6yO(d)d4:\u2207h=2\u02c6zO(d)d5:L=hT\u02c6h\u22122hT\u02c6y+yTyO(2d+K)2d+K+16:Unew=U\u22122\u03b7(Uh)hTO(d2)2d2+d7:U\u2212Tnew=U\u2212T+2\u03b71\u22122\u03b7\uffffh\uffff2(U\u2212Th)hTO(d2)2d2+2d+38:Vnew=V+2\u03b7y(U\u2212Tnewh)TO(d2+Kd)d2+K+Kd9:Qnew=Q\u22122\u03b7\uffffh\u02c6zT+\u02c6zhT\uffff+(4\u03b72L)hhTO(d2)4+2d+3d24DISCUSSION:EXPECTEDBENEFITS,EXTENSIONSANDLIMITATIONSHavingK\uffffd\uffffDweseethattheproposedalgorithmrequiresO(d2)operationswhereasthestandardapproachrequiredO(Dd)operations.IfwetakeK\u2248d,wemaystatemorepreciselythattheproposedalgorithm,forcomputingthelossandthegradientupdateswillrequiresroughly12d2operationswhereasthestandardapproachrequiredroughly3Ddoperations.SooveralltheproposedalgorithmchangecorrespondstoacomputationalspeedupbyafactorofD4d.ForD=200000andd=500theexpectedspeedupisthus100.Notethattheadvantageisnotonlyincomputationalcomplexity,butalsoinmemoryaccess.Foreachexample,thestandardapproachneedstoaccessandchangeallD\u00d7delementsofmatrixW,whereastheproposedapproachonlyaccessesthemuchsmallernumberK\u00d7delementofVaswellasthethreed\u00d7dmatricesU,U\u2212T,andQ.Sooverallwehaveamuchfasteralgorithm,whichwhiledoingsoimplicitly,willhoweverperformtheexactsamegradientupdateasthestandardapproach.Wewanttoemphasizeherethatwhatwearedoingisnotatallthesameassimplychaining2linearlayersUandVandperformingordinarygradientdescentupdatesonthese:thiswouldresultinthesameprohibitivecomputationalcomplexityasthestandardapproach,andsuchordinaryseparategradientupdatestoUandVwouldnotbeequivalenttotheordinarygradientupdatetoW=VU.Ouralgorithmcanbestraightforwardlyextendedtotheminibatchcase,andisexpectedtoyieldthesamespeedupfactorcomparedtothestandardapproach.ButoneneedstobecarefulinordertokeepthecomputationofU\u2212Threasonablyef\ufb01cient.Indeed,dependingonthesizeoftheminibatchm,itmaybemoreef\ufb01cienttoresolvethecorrepsondinglinearequationforeachminibatchfromscratchratherthanupdatingU\u2212TwiththeWoodburyequation(whichgeneralizestheSheman-Morrisonformulaform>1).Thisapproachthatwedetailedforlinearoutputandsquarederrorcaneasilybeextendedtoslightlymoreexoticlossfunctions:basicallyanylossfunctionthatcanbeexpressedusingonlytheocassociatedtonon-zeroycand\uffffo\uffff2=\uffffjo2jthesquarednormofthewholeoutputvector,whichwecancomputecheaply.Thisfamilyoflossfunctionsdoesnotincludethestandardsoftmax,butincludestheso-calledsphericalsoftmax:logo2c\uffffjo2j(wherecisthecorrectclasslabel).Itremainstobeseeninpracticehowthisapproachperformscomputationally,andwhetherwelosesomethingduetousingthismorelimitedfamilyoflossfunctions.7Prohibitive!Time taken by naive backprop (dotted lines) and the proposed factorised parameter version (full lines).Speedup of factorised parameter version v.s. naive backprop (theoretical and experimentally measured).Conclusion and future work\u2023We developed an original algorithm that yields a huge speedup for performing a full exact gradient update in networks with very large sparse targets: remarkably time is independent of output size (number of classes).\u2023Gain is from a fundamental algorithmic computational complexity improvement, not from low-level hardware-speci\ufb01c tricks or tuning. \u2023Future: GPU implementation; spherical softmax cost; compare quality of word embeddings learned with these costs to standard softmax. References:\f2.3 The hard part: output layer propagation and weight update\n\nGiven some network input x, we suppose we can compute without dif\ufb01culty through forward prop-\nagation the associated last hidden layer representation h \u2208 Rd. From then on:\n\u2022 Computing the \ufb01nal output o = W h incurs a prohibitive computational cost of O(Dd) since W\nis a full D \u00d7 d matrix. Note that there is a-priori no reason for representation h to be sparse (e.g.\nwith a sigmoid non-linearity) but even if it was, this would not fundamentally change the problem\nsince it is D that is extremely large, and we supposed d reasonably sized already. Computing the\nresidual (o \u2212 y) and associated squared error loss (cid:107)o \u2212 y(cid:107)2 incurs an additional O(D) cost.\nwhich is another O(Dd) matrix-vector product.\n\n\u2022 The gradient on h that we need to backpropagate to lower layers is \u2207h = \u2202L\n\u2202h = 2W T (o \u2212 y)\n\u2022 Finally, when performing the corresponding output weight update W \u2190 W \u2212 \u03b7(o\u2212 y)hT we see\nthat it is a rank-one update that updates all D\u00d7 d elements of W , which again incurs a prohibitive\nO(Dd) computational cost.\n\nFor very large D, all these three O(Dd) operations are prohibitive, and the fact that y is sparse, seen\nfrom this perspective, doesn\u2019t help, since neither o nor o \u2212 y will be sparse.\n\n3 A computationally ef\ufb01cient algorithm for performing the exact online\n\ngradient update\n\nPreviously proposed workarounds are approximate or use stochastic sampling. We propose a differ-\nent approach that results in the exact same, yet ef\ufb01cient gradient update, remarkably without ever\nhaving to compute large output o.\n\n3.1 Computing the squared error loss L and the gradient with respect to h ef\ufb01ciently\n\nSuppose that, we have, for a network input example x, computed the last hidden representation\nh \u2208 Rd through forward propagation. The network\u2019s D dimensional output o = W h is then in\nprinciple compared to the high dimensional target y \u2208 RD. The corresponding squared error loss\nis L = (cid:107)W h \u2212 y(cid:107)2. As we saw in Section 2.3, computing it in the direct naive way would have\na prohibitive computational complexity of O(Dd + D) = O(Dd) because computing output W h\nwith a full D \u00d7 d matrix W and a typically non-sparse h is O(Dd). Similarly, to backpropagate\nthe gradient through the network, we need to compute the gradient of loss L with respect to last\nhidden layer representation h. This is \u2207h = \u2202L\n= 2W T (W h \u2212 y). So again, if\nwe were to compute it directly in this manner, the computational complexity would be a prohibitive\nO(Dd). Provided we have maintained an up-to-date matrix Q = W T W , which is of reasonable\nsize d \u00d7 d and can be cheaply maintained as we will see in Section 3.3, we can rewrite these two\noperations so as to perform them in O(d2):\n\n\u2202h = \u2202(cid:107)W h\u2212y(cid:107)2\n\n\u2202h\n\nLoss computation:\n\nGradient on h:\n\nL = (cid:107)\n\nO(Dd)\n\n(cid:122)(cid:125)(cid:124)(cid:123)W h \u2212y(cid:107)2\n\n= (W h \u2212 y)T (W h \u2212 y)\n= hT W T W h \u2212 yT W h \u2212 hT W T y + yT y\n= hT Qh \u2212 2hT (W T y) + yT y\n) + yT y\n= hT ( Qh\n(cid:124)(cid:123)(cid:122)(cid:125)O(K)\n\n\u22122 W T y\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n(cid:124)(cid:123)(cid:122)(cid:125)O(d2)\n\nO(Kd)\n\n(1)\n\n\u2207h =\n\n\u2202L\n\u2202h\n\n\u2202h\n\n\u2202(cid:107)W h \u2212 y(cid:107)2\n=\n= 2W T (W h \u2212 y)\n= 2(cid:0)W T W h \u2212 W T y(cid:1)\n\n= 2( Qh\n\n)\n\n\u2212 W T y\n(cid:124)(cid:123)(cid:122)(cid:125)\n\nO(Kd)\n\n(cid:124)(cid:123)(cid:122)(cid:125)O(d2)\n\n(2)\n\nThe terms in O(Kd) and O(K) are due to leveraging the K-sparse representation of target vector\ny. With K (cid:28) D and d (cid:28) D, we get altogether a computational cost of O(d2) which can be several\norders of magnitude cheaper than the prohibitive O(Dd) of the direct approach.\n\n4\n\n\f3.2 Ef\ufb01cient gradient update of W\n\n\u2202W\n\nThe gradient of the squared error loss with respect to output layer weight matrix W is \u2202L\n\u2202W =\n\u2202(cid:107)W h\u2212y(cid:107)2\n= 2(W h \u2212 y)hT . And the corresponding gradient descent update to W would be\nWnew \u2190 W \u2212 2\u03b7(W h\u2212 y)hT , where \u03b7 is a positive learning rate. Again, computed in this manner,\nthis induces a prohibitive O(Dd) computational complexity, both to compute output and residual\nW h \u2212 y, and then to update all the Dd elements of W (since generally neither W h \u2212 y nor h will\nbe sparse). All D\u00d7 d elements of W must be accessed during this update. On the surface this seems\nhopeless. But we will now see how we can achieve the exact same update on W in O(d2). The trick\n\nand update U and V instead\n\nis to represent W implicitly as the factorization W(cid:124)(cid:123)(cid:122)(cid:125)D\u00d7d\n\n= V(cid:124)(cid:123)(cid:122)(cid:125)D\u00d7d\n\nU(cid:124)(cid:123)(cid:122)(cid:125)d\u00d7d\n\na) Unew = U \u2212 2\u03b7(U h)hT\nb) Vnew = V + 2\u03b7y(U\u2212T\n\nnewh)T\n\n(3)\n(4)\n\nThis results in implicitly updating W as we did explicitly in the naive approach as we now prove:\n\nVnewUnew = (V + 2\u03b7y(U\u2212T\n\nnewh)T ) Unew\n\nnewUnew\n\nnewh)T Unew\n\n= V Unew + 2\u03b7y(U\u2212T\n= V Unew + 2\u03b7yhT U\u22121\n= V (U \u2212 2\u03b7(U h)hT ) + 2\u03b7yhT (U\u22121\n= V U \u2212 2\u03b7V U hhT + 2\u03b7yhT\n= V U \u2212 2\u03b7(V U h \u2212 y)hT\n= W \u2212 2\u03b7(W h \u2212 y)T hT\n= Wnew\n\nnewUnew)\n\nWe see that the update of U in Eq. 3 is a simple O(d2) operation. Following this simple rank-one\nupdate to U, we can use the Sherman-Morrison formula to derive the corresponding rank-one update\nto U\u2212T which will also be O(d2):\n\nU\u2212T\nnew = U\u2212T +\n\n2\u03b7\n\n1 \u2212 2\u03b7 (cid:107)h(cid:107)2 (U\u2212T h)hT\n\n(5)\n\nnewh, an O(d2) operation needed in Eq. 4. The ensuing rank-one\nIt is then easy to compute the U\u2212T\nupdate of V in Eq 4, thanks to the K-sparsity of y is only O(Kd): only the K rows V associated\nto non-zero elements in y are accessed and updated, instead of all D rows of W we had to modify\nin the naive update! Note that with the factored representation of W as V U, we only have W\nimplicitly, so the W T y terms that entered in the computation of L and \u2207h in the previous paragraph\nneed to be adapted slightly as \u02c6y = W T y = U T (V T y), which becomes O(d2 + Kd) rather than\nO(Kd) in computational complexity. But this doesn\u2019t change the overall O(d2) complexity of these\ncomputations.\n\n3.3 Bookkeeping: keeping an up-to-date Q and U\u2212T\n\nWe have already seen, in Eq. 5, how we can cheaply maintain an up-to-date U\u2212T following our\nupdate of U. Similarly, following our updates to U and V , we need to keep an up-to-date Q =\nW T W which is needed to ef\ufb01ciently compute the loss L (Eq. 1) and gradient \u2207h (Eq. 2). We have\nshown that updates to U and V in equations 3 and 4 are equivalent to implicitly updating W as\nWnew \u2190 W \u2212 2\u03b7(W h \u2212 y)hT , and this translates into the following update to Q = W T W :\n\n\u02c6z = Qh \u2212 U T (V T y)\nQnew = Q \u2212 2\u03b7(cid:0)h\u02c6zT + \u02c6zhT(cid:1) + (4\u03b72L)hhT\n\n(6)\n\nThe proof is straightforward but due to space constraints we put it in supplementary material. One\ncan see that this last bookkeeping operation also has a O(d2) computational complexity.\n\n5\n\n\f3.4 Putting it all together: detailed algorithm and expected bene\ufb01ts\n\nWe have seen that we can ef\ufb01ciently compute cost L, gradient with respect to h (to be later back-\npropagated further) as well as updating U and V and performing the bookkeeping for U\u2212T and\nQ. Algorithm 1 describes the detailed algorithmic steps that we put together from the equations\nderived above. Having K (cid:28) d (cid:28) D we see that the proposed algorithm requires O(d2) operations,\nwhereas the standard approach required O(Dd) operations. If we take K \u2248 d , we may state more\nprecisely that the proposed algorithm, for computing the loss and the gradient updates will require\nroughly 12d2 operations whereas the standard approach required roughly 3Dd operations. So over-\nall the proposed algorithm change corresponds to a computational speedup by a factor of D\n4d. For\nD = 200 000 and d = 500 the expected speedup is thus 100. Note that the advantage is not only\nin computational complexity, but also in memory access. For each example, the standard approach\nneeds to access and change all D \u00d7 d elements of matrix W , whereas the proposed approach only\naccesses the much smaller number K \u00d7 d elements of V as well as the three d\u00d7 d matrices U, U\u2212T ,\nand Q. So overall we have a substantially faster algorithm, which, while doing so implicitly, will\nnevertheless perform the exact same gradient update as the standard approach. We want to empha-\nsize here that our approach is completely different from simply chaining 2 linear layers U and V\nand performing ordinary gradient descent updates on them: this would result in the same prohibitive\ncomputational complexity as the standard approach, and such ordinary separate gradient updates to\nU and V would not be equivalent to the ordinary gradient update to W = V U.\n\nAlgorithm 1 Ef\ufb01cient computation of cost L, gradient on h, and update to parameters U and V\n\nStep\n#\n1:\n2:\n3:\n4:\n5:\n6:\n7:\n\n8:\n9:\n\nOperation\n\n\u02c6h = Qh\n\u02c6y = U T (V T y)\n\u02c6z = \u02c6h \u2212 \u02c6y\n\u2207h = 2\u02c6z\nL = hT \u02c6h \u2212 2hT \u02c6y + yT y\nUnew = U \u2212 2\u03b7(U h)hT\nU\u2212T\nnew =\nU\u2212T +\nVnew = V + 2\u03b7y(U\u2212T\nQnew =\n\n1\u22122\u03b7(cid:107)h(cid:107)2 (U\u2212T h)hT\nnewh)T\nQ \u2212 2\u03b7(cid:0)h\u02c6zT + \u02c6zhT(cid:1) +\n\n2\u03b7\n\n(4\u03b72L)hhT\nAltogether:\n\nComputational\ncomplexity\n\nNumber of\n\nmultiply-adds\n\nO(d2)\n\nd2\n\nO(Kd + d2)\n\nKd + d2\n\nO(d)\nO(d)\n\nO(2d + K)\n\nO(d2)\nO(d2)\n\nd\nd\n\n2d + K + 1\n\n2d2 + d\n\n2d2 + 2d + 3\n\nO(d2 + Kd)\n\nO(d2)\n\nd2 + K + Kd\n4 + 2d + 3d2\n\nO(d2)\nprovided\nK < d (cid:28) D\n\n\u2248 12d2\nelementary\noperations\n\n3.5 Controlling numerical stability and extension to the minibatch case\n\nThe update of U in Equation 3 may over time lead U to become ill-conditioned. To prevent this,\nwe regularly (every 100 updates) monitor its conditioning number. If either the smallest or largest\nsingular value moves outside an acceptable range2, we bring it back to 1 by doing an appropriate\nrank-1 update to V (which costs Dd operations, but is only done rarely). Our algorithm can also\nbe straightforwardly extended to the minibatch case (the derivations are given in the supplemen-\ntary material section) and yields the same theoretical speedup factor with respect to the standard\nnaive approach. But one needs to be careful in order to keep the computation of U\u2212T h reasonably\nef\ufb01cient: depending on the size of the minibatch m, it may be more ef\ufb01cient to solve the correspond-\ning linear equation for each minibatch from scratch rather than updating U\u2212T with the Woodbury\nequation (which generalizes the Sherman-Morrison formula for m > 1).\n\n2More details on our numerical stabilization procedure can be found in the supplementary material\n\n6\n\n\f3.6 Generalization to a broader class of loss functions\n\nThe approach that we just detailed for linear output and squared error can be extended to a broader,\nthough restricted, family of loss functions. We call it the spherical family of loss functions because\nit includes the spherical alternative to the softmax, thus named in [14]. Basically it contains any loss\nfunction that can be expressed as a function of only the oc associated to non-zero yc and of (cid:107)o(cid:107)2 =\n(cid:80)j o2\nj the squared norm of the whole output vector, which we can compute cheaply, irrespective of\nD, as we did above3. This family does not include the standard softmax loss log exp(oc)\n(cid:80)j exp(oj ), but it\ndoes include the spherical softmax4: log\nj +\u0001). Due to space constraints we will not detail this\nextension here, only give a sketch of how it can be obtained. Deriving it may not appear obvious at\n\ufb01rst, but it is relatively straightforward once we realize that: a) the gain in computing the squared\nerror loss comes from being able to very cheaply compute the sum of squared activations (cid:107)o(cid:107)2 (a\nscalar quantity), and will thus apply equally well to other losses that can be expressed based on that\nquantity (like the spherical softmax). b) generalizing our gradient update trick to such losses follows\nnaturally from gradient backpropagation: the gradient is \ufb01rst backpropagated from the \ufb01nal loss to\nthe scalar sum of squared activations, and from there on follows the same path and update procedure\nas for the squared error loss.\n\nc+\u0001(cid:80)j (o2\n\no2\n\n4 Experimental validation\n\nWe implemented both a CPU version using blas and a parallel GPU (Cuda) version using cublas\nof the proposed algorithm5. We evaluated the GPU and CPU implementations by training word\nembeddings with simple neural language models, in which a probability map of the next word\ngiven its preceding n-gram is learned by a neural network. We used a Nvidia Titan Black GPU\nand a i7-4820K @ 3.70GHz CPU and ran experiments on the one billion word dataset[15], which\nis composed of 0.8 billions words belonging to a vocabulary of 0.8 millions words. We evaluated\nthe resulting word embeddings with the recently introduced Simlex-999 score [16], which measures\nthe similarity between words. We also compared our approach to unfactorised versions and to a\ntwo-layer hierarchical softmax. Figure 2 and 3 (left) illustrate the practical speedup of our approach\nfor the output layer only. Figure 3 (right) shows that out LST (Large Sparse Target) models are\nmuch faster to train than the softmax models and converge to only slightly lower Simlex-999 scores.\nTable 1 summarizes the speedups for the different output layers we tried, both on CPU and GPU.\nWe also empirically veri\ufb01ed that our proposed factored algorithm learns the model weights (V U ) as\nthe corresponding naive unfactored algorithm\u2019s W , as it theoretically should, and followed the same\nlearning curves (as a function of number of iterations, not time!).\n\n5 Conclusion and future work\n\nWe introduced a new algorithmic approach to ef\ufb01ciently compute the exact gradient updates for\ntraining deep networks with very large sparse targets. Remarkably the complexity of the algorithm\nis independent of the target size, which allows tackling very large problems. Our CPU and GPU\nimplementations yield similar speedups to the theoretical one and can thus be used in practical\napplications, which could be explored in further work. In particular, neural language models seem\ngood candidates. But it remains unclear how using a loss function other than the usual softmax might\naffect the quality of the resulting word embeddings so further research needs to be carried out in this\ndirection. This includes empirically investigating natural extensions of the approach we described\nto other possible losses in the spherical family such as the spherical-softmax.\nAcknowledgements: We wish to thank Yves Grandvalet for stimulating discussions, \u00c7a\u02d8glar\nG\u00fcl\u00e7ehre for pointing us to [14], the developers of Theano [17, 18] and Blocks [19] for making\nthese libraries available to build on, and NSERC and Ubisoft for their \ufb01nancial support.\n\n3In addition loss functions in this family are also allowed to depend on sum(o) =(cid:80)\ncompute cheaply without computing o, by tracking \u00afw =(cid:80)\nj W:j whereby sum(o) =(cid:80)\n\nj oj which we can also\nj W T\n\n:j h = \u00afwT h.\n\n4where c is the correct class label, and \u0001 is a small positive constant that we added to the spherical interpre-\n\ntation in [14] for numerical stability: to guarantee we never divide by 0 nor take the log of 0.\n5Open source code is available at: https://github.com/pascal20100/factored_output_layer\n\n7\n\n\fTable 1: Speedups with respect to the baseline naive model on CPU, for a minibatch of 128 and the\nwhole vocabulary of D = 793471 words. This is a model with two hidden layers of d = 300 neurons.\n\nModel\n\noutput layer only speedup whole model speedup\n\ncpu unfactorised (naive)\ngpu unfactorised (naive)\ngpu hierarchical softmax\n\ncpu factorised\ngpu factorised\n\n1\n6.8\n125.2\n763.3\n3257.3\n\n1\n4.7\n178.1\n501\n\n1852.3\n\nFigure 2: Timing of different algorithms. Time taken by forward and backward propagations in the\noutput layer, including weight update, on a minibatch of size 128 for different sizes of vocabulary\nD on both CPU and GPU. The input size d is \ufb01xed to 300. The Timing of a 2 layer hierarchical\nsoftmax ef\ufb01cient GPU implementation (h_softmax) is also provided for comparison. Right plot is\nin log-log scale. As expected, the timings of factorized versions are independent of vocabulary size.\n\nFigure 3: Left: Practical and theoretical speedups for different sizes of vocabulary D and \ufb01xed input\nsize d=300. The practical unfact / fact speedup is similar to the theoretical one. Right: Evolution\nof the Simlex-999 score obtained with different models as a function of training time (CPU softmax\ntimes were extrapolated from fewer iterations). Softmax models are zero hidden-layer models, while\nour large sparse target (LST) models have two hidden layers. These were the best architectures\nretained in both cases (surprisingly the softmax models with hidden layers performed no better on\nthis task). The extra non-linear layers in LST may help compensate for the lack of a softmax. LST\nmodels converge to slightly lower scores at similar speed as the hierarchical softmax model but\nsigni\ufb01cantly faster than softmax models.\n\n8\n\n0200040006000800010000Size of the vocabulary D0.0000.0020.0040.0060.0080.010Timing (sec) of a minibatch of size 128un-factorised CPUun-factorised GPUfactorised GPUfactorised CPUh_softmax GPU101102103104105106Size of the vocabulary D10-310-210-1100101Timing (sec) of a minibatch of size 128un-factorised CPUun-factorised GPUfactorised GPUfactorised CPUh_softmax GPU0100200300400500600700800Size of the vocabulary D (in thousands)0200400600800100012001400Speedupcpu_unfact / cpu_fact, experimentalgpu_unfact / gpu_fact, experimentalunfact / fact, theoreticalcpu_unfact / gpu_fact, experimentalcpu_unfact / gpu_unfact, experimental10-1100101102103Training time (hours)\u22120.10\u22120.050.000.050.100.150.200.25SimLex-999LST CPULST GPUSoftmax CPUSoftmax GPUH-Softmax GPU\fReferences\n[1] Y. Bengio, R. Ducharme, and P. Vincent. A neural probabilistic language model. In Advances in Neural\n\nInformation Processing Systems 13 (NIPS\u201900), pages 932\u2013938, 2001.\n\n[2] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language process-\n\ning (almost) from scratch. Journal of Machine Learning Research, 12:2493\u20132537, 2011.\n\n[3] Y. Dauphin, X. Glorot, and Y. Bengio. Large-scale learning of embeddings with reconstruction sampling.\n\nIn Proceedings of the 28th International Conference on Machine learning, ICML \u201911, 2011.\n\n[4] S. Jean, K. Cho, R. Memisevic, and Y. Bengio. On using very large target vocabulary for neural machine\n\ntranslation. In ACL-IJCNLP\u20192015, 2015. arXiv:1412.2007.\n\n[5] M. Gutmann and A. Hyvarinen. Noise-contrastive estimation: A new estimation principle for unnormal-\nized statistical models. In Proceedings of The Thirteenth International Conference on Arti\ufb01cial Intelli-\ngence and Statistics (AISTATS\u201910), 2010.\n\n[6] A. Mnih and K. Kavukcuoglu. Learning word embeddings ef\ufb01ciently with noise-contrastive estimation.\n\nIn Advances in Neural Information Processing Systems 26, pages 2265\u20132273. 2013.\n\n[7] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and\n\nphrases and their compositionality. In NIPS\u20192013, pages 3111\u20133119. 2013.\n\n[8] A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search\n\n(MIPS). In Advances in Neural Information Processing Systems 27, pages 2321\u20132329. 2014.\n\n[9] S. Vijayanarasimhan, J. Shlens, R. Monga, and J. Yagnik. Deep networks with large output spaces.\n\narxiv:1412.7479, 2014.\n\n[10] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the\n\nTenth International Workshop on Arti\ufb01cial Intelligence and Statistics, pages 246\u2013252, 2005.\n\n[11] D. Rumelhart, G. Hinton, and R. Williams. Learning representations by back-propagating errors. Nature,\n\n323:533\u2013536, 1986.\n\n[12] Y. LeCun. Une proc\u00e9dure d\u2019apprentissage pour R\u00e9seau \u00e0 seuil assym\u00e9trique.\n\nIn Cognitiva 85: A la\nFronti\u00e8re de l\u2019Intelligence Arti\ufb01cielle, des Sciences de la Connaissance et des Neurosciences, pages 599\u2013\n604, 1985.\n\n[13] Y. LeCun. Learning processes in an asymmetric threshold network. In Disordered Systems and Biological\n\nOrganization, pages 233\u2013240. Les Houches 1985, 1986.\n\n[14] Y. Ollivier. Riemannian metrics for neural networks. CoRR, abs/1303.0818, 2013.\n\n[15] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word\nbenchmark for measuring progress in statistical language modeling. In INTERSPEECH 2014, 15th Annual\nConference of the International Speech Communication Association, Singapore, September 14-18, 2014,\npages 2635\u20132639, 2014.\n\n[16] F. Hill, R. Reichart, and A. Korhonen. Simlex-999: Evaluating semantic models with (genuine) similarity\n\nestimation. CoRR, abs/1408.3456, 2014.\n\n[17] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\nand Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for\nScienti\ufb01c Computing Conference (SciPy), 2010. Oral Presentation.\n\n[18] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio.\nTheano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS\n2012 Workshop, 2012.\n\n[19] B. van Merri\u00ebnboer, D. Bahdanau, V. Dumoulin, D. Serdyuk, D. Warde-Farley, J. Chorowski, and Y. Ben-\n\ngio. Blocks and Fuel: Frameworks for deep learning. ArXiv e-prints, June 2015.\n\n9\n\n\f", "award": [], "sourceid": 687, "authors": [{"given_name": "Pascal", "family_name": "Vincent", "institution": "U. Montreal"}, {"given_name": "Alexandre", "family_name": "de Br\u00e9bisson", "institution": "Universit\u00e9 de Montr\u00e9al"}, {"given_name": "Xavier", "family_name": "Bouthillier", "institution": "Universit de Montr\u00e9al"}]}