{"title": "Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers", "book": "Advances in Neural Information Processing Systems", "page_first": 24, "page_last": 34, "abstract": "In deep neural nets, lower level embedding layers account for a large portion of the total number of parameters. Tikhonov regularization, graph-based regularization, and hard parameter sharing are approaches that introduce explicit biases into training in a hope to reduce statistical complexity. Alternatively, we propose stochastically shared embeddings (SSE), a data-driven approach to regularizing embedding layers, which stochastically transitions between embeddings during stochastic gradient descent (SGD). Because SSE integrates seamlessly with existing SGD algorithms, it can be used with only minor modifications when training large scale neural networks. We develop two versions of SSE: SSE-Graph using knowledge graphs of embeddings; SSE-SE using no prior information. We provide theoretical guarantees for our method and show its empirical effectiveness on 6 distinct tasks, from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages. We find that when used along with widely-used regularization methods such as weight decay and dropout, our proposed SSE can further reduce overfitting, which often leads to more favorable generalization results.", "full_text": "Stochastic Shared Embeddings: Data-driven\n\nRegularization of Embedding Layers\n\nLiwei Wu\n\nDepartment of Statistics\n\nUniversity of California, Davis\n\nDavis, CA 95616\n\nliwu@ucdavis.edu\n\nShuqing Li\n\nDepartment of Computer Science\nUniversity of California, Davis\n\nDavis, CA 95616\n\nqshli@ucdavis.edu\n\nCho-Jui Hsieh\n\nDepartment of Computer Science\nUniversity of California, Los Angles\n\nLos Angles, CA 90095\n\nchohsieh@cs.ucla.edu\n\nJames Sharpnack\n\nDepartment of Statistics\n\nUniversity of California, Davis\n\nDavis, CA 95616\n\njsharpna@ucdavis.edu\n\nAbstract\n\nIn deep neural nets, lower level embedding layers account for a large portion of\nthe total number of parameters. Tikhonov regularization, graph-based regular-\nization, and hard parameter sharing are approaches that introduce explicit biases\ninto training in a hope to reduce statistical complexity. Alternatively, we propose\nstochastic shared embeddings (SSE), a data-driven approach to regularizing embed-\nding layers, which stochastically transitions between embeddings during stochastic\ngradient descent (SGD). Because SSE integrates seamlessly with existing SGD\nalgorithms, it can be used with only minor modi\ufb01cations when training large scale\nneural networks. We develop two versions of SSE: SSE-Graph using knowledge\ngraphs of embeddings; SSE-SE using no prior information. We provide theoretical\nguarantees for our method and show its empirical effectiveness on 6 distinct tasks,\nfrom simple neural networks with one hidden layer in recommender systems, to\nthe transformer and BERT in natural languages. We \ufb01nd that when used along\nwith widely-used regularization methods such as weight decay and dropout, our\nproposed SSE can further reduce over\ufb01tting, which often leads to more favorable\ngeneralization results.\n\n1\n\nIntroduction\n\nRecently, embedding representations have been widely used in almost all AI-related \ufb01elds, from\nfeature maps [13] in computer vision, to word embeddings [15, 20] in natural language processing,\nto user/item embeddings [17, 10] in recommender systems. Usually, the embeddings are high-\ndimensional vectors. Take language models for example, in GPT [22] and Bert-Base model [3],\n768-dimensional vectors are used to represent words. Bert-Large model utilizes 1024-dimensional\nvectors and GPT-2 [23] may have used even higher dimensions in their unreleased large models.\nIn recommender systems, things are slightly different: the dimension of user/item embeddings are\nusually set to be reasonably small, 50 or 100, but the number of users and items is on a much bigger\nscale. Contrast this with the fact that the size of word vocabulary that normally ranges from 50,000\nto 150,000, the number of users and items can be millions or even billions in large-scale real-world\ncommercial recommender systems [1].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fGiven the massive number of parameters in modern neural networks with embedding layers, mitigat-\ning over-parameterization can play an important role in preventing over-\ufb01tting in deep learning. We\npropose a regularization method, Stochastic Shared Embeddings (SSE), that uses prior information\nabout similarities between embeddings, such as semantically and grammatically related words in\nnatural languages or real-world users who share social relationships. Critically, SSE progresses by\nstochastically transitioning between embeddings as opposed to a more brute-force regularization such\nas graph-based Laplacian regularization and ridge regularization. Thus, SSE integrates seamlessly\nwith existing stochastic optimization methods and the resulting regularization is data-driven.\nWe will begin the paper with the mathematical formulation of the problem, propose SSE, and provide\nthe motivations behind SSE. We provide a theoretical analysis of SSE that can be compared with\nexcess risk bounds based on empirical Rademacher complexity. We then conducted experiments\nfor a total of 6 tasks from simple neural networks with one hidden layer in recommender systems,\nto the transformer and BERT in natural languages and \ufb01nd that when used along with widely-used\nregularization methods such as weight decay and dropout, our proposed methods can further reduce\nover-\ufb01tting, which often leads to more favorable generalization results.\n\n2 Related Work\n\nRegularization techniques are used to control model complexity and avoid over-\ufb01tting. `2 regulariza-\ntion [8] is the most widely used approach and has been used in many matrix factorization models\nin recommender systems; `1 regularization [29] is used when a sparse model is preferred. For deep\nneural networks, it has been shown that `p regularizations are often too weak, while dropout [7, 27]\nis more effective in practice. There are many other regularization techniques, including parameter\nsharing [5], max-norm regularization [26], gradient clipping [19], etc.\nOur proposed SSE-graph is very different from graph Laplacian regularization [2], in which the\ndistances of any two embeddings connected over the graph are directly penalized. Hard parameter\nsharing uses one embedding to replace all distinct embeddings in the same group, which inevitably\nintroduces a signi\ufb01cant bias. Soft parameter sharing [18] is similar to the graph Laplacian, penalizing\nthe l2 distances between any two embeddings. These methods have no dependence on the loss, while\nthe proposed SSE-graph method is data-driven in that the loss in\ufb02uences the effect of regularization.\nUnlike graph Laplacian regularization, hard and soft parameter sharing, our method is stochastic by\nnature. This allows our model to enjoy similar advantages as dropout [27].\nInterestingly, in the original BERT model\u2019s pre-training stage [3], a variant of SSE-SE is already\nimplicitly used for token embeddings but for a different reason. In [3], the authors masked 15%\nof words and 10% of the time replaced the [mask] token with a random token. In the next section,\nwe discuss how SSE-SE differs from this heuristic. Another closely related technique to ours is\nthe label smoothing [28], which is widely used in the computer vision community. We \ufb01nd that in\nthe classi\ufb01cation setting if we apply SSE-SE to one-hot encodings associated with output yi only,\nour SSE-SE is closely related to the label smoothing, which can be treated as a special case of our\nproposed method.\n\n3 Stochastic Shared Embeddings\n\nThroughout this paper, the network input xi and label yi will be encoded into indices ji\n1, . . . , ji\nM\nwhich are elements of I1 \u21e5 . . .IM, the index sets of embedding tables. A typical choice is that\nthe indices are the encoding of a dictionary for words in natural language applications, or user and\nitem tables in recommendation systems. Each index, jl, within the lth table, is associated with\nan embedding El[jl] which is a trainable vector in Rdl. The embeddings associated with label yi\nare usually non-trainable one-hot vectors corresponding to label look-up tables while embeddings\nassociated with input xi are trainable embedding vectors for embedding look-up tables. In natural\nlanguage applications, we appropriately modify this framework to accommodate sequences such as\nsentences.\nThe loss function can be written as the functions of embeddings:\n\nRn(\u21e5) =Xi\n\n`(xi, yi|\u21e5) =Xi\n\n2\n\n`(E1[ji\n\n1], . . . , EM [ji\n\nM ]|\u21e5),\n\n(1)\n\n\fembeddings {E1, . . . , EM}\n\nSample one mini-batch {x1, . . . , xm}\nfor i = 1 to m do\n\nAlgorithm 1 SSE-Graph for Neural Networks with Embeddings\n1: Input: input xi, label yi, backpropagate T steps, mini-batch size m, knowledge graphs on\n2: De\ufb01ne pl(., .|) based on knowledge graphs on embeddings, l = 1, . . . , M\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: end for\n13: Return embeddings {E1, . . . , EM}, and neural network parameters \u21e5\n\nend for\nForward and backward pass with the new embeddings\n\nIdentify the set of embeddings Si = {E1[ji\nfor each embedding El[ji\n\nM ]} for input xi and label yi\n\nl ] 2S i do\n\nReplace El[ji\n\nl ] with El[kl], where kl \u21e0 pl(ji\n\nl , .|)\n\nend for\n\n1], . . . , EM [ji\n\nFigure 1: SSE-Graph described in Algorithm 1 and Figure 2 can be viewed as adding exponentially\nmany distinct reordering layers above the embedding layer. A modi\ufb01ed backpropagation procedure\nin Algorithm 1 is used to train exponentially many such neural networks at the same time.\nwhere yi is the label and \u21e5 encompasses all trainable parameters including the embeddings, {El[jl] :\njl 2I l}. The loss function ` is a mapping from embedding spaces to the reals. For text input, each\nl ] is a word embedding vector in the input sentence or document. For recommender systems,\nEl[ji\nusually there are two embedding look-up tables: one for users and one for items [6]. So the objective\nfunction, such as mean squared loss or some ranking losses, will comprise both user and item\nembeddings for each input. We can more succinctly write the matrix of all embeddings for the ith\nsample as E[ji] = (E1[ji\nM ) 2I . By an abuse of notation we\n1, . . . , ji\nwrite the loss as a function of the embedding matrix, `(E[ji]|\u21e5).\nSuppose that we have access to knowledge graphs [16, 14] over embeddings, and we have a prior\nbelief that two embeddings will share information and replacing one with the other should not incur\na signi\ufb01cant change in the loss distribution. For example, if two movies are both comedies and\nthey are starred by the same actors, it is very likely that for the same user, replacing one comedy\nmovie with the other comedy movie will result in little change in the loss distribution. In stochastic\noptimization, we can replace the loss gradient for one movie\u2019s embedding with the other similar\nmovie\u2019s embedding, and this will not signi\ufb01cantly bias the gradient if the prior belief is accurate. On\nthe other hand, if this exchange is stochastic, then it will act to smooth the gradient steps in the long\nrun, thus regularizing the gradient updates.\n\nM ]) where ji = (ji\n\n1], . . . , EM [ji\n\n3.1 General SSE with Knowledge Graphs: SSE-Graph\nInstead of optimizing objective function Rn(\u21e5) in (1), SSE-Graph described in Algorithm 1, Figure 1,\nand Figure 2 is approximately optimizing the objective function below:\np(ji, k|)`(E[k]|\u21e5),\n\n(2)\n\nSn(\u21e5) =Xi Xk2I\n\nwhere p(j, k|) is the transition probability (with parameters ) of exchanging the encoding vector\nj 2I with a new encoding vector k 2I in the Cartesian product index set of all embedding tables.\nWhen there is a single embedding table (M = 1) then there are no hard restrictions on the transition\n\n3\n\n\fFigure 2: Illustration of how SSE-Graph algorithm in Figure 1 works for a simple neural network.\n\nprobabilities, p(., .), but when there are multiple tables (M > 1) then we will enforce that p(., .)\ntakes a tensor product form (see (4)). When we are assuming that there is only a single embedding\ntable (M = 1) we will not bold j, E[j] and suppress their indices.\nIn the single embedding table case, M = 1, there are many ways to de\ufb01ne transition probability from\nj to k. One simple and effective way is to use a random walk (with random restart and self-loops) on\na knowledge graph G, i.e. when embedding j is connected with k but not with l, we can set the ratio\nof p(j, k|) and p(j, l|) to be a constant greater than 1. In more formal notation, we have\n\nj \u21e0 k, j 6\u21e0 l ! p(j, k|)/p(j, l|) = \u21e2,\n\n(3)\n\nwhere \u21e2> 1 and is a tuning parameter. It is motivated by the fact that embeddings connected with\neach other in knowledge graphs should bear more resemblance and thus be more likely replaced by\neach other. Also, we let p(j, j|) = 1 p0, where p0 is called the SSE probability and embedding\nretainment probability is 1 p0. We treat both p0 and \u21e2 as tuning hyper-parameters in experiments.\nWith (3) andPk p(j, k|) = 1, we can derive transition probabilities between any two embeddings\nto \ufb01ll out the transition probability table.\nWhen there are multiple embedding tables, M > 1, then we will force that the transition from j to k\ncan be thought of as independent transitions from jl to kl within embedding table l (and index set Il).\nEach table may have its own knowledge graph, resulting in its own transition probabilities pl(., .).\nThe more general form of the SSE-graph objective is given below:\n\nSn(\u21e5) =Xi Xk1,...,kM\n\np1(ji\n\n1, k1|)\u00b7\u00b7\u00b7 pM (ji\n\nM , kM|)`(E1[k1], . . . , EM [kM ]|\u21e5),\n\n(4)\n\nIntuitively, this SSE objective could reduce the variance of the estimator.\nOptimizing (4) with SGD or its variants (Adagrad [4], Adam [12]) is simple. We just need to randomly\nswitch each original embedding tensor E[ji] with another embedding tensor E[k] randomly sampled\naccording to the transition probability (see Algorithm 1). This is equivalent to have a randomized\nembedding look-up layer as shown in Figure 1.\nWe can also accommodate sequences of embeddings, which commonly occur in natural language\napplication, by considering (ji\nl , kl) for l-th embedding table in\nl,1, kl,1), . . . , (ji\n(4), where 1 \uf8ff l \uf8ff M and ni\nl is the number of embeddings in table l that are associated with (xi, yi).\nWhen there is more than one embedding look-up table, we sometimes prefer to use different p0 and \u21e2\nfor different look-up tables in (3) and the SSE probability constraint. For example, in recommender\nsystems, we would use pu,\u21e2 u for user embedding table and pi,\u21e2 i for item embedding table.\n\n) instead of (ji\n\nl,ni\nl\n\n, kl,ni\n\nl\n\n4\n\n\fFigure 3: Projecting 50-dimensional embeddings obtained by training a simple neural network\nwithout SSE (Left), and with SSE-Graph (Center) , SSE-SE (Right) into 3D space using PCA.\nWe \ufb01nd that SSE with knowledge graphs, i.e., SSE-Graph, can force similar embeddings to cluster\nwhen compared to the original neural network without SSE-Graph. In Figure 3, one can easily see\nthat more embeddings tend to cluster into 2 singularities after applying SSE-Graph when embeddings\nare projected into 3D spaces using PCA. Interestingly, a similar phenomenon occurs when assuming\nthe knowledge graph is a complete graph, which we would introduce as SSE-SE below.\n\n,\n\np0\nN 1\n\np(j, k|) =\n\n81 \uf8ff k 6= j \uf8ff N,\n\n3.2 Simpli\ufb01ed SSE with Complete Graph: SSE-SE\nOne clear limitation of applying the SSE-Graph is that not every dataset comes with good-quality\nknowledge graphs on embeddings. For those cases, we could assume there is a complete graph over\nall embeddings so there is a small transition probability between every pair of different embeddings:\n(5)\nwhere N is the size of the embedding table. The SGD procedure in Algorithm 1 can still be applied\nand we call this algorithm SSE-SE (Stochastic Shared Embeddings - Simple and Easy). It is worth\nnoting that SSE-Graph and SSE-SE are applied to embeddings associated with not only input xi but\nalso those with output yi. Unless there are considerably many more embeddings than data points and\nmodel is signi\ufb01cantly over\ufb01tting, normally p0 = 0.01 gives reasonably good results.\nInterestingly, we found that the SSE-SE framework is related to several techniques used in practice.\nFor example, BERT pre-training unintentionally applied a method similar to SSE-SE to input xi by\nreplacing the masked word with a random word. This would implicitly introduce an SSE layer for\ninput xi in Figure 1, because now embeddings associated with input xi be stochastically mapped\naccording to (5). The main difference between this and SSE-SE is that it merely augments the input\nonce, while SSE introduces randomization at every iteration, and we can also accommodate label\nembeddings. In experimental Section 4.4, we will show that SSE-SE would improve original BERT\npre-training procedure as well as \ufb01ne-tuning procedure.\n\n3.3 Theoretical Guarantees\nWe explain why SSE can reduce the variance of estimators and thus leads to better generalization\nperformance. For simplicity, we consider the SSE-graph objective (2) where there is no transition\nassociated with the label yi, and only the embeddings associated with the input xi undergo a transition.\nWhen this is the case, we can think of the loss as a function of the xi embedding and the label,\n`(E[ji], yi;\u21e5) . We take this approach because it is more straightforward to compare our resulting\ntheory to existing excess risk bounds.\nThe SSE objective in the case of only input transitions can be written as,\np(ji, k) \u00b7 `(E[k], yi|\u21e5),\n\n(6)\n\nSn(\u21e5) =Xi Xk\n\nand there may be some constraint on \u21e5. Let \u02c6\u21e5 denote the minimizer of Sn subject to this constraint.\nWe will show in the subsequent theory that minimizing Sn will get us close to a minimizer of\nS(\u21e5) = ESn(\u21e5), and that under some conditions this will get us close to the Bayes risk. We will use\n\nthe standard de\ufb01nitions of empirical and true risk, Rn(\u21e5) =Pi `(xi, yi|\u21e5) and R(\u21e5) = ERn(\u21e5).\n\nOur results depend on the following decomposition of the risk. By optimality of \u02c6\u21e5,\n\nR( \u02c6\u21e5) = Sn( \u02c6\u21e5) + [R( \u02c6\u21e5) S( \u02c6\u21e5)] + [S( \u02c6\u21e5) Sn( \u02c6\u21e5)] \uf8ff Sn(\u21e5\u21e4) + B( \u02c6\u21e5) + E( \u02c6\u21e5)\n\n(7)\n\n5\n\n\fTable 1: Compare SSE-Graph and SSE-SE against ALS-MF with Graph Laplacian Regularization.\nThe pu and pi are the SSE probabilities for user and item embedding tables respectively, as in (5).\nDe\ufb01nitions of \u21e2u and \u21e2i can be found in (3). Movielens10m does not have user graphs.\n\nModel\nSGD-MF\n\nGraph Laplacian + ALS-MF\n\nSSE-Graph + SGD-MF\n\nSSE-SE + SGD-MF\n\nMovielens1m\n\u21e2u\n-\n-\n500\n1\n\n\u21e2i\n-\n-\n200\n1\n\npu\n-\n-\n\n0.005\n0.005\n\npi\n-\n-\n\n0.005\n0.005\n\nRMSE\n1.0984\n1.0464\n1.0145\n1.0150\n\nMovielens10m\npu\n-\n-\n\n\u21e2i\n-\n-\n500\n1\n\nRMSE \u21e2u\n-\n1.9490\n-\n1.9755\n1.9019\n1\n1.9085\n1\n\n0.01\n0.01\n\npi\n-\n-\n\n0.01\n0.01\n\nwhere B(\u21e5) = |R(\u21e5)S(\u21e5)|, and E(\u21e5) = |S(\u21e5)Sn(\u21e5)|. We can think of B(\u21e5) as representing\nthe bias due to SSE, and E(\u21e5) as an SSE form of excess risk. Then by another application of similar\nbounds,\n\nR( \u02c6\u21e5) \uf8ff R(\u21e5\u21e4) + B( \u02c6\u21e5) + B(\u21e5\u21e4) + E( \u02c6\u21e5) + E(\u21e5\u21e4).\n\n(8)\nThe high level idea behind the following results is that when the SSE protocol re\ufb02ects the underlying\ndistribution of the data, then the bias term B(\u21e5) is small, and if the SSE transitions are well mixing\nthen the SSE excess risk E(\u21e5) will be of smaller order than the standard Rademacher complexity.\nThis will result in a small excess risk.\nTheorem 1. Consider SSE-graph with only input transitions. Let L(E[ji]) = EY |X=xi`(E[ji], Y |\u21e5)\nbe the expected loss conditional on input xi and e(E[ji], y|\u21e5) = `(E[ji], y|\u21e5) L(E[ji]|\u21e5) be the\nresidual loss. De\ufb01ne the conditional and residual SSE empirical Rademacher complexities to be\n\n\u21e2L,n = E sup\n\n\u21e2e,n = E sup\n\n,\n\np(ji, k) \u00b7 L(E[k]|\u21e5)\n\u21e5 Xi\niXk\np(ji, k) \u00b7 e(E[k], yi;\u21e5)\n\u21e5 Xi\niXk\n\n(9)\n\n(10)\n\n,\n\n(11)\nrespectively where is a Rademacher \u00b11 random vectors in Rn. Then we can decompose the SSE\nempirical risk into\n(12)\n\n\u21e5 |Sn(\u21e5) S(\u21e5)|\uf8ff 2E[\u21e2L,n + \u21e2e,n].\n\nE sup\n\nRemark 1. The transition probabilities in (9), (10) act to smooth the empirical Rademacher com-\nplexity. To see this, notice that we can write the inner term of (9) as (P )>L, where we have\nvectorized i, L(xi;\u21e5) and formed the transition matrix P . Transition matrices are contractive and\nwill induce dependencies between the Rademacher random variables, thereby stochastically reducing\nthe supremum. In the case of no label noise, namely that Y |X is a point mass, e(x, y;\u21e5) = 0 , and\n\u21e2e,n = 0. The use of L as opposed to the losses, `, will also make \u21e2L,n of smaller order than the\nstandard empirical Rademacher complexity. We demonstrate this with a partial simulation of \u21e2L,n on\nthe Movielens1m dataset in Figure 5 of the Appendix.\nTheorem 2. Let the SSE-bias be de\ufb01ned as\n\nSuppose that 0 \uf8ff `(., .;\u21e5) \uf8ff b for some b > 0, then\n\nB = sup\n\nE\"Xi Xk\n\n\u21e5 \np(ji, k) \u00b7`(E[k], yi|\u21e5) `(E[ji], yi|\u21e5)#\nPnR( \u02c6\u21e5) > R(\u21e5\u21e4) + 2B + 4E[\u21e2L,n + \u21e2e,n] + pnuo \uf8ff e u2\n\n2b2 .\n\n.\n\nRemark 2. The price for \u2018smoothing\u2019 the Rademacher complexity in Theorem 1 is that SSE may\nintroduce a bias. This will be particularly prominent when the SSE transitions have little to do with\nthe underlying distribution of Y, X. On the other extreme, suppose that p(j, k) is non-zero over a\nneighborhood Nj of j, and that for data x0, y0 with encoding k 2N j, x0, y0 is identically distributed\nwith xi, yi, then B = 0. In all likelihood, the SSE transition probabilities will not be supported over\nneighborhoods of iid random pairs, but with a well chosen SSE protocol the neighborhoods contain\napproximately iid pairs and B is small.\n\n6\n\n\fTable 2: SSE-SE outperforms Dropout for Neural Networks with One Hidden Layer such as Matrix\nFactorization Algorithm regardless of dimensionality we use. ps is the SSE probability for both user\nand item embedding tables and pd is the dropout probability.\n\nModel\nMF\n\nDropout + MF\nSSE-SE + MF\n\nSSE-SE + Dropout + MF\n\nDouban\n\nRMSE\n0.7339\n0.7296\n0.7201\n0.7185\n\npd\n-\n0.1\n-\n0.1\n\nps\n-\n-\n\n0.008\n0.005\n\nMovielens10m\nps\n-\n-\n\npd\n-\n0.1\n-\n0.1\n\nRMSE\n0.8851\n0.8813\n0.8715\n0.8678\n\n0.008\n0.005\n\nNet\ufb02ix\n\nRMSE\n0.8941\n0.8897\n0.8842\n0.8790\n\npd\n-\n0.1\n-\n0.1\n\nps\n-\n-\n\n0.008\n0.005\n\nTable 3: SSE-SE outperforms dropout for Neural Networks with One Hidden Layer such as Bayesian\nPersonalized Ranking Algorithm regardless of dimensionality we use. We report the metric precision\nfor top k recommendations as P @k.\n\nMovielens1m\n\nYahoo Music\n\nFoursquare\n\nModel\n\nSQL-Rank (2018)\n\nBPR\n\nDropout + BPR\nSSE-SE + BPR\n\nP @1\n0.7369\n0.6977\n0.7031\n0.7254\n\nP @5\n0.6717\n0.6568\n0.6548\n0.6813\n\nP @10\n0.6183\n0.6257\n0.6273\n0.6469\n\nP @1\n0.4551\n0.3971\n0.4080\n0.4297\n\nP @5\n0.3614\n0.3295\n0.3315\n0.3498\n\nP @10\n0.3069\n0.2806\n0.2847\n0.3005\n\nP @1\n0.0583\n0.0437\n0.0437\n0.0609\n\nP @5\n0.0194\n0.0189\n0.0184\n0.0262\n\nP @10\n0.0170\n0.0143\n0.0146\n0.0155\n\n4 Experiments\n\nWe have conducted extensive experiments on 6 tasks, including 3 recommendation tasks (explicit\nfeedback, implicit feedback and sequential recommendation) and 3 NLP tasks (neural machine\ntranslation, BERT pre-training, and BERT \ufb01ne-tuning for sentiment classi\ufb01cation) and found that our\nproposed SSE can effectively improve generalization performances on a wide variety of tasks. Note\nthat the details about datasets and parameter settings can be found in the appendix.\n\n4.1 Neural Networks with One Hidden Layer (Matrix Factorization and BPR)\n\nMatrix Factorization Algorithm (MF) [17] and Bayesian Personalized Ranking Algorithm (BPR)\n[25] can be viewed as neural networks with one hidden layer (latent features) and are quite popular in\nrecommendation tasks. MF uses the squared loss designed for explicit feedback data while BPR uses\nthe pairwise ranking loss designed for implicit feedback data.\nFirst, we conduct experiments on two explicit feedback datasets: Movielens1m and Movielens10m.\nFor these datasets, we can construct graphs based on actors/actresses starring the movies. We compare\nSSE-graph and the popular Graph Laplacian Regularization (GLR) method [24] in Table 1. The\nresults show that SSE-graph consistently outperforms GLR. This indicates that our SSE-Graph has\ngreater potentials over graph Laplacian regularization as we do not explicitly penalize the distances\nacross embeddings, but rather we implicitly penalize the effects of similar embeddings on the loss.\nFurthermore, we show that even without existing knowledge graphs of embeddings, our SSE-SE\nperforms only slightly worse than SSE-Graph but still much better than GLR and MF.\nIn general, SSE-SE is a good alternative when graph information is not available. We then show\nthat our proposed SSE-SE can be used together with standard regularization techniques such as\ndropout and weight decay to improve recommendation results regardless of the loss functions and\ndimensionality of embeddings. This is evident in Table 2 and Table 3. With the help of SSE-SE, BPR\ncan perform better than the state-of-art listwise approach SQL-Rank [32] in most cases. We include\nthe optimal SSE parameters in the table for references and leave out other experiment details to the\nappendix. In the rest of the paper, we would mostly focus on SSE-SE as we do not have high-quality\ngraphs of embeddings on most datasets.\n\n4.2 Transformer Encoder Model for Sequential Recommendation\n\nSASRec [11] is the state-of-the-arts algorithm for sequential recommendation task. It applies the\ntransformer model [30], where a sequence of items purchased by a user can be viewed as a sentence in\ntransformer, and next item prediction is equivalent to next word prediction in the language model. In\nTable 4, we perform SSE-SE on input embeddings (px = 0.1, py = 0), output embeddings (px = 0.1,\npy = 0) and both embeddings (px = py = 0.1), and observe that all of them signi\ufb01cantly improve\nover state-of-the-art SASRec (px = py = 0). The regularization effects of SSE-SE is even more\n\n7\n\n\fTable 4: SSE-SE has two tuning parameters: probability px to replace embeddings associated with\ninput xi and probability py to replace embeddings associated with output yi. We use the dropout\nprobability of 0.1, weight decay of 1e5, and learning rate of 1e3 for all experiments.\n\nMovielens1m\n\nDimension # of Blocks\n\nModel\nSASRec\nSASRec\n\nSSE-SE + SASRec\nSSE-SE + SASRec\nSSE-SE + SASRec\nSSE-SE + SASRec\n\nNDCG@10 Hit Ratio@10\n\n0.5941\n0.5996\n0.6092\n0.6085\n0.6200\n0.6265\n\n0.8182\n0.8272\n0.8250\n0.8293\n0.8315\n0.8364\n\nd\n100\n100\n100\n100\n100\n100\n\nb\n2\n6\n2\n2\n2\n6\n\nSSE-SE Parameters\npx\n-\n-\n0.1\n0\n0.1\n0.1\n\npy\n-\n-\n0\n0.1\n0.1\n0.1\n\nTable 5: Our proposed SSE-SE helps the Transformer achieve better BLEU scores on English-to-\nGerman in 10 out of 11 newstest data between 2008 and 2018.\nTest BLEU\n\nModel\n\nTransformer\n\nSSE-SE + Transformer\n\n2008\n21.0\n21.4\n\n2009\n20.7\n21.1\n\n2010\n22.7\n23.0\n\n2011\n20.6\n21.0\n\n2012\n20.6\n20.8\n\n2013\n25.3\n25.2\n\n2014\n26.2\n27.2\n\n2015\n28.4\n29.2\n\n2016\n32.1\n33.1\n\n2017\n27.2\n27.9\n\n2018\n38.8\n39.9\n\nobvious when we increase the number of self-attention blocks from 2 to 6, as this will lead to a more\nsophisticated model with many more parameters. This leads to the model over\ufb01tting terribly even\nwith dropout and weight decay. We can see in Table 4 that when both methods use dropout and\nweight decay, SSE-SE + SASRec is doing much better than SASRec without SSE-SE.\n4.3 Neural Machine Translation\n\nWe use the transformer model [30] as the backbone for our experiments. The baseline model is the\nstandard 6-layer transformer architecture and we apply SSE-SE to both encoder, and decoder by\nreplacing corresponding vocabularies\u2019 embeddings in the source and target sentences. We trained on\nthe standard WMT 2014 English to German dataset which consists of roughly 4.5 million parallel\nsentence pairs and tested on WMT 2008 to 2018 news-test sets. We use the OpenNMT implementation\nin our experiments. We use the same dropout rate of 0.1 and label smoothing value of 0.1 for the\nbaseline model and our SSE-enhanced model. The only difference between the two models is whether\nor not we use our proposed SSE-SE with p0 = 0.01 in (5) for both encoder and decoder embedding\nlayers. We evaluate both models\u2019 performances on the test datasets using BLEU scores [21].\nWe summarize our results in Table 5 and \ufb01nd that SSE-SE helps improving accuracy and BLEU\nscores on both dev and test sets in 10 out of 11 years from 2008 to 2018. In particular, on the last 5\nyears\u2019 test sets from 2014 to 2018, the transformer model with SSE-SE improves BLEU scores by\n0.92 on average when compared to the baseline model without SSE-SE.\n\n4.4 BERT for Sentiment Classi\ufb01cation\n\nBERT\u2019s model architecture [3] is a multi-layer bidirectional Transformer encoder based on the\nTransformer model in neural machine translation. Despite SSE-SE can be used for both pre-training\nand \ufb01ne-tuning stages of BERT, we want to mainly focus on pre-training as \ufb01ne-tuning bears\nmore similarity to the previous section. We use SSE probability of 0.015 for embeddings (one-\nhot encodings) associated with labels and SSE probability of 0.015 for embeddings (word-piece\nembeddings) associated with inputs. One thing worth noting is that even in the original BERT model\u2019s\npre-training stage, SSE-SE is already implicitly used for token embeddings. In original BERT model,\nthe authors masked 15% of words for a maximum of 80 words in sequences of maximum length of\n512 and 10% of the time replaced the [mask] token with a random token. That is roughly equivalent\nto SSE probability of 0.015 for replacing input word-piece embeddings.\nWe continue to pre-train Google pre-trained BERT model on our crawled IMDB movie reviews with\nand without SSE-SE and compare downstream tasks performances. In Table 6, we \ufb01nd that SSE-SE\npre-trained BERT base model helps us achieve the state-of-the-art results for the IMDB sentiment\nclassi\ufb01cation task, which is better than the previous best in [9]. We report test set accuracy of 0.9542\nafter \ufb01ne-tuning for one epoch only. For the similar SST-2 sentiment classi\ufb01cation task in Table 7, we\nalso \ufb01nd that SSE-SE can improve BERT pre-trains better. Our SSE-SE pre-trained model achieves\n94.3% accuracy on SST-2 test set after 3 epochs of \ufb01ne-tuning while the standard pre-trained BERT\n\n8\n\n\fTable 6: Our proposed SSE-SE applied in the pre-training stage on our crawled IMDB data improves\nthe generalization ability of pre-trained IMDB model and helps the BERT-Base model outperform\ncurrent SOTA results on the IMDB Sentiment Task after \ufb01ne-tuning.\n\nModel\n\nULMFiT [9]\n\nGoogle Pre-trained Model + Fine-tuning\n\nPre-training + Fine-tuning\n\n(SSE-SE + Pre-training) + Fine-tuning\n\nIMDB Test Set\n\nAUC\n\n-\n\n0.9415\n0.9518\n0.9542\n\nAccuracy\n0.9540\n0.9415\n0.9518\n0.9542\n\nF1 Score\n\n-\n\n0.9419\n0.9523\n0.9545\n\nTable 7: SSE-SE pre-trained BERT-Base models on IMDB datasets turn out working better on the\nnew unseen SST-2 Task as well.\n\nModel\n\nGoogle Pre-trained + Fine-tuning\n\nPre-training + Fine-tuning\n\n(SSE-SE + Pre-training) + Fine-tuning\n\n(SSE-SE + Pre-training) + (SSE-SE + Fine-tuning)\n\nSST-2 Dev Set\nAccuracy\n0.9232\n0.9266\n0.9278\n0.9323\n\nSST-2 Test Set\nF1 Score Accuracy (%)\n0.9253\n0.9281\n0.9295\n0.9336\n\n93.6\n93.8\n94.3\n94.5\n\nAUC\n0.9230\n0.9265\n0.9276\n0.9323\n\nmodel only reports 93.8 after \ufb01ne-tuning. Furthermore, we show that SSE-SE with SSE probability\n0.01 can also improve dev and test accuracy in the \ufb01ne-tuning stage. If we are using SSE-SE for both\npre-training and \ufb01ne-tuning stage of the BERT base model, we can achieve 94.5% accuracy on the\nSST-2 test set, approaching the 94.9% accuracy by the BERT large model. We are optimistic that our\nSSE-SE can be applied to BERT large model as well in the future.\n\n4.5 Speed and Convergence Comparisons.\nIn Figure 4, it is clear to see that our one-hidden-layer neural networks with SSE-SE are achieving\nmuch better generalization results than their respective standalone versions. One can also easily spot\nthat SSE-version algorithms converge at much faster speeds with the same learning rate.\n\n5 Conclusion\n\nWe have proposed Stochastic Shared Embeddings, which is a data-driven approach to regularization,\nthat stands in contrast to brute force regularization such as Laplacian and ridge regularization. Our\ntheory is a \ufb01rst step towards explaining the regularization effect of SSE, particularly, by \u2018smoothing\u2019\nthe Rademacher complexity. The extensive experimentation demonstrates that SSE can be fruitfully\nintegrated into existing deep learning applications.\nAcknowledgement. Hsieh acknowledges the support of NSF IIS-1719097, Intel faculty award,\nGoogle Cloud and Nvidia.\n\nFigure 4: Compare Training Speed of Simple Neural Networks with One Hidden Layer, i.e. MF and\nBPR, with and without SSE-SE.\n\n9\n\n\fReferences\n[1] James Bennett, Stan Lanning, et al. The net\ufb02ix prize. In Proceedings of KDD cup and workshop,\n\nvolume 2007, page 35. New York, NY, USA., 2007.\n\n[2] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative\nmatrix factorization for data representation. IEEE transactions on pattern analysis and machine\nintelligence, 33(8):1548\u20131560, 2011.\n\n[3] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of\ndeep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,\n2018.\n\n[4] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n\n[5] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1.\n\nMIT press Cambridge, 2016.\n\n[6] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural\ncollaborative \ufb01ltering. In Proceedings of the 26th International Conference on World Wide Web,\npages 173\u2013182. International World Wide Web Conferences Steering Committee, 2017.\n\n[7] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhut-\ndinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv\npreprint arXiv:1207.0580, 2012.\n\n[8] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal\n\nproblems. Technometrics, 12(1):55\u201367, 1970.\n\n[9] Jeremy Howard and Sebastian Ruder. Universal language model \ufb01ne-tuning for text classi\ufb01ca-\n\ntion. arXiv preprint arXiv:1801.06146, 2018.\n\n[10] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative \ufb01ltering for implicit feedback\n\ndatasets. In ICDM, volume 8, pages 263\u2013272. Citeseer, 2008.\n\n[11] Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. arXiv\n\npreprint arXiv:1808.09781, 2018.\n\n[12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[14] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes,\nSebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, S\u00f6ren Auer, et al. Dbpedia\u2013a large-\nscale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167\u2013195,\n2015.\n\n[15] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[16] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):\n\n39\u201341, 1995.\n\n[17] Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in\n\nneural information processing systems, pages 1257\u20131264, 2008.\n\n[18] Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weight-sharing.\n\nNeural computation, 4(4):473\u2013493, 1992.\n\n10\n\n\f[19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\nneural networks. In International Conference on Machine Learning, pages 1310\u20131318, 2013.\n[20] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for\nword representation. In Proceedings of the 2014 conference on empirical methods in natural\nlanguage processing (EMNLP), pages 1532\u20131543, 2014.\n\n[21] Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference\non Machine Translation: Research Papers, pages 186\u2013191, Belgium, Brussels, October 2018.\nAssociation for Computational Linguistics. URL https://www.aclweb.org/anthology/\nW18-6319.\n\n[22] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language\nunderstanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-\nassets/research-covers/languageunsupervised/language understanding paper. pdf, 2018.\n\n[23] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.\nLanguage models are unsupervised multitask learners. URL https://openai. com/blog/better-\nlanguage-models, 2019.\n\n[24] Nikhil Rao, Hsiang-Fu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon. Collaborative \ufb01ltering\nwith graph information: Consistency and scalable methods. In Advances in neural information\nprocessing systems, pages 2107\u20132115, 2015.\n\n[25] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr:\nIn Proceedings of the twenty-\ufb01fth\n\nBayesian personalized ranking from implicit feedback.\nconference on uncertainty in arti\ufb01cial intelligence, pages 452\u2013461. AUAI Press, 2009.\n\n[26] Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximum-margin matrix factorization.\n\nIn Advances in neural information processing systems, pages 1329\u20131336, 2005.\n\n[27] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[28] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pages 2818\u20132826, 2016.\n\n[29] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[31] Liwei Wu, Cho-Jui Hsieh, and James Sharpnack. Large-scale collaborative ranking in near-\nlinear time. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining, pages 515\u2013524. ACM, 2017.\n\n[32] Liwei Wu, Cho-Jui Hsieh, and James Sharpnack. Sql-rank: A listwise approach to collaborative\nranking. In Proceedings of Machine Learning Research (35th International Conference on\nMachine Learning), volume 80, 2018.\n\n11\n\n\f", "award": [], "sourceid": 23, "authors": [{"given_name": "Liwei", "family_name": "Wu", "institution": "University of California, Davis"}, {"given_name": "Shuqing", "family_name": "Li", "institution": "University of California, Davis"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UCLA"}, {"given_name": "James", "family_name": "Sharpnack", "institution": "UC Davis"}]}