{"title": "Provable Gaussian Embedding with One Observation", "book": "Advances in Neural Information Processing Systems", "page_first": 6764, "page_last": 6774, "abstract": "The success of machine learning methods heavily relies on having an appropriate representation for data at hand. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about data. However, recently there has been a surge in approaches that learn how to encode the data automatically in a low dimensional space. Exponential family embedding provides a probabilistic framework for learning low-dimensional representation for various types of high-dimensional data. Though successful in practice, theoretical underpinnings for exponential family embeddings have not been established. In this paper, we study the Gaussian embedding model and develop the first theoretical results for exponential family embedding models. First, we show that, under a mild condition, the embedding structure can be learned from one observation by leveraging the parameter sharing between different contexts even though the data are dependent with each other. Second, we study properties of two algorithms used for learning the embedding structure and establish convergence results for each of them. The first algorithm is based on a convex relaxation, while the other solved the non-convex formulation of the problem directly. Experiments demonstrate the effectiveness of our approach.", "full_text": "Provable Gaussian Embedding with One Observation\n\nMing Yu \u21e4\n\nZhuoran Yang \u2020\n\nTuo Zhao \u2021 Mladen Kolar \u00a7\n\nZhaoran Wang \u00b6\n\nAbstract\n\nThe success of machine learning methods heavily relies on having an appropriate\nrepresentation for data at hand. Traditionally, machine learning approaches relied\non user-de\ufb01ned heuristics to extract features encoding structural information about\ndata. However, recently there has been a surge in approaches that learn how to\nencode the data automatically in a low dimensional space. Exponential family\nembedding provides a probabilistic framework for learning low-dimensional rep-\nresentation for various types of high-dimensional data [20]. Though successful\nin practice, theoretical underpinnings for exponential family embeddings have\nnot been established. In this paper, we study the Gaussian embedding model and\ndevelop the \ufb01rst theoretical results for exponential family embedding models. First,\nwe show that, under mild condition, the embedding structure can be learned from\none observation by leveraging the parameter sharing between different contexts\neven though the data are dependent with each other. Second, we study properties of\ntwo algorithms used for learning the embedding structure and establish convergence\nresults for each of them. The \ufb01rst algorithm is based on a convex relaxation, while\nthe other solved the non-convex formulation of the problem directly. Experiments\ndemonstrate the effectiveness of our approach.\n\n1\n\nIntroduction\n\nExponential family embedding is a powerful technique for learning a low dimensional representation\nof high-dimensional data [20]. Exponential family embedding framework comprises of a known\ngraph G = (V, E) and the conditional exponential family. The graph G has m vertices and with\neach vertex we observe a p-dimensional vector xj, j = 1, . . . , m, representing an observation for\nwhich we would like to learn a low-dimensional embedding. The exponential family distribution\nis used to model the conditional distribution of xj given the context {xk, (k, j) 2 E} speci\ufb01ed by\nthe neighborhood of the node j in the graph G. In order for the learning of the embedding to be\npossible, one furthermore assumes how the parameters of the conditional distributions are shared\nacross different nodes in the graph. The graph structure, conditional exponential family, and the way\nparameters are shared across the nodes are modeling choices and are application speci\ufb01c.\nFor example, in the context of word embeddings [1, 11], a word in a document corresponds to a\nnode in a graph with the corresponding vector xj being a one-hot vector (the indicator of this word);\nthe context of the word j is given by the surrounding words and hence the neighbors of the node j\nin the graph are the nodes corresponding to those words; and the conditional distribution of xj is\n\n\u21e4Booth School of Business, University of Chicago, Chicago, IL. Email: ming93@uchicago.edu\n\u2020Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ.\n\u2021School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA.\n\u00a7Booth School of Business, University of Chicago, Chicago, IL.\n\u00b6Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fa multivariate categorical distribution. As another example arising in computational neuroscience\nconsider embedding activities of neurons. Here the graph representing the context encodes spatial\nproximity of neurons and the Gaussian distribution is used to model the distributions of a neuron\u2019s\nactivations given the activations of nearby neurons.\nWhile exponential family embeddings have been successful in practice, theoretical underpinnings\nhave been lacking. This paper is a step towards providing a rigorous understanding of exponential\nfamily embedding in the case of Gaussian embedding. We view the framework of exponential family\nembeddings through the lens of probabilistic graphical models [6], with the context graph specifying\nthe conditional independencies between nodes and the conditional exponential family specifying the\ndistribution locally. We make several contributions:\n1) First, since the exponential family embedding speci\ufb01es the distribution for each object conditionally\non its context, there is no guarantee that there is a joint distribution that is consistent with all the\nconditional models. The probabilistic graphical models view allows us to provide conditions under\nwhich the conditional distributions de\ufb01ned a valid joint distribution over all the nodes.\n2) Second, the probabilistic graphical model view allows us to learn the embedding vector from one\nobservation \u2014 we get to see only one vector xj for each node j 2 V \u2014 by exploiting the shared\nparameter representation between different nodes of the graph. One might mistakenly then think that\nwe in fact have m observations to learn the embedding. However, the dif\ufb01culty lies in the fact that\nthese observations are not independent and the dependence intricately depends on the graph structure.\nApparently not every graph structure can be learned from one observation, however, here we provide\nsuf\ufb01cient conditions on the graph that allow us to learn Gaussian embedding from one observation.\n3) Finally, we develop two methods for learning the embedding. Our \ufb01rst algorithm is based on a\nconvex optimization algorithm, while the second algorithm directly solves a non-convex optimization\nproblem. They both provably recover the underlying embedding, but in practice, non-convex approach\nmight lead to a faster algorithm.\n\n1.1 Related Work\nExponential family embedding Exponential family embedding originates from word embedding,\nwhere words or phrases from the vocabulary are mapped to embedding vectors [1]. Many variants and\nextensions of word embedding have been developed since [12, 9, 31, 10]. [20] develop a probabilistic\nframework based on general exponential families that is suitable for a variety of high-dimensional\ndistributions, including Gaussian, Poisson, and Bernoulli embedding. This generalizes the embedding\nidea to a wider range of applications and types of data, such as real-valued data, count data, and\nbinary data [13, 18, 19]. In this paper, we contribute to the literature by developing theoretical results\non Gaussian embedding, which complements existing empirical results in the literature.\n\nGraphical model. The exponential family embedding is naturally related to the literature on\nprobabilistic graphical models as the context structure forms a conditional dependence graph among\nthe nodes. These two models are naturally related, but the goals and estimation procedures are very\ndifferent. Much of the research effort on graphical model focus on learning the graph structure and\nhence the conditional dependency among nodes [8, 25, 29, 22]. As a contrast, in this paper, we instead\nfocus on the problem where the graph structure is known and learn the embedding.\n\nLow rank matrix estimation. As will see in Section 2, the conditional distribution in exponential\nfamily embedding takes the form f (V V >) for the embedding parameter V 2 Rp\u21e5r which embeds the\np dimensional vector xj to r dimensional space. Hence this is a low rank matrix estimation problem.\nTraditional methods focused on convex relaxation with nuclear norm regularization [14, 3, 17].\nHowever, when the dimensionality is large, solving convex relaxation problem is usually time\nconsuming. Recently there has been a lot of research on non-convex optimization formulations,\nfrom both theoretical and empirical perspectives [24, 26, 21, 27, 30]. People found that non-convex\noptimization is computationally more tractable, while giving comparable or better result. In our paper\nwe consider both convex relaxation and non-convex optimization approaches.\n\n2\n\n\f(a) Chain structure\n\n(b) !-nearest neighbor structure\n\n(c) Lattice structure\n\nFigure 1: Some commonly used context structures\n\n2 Background\n\nIn this section, we brie\ufb02y review the exponential family embedding framework. Let X =\n(x1, . . . , xm) 2 Rp\u21e5m be the data matrix where a column xj 2 Rp corresponds to a vector ob-\nserved at node j. For example, in word embedding, x represents a document consisting of m words,\nxj is a one-hot vector representation of the j-th word, and p is the size of the dictionary. For each\nj, let cj \u2713{ 1, ..., m} be the context of j, which is assumed to be known and is given by the graph\nG \u2014 in particular, cj = {k 2 V : (j, k) 2 E}. Some commonly used context structures are shown\nin Figure 1. Figure 1(a) is for chain structure. Note that this is different from vector autoregressive\nmodel where the chain structure is directed. Figure 1(b) is for !-nearest neighbor structure, where\neach node is connected with its preceding and subsequent ! nodes. This structure is common in word\nembedding where the preceding and subsequent ! words are the contexts. When ! = 1 it boils down\nto the chain structure. Finally Figure 1(c) is for lattice structure that is widely used in the Ising model.\nThe exponential family embedding model assumes that xj conditioning on xcj follows an exponential\nfamily distribution\n\n(2.1)\nwhere t(xj) is the suf\ufb01cient statistics and \u2318j(xcj ) 2 Rp is the natural parameter. For the linear\nembedding, we assume that \u2318 in (2.1) takes the form\n\nxj|xcj \u21e0 ExponentialFamilyh\u2318j(xcj ), t(xj)i,\n\n\u2318j(xcj ) = fj\u21e3Vj Xk2cj\n\nV >k xk\u2318,\n\n(2.2)\n\nwhere the link function fj is applied elementwise and Vj 2 Rp\u21e5r. The low dimensional matrix Vk\nembeds the vector xk 2 Rp to a lower r-dimensional space with V >k xk 2 Rr being the embedding\nof xk. For example, in word embedding each row of Vk is the embedding rule for a word. Since xk is\na one-hot vector, we see that V >k xk is selecting a row of Vk that corresponds to the word on the node\nk. A common simplifying assumption is that the embedding structure is shared across the nodes by\nassuming that Vj = V for all j 2 V . In word-embedding, this makes the embedding rule not depend\non the position of the word in the document. We summarize some commonly seen exponential family\ndistributions and show how they de\ufb01ne an exponential family embedding model.\n\nGaussian embedding.\n\nIn Gaussian embedding it is assumed that the conditional distribution is\n\nxj|xcj \u21e0 N\u21e3V Xk2cj\n\nV >xk, \u2303j\u2318 = N\u21e3M Xk2cj\n\nxk, \u2303j\u2318,\n\n(2.3)\n\nwhere M = V V > and \u2303j is the conditional covariance matrix for each node j. We will prove in\nSection 3 that under mild conditions, these conditional distributions de\ufb01ne a valid joint Gaussian\ndistribution. The link function for Gaussian embedding is the identity function, but one may choose\nthe link function to be f (\u00b7) = log(\u00b7) in order to constrain the parameters to be non-negative. Gaussian\nembedding is commonly applied to real valued observations.\n\n3\n\n\fWord embedding (cbow [11]).\nIn the word embedding setting, xj is an indicator of the j-th word\nin a document and the dimension of xj is equal to the size of the vocabulary. The context of the j-th\nword, cj, is the window of size ! around xj, that is, cj = {k 2{ 1, ..., m} : k 6= j,|k j|\uf8ff !}.\nCbow is a special case of exponential family embedding with\n\np(xj|xcj ) =\n\nexphx>j V \u21e3Pk2cj\nPj exphx>j V \u21e3Pk2cj\n\nV >xk\u2318i\nV >xk\u2318i .\n\n(2.4)\n\nPoisson embedding.\nparameter is the logarithm of the rate. The conditional distribution is given as\n\nIn Poisson embedding, the suf\ufb01cient statistic is the identity and the natural\n\nxj|xcj \u21e0 Poisson\u21e3 expV Xk2cj\n\nPoisson embedding can be applied to count data.\n\nV >xk\u2318.\n\n(2.5)\n\n3 Gaussian Embedding Model\n\nIn this paper, we consider the case of Gaussian embedding, where the conditional distribution of\nxj given its context xcj is given in (2.3) with the conditional covariance matrix \u2303j unknown. The\nparameter matrix M = V V > with V 2 Rp\u21e5r will be learned from the data matrix X 2 Rp\u21e5m and\nV >xk is the embedding of xk.\nLet Xcol = [x>1 , x>2 , ..., x>m]> 2 Rpm\u21e51 be the column vector obtained by stacking columns of X\nand let [xj]` denote the `-th coordinate of xj. We \ufb01rst restate a de\ufb01nition on compatibility from [23].\nDe\ufb01nition 3.1. A non-negative function g is capable of generating a conditional density function\np(y|x) if\n\n.\n\n(3.1)\n\nTwo conditional densities are said to be compatible if there exists a function g that can generate both\nconditional densities. When g is a density, the conditional densities are called strongly compatible.\n\nSince M is symmetric, according to Proposition 1 in [4], we have the following proposition.\n\n2 x>col \u00b7 \u23031\n\nProposition 3.2. The conditional distributions (2.3) is compatible and the joint distribution of Xcol\n\ncol \u00b7 xcol for some \u2303col 2 Rpm\u21e5pm. When the choice of\nis of the form p(xcol) / exp 1\nM and \u2303j is such that \u2303col 0, the conditional distributions are strongly compatible and we have\nXcol \u21e0 N (0, \u2303col).\nThe explicit expression of \u2303col can be derived from (2.3), however, in general is quite complicated.\nThe following example provides an explicit formula in the case where \u2303j = I.\nExample 3.3. Suppose that \u2303j = I for all j = 1, . . . , m. Let A 2 Rm\u21e5m denote the adjacency\nmatrix of the graph G, with aj,k = 1 when there is an edge between nodes j and k and 0 otherwise.\nDenote `c = {1, . . . ,` 1,` + 1, . . . , p}, the conditional distribution of [xj]` is given by\n\np(y|x) =\n\ng(y, x)\n\nR g(y, x)dy\n\n[xj]` xcj , [xj]`c \u21e0 N\u2713hM Xk2cj\n\nxki`\n\n, 1\u25c6.\n\nMoreover, there exists a joint distribution Xcol \u21e0 N (0, \u2303col) where \u2303col 2 Rpm\u21e5pm satis\ufb01es\n\n(3.2)\nand A \u2326 M denotes the Kronecker product between A and M. Clearly, we need \u2303col 0, which\nimposes implicit restrictions on A and M. To ensure that \u2303col is positive de\ufb01nite, we need to\n\n\u23031\ncol = I A \u2326 M,\n\n4\n\n\fensure that all the eigenvalues of A \u2326 M are smaller than 1. One suf\ufb01cient condition for this is\nkAk2 \u00b7k Mk2 < 1. For example, consider a chain graph with\nM\nI\nM I\n...\n\n2 Rp\u21e5p and \u23031\n\n2 Rpm\u21e5pm.\n\n0\n...\n\n(3.3)\n\n1\n\n0\n\n1\n\nA =266664\n\n...\n...\n1\n\n377775\n\n1\n0\n\n...\n... M\nM I\n\n377775\n\ncol =266664\n\nThen it suf\ufb01ces to have kMk2 < 1/2. Similarly for !-nearest neighbor structure, it suf\ufb01ces to have\nkMk2 < 1/2! and for the lattice structure to have kMk2 < 1/4.\n3.1 Estimation Procedures\n\nSince \u2303j is unknown, we propose to minimize the following loss function based on the conditional\nlog-likelihood\n\nL(M ) = m1\n\nLj(M ),\n\n(3.4)\n\nmXj=1\n\nwhere Lj(M ) := 1\nwith V \u21e4 2 Rp\u21e5r. Note that V \u21e4 is not unique, but M\u21e4 is. Observe that minimizing (3.4) leads to a\nconsistent estimator, since\n\n2 \u00b7xj MPk2cj\nxk2. Let M\u21e4 = V \u21e4V \u21e4> denote the true rank r matrix\nx>k = Excj Exj\uf8ff\u21e3xj M\u21e4Xk2cj\nEhrLj(M\u21e4)i = E\uf8ff\u21e3xj M\u21e4Xk2cj\nx>k xcj = 0.\nxk\u2318Xk2cj\nIn order to \ufb01nd a low rank solution cM that approximates M\u21e4, we consider the following two\n\nxk\u2318Xk2cj\n\nprocedures.\n\nConvex Relaxation We solve the following problem\n\nM2Rp\u21e5p,M>=M,M\u232b0L(M ) + kMk\u21e4,\n\nmin\n\n(3.5)\n\nwhere k\u00b7k \u21e4 is the nuclear norm of a matrix and is the regularization parameter to be speci\ufb01ed in\nthe next section. The problem (3.5) is convex and hence can be solved by proximal gradient descent\nmethod [15] with any initialization point.\n\nNon-convex Optimization Although it is guaranteed to \ufb01nd global minimum by solving the convex\nrelaxation problem (3.5), in practice it may be slow. In our problem, since M is low rank and positive\nsemide\ufb01nite, we can always write M = V V > for some V 2 Rp\u21e5r and solve the following non-\nconvex problem\n\nmin\n\nV 2Rp\u21e5r L(V V >).\n\n(3.6)\n\nWith an appropriate initialization V (0), in each iteration we update V by gradient descent\n\nV (t+1) = V (t) \u2318 \u00b7r V L(V V >)V =V (t) ,\n\nwhere \u2318 is the step size. The choice of initialization V (0) and step size \u2318 will be speci\ufb01ed in details in\nthe next section. The unknown rank r can be estimated as in [2].\n\n4 Theoretical Result\n\nWe establish convergence rates for the two estimation procedures.\n\n5\n\n\f4.1 Convex Relaxation\n\nIn order to show that a minimizer of (3.5) gives a good estimator for M, we \ufb01rst show that the\nobjective function L(\u00b7) is strongly convex under the assumption that the data are distributed according\nto (2.3) with the true parameter M\u21e4 = V \u21e4V \u21e4> with V \u21e4 2 Rp\u21e5r. Let\n\nL() = L(M\u21e4 +) L (M\u21e4) hrL (M\u21e4), i,\n\nwhere hA, Bi = tr(A>B) and is a symmetric matrix. Let i denote the i-th column of and let\ncol = [ >1 , . . . , >p ]> 2 Rp2\u21e51 be the vector obtained by stacking columns of . Then a simple\ncalculation shows that\n\nL() =\n\n1\nm \u00b7\n\n>i \uf8ff 1\npXi=1\nmXj=1Xk2cj\nxk2 =\neX =hXk2c1\nxk, ..., Xk2cm\nxk,Xk2c2\nmXj=1Xk2cj\nxk \u00b7Xk2cj\nxk> =\n\n1\n\n1\nm\n\nxk>i\n\nm\n\nmXj=1Xk2cj\nxk \u00b7Xk2cj\nxki = X \u00b7 A 2 Rp\u21e5m,\n\nhas a quadratic form in each i with the same Hessian matrix H. Let\n\nwhere A is the adjacency matrix of the graph G. Then the Hessian matrix is given by\n\n1\nm\n\nH =\n\nXAA>X> 2 Rp\u21e5p\n\nmeXeX> =\nand therefore we can succinctly write L() = >col \u00b7 Hcol \u00b7 col, where the total Hessian matrix\nHcol = diag(H, H, . . . , H) 2 Rp2\u21e5p2 is a block diagonal matrix.\nTo show that L(\u00b7) is strongly convex, it suf\ufb01ces to lower bound the minimum eigenvalue of H, de\ufb01ned\nin (4.1). If the columns of eX were independent, the minimum eigenvalue of H would be bounded\naway from zero with overwhelming probability for a large enough m [16]. However, in our setting\nthe columns of eX are dependent and we need to prove this lower bound using different tools. As the\n\ndistribution of X depends on the unknown conditional covariance matrices \u2303j, j = 1, . . . , m in a\ncomplicated way, we impose the following assumption on the expected version of H.\n\n(4.1)\n\nAssumption EC. The minimum and maximum eigenvalues of EH are bounded from below and\nfrom above: 0 < cmin \uf8ff min(EH) \uf8ff max(EH) \uf8ff cmax < 1.\nAssumption (EC) puts restrictions on conditional covariance matrices \u2303j and can be veri\ufb01ed in\nspeci\ufb01c instances of the problem. In the context of Example 3.3, where \u2303j = I, j = 1, . . . , m, and\nthe graph is a chain, we have the adjacency matrix A and the covariance matrix \u2303col given in (3.3).\nThen simple linear algebra [5] gives us that\n\nEH = m1EXAA>X> = 2I + cM 2 + o(M 2),\n\nwhich guarantees that min(EH) 1 and max(EH) \uf8ff c + 3 for large enough m.\nThe following assumption requires that the spectral norm of A and \u2303col do not scale with m.\ncol k2 \uf8ff \u21e20.\nAssumption SC. There exists a constant \u21e20 such that maxkAk2,k\u23031/2\n\nAssumption (SC) gives suf\ufb01cient condition on the graph structure, and it requires that the dependency\namong nodes is weak. In fact, it can be relaxed to \u21e20 = o(m1/4) which allows the spectral norm to\nscale with m slowly. In this way, the minimum and maximum eigenvalues in assumption (EC) also\nscale with m and it results in a much larger sample complexity on m. However, if \u21e20 grows even\nfaster, then there is no way to guarantee a reasonable estimation. We see that \u21e20 \u21e0 m1/4 is the critical\ncondition, and we have the phase transition on this boundary.\nIt is useful to point out that these assumptions are not restrictive. For example, under the simpli\ufb01cation\nthat \u2303j = I, we have k\u2303colk2 = 1/(1 kAk2 \u00b7k Mk2). The condition kAk2 \u00b7k Mk2 < 1 is satis\ufb01ed\n\n6\n\n\fcol k2 \uf8ff \u21e20, we only\n\n0, i.e., it is bounded away from 1 by a constant distance.\n\nnaturally for a valid Gaussian embedding model. Therefore in order to have k\u23031/2\nneed that kAk2 \u00b7k Mk2 \uf8ff 1 1/\u21e22\nIt is straightforward to verify that assumption (SC) holds for the chain structure in Example 3.3. If\nthe graph is fully connected, we have kAk2 = m 1, which violates the assumption. In general,\nassumption (SC) gives a suf\ufb01cient condition on the graph structure so that the embedding is learnable.\nWith these assumptions, the following lemma proves that the minimum and maximum eigenvalues of\nthe sample Hessian matrix H are also bounded from below and above with high probability.\nLemma 4.1. Suppose the assumption (EC) and (SC) hold. Then for m c0p we have min(H) \n2 cmin and max(H) \uf8ff 2cmax with probability at least 1 c1 exp(c2m), where c0, c1, c2 are\n1\nabsolute constants. Therefore\n\nwith \uf8ff\u00b5 = 1\n\n2 cmin and \uf8ffL = 2cmax for any 2 Rp\u21e5p.\n\n\uf8ff\u00b5 \u00b7k k2\n\nF \uf8ff L() \uf8ff \uf8ffL \u00b7k k2\nF ,\n\n(4.2)\n\nLemma 4.1 is the key technical result, which shows that although all the xj are dependent, the\nobjective function L(\u00b7) is still strongly convex and smooth in . Since the loss function L(\u00b7) is\nstrongly convex, an application of Theorem 1 in [14] gives the following result on the performance of\nthe convex relaxation approach proposed in the previous section.\n\nTheorem 4.2. Suppose the assumptions (SC) and (EC) are satis\ufb01ed. The minimizercM of (3.5) with\n 1\nj=1\u21e3xj M\u21e4Pk2cj\nmPm\n\nx>k2\nxk\u2318 \u00b7Pk2cj\nkcM M\u21e4kF \uf8ff\n\nsatis\ufb01es\n32pr\n\uf8ff\u00b5\n\n.\n\nThe following lemma gives us a way to set the regularization parameter .\nmPm\nLemma 4.3. Let G = 1\nj=1 \u2303j. Assume that the maximum eigenvalue of G is bounded from\nabove as max(G) \uf8ff \u2318max for some constant \u2318max. Then there exist constants c0, c1, c2, c3 > 0 such\nthat for m c0p, we have\nmXj=1\u21e3xj M\u21e4Xk2cj\nOpp/m, which leads to the error rate\n\nCombining the result of Lemma 4.3 with Theorem 4.2, we see that should be chosen as =\n\nm# \uf8ff c2 exp(c3m).\n\nxk\u2318 \u00b7Xk2cj\n\nP\"\n\n1\nm\n\nx>k2 c1r p\n\uf8ff\u00b5r rp\nm\u25c6 .\n\nkcM M\u21e4kF = OP\u2713 1\n\n(4.3)\n\n4.2 Non-convex Optimization\n\nNext, we consider the convergence rate for the non-convex method resulting in minimizing (3.6) in V .\nSince the factorization of M\u21e4 is not unique, we measure the subspace distance between V and V \u21e4.\nSubspace distance. Let V \u21e4 be such that V \u21e4V \u21e4> =\u21e5 \u21e4. De\ufb01ne the subspace distance between V\nand V \u21e4 as\n\n(4.4)\n\n(4.5)\n\nd2(V, V \u21e4) = min\nwhere O(r) = {O : O 2 Rr\u21e5r, OO> = O>O = I}.\nNext, we introduce the notion of the statistical error. Denote\n\nO2O(r)kV V \u21e4Ok2\nF ,\n\nThe statistical error is de\ufb01ned as\n\n\u2326= : 2 Rp\u21e5p, = >, rank() = 2r,kkF = 1 .\n\nestat = sup\n\n2\u2326\u2326rL(M\u21e4), \u21b5.\n\n7\n\n\f0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n\n0.1\n\n0\n\n Convex relaxation\n Non-convex method\n\n5000\n\n10000\n\n15000\n\n#nodes\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n\n0.1\n\n0\n\n Convex relaxation\n Non-convex method\n\n5000\n\n10000\n\n15000\n\n#nodes\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\nt\n\na\nm\n\ni\nt\ns\nE\n\n0\n\n0\n\n Convex relaxation\n Non-convex method\n\n5000\n\n10000\n\n15000\n\n#nodes\n\n(a) Chain structure\n\n(b) !-nearest neighbor structure\n\n(c) Lattice structure\n\nFigure 2: Estimation accuracy for three context structures\n\nIntuitively, the statistical error quanti\ufb01es how close the estimator can be to the true value. Speci\ufb01cally,\nif V is within c \u00b7 estat distance from V \u21e4, then it is already optimal. For any 2 \u2326, we have the\nfactorization = UV > where U, V 2 Rp\u21e52r and kUk2 = kVkF = 1. We then have\n\n\u2326rL(M\u21e4), \u21b5 =\u2326rL(M\u21e4)V, U\u21b5 \uf8ff krL(M\u21e4)VkF \u00b7k UkF\n\n\uf8ff krL(M\u21e4)k2kVkFkUF \uf8ff p2r,\n\nwhere the last inequality follows from Lemma 4.3. In particular, we see that both convex relaxation\nand non-convex optimization give the same rate.\n\n(4.6)\n\nInitialization.\nIn order to prove a linear rate of convergence for the procedure, we need to initialize\nit properly. Since the loss function L(M ) is quadratic in M, we can ignore all the constraints on M\nand get a closed form solution as\n\nxkXk2cj\n\nM (0) =h mXj=1Xk2cj\n\n\u00b7h mXj=1\nx>jXk2cj\n2M (0) + M (0)> and obtain\nWe then apply rank-r eigenvalue decomposition on fM (0) = 1\n[eV ,eS,eV ] = rank-r svd offM (0). Then V (0) = eVeS1/2 is the initial point for the gradient descent.\n\nThe following lemma quanti\ufb01es the accuracy of this initialization.\nLemma 4.4. The initialization M (0) and V (0) satisfy\n\nxk>i1\n\nxki.\n\n(4.7)\n\nkM (0) M\u21e4kF \uf8ff\n\n2pp\n\uf8ff\u00b5\n\nand d2V (0), V \u21e4 \uf8ff\n\n20p2\n\u00b5 \u00b7 r(M\u21e4)\n\uf8ff2\n\nwhere r(M\u21e4) is the minimum non-zero singular value of M\u21e4.\n\nWith this initialization, we obtain the following main result for the non-convex optimization approach,\nwhich establishes a linear rate of convergence to a point that has the same statistical error rate as the\nconvex relaxation approach studied in Theorem 4.2.\nTheorem 4.5. Suppose the assumption (EC) and (SC) are satis\ufb01ed, and suppose the step size \u2318\n\nsatis\ufb01es \u2318 \uf8ff\u21e532kM (0)k2\n\n2 \u00b7 (\uf8ff\u00b5 + \uf8ffL)\u21e41. For large enough m, after T iterations we have\nd2\u21e3V (T ), V \u21e4\u2318 \uf8ff T d2\u21e3V (0), V \u21e4\u2318 +\n\nC\n\u00b5 \u00b7 e2\n\uf8ff2\n\nstat,\n\n(4.8)\n\nfor some constant < 1 and a constant C.\n\n5 Experiment\n\nIn this section, we evaluate our methods through experiments. We \ufb01rst justify that although \u2303j is\nunknown, minimizing (3.4) still leads to a consistent estimator. We compare the estimation accuracy\n\n8\n\n\f101\n\n100.8\n\ns\ns\no\n\nl\n\n100.6\n\n100.4\n\n100.2\n\n100\n\n0\n\n Convex relaxation\n Non-convex method\n\n100.8\n\n100.6\n\ns\ns\no\n\nl\n\n100.4\n\n100.2\n\n Convex relaxation\n Non-convex method\n\n5000\n\n10000\n\n15000\n\n100\n\n0\n\n#nodes\n\n5000\n\n10000\n\n15000\n\n#nodes\n\n101.2\n\n101\n\n100.8\n\ns\ns\no\n\nl\n\n100.6\n\n100.4\n\n100.2\n\n100\n\n99.8\n\n0\n\n Convex relaxation\n Non-convex method\n\n5000\n\n10000\n\n15000\n\n#nodes\n\n(a) Chain structure\n\n(b) !-nearest neighbor structure\n\n(c) Lattice structure\n\nFigure 3: Testing loss for three context structures\n\nwith known and unknown covariance matrix \u2303j. We set \u2303j = j \u00b7 Toeplitz(\u21e2j) where Toeplitz(\u21e2j)\ndenotes Toeplitz matrix with parameter \u21e2j. We set \u21e2j \u21e0 U [0, 0.3] and j \u21e0 U [0.4, 1.6] to make them\nnon-isotropic. The estimation accuracy with known and unknown \u2303j are given in Table 1. We can\nsee that although knowing \u2303j could give slightly better accuracy, the difference is tiny. Therefore,\neven if the covariance matrices are not isotropic, ignoring them still gives a consistent estimator.\n\nTable 1: Comparison of estimation accuracy with known and unknown covariance matrix\n\nm = 1000 m = 2500 m = 5000 m = 8000 m = 15000\n\nunknown\nknown\n\n0.8184\n0.7142\n\n0.4432\n0.3990\n\n0.3210\n0.2908\n\n0.2472\n0.2288\n\n0.1723\n0.1649\n\nWe then consider three kinds of graph structures given in Figure 1: chain structure, !-nearest neighbor\nstructure, and lattice structure. We generate the data according to the conditional distribution (2.3)\nusing Gibbs Sampling. We set p = 100, r = 5 and vary the number of nodes m. For each j, we\nset \u2303j =\u2303 to be a Toeplitz matrix with \u2303i` = \u21e2|i`| with \u21e2 = 0.3. We generate independent train,\nvalidation, and test sets. For convex relaxation, the regularization parameter is selected using the\n\nvalidation set. We consider two metrics, one is the estimation accuracy kcM M\u21e4kF /kM\u21e4kF , and\nthe other is the loss L(cM ) on the test set.\n\nThe simulation results for estimation accuracy for the three graph structures are shown in Figure 2,\nand the results for loss on test sets are shown in Figure 3. Each result is based on 20 replicates. For the\nestimation accuracy, we see that when the number of nodes is small, neither method gives accurate\nestimation; for reasonably large m, non-convex method gives better estimation accuracy since it\ndoes not introduce bias; for large enough m, both methods give accurate and similar estimation.\nFor the loss on test sets, we see that in general, both methods give smaller loss as m increases. The\nnon-convex method gives marginally better loss. This demonstrates the effectiveness of our methods.\n\n6 Conclusion\n\nIn this paper, we focus on Gaussian embedding and develop the \ufb01rst theoretical result for exponential\nfamily embedding model. We show that for various kinds of context structures, we are able to learn\nthe embedding structure with only one observation. Although all the data we observe are dependent,\nwe show that the objective function is still well-behaved and therefore we can learn the embedding\nstructure reasonably well.\nIt is useful to point out that, the theoretical framework we proposed is for general exponential family\nembedding models. As long as the similar conditions are satis\ufb01ed, the framework and theoretical\nresults hold for any general exponential family embedding model as well. However, proving these\nconditions is quite challenging from the probability perspective. Nevertheless, our framework still\nholds and all we need are more complicated probability tools. Extending the result to other embedding\nmodels, for example the Ising model, is work in progress.\n\n9\n\n\fReferences\n[1] Yoshua Bengio, R\u00e9jean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic\n\nlanguage model. Journal of machine learning research, 3(Feb):1137\u20131155, 2003.\n\n[2] Florentina Bunea, Yiyuan She, and Marten H Wegkamp. Optimal selection of reduced rank\n\nestimators of high-dimensional matrices. The Annals of Statistics, pages 1282\u20131309, 2011.\n\n[3] Emmanuel J Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex optimization.\n\nFoundations of Computational mathematics, 9(6):717, 2009.\n\n[4] Shizhe Chen, Daniela M Witten, and Ali Shojaie. Selection and estimation for mixed graphical\n\nmodels. Biometrika, 102(1):47\u201364, 2014.\n\n[5] GY Hu and Robert F O\u2019Connell. Analytical inversion of symmetric tridiagonal matrices.\n\nJournal of Physics A: Mathematical and General, 29(7):1511, 1996.\n\n[6] Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.\n\n[7] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and pro-\n\ncesses. Springer Science & Business Media, 2013.\n\n[8] Jason D Lee and Trevor J Hastie. Learning the structure of mixed graphical models. Journal of\n\nComputational and Graphical Statistics, 24(1):230\u2013253, 2015.\n\n[9] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In\n\nAdvances in neural information processing systems, pages 2177\u20132185, 2014.\n\n[10] Omer Levy, Yoav Goldberg, and Ido Dagan. Improving distributional similarity with lessons\nlearned from word embeddings. Transactions of the Association for Computational Linguistics,\n3:211\u2013225, 2015.\n\n[11] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[12] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space\nword representations. In Proceedings of the 2013 Conference of the North American Chapter of\nthe Association for Computational Linguistics: Human Language Technologies, pages 746\u2013751,\n2013.\n\n[13] Tanmoy Mukherjee and Timothy Hospedales. Gaussian visual-linguistic embedding for zero-\nshot recognition. In Proceedings of the 2016 Conference on Empirical Methods in Natural\nLanguage Processing, pages 912\u2013918, 2016.\n\n[14] Sahand Negahban and Martin J Wainwright. Estimation of (near) low-rank matrices with noise\n\nand high-dimensional scaling. The Annals of Statistics, pages 1069\u20131097, 2011.\n\n[15] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and Trends R in Optimiza-\n\ntion, 1(3):127\u2013239, 2014.\n\n[16] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Restricted eigenvalue properties for\ncorrelated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241\u20132259, 2010.\n\n[17] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of\n\nlinear matrix equations via nuclear norm minimization. SIAM review, 52(3):471\u2013501, 2010.\n\n[18] Maja Rudolph and David Blei. Dynamic bernoulli embeddings for language evolution. arXiv\n\npreprint arXiv:1703.08052, 2017.\n\n[19] Maja Rudolph, Francisco Ruiz, Susan Athey, and David Blei. Structured embedding models for\ngrouped data. In Advances in Neural Information Processing Systems, pages 250\u2013260, 2017.\n\n10\n\n\f[20] Maja Rudolph, Francisco Ruiz, Stephan Mandt, and David Blei. Exponential family embeddings.\n\nIn Advances in Neural Information Processing Systems, pages 478\u2013486, 2016.\n\n[21] Ruoyu Sun and Zhi-Quan Luo. Guaranteed matrix completion via non-convex factorization.\n\nIEEE Transactions on Information Theory, 62(11):6535\u20136579, 2016.\n\n[22] Jialei Wang and Mladen Kolar. Inference for high-dimensional exponential family graphical\n\nmodels. In Arti\ufb01cial Intelligence and Statistics, pages 1042\u20131050, 2016.\n\n[23] Yuchung J Wang and Edward H Ip. Conditionally speci\ufb01ed continuous distributions. Biometrika,\n\n95(3):735\u2013746, 2008.\n\n[24] Zhaoran Wang, Huanran Lu, and Han Liu. Nonconvex statistical optimization: Minimax-optimal\n\nsparse pca in polynomial time. arXiv preprint arXiv:1408.5352, 2014.\n\n[25] Eunho Yang, Pradeep Ravikumar, Genevera I Allen, and Zhandong Liu. Graphical models via\nunivariate exponential family distributions. Journal of Machine Learning Research, 16(1):3813\u2013\n3847, 2015.\n\n[26] Ming Yu, Varun Gupta, and Mladen Kolar. An in\ufb02uence-receptivity model for topic based\ninformation cascades. 2017 IEEE International Conference on Data Mining (ICDM), pages\n1141\u20131146, 2017.\n\n[27] Ming Yu, Varun Gupta, and Mladen Kolar. Learning in\ufb02uence-receptivity network structure\n\nwith guarantee. arXiv preprint arXiv:1806.05730, 2018.\n\n[28] Ming Yu, Varun Gupta, and Mladen Kolar. Recovery of simultaneous low rank and two-way\n\nsparse coef\ufb01cient matrices, a nonconvex approach. arXiv preprint arXiv:1802.06967, 2018.\n\n[29] Ming Yu, Mladen Kolar, and Varun Gupta. Statistical inference for pairwise graphical models\nusing score matching. In Advances in Neural Information Processing Systems, pages 2829\u20132837,\n2016.\n\n[30] Tuo Zhao, Zhaoran Wang, and Han Liu. Nonconvex low rank matrix factorization via inexact\n\n\ufb01rst order oracle. Advances in Neural Information Processing Systems, 2015.\n\n[31] Will Y Zou, Richard Socher, Daniel Cer, and Christopher D Manning. Bilingual word em-\nbeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on\nEmpirical Methods in Natural Language Processing, pages 1393\u20131398, 2013.\n\n11\n\n\f", "award": [], "sourceid": 3406, "authors": [{"given_name": "Ming", "family_name": "Yu", "institution": "The University of Chicago, Booth School of Business"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Tuo", "family_name": "Zhao", "institution": "Gatech"}, {"given_name": "Mladen", "family_name": "Kolar", "institution": "University of Chicago"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Princeton, Phd student"}]}