{"title": "Learning the Dependency Structure of Latent Factors", "book": "Advances in Neural Information Processing Systems", "page_first": 2366, "page_last": 2374, "abstract": "In this paper, we study latent factor models with the dependency structure in the latent space. We propose a general learning framework which induces sparsity on the undirected graphical model imposed on the vector of latent factors. A novel latent factor model SLFA is then proposed as a matrix factorization problem with a special regularization term that encourages collaborative reconstruction. The main benefit (novelty) of the model is that we can simultaneously learn the lower-dimensional representation for data and model the pairwise relationships between latent factors explicitly. An on-line learning algorithm is devised to make the model feasible for large-scale learning problems. Experimental results on two synthetic data and two real-world data sets demonstrate that pairwise relationships and latent factors learned by our model provide a more structured way of exploring high-dimensional data, and the learned representations achieve the state-of-the-art classification performance.", "full_text": "Learning the Dependency Structure of Latent Factors\n\nYunlong He\u2217\n\nGeorgia Institute of Technology\nheyunlong@gatech.edu\n\nKoray Kavukcuoglu\nNEC Labs America\n\nkoray@nec-labs.com\n\nYanjun Qi\n\nNEC Labs America\n\nyanjun@nec-labs.com\n\nHaesun Park\u2217\n\nGeorgia Institute of Technology\n\nhpark@cc.gatech.edu\n\nAbstract\n\nIn this paper, we study latent factor models with dependency structure in the la-\ntent space. We propose a general learning framework which induces sparsity on\nthe undirected graphical model imposed on the vector of latent factors. A novel\nlatent factor model SLFA is then proposed as a matrix factorization problem with\na special regularization term that encourages collaborative reconstruction. The\nmain bene\ufb01t (novelty) of the model is that we can simultaneously learn the lower-\ndimensional representation for data and model the pairwise relationships between\nlatent factors explicitly. An on-line learning algorithm is devised to make the\nmodel feasible for large-scale learning problems. Experimental results on two\nsynthetic data and two real-world data sets demonstrate that pairwise relationships\nand latent factors learned by our model provide a more structured way of exploring\nhigh-dimensional data, and the learned representations achieve the state-of-the-art\nclassi\ufb01cation performance.\n\nIntroduction\n\n1\nData samples described in high-dimensional feature spaces are encountered in many important areas.\nTo enable the ef\ufb01cient processing of large data collections, latent factor models (LFMs) have been\nproposed to \ufb01nd concise descriptions of the members of a data collection. A random vector x \u2208 RM\nis assumed to be generated by a linear combination of a set of basis vectors, i.e.,\n\nx = Bs + \u0001 = B1s1 + B2s2 + \u00b7\u00b7\u00b7 + BKsK + \u0001\n\n(1)\nwhere B = [B1, . . . , BK] stores the set of unknown basis vectors and \u0001 describes noise. The i-th\n\u201cfactor\u201d si (i \u2208 {1, ..., K}) denotes the i-th variable in the vector s.\nIn this paper, we consider the problem of learning hidden dependency structure of latent factors\nin complex data sets. Our goal includes two main aspects: (1) to learn the interpretable lower-\ndimensional representations hidden in a set of data samples, and (2) to simultaneously model the\npairwise interaction of latent factors. It is dif\ufb01cult to achieve both aspects at the same time using\nexisting models. The statistical structure captured by LFM methods, such as Principal Component\nAnalysis (PCA) are limited in interpretability, due to their anti-correlation assumption on the latent\nfactors. For example, when a face image is represented as a linear super-position of PCA bases\nwith uncorrelated coef\ufb01cients learned by PCA, there exist complex cancellations between the basis\nimages [14]. Methods that theoretically assume independence of components like ICA [10] or sparse\ncoding [15] fail to generate independent representations in practice. Notable results in [13, 17] have\nshown that the coef\ufb01cients of linear features for natural images are never independent.\n\n\u2217The work of these authors was supported in part by the National Science Foundation grant CCF-0808863.\nAny opinions, \ufb01ndings and conclusions or recommendations expressed in this material are those of the authors\nand do not necessarily reect the views of the National Science Foundation.\n\n1\n\n\fInstead of imposing this unrealistic assumption, more recent works [18, 25, 27] propose to allow cor-\nrelated latent factors, which shows to be helpful in obtaining better performance on various tasks.\nHowever, the graphical structure of latent factors (i.e., conditional dependence/independence) is not\nconsidered in these works. Particularly, the sparse structure of the latent factor network is often\npreferred but has been never been explicitly explored in the learning process [2, 8, 23]. For example,\nwhen mining the enormous on-line news-text documents, a method discovering semantically mean-\ningful latent topics and a concise graph connecting the topics will greatly assist intelligent browsing,\norganizing and accessing of these documents.\nThe main contribution in this paper is a general LFM method that models the pairwise relation-\nships between latent factors by sparse graphical models. By introducing a generalized Tikhonov\nregularization, we enforce the interaction of latent factors to have an in\ufb02uence on learning latent\nfactors and basis vectors. As a result, we learn meaningful latent factors and simultaneously obtain\na graph where the nodes represent hidden groups and the edges represent their pairwise relation-\nships. This graphical representation helps us analyze collections of complex data samples in a much\nmore structured and organized way. The latent representations of data samples obtained from our\nmodel capture deeper signals hidden in the data which produce the useful features for discriminative\ntask and in-depth analysis, e.g. our model achieves a state-of-the-art performance on classifying\ncancer samples in our experiment.\n\n2 Methods\n2.1 Sparse Undirected Graphical Model of Latent Factors: A General Formulation\nFollowing [4, 16], our framework considers data samples drawn from the exponential family of\ndistributions, i.e.,\n\n(2)\nwhere suf\ufb01cient statitic T (x) \u2208 RM , \u03b7 \u2208 RM represents the natural parameter for the model, T (x),\nh(x) and A(\u03b7) are known functions de\ufb01ning a particular member of the exponential family. This\nfamily includes most of the common distributions, like normal, Dirichlet, multinomial, Poisson, and\nmany others.\nTo learn the hidden factors for generating x, the natural parameter \u03b7 is assumed to be represented\nby a linear combination of basis vectors, i.e.,\n\np(x|\u03b7) = h(x)exp(\u03b7\n\n(cid:124)\n\nT (x) \u2212 A(\u03b7)),\n\n\u03b7 = Bs,\n\n(3)\n\nwhere B = [B1, . . . , BK] is the basis matrix. To model the pairwise interaction between latent\nfactors, we introduce a pairwise Markov Random Field (MRF) prior on the vector of factors s \u2208 RK:\n\n\u03b8ijsisj)\n\n(4)\n\nexp(\u2212 K(cid:88)\n\nK(cid:88)\n\nK(cid:88)\n\np(s|\u00b5, \u0398) =\n\n1\n\nZ(\u00b5, \u0398)\n\n\u00b5isi \u2212 1\n2\n\ni=1\n\ni=1\n\nj=1\n\nwith parameter \u00b5 = [\u00b5i], symmetric \u0398 = [\u03b8ij], and partition function Z(\u00b5, \u0398) which normalizes\nthe distribution. The classic Ising model and Gaussian graphical model are two special cases of the\nabove MRF. Let G = (V, E) denote a graph with K nodes, corresponding to the K latent factors\n{s1, . . . , sK}, and with edge set\n\nE = {(i, j) \u2208 V \u00d7 V : \u03b8ij (cid:54)= 0}.\n\n(5)\n\nSince \u03b8ij = 0 indicates that latent factor si and latent factor sj are conditionally independent given\nother latent factors, the graph G presents an illustrative view of the statistical dependencies between\nlatent factors.\nWith such a hierarchical and \ufb02exible model, there would be signi\ufb01cant risk of over-\ufb01tting, especially\nwhen we consider all possible interactions between K latent factors. Therefore, regularization has\nto be introduced for better generalization property of the model. As we will see in subsection 3,\nregularization is also necessary from the perspective of avoiding ill-posed optimization problem.\nThe regularization technique we use is to introduce a sparsity-inducing prior for \u0398:\n\np(\u0398) \u221d exp(\u2212 1\n2\n\n\u03c1(cid:107)\u0398(cid:107)1),\n\n2\n\n(6)\n\n\fwhere \u03c1 is a positive hyper-parameter and (cid:107)\u0398(cid:107)1 :=(cid:80)\n\n(cid:80)\nj |\u03b8ij|. We aim to achieve two goals when\ndesigning such a prior distribution: (1) in practice irrelevant latent factors are not supposed to be\nconditionally dependent and hence a concise graphical structure between latent factors is preferred\nin many applications such as topic mining and image feature learning, and (2) in contrast to L0\nregularization which is the number of non-zero components, we obtain a convex subproblem of \u0398,\nthat can be ef\ufb01ciently solved by utilizing the recently developed convex optimization techniques.\n\ni\n\n2.2 Learning Algorithm\nWe consider the posterior distribution of parameters, which is proportional to the product of data\nlikelihood and the prior distributions:\n\nh(x)exp{s\n\n(cid:124)\n\n(cid:124)\nB\n\nT (x) \u2212 A(Bs)} \u00d7\n\n1\n\nZ(\u00b5, \u0398)\n\n(cid:124)\nexp(\u2212\u00b5\n\ns \u2212 1\n2\n\n(cid:124)\ns\n\n\u0398s) \u00d7 exp(\u2212 1\n2\n\n\u03c1(cid:107)\u0398(cid:107)1).\n\n(7)\n\nGiven a set of data observations {x(1), . . . , x(N )}, the Maximum a Posteriori (MAP) estimates of\nthe basis matrix B, the latent factors in S = [s(1), . . . , s(N )] and the parameters {\u00b5, \u0398} of the latent\nfactor network are therefore the solution of the following problem:\n{\u2212 log h(x(i)) + A(Bs(i)) \u2212 s(i)(cid:124)\n\nT (x(i))}\n\n(cid:88)\n\n(cid:124)\nB\n\nmin\nB,S,\u0398\n\n1\nN\n\ni\n\n+ log Z(\u00b5, \u0398) +\n\n1\n2N\ns.t. B \u2265 0,(cid:107)Bk(cid:107)2 \u2264 1, k = 1, . . . , K,\n\nS1N +\n\n1\nN\n\n(cid:124)\n\u00b5\n\n(cid:124)\ntr(S\n\n\u0398S) +\n\n\u03c1(cid:107)\u0398(cid:107)1\n\n1\n2\n\n(8)\nwhere additional constrains B \u2265 0 and (cid:107)Bk(cid:107)2 \u2264 1 are introduced for the identi\ufb01ability of the\nmodel.\nThe objective function in Eq. (8) is not convex with respect to all three unknowns (B, S and \u0398)\ntogether. Therefore, a good algorithm in general exhibits convergence behavior to a stationary point\nand we can use Block Coordinate Descent algorithm [1] to iteratively update B, S and \u0398 as follows:\nwhile not convergent do\n\nFor i = 1, . . . , N, solve\n\nmin\ns(i)\n\nSolve\n\nSolve\n\n(cid:88)\n\ni\n\n(cid:124)\n\u2212 log h(x(i)) + A(Bs(i)) \u2212 s(i)T B\n\n(cid:124)\nT (x(i)) + \u00b5\n\ns(i) +\n\ns(i)T \u0398s(i)\n\nmin\n\nB\u22650,(cid:107)Bk(cid:107)2\u22641\n\n(cid:124)\n{\u2212 log h(x(i)) + A(Bs(i)) \u2212 s(i)T B\n\nmin\n\u00b5,\u0398\n\nlog Z(\u00b5, \u0398) +\n\n(cid:124)\n\u00b5\n\n1\nN\n\nS1N +\n\n(cid:124)\ntr(S\n\n1\n2N\n\n\u0398S) +\n\n\u03c1(cid:107)\u0398(cid:107)1\n\n1\n2\n\n1\n2\nT (x(i))}\n\n(9)\n\n(10)\n\n(11)\n\nend do\nSince p(x|\u03b7) is in the exponential family, the subproblem (10) with respect to B is convex and\nsmooth with simple constraints, for which quasi-Newton methods such as projected L-BFGS [22]\nare among the most ef\ufb01cient methods. Subproblem (9) is easy to solve for real-valued s(i) but\ngenerally hard when the latent factors only admit discrete values. For example for s \u2208 {0, 1}K and\nGaussian p(x|\u03b7), subproblem (9) is a 0-1 quadratic programming problem and we can resort to SDP\nbased Branch and Bound algorithms [20] to solve it in a reasonable time. The subproblem (11) is\nminimizing the sum of a differentiable convex function and an L1 regularization term, for which\na few recently developed methods can be very ef\ufb01cient, such as variants of ADMM [6]. For the\ncases of discrete s with large K (usually K << M), evaluation of the partition function Z(\u00b5, \u0398)\nduring the iterations is (cid:93)P-hard and Schmidt [21] discusses methods to solve the pseudo-likelihood\napproximation of (11).\n\n3 A Special Case: Structured Latent Factor Analysis\nFrom this section on, we consider a special case of the learning problem in Eq. (8) when x follows\na multivariate normal distribution and s follows a sparse Gaussian graphical model (SGGM). We\nname our model under this default setting as \u201cstructured latent factor analysis\u201d (SLFA) and compare\n2\u03c32(cid:107)x \u2212 \u03b7(cid:107)2) and s \u223c N (\u00b5, \u03a6\u22121), with\nit to related works. Assume p(x|\u03b7) = (2\u03c0)\u2212M/2exp(\u2212 1\n\n3\n\n\fsparse precision matrix \u03a6 (inverse covariance). For simplicity we assume the given data matrix\nX = [x(1), . . . , x(N )] is centered and set \u00b5 = 0. Then the objective function in Eq. (8) becomes\n\n1\nN\n\n(cid:107)X \u2212 BS(cid:107)2\n\nmin\nB,S,\u03a6\ns.t. B \u2265 0,(cid:107)Bk(cid:107)2 \u2264 1, k = 1, . . . , K, \u03a6 (cid:60) 0.\n\nF + \u03c32(\n\n(cid:124)\ntr(S\n\n\u03a6S) \u2212 log det(\u03a6) + \u03c1(cid:107)\u03a6(cid:107)1)\n\n1\nN\n\n(12)\n\nIf \u03a6 is \ufb01xed, the problem in Eq. (12) is a matrix factorization method with generalized Tikhonov\n(cid:124)\nregularization: trace(S\n\u03a6S). If \u03a6i,j > 0, minimizing the objective function will avoid si and\nsj to be simultaneously large, and we say the i-th factor and the j-th factor are negatively related.\nIf \u03a6i,j < 0, the solution is likely to have si and sj of the same sign, and we say the i-th factor\nand the j-th factor are positively related. If \u03a6i,j = 0, the regularization doesn\u2019t induce interaction\nbetween si and sj in the objective function. Therefore, this regularization term makes SLFA produce\na collaborative reconstruction based on the conditional dependencies between latent factors. On one\nhand, the collaborative nature makes SLFA capture deeper statistical structure hidden in the data\nset, compared to the matrix factorization problem with the Tikhonov regularization (cid:107)S(cid:107)2\nF or sparse\ncoding with the sparsity-inducing regularization such as (cid:107)S(cid:107)1. On the other hand, SLFA encourages\nsparse interactions which is very different from previous works such as correlated topic Model [2]\nand latent Gaussian model [18], where the latent factors are densely related.\nAn On-line Algorithm For Learning SLFA: The convex subproblem\n\u03a6S) \u2212 log det(\u03a6) + \u03c1(cid:107)\u03a6(cid:107)1\n\n(cid:124)\ntr(S\n\n(13)\n\nmin\n\u03a6(cid:60)0\n\n1\nN\n\ncan be ef\ufb01ciently solved by a recent quadratic approximation method in [9]. For subproblem of S\nwe have closed-form solution\n\n(cid:124)\nS = (B\n\nB + \u03c32\u03a6)\u22121X.\n\nMoreover, considering that many modern high-dimensional data sets include a large number of data\nobservations (e.g. text articles from web-news), we propose an online algorithm for learning SLFA\non larger data sets. As summarized in Algorithm 1, at each iteration, we randomly fetch a mini-batch\nof observations simultaneously, compute their latent factor vector s. Then the latent factor vectors\nare used to update the basis matrix B in stochastic gradient descent fashion with projections on the\nconstraint set. Lastly we update the precision matrix \u03a6.\nAlgorithm 1 An on-line algorithm for learning SLFA.\nInput: X = [x(1), . . . , x(N )], initial guess of basis matrix B, initial precision matrix \u03a6 = I,\nnumber of iterations T , parameters \u03c32 and \u03c1, step-size \u03b3, mini-batch size N(cid:48).\n\n\u2013 Draw N(cid:48) observations randomly from X = [x(1), . . . , x(N )] to form the matrix\n\n\u2022 for t = 1 to T\n\nXbatch.\n\n(cid:124)\n\u2013 Compute the latent factor vectors Sbatch = (B\n\u2013 Update the basis matrix B using a gradient descent step:\n\u2013 Project columns of B to the \ufb01rst orthant and the unit ball, i.e., B \u2265 0 and (cid:107)Bi(cid:107) \u2264 1.\n\u2013 Solve the subproblem (13) to update the sparse inverse covariance matrix \u03a6 using all\n\n(cid:124)\nN(cid:48) [BSbatch \u2212 Xbatch]S\nbatch.\n\nB + \u03c32\u03a6)\u22121Xbatch.\n\nB \u2190 B \u2212 \u03b3\n\navailable latent factor vectors in S.\n\n\u2022 end for\n\nParameter Selection: The hyper-parameter \u03c1 controls the sparsity of \u03a6. A large \u03c1 will result in\na diagonal precision matrix \u03a6, indicating that the latent factors are conditionally independent. As\n\u03c1 \u2192 0, \u03a6 becomes denser. However, if we set \u03c1 = 0, the subproblem with respect to \u03a6 has a\n)\u22121, i.e., inverse sample covariance matrix. Plugging it back to\nclosed form solution \u03a6 = ( 1\nthe Eq. (12), we have\n(cid:107)X \u2212 BS(cid:107)2\n\n(cid:124)\nSS\n\nF + \u03c32 log det(\n\n),\n\n(cid:124)\nN SS\n1\nN\n\nmin\nB,S\n\n1\nN\n\nwhich doesn\u2019t have a lower bound. Therefore the regularization is necessary and we choose positive\nvalues for \u03c1 in the experiments. For supervised tasks, we use cross-validation to choose the proper\n\n4\n\n\fvalue of \u03c1 that optimizes the evaluation rule on validation set. For unsupervised applications, we\ncombine the BIC criterion in [28], with our model to obtain the following criterion:\n\u03a6(\u03c1)S(\u03c1)) \u2212 log det(\u03a6(\u03c1)) +\n\u03c1\u2217 = min\n\n(cid:107)X \u2212 B(\u03c1)S(\u03c1)(cid:107)2\n\ntr(S(\u03c1)\n\n(cid:107)\u03a6(\u03c1)(cid:107)0\n\n(cid:124)\n\n(cid:19)\n\n,\n\nlog N\n\nN\n\n1\nN\n\n\u03c1\n\n(cid:18) 1\n\nN\n\nF + \u03c32\n\nwhere B(\u03c1), S(\u03c1) and \u03a6(\u03c1) and learned from (12) with parameter \u03c1. Alternatively, for visual\nanalysis of latent factors, we can select multiple values of \u03c1 to obtain \u03a6 with desired sparsity.\nRelationship to Sparse Gaussian Graphical Model: We can also see SLFA as a generalization\nof sparse Gaussian graphical model. In fact, if the reduced dimension K = M, the problem (12)\nhas trivial solution B = I and S = X, and the problem becomes the same as (13). When K < M,\n(cid:124)\nB + \u03c32\u03a6)\u22121x. Therefore, lower dimensional\nthe subproblem with respect to s has solution s = (B\nrandom vector s has less variables among which each variable is a linear combination of the original\n(cid:124)\nB + \u03c32\u03a6)\u22121. In this sense, SLFA\nvariables of x with the combination weights stored in W = (B\ncould be seen as the sparse Gaussian graphical model of s = Wx, i.e. it generalizes the concept\nfrom the original (totally N) variables to the merged (totally K) group variables.\nA few recent efforts [3, 24] also combined the model of SGGM and with latent factor models. For\nexample, \u201cKronecker GLasso\u201d in [24] performs a joint learning of row and column covariances\nfor matrix-variate Gaussian models. Different from our SLFA, these methods still aim at modeling\nthe interaction between the original features and doesn\u2019t consider interaction in the latent factor\nInstead, SLFA is a hierarchical model and the learned pairwise relationships are on the\nspace.\nlatent factor level.\nIf we apply both SLFA and Kronecker GLasso on a text corpus where each\ndocument is represented by a 50, 000 sparse vector and number of latent factors (topics) are \ufb01xed as\n50, then Kronecker GLasso will produce a precision matrix of dimension 50, 000 \u00d7 50, 000 and a\ncorresponding sparse graph of 50, 000 nodes. SLFA, however, can dramatically reduce the problem\nto learning a 50 \u00d7 50 sparse precision matrix and the corresponding graph of 50 nodes.\nRelationship to other works: Sparse coding [19] can be modeled as:\n\nmin\nB,S\n\n1\n2\n\n(cid:107)X \u2212 BS(cid:107)2\n\nF + \u03bb(cid:107)S(cid:107)1.\n\n(14)\n\nFor many high-dimensional data sets such as text in natural languages, the input data is already very\nsparse or high dimensional. Thus, sparse coding is not easily applicable. Intuitively, sparse coding\nbased works (such as [7]) try to remove the redundancy in the representation of data while SLFA\nencourages a (sparse) collaborative reconstruction of the data from the latent bases.\nRecently, Jenatton et al. [12] proposed a method that can learn latent factors with given tree structure.\nThe optimization problem in Jenatton et al., 2010 is a penalized matrix factorization problem similar\nto our Eq. (12) and Eq. (14), but uses a different regularization term which imposes the overlapped\ngroup sparsity of factors. Differently, SLFA can learn a more general graphical structure among\nlatent factors and doesn\u2019t assume that data sample maps to a sparse combination of basis vectors.\nThe model of SLFA has similar hierarchy with correlated topic model [2] and latent Gaussian\nmodel [18]. Besides the key difference of sparsity, SLFA directly use precision matrix to learn\nlatent factor networks while the other two works learn the covariance matrix by Bayesian methods.\n\n4 Experiments\nIn this section, we conduct experiments on both synthetic and real world data sets to show that:\n(1) SLFA recovers latent basis vectors and \ufb01nds the pairwise relationships of latent factors, (2)\nSLFA generates useful features for various tasks such as images analysis, topic visualization and\nmicroarray analysis.\n4.1 Synthetic Data I: Four Different Graphical Relationships\nThe \ufb01rst experiment uses randomly generated synthetic data with different graphical structures of\nlatent factors. It aims to test if SLFA can \ufb01nd true latent factors and the true relationships among\nlatent factors and to study the effect of the parameter \u03c1 on the results. We use four special cases\nof Sparse Gaussian Graphical Model to generate the latent factors. The underlying graph is either\na ring, a grid, a tree or a random sparse graph, which are shown in Figure 1. A sparse positive\n\n5\n\n\f(a) Ring\n\n(b) Grid\n\n(c) Tree\n\n(d) Random\n\n(e) F-score (ring)\n\n(f) F-score (grid)\n\n(g) F-score (tree)\n\n(h) F-score (random)\n\n1\n\nij < 0), blue edge implies the two latent factors are negatively related (\u03a6\u2217\n\nFigure 1: Recovering structured latent factors from data. On the upper row are four different under-\nlying graphical model of latent factors. Red edge means the two latent factors are positively related\n(\u03a6\u2217\nij > 0). On the lower\nrow are the plots of F-score vs. \u03c1 for four settings. We can observe that SLFA (red lines) is as good\nas an oracle method (True Basis, green lines). The pink dash lines of BIC score (scaled to [0, 1])\ndemonstrate that the parameter selection method works well.\nde\ufb01nite matrix \u03a6\u2217 \u2208 R10\u00d710 is constructed based on the graph of SGGM. Then we sample 200\nGaussian random vectors, s(1), . . . , s(200) \u2208 R10, with precision matrix \u03a6\u2217. A set of vectors B\u2217 \u2208\nR500\u00d710 is randomly generated with normal distribution and then \ufb01ltered by a sigmoid function\n1+e\u2212100b such that most components of B\u2217 are close to either 0 or 1. B1, B2, . . . , B10\nf (b) =\nare then normalized as basis vectors. Finally,the synthetic data points are generated by x(i) =\nBs(i) + 0.1\u0001i, i = 1, . . . , 200, where \u0001i \u223c N (0, I).\nWe compare SLFA to other four methods for learning the basis matrix B and the precision matrix\n\u03a6 from the data. The \ufb01rst one is NMF, where we learn nonnegative basis B from the data and then\nlearn the sparse precision matrix \u03a6 for the corresponding factor vectors (non nonnegative constraint\non factors) by SGGM. The second one is an ideal case where we have the \u201coracle\u201d of the true basis\nB\u2217, then after \ufb01t the data to be true basis we learn the sparse precision matrix \u03a6 by SGGM. The third\none is named L2 version of SLFA as we replace the L1 regularization of \u03a6 by a Frobenius norm\nregularization. The fourth method \ufb01rst applies L2 version of SLFA and then learns \u03a6 by SGGM.\nIn all cases except the oracle method, we have a non-convex problem so that after we obtain the\nlearned basis vectors we use Hungarian algorithm to align them to with the true basis vectors based\non the cosine similarity. We compute the precision and recall rates for recovering the relationship\nbetween latent factors by comparing the learned \u03a6 with the true precision matrix \u03a6\u2217.\nWe plot F-score based on the precision and recall rates averaged over 10 experiments. According\nto Figure 1, when \u03c1 is large, the estimated \u03a6 is diagonal so that recall rate is 0. As \u03c1 becomes\nsmaller, more nonzero elements appear in the estimated \u03a6 and both the recall and precision rate of\n\u201cpositive/negative relationship\u201d get increased. When \u03c1 is small enough, the recovered \u03a6 becomes\ndenser and may not even recover the \u201cpositive/negative relationship\u201d correctly. We can see that\nfor all four cases, our proposed method SLFA is as good as the \u201coracle\u201d method at recovering the\npairwise relationship between latent factors. NMF most probably fails to \ufb01nd the right basis since\nit does consider any higher level information about the interactions between basis elements, hence\nSGGM can\u2019t \ufb01nd meaningful relationship between the factors obtained from NMF. L2 version of\nSLFA also has poor F-score since it can\u2019t recover the sparse structure. Since latent factors have dense\ninteractions in L2 version of SLFA, combining it with a postprocessing by SGGM improves the\nperformance signi\ufb01cantly, however it still performs worse compared to SLFA. This experiment also\ncon\ufb01rms that the idea of performing an integrated learning of the bases together with a regularized\nprecision matrix is essential for recovering the true structure in the data.\n\n4.2 Synthetic Data II: Parts-based Images\nThe second experiment also utilizes a simulated data set based on images to compare SLFA with\npopular latent factor models. We set up an experiment by generating 15000 images of \u201cbugs\u201d, each\n\n6\n\n\u221220246800.20.40.60.81\u2212log2(rho)F\u2212score SLFANMF+SGGMTrue BasisL2 versionL2+SGGMScaled\u2212BIC\u221220246800.20.40.60.81\u2212log2(rho)F\u2212score SLFANMF+SGGMTrue BasisL2 versionL2+SGGMScaled\u2212BIC\u221220246800.20.40.60.81\u2212log2(rho)F\u2212score SLFANMF+SGGMTrue BasisL2 versionL2+SGGMScaled\u2212BIC\u221220246800.20.40.60.81\u2212log2(rho)F\u2212score SLFANMF+SGGMTrue BasisL2 versionL2+SGGMScaled\u2212BIC\f(a) True Bases\n\n(b) Creation\n\n(c) Samples\n\n\u03a6i,j (\u2212) rel. \u03a6i,j (+) rel.\n0.030\n0.020\n0.015\n0.015\n0.014\n0.013\n(d) Precision Matrix\n\n\u22120.016\n\u22120.015\n\u22120.013\n\u22120.012\n\u22120.011\n\u22120.011\n\n(e) SLFA Basis\n\nFigure 2: Table (e) shows the \u03a6(i, j) values and corresponding Bi and Bj elements learned by\nSLFA for the six highest and and six lowest entries in \u03a6. For \u03a6(i, j) > 0, Bi and Bj are negatively\nrelated (exclusive), for \u03a6(i, j) < 0, Bi and Bj are positively related (supportive).\nof which is essentially a linear combination of \ufb01ve latent parts shown in Figure 2a. Given 37 basis\nimages, we \ufb01rst randomly select one of the \ufb01ve big circles as the body of the \u201cbugs\u201d. Each shape of\nbody is associated with four positions where the legs of the bug is located. We then randomly pick 4\nlegs from its associated set of 4 small circles and 4 small squares. However, for each leg, circle and\nsquare are exclusive of each other. We combine the selected \ufb01ve latent parts with random coef\ufb01cients\nthat are sampled from the uniform distribution and multiplied by \u22121 with probability 0.5. Finally,\nwe add a randomly selected basis with small random coef\ufb01cients plus Gaussian random noise to\nthe image to introduce the noise and confusion in the data set. A few examples of the bug image\nsamples created by the above strategy are shown in Figure 2c. The generating process (Figure 2b)\nindicates positive relationship between one type of body and its associates legs, as well as negative\nrelationship between the pair of circle and square that is located at the same position.\nUsing SLFA and other two baseline algorithms, PCA and NMF, we learn a set of latent bases and\ncompare the result of three methods in Figures 2e. We can see that the basis images generated by\nSLFA is almost exactly same as the true latent bases. This is due to the fact that SLFA accounts for\nthe sparse interaction between factors in the joint optimization problem and encourages collaborative\nreconstruction. NMF basis (shown in supplementary material due to space considerations) in this\ncase also turns out to be similar to true basis, however, one can still observe that many components\ncontain mixed structures since it can not capture the true data generation process. The bases learned\nby PCA (also shown in supp. material) is not interpretable as expected.\nMore importantly, SLFA provides the convenience of analyzing the relationship between the bases\nusing the precision matrix \u03a6. In Figure 2d, we analyze the relational structure learned in the preci-\nsion matrix \u03a6. The most negatively related (exclusive) pairs (the i and j entries with highest positive\nentries in \u03a6) are circular and square legs which conforms fully to the generation process, since only\none of them is chosen for any given location. Accordingly, the most positively related pairs are\na body shape and one of its associated legs since every bug has a body and four legs with \ufb01xed\npositions.\n\n4.3 Real Data I: NIPS Documents\nIn this section, we apply SLFA to the NIPS corpus1 which contains 1740 abstracts from the NIPS\nConferences 1\u221212 for the purpose of topic/content modeling. SLFA is used to organize and visualize\nthe relationship between the structured topics. SLFA is applied on the 13649 dimensional tf-idf\nfeature vector which is normalized to have the unit norm. We \ufb01x the number of topics to be 40 and\ntune the parameters \u03c3 and \u03c1 to obtain \u03a6 with a proper sparsity for the visualization task. In \ufb01gure 3,\nwe plot a graph of topics (standing-alone topics removed) with positive interaction between each\nother and present the top 5 keywords for each topic. For example, the topic at the top is about general\nnotions in many learning algorithms and acts as the hub point of the graph. more speci\ufb01c words that\nare relevant to a particular learning algorithm or a more specialized topic of interest. It is obvious\nthat SLFA not only extracts the underlying topics, but is also able to capture the (de)correlations\nbetween topics. For example, on the far left, the topic related to cells is connected to \u201cmotion,\nvelocity, ...\u201d, \u201cobjects, image,...\u201d and \u201cspike, neurons, ...\u201d nodes. This subgraph clearly represents\na few topics in computer vision and neuroscience. The node on the far right containing \u201crobot,\nplanning, ...\u201d is connected to the node with \u201ccontroller, control, ...\u201d which represents a robotics\nrelated topic-cluster. It is also interesting to note that SLFA can obtain a graph of negatively related\ntopics(shown in supplementary material). One can see that closely related topics tend to exclude\neach other.\n\n1http://cs.nyu.edu/ roweis/data.html\n\n7\n\n\fFigure 3: Positively related topics (learned by SLFA) discovered from NIPS text corpus. Each edge\ncorresponds to a negative element in the sparse precision matrix \u03a6.\n\nSLFA\n34.22 \u00b1 2.58\n\nLasso-overlapped-group Lasso\n35.31 \u00b1 2.05\n\n36.42 \u00b1 2.50\n\nSVM\n36.93 \u00b1 2.54\n\nPCA\n36.85 \u00b1 3.02\n\nTable 1: Cross-validation error rate (average and standard deviation) by different methods on Gene\nMicro-array data. SLFA performs best and even better than Lasso-overlapped-group (t-test at sig-\nni\ufb01cance level 0.02), which takes advantage of external information (42, 594 known edges between\ngene variables from another biological resource).\n4.4 Real Data II: Gene Microarray Data for Cancer Classi\ufb01cation\nNext, we test our model on a classi\ufb01cation task which uses breast cancer microarray data set obtained\nfrom [11]. This data set contains the gene expression values of 8, 141 genes for 295 breast cancer\ntumor samples. The task is to classify the tumor samples into two classes (with 78 metastatic and\n217 non-metastatic).\nUsing the classi\ufb01cation error rates as the metric, we compare totally \ufb01ve methods, including\nLasso [26], Lasso-overlapped-group [11], linear SVM classi\ufb01er [5], PCA with linear SVM clas-\nsi\ufb01er and SLFA with linear SVM classi\ufb01er. Lasso-overlapped-group, which is a logistic regression\napproach with the graph-guided sparsity enforced, uses a known biological network as the graphical\n(overlapped group) regularization on the lasso regression. The other methods, including SLFA, do\nnot use this extra supervised information. We run 10-fold cross validation and use the averaged error\nrate to indicate the predictive performance of different methods. The test is repeated 50 times and\neach time all methods use the same split of training and validation sets.\nThe averaged cross-validation error rate is shown in Table 1. We can observe that SLFA (K = 100)\nhas lower error rates than other methods, including Lasso, SVM and PCA. Compared to the\nmethod of Lasso-overlapped-group [11] which constructs the regularization from external informa-\ntion (42, 594 known edges as prior knowledge), our method based on SLFA performs better, even\nthough it does not utilize any extra evidence. This is a strong evidence which indicates that SLFA\ncan extract deeper structural information hidden in the data. Indeed, genes naturally act in the form\nof functional modules (gene groups) to carry out speci\ufb01c functions. Gene groups that usually corre-\nspond to biological processes or pathways, exhibit diverse pairwise dependency relationships among\neach other. SLFA discovers these relationships while learning the latent representation of each data\nsample at the same time. That is why its learned lower-dimensional representation captures more\nfundamental and strong signals, and achieves the state-of-art classi\ufb01cation performance. The learned\nstructural information and latent gene groups also get con\ufb01rmed by the biological function analysis\nin supplementary document.\n\n5 Conclusion\nIn this paper we have introduced a novel structured latent factor model that simultaneously learns\nlatent factors and their pairwise relationships. The model is formulated to represent data drawn\nfrom the general exponential family of distributions. The learned sparse interaction between latent\nfactors is crucial for understanding complex data sets and to visually analyze them. SLFA model is\nalso a hierarchical extension of Sparse Gaussian Graphical Model by generalizing the application\nof precision matrix from the original variable space to the latent factor space and optimizing the\nbases together with the precision matrix simultaneously. We have also provided an ef\ufb01cient online\nlearning algorithm that can scale SLFA training to large-scale datasets and showed that SLFA not\nonly can predict the true basis and structured relationshop between bases, but also it can achieve\nstate-of-the-art results in challenging biological classi\ufb01cation task.\n\n8\n\nunitshiddenunitlayerboltzmannmotionvelocityvisualdirectionflowmemorycapacityassociativehopfieldmemorieshmmhmmsspeechmarkovmlpconvergencegradientdescentstochasticmatrixbayesianposteriorgaussianhyperparameterscarloclassifierclassifiersrbfclassificationclasscontrollercontrolplantcriticforwardstudentteachergeneralizationcommitteeoverlapspcaobspruningobdadaboostcellscellorientationreceptivecortexmotorarmtrajectorymovementmovementsobjectobjectsviewsimagevisualobsriskpruningobdvalidationmixtureemexpertsexpertlikelihoodimageimagestexturewaveletpixelrobotplanningnavigationreinforcementactionspikefiringspikesneuronneuronsfacefacesfacialimagespca\fReferences\n[1] Bertsekas, D.: Nonlinear programming. Athena Scienti\ufb01c Belmont, MA (1999)\n[2] Blei, D., Lafferty, J.: Correlated topic models. Advances in Neural Information Processing Systems\n\n(2006)\n\n[3] Chandrasekaran, V., Parrilo, P., Willsky, A.: Latent variable graphical model selection via convex opti-\n\nmization. Arxiv preprint arXiv:1008.1290 (2010)\n\n[4] Collins, M., Dasgupta, S., Schapire, R.: A generalization of principal component analysis to the exponen-\n\ntial family. Advances in neural information processing systems (2002)\n\n[5] Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: Liblinear: A library for large linear classi\ufb01cation. JMLR\n\n(2008)\n\n[6] Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of\n\ntwo convex functions. Arxiv preprint arXiv:0912.4571 (2009)\n\n[7] Gregor, K., Szlam, A., LeCun, Y.: Structured sparse coding via lateral inhibition. Advances in Neural\n\nInformation Processing Systems 24 (2011)\n\n[8] Hinton, G., Osindero, S., Bao, K.: Learning causally linked markov random \ufb01elds. In: AI & Statistics\n\n(2005)\n\n[9] Hsieh, C., Sustik, M., Ravikumar, P., Dhillon, I.: Sparse inverse covariance matrix estimation using\n\nquadratic approximation. Advances in Neural Information Processing Systems (NIPS) 24 (2011)\n\n[10] Hyv\u00a8arinen, A., Hurri, J., Hoyer, P.: Independent component analysis. Natural Image Statistics (2009)\n[11] Jacob, L., Obozinski, G., Vert, J.: Group lasso with overlap and graph lasso. Proceedings of the 26th\n\nAnnual International Conference on Machine Learning (2009)\n\n[12] Jenatton, R., Mairal, J., Obozinski, G., Bach, F.: Proximal methods for sparse hierarchical dictionary\n\nlearning. Proceedings of the International Conference on Machine Learning (2010)\n\n[13] Karklin, Y., Lewicki, M.S.: Emergence of complex cell properties by learning to generalize in natural\n\nscenes. Nature (2009)\n\n[14] Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature (1999)\n[15] Lee, H., Battle, A., Raina, R., Ng, A.: Ef\ufb01cient sparse coding algorithms. Advances in neural information\n\nprocessing systems (2007)\n\n[16] Lee, H., Raina, R., Teichman, A., Ng, A.: Exponential family sparse coding with applications to self-\n\ntaught learning. Proceedings of the 21st international jont conference on Arti\ufb01cal intelligence (2009)\n\n[17] Lyu, S., Simoncelli, E.: Nonlinear extraction of independent components of natural images using radial\n\ngaussianization. Neural computation (2009)\n\n[18] Murray, I., Adams, R.: Slice sampling covariance hyperparameters of latent gaussian models. Arxiv\n\npreprint arXiv:1006.0868 (2010)\n\n[19] Olshausen, B., et al.: Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code for\n\nnatural images. Nature (1996)\n\n[20] Rendl, F., Rinaldi, G., Wiegele, A.: Solving Max-Cut to optimality by intersecting semide\ufb01nite and\n\npolyhedral relaxations. Math. Programming 121(2), 307 (2010)\n\n[21] Schmidt, M.: Graphical model structure learning with l1-regularization. Ph.D. thesis, UNIVERSITY OF\n\nBRITISH COLUMBIA (2010)\n\n[22] Schmidt, M., Van Den Berg, E., Friedlander, M., Murphy, K.: Optimizing costly functions with simple\n\nconstraints: A limited-memory projected quasi-newton algorithm. In: AI & Statistics (2009)\n\n[23] Silva, R., Scheine, R., Glymour, C., Spirtes, P.: Learning the structure of linear latent variable models.\n\nThe Journal of Machine Learning Research 7, 191\u2013246 (2006)\n\n[24] Stegle, O., Lippert, C., Mooij, J., Lawrence, N., Borgwardt, K.: Ef\ufb01cient inference in matrix-variate\ngaussian models with iid observation noise. Advances in Neural Information Processing Systems (2011)\n\n[25] Teh, Y., Seeger, M., Jordan, M.: Semiparametric latent factor models. In: AI & Statistics (2005)\n[26] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological) (1996)\n\n[27] Wainwright, M., Simoncelli, E.: Scale mixtures of gaussians and the statistics of natural images. Advances\n\nin neural information processing systems (2000)\n\n[28] Yuan, M., Lin, Y.: Model selection and estimation in the gaussian graphical model. Biometrika (2007)\n\n9\n\n\f", "award": [], "sourceid": 1148, "authors": [{"given_name": "Yunlong", "family_name": "He", "institution": null}, {"given_name": "Yanjun", "family_name": "Qi", "institution": null}, {"given_name": "Koray", "family_name": "Kavukcuoglu", "institution": null}, {"given_name": "Haesun", "family_name": "Park", "institution": null}]}