{"title": "Factorized Latent Spaces with Structured Sparsity", "book": "Advances in Neural Information Processing Systems", "page_first": 982, "page_last": 990, "abstract": "Recent approaches to multi-view learning have shown that factorizing the information into parts that are shared across all views and parts that are private to each view could effectively account for the dependencies and independencies between the different input modalities. Unfortunately, these approaches involve minimizing non-convex objective functions. In this paper, we propose an approach to learning such factorized representations inspired by sparse coding techniques. In particular, we show that structured sparsity allows us to address the multi-view learning problem by alternately solving two convex optimization problems. Furthermore, the resulting factorized latent spaces generalize over existing approaches in that they allow :having latent dimensions shared between any subset of the views instead of between all the views only. We show that our approach outperforms state-of-the-art methods on the task of human pose estimation.", "full_text": "Factorized Latent Spaces with Structured Sparsity\n\nYangqing Jia1, Mathieu Salzmann1,2, and Trevor Darrell1\n\n{jiayq,trevor}@eecs.berkeley.edu, salzmann@ttic.edu\n\n1UC Berkeley EECS and ICSI\n\n2TTI-Chicago\n\nAbstract\n\nRecent approaches to multi-view learning have shown that factorizing the infor-\nmation into parts that are shared across all views and parts that are private to each\nview could effectively account for the dependencies and independencies between\nthe different input modalities. Unfortunately, these approaches involve minimiz-\ning non-convex objective functions.\nIn this paper, we propose an approach to\nlearning such factorized representations inspired by sparse coding techniques.\nIn particular, we show that structured sparsity allows us to address the multi-\nview learning problem by alternately solving two convex optimization problems.\nFurthermore, the resulting factorized latent spaces generalize over existing ap-\nproaches in that they allow having latent dimensions shared between any subset\nof the views instead of between all the views only. We show that our approach\noutperforms state-of-the-art methods on the task of human pose estimation.\n\n1\n\nIntroduction\n\nMany computer vision problems inherently involve data that is represented by multiple modalities\nsuch as different types of image features, or images and surrounding text. Exploiting these multiple\nsources of information has proven bene\ufb01cial for many computer vision tasks. Given these multiple\nviews, an important problem therefore is that of learning a latent representation of the data that best\nleverages the information contained in each input view.\nSeveral approaches to addressing this problem have been proposed in the recent years. Multiple\nkernel learning [2, 24] methods have proven successful under the assumption that the views are\nindependent. In contrast, techniques that learn a latent space shared across the views (Fig. 1(a)), such\nas Canonical Correlation Analysis (CCA) [12, 3], the shared Kernel Information Embedding model\n(sKIE) [23], and the shared Gaussian Process Latent Variable Model (shared GPLVM) [21, 6, 15],\nhave shown particularly effective to model the dependencies between the modalities. However, they\ndo not account for the independent parts of the views, and therefore either totally fail to represent\nthem, or mix them with the information shared by all views.\nTo generalize over the above-mentioned approaches, methods have been proposed to explicitly ac-\ncount for the dependencies and independencies of the different input modalities. To this end, these\nmethods factorize the latent space into a shared part common to all views and a private part for each\nmodality (Fig. 1(b)). This has been shown for linear mappings [1, 11], as well as for non-linear\nones [7, 14, 20]. In particular, [20] proposed to encourage the shared-private factorization to be non-\nredundant while simultaneously discovering the dimensionality of the latent space. The resulting\nFOLS models were shown to yield more accurate results in the context of human pose estimation.\nThis, however, came at the price of solving a complicated, non-convex optimization problem. FOLS\nalso lacks an ef\ufb01cient inference method, and extension from two views to multiple views is not\nstraightforward since the number of shared/latent spaces that need to be explicitly modeled grows\nexponentially with the number of views.\nIn this paper, we propose a novel approach to \ufb01nding a latent space in which the information is cor-\nrectly factorized into shared and private parts, while avoiding the computational burden of previous\ntechniques [14, 20]. Furthermore, our formulation has the advantage over existing shared-private\nfactorizations of allowing shared information between any subset of the views, instead of only be-\n\n1\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: Graphical models for the two-view case of (a) shared latent space models [23, 21, 6, 15],\n(b) shared-private factorizations [7, 14, 20], (c) the global view of our model, where the shared-\nprivate factorization is automatically learned instead of explicitly separated, and (d) an equivalent\nshared-private spaces interpretation of our model. Due to structured sparsity, rows \u03a0s of \u03b1 are\nshared across the views, whereas rows \u03a01 and \u03a02 are private to view 1 and 2, respectively.\ntween all views. In particular, we represent each view as a linear combination of view-dependent\ndictionary entries. While the dictionaries are speci\ufb01c to each view, the weights of these dictionaries\nact as latent variables and are the same for all the views. Thus, as shown in Fig. 1(c), the data is\nembedded in a latent space that generates all the views. By exploiting the idea of structured spar-\nsity [26, 18, 4, 17, 9], we encourage each view to only use a subset of the latent variables, and at\nthe same time encourage the whole latent space to be low-dimensional. As a consequence, and as\ndepicted in Fig. 1(d), the latent space is factorized into shared parts which represent information\ncommon to multiple views, and private parts which model the remaining information of the individ-\nual views. Training the model can be done by alternately solving two convex optimization problems,\nand inference by solving a convex problem.\nWe demonstrate the effectiveness of our approach on the problem of human pose estimation where\nthe existence of shared and private spaces has been shown [7]. We show that our approach correctly\nfactorizes the latent space and outperforms state-of-the-art techniques.\n\n2 Learning a Latent Space with Structured Sparsity\n\nIn this section, we \ufb01rst formulate the problem of learning a latent space for multi-view modeling.\nWe then brie\ufb02y review the concepts of sparse coding and structured sparsity, and \ufb01nally introduce\nour approach within this framework.\n\n2.1 Problem Statement and Notations\nLet X = {X(1), X(2),\u00b7\u00b7\u00b7 , X(V )} be a set of N observations obtained from V views, where X(v) \u2208\n(cid:60)Pv\u00d7N contains the feature vectors for the vth view. We aim to \ufb01nd an embedding \u03b1 \u2208 (cid:60)Nd\u00d7N of\nthe data into an Nd-dimensional latent space and a set of dictionaries D = {D(1), D(2),\u00b7\u00b7\u00b7 , D(V )},\nwith D(v) \u2208 (cid:60)Pv\u00d7Nd the dictionary entries for view v, such that X(v) is generated by D(v)\u03b1, as\ndepicted in Fig. 1(c). More speci\ufb01cally, we seek the latent embedding \u03b1 and the dictionaries that\nbest reconstruct the data in the least square sense by solving the optimization problem\n\nV(cid:88)\n\nv=1\n\nminD,\u03b1\n\n(cid:107)X(v) \u2212 D(v)\u03b1(cid:107)2\n\nFro .\n\n(1)\n\nFurthermore, as explained in Section 1, we aim to \ufb01nd a latent space that naturally separates the\ninformation shared among several views from the information private to each view. Our approach\nto addressing this problem is inspired by structured sparsity, which we brie\ufb02y review below.\nThroughout this paper, given a matrix A, we will use the term Ai to denote its ith column vector,\nAi,\u00b7 to denote its ith row vector, and A\u00b7,\u2126 (A\u2126,\u00b7) to denote the submatrix formed by taking a subset\nof its columns (rows), where the set \u2126 contains the indices of the chosen columns (rows).\n\n2.2 Sparse Coding and Structured Sparsity\n\nIn the single-view case, sparse coding techniques [16, 25, 13] have been proposed to represent the\nobserved data (e.g., image features) as a linear combination of dictionary entries, while encourag-\ning each observation vector to only employ a subset of all the available dictionary entries. More\n\n2\n\nZX(1)X(2)X(1)X(2)ZsZ1Z2X(1)X(2)D(1)\u03b1D(2)X(1)X(2)\u03b1\u03a0s\u03b1\u03a01\u03b1\u03a02D(2)\u03a02D(2)\u03a0sD(1)\u03a0sD(1)\u03a01\fformally, let X \u2208 (cid:60)P\u00d7N be the matrix of training examples. Sparse coding aims to \ufb01nd a set of\ndictionary entries D \u2208 (cid:60)P\u00d7Nd and the corresponding linear combination weights \u03b1 \u2208 (cid:60)Nd\u00d7N by\nsolving the optimization problem\n\nmin\nD,\u03b1\ns.t.\n\n||X \u2212 D\u03b1||2\n\n1\nFro + \u03bb\u03c6(\u03b1)\nN\n||Di|| \u2264 1 , 1 \u2264 i \u2264 Nd ,\n\n(2)\n\nwhere \u03c6 is a regularizer that encourages sparsity of its input, and \u03bb is the weight that sets the relative\nin\ufb02uence of both terms. In practice, when \u03c6 is a convex function, problem (2) is convex in D for a\n\ufb01xed \u03b1 and vice-versa. Typically, the L1 norm is used to encourage sparsity, which yields\n\nN(cid:88)\n\nN(cid:88)\n\nNd(cid:88)\n\nj=1\n\nj=1\n\ni=1\n\n\u03c6(\u03b1) =\n\n(cid:107)\u03b1j(cid:107)1 =\n\n|\u03b1i,j| .\n\n(3)\n\nWhile sparse coding has proven effective in many domains, it fails to account for any structure in the\nobserved data. For instance, in classi\ufb01cation tasks, one would expect the observations belonging to\nthe same class to depend on the same subset of dictionary entries. This problem has been addressed\nby structured sparse coding techniques [26, 4, 9], which encode the structure of the problem in the\nregularizer. Typically, these methods rely on the notion of groups among the training examples to\nencourage members of the same group to rely on the same dictionary entries. This can simply be\ndone by re-writing problem (2) as\n\nNg(cid:88)\n\nNd(cid:88)\n\nmin\nD,\u03b1\n\ns.t.\n\n||X \u2212 D\u03b1||2\n\n1\nN\n||Di|| \u2264 1 , 1 \u2264 i \u2264 Nd ,\n\nFro + \u03bb\n\ng=1\n\n\u03c8(\u03b1\u00b7,\u2126g )\n\n(4)\n\nNd(cid:88)\n\nNd(cid:88)\n\nwhere Ng is the total number of groups, \u2126g represents the indices of the examples that belong to\ngroup g, and \u03b1\u00b7,\u2126g is the matrix containing the weights associated to these examples. To keep the\nproblem convex in \u03b1, \u03c8 is usually taken either as the L1,2 norm, or as the L1,\u221e norm, which yield\n\n\u03c8(\u03b1\u00b7,\u2126g ) =\n\n||\u03b1i,\u2126g||2 , or \u03c8(\u03b1\u00b7,\u2126g ) =\n\n||\u03b1i,\u2126g||\u221e =\n\ni=1\n\ni=1\n\ni=1\n\n|\u03b1i,k| .\n\nmax\nk\u2208\u2126g\n\n(5)\n\nIn general, structured sparsity can lead to more meaningful latent embeddings than sparse coding.\nFor example, [4] showed that the dictionary learned by grouping local image descriptors into images\nor classes achieved better accuracy than sparse coding for small dictionary sizes.\n\n2.3 Multi-view Learning with Structured Sparsity\n\nWhile the previous framework has proven successful for many tasks, it has only been applied to the\nsingle-view case. Here, we propose an approach to multi-view learning inspired by structured sparse\ncoding techniques. To correctly account for the dependencies and independencies of the views, we\ncast the problem as that of \ufb01nding a factorization of the latent space into subspaces that are shared\nacross several views and subspaces that are private to the individual views. In essence, this can be\nseen as having each view exploiting only a subset of the dimensions of the global latent space, as\ndepicted by Fig. 1(d). Note that this de\ufb01nition is in fact more general than the usual de\ufb01nition of\nshared-private factorizations [7, 14, 20], since it allows latent dimensions to be shared across any\nsubset of the views rather than across all views only.\nMore formally, to \ufb01nd a shared-private factorization of the latent embedding \u03b1 that represents the\nmultiple input modalities, we adopt the idea of structured sparsity and aim to \ufb01nd a set of dictionaries\nD = {D(1), D(2),\u00b7\u00b7\u00b7 , D(V )}, each of which uses only a subspace of the latent space. This can be\nachieved by re-formulating problem (1) as\n\nV(cid:88)\n\nminD,\u03b1\n\ns.t.\n\n(cid:107)X(v) \u2212 D(v)\u03b1(cid:107)2\n1\nN\n||\u03b1\u00b7,i|| \u2264 1 , 1 \u2264 i \u2264 Nd .\n\nv=1\n\nFro + \u03bb\n\nV(cid:88)\n\nv=1\n\n3\n\n\u03c8((D(v))T )\n\n(6)\n\n\fwhere the regularizer \u03c8((D(v))T ) can be de\ufb01ned using the L1,2 or L1,\u221e norm. In practice, we chose\nthe L1,\u221e norm regularizer which has proven more effective than the L1,2 [18, 17]. Note that, here,\nwe enforce structured sparsity on the dictionary entries instead of on the weights \u03b1. Furthermore,\nnote that this sparsity encourages the columns of the individual D(v) to be zeroed-out instead of the\nrows in the usual formulation. The intuition behind this is that we expect each view X(v) to only\ndepend on a subset of the latent dimensions. Since X(v) is generated by D(v)\u03b1, having zero-valued\ncolumns of D(v) removes the in\ufb02uence of the corresponding latent dimensions on the reconstruction.\nWhile the formulation in Eq. 6 encourages each view to only use a limited number of latent di-\nmensions, it doesn\u2019t guarantee that parts of the latent space will be shared across the views. With\na suf\ufb01ciently large number Nd of dictionary entries, the same information can be represented in\nseveral parts of the dictionary. This issue is directly related to the standard problem of \ufb01nding the\ncorrect dictionary size. A simple approach would be to manually choose the dimension of the la-\ntent space, but this introduces an additional hyperparameter to tune. Instead, we propose to address\nthis issue by trying to \ufb01nd the smallest size of dictionary that still allows us to reconstruct the data\nwell. In spirit, the motivation is similar to [8, 20] that use a relaxation of rank constraints to dis-\ncover the dimensionality of the latent space. Here, we further exploit structured sparsity and re-write\nproblem (6) as\n\n(cid:107)X(v) \u2212 D(v)\u03b1(cid:107)2\n\nFro + \u03bb\n\nv=1\n\nv=1\n\n\u03c8((D(v))T ) + \u03b3\u03c8(\u03b1) ,\n\n(7)\n\nwhere we replaced the constraints on \u03b1 by an L1,\u221e norm regularizer that encourages rows of \u03b1 to be\nzeroed-out. This lets us automatically discover the dimensonality of the latent space \u03b1. Furthermore,\nif there is shared information between several views, this regularizer will favor representing it in a\nsingle latent dimension, instead of having redundant parts of the latent space.\nThe optimization problem (7) is convex in D for a \ufb01xed \u03b1 and vice versa. Thus, in practice, we\nalternate between optimizing D with a \ufb01xed \u03b1 and the opposite. Furthermore, to speed up the\nprocess, after each iteration, we remove the latent dimensions whose norm is less than a pre-de\ufb01ned\nthreshold. Note that ef\ufb01cient optimization techniques for the L1,\u221e norm have been proposed in the\nliterature [17], enabling ef\ufb01cient optimization algorithms for the problem.\n\nV(cid:88)\n\nminD,\u03b1\n\n1\nN\n\nV(cid:88)\n\nInference\n\n2.4\nAt inference, given a new observation {x(1)\u2217 ,\u00b7\u00b7\u00b7 , x(V )\u2217 }, the corresponding latent embedding \u03b1\u2217 can\nbe obtained by solving the convex problem\n\n(cid:107)x(v)\u2217 \u2212 D(v)\u03b1\u2217(cid:107)2\n\n2 + \u03b3(cid:107)\u03b1\u2217(cid:107)1 ,\n\n(8)\n\nV(cid:88)\n\nv=1\n\nmin\n\u03b1\u2217\n\n(cid:88)\n\nv\u2208Va\n\nmin\n\u03b1\u2217\n\nwhere the regularizer lets us deal with noise in the observations.\nAnother advantage of our model is that it easily allows us to address the case where only a subset\nof the views are observed at test time. This scenario arises, for example, in human pose estimation,\nwhere view X(1) corresponds to image features and view X(2) contains the 3D poses. At inference,\nthe goal is to estimate the pose x(2)\u2217 given new image features x(1)\u2217 . To this end, we seek to estimate\nthe latent variables \u03b1\u2217, as well as the unknown views from the available views. This is equivalent to\n\ufb01rst solving the convex problem\n\n(cid:107)x(v)\u2217 \u2212 D(v)\u03b1\u2217(cid:107)2\n\n2 + \u03b3(cid:107)\u03b1\u2217(cid:107)1 ,\n\n(9)\n\nwhere Va is the set of indices of available views. The remaining unobserved views x(v)\u2217\nare then estimated as x(v)\u2217 = D(v)\u03b1\u2217 .\n\n, v /\u2208 Va\n\n3 Related Work\n\nWhile our method is closely related to the shared-private factorization algorithms which we dis-\ncussed in Section 1, it was inspired by the existing sparse coding literature and therefore is also\n\n4\n\n\fMethod\nPCA\nSC (e.g. [25])\nGroup SC [4]\nSSPCA [9]\nGroup Lasso [26]\nOur Method\n\n\u03c8(D)\nnone\nnone\n(cid:107)DT(cid:107)1,2\n(cid:107)D\u00b7,\u2126g(cid:107)\u03be,2\nnone\n\n\u2020\n\n(cid:107)(D\u2126g ,\u00b7)T(cid:107)1,\u221e\n\n\u2126g\n\n(cid:80)\n(cid:80)\n\n\u2126g\n\n(cid:80)\n(cid:80)\n\n\u2126g\n\n\u03c6(\u03b1)\nnone\n(cid:107)\u03b1T(cid:107)1,1\n(cid:107)\u03b1\u00b7,\u2126g(cid:107)1,2\nnone\n(cid:107)(\u03b1\u2126g ,\u00b7)T(cid:107)1,2\n(cid:107)\u03b1(cid:107)1,\u221e\n\n\u2126g\n\nCD or C\u03b1\n\n{D|DT D = I}\n\n{D|(cid:107)Di(cid:107)2 \u2264 1 \u2200i \u2264 Nd}\n\nnone\n\n{\u03b1|(cid:107)\u03b1i,\u00b7(cid:107)2 \u2264 1 \u2200i \u2264 Nd}\n\n{D|DT D = I}\n\nnone\n\n\u2020 Here \u03be denotes the vector l\u03b1/l1 quasi-norm. See [9] for details.\n\nTable 1: Properties of the different algorithms that can be viewed as special cases of RMF.\n\nrelated to it. In this section, we \ufb01rst show that many existing techniques can be considered as special\ncases of a general regularized matrix factorization (RMF) framework, and then discuss the relation-\nships and differences between our method and the existing ones.\nIn general, the RMF problem can be de\ufb01ned as that of factorizing a P \u00d7N matrix X into the product\nof a P \u00d7 M matrix D and an M \u00d7 N matrix \u03b1 so that the residual error is minimized. Furthermore,\nRMF exploits structured or unstructured regularizers to constrain the forms of D and \u03b1. This can\nbe expressed as the optimization problem\n\n1\nN\n\n(cid:107)X \u2212 D\u03b1(cid:107)2\n\nmin\nD,\u03b1\ns.t. D \u2208 CD , \u03b1 \u2208 C\u03b1 ,\n\nFro + \u03bb\u03c8(D) + \u03b3\u03c6(\u03b1)\n\n(10)\n\nwhere CD and C\u03b1 are the domains of the dictionary D and of latent embedding \u03b1, respectively.\nThese domains allow to enforce additional constraints on those matrices. Several existing algo-\nrithms, such as PCA, sparse coding (SC), group SC, structured sparse PCA (SSPCA) and group\nLasso, can be considered as special cases of this general framework. Table 1 lists the regularization\nterms and constraints used by these different algorithms.\nAlgorithms relying on structured sparsity exploit different types of matrix norm1 to impose sparsity\nand different ways of grouping the rows or columns of D and \u03b1 using algorithm-speci\ufb01c knowledge.\nGroup sparse coding [4] relies on supervised information such as class labels to de\ufb01ne the groups,\nwhile in our case, we exploit the natural separation provided by the multiple views. As a result,\nwhile group sparse coding \ufb01nds dictionary entries that encode class-related information, our method\n\ufb01nds latent spaces factorized into subspaces shared among different views and subspaces private to\nthe individual views.\nFurthermore, while structured sparsity is typically enforced on \u03b1, our method employs it on the\ndictionary. This also is the case of [9] in their SSPCA algorithm. However, while in our approach\nthe groups are taken as subsets of the rows of D, their method follows the more usual approach\nof de\ufb01ning the groups as subsets of its columns. Their intuition for doing so was to encourage\ndictionary entries to represent the variability of parts of the observation space, such as the variability\nof the eyes in the context of face images.\nFinally, it is worth noting that imposing structured sparsity regularization on both D and \u03b1 naturally\nyields a multi-view, multi-class latent space learning algorithm that can be deemed as a generaliza-\ntion of several algorithms summarized here.\n\n4 Experimental Evaluation\nIn this section, we show the results of our approach on learning factorized latent spaces from multi-\nview inputs. We compare our results against those obtained with state-of-the-art techniques on the\ntask of human pose estimation.\n\n4.1 Toy Example\nFirst, we evaluated our approach on the same toy case used by [20]. This shows our method\u2019s ability\nto correctly factorize a latent space into shared and private parts. This toy example consists of two\n\n1In our paper, we de\ufb01ne the Lp,q norm of a matrix A to be the p-norm of the vector containing of the\n\n(cid:13)(cid:13)(cid:13) ((cid:107)A1,\u00b7(cid:107)q,(cid:107)A2,\u00b7(cid:107)q,\u00b7\u00b7\u00b7 ,(cid:107)An,\u00b7(cid:107)q)\n\n(cid:13)(cid:13)(cid:13)p\n\n.\n\nq-norms of the matrix rows, i.e., (cid:107)A(cid:107)p,q =\n\n5\n\n\f(a) Generative Signal (View 1)\n\n(b) Generative Signal (View 2)\n\n(c) Observations\n\n(d) CCA\n\n(e) Our Method\n\n(f) Dictionaries\n\nFigure 2: Latent spaces recovered on a toy example. (a,b) Generative signals for the two views.\n(c) Correlated noise and the two 20D input views. (d) First 3 dimensions recovered by CCA. (e)\n3-dimensional latent space recovered with our method. Note that, as opposed to CCA, our approach\ncorrectly recovered the generative signals and discarded the noise. (f) Dictionaries learned by our\nalgorithm for each view. Fully white columns correspond to zero-valued vectors; note that the\ndictionary for each view uses only the shared dimension and its own private dimension.\n\nviews generated from one shared signal and one private signal per view depicted by Fig. 2(a,b). In\nparticular, we used sinusoidal signals at different frequencies such that\n\n5\u03c0t))] ,\n\n\u221a\n\u03b1(1) = [sin(2\u03c0t); cos(\u03c02t))], \u03b1(2) = [sin(2\u03c0t); cos(\n\n(11)\nwhere t was sampled from a uniform distribution in the interval (\u22121, 1). This yields a 3-dimensional\nground-truth latent space, with 1 shared dimension and 2 private dimensions. The observations X(v)\nwere generated by randomly projecting the \u03b1(v) into 20-dimensional spaces and adding Gaussian\nnoise with variance 0.01. Finally, we added noise of the form ynoise = 0.02 sin(3.6\u03c0t) to both\nviews to simulate highly correlated noise. The input views are depicted in Fig. 2(c)\nTo initialize our method, we \ufb01rst applied PCA separately on both views, as well as on the con-\ncatenation of the views, and in each case, kept the components representing 95% of the variance.\nWe took \u03b1 as the concatenation of the corresponding weights. Note that the fact that this latent\nspace is redundant is dealt with by our regularization on \u03b1. We then alternately optimized D and\n\u03b1, and let the algorithm determine the optimal latent dimensionality. Fig. 2(e,f) depicts the recon-\nstructed latent spaces for both views, as well as the learned dictionaries, which clearly show the\nshared-private factorization. In Fig. 2(d), we show the results obtained with CCA. Note that our\napproach correctly discovered the original generative signals and discarded the noise, whereas CCA\nrecovered the shared signal, but also the correlated noise and an additional noise. This con\ufb01rms\nthat our approach is well-suited to learn shared-private factorizations, and shows that CCA-based\napproaches [1, 11] tend to be sensitive to noise.\n\n4.2 Human Pose Estimation\n\nWe then applied our method to the problem of human pose estimation, in which the task is to recover\n3D poses from 2D image features. It has been shown that this problem is ambiguous, and that shared-\nprivate factorizations helped accounting for these ambiguities. Here, we used the HumanEva dataset\n[22] which consists of synchronized images and motion capture data describing the 3D locations of\nthe 19 joints of a human skeleton. These two types of observations can be seen as two views of the\nsame problem from which we can learn a latent space.\nIn our experiments, we compare our results with those of several regression methods that directly\nlearn a mapping from image features to 3D poses. In particular, we used linear regression (Lin-\nReg), Gaussian Process regression with a linear kernel (GP-lin) and with an RBF kernel (GP-rbf),\nand nearest-neighbor in the feature space (NN). We also compare our results with those obtained\nwith the FOLS-GPLVM [20], which also proposes a shared-private factorization of the latent space.\nNote that we did not compare against other shared-private factorizations [7, 14], or purely shared\n\n6\n\n\u2212101Shared\u2212101Private1\u2212101Shared\u2212101Private2\u22120.0200.02CorrelatedNoise\u22120.500.5X(1)\u22120.500.5X(2)\u2212101\u2212101\u2212101\u2212101\u2212101\u2212101Dictionary for View 11235101520Dictionary for View 21235101520\fData\nJogging\nWalking\n\nLin-Reg GP-lin GP-rbf\n1.396\n1.420\n2.167\n2.330\n\n1.429\n2.363\n\nNN\n1.436\n2.175\n\nFOLS Our Method\n1.461\n2.137\n\n0.954\n1.322\n\nTable 2: Mean squared errors between the ground truth and the reconstructions obtained by different\nmethods.\n\n(a) jogging\n\n(b) walking\n\n(c) walking with multiple features\n\nFigure 3: Dictionaries learned from the HumanEva data. Each column corresponds to a dictionary\nentry. (a) and (b) show the 2-view case, and (c) shows a three-view case. Note that in (c) our model\nfound latent dimensions shared among all views, but also shared between the image features only.\n\nmodels [21, 6, 15, 23], since they were shown to be outperformed by the FOLS-GPLVM [20] for\nhuman pose estimation.\nTo initialize the latent spaces for our model and for the FOLS-GPLVM, we proceeded similarly\nas for the toy example; We applied PCA on both views separately, as well as on the concatenated\nviews, and retained the components representing 95% of the variance. In our case, we set \u03b1 to be the\nconcatenation of the corresponding PCA weights. For the FOLS-GPLVM, we initialized the shared\nlatent space with the coef\ufb01cients of the joint PCA, and the private spaces with those of the individual\nPCAs. We performed cross validation on the jogging data, and the optimal setting \u03bb = 0.01 and\n\u03b3 = 0.1 was then \ufb01xed for all experiments.\nAt inference for human pose estimation, only one of the views (i.e., the images) is available. As\nshown in Section 2.4, our model provides a natural way to deal with this case by computing the\nlatent variables from the image features \ufb01rst, and then recovering the 3D coordinates using the\nlearned dictionary. For the FOLS-GPLVM, we followed the same strategy as in [20]; we computed\nthe nearest-neighbor among the training examples in image feature space and took the corresponding\nshared and private latent variables that we mapped to the pose. No special care was required for the\nother baselines, since they explicitly take the images as inputs and the poses as outputs.\nAs a \ufb01rst case, we used hierarchical features [10] computed on the walking and jogging video\nsequences of the \ufb01rst subject seen from a single camera. As the subject moves in circles, we used\nthe \ufb01rst loop to train our model, and the remaining ones for testing. Table 2 summarizes the mean\nsquared reconstruction error for all the methods. Note that our approach yields a smaller error than\nthe other methods. In Fig. 3(a,b), we show the factorization of the latent space obtained by our\napproach by displaying the learned dictionaries 2. For the jogging case our algorithm automatically\nfound a low-dimensional latent space of 10 dimensions, with a 4D private space for the image\nfeatures, a 4D shared space, and a 2D private space for the 3D pose3. For the walking case, the\n\n2Note that the latent space per se is a dense, low-dimensional space, and whether a dimension is private or\n\nshared among multiple views is determined by the corresponding dictionary entries.\n\n3A latent dimension is considered private if the norm of the corresponding dictionary entry in the other view\n\nis smaller than 10% of the average norm of the dictionary entries for that view.\n\n7\n\nImage Features246810510153D Pose2468101020304050Image Features2468101214510153D Pose24681012141020304050PHOG51015202040RT51015202040603D Pose51015201020304050\fFeature\nPHOG\nRT\nPHOG+RT\n\nLin-Reg GP-lin GP-rbf\n0.839\n1.190\n0.827\n1.345\n1.159\n0.727\n\n1.167\n1.272\n1.042\n\nNN\n1.279\n1.067\n1.090\n\nFOLS Our Method\n1.277\n1.068\n1.015\n\n0.778\n1.141\n0.769\n\n\u03bb = 0\n2.886\n3.962\n1.306\n\n\u03b3 = 0\n0.863\n1.235\n0.794\n\nTable 3: Mean squared errors for different choices of image features. The last two columns show\nthe result of our method while forcing one regularization term to be zero. See text for details.\n\n(a) PHOG\n\n(b) RT\n\n(c) PHOG+RT\n\nFigure 4: Mean squared error as a function of the number of training examples using PHOG features\nonly, RT features only, or both feature types simultaneously.\n\nprivate space for the image features was found to be higher-dimensional. This can partially explain\nwhy the other methods did not perform as well as in the jogging case.\nNext, we evaluated the performance of the same algorithms for different image features. In particu-\nlar, we used randomized tree (RT) features generated by [19], and PHOG features [5]. For this case,\nwe only considered the walking sequence and similarly trained the different methods using the \ufb01rst\ncycle and tested on the rest of the sequence. The top two rows of Table 3 show the results of the\ndifferent approaches for the individual features. Note that, with the RT features that were designed\nto eliminate the ambiguities in pose estimation, GP regression with an RBF kernel performs slightly\nbetter than us. However, this result is outperformed by our model with PHOG features.\nTo show the ability of our method to model more than two views, we learned a latent space by\nsimultaneously using RT features, PHOG features and 3D poses. The last row of Table 3 shows\nthe corresponding reconstruction errors. In this case, we used the concatenated features as input\nto Lin-Reg, GP-lin and NN. For GP-rbf, we relied on kernel combination to predict the pose from\nmultiple features. For the FOLS model, we applied the following inference strategy. We computed\nthe NN in feature space for both features individually and took the mean of the corresponding\nshared latent variables. We then obtained the private part by computing the NN in shared space\nand taking the corresponding private variables. Note that this proved more accurate than using NN\non a single view, or on the concatenated views. Also, notice in Table 3 that the performance drops\nwhen structured sparsity is only imposed on either D\u2019s or \u03b1, showing the advantage of our model\nover simple structured sparsity approaches. Fig. 3(c) depicts the dictionary found by our method.\nNote that our approach allowed us to \ufb01nd latent dimensions shared among all views, as well as\nshared among the image features only.\nFinally, we studied the in\ufb02uence of the number of training examples on the performance of the\ndifferent approaches. To this end, we varied the training set size from 5 to 100, and, for each size,\nrandomly sampled 10 different training sets on the \ufb01rst walking cycle. In all cases, we kept the same\ntest set as before. Fig. 4 shows the mean squared errors averaged over the 10 different sets as a\nfunction of the number of training examples. Note that, with small training sets, our method yields\nmore accurate results than the baselines.\n5 Conclusion\nIn this paper, we have proposed an approach to learning a latent space factorized into dimensions\nshared across subsets of the views and dimensions private to each individual view. To this end, we\nhave proposed to exploit the notion of structured sparsity, and have shown that multi-view learning\ncould be addressed by alternately solving two convex optimization problems. We have demonstrated\nthe effectiveness of our approach on the task of estimating 3D human pose from image features. In\nthe future, we intend to study the use of our model to other tasks, such as classi\ufb01cation. To this end,\nwe would extend our approach to incorporating an additional group sparsity regularizer on the latent\nvariables to encode class membership.\n\n8\n\n02040608010011.522.533.5Mean Squared Error with PHOG Featuresnumber of training dataMSE Our MethodGP\u2212linGP\u2212rbfnnFOLS02040608010011.522.533.544.55Mean Squared Error with RT Featuresnumber of training dataMSE Our MethodGP\u2212linGP\u2212rbfnnFOLS02040608010011.522.533.54Mean Squared Error with both Featuresnumber of training dataMSE Our MethodGP\u2212linGP\u2212rbfnnFOLS\fReferences\n[1] C. Archambeau and F. Bach. Sparse probabilistic projections. In Neural Information Processing Systems,\n\n2008.\n\n[2] F. Bach, G. Lanckriet, and M. Jordan. Multiple kernel learning, conic duality, and the SMO algorithm. In\n\nInternational Conference on Machine learning. ACM New York, NY, USA, 2004.\n\n[3] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis. Technical\n\nReport 688, Department of Statistics, University of California, Berkeley, 2005.\n\n[4] S. Bengio, F. Pereira, Y. Singer, and D. Strelow. Group sparse coding. Neural Information Processing\n\nSystems, 2009.\n\n[5] A. Bosch, A. Zisserman, and X. Munoz. Image classi\ufb01cation using random forests and ferns. In Interna-\n\ntional Conference on Computer Vision, 2007.\n\n[6] C. H. Ek, P. Torr, and N. Lawrence. Gaussian process latent variable models for human pose estimation.\n\nIn Joint Workshop on Machine Learning and Multimodal Interaction, 2007.\n\n[7] C. H. Ek, P. Torr, and N. Lawrence. Ambiguity modeling in latent spaces. In Joint Workshop on Machine\n\nLearning and Multimodal Interaction, 2008.\n\n[8] A. Geiger, R. Urtasun, and T. Darrell. Rank priors for continuous non-linear dimensionality reduction. In\n\nConference on Computer Vision and Pattern Recognition, 2009.\n\n[9] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, Sardinia, Italy, May 2010.\n\n[10] A. Kanaujia, C. Sminchisescu, and D. N. Metaxas. Semi-supervised hierarchical models for 3d human\n\npose reconstruction. In Conference on Computer Vision and Pattern Recognition, 2007.\n\n[11] A. Klami and S. Kaski. Probabilistic approach to detecting dependencies between data sets. Neurocom-\n\nputing, 72:39\u201346, 2008.\n\n[12] M. Kuss and T. Graepel. The geometry of kernel canonical correlation analysis. Technical Report TR-108,\n\nMax Planck Institute for Biological Cybernetics, T\u00a8ubingen, Germany, 2003.\n\n[13] H. Lee, A. Battle, R. Raina, and A. Y. Ng. Ef\ufb01cient sparse coding algorithms. In Neural Information\n\nProcessing Systems, 2006.\n\n[14] G. Leen. Context assisted information extraction. PhD thesis, University the of West of Scotland, Uni-\n\nversity of the West of Scotland, High Street, Paisley PA1 2BE, Scotland, 2008.\n\n[15] R. Navaratnam, A. Fitzgibbon, and R. Cipolla. The Joint Manifold Model for Semi-supervised Multi-\n\nvalued Regression. In International Conference on Computer Vision, Rio, Brazil, October 2007.\n\n[16] B. Olshausen and D. Field. Emergence of simple-cell receptive \ufb01eld properties by learning a sparse code\n\nfor natural images. Nature, 381:607\u2013609, 1996.\n\n[17] A. Quattoni, X. Carreras, M. Collins, and T. Darrell. An ef\ufb01cient projection for l1,in\ufb01nity regularization.\n\nIn International Conference on Machine Learning, 2009.\n\n[18] A. Quattoni, M. Collins, and T. Darrell. Transfer learning for image classi\ufb01cation with sparse prototype\n\nrepresentations. In Conference on Computer Vision and Pattern Recognition, 2008.\n\n[19] G. Rogez, J. Rihan, S. Ramalingam, C. Orrite, and P. Torr. Randomized Trees for Human Pose Detection.\n\nIn Conference on Computer Vision and Pattern Recognition, 2008.\n\n[20] M. Salzmann, C.-H. Ek, R. Urtasun, and T. Darrell. Factorized orthogonal latent spaces. In International\n\nConference on Arti\ufb01cial Intelligence and Statistics, Sardinia, Italy, May 2010.\n\n[21] A. P. Shon, K. Grochow, A. Hertzmann, and R. P. N. Rao. Learning shared latent structure for image\n\nsynthesis and robotic imitation. In Neural Information Processing Systems, pages 1233\u20131240, 2006.\n\n[22] L. Sigal and M. J. Black. Humaneva: Synchronized video and motion capture dataset for evaluation of\n\narticulated human motion. Technical Report CS-06-08, Brown University, 2006.\n\n[23] L. Sigal, R. Memisevic, and D. J. Fleet. Shared kernel information embedding for discriminative infer-\n\nence. In Conference on Computer Vision and Pattern Recognition, 2009.\n\n[24] S. Sonnenburg, G. R\u00a8atsch, C. Sch\u00a8afer, and B. Sch\u00a8olkopf. Large scale multiple kernel learning. The\n\nJournal of Machine Learning Research, 7:1531\u20131565, 2006.\n\n[25] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society,\n\nSeries B, 58:267\u2013288, 1996.\n\n[26] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society, Series B, 68:49\u201367, 2006.\n\n9\n\n\f", "award": [], "sourceid": 434, "authors": [{"given_name": "Yangqing", "family_name": "Jia", "institution": null}, {"given_name": "Mathieu", "family_name": "Salzmann", "institution": null}, {"given_name": "Trevor", "family_name": "Darrell", "institution": null}]}