{"title": "Sparse Coding for Learning Interpretable Spatio-Temporal Primitives", "book": "Advances in Neural Information Processing Systems", "page_first": 1117, "page_last": 1125, "abstract": "Sparse coding has recently become a popular approach in computer vision to learn dictionaries of natural images. In this paper we extend sparse coding to learn interpretable spatio-temporal primitives of human motion. We cast the problem of learning spatio-temporal primitives as a tensor factorization problem and introduce constraints to learn interpretable primitives. In particular, we use group norms over those tensors, diagonal constraints on the activations as well as smoothness constraints that are inherent to human motion. We demonstrate the effectiveness of our approach to learn interpretable representations of human motion from motion capture data, and show that our approach outperforms recently developed matching pursuit and sparse coding algorithms.", "full_text": "Sparse Coding for Learning Interpretable\n\nSpatio-Temporal Primitives\n\nTaehwan Kim\nTTI Chicago\n\ntaehwan@ttic.edu\n\nGregory Shakhnarovich\n\nTTI Chicago\n\ngregory@ttic.edu\n\nRaquel Urtasun\n\nTTI Chicago\n\nrurtasun@ttic.edu\n\nAbstract\n\nSparse coding has recently become a popular approach in computer vision to learn\ndictionaries of natural images. In this paper we extend the sparse coding frame-\nwork to learn interpretable spatio-temporal primitives. We formulated the prob-\nlem as a tensor factorization problem with tensor group norm constraints over the\nprimitives, diagonal constraints on the activations that provide interpretability as\nwell as smoothness constraints that are inherent to human motion. We demon-\nstrate the effectiveness of our approach to learn interpretable representations of\nhuman motion from motion capture data, and show that our approach outperforms\nrecently developed matching pursuit and sparse coding algorithms.\n\n1\n\nIntroduction\n\nIn recent years sparse coding has become a popular paradigm to learn dictionaries of natural images\n[10, 1, 4]. The learned representations have proven very effective in computer vision tasks such as\nimage denoising [4], inpainting [10, 8] and object recognition [1]. In these approaches, sparse coding\nwas formulated as the sum of a data \ufb01tting term, typically the Frobenius norm, and a regularization\nterm that imposes sparsity. The (cid:96)1 norm is typically used as it is convex instead of other sparsity\npenalties such as the (cid:96)0 pseudo-norm.\nHowever, the sparsity induced by these norms is local; The estimated representations are sparse in\nthat most of the activations are zero, but the sparsity has no structure, i.e., there is no preference\nto which coef\ufb01cients are active. Mairal et al. [9] extend the sparse coding formulation of natural\nimages to impose structure by \ufb01rst clustering the set of image patches and then learning a dictionary\nwhere members of the same cluster are encouraged to share sparsity patterns. In particular, they use\ngroup norms so that the sparsity patterns are shared within a group.\nHere we are interested in the problem of learning dictionaries of human motion. Learning spatio-\ntemporal representations of motion has been addressed in the neuroscience and motor control liter-\nature, in the context of motor synergies [13, 5, 14]. However, most approaches have focussed on\nlearning static primitives, such as those obtained by linear subspace models applied to individual\nframes of motion [12, 15].\nOne notable exception to this is the work of diAvella et al. [3] where the goal was to recover primi-\ntives from time series of EMG signals recorded from a set of frog muscles. Using matching pursuit\n[11] and an (cid:96)0-type regularization as the underlying mechanism to learn primitives, [3] performed\nmatrix factorization of the time series. The recovered factors represent the primitive dictionary and\nthe primitive activations. However, this technique suffers from the inherent limitations of the (cid:96)0\nregularization which is combinatorial in nature and thus dif\ufb01cult to optimize; therefore [3] resorted\nto a greedy algorithm that is subject to the inherent limitations of such an approach.\nIn this paper we propose to extend the sparse coding framework to learn motion dictionaries. In\nparticular, we cast the problem of learning spatio-temporal primitives as a tensor factorization prob-\n\n1\n\n\flem and introduce tensor group norms over the primitives that encourage sparsity in order to learn\nthe number of elements in the dictionary. The introduction of additional diagonal constraints in the\nactivations, as well as smoothness constraints that are inherent to human motion, will allow us to\nlearn interpretable representations of human motion from motion capture data. As demonstrated in\nour experiments, our approach outperforms state-of-the-art matching pursuit [3], as well as recently\ndeveloped sparse coding algorithms [7].\n\n2 Sparse coding for motion dictionary learning\n\nIn this section we \ufb01rst review the framework of sparse coding, and then show how to extend this\nframework to learn interpretable dictionaries of human motion.\n\n2.1 Traditional sparse coding\nLet Y = [y1,\u00b7\u00b7\u00b7 , yN ] be the matrix formed by concatenating the set of training examples drawn\ni.i.d. from p(y). Sparse coding is usually formulated as a matrix factorization problem composed\nof a data \ufb01tting term, typically the Frobenius norm, and a regularizer that encourages sparsity of the\nactivations\n\nor equivalently\n\nmin\nW,H\n\n||Y \u2212 WH||2\n\nF + \u03bb\u03c8(H) .\n\nmin\nW,H\nsubject to\n\n||Y \u2212 WH||2\n\u03c8(H) \u2264 \u03b4sparse\n\nF\n\nwhere \u03bb and \u03b4sparse are parameters of the model. Additional bounding constraints on W are typi-\ncally employed since there is an ambiguity on the scaling of W and H. In this formulation W is the\ndictionary, with wi the dictionary elements, H is the matrix of activations, and \u03c8(H) is a regularizer\nthat induces sparsity. Solving this problem involves a non-convex optimization. However, solving\nwith respect to W and H alone is convex if \u03c8 is a convex function of H. As a consequence, \u03c8 is\ni,j |hi,j|, and an alternate minimization scheme is\n\nusually taken to be the (cid:96)1 norm, i.e., \u03c8(H) =(cid:80)\n\nde\ufb01ned as \u03c8(H) =(cid:80)\n\ntypically employed [7].\nIf the problem has more structure, one would like to use this structure in order to learn non-local\nsparsity patterns. Mairal et al. [9] exploit group norm sparsity priors to learn dictionaries of natural\nimages by \ufb01rst clustering the training image patches, and then learning a dictionary where members\nof the same cluster are encouraged to share sparsity patterns. In particular, they use the (cid:96)2,1 norm\nk ||hk||2, where hk are the elements of H that are members of the k-th group.\nNote that the members of a group do not need to be rows or columns, more complex group structures\ncan be employed [6].\nHowever, the structure imposed by these group norms is not suf\ufb01cient for learning interpretable\nmotion primitives. We now show how in the case of motion, we can consider the activations and\nthe primitives as tensors and impose group norm sparsity on the tensors. Moreover, we impose\nadditional constraints such as continuity and differentiability that are inherent of human motion\ndata, as well as diagonal constraints that ensure interpretability.\n\n2.2 Motion dictionary learning\nLet Y \u2208 (cid:60)D\u00d7L be a D dimensional signal of temporal length L. We formulate the problem of\nlearning dictionaries of human motion as a tensor factorization problem where the matrix W is now\na tensor, W \u2208 (cid:60)D\u00d7P\u00d7Q, encoding temporal and spatial information, with D the dimensionality\nof the observations, P the number of primitives, and Q the length of the primitives. H is now also\nde\ufb01ned as a tensor, H \u2208 (cid:60)Q\u00d7P\u00d7L, with L the temporal length of the sequence. For simplicity in\nthe discussion we assume that the primitives have the same length. This restriction can be easily re-\nmoved by setting Q to be the maximum length of the primitives and padding the remaining elements\nto zero. We thus de\ufb01ne the data term to be\n\n(cid:96)data = ||Y \u2212 vec(W)vec(H)||F\n\n(2)\n\n2\n\n\fFigure 1: Walking dataset composed of multiple walking cycles performed by the same subject.\n(left, center) Projection of the data onto the \ufb01rst two principal components of walking. This is the\ndata to be recovered. (right) Training error as a function of the number of iterations. Note that our\napproach converges after only a few iterations\n\nwhere vec(W) \u2208 (cid:60)D\u00d7P Q and vec(H) \u2208 (cid:60)QP\u00d7L are projections of the tensors to be represented\nas matrices, i.e., \ufb02attening.\nWhen learning dictionaries of human motion, there is additional structure and constraints that one\nwould like the dictionary elements to satisfy. One important property of human motion is that it is\nsmooth. We impose continuity and differentiability constraints by adding a regularization term that\n\nencourages smooth curvature, i.e., \u03c6(W) =(cid:80)P\n\np=1 ||\u22072Wp,:,:||F .\n\nOne of the main dif\ufb01culties with learning motion dictionaries is that the dictionary words might\nhave very different temporal lengths. Note that this problem does not arise in traditional dictionary\nlearning of natural images, since the size of the dictionary words is manually speci\ufb01ed [4, 1, 9]. This\nmakes the learning problem more complex since one would like to identify not only the number of\nelements in the dictionary, but also the size of each dictionary word. We address this problem by\nadding a regularization term that prefers dictionaries with small number of primitives, as well as\nprimitives of short length. In particular, we extend the group norms over matrices to be group norms\nover tensors and de\ufb01ne\n\n\uf8eb\uf8ec\uf8ed P(cid:88)\n\ni=1\n\n\uf8eb\uf8ed Q(cid:88)\n\n(cid:32) D(cid:88)\n\nj=1\n\nk=1\n\n(cid:33)q/p\uf8f6\uf8f8r/q\uf8f6\uf8f7\uf8f81/r\n\n|Wi,j,k|p\n\n(cid:96)p,q,r(W) =\n\nwhere Wi,j,k is the k-th dimension at the j-th time frame of the i-th primitive in W.\nWe will also like to impose additional constraints on the activations H. For interpretability, we\nwould like to have only positive activations. Moreover, since the problem is under-constrained, i.e.,\nH and W can be recovered up to an invertible transformation WH = (WC\u22121)(CH), we impose\nthat the elements of the activation tensor should be in the unit interval, i.e., Hi,j,k \u2208 [0, 1]. As in\ntraditional sparse coding, we encourage the activations to be sparse. We impose this by bounding the\nL1 norm. Finally, to impose interpretability of the results as spatio-temporal primitives, we impose\nthat when a spatio-temporal primitive is active, it should be active across all its time-length with\nconstant activation strength, i.e., \u2200i, j, k, Hi,j,k = Hi,j+1,k+1.\nWe thus formulate the problem of learning motion dictionaries as the one of solving the following\noptimization problem\n\nmin\nW,H\nsubject to\n\n||Y \u2212 vec(W)vec(H)||F + \u03bb\u03c6(W) + \u03b7Lp,q,r(W)\n0 \u2264 Hi,j,k \u2264 1, Hi,j,k = Hi,j+1,k+1, \u2200j\n\u2200i, j, k\n\n(cid:88)\n\nHi,j,k \u2264 \u03b4train (3a)\n\nwhere \u03b4train, \u03bb and \u03b7 are parameters of our model.\nWhen optimizing over W or H alone the problem is convex. We thus perform alternate minimizatio.\nOur algorithm converges to a local minimum, the proof is similar to the convergence proof of block\ncoordinate descent, see Prop. 2.7.1 in [2].\n\ni,j\n\n3\n\n020406080100120\u221240\u221230\u221220\u22121001020304050020406080100120\u221250\u221240\u221230\u221220\u221210010203040123456789101150100150200Convergence of our approachObjectiveIterations\f(W1-MP-NR)\n\n(W2-MP-NR)\n\n(H-MP-NR)\n\n(W1-MP)\n\n(W2-MP)\n\n(H-MP)\n\n(W1-SC)\n\n(W2-SC)\n\n(H-SC)\n\n(W1-Ours)\n\n(W2-Ours)\n\n(H-Ours)\n\nFigure 2: Estimation of W and H when the number of primitives is unknown, using (top) matching\npursuit without refractory period, (second row) matching pursuit with refractory period [3], (third\nrow) traditional sparse coding and (bottom) our approach. Note that our approach is able to recover\nthe primitives, their number and the correct activations. Matching pursuit is able to recover the\nnumber of primitives when using refractory period, however the activations and the primitives are\nnot correct. When we do not use the refractory period, the recovered primitives are very noisy.\nSparse coding has a low reconstruction error, but neither the number of primitives, nor the primitives\nand the activations are correctly recovered.\n\n3 Experimental Evaluation\n\nWe compare our algorithm to two state-of-the-art approaches in the task of discovering interpretable\nprimitives from motion capture data, namely, the sparse coding approach of [7] and matching pursuit\n[3]. In the following, we \ufb01rst describe the baselines in detail. We then demonstrate our method\u2019s\nability to estimate the primitives, their number, as well as the activation patterns. We then show that\nour approach outperforms matching pursuit and sparse coding when learning dictionaries of walking\nand running motions. For all experiments we set \u03b4train = 1, \u03b4test = 1.3, \u03bb = 1 and \u03b7 = 0.05 and\nuse the (cid:96)2,1,1 norm. Note that similar results where obtained with the (cid:96)2,2,1 norm. For SC we use\n\u03b2 = 0.01 and c is set to the maximum value of the (cid:96)2 norm. The threshold for MP with refractory\nperiod is set to 0.1.\n\n4\n\n0102030405060708090\u22120.4\u22120.3\u22120.2\u22120.100.10.20.30.4Matching Pursuit Non Refractory0102030405060708090\u22120.6\u22120.5\u22120.4\u22120.3\u22120.2\u22120.100.10.2Matching Pursuit Non RefractoryMatching Pursuit Non Refractory10203040506070809010010203040506070800102030405060708090\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.81Matching Pursuit0102030405060708090\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.8Matching PursuitMatching Pursuit10203040506070809010010203040506070800102030405060708090\u221260\u221240\u2212200204060Sparse Coding0102030405060708090\u221250\u221240\u221230\u221220\u22121001020304050Sparse CodingSparse Coding 1020304050607080901001020304050607080\u221212\u221210\u22128\u22126\u22124\u22122024680102030405060708090\u221260\u221240\u2212200204060Our approach0102030405060708090\u221250\u221240\u221230\u221220\u221210010203040Our approachOur approach1020304050607080901001020304050607080\f(walk, \u03c32 = 50, e59D)\n\n(walk, \u03c32 = 100, e59D)\n\n(walk, \u03c32 = 50, eP CA)\n\n(walk, \u03c32 = 100, eP CA)\n\n(run, \u03c32 = 50, e59D)\n\n(run, \u03c32 = 100, e59D)\n\n(run, \u03c32 = 50, eP CA)\n\n(run, \u03c32 = 100, eP CA)\n\nFigure 3: Error as a function of the dimension when adding Gaussian noise of variance 50 and 100.\n(Top) Walking, (bottom) running.\n\nMatching pursuit (MP): We follow a similar approach to [3] where an alternate minimization\nover W and H is employed. For each iteration in the alternate minimization, W is optimized\nby minimizing (cid:96)data de\ufb01ned in Eq. (2) until convergence. For each iteration in the optimization\nof H, an over-complete dictionary D is created by taking the primitives in W, and generating\ncandidates by shifting each primitive in time. Note that the cardinality of the candidate dictionary\nis |D| = P (L + Q \u2212 1) if W has P primitives and the data is composed of L frames. Once the\ndictionary is created, a set of primitives is iteratively selected (one at a time) by choosing at each\niteration the primitive with the largest scalar product with respect to the residual signal that cannot\nbe explained with the already selected primitives. Primitives are chosen until a threshold on the\nscalar product is reached. Note that this is an instance of Matching Pursuit [11], a greedy algorithm\nto solve an (cid:96)0-type optimization. Additionally, in the step of choosing elements in the dictionary, [3]\nintroduced the refractory period, which means that when one element in the dictionary is chosen,\nall overlapping elements are removed from the dictionary. This is done to avoid multiple activations\nof primitives. In our experiments we compare our approach to matching pursuit with and without\nrefractory period.\n\nSparse coding (SC): We use the sparse coding formulation of [7] which minimizes the Frobenius\nnorm with an L1 regularization penalty on the activations\n\nmin\n\u00afW, \u00afH\n\nsubject to\n\n||Y \u2212 \u00afW \u00afH||F + \u03b2\n\u2200j\n| \u00afW:,j| \u2264 c\n\n| \u00afHi,j|\n\n(cid:88)\n\ni,j\n\nwith \u03b2 a constant trading off the relative in\ufb02uence of the data \ufb01tting term and the regularizer, and c\na constant bounding the value of the primitives. Note that now \u00afW and \u00afH are matrices. Following\n[7], we solve this optimization problem alternating between solving with respect to the primitives\n\u00afW and the activations \u00afH.\n\n3.1 Estimating the number of primitives\n\nIn the \ufb01rst experiment we demonstrate the ability of our approach to infer the number of primitives\nas well as the length of the existing primitives. For this purpose we created a simple dataset which is\ncomposed of a single sequence of multiple walking cycles performed by the same subject from the\nCMU mocap dataset 1. We apply PCA to the data reducing the dimensionality of the observations\n\n1The data was obtained from mocap.cs.cmu.edu\n\n5\n\n0510152025050100150200250DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025050100150200250300DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450500DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450500DimensionReconstruction error MP w/o RPMP w/ RPSCOurs051015202520406080100120140160180200DimensionReconstruction error MP w/o RPMP w/ RPSCOurs051015202520406080100120140160180200DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450500550DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450500DimensionReconstruction error MP w/o RPMP w/ RPSCOurs\f(walk, d=4, e59D)\n\n(walk, d=10, e59D)\n\n(walk, d=4, eP CA)\n\n(walk, d=10, eP CA)\n\n(run, d=4, e59D)\n\n(run, d=10, e59D)\n\n(run, d=4, eP CA)\n\n(run, d=10, eP CA)\n\nFigure 4: Error as a function of the Gaussian noise variance for 4D and 10D spaces learned from a\ndataset composed of a single subject. (Top) walking, (bottom) running.\n\nfrom 59D to 2D for each time instant. Fig. 1 depicts the projections of the data onto the \ufb01rst two\nprincipal components as a function of time. In this case it is easy to see that since the motion is\nperiodic, the signal could be represented by a single 2D primitive whose length is equal to the length\nof the period.\nTo perform the experiments we initialize our approach and the baselines with a sum of random\nsmooth functions (sinusoids) whose frequencies are different from the principal frequency of the\nperiodic training data, and set the number of primitives to P = 2. One primitive is set to have\napproximately the same length as a cycle of the periodic motion and the other primitive is set to be\n50% larger. Note that a rough estimate of the length of the primitives could be easily obtained by\nanalyzing the principal frequencies of the signal. Fig. 2 depicts the results obtained by our approach\nand the baselines. The \ufb01rst two columns depict the two dimensional primitives recovered (W1 and\nW2). Each plot represents vec(Wi,:,:) \u2208 (cid:60)(Q1+Q2)\u00d71. The dotted black line separates the two\nprimitives. Note that we expect these primitives to be similar to the original signal, i.e., vec(W1,:,:)\nsimilar to a period in Fig. 1 (left) and vec(W2,:,:) to a period in Fig. 1 (right). The third column\ndepicts the activations vec(H) \u2208 (cid:60)(Q1+Q2)\u00d7L recovered. We expect the successful activations to\nbe diagonal, and to appear only once every cycle.\nNote that our approach is able to recover the number of primitives as well as the primitive them-\nselves and the correct activations. Matching pursuit without refractory period (\ufb01rst row) is not able\nto recover the primitives, their number, or the activations. Moreover, the estimated signal has high\nfrequencies. Matching pursuit with refractory period (second row) is able to recover the number\nof primitives, however the activations are underestimated and the primitives are not very accurate.\nSparse coding has a low reconstruction error, but neither the primitives, their number, nor the acti-\nvations are correctly recovered. This con\ufb01rms the inability of traditional sparse coding to recover\ninterpretable primitives, and the importance of having interpretability constraints such as the refrac-\ntory period of matching pursuit and our diagonal constraints. Note also that as shown in Fig. 1\n(right) our approach converges in a few iterations.\n\n3.2 Quantitative analysis and comparisons\n\nWe evaluate the capabilities of our approach to reconstruct new sequences, and compare our ap-\nproach to the baselines [3, 7] in a denoising scenario as well as when dealing with missing data. We\npreprocess the data by applying PCA to reduce the dimensionality of the input space. We measure\nerror by computing the Frobenius norm between the test sequences and the reconstruction given by\n\n6\n\n020406080100120050100150200250300VarianceReconstruction error ||V(cid:239)WH||F MP w/o RPMP w/ RPSCOurs0204060801001200100200300400500600VarianceReconstruction error ||V(cid:239)WH||F MP w/o RPMP w/ RPSCOurs020406080100120150200250300350400VarianceReconstruction error MP w/o RPMP w/ RPSCOurs02040608010012050100150200250300350400450500550VarianceReconstruction error MP w/o RPMP w/ RPSCOurs020406080100120050100150200250300350VarianceReconstruction error ||V(cid:239)WH||F MP w/o RPMP w/ RPSCOurs0204060801001200100200300400500600700800900VarianceReconstruction error ||V(cid:239)WH||F MP w/o RPMP w/ RPSCOurs020406080100120100150200250300350400VarianceReconstruction error MP w/o RPMP w/ RPSCOurs0204060801001200100200300400500600700800900VarianceReconstruction error MP w/o RPMP w/ RPSCOurs\f(run, P=1, e59D)\n\n(run, P=2, e59D)\n\n(run, P=1, eP CA)\n\n(run, P=2, eP CA)\n\nFigure 5: Multiple subject error as a function of the dimension for noisy data with variance 100 and\ndifferent numbers of primitives. As expected one primitive is not enough for accurate reconstruction.\n\n(smooth, Q/2, e59D)\n\n(random, Q/2, e59D)\n\n(smooth, 2Q/3, e59D)\n\n(random, 2Q/3, e59D)\n\nFigure 6: Missing data and in\ufb02uence of initialization: Error in the 59D space when Q/2 and 2Q/3\nof the data is missing. The primitives are either initialize randomly or to a smooth set of sinusoids\nof random frequencies.\n\nthe learned W and the estimated activations Htest\n\nepca =\n\n1\nD\n\n||Vtest \u2212 vec(W)vec(Htest)||F\n\nas well as the error in the original 59D space which can be computed by projecting back into the\noriginal space using the singular vectors. Note that W is learned at training, and the activations\nHtest are estimated at inference time. To evaluate the generalization properties of each algorithm,\nwe compute both errors in a denoising scenario, where Htest is obtained using \u02c6Vtest = Vtest + \u0001,\nwith \u0001 i.i.d Gaussian noise, and the errors are computed using the ground truth data Vtest. For each\nexperiment we use P = 1, \u03b7 = 0.05, \u03b4train = 1, \u03b4test = 1.3 and a rough estimate of Q, which\ncan be easily obtained by examining the principal frequencies of the data [16]. The primitives are\ninitialized to a sum of sinusoids of random frequencies.\nIn par-\nWe created a walking dataset composed of motions performed by the same subject.\nticular we used motions {02, 03, 04, 05, 06, 07, 08, 09, 10, 11} of subject 35 in the CMU mocap\ndataset. We also performed reconstruction experiments for running motions and used motions\n{17, 18, 20, 21, 22, 23, 24, 25} from subject 35. In both cases, we use 2 sequences for training and\nthe rest for testing, and report average results over 10 random splits. Fig. 3 depicts reconstruction\nerror in PCA space and in the original space as a function of the noise variance. Fig. 4 depicts\nreconstruction error as a function of the dimensionality of the PCA space. Our approach outper-\nforms matching pursuit with and without refractory period in all scenarios. Note that out method\noutperforms sparse coding when the output is noisy. This is due to the fact that, given a big enough\ndictionary, sparse coding over\ufb01ts and can perfectly \ufb01t the noise.\nWe also performed reconstruction experiments for running motions performed by different subjects.\nIn particular we use motions {03, 04, 05, 06} of subject 9 and motions {21, 23, 24, 25} of subject\n35. Fig. 5 depicts reconstruction error for our approach when using different numbers of primitives.\nAs expected one primitive is not enough for accurate reconstruction. When using two primitives our\napproach performs comparable to sparse coding and clearly outperforms the other baselines.\nIn the next experiment we show the importance of having interpretable primitives. In particular we\ncompare our approach to the baselines in a missing data scenario, where part of the sequence is miss-\ning. In particular, Q/2 and 2Q/3 frames are missing. We use the single subject walking database.\n\n7\n\n051015202520406080100120140160180200220DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450500DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025050100150200250300350400DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025020040060080010001200DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450500550600650DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025150200250300350400450DimensionReconstruction error MP w/o RPMP w/ RPSCOurs0510152025100200300400500600700DimensionReconstruction error MP w/o RPMP w/ RPSCOurs\fFigure 7: In\ufb02uence of \u03b7 and P on the single subject walking dataset as well as using soft constraints\ninstead of hard constraints on the activations. (left) Our method is fairly insensitive to the choice of \u03b7.\nAs expected the reconstruction error of the training data decreases when there is less regularization.\nThe test error however is very \ufb02at, and increases when there is too much or too little regularization.\nFor missing data, having good primitives is important, and thus regularization is necessary. Note\nthat the horizontal axis depicts \u2212 log \u03b7, thus \u03b7 decreases for larger values of this axis. (center) Error\nwith (green) and without (red) missing data as a function of P . Our approach is not sensitive to the\nvalue of P ; one primitive is enough for accurate reconstruction in this dataset. (right) Error when\nusing solft constraints |Hi,j,k \u2212 Hi,j+1,k+1| \u2264 \u03b1 as a function of \u03b1. The leftmost point corresponds\nto \u03b1 = 0, i.e., Hi,j,k = Hi,j+1,k+1.\n\nAs shown in Fig. 6 our approach clearly outperforms all the baselines. This is due to the fact that\nsparse coding does not have structure, while the structure imposed by our equality constraints, i.e.,\n\u2200i, j, k Hi,j,k = Hi,j+1,k+1, help \u201dhallucinate\u201d the missing data. We also investigate the in\ufb02uence\nof initialization by using a random non-smooth initialization and the smooth initialization described\nabove, i.e.,sinusoids of random frequencies. Note that as our approach, sparse coding is not sensitive\nto initialization. This is in contrast with MP which is very sensitive due to the (cid:96)0-type regularization.\nWe also investigated the in\ufb02uence of the amount of regularization on W. Towards this end we use\nthe single subject walking dataset, and compute reconstruction error for the training and test data\nwith and without missing data as a function of \u03b7. As shown in Fig. 7 (left) our method is fairly\ninsensitive to the choice of \u03b7. As expected the reconstruction error of the training data decreases\nwhen there is less regularization. The test error in the noiseless case is however very \ufb02at, and\nincreases slightly when there is too much or too little regularization. When dealing with missing\ndata, having good primitives becomes more important. Note that the horizontal axis depicts \u2212 log \u03b7,\nthus \u03b7 decreases for larger values of the horizontal axis. The test error is higher than the training\nerror for large \u03b7 since we use \u03b4train = 1 and \u03b4test = 1.3. Thus we are more conservative at learning\nsince we want to learn interpretable primitives. We also investigate the sensitivity of our approach\nto the number of primitives. We use the single subject walking dataset and report errors averaged\nover 10 partitions of the data. As shown in Fig. 7 (middle) our approach is very insensitive to P ; in\nthis example a single primitive is enough for accurate reconstruction.\nWe \ufb01nally investigate the in\ufb02uence of replacing the hard constraints on the activations by soft con-\nstraints |Hi,j,k \u2212 Hi,j+1,k+1| \u2264 \u03b1. Note that our approach is not sensitive to the value of \u03b1 and\nthat the hard constraints ( Hi,j,k = Hi,j+1,k+1), depicted in the leftmost point in Fig. 7 (right), are\nalmost optimal. This justi\ufb01es our choice since when using hard constraints we do not need to search\nfor the optimal value of \u03b1.\n\n4 Conclusion\n\nWe have proposed a sparse coding approach to learn interpretable spatio-temporal primitives of hu-\nman motion. We have formulated the problem as a tensor factorization problem with tensor group\nnorm constraints over the primitives, diagonal constraints on the activations, as well as smooth-\nness constraints that are inherent to human motion. Our approach has proven superior to recently\ndeveloped matching pursuit and sparse coding algorithms in the task of learning interpretable spatio-\ntemporal primitives of human motion from motion capture data. In the future we plan to investigate\napplying similar techniques to learn spatio-temporal dictionaries of video data such as dynamic\ntextures.\n\n8\n\n024681012152025303540455055error\u2212 log !\" error with missing data\" test error\" training errorErrors vs. !123450510152025303540PerrorInfluence of P(cid:239)7(cid:239)6(cid:239)5(cid:239)4(cid:239)3(cid:239)2(cid:239)1170180190200210220230log (cid:95)Reconstruction error\fReferences\n[1] S. Bengio, F Pereira, Y. Singer, and D. Strelow. Group sparse coding. In NIPS, 2009.\n[2] D. P. Bertsekas. Nonlinear Programming. Athena Scienti\ufb01c, Belmont, Massachusetts, 1999.\n[3] A. diAvella and E. Bizzi. Shared and speci\ufb01c muscle synergies in natural motor behaviors. PNAS,\n\n102(8):3076\u20133081, 2005.\n\n[4] M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictio-\n\nnaries. IEEE Trans. on Image Processing, 15(12):3736\u20133745, 2006.\n\n[5] Z. Ghahramani. Building blocks of movement. Nature, 407:682\u2013683, 2000.\n[6] R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. In Proc. AIS-\n\nTATS10, 2010.\n\n[7] H. Lee, Alexis Battle, Raina R, and A. Y. Ng. Ef\ufb01cient sparse coding algorithms. In NIPS, 2007.\n[8] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.\n[9] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image restoration.\n\nIn ICCV, 2009.\n\n[10] J. Mairal, G. Sapiro, and M. Elad. Learning multiscale sparse representations for image and video restora-\n\ntion. SIAM Multiscale Modelling and Simulation., 7(1):214\u2013241, 2008b.\n\n[11] S. G. Mallat and Z. Zhang. Matching pursuits with time-frequency dictionaries.\n\nProc. 41, pages 3397\u20133415, 1993.\n\nIEEE Trans. Signal.\n\n[12] C. R. Mason, J. E. Gomez, and T. J. Ebner. Hand synergies during reach to grasp. J. of Neurophysiology,\n\n86:2896\u20132910, 2001.\n\n[13] F. A. Mussa-Ivaldi and E. Bizzi. Motor learning: the combination of primitives. Phil. Trans. Royal Society\n\nLondon, Series B, 355:1755\u20131769, 2000.\n\n[14] F. A. Mussa-Ivaldi and S. Solla. Neural primitives for motion control. IEEE Journal on Ocean Engineer-\n\ning, 29(3):640\u2013650, 2004.\n\n[15] E. Todorov and Z. Ghahramani. Analysis of the synergies underlying complex hand manipulation. In\nProceedings of Conf. of the IEEE Engineering in Medicine and Biology Society, pages 4637\u20134640, 2004.\n[16] R. Urtasun, D. J. Fleet, A. Geiger, J. Popovic, T. Darrell, and N. D. Lawrence. Topologically-constrained\n\nlatent variable models. In ICML, 2008.\n\n9\n\n\f", "award": [], "sourceid": 403, "authors": [{"given_name": "Taehwan", "family_name": "Kim", "institution": null}, {"given_name": "Gregory", "family_name": "Shakhnarovich", "institution": null}, {"given_name": "Raquel", "family_name": "Urtasun", "institution": null}]}