{"title": "Learning with Target Prior", "book": "Advances in Neural Information Processing Systems", "page_first": 2231, "page_last": 2239, "abstract": "In the conventional approaches for supervised parametric learning, relations between data and target variables are provided through training sets consisting of pairs of corresponded data and target variables. In this work, we describe a new learning scheme for parametric learning, in which the target variables $\\y$ can be modeled with a prior model $p(\\y)$ and the relations between data and target variables are estimated through $p(\\y)$ and a set of uncorresponded data $\\x$ in training. We term this method as learning with target priors (LTP). Specifically, LTP learning seeks parameter $\\t$ that maximizes the log likelihood of $f_\\t(\\x)$ on a uncorresponded training set with regards to $p(\\y)$. Compared to the conventional (semi)supervised learning approach, LTP can make efficient use of prior knowledge of the target variables in the form of probabilistic distributions, and thus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach, the learned parametric regressor in LTP can be more efficiently implemented and deployed in tasks where running efficiency is critical, such as on-line BCI signal decoding. We demonstrate the effectiveness of the proposed approach on parametric regression tasks for BCI signal decoding and pose estimation from video.", "full_text": "Learning with Target Prior\n\nZuoguan Wang\n\nTroy, NY 12180\n\nwangz6@rpi.edu\n\nGerwin Schalk\n\nAlbany, NY, 12201\n\nschalk@wadsworth.org\n\nSiwei Lyu\n\nAlbany, NY 12222\n\nlsw@cs.albany.edu\n\nQiang Ji\n\nTroy, NY 12180\njiq@rpi.edu\n\nDept. of ECSE, Rensselaer Polytechnic Inst.\n\nComputer Science, Univ. at Albany, SUNY\n\nWadsworth Center, NYS Dept. of Health\n\nDept. of ECSE, Rensselaer Polytechnic Inst.\n\nAbstract\n\nIn the conventional approaches for supervised parametric learning, relations be-\ntween data and target variables are provided through training sets consisting of\npairs of corresponded data and target variables.\nIn this work, we describe a\nnew learning scheme for parametric learning, in which the target variables y can\nbe modeled with a prior model p(y) and the relations between data and target\nvariables are estimated with p(y) and a set of uncorresponded data X in train-\ning. We term this method as learning with target priors (LTP). Speci\ufb01cally,\nLTP learning seeks parameter \u03b8 that maximizes the log likelihood of f\u03b8(X) on\na uncorresponded training set with regards to p(y). Compared to the conventional\n(semi)supervised learning approach, LTP can make ef\ufb01cient use of prior knowl-\nedge of the target variables in the form of probabilistic distributions, and thus re-\nmoves/reduces the reliance on training data in learning. Compared to the Bayesian\napproach, the learned parametric regressor in LTP can be more ef\ufb01ciently imple-\nmented and deployed in tasks where running ef\ufb01ciency is critical. We demonstrate\nthe effectiveness of the proposed approach on parametric regression tasks for BCI\nsignal decoding and pose estimation from video.\n\nIntroduction\n\n1\nOne of the central problems in machine learning is prediction/inference, where given an input da-\ntum X, we would like to predict or infer the value of a target variable of interest, y, assuming X\nand y have some intrinsic relationship. The prediction/inference task in many practical applications\ninvolves high dimensional and structured data and target variables. Depending on the form of knowl-\nedge about X and y and their relationship available to us, there are several different methodologies\nto solve the prediction inference problem.\nIn the Bayesian approach, our knowledge about input and target variables, as well as their relation-\nships, are all represented as probability distributions. Correspondingly, the prediction/inference task\nis solved with optimizations based on the posterior distribution p(y|X), a common choice of which\nis the maximum a posteriori objective: maxy p(y|X). The posterior distribution can be explicitly\nconstructed from the target prior, p(y), which encodes our knowledge on the internal structure of\nthe target y, and the likelihood, p(X|y), which summarizes the process of generating X from y, as\np(y|X) \u221d p(X|y)p(y). Or it can be directly handled as in the conditional random \ufb01elds [9] without\nreferring to the target prior or the likelihood. The advantage of the Bayesian approach is that it\nincorporates prior knowledge about data and target variables into the prediction/inference task in a\nprincipled manner. The main downside is that, in many practical problems, the relationship between\nX and y could be complicated and defy straightforward modeling. Furthermore, except for a few\nspecial cases (e.g., Gaussian models), the Bayesian prediction/inference of y from data X usually\nrequires expensive numerical optimization or Monte-Carlo sampling.\n\n1\n\n\f1\nm\n\nAn alternative approach to prediction/inference is supervised parametric learning, where the in-\nformation about X and y and their relationship is described in the form of a set of corre-\n(cid:80)m\nsponding examples, {Xi, yi}m\ni=1, and the goal of learning is to choose an optimal member from\na parametric family f\u03b8(X) that minimizes the average prediction error using a loss function\ni=1 L(yi \u2212 f\u03b8(Xi)). Usually, the optimization may also include a regularization penalty\nmin\u03b8\non \u03b8 to reduce over-\ufb01tting. The most signi\ufb01cant drawback of the supervised parametric learning\napproach is that the learning performance relies heavily on the quality and quantity of the train-\ning data. This problem is somewhat alleviated in semi-supervised learning [28], where the training\ndata include unlabeled examples of X. However, unlike the Bayesian approach, it is usually dif\ufb01-\ncult to incorporate prior knowledge in the form of probabilistic distributions into (semi)supervised\nparametric learning.\nIn this work, we describe a new approach to learning a parametric regressor f\u03b8(X), which we term\nas learning with target prior (LTP). In many practical applications, the target variables y follow the\nsome regular spatial and temporal patterns that can be described probabilistically, and the observed\ntarget variables are samples of such distributions. For instance, to perform an activity like grasping\na cup, the traces of \ufb01nger movements tend to have similar patterns that are caused by many factors,\nsuch as the underlying physiological, anatomical and dynamic constraints. Such regular patterns\ncan bene\ufb01t the task of decoding the \ufb01nger movements from ECoG signals in a brain computer\ninterface (BCI) system, Fig.1, as it regularizes the decoder to produce similar patterns. Similarly,\nthe skeleton structures and the dynamic dependencies constraint the body pose to have similar spatial\nand temporal patterns for the same activity (e.g. walking, running and jumping), which can be used\nfor body pose estimation in computer vision.\nIn LTP learning, we incorporate such spatial and temporal regular patterns of the target variables\ninto the learning framework. Speci\ufb01cally, we learn a probability distribution p(y) that captures the\nspatial and temporal regularities of the target variable y, then we estimate the function parameters\n\u03b8, by maximizing the log-likelihood of the output y = f\u03b8(X) with respect to the the prior distribu-\ntion. LTP learning can be applied to both unsupervised learning, in which no corresponded input\nand output are available, and semi-supervised learning in which part of corresponding outputs are\navailable. We demonstrate the effectiveness of LTP learning in two problems: BCI decoding and\npose estimation.\nThe rest of the paper is organized as the following: Section 2 discusses the related work. Section\n3 describes the general framework for our method and compare with other existing methodologies.\nIn Sections 4 and 5, details on deployment and experimental evaluation of this general framework\nin two applications, namely BCI decoding and pose estimation from video, are described. Section 6\nconcludes the paper with discussion and future works.\n2 Related Work\nLTP learning is related to several existing learning schemes. The prior knowledge about the target\nvariables in classi\ufb01cation problems is exploited in recent works as learning with uncertain labels,\nin which the distribution over the target class labels for each data example is used in place of cor-\nresponding pairs of data/target variables [10]. A similar idea in semi-supervised learning uses the\nproportion of different classes [16, 28] to predict the class labels on the uncorresponded training\ndata examples. The knowledge about class proportion conditioned on certain input feature is cap-\ntured by generalized expectation (GE)[12, 13]. There are several works directly embed domain\nconstraints about the target variables in learning. For instance, constraint driven learning (CODL)\n[3] enforces task speci\ufb01c constraints on the target labels by appending a penalty term in the ob-\njective function. Posterior regularization [5] directly imposes regularization on the posterior of the\nlatent target variables, of which CODL can be seen as a special case with MAP approximation. A\ngeneral framework, which incorporates prior information as measurements in the Bayesian frame-\nwork, is proposed in [11]. However, all these approaches have only been applied to problems with\ndiscrete outputs (classi\ufb01cation or labeling) and may be dif\ufb01cult to extend to incorporate complex\ndependencies in high-dimensional continuous target variables.\nLTP learning is also related to learning with structured outputs. Dependencies in the target variables\ncan be directly modeled in conditional random \ufb01elds (CRF) [9], as a probabilistic graphical model\nbetween the output components. However, the learned regressor is usually not in closed form and\npredictions has to be obtained by numerical optimization or Monte-Carlo sampling. Some of the\nrecent supervised parametric learning methods can take advantage of some structure constraints over\nthe target variables. The max margin Markov network [21] trains an SVM classi\ufb01er with outputs\n\n2\n\n\fFigure 1: Experiment setup for this study.\n\nwhose structures are described by graphs. The structured SVM was further extended with high order\nloss function [20] or models with latent variables [27]. These methods can be viewed as special cases\nof LTP learning, where general probabilistic models for target variables can be incorporated.\n3 General Framework\nIn this section, we describe the general framework of learning with target priors. Speci\ufb01cally, our\ntask is to learn the parameter \u03b8 in a parametric family of functions of X, f\u03b8(x), to best predict\nthe corresponding target variable y. Both the data and target variable can be of high dimensions.\nKnowledge about target variable is provided through a target prior model in the form of a parametric\nprobability distribution, p\u03b7(y), with model parameter \u03b7. The speci\ufb01c form of p\u03b7(y) is determined\nbased on different applications, ranging from simple distributions to more complex models such\nas Markov random \ufb01elds. The model parameter \u03b8 is estimated by maximizing the log-likelihood\nlog p\u03b7(f\u03b8(X)). In the following, we apply the LTP learning to unsupervised learning in which no\ncorresponded input and output are available, as well as semi-supervised learning in which part of\ncorresponding outputs are available.\nFor the unsupervised learning, assume we are given a set of outputs y \u2208 RY\u00d7m, as well as a set of\nuncorresponded inputs X \u2208 RX\u00d7n, where Y and X are the dimensionality, and m and n are the\ntemporal length for y and X respectively. This is applicable to the case of BCI where it is easier to\ngather inputs X or structured targets y than it is to gather corresponded inputs and targets (X, y).\nIn many real BCI applications the input brain signals X are collected only under thoughts without\nactual body movement y. The body movements could be easily collected when the brain signals\nare not being recorded. In the problem of pose estimation, it is a tedious work to label poses y on\nthe input images X. In both the \ufb01nger movement decoding and pose estimation, y and X could be\nextracted from different subjects. A prior model p\u03b7(y) is learned from {yi}m\ni=1, where yi \u2208 RY\u00d71\nand \u03b7 is parameter of the prior model. Then the function parameter \u03b8 is estimated by maximizing\n\nlog p\u03b7(f\u03b8(Xi)),\n\n(1)\n\nwhere Xi \u2208 RX\u00d71. The parameter \u03b8 is chosen in the way that the output on the {Xi}n\ni=1 maximally\nconsistent of the prior distribution p\u03b7(y). The setting of semi-supervised learning is slightly different\nfrom unsupervised learning, in which the corresponding input {Xi}m\ni=1 are\nalso given. Then the learning becomes the combination of supervised and unsupervised learning:\n\ni=1 of the output {yi}m\n\nn(cid:88)\n\ni=1\n\nmax\n\n\u03b8\n\n1\nn\n\nm(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\nmin\n\n\u03b8\n\n1\nm\n\nL(yi \u2212 f\u03b8(Xi)) \u2212 \u03bb\nn\n\nlog p\u03b7(f\u03b8(Xi)),\n\n(2)\n\nwhere L is the loss function and \u03bb is a constant representing the tradeoff between the two terms. In\neq. 2, the parameter \u03b8 is chosen in the way that the outputs not only minimize the loss function on\ntraining data, but also make the predicted targets on the unlabeled data comply with the target prior.\nNext, we adapt unsupervised/semi-supervised learning with LTP to the prediction/inference in two\napplications, namely, decoding ECoG signal to predict \ufb01nger movement in BCI and estimation of\nbody poses from videos, where the-state-of-the-art performances are achieved.\n4 Finger Movement Decoding in ECoG based BCI\nThe main task in brain-computer interface (BCI) systems is to convert electronic signals recorded\nfrom human brain into controlling commands for patients with motor disabilities (e.g., paraly-\nsis). Many recent studies in neurobiology have suggested that electrocorticographic (ECoG) signals\n\n3\n\n\frecorded near the brain surface show strong correlations with limb motions [2, 8]. ECoG signal\ndecoding is the critical step in ECoG based BCI systems, the goal of which is to obtain a functional\nmapping between the ECoG signals and the kinematic variables (e.g., spatial locations and move-\nment velocities of \ufb01ngers recorded by a digital glove) [8]. The ECoG decoding problem has been\nwidely solved with supervised parametric learning [26, 8, 25], where corresponded ECoG signals\nand target kinematic variables are collected from one subject and used to train a parametric regres-\nsor. However, the decoder learned from data collected from one subject in a controlled experiment\nusually has trouble to generalize for the same subject over time and in an open environment (tem-\nporal generalization) [18], or to decode signals from other subjects (cross-subject generalization)\n[24]. The former is due to the strong variances in ECoG signals that are caused by other concurrent\nbrain activities, and the latter is due to the difference in shape and volume of the brains for different\nsubjects. These limitations are regarded as the most challenging issues in current BCI systems [7].\nThere have been several works addressing these issues. For instance, to improve the generalization\nperformance across trials, several adaptive classi\ufb01cation methods are proposed [18], i.e., updating\nthe LDA with labeled feed back data. To generalize better across subjects, a collaborative paradigm\nwas proposed to integrating information from multiple subjects [24]. In [17] it is investigated that\ncertain spectral features of ECoG signals can be used across subjects to classify movements. How-\never, these methods do not provide satisfactory solutions since the central challenge in extending the\nparametric decoder across time and subject is that the conventional parametric learning approach,\non which all these methods are based, relies on training data to obtain information for learning the\nregressor, which in these cases are dif\ufb01cult to collect. At the same time, in BCI it is typically much\neasier to gather samples of uncorresponded target variables, i.e, traces of \ufb01nger movements recorded\nby digital gloves, than it is to gather corresponding pairs of training samples.\nThus in this work, we propose to improve the temporal and cross-subject generalization of BCI\ndecoders with the learning with target priors framework. In the \ufb01rst step, we obtain a parametric\ntarget prior model using uncorresponded samples of the target data, in this case, the traces of \ufb01nger\npositions.\nIn the second step, we estimate a linear decoding function using the general method\ndescribed in Section 3. Let us \ufb01rst de\ufb01ne notations that are to be used subsequently: we use a linear\ndecoding function, as: f\u03b8(x) = XT \u03b8, to predict the traces of \ufb01nger movements y as target variable.\nSpeci\ufb01cally, we de\ufb01ne y \u2208 RY where Y corresponds to the number of samples in the \ufb01nger traces.\nX \u2208 RL\u00d7Y is a matrix whose columns are a subset of ECoG signal features of length L. The model\nparameter \u03b8 \u2208 RL is a vector. Linear decoding function are widely used in BCI decoding [1] for its\nsimplicity and run-time ef\ufb01ciency in constructing hardware based BCI system.\n\n4.1 Target Prior Model\n\n(cid:80)\nWe use the Gaussian-Bernoulli restricted Boltzmann machine (GB-RBM) [14]:\np\u03b7(y) =\nh e\u2212E\u03b7(y,h), where Z is the normalizing constant, and h \u2208 {0, 1}H are binary hidden vari-\n1\nZ\nables, as the parametric target prior model. The pdf is de\ufb01ned in terms of the joint energy function\nover y and h, as:\n\nY(cid:88)\n\n(yi \u2212 ci)2\n\n\u2212\n\ni=1\n\n2\n\nY,H(cid:88)\n\nWijyihj \u2212\n\nbjhj.\n\ni=1,j=1\n\nj=1\n\nH(cid:88)\n\nE\u03b7(y, h) =\n\nwhere Wij is the interaction strength between the hidden node hi and visible node yj. c and b are\nthe bias for the visible layer and hidden layer, respectively. The target variable y is normalized to\nhave zero mean and unit standard variance. The parameters in this model, (W, c, b), are collec-\ntively represented with \u03b7. Direct maximum likelihood training of GB-RBM is intractable due to the\nnormalizing factor Z, so we use contrastive divergence [6] to estimate \u03b7 from data.\n\n4.2 Learning Regressor Parameter \u03b8\n\nWith training data and the GB-RBM as the target prior model, we optimize the objective function of\nLTP in Eq.(1) or (2) for parameters \u03b8. With the linear decoding function and squared loss function,\nthe gradient of the \ufb01rst term of Eq.(2) can be computed as \u2212 2\ni \u03b8). The derivative\nof \u03b8 over log-likelihood of XT \u03b8 with regards to the prior model can be computed, as\n\n(cid:80)m\ni=1 Xi(yi \u2212 XT\n\nm\n\n\u2202 log p\u03b7(XT \u03b8)\n\n\u2202\u03b8\n\n=\n\n(cid:88)\n\nh\n\np\u03b7(h|XT \u03b8)\n\n\u2212\u2202E(XT \u03b8, h)\n\n\u2202\u03b8\n\n.\n\n(3)\n\n4\n\n\fPlugging the energy function E into Eq.(3), we can simplify it to\n\n= X(XT \u03b8 \u2212 c) + XWT(cid:88)\n\nh\n\np\u03b7(h|XT \u03b8)h,\n\n(4)\n\n\u2202 log p\u03b7(XT \u03b8)\n\nwhere(cid:80)\ngiven XT \u03b8. Speci\ufb01cally, assume g = (cid:80)\n\n\u2202\u03b8\n\nh p\u03b7(h|XT \u03b8)h using the property of GB-RBM that the elements of h are independent\nh p\u03b7(h|XT \u03b8)h, then gi = \u03c3(WiXT \u03b8), where Wi is\nthe ith row of W and \u03c3 is the logistic function \u03c3(x) = 1/(1 + exp (\u2212x)). The expectation of\nthe derivative over all sequences, composed of Y successive samples in the training data, can be\nexpressed as (cid:104) \u2202 log p\u03b7(XT \u03b8)\n4.3 Experimental Settings\n\n(cid:105)data where < \u00b7 >data stands for expectation over the data.\n\n\u2202\u03b8\n\nThe ECoG data and target \ufb01nger movement variables are collected from a clinical setting based on\n\ufb01ve subjects (A-E) who underwent brain surgeries [8]. Each subject had a 48- or 64- electrode grid\nplaced over the cortex. During the experiment, the subjects are required to repeatedly \ufb02ex and extend\nspeci\ufb01c individual \ufb01ngers according to visual cues on a video screen. The experiment setup is shown\nin Fig. 1. The data collection for each subject lasted 10 minutes, which yielded an average of 30\ntrials for each \ufb01nger. The \ufb02exion of each \ufb01nger was measured by a data glove. For each channel,\nfeatures are extracted based on signal power of three bands (1-60Hz, 60-100Hz, 100-200Hz) [2],\nwhich results in 144 or 204 features for subjects with 48 or 64 channels, respectively.\n4.4 Learning Target Prior Model and Decoding Function\n\nThe training data for the prior model p\u03b7(y) are either from other subjects or from the same subject\nbut were collected at a different time and do not have correspondence with the training input data.\nHere we consider the \ufb01nger moving traces only composed of \ufb02exion and extension as in Fig. 2(A).\nThis simpli\ufb01ed model is still practically useful since we can \ufb01rst classify the trace into movement\nstate or rest state and then apply the corresponding regressor for each state [4]. Each subject has\naround 1400 samples. We model the \ufb01nger movement trace using the GB-RBM with 64 hidden\nnodes and 12 visible nodes, which is approximately the length of one round extension and \ufb02exion.\nThen, all segments from 12 successive samples in the data are used to train the prior model.\nThe GB-RBM is trained with stochastic gradient decent with a mini-batch size of 25 sub-sequences.\nWe run 5000 epochs with a \ufb01xed learning rate 0.001. We \ufb01rst validate the prior model by drawing\nsamples from the learned GB-RBM. Figure 2(B) shows the 9 samples, which seem to capture some\nimportant properties of the temporal dynamics of the \ufb01nger trace.\n\nFigure 2: (A) Original trace; (B)samples from GB-RBM. Each sample is a segment with length 12.\nWith the prior model, the paired features/target variables if they exist and unpaired features, on\nwhich the expectation of Eq.(4) is calculated, are used to learn the parameter \u03b8. \u03b8 is randomly\ninitialized and learned with stochastic gradient decent with the same batch size 25. We run 2000\nepochs with \ufb01xed learning rate 10\u22124.\n4.5 Generalization Across Subjects\n\nWe learn the decoding function for new subjects by deploying the unsupervised LTP learning in Sec-\ntion 3. Even though it is dif\ufb01cult to get the corresponded samples from new subjects, we always have\nthe input ECoG signals, whose features will be used as the input of the unsupervised LTP learning.\nWe compare the unsupervised LTP learning with linear regression [2] in two ways: 1) the linear\nregression (intra subject) in which the corresponded data and target variables are available. The\naccuracy of linear regression is calculated based on \ufb01ve fold cross-validation, that is, 4/5 trials (25\ntrials) are used for training and 1/5 trials (5 trials) are used for testing. 2) the linear regression (inter\n\n5\n\n0123456TIME (s)NORMALIZED AMPLITUDE0-110123456TIME (s)NORMALIZED AMPLITUDE0-11(A)(B)\fTable 1: Results on thumb of subjects based on 2 fold cross validation (correlation coef\ufb01cient).\n\nA\n0.29\n0.38\n\nB\n0.26\n0.42\n\nC\n0.06\n0.13\n\nLinear R\nSemisupervised LTP\n\nD\n0.10\n0.15\n\nE\n0.11\n0.12\n\nsubject) trained on the one subject and tested on other subjects. The results for inter subjects are\ncalculated based on 5 fold cross-validation (each time one subject is used for training and the model\nis tested on other four subjects). Linear regression is trained on pairs of features and targets while\nLTP only uses the targets to train the prior model. For the linear regression trained and tested on\ndifferent subjects, the channels across subjects are aligned by the 3-d position of the sensors.\nFigure 3(A) shows the performance comparison of the three models. Note that the performances of\nthe unsupervised LTP learning is on par with those of the linear regression (intra) on subject A, B, C\nand D, which suggests that the decoder learned by unsupervised LTP learning can generalize across\nsubjects. Figure 3(B) and (C) shows two examples of prediction results from the unsupervised\nLTP learning. On the other hand, not surprisingly, the performances of linear regression (inter\nsubjects) suggest that it cannot be extended across subjects, which is due to brain difference for\ndifferent subjects as stated above. The generalization ability gained by unsupervised LTP learning\nis mainly because it directly learns decoding functions on the new subject without using brain signal\nfrom existing subjects, which are believed to change dramatically among subjects. One thing we\nnoticed is that the unsupervised LTP learning does not work well on subject E, which is because the\nthumb movement speed of subject E is much slower than subject A, on which the prior model is\ntrained. This suggests that the quality of the target prior model is critical for the performance.\n\nFigure 3: (A) Comparison among three models across subjects; (B) Sample results for subject A;\n(C) Sample results for subject B. The dot line is the ground truth and the solid line is the prediction\n\n4.6 Online Learning for Decoding Functions\n\ni=1 be the training data in the current trial and {Xj}n\n\nIn the next set of experiment, we use the learning with target priors framework for learning decoding\nfunctions that generalize over time. This experiment is performed for each subject individually. For\neach subject, assume {Xi, yi}m\nj=1 be the new\nsamples unknown in training. We \ufb01rst train the prior model on {yi}m\ni=1. Then parameter \u03b8 is learned\nusing the semi-supervised learning in section 3.\nThe new samples come sequentially and thus we want the decoding function to be online updated.\nThe parameter \u03b8 is not updated for every new coming sample, but every batch of data X \u2208 RL\u00d7Y.\nHere we give a brief description of the online batch updating method. For the start, the parameter\n\u03b8 is learned from the corresponding pairs of samples {Xi, yi}m\ni=1. Then the decoding function\nwith parameter \u03b8 is used to decode the \ufb01rst batch {Xj}Y\ni=1 is decoded,\n{Xj}Y\nj=1, not including the predicted target variables, is included as part of the unlabeled training\ndata to update the parameter \u03b8 by the semi-supervised learning in section 3. Then the updated \u03b8\nis used to decode the second batch {Xj}2Y\nj=Y+1 and the process loops. In summary, after the new\ncoming batch is decoded using the current parameter \u03b8, then it is included as training data to update\nparameter \u03b8. Generally, we are trying to maximally use the \u201dseen\u201d data to get the decoding function\nprepared for the \u201dunseen\u201d coming samples.\nThe batch size Y is chosen to be 12. The model is tested on the thumb of \ufb01ve subjects based on 2\nfold cross validation, that is, we treat the \ufb01rst 15 trials as the paired data/target variables and then\nonline test the remaining trials. After that in turn we treat the last 15 trials as the paired data/target\nvariables and use the \ufb01rst 15 trials for online testing. The results in Table 1 show the proposed model\nwith online batch updating can signi\ufb01cantly improve the results. This means that by regularizing\n\nj=1. After the batch {Xj}Y\n\n6\n\nABCDE\u22120.100.10.20.3Correlation CoefficientLinear Regression(intra subject)Unsupervised LTP (inter subject)Linear Regression (inter subject)050100150200250300-1.5-1-0.500.511.5050100150200250300-2-101234(A)(B)(C)\fthe new features with the target prior, the semi-supervised learning in Section 3 successfully obtains\ninformation from the new features and adapts the decoders well for new coming samples.\n5 Pose Estimation from Videos\nIn this section, we apply learning with target priors to the problem of the pose estimation problem,\nthe goal of which is to extract 3D human pose from images or video sequences. We demonstrate\nLTP by applying it to learn a linear mapping from image features to poses while LTP could be used\nto learn more sophisticated models. We will show that the algorithms learned by LTP are more\ngeneralizable both across subjects and over time on the same subject respectively.\nIn this\nfrom CMU MoCap database\n(http://mocap.cs.cmu.edu). The data are from 3 subjects, with sequences 1 & 2 from the \ufb01rst\nsubject, sequences 3 & 4 from the second subject and sequences 5 & 6 from the third subject. Each\nsequence consists of about 70 frames. Our task is to estimate the 3-D pose from videos, which is\ndescribed by 59 dimensional joint angles. The image feature is extracted from the silhouette image\nat the side view. For each silhouette image we take 10 dimension moment features [23].\ni=1, where X \u2208 R10\u00d7n are the image features, y \u2208\nWe represent the video sequence as {Xi, yi}n\nR59\u00d7n are the joint angles, where n is the length of the sequence and n could be different for\ndifferent sequences. Instead of directly mapping features to 59 dimensional joint angles, we learn\nthe function which maps the features to the 3 dimensional subspace of joint angles obtained through\nPCA. Then the original space of joint angles is recovered from the low dimensional subspace. All\n\nsix walking sequences\n\nexperiment, we use\n\nAlgorithm 1 learning with target priors\n\ni=1, test features X\u2217\n\nInput: joint angles {yi}n\nOutput: y\u2217 corresponding to X\u2217\nSteps:\n1: PCA: y \u2192 EZ, where E \u2208 R59\u00d73, Z \u2208 R3\u00d7n\n2: learn prior model p\u03b7(y) on Z\n3: learn mapping function Z\u2217 = f\u03b8(X\u2217) using the unsupervised LTP learning in section 3\nOutput: recover original space y\u2217 = EZ\u2217\n\npossible segments composed of successive 60 frames in the sequence are used to train the GB-RBM.\nSo the length of the vector into the GB-RBM is 180 (the subspace is 3 dimension).\nMany methods have been proposed to address the pose estimation problem, among which sGPLVM\n[19], FOLS-GPLVM [15] and imCRBM [22] are the three very competitive ones. sGPLVM models\na shared latent space by pose and image features through GPLVM, while FOLS-GPLVM models a\nshared latent space and a private latent space for each part. imCRBM constructs a pose prior for the\nBayesian model using the implicit mixture of CRBM. However, Taylor\u2019s work is not comparable\nto our method, because it requires a generative model to directly map a pose to a silhouette, while\nour method explicitly uses the extracted moment features, and the comparison here focuses on algo-\nrithms instead of features. So we will compare with sGPLVM and FOLS-GPLVM using the same\nimage features. The training of both sGPLVM and FOLS-GPLVM require corresponded images and\nposes (X, y) while LTP does not require this.\nFor the unsupervised LTP learning, the target prior model is trained on the subspace of the joint\nangles {yi}n\ni=1 on sequence 1 and tested on the features of all 6 sequences. The implementation\ndetails are shown in algorithm 1. Except for sGPLVM and FOLS-GPLVM, the results are also com-\npared with ridge regression. Ridge regression, sGPLVM and FOLS-GPLVM are trained on the \ufb01rst\nsequence with paired samples {Xi, yi}n\ni=1 and tested on all the 6 sequences. The implementation\nof ridge regression is similar to that in algorithm 1, the only difference is that the mapping from\nfeatures to the PCA subspace is through ridge regression.\nThe results are measured in terms of mean absolute joint angle error and are shown in table 2.\nWe can see that when testing on the sequence from the same subject (sequence 2), unsupervised\nLTP learning is not the best. In contrast, when testing on the sequences from subjects B and C,\nunsupervised LTP learning achieves the best results, which is slightly better than sGPLVM. Con-\nsidering that only linear dimension reduction and linear function are assumed for unsupervised\nLTP learning and paired samples are not required, unsupervised LTP learning is even more com-\npetitive. FOLS-GPLVM does not perform well on this data set, which is probably due to limited\ntraining samples. Thus the experiments demonstrate that the algorithm learned by unsupervised\n\n7\n\n\fTable 2: Train prior model on the \ufb01rst sequence and test on all sequences. Results are measured\nwith mean absolute joint angle error.\n\nB\n\nC\n\n3\n8.3\n5.6\n6.5\n5.3\n\n4\n8.5\n6.1\n6.4\n6.1\n\n5\n10.7\n3.0\n3.3\n2.9\n\n6\n10.7\n3.1\n4.0\n2.9\n\nSubject\nSequence Num\nRidge Regression\nsGPLVM\nFOLS-GPLVM\nUnsupervised LTP\n\nA\n\n2\n1\n2.1\n4.8\n\u2014 3.1\n\u2014 5.3\n3.0\n4.8\n\nTable 3: For each subject, train on the \ufb01rst sequence and test on the second sequence. Results are\nmeasured with absolute joint angle error.\n\nSubject\nRidge Regression\nsGPLVM\nFOLS-GPLVM\nSemi-supervised LTP\n\nA\n4.8\n3.1\n5.3\n2.87\n\nB\n5.3\n5.3\n5.8\n3.97\n\nC\n3.1\n3.0\n3.8\n2.33\n\nLTP learning in section 3 can generalize well across subjects. The reason that ridge regression,\nsGPLVM and FOLS-GPLVM do not generalize well is that the relations between poses and images\nare solely learned from corresponded poses and images, and these relations may have dif\ufb01culty to\nhold for the new subjects due to may factors (i.e, the video for the new subject is recorded from a\nslightly different angle). LTP avoids this problem by learning the relations using the generalizable\nprior distribution over the targets and the images from the new subjects.\nWe further demonstrate that the algorithm learned through semi-supervised learning in section 3\ngeneralizes well across time for the same subject. In this experiment, for each subject we treat the\n\ufb01rst sequence as the paired samples {Xi, yi}m\ni=1 and estimate the 3-D pose of the second sequence\n{Xi}n\ni=1. The algorithm\nis similar to that in algorithm 1 except for replacing unsupervised LTP learning with semi-supervised\nlearning. The results in table 3 show that the semi-supervised learning in section 3 signi\ufb01cantly\noutperforms three other methods.\n\nj=1. The prior model is trained on the joint angles of the \ufb01rst sequence {yi}m\n\n6 Conclusion and Discussion\n\nIn this work, we describe a new learning scheme for parametric learning, known as learning with\ntarget priors, that uses a prior model over the target variables and a set of uncorresponded data in\ntraining. Compared to the conventional (semi)supervised learning approach, LTP can make ef\ufb01-\ncient use of prior knowledge of the target variables in the form of probabilistic distributions, and\nthus removes/reduces the reliance on training data in learning. Compared to the Bayesian approach,\nthe learned parametric regressor in LTP can be more ef\ufb01ciently implemented and deployed in tasks\nwhere running ef\ufb01ciency is critical, such as on-line BCI signal decoding. We demonstrate the effec-\ntiveness of the proposed approach in terms of generalization on parametric regression tasks for BCI\nsignal decoding and pose estimation from video.\nThere are several extensions of this work we would like to further pursue. First, in the current work\nwe only use a simple target prior model in the form of GB-RBM. There are, however, more \ufb02exible\nprobabilistic models, such as Markov random \ufb01elds or dynamic Bayesian networks, that can better\nrepresent statistical properties in the target variables. Therefore, we would like to incorporate such\nmodels into LTP learning to further improve the performance. Second, we would like to investigate\nthe connection between conventional capacity control methods (e.g., max margin or regularization)\nwith LTP learning. This has the potential to unify and shed light on the deeper relation among\ndifferent learning methodologies. Last, we would also like to use LTP learning with nonlinear\ndecoding functions.\nAcknowledgement The authors would like to thank Jixu Chen for providing the motion capture\ndata and feature extraction code. Zuoguan Wang and Qiang Ji are supported in part by a grant from\nUS Army Research Of\ufb01ce (W911NF-08-1-0216 (GS)) through Albany Medical College. Gerwin\nSchalk is supported by US Army Research Of\ufb01ce (W911NF-08-1-0216 (GS)) and W911NF-07-1-\n0415 (GS), and the NIH (EB006356(GS) and EB000856 (GS)). Siwei Lyu is supported by an NSF\nCAREER Award (IIS-0953373).\n\n8\n\n\fReferences\n[1] Bashashati, Ali, Fatourechi, Mehrdad, Ward, Rabab K., and Birch, Gary E. A survey of signal\nprocessing algorithms in brain-computer interfaces based on electrical brain signals. J. Neural\nEng., 4, June 2007.\n\n[2] Bougrain, Laurent and Liang, Nanying. Band-speci\ufb01c features improve Finger Flexion Pre-\ndiction from ECoG. In Jornadas Argentinas sobre Interfaces Cerebro Computadora - JAICC,\nParan`a, Argentine, 2009.\n\n[3] Chang, Mingwei, Ratinov, Lev, and Roth, Dan. Guiding semi-supervision with constraint-\n\ndriven learning. In Proc. of the Annual Meeting of the ACL, 2007.\n\n[4] Flamary, R\u00b4emi and Rakotomamonjy, Alain. Decoding \ufb01nger movements from ECoG signals\n\nusing switching linear models. Technical report, September 2009.\n\n[5] Ganchev, Kuzman, Graca, Joao, Gillenwater, Jennifer, and Taskar, Ben. Posterior regulariza-\n\ntion for structured latent variable models. JMLR, 11(July):2001\u20132049, 2010.\n\n[6] Hinton, Geoffrey. Training products of experts by minimizing contrastive divergence. Neural\n\nComputation, 14(8):2002, Aug 2000.\n\n[7] Krusienski, Dean J, Grosse-Wentrup, Moritz, Galn, Ferran, Coyle, Damien, Miller, Kai J,\nForney, Elliott, and Anderson, Charles W. Critical issues in state-of-the-art brain-computer\ninterface signal processing. Journal of Neural Engineering, 8(2):025002, 2011.\n\n[8] Kub\u00b4anek, J, Miller, K J, Ojemann, J G, Wolpaw, J R, and Schalk, G. Decoding \ufb02exion of\nindividual \ufb01ngers using electrocorticographic signals in humans. J Neural Eng, 6(6):066001\u2013\n066001, Dec 2009.\n\n[9] Lafferty, John. Conditional random \ufb01elds: Probabilistic models for segmenting and labeling\n\nsequence data. In NIPS, pp. 282\u2013289. Morgan Kaufmann, 2001.\n\n[10] Lefort, Riwal, Fablet, Ronan, and Boucher, Jean-Marc. Weakly supervised classi\ufb01cation of\n\nobjects in images using soft random forests. In ECCV, pp. 185\u2013198, 2010.\n\n[11] Liang, Percy, Jordan, Michael I., and Klein, Dan. Learning from measurements in exponential\n\nfamilies. In ICML \u201909, pp. 641\u2013648, New York, NY, USA, 2009. ACM.\n\n[12] Mann, Gideon S. and McCallum, Andrew. Simple, robust, scalable semi-supervised learning\n\nvia expectation regularization. In ICML, pp. 593\u2013600, 2007.\n\n[13] Mann, Gideon S. and Mccallum, Andrew. Generalized expectation criteria for semi-supervised\n\nlearning of conditional random \ufb01elds. In ACL\u201908, pp. 870\u2013878, 2008.\n\n[14] Mohamed, A., Dahl, G., and Hinton, G. Acoustic modeling using deep belief networks. Audio,\n\nSpeech, and Language Processing, IEEE Transactions on, PP(99):1, 2011.\n\n[15] Salzmann, Mathieu, Henrik, Carl, Raquel, Ek, and Darrell, Urtasun Trevor. Factorized orthog-\n\n[16] Schapire, Robert E., Rochery, Marie, Rahim, Mazin G., and Gupta, Narendra. Incorporating\n\nonal latent spaces. JMLR, 9:701\u2013708, 2010.\n\nprior knowledge into boosting. In ICML, 2002.\n\n[17] Shenoy, P., Miller, K.J., Ojemann, J.G., and Rao, R.P.N. Generalized features for electrocor-\n\nticographic bcis. Biomedical Engineering, IEEE Transactions on, 55(1), jan. 2008.\n\n[18] Shenoy, Pradeep, Krauledat, Matthias, Blankertz, Benjamin, Rao, Rajesh P. N., and M\u00a8uller,\nKlaus-Robert. Towards adaptive classi\ufb01cation for BCI. Journal of Neural Engineering, 2006.\n[19] Shon, Aaron P., Grochow, Keith, Hertzmann, Aaron, and Rao, Rajesh P. N. Learning shared\n\nlatent structure for image synthesis and robotic imitation. In NIPS, pp. 1233\u20131240, 2006.\n\n[20] Tarlow, Daniel and S. Zemel, Richard. Structured output learning with high order loss func-\n\ntions. AISTATS, 2012.\n\nMIT Press, 2003.\n\n[21] Taskar, Ben, Guestrin, Carlos, and Koller, Daphne. Max-margin markov networks. In NIPS.\n\n[22] Taylor, G.W., Sigal, L., Fleet, D.J., and Hinton, G.E. Dynamical binary latent variable models\n\nfor 3d human pose tracking. In CVPR, pp. 631 \u2013638, June 2010.\n\n[23] Tian, Tai-Peng, Li, Rui, and Sclaroff, S. Articulated pose estimation in a learned smooth space\n\nof feasible solutions. In CVPR, pp. 50, June 2005.\n\n[24] Wang, Yijun and Jung, Tzyy-Ping. A collaborative brain-computer interface for improving\n\nhuman performance. PLoS ONE, 6(5):e20422, 05 2011.\n\n[25] Wang, Zuoguan, Ji, Qiang, Miller, Kai J., and Schalk, Gerwin. Decoding \ufb01nger \ufb02exion from\nelectrocorticographic signals using a sparse gaussian process. In ICPR, pp. 3756\u20133759, 2010.\n[26] Wang, Zuoguan, Schalk, Gerwin, and Ji, Qiang. Anatomically constrained decoding of \ufb01nger\n\n\ufb02exion from electrocorticographic signals. In NIPS, 2011.\n\n[27] Yu, C.-N. and Joachims, T. Learning structural SVMs with latent variables. In ICML, 2009.\n[28] Zhu, Xiaojin. Semi-supervised learning literature survey, 2006. URL http://pages.cs.\n\nwisc.edu/\u02dcjerryzhu/pub/ssl_survey.pdf.\n\n9\n\n\f", "award": [], "sourceid": 1106, "authors": [{"given_name": "Zuoguan", "family_name": "Wang", "institution": null}, {"given_name": "Siwei", "family_name": "Lyu", "institution": null}, {"given_name": "Gerwin", "family_name": "Schalk", "institution": null}, {"given_name": "Qiang", "family_name": "Ji", "institution": null}]}