{"title": "Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": "We present a new and efficient semi-supervised training method for parameter estimation and feature selection in conditional random fields (CRFs). In real-world applications such as activity recognition, unlabeled sensor traces are relatively easy to obtain whereas labeled examples are expensive and tedious to collect. Furthermore, the ability to automatically select a small subset of discriminatory features from a large pool can be advantageous in terms of computational speed as well as accuracy. In this paper, we introduce the semi-supervised virtual evidence boosting (sVEB) algorithm for training CRFs -- a semi-supervised extension to the recently developed virtual evidence boosting (VEB) method for feature selection and parameter learning. Semi-supervised VEB takes advantage of the unlabeled data via minimum entropy regularization -- the objective function combines the unlabeled conditional entropy with labeled conditional pseudo-likelihood. The sVEB algorithm reduces the overall system cost as well as the human labeling cost required during training, which are both important considerations in building real world inference systems. In a set of experiments on synthetic data and real activity traces collected from wearable sensors, we illustrate that our algorithm benefits from both the use of unlabeled data and automatic feature selection, and outperforms other semi-supervised training approaches.", "full_text": "Fast and Scalable Training of Semi-Supervised CRFs\n\nwith Application to Activity Recognition\n\nMaryam Mahdaviani\n\nComputer Science Department\nUniversity of British Columbia\n\nVancouver, BC, Canada\n\nTanzeem Choudhury\n\nIntel Research\n\n1100 NE 45th Street\n\nSeattle, WA 98105,USA\n\nAbstract\n\nWe present a new and ef\ufb01cient semi-supervised training method for parameter es-\ntimation and feature selection in conditional random \ufb01elds (CRFs). In real-world\napplications such as activity recognition, unlabeled sensor traces are relatively\neasy to obtain whereas labeled examples are expensive and tedious to collect.\nFurthermore, the ability to automatically select a small subset of discriminatory\nfeatures from a large pool can be advantageous in terms of computational speed as\nwell as accuracy. In this paper, we introduce the semi-supervised virtual evidence\nboosting (sVEB) algorithm for training CRFs \u2013 a semi-supervised extension to the\nrecently developed virtual evidence boosting (VEB) method for feature selection\nand parameter learning. The objective function of sVEB combines the unlabeled\nconditional entropy with labeled conditional pseudo-likelihood.\nIt reduces the\noverall system cost as well as the human labeling cost required during training,\nwhich are both important considerations in building real-world inference systems.\nExperiments on synthetic data and real activity traces collected from wearable\nsensors, illustrate that sVEB bene\ufb01ts from both the use of unlabeled data and au-\ntomatic feature selection, and outperforms other semi-supervised approaches.\n\n1 Introduction\n\nConditional random \ufb01elds (CRFs) are undirected graphical models that have been successfully ap-\nplied to the classi\ufb01cation of relational and temporal data [1]. Training complex CRF models with\nlarge numbers of input features is slow, and exact inference is often intractable. The ability to select\nthe most informative features as needed can reduce the training time and the risk of over-\ufb01tting of\nparameters. Furthermore, in complex modeling tasks, obtaining the large amount of labeled data\nnecessary for training can be impractical. On the other hand, large unlabeled datasets are often easy\nto obtain, making semi-supervised learning methods appealing in various real-world applications.\nThe goal of our work is to build an activity recognition system that is not only accurate but also scal-\nable, ef\ufb01cient, and easy to train and deploy. An important application domain for activity recognition\ntechnologies is in health-care, especially in supporting elder care, managing cognitive disabilities,\nand monitoring long-term health. Activity recognition systems will also be useful in smart environ-\nments, surveillance, emergency and military missions. Some of the key challenges faced by current\nactivity inference systems are the amount of human effort spent in labeling and feature engineering\nand the computational complexity and cost associated with training. Data labeling also has privacy\nimplications because it often requires human observers or recording of video. In this paper, we intro-\nduce a fast and scalable semi-supervised training algorithm for CRFs and evaluate its classi\ufb01cation\nperformance on extensive real world activity traces gathered using wearable sensors. In addition\nto being computationally ef\ufb01cient, our proposed method reduces the amount of labeling required\nduring training, which makes it appealing for use in real world applications.\n\n1\n\n\fSeveral supervised techniques have been proposed for feature selection in CRFs. For discrete fea-\ntures, McCallum [2] suggested an ef\ufb01cient method for feature induction by iteratively increasing\nconditional log-likelihood. Dietterich [3] applied gradient tree boosting to select features in CRFs\nby combining boosting with parameter estimation for 1D linear-chain models. Boosted random\n\ufb01elds (BRFs) [4] combine boosting and belief propagation for feature selection and parameter esti-\nmation for densely connected graphs that have weak pairwise connections. Recently, Liao et.al. [5]\ndeveloped a more general version of BRFs, called virtual evidence boosting (VEB) that does not\nmake any assumptions about graph connectivity or the strength of pairwise connections. The ob-\njective function in VEB is a soft version of maximum pseudo-likelihood (MPL), where the goal is\nto maximize the sum of local log-likelihoods given soft evidence from its neighbors. This objective\nfunction is similar to that used in boosting, which makes it suitable for uni\ufb01ed feature selection and\nparameter estimation. This approximation applies to any CRF structures and leads to a signi\ufb01cant\nreduction in training complexity and time. Semi-supervised training techniques have been exten-\nsively explored in the case of generative models and naturally \ufb01t under the expectation maximization\nframework [6]. However, it is not straight forward to incorporate unlabeled data in discriminative\nmodels using the traditional conditional likelihood criteria. A few semi-supervised training meth-\nods for CRFs have been proposed that introduce dependencies between nearby data points [7, 8].\nMore recently, Grandvalet and Bengio [9] proposed a minimum entropy regularization framework\nfor incorporating unlabeled data. Jiao et.al. [10] used this framework and proposed an objective\nfunction that combines the conditional likelihood of the labeled data with the conditional entropy of\nthe unlabeled data to train 1D CRFs, which was extended to 2D lattice structures by Lee et.al. [11].\nIn our work, we combine the minimum entropy regularization framework for incorporating unla-\nbeled data with VEB for training CRFs. The contributions of our work are: (i) semi-supervised\nvirtual evidence boosting (sVEB) - an ef\ufb01cient technique for simultaneous feature selection and\nsemi-supervised training of CRFs, which to the best of our knowledge is the \ufb01rst method of its\nkind, (ii) experimental results that demonstrate the strength of sVEB, which consistently outper-\nforms other training techniques on synthetic data and real-world activity classi\ufb01cation tasks, and\n(iii) analysis of the time and complexity requirements of our algorithm, and comparison with other\nexisting techniques that highlight the signi\ufb01cant computational advantages of our approach. The\nsVEB algorithm is fast and easy to implement and has the potential of being broadly applicable.\n\n2 Approaches to training of Conditional Random Fields\n\nMaximum likelihood parameter estimation in CRFs involves maximizing the overall conditional\nlog-likelihood, where x is the observation sequence and y is the hidden state sequence:\n\nL(\u03b8) = log(p(y|x, \u03b8)) \u2212 (cid:107)\u03b8(cid:107)/2 = log\n\n\u2212 (cid:107)\u03b8(cid:107)/2\n\n(1)\n\nK(cid:80)\nK(cid:80)\n\nk=1\n\nk=1\n\nexp(\n\n\u03b8kfk(x, y))\n\nexp(\n\n\u03b8kfk(x, y(cid:48)))\n\n(cid:80)\n\ny(cid:48)\n\nThe conditional distribution is de\ufb01ned by a log-linear combination of k features functions fk asso-\nciated with weight \u03b8k. A regularizer on \u03b8 is used to keep the weights from getting too large and\nto avoid over\ufb01tting1. For large CRFs exact inference is often intractable and approximate methods\nsuch as mean \ufb01eld approximation or loopy belief propagation [12, 13] are used.\nAn alternative to approximating the conditional likelihood is to change the objective function.\nMPL [14] and VEB [5] are such techniques. For MPL the CRF is cut into a set of independent\npatches; each patch consists of a hidden node or class label yi, the true value of its direct neighbors\nand the observations, i.e., the Markov Blanket(M Byi) of the node. The parameter estimation then\nbecomes maximizing the pseudo log-likelihood:\n\nLpseudo(\u03b8) =\n\nlog(p(yi|M Byi , \u03b8)) =\n\nlog\n\nN(cid:80)\n\ni=1\n\nexp(\n\n\u03b8kfk(M Byi ,yi))\n\nexp(\n\n\u03b8kfk(M By(cid:48)\n\ni\n\n,y(cid:48)\n\ni))\n\nK2k=1\nK2k=1\n\n2y(cid:48)\n\ni\n\nN(cid:80)\n\ni=1\n\nMPL has been known to over-estimate the dependency parameters in some cases and there is no\ngeneral guideline on when it can be safely used [15].\n\n1When a prior is used in the maximum likelihood objective function as a regularizer \u2013 the second term in eq. (1), the method is in fact\n\ncalled maximum a posteriori.\n\n2\n\n\f2.1 Virtual evidence boosting\nBy extending the standard LogitBoost algorithm [16], VEB integrates boosting based feature se-\nlection into CRF training. The objective function used in VEB is very similar to MPL, except that\nVEB uses the messages from the neighboring nodes as virtual evidence instead of using the true\nlabels of neighbors. The use of virtual evidence helps to reduce over-estimation of neighborhood\ndependencies. We brie\ufb02y explain the approach here but please refer to [5] for more detail.\nVEB incorporates two types of observations nodes: (i) hard evidence corresponding to the observa-\ntions ve(xi), which are indicator functions at the observation values and (ii) soft evidence, corre-\nsponding to the messages from neighboring nodes ve(n(yi)), which are discrete distributions over\nthe hidden states. Let vei (cid:44) {ve(xi), ve(n(yi))}. The objective function of VEB is as follows:\n\n(cid:80)\n(cid:80)\n(cid:80)\n\nvei\n\ny(cid:48)\n\ni\n\nvei\n\nK(cid:80)\nK(cid:80)\n\nk=1\n\nk=1\n\nvei exp(\n\n\u03b8kfk(vei, yi))\n\nvei exp(\n\n(2)\n\n\u03b8kfk(vei, y(cid:48)\ni))\n\nN(cid:88)\n\ni=1\n\nLV EB(\u03b8) =\n\nlog(p(yi|vei, \u03b8)), where p(yi|vei, \u03b8) =\n\nVEB learns a set weak learners fts iteratively and estimates the combined feature Ft = Ft\u22121 + ft\nby solving the following weighted least square error(WLSE) problem:\n\nN(cid:88)\nwhere wi = p(yi|vei)(1 \u2212 p(yi|vei)), zi = yi \u2212 0.5\np(yi|vei)\n\nwiE(f(vei) \u2212 zi)2 = arg min\n\n(cid:88)\n\nN(cid:88)\n\ni=1\n\ni=1\n\nvei\n\n[\n\nf\n\nwip(yi|vei)(f(vei) \u2212 zi)2]\n\n(3)\n\n(4)\n\nft(vei) = arg min\n\nf\n\nThe wi and zi in equation 4 are the boosting weight and working response respectively for the ith\ndata point, exactly as in LogitBoost. However, the least square problem for VEB (eq.3) involves\nN X points because of virtual evidence as opposed to N points in LogitBoost. Although eq. 4 is\ngiven for the binary case (i.e. yi \u2208 {0, 1}), it is easily extendible to the multi-class case and we have\ndone that in our experiments. At each iteration, vei is updated as messages from n(yi) changes with\nthe addition of new features. We run belief propagation (BP) to obtain the virtual evidence before\neach iteration. The CRF feature weights, \u03b8\u2019s are computed by solving the WLSE problem, where\nthe local features, nki is the count of feature k in data instance i and the compatibility features, nki\nis the virtual evidence from the neighbors.: \u03b8k =\n\nwizinki/\n\nN(cid:80)\n\nN(cid:80)\n\nwinki.\n\ni=1\n\ni=1\n\n2.2 Semi-supervised training\nFor semi-supervised training of CRFs, Jiao et.al. [10] have proposed an algorithm that utilizes unla-\nbeled data via entropy regularization \u2013 an extension of the approach proposed by [9] to structured\nCRF models. The objective function that is maximized during semi-supervised training of CRFs is\ngiven below, where (xl, yl) and (xu, yu) represent the labeled and unlabeled data respectively:\n\nLSS(\u03b8) = log p(yl|xl, \u03b8) + \u03b1\n\np(yu|xu, \u03b8)log p(yu|xu, \u03b8) \u2212 (cid:107)\u03b8(cid:107)/2\n\n(cid:80)\n\nyu\n\nBy minimizing the conditional entropy of the unlabeled data, the algorithm will generally \ufb01nd la-\nbeling of the unlabeled data that mutually reinforces the supervised labels. One drawback of this\nobjective function is that it is no longer concave and in general there will be local maxima. The\nauthors [10] showed that this method is still effective in improving an initial supervised model.\n\n3 Semi-supervised virtual evidence boosting\n\nIn this work, we develop semi-supervised virtual evidence boosting (sVEB) that combines feature\nselection with semi-supervised training of CRFs. sVEB extends the VEB framework to take advan-\ntage of unlabeled data via minimum entropy regularization similar to [9, 10, 11]. The new objective\nfunction LsV EB we propose is as follows, where (i = 1\u00b7\u00b7\u00b7 N) are labeled and (i = N + 1\u00b7\u00b7\u00b7 M)\nare unlabled examples:\n\nM(cid:88)\n\n(cid:88)\n\nLsV EB =\n\nlog p(yi|vei) + \u03b1\n\np(y(cid:48)\n\ni|vei) log p(y(cid:48)\n\ni|vei)\n\n(5)\n\nN(cid:88)\n\ni=1\n\ni=N +1\n\ny(cid:48)\n\ni\n\n3\n\n\fThe sVEB aglorithm, similar to VEB, maximizes the conditional soft pseudo-likelihood of the la-\nbeled data but in addition minimizes the conditional entropy over unlabeled data. The \u03b1 is a tuning\nparameter for controlling how much in\ufb02uence the unlabeled data will have.\nBy considering the soft pseudo-likelihood in LsV EB and using BP to estimate p(yi|vei), sVEB can\nuse boosting to learn the parameters of CRFs. The virtual evidence from the neighboring nodes\ncaptures the label dependencies. There are three different types of feature functions fs that\u2019s used:\nfor continuous observations f1(xi) is a linear combination of decision stumps, for discrete obser-\nvations the learner f2(xi) is expressed as indicator functions, and for virtual evidences the weak\nlearner f3(xi) is the weighted sum of two indicator functions (for binary case). These functions are\ncomputed as follows, where \u03b4 is an indicator function, h is a threshold for the decision stump, and\nD is the number of dimensions of the observations:\nf1(xi) = \u03b81\u03b4(xi \u2265 h) + \u03b82\u03b4(xi < h), f2(xi) =\n\n\u03b8k\u03b4(xi = d), f3(yi) =\n\n\u03b8k\u03b4(yi = k) (6)\n\n1(cid:88)\n\nD(cid:88)\n\nk=1\n\nk=0\n\nSimilar to LogitBoost and VEB, the sVEB algorithm estimates a combined feature function F that\nmaximizes the objective by sequentially learning a set of weak learners, ft\u2019s (i.e. iteratively selecting\nfeatures). In other words, sVEB solves the following weighted least-square error (WLSE) problem\nto learn fts:\n\nN(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nM(cid:88)\n\nwip(yi|vei)(f(xi) \u2212 zi)2 +\n\nwip(y(cid:48)\n\ni|vei)(f(xi) \u2212 zi)2]\n\n(7)\n\nft = arg min\n\n[\n\nf\n\ni=1\n\nvei\n\ni=N +1\n\ny(cid:48)\n\ni\n\nvei\n\nFor labeled data (\ufb01rst term in eq.7), boosting weights, wi\u2019s, and working responses, zi\u2019s, are com-\nputed as described in equation 4. But for the case of unlabeled data the expression for wi and zi\nbecomes more complicated because of the entropy term. We present the equations for wi and zi\nbelow, please refer to the Appendix for the derivations:\n\nwi = \u03b12(1 \u2212 p(yi|vei))[p(yi|vei)(1 \u2212 p(yi|vei)) + log p(yi|vei)]\n\nzi =\n\n(yi \u2212 0.5)p(yi|vei)(1 \u2212 log p(yi|vei))\n\n\u03b1[p(yi|vei)(1 \u2212 p(yi|vei)) + log p(yi|vei)]\n\n(8)\n\nM(cid:80)\n\n(cid:80)\n\nM(cid:80)\n\n(cid:80)\n\nThe soft evidence corresponding to messages from the neighboring nodes is obtained by running BP\non the entire training dataset (labeled and unlabeled). The CRF feature weights \u03b8ks are computed\nby solving the WLSE problem (e.q.(7)), \u03b8k =\n\nwizinki/\n\nwinki\n\ni=1\n\nyi\n\ni=1\n\nyi\n\nAlgorithm 1 gives the pseudo-code for sVEB. The main difference between VEB and sVEB are\nsteps 7 \u2212 10, where we compute wi\u2019s and zi\u2019s for all possible values of yi based on the virtual\nevidence and observations of unlabeled training cases. The boosting weights and working responses\nare computed using equation (8). The weighted least-square error (WLSE) equation (eq. 7) in step\n10 of sVEB is different from that of VEB and the solution results in slightly different CRF feature\nweights, \u03b8\u2019s. One of the major advantages of VEB and sVEB over ML and sML is that the parameter\nestimation is done by mainly performing feature counting. Unlike ML and sML, we do not need to\nuse an optimizer to learn the model parameters which results in a huge reduction in the time required\nto train the CRF models. Please refer to the complexity analysis section for details.\n\n4 Experiments\n\nWe conduct two sets of experiments to evaluate the performance of the sVEB method for training\nCRFs and the advantage of performing feature selection as part of semi-supervised training.\nIn\nthe \ufb01rst set of experiments, we analyze how much the complexity of the underlying CRF and the\ntuning parameter \u03b1 effect the performance using synthetic data. In the second set of experiments, we\nevaluate the bene\ufb01t of feature selection and using unlabeled data on two real-world activity datasets.\nWe compare the performance of the semi-supervised virtual evidence boosting(sVEB) presented in\nthis paper to the semi-supervised maximum likelihood (sML) method [10]. In addition, for the ac-\ntivity datasets, we also evaluate an alternative approach (sML+Boost), where a subset of features\nis selected in advance using boosting. To benchmark the performance of the semi-supervised tech-\nniques, we also evaluate three different supervised training approaches, namely maximum likelihood\n\n4\n\n\ffor t = 1, 2,\u00b7\u00b7\u00b7 , T do\n\nAlgorithm 1: Training CRFs using semi-supervised VEB\ninputs : structure of CRF and training data (xi, yi), with yi \u2208 {0, 1}, 1 \u2264 i \u2264 M, and F0 = 0\noutput: Learned FT and their corresponding weights, \u03b8\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n11\n12\n\nend\nObtain \u201cbest\u201d weak learner ft according to equation (7) and update Ft = Ft\u22121 + ft ;\n\nCompute likelihood p(yi|vei);\nCompute wi and zi using equation (8)\n\nRun BP using Ft to get virtual evidences vei;\nfor i = 1, 2,\u00b7\u00b7\u00b7 , N do\n\nCompute likelihood p(yi|vei);\nCompute wi and zi using equation (4)\n\nend\nfor i = N + 1, ..., M and yi = 0, 1 do\n\nend\n\nFigure 1: Accuracy of sML and sVEB for different number of states, local features and different values of \u03b1.\n\nmethod using all observed features(ML), (ML+Boost) using a subset of features selected in advance,\nand virtual evidence boosting (VEB). All the learned models are tested using standard maximum a\nposteriori(MAP) estimate and belief propagation. We used a l2-norm shrinkage prior as a regularizer\nfor the ML and sML methods.\n\n4.1 Synthetic data\nThe synthetic data is generated using a \ufb01rst-order Markov Chain with self-transition probabilities\nset to 0.9. For each model, we generate \ufb01ve sequences of length 4,000 and divide each trace into\nsequences of length 200. We randomly choose 50% of them as the labeled and the other 50% as un-\nlabeled training data. We perform leave-one-out cross-validation and report the average accuracies.\nTo measure how the complexity of the CRFs affects the performance of the different semi-supervised\nmethods, we vary the number of local features and the number of states. First, we compare the per-\nformance of sVEB and sML on CRFs with increasing the number of features. The number of states\nis set to 10 and the number of observation features is varied from 20 to 400 observations. Figure\n(1a) shows the average accuracy for the two semi-supervised training methods and their con\ufb01dence\nintervals. The experimental results demonstrate that sVEB outperforms sML as we increase the di-\nmension of observations (i.e. the number of local features). In the second experiment, we increase\nthe number of classes and keep the dimension of observations \ufb01xed to 100. Figure (1b) demonstrates\nthat sVEB again outperforms sML as we increase the number of states. Given the same amount of\ntraining data, sVEB is less likely to over\ufb01t because of the feature selection step. In both these ex-\nperiments we set the value of tuning parameter, \u03b1, to 1.5. To explore the effect of tuning parameter\n\u03b1, we vary the value of \u03b1 from 0.1 to 10 , while setting the number of states to 10 and the number\nof dimensions to 100. Figure (1c) shows that the performance of both sML and sVEB depends on\nthe value of \u03b1 but the accuracy decreases for large \u03b1\u2019s similar to the sML results presented in [10].\n\n5\n\n0102030400.60.650.70.750.80.850.9Number of statesAccuracysMLsVEB(b)01002003004005000.50.60.70.80.9Dimension of ObservationsAccuracysMLsVEB(a)02468100.70.750.80.850.90.951Values of \u03b1sMLsVEBAccuracy(c)\fML+all obs ML+Boost VEB\n\nML+all obs ML+Boost VEB\n\nFigure 2: An example of a sensor trace and a classi\ufb01cation trace\n\nLabeled Average Accuracy (%) - Dataset 1\n60% 62.7 \u00b1 6.6 69.4 \u00b1 3.9 82.6 \u00b1 7.3\n80% 73.0 \u00b1 4.2 81.8 \u00b1 4.7 90.3 \u00b1 4.7\n100% 77.8 \u00b1 3.4 87.0 \u00b1 2.3 91.5 \u00b1 3.8\n\nLabeled Average Accuracy (%) - Dataset 2\n60% 74.3 \u00b1 3.7 75.8 \u00b1 3.3 88.5 \u00b1 5.1\n80% 80.6 \u00b1 2.9 84.8 \u00b1 2.9 93.4 \u00b1 3.8\n100% 86.2 \u00b1 3.1 87.5 \u00b1 3.1 93.8 \u00b1 4.6\nTable 1: Accuracy \u00b1 95% con\ufb01dence interval of the supervised algorithms on activity datasets 1 and 2\n\n4.2 Activity dataset\nWe collected two activity datasets using wearable sensors, which include audio, acceleration, light,\ntemperature, pressure, and humidity. The \ufb01rst dataset contains instances of 8 basic physical activities\n(e.g. walking, running, going up/down stairs, going up/down elevator, sitting, standing, and brushing\nteeth) from 7 different users. There is on average 30 minutes of data per user and a total of 3.5 hours\nof data that is manually labeled for training and testing purposes. The data is segmented into 0.25s\nchunks resulting in a total of 49613 data points. For each chunk, we compute 651 features, which\ninclude signal energy in log and linear frequency bands, autocorrelation, different entropy measures,\nmean, variances etc. The features are chosen based on what is used in existing activity recognition\nliterature and a few additional ones that we felt could be useful. During training, the data from\neach person is divided into sequences of length 200 and fed into linear chain CRFs as observations.\nThe second dataset contains instances of 5 different indoor activities (e.g. computer usage, meal,\nmeeting, watching TV and sleeping) from a single user. We recorded 15 hours of sensor traces over\n12 days. As this set contains longer time-scale activities, the data is segmented into 1 minute chunks\nand 321 different features are computed, similar to the \ufb01rst dataset. There are a total of 907 data\npoints. These features are fed into CRFs as observations, one linear chain CRF is created per day.\nWe evaluate the performance of supervised and semi-supervised training algorithms on these two\ndatasets. For the semi-supervised case, we randomly select 40% of the sequences for a given person\nor a given day as labeled and a different subset as the unlabeled training data. We compare the\nperformance of sML and sVEB as we incorporate more unlabeled data (20%, 40% and 60%) into\nthe training process. We also compare the supervised techniques, ML, ML+Boost, and VEB, with\nincreasing amount of labeled data. For all the experiments, the tuning parameter \u03b1 is set to 1.5. We\nperform leave-one-person-out cross-validation on dataset 1 and leave-one-day-out cross-validation\non dataset 2 and report the average the accuracies. The number of features chosen (i. e. through\nthe boosting iterations) is set to 50 for both datasets \u2013 including more features did not signi\ufb01cantly\nimprove the classi\ufb01cation performance.\nFor both datasets, incorporating more unlabeled data improves accuracy. The sML estimate of the\nCRF parameters performs the worst. Even with the shrinkage prior, the high dimensionality can still\ncause over-\ufb01tting and lower the accuracy. Whereas parameter estimation and feature selection via\nsVEB consistently results in the highest accuracy. The (sML+Boost) method performs better than\nsML but does not perform as well as when feature selection and parameter estimation is done within\na uni\ufb01ed framework as in sVEB. Table 2 summarize our results. The results of supervised learn-\n\nAverage Accuracy (%) - Dataset 1\nUn-\nsVEB\nlabeled sML+all obs sML+Boost\n20% 60.8 \u00b1 5.4 66.4 \u00b1 4.2 72.6 \u00b1 2.3\n40% 68.1 \u00b1 4.8 76.8 \u00b1 3.4 78.5 \u00b1 3.4\n60% 74.9 \u00b1 3.1 81.3 \u00b1 3.9 85.3 \u00b1 4.1\n\nAverage Accuracy (%) - Dataset 2\nUn-\nsVEB\nlabeled sML+all obs sML+Boost\n20% 71.4 \u00b1 3.2 70.5 \u00b1 5.3 79.9 \u00b1 4.2\n40% 73.5 \u00b1 5.8 74.1 \u00b1 4.6 83.5 \u00b1 6.3\n60% 75.6 \u00b1 3.9 77.8 \u00b1 3.2 87.4 \u00b1 4.7\n\nTable 2: Accuracy \u00b1 95% con\ufb01dence interval of semi-supervised algorithms on activity datasets 1 and 2\n\n6\n\nTimeClasses1000200030004000500012345678Sensor TracesTimeGround truthInference\fML+all obs ML+Boost VEB\n\nLabeled Average Accuracy (%) - Dataset 2\n5% 59.2 \u00b1 6.5 65.7 \u00b1 8.3 71.2 \u00b1 5.7\n20% 66.9 \u00b1 5.9 67.3 \u00b1 8.5 77.4 \u00b1 3.6\n\nLabeled Average Accuracy (%) - Dataset 2\n5% 71.2 \u00b1 4.1 68.3 \u00b1 6.7 79.7 \u00b1 7.9\n20% 71.4 \u00b1 6.3 73.8 \u00b1 5.2 83.1 \u00b1 6.4\nTable 3: Accuracy \u00b1 95% con\ufb01dence interval of semi-supervised algorithms on activity datasets 1 and 2\n\nML+all obs ML+Boost VEB\n\ning algorithms are presented in Table 1. Similar to the semi-supervised results, the VEB method\nperforms the best, the ML is the worst performer, and the accuracy numbers for the (ML+Boost)\nmethod is in between. The accuracy increases if we incorporate more labeled data during training.\nTo evaluate sVEB when a small amount of labeled data is available, we performed another set of\nexperiments on datasets 1 and 2, where only 5% and 20% of the training data is labeled respec-\ntively. We used all the available unlabeled data during training. The results are shown in table 3.\nThese experiments clearly demonstrate that although adding more unlabeled data is not as helpful\nas incorporating more labeled data, the use of cheap unlabeled data along with feature selection can\nsigni\ufb01cantly boost the performance of the models.\n4.3 Complexity Analysis\nThe sVEB and VEB algorithm are signi\ufb01cantly faster than ML and sML because they do not need\nto use optimizers such as quasi-newton methods to learn the weight parameters. For each training\niteration in sML the cost of running BP is O(clns2 + cun2s3) [10] whereas the cost of each boosting\niteration in sVEB is O((cl + cu)ns2). An ef\ufb01cient entropy gradient computation is proposed in [17],\nwhich reduces the cost of sML to O((cl + cu)ns2) but still requires an optimizer to maximize the\nlog-likelihood. Moreover, the number of training iterations needed is usually much higher than the\nnumber of boosting iterations because optimizers such as L-BFGS require many more iterations to\nreach convergence in high dimensional spaces. For example, for dataset 1, we needed about 1000\niterations for sML to converge but we ran sVEB for only 50 iterations. Table 4 shows the time for\nperforming the experiments on activity datasets (as described in the previous section) 2. On the\nother hand the space complexity of sVEB is linearly smaller than sML and ML. Similar to ML, sML\nhas the space complexity of O(ns2D) in the best case [10]. VEB and sVEB have a lower space\ncost of O(ns2Db), because of the feature selection step Db (cid:191) D usually. Therefore, the difference\nbecomes signi\ufb01cant when we are dealing with high dimensional data, particularly if they include a\nlarge number of redundant features.\n\nTime (hours)\n\nML ML+Boost VEB sML sML+Boost sVEB\n\nDataset 1 34\nDataset 2 7.5\n\n18\n4.25\n\n5 Conclusion\n\n2.5 96\n0.4 10.5\nTable 4: Training time for the different algorithms.\n\n4\n0.6\n\nD, Db\n\n48\n8\n\nn\ncl\ncu\ns\n\nlength of training sequence\nnumber of labeled training sequences\nnumber of unlabeled training sequences\nnumber of states\ndimension of observations\n\nWe presented sVEB, a new semi-supervised training method for CRFs, that can simultaneously\nselect discriminative features via modi\ufb01ed LogitBoost and utilize unlabeled data via minimum-\nentropy regularization. Our experimental results demonstrate the sVEB signi\ufb01cantly outperforms\nother training techniques in real-world activity recognition problems. The uni\ufb01ed framework for\nfeature selection and semi-supervised training presented in this paper reduces the computational and\nhuman labeling costs, which are often the major bottlenecks in building large classi\ufb01cation systems.\n\nAcknowledgments\nThe authors would like to thank Nando de Freitas and Lin Liao for many helpful discussions. This work was\nsupported by the NSF under grant number IIS 0433637 and NSERC Canada Graduate Scholarship.\n\nReferences\n[1] J. Lafferty, A. McCallum, and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\nand labeling sequence data. In Proc. of the International Conference on Machine Learning (ICML), 2001.\n\n2The experiments were run in Matlab environment and as a result they took longer.\n\n7\n\n\f[2] Andrew McCallum. Ef\ufb01ciently inducing features or conditional random \ufb01elds. In Proc. of the Conference\n\non Uncertainty in Arti\ufb01cial Intelligence (UAI), 2003.\n\n[3] T. Dietterich, A. Ashenfelter, and Y. Bulatov. Training conditional random \ufb01elds via gradient tree boost-\n\ning. In Proc. of the International Conference on Machine Learning (ICML), 2004.\n\n[4] A. Torralba, K. P. Murphy, and W. T. Freeman. Contextual models for object detection using boosted\n\nrandom \ufb01elds. In Advances in Neural Information Processing Systems (NIPS), 2004.\n\n[5] L. Liao, T. Choudhury, D. Fox, and H Kautz. Training conditional random \ufb01elds using virtual evidence\n\nboosting. In Proc. of the International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 2007.\n\n[6] K. Nigam, A. McCallum, A. Thrun, and T. Mitchell. Text classi\ufb01cation from labeled and unlabeled\n\ndocuments using em. Machine learning, 2000.\n\n[7] A. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\n\nfunctions. In Proc. of the International Conference on Machine Learning (ICML), 2003.\n\n[8] W. Li and M. Andrew. Semi-supervised sequence modeling with syntactic topic models. In Proc. of the\n\nNational Conference on Arti\ufb01cial Intelligence (AAAI), 2005.\n\n[9] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2004.\n\n[10] F. Jiao, W. Wang, C. H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random\n\ufb01elds for improved sequence segmentation and labeling. In International Committee on Computational\nLinguistics and the Association for Computational Linguistics, 2006.\n\n[11] C. Lee, S. Wang, F. Jiao, Schuurmans D., and R. Greiner. Learning to Model Spatial Dependency: Semi-\n\nSupervised Discriminative Random Fields. In NIPS, 2006.\n\n[12] J.S. Yedidia, W.T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized\n\nbelief propagation algorithms. IEEE Transactions on Information Theory, 51(7):2282\u20132312, 2005.\n\n[13] Y. Weiss. Comparing mean \ufb01eld method and belief propagation for approximate inference in mrfs. 2001.\n[14] J. Besag. Statistical analysis of non-lattice data. The Statistician, 24, 1975.\n[15] C. J. Geyer and E. A. Thompson. Constrained Monte Carlo Maximum Likelihood for dependent data.\n\nJournal of Royal Statistical Society, 1992.\n\n[16] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of\n\nboosting. The Annals of Statistics, 38(2):337\u2013374, 2000.\n\n[17] G. Mann and A. McCullum. Ef\ufb01cient computation of entropy gradient for semi-supervised conditional\n\nrandom \ufb01elds. In Human Language Technologies, 2007.\n\n6 Appendix\n\nIn this section, we show how we derived the equations for wi and zi (eq. 8):\nLF = LsV EB = LV EB \u2212 \u03b1Hemp =\np(y(cid:48)\n\nlog p(yi|vei) + \u03b1\n\ni|vei) log p(y(cid:48)\n\ni|vei)\n\nAs in LogitBoost, the likelihood function LF is maximized by learning an ensemble of weak learners. We start\nwith an empty ensemble F = 0 and iteratively add the next best weak learner, ft, by computing the Newton\nupdate s\nF (vei, yi)) \u2190 F (vei, yi) \u2212 s\n\nH , where s and H are the \ufb01rst and second derivative respectively of LF with respect to f (vei, yi).\n\n|f =0 and H = \u22022LF +f\n\nH , where s = \u2202LF +f\n\n\u2202f\n\n2(2yi \u2212 1)(1 \u2212 p(yi|vei)) + \u03b1\n\n[2(2y(cid:48)\n\ni \u2212 1)(1 \u2212 p(y(cid:48)\n\ni|vei)(1 \u2212 log p(y(cid:48)\n\ni|vei))]\n\n\u2202f 2\n\n|f =0\ni|vei))p(y(cid:48)\n\nM2i=N +12y(cid:48)\n\ni\n\nN2i=1\n\nM2i=N +12y(cid:48)\n\ni\n\ni|vei)) + log p(y(cid:48)\n\ns =\n\np(y(cid:48)\n\nN2i=1\nH = \u2212 N2i=1\nN2i=1\nN2i=1\n\nF \u2190 F +\n\nziwi+\n\nziwi\n\ni|vei)]\nM2i=N+12y(cid:48)\nM2i=N+12y(cid:48)\n\u03b12(1 \u2212 p(y(cid:48)\n\nwi+\n\ni\n\ni\n\nwi\n\nand wi =\u0001 p(yi|vei)(1 \u2212 p(yi|vei))\n\ni|vei))[p(y(cid:48)\n\n4p(yi|vei)(1 \u2212 p(yi|vei))(2yi \u2212 1)2 + \u03b12\n\n4(2y(cid:48)\n\ni \u2212 1)2(1 \u2212 p(y(cid:48)\n\ni|vei))[p(y(cid:48)\n\ni|vei)(1 \u2212\n\nM2i=N +12y(cid:48)\n\ni\n\nwhere zi =\u001c yi\u22120.5\n\np(yi|vei)\n(y(cid:48)\n\u03b1[p(y(cid:48)\n\ni\u22120.5)p(y(cid:48)\ni|vei)(1\u2212p(y(cid:48)\n\ni|vei)(1\u2212log p(y(cid:48)\n\ni|vei))+log p(y(cid:48)\n\ni|vei))\n\ni|vei)]\n\nif 1 \u2264 i \u2264 N\neq. (4)\nif N < i \u2264 M eq. (8)\n\ni|vei)(1 \u2212 p(y(cid:48)\n\ni|vei)) + log p(y(cid:48)\n\ni|vei)]\n\nif 1 \u2264 i \u2264 N eq. (4)\nif N < i \u2264 M eq. (8)\n\nAt iteration t we get the best weak learner, ft, by solving the WLSE problem in eq. 7.\n\n8\n\n\f", "award": [], "sourceid": 863, "authors": [{"given_name": "Maryam", "family_name": "Mahdaviani", "institution": null}, {"given_name": "Tanzeem", "family_name": "Choudhury", "institution": null}]}