{"title": "Self-Paced Learning for Latent Variable Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1189, "page_last": 1197, "abstract": "Latent variable models are a powerful tool for addressing several tasks in machine learning. However, the algorithms for learning the parameters of latent variable models are prone to getting stuck in a bad local optimum. To alleviate this problem, we build on the intuition that, rather than considering all samples simultaneously, the algorithm should be presented with the training data in a meaningful order that facilitates learning. The order of the samples is determined by how easy they are. The main challenge is that often we are not provided with a readily computable measure of the easiness of samples. We address this issue by proposing a novel, iterative self-paced learning algorithm where each iteration simultaneously selects easy samples and learns a new parameter vector. The number of samples selected is governed by a weight that is annealed until the entire training data has been considered. We empirically demonstrate that the self-paced learning algorithm outperforms the state of the art method for learning a latent structural SVM on four applications: object localization, noun phrase coreference, motif finding and handwritten digit recognition.", "full_text": "Self-Paced Learning for Latent Variable Models\n\nM. Pawan Kumar\n\nBenjamin Packer\n\nDaphne Koller\n\nComputer Science Department\n\nStanford University\n\n{pawan,bpacker,koller}@cs.stanford.edu\n\nAbstract\n\nLatent variable models are a powerful tool for addressing several tasks in machine\nlearning. However, the algorithms for learning the parameters of latent variable\nmodels are prone to getting stuck in a bad local optimum. To alleviate this prob-\nlem, we build on the intuition that, rather than considering all samples simulta-\nneously, the algorithm should be presented with the training data in a meaningful\norder that facilitates learning. The order of the samples is determined by how easy\nthey are. The main challenge is that often we are not provided with a readily com-\nputable measure of the easiness of samples. We address this issue by proposing a\nnovel, iterative self-paced learning algorithm where each iteration simultaneously\nselects easy samples and learns a new parameter vector. The number of samples\nselected is governed by a weight that is annealed until the entire training data has\nbeen considered. We empirically demonstrate that the self-paced learning algo-\nrithm outperforms the state of the art method for learning a latent structural SVM\non four applications: object localization, noun phrase coreference, motif \ufb01nding\nand handwritten digit recognition.\n\nIntroduction\n\n1\nLatent variable models provide an elegant formulation for several applications of machine learning.\nFor example, in computer vision, we may have many \u2018car\u2019 images from which we wish to learn a\n\u2018car\u2019 model. However, the exact location of the cars may be unknown and can be modeled as latent\nvariables. In medical diagnosis, learning to diagnose a disease based on symptoms can be improved\nby treating unknown or unobserved diseases as latent variables (to deal with confounding factors).\nLearning the parameters of a latent variable model often requires solving a non-convex optimization\nproblem. Some common approaches for obtaining an approximate solution include the well-known\nEM [8] and CCCP algorithms [9, 23, 24]. However, these approaches are prone to getting stuck in a\nbad local minimum with high training and generalization error.\n\nMachine learning literature is \ufb01lled with scenarios in which one is required to solve a non-convex\noptimization task, for example learning perceptrons or deep belief nets. A common approach for\navoiding a bad local minimum in these cases is to use multiple runs with random initializations\nand pick the best solution amongst them (as determined, for example, by testing on a validation\nset). However, this approach is adhoc and computationally expensive as one may be required to\nuse several runs to obtain an accurate solution. Bengio et al. [3] recently proposed an alternative\nmethod for training with non-convex objectives, called curriculum learning. The idea is inspired\nby the way children are taught: start with easier concepts (for example, recognizing objects in\nsimple scenes where an object is clearly visible) and build up to more complex ones (for example,\ncluttered images with occlusions). Curriculum learning suggests using the easy samples \ufb01rst and\ngradually introducing the learning algorithm to more complex ones. The main challenge in using\nthe curriculum learning strategy is that it requires the identi\ufb01cation of easy and hard samples in a\ngiven training dataset. However, in many real-world applications, such a ranking of training samples\nmay be onerous or conceptually dif\ufb01cult for a human to provide \u2014 even if this additional human\nsupervision can be provided, what is intuitively \u201ceasy\u201d for a human may not match what is easy for\nthe algorithm in the feature and hypothesis space employed for the given application.\n\nTo alleviate these de\ufb01ciencies, we introduce self-paced learning. In the context of human education,\nself-paced learning refers to a system where the curriculum is determined by the pupil\u2019s abilities\nrather than being \ufb01xed by a teacher. We build on this intuition for learning latent variable models by\n\n1\n\n\fdesigning an iterative approach that simultaneously selects easy samples and updates the parameters\nat each iteration. The number of samples selected at each iteration is determined by a weight that is\ngradually annealed such that later iterations introduce more samples. The algorithm converges when\nall samples have been considered and the objective function cannot be improved further. Note that,\nin self-paced learning, the characterization of what is \u201ceasy\u201d applies not to individual samples, but\nto sets of samples; a set of samples is easy if it admits a good \ufb01t in the model space.\n\nWe empirically demonstrate that our self-paced learning approach outperforms the state of the art\nalgorithm for learning a recently proposed latent variable model, called latent structural SVM, on\nfour standard machine learning applications using publicly available datasets.\n\n2 Related Work\nSelf-paced learning is related to curriculum learning in that both regimes suggest processing the\nsamples in a meaningful order. Bengio et al. [3] noted that curriculum learning can be seen as a type\nof continuation method [1]. However, in their work, they circumvented the challenge of obtaining\nsuch an ordering by using datasets where there is a clear distinction between easy and hard samples\n(for example, classifying equilateral triangles vs. squares is easier than classifying general triangles\nvs. general quadrilaterals). Such datasets are rarely available in real world applications, so it is not\nsurprising that the experiments in [3] were mostly restricted to small toy examples.\n\nOur approach also has a similar \ufb02avor to active learning, which chooses a sample to learn from at\neach iteration. Active learning approaches differ in their sample selection criteria. For example,\nTong and Koller [21] suggest choosing a sample that is close to the margin (a \u201chard\u201d sample),\ncorresponding to anti-curriculum learning. Cohn et al. [6] advocate the use of the most uncertain\nsample with respect to the current classi\ufb01er. However, unlike our setting, in active learning the labels\nof all the samples are not known when the samples are chosen.\n\nAnother related learning regime is co-training, which works by alternately training classi\ufb01ers such\nthat the most con\ufb01dently labeled samples from one classi\ufb01er are used to train the other [5, 17].\nOur approach differs from co-training in that in our setting the latent variables are simply used to\nassist in predicting the target labels, which are always observed, whereas co-training deals with a\nsemi-supervised setting in which some labels are missing.\n\n3 Preliminaries\nWe will denote the training data as D = {(xi, yi), \u00b7 \u00b7 \u00b7 , (xn, yn)}, where xi \u2208 X are the observed\nvariables (which we refer to as input) for the ith sample and yi \u2208 Y are the unobserved variables\n(which we refer to as output), whose values are known during training. In addition, latent variable\nmodels also contain latent, or hidden, variables that we denote by hi \u2208 H. For example, when\nlearning a \u2018car\u2019 model using image-level labels, x represents an image, the binary output y indicates\nthe presence or absence of a car in the image, and h represents the car\u2019s bounding box (if present).\n\nGiven the training data, the parameters w of a latent variable model are learned by optimizing an\nobjective function, for example by maximizing the likelihood of D or minimizing the risk over D.\nTypically, the learning algorithm proceeds iteratively, with each iteration consisting of two stages:\n(i) the hidden variables are either imputed or marginalized to obtain an estimate of the objective\nfunction that only depends on w; and (ii) the estimate of the objective function is optimized to\nobtain a new set of parameters. We brie\ufb02y describe two such well-known algorithms below.\n\nEM Algorithm for Likelihood Maximization. An intuitive objective is to maximize likelihood:\n\nmax\n\nw Xi\n\nlog Pr(xi, yi; w) = max\n\nw Xi\n\nlog Pr(xi, yi, hi; w) \u2212Xi\n\nlog Pr(hi|xi, yi; w)! .\n\n(1)\n\nA common approach for this task is to use the EM method [8] or one of its many variants [12].\nOutlined in Algorithm 1, EM iterates between \ufb01nding the expected value of the latent variables h\nand maximizing objective (1) subject to this expectation. We refer the reader to [8] for more details.\n\nCCCP Algorithm for Risk Minimization. Given the true output y, we denote the user-speci\ufb01ed\nrisk of predicting \u02c6y(w) as \u2206(y, \u02c6y(w)). The risk is usually highly non-convex in w, and therefore\nvery dif\ufb01cult to minimize. An ef\ufb01cient way to overcome this dif\ufb01culty is to use the recently proposed\nlatent structural support vector machine (hereby referred to as latent SSVM) formulation [9, 23] that\nminimizes a regularized upper bound on the risk. Latent SSVM provides a linear prediction rule of\n\n2\n\n\fAlgorithm 1 The EM algorithm for parameter estimation by likelihood maximization.\ninput D = {(x1, y1), \u00b7 \u00b7 \u00b7 , (xn, yn)}, w0, \u01eb.\n1: t \u2190 0\n2: repeat\n3:\n4:\n\nObtain the expectation of objective (1) under the distribution Pr(hi|xi, yi; wt).\nUpdate wt+1 by maximizing the expectation of objective (1). Speci\ufb01cally,\n\nwt+1 = argmaxwPi Pr(hi|xi, yi; wt) log Pr(xi, yi, hi; w).\n\n5:\n6: until Objective function cannot be increased above tolerance \u01eb.\n\nt \u2190 t + 1.\n\nthe form fw(x) = argmaxy\u2208Y,h\u2208H w\u22a4\u03a6(x, y, h). Here, \u03a6(x, y, h) is the joint feature vector. For\ninstance, in our \u2018car\u2019 model learning example, the joint feature vector can be modeled as the HOG [7]\ndescriptor extracted using pixels in the bounding box h.\n\nThe parameters w are learned by solving the following optimization problem:\n\nmin\nw,\u03bei\u22650\n\n1\n2\n\n||w||2 +\n\nC\nn\n\n\u03bei,\n\nn\n\nXi=1\n\ns.t.\n\nw\u22a4(cid:16)\u03a6(xi, yi, hi) \u2212 \u03a6(xi, \u02c6yi, \u02c6hi)(cid:17) \u2265 \u2206(yi, \u02c6yi) \u2212 \u03bei,\n\nmax\nhi\u2208H\n\u2200\u02c6yi \u2208 Y, \u2200\u02c6hi \u2208 H, i = 1, \u00b7 \u00b7 \u00b7 , n.\n\n(2)\n\nFor any given w, the value of \u03bei can be shown to be an upper bound on the risk \u2206(yi, \u02c6yi(w)) (where\n\u02c6yi(w) is the predicted output given w). The risk function can also depend on \u02c6hi(w); that is, it can\nbe of the form \u2206(yi, \u02c6yi(w), \u02c6hi(w)). We refer the reader to [23] for more details.\n\nProblem (2) can be viewed as minimizing the sum of a convex and a concave function. This obser-\nvation leads to a concave-convex procedure (CCCP) [24] outlined in Algorithm 2, which has been\nshown to converge to a local minimum or saddle point solution [19]. The algorithm has two main\nsteps: (i) imputing the hidden variables (step 3), which corresponds to approximating the concave\nfunction by a linear upper bound; and (ii) updating the value of the parameter using the values of\nthe hidden variables. Note that updating the parameters requires us to solve a convex SSVM learning\nproblem (where the output yi is now concatenated with the hidden variable h\u2217\ni ) for which several\nef\ufb01cient algorithms exist in the literature [14, 20, 22].\n\nAlgorithm 2 The CCCP algorithm for parameter estimation of latent SSVM.\ninput D = {(x1, y1), \u00b7 \u00b7 \u00b7 , (xn, yn)}, w0, \u01eb.\n1: t \u2190 0\n2: repeat\n3:\n4:\n\nUpdate h\u2217\nUpdate wt+1 by \ufb01xing the hidden variables for output yi to h\u2217\nSSVM problem. Speci\ufb01cally,\nwt+1 = argminw\nt \u2190 t + 1.\n\n5:\n6: until Objective function cannot be decreased below tolerance \u01eb.\n\ni = argmaxhi\u2208H w\u22a4\n\nt \u03a6(xi, yi, hi).\n\n1\n\n2 ||w||2 + C\n\ni and solving the corresponding\n\nn Pi max{0, \u2206(yi, \u02c6yi) + w\u22a4(\u03a6(xi, \u02c6yi, \u02c6hi) \u2212 \u03a6(xi, yi, h\u2217\n\ni ))}.\n\n4 Self-Paced Learning for Latent Variable Models\nOur self-paced learning strategy alleviates the main dif\ufb01culty of curriculum learning, namely the\nlack of a readily computable measure of the easiness of a sample. In the context of a latent variable\nmodel, for a given parameter w, this easiness can be de\ufb01ned in two ways: (i) a sample is easy if\nwe are con\ufb01dent about the value of a hidden variable; or (ii) a sample is easy if it is easy to predict\nits true output. The two de\ufb01nitions are somewhat related: if we are more certain about the hidden\nvariable, we may be more certain about the prediction. They are different in that certainty does not\nimply correctness, and the hidden variables may not be directly relevant to what makes the output of\na sample easy to predict. We therefore focus on the second de\ufb01nition: easy samples are ones whose\ncorrect output can be predicted easily (its likelihood is high, or it lies far from the margin).\n\n3\n\n\fIn the above argument, we have assumed a given w. However, in order to operationalize self-\npaced learning, we need a strategy for simultaneously selecting the easy samples and learning the\nparameter w at each iteration. To this end, we note that the parameter update involves optimizing an\nobjective function that depends on w (for example, see step 4 of both Algorithms 1 and 2). That is,\n\nwt+1 = argmin\n\nw\u2208Rd r(w) +\n\nf (xi, yi; w)! ,\n\n(3)\n\nwhere r(.) is a regularization function and f (.) is the negative log-likelihood for EM or an upper\nbound on the risk for latent SSVM (or any other criteria for parameter learning). We now modify the\nabove optimization problem by introducing binary variables vi that indicate whether the ith sample\nis easy or not. Only easy samples contribute to the objective function. Formally, at each iteration we\nsolve the following mixed-integer program:\n\nn\n\nXi=1\n\nn\n\nXi=1\n\n(wt+1, vt+1) =\n\nw\u2208Rd,v\u2208{0,1}n r(w) +\n\nargmin\n\nvif (xi, yi; w) \u2212\n\n1\nK\n\nvi! .\n\n(4)\n\nn\n\nXi=1\n\nK is a weight that determines the number of samples to be considered: if K is large, the problem\nprefers to consider only \u201ceasy\u201d samples with a small value of f (.) (high likelihood, or far from the\nmargin). Importantly, however, the samples are tied together in the objective through the parameter\nw. Therefore, no sample is considered independently easy; rather, a set of samples is easy if a w\ncan be \ufb01t to it such that the corresponding values of f (.) are small. We iteratively decrease the value\nof K in order to estimate the parameters of a latent variable model via self-paced learning. As K\napproaches 0, more samples are included until problem (4) reduces to problem (3). We thus begin\nwith only a few easy examples, gradually introducing more until the entire training dataset is used.\n\nTo optimize problem (4), we note that it can be relaxed such that each variable vi is allowed to take\nany value in the interval [0, 1]. This relaxation is tight; that is, for any value of w an optimum value\nof vi is either 0 or 1 for all samples. If f (xi, yi; w) < 1/K then vi = 1 yields the optimal objective\nfunction value. Similarly, if f (xi, yi; w) > 1/K then the objective is optimal when vi = 0.\n\nRelaxing problem (4) allows us to identify special cases where the optimum parameter update can be\nfound ef\ufb01ciently. One such special case is when r(.) and f (.) are convex in w, as in the latent SSVM\nparameter update. In this case, the relaxation of problem (4) is a biconvex optimization problem.\nRecall that a biconvex problem is one where the variables z can be divided into two sets z1 and z2\nsuch that for a \ufb01xed value of each set, the optimal value of the other set can be obtained by solving a\nconvex optimization problem. In our case, the two sets of variables are w and v. Biconvex problems\nhave a vast literature, with both global [11] and local [2] optimization techniques. In this work, we\nuse alternative convex search (ACS) [2], which alternatively optimizes w and v while keeping the\nother set of variables \ufb01xed. We found in our experiments that ACS obtained accurate results.\n\nEven in the general case with non-convex r(.) and/or f (.), we can use the alternative search strategy\nto ef\ufb01ciently obtain an approximate solution for problem (4). Given parameters w, we can obtain\nthe optimum v as vi = \u03b4(f (xi, yi; w) < 1/K), where \u03b4(.) is the indicator function. For a \ufb01xed v,\nproblem (4) has the same form as problem (3). Thus, the optimization for self-paced learning is as\neasy (or as dif\ufb01cult) as the original parameter learning algorithm.\n\nSelf-Paced Learning for Latent SSVM. As an illustrative example of self-paced learning, Algo-\nrithm 3 outlines the overall self-paced learning method for latent SSVM, which involves solving a\nmodi\ufb01ed version of problem (2). At each iteration, the weight K is reduced by a factor of \u00b5 > 1,\nintroducing more and more (dif\ufb01cult) samples from one iteration to the next. The algorithm con-\nverges when it considers all samples but is unable to decrease the latent SSVM objective function\nvalue below the tolerance \u01eb. We note that self-paced learning provides the same guarantees as CCCP:\nProperty: Algorithm 3 converges to a local minimum or saddle point solution of problem (2).\nThis follows from the fact that the last iteration of Algorithm 3 is the original CCCP algorithm.\n\nOur algorithm requires an initial parameter w0 (similar to CCCP). In our experiments, we obtained\nan estimate of w0 by initially setting vi = 1 for all samples and running the original CCCP algorithm\nfor a \ufb01xed, small number of iterations T0. As our results indicate, this simple strategy was suf\ufb01cient\nto obtain an accurate set of parameters using self-paced learning.\n\n5 Experiments\nWe now demonstrate the ef\ufb01cacy of self-paced learning in the context of latent SSVM. We show\nthat our approach outperforms the state of the art CCCP algorithm on four standard machine learning\n\n4\n\n\fAlgorithm 3 The self-paced learning algorithm for parameter estimation of latent SSVM.\ninput D = {(x1, y1), \u00b7 \u00b7 \u00b7 , (xn, yn)}, w0, K0, \u01eb.\n1: t \u2190 0, K \u2190 K0.\n2: repeat\n3:\n\nt \u03a6(xi, yi, hi).\n\ni = argmaxhi\u2208H w\u22a4\n\nUpdate h\u2217\nUpdate wt+1 by using ACS to minimize the objective 1\nsubject to the constraints of problem (2) as well as v \u2208 {0, 1}n.\nt \u2190 t + 1, K \u2190 K/\u00b5.\n\n2 ||w||2 + C\n\n5:\n6: until vi = 1, \u2200i and the objective function cannot be decreased below tolerance \u01eb.\n\n4:\n\nn Pn\n\ni=1 vi\u03bei \u2212 1\n\ni=1 vi\n\nK Pn\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: Results for the noun phrase coreference experiment. Top: MITRE score. Bottom: Pairwise score. (a)\nThe relative objective value computed as (objcccp \u2212objspl)/objcccp, where objcccp and objspl are the objective\nvalues of CCCP and self-paced learning respectively. A green circle indicates a signi\ufb01cant improvement (greater\nthan tolerance C\u01eb), while a red circle indicates a signi\ufb01cant decline. The black dashed line demarcates equal\nobjective values. (b) Loss over the training data. Minimum MITRE loss: 14.48 and 14.02 for CCCP and self-\npaced learning respectively; Minimum pairwise loss: 31.10 and 31.03. (c) Loss over the test data. Minimum\nMITRE loss: 15.38 and 14.91; Minimum pairwise loss: 34.10 and 33.93.\n\napplications. In all our experiments, the initial weight K0 is set such that the \ufb01rst iteration selects\nmore than half the samples (as there are typically more easy samples than dif\ufb01cult ones). The\nweight is reduced by a factor \u00b5 = 1.3 at each iteration and the parameters are initialized using\nT0 = 2 iterations of the original CCCP algorithm.\n\n5.1 Noun Phrase Coreference\nProblem Formulation. Given the occurrence of all the nouns in a document, the goal of noun\nphrase coreference is to provide a clustering of the nouns such that each cluster refers to a single\nobject. This task was formulated within the SSVM framework in [10] and extended to include latent\nvariables in [23]. Formally, the input vector x consists of the pairwise features xij suggested in [16]\nbetween all pairs of noun phrases i and j in the document. The output y represents a clustering\nof the nouns. A hidden variable h speci\ufb01es a forest over the nouns such that each tree in the forest\nconsists of all the nouns of one cluster. Imputing the hidden variables involves \ufb01nding the maximum\nspanning forest (which can be solved by Kruskal or Prims algorithm). Similar to [23], we employ\ntwo different loss functions, corresponding to the pairwise and MITRE scores.\n\nDataset. We use the publicly available MUC6 noun phrase coreference dataset, which consists of\n60 documents. We use the same split of 30 training and 30 test documents as [23].\n\nResults. We tested CCCP and our self-paced learning method on different values of C; the average\ntraining times over all 40 experiments (20 different values of C and two different loss functions)\nfor the two methods were 1183 and 1080 seconds respectively. Fig. 1 compares the two methods in\nterms of the value of the objective function (which is the main focus of this work), the loss over the\ntraining data and the loss over the test data. Note that self-paced learning signi\ufb01cantly improves the\nobjective function value in 11 of the 40 experiments (compared to only once when CCCP outperforms\nself-paced learning; see Fig. 1(a)). It also provides a better training and testing loss for both MITRE\nand pairwise scores when using the optimal value of C (see Fig. 1(b)-(c)).\n\n5.2 Motif Finding\nProblem Formulation. We consider the problem of binary classi\ufb01cation of DNA sequences, which\nwas cast as a latent SSVM in [23]. Speci\ufb01cally, the input vector x consists of a DNA sequence of\nlength l (where each element of the sequence is a nucleotide of type A, G, T or C) and the output\nspace Y = {+1, \u22121}. In our experiments, the classes correspond to two different types of genes:\n\n5\n\n\fthose that bind to a protein of interest with high af\ufb01nity and those that do not. The positive sequences\nare assumed to contain particular patterns, called motifs, of length m that are believed to be useful\nfor classi\ufb01cation. However, the starting position of the motif within a gene sequence is often not\nknown. Hence, this position is treated as the hidden variable h. For this problem, we use the joint\nfeature vector suggested by [23]. Here, imputing the hidden variables simply involves a search for\nthe starting position of the motif. The loss function \u2206 is the standard 0-1 classi\ufb01cation loss.\n\nDataset. We use the publicly available UniProbe dataset [4] that provides positive and negative\nDNA sequences for 177 proteins. For this work, we chose \ufb01ve proteins at random. The total number\nof sequences per protein is roughly 40, 000. For all the sequences, the motif length m is known\n(provided with the UniProbe dataset) and the background Markov model is assumed to be of order\nk = 3.\nIn order to specify a classi\ufb01cation task for a particular protein, we randomly split the\nsequences into roughly 50% for training and 50% for testing.\n\nCCCP\n\nSPL\n\n92.77 \u00b1 0.99\n92.37 \u00b1 0.65\n\n106.50 \u00b1 0.38\n106.60 \u00b1 0.30\n\n94.00 \u00b1 0.53\n93.51 \u00b1 0.29\n\n116.63 \u00b1 18.78\n107.18 \u00b1 1.48\n\n75.51 \u00b1 1.97\n74.23 \u00b1 0.59\n\n(a) Objective function value\n\nCCCP\n\nSPL\n\n27.10 \u00b1 0.44\n26.94 \u00b1 0.26\n\n32.03 \u00b1 0.31\n32.04 \u00b1 0.23\n\n(b) Training error (%)\n26.90 \u00b1 0.28\n26.81 \u00b1 0.19\n\n(c) Test error (%)\n\n34.89 \u00b1 8.53\n30.31 \u00b1 1.14\n\n20.09 \u00b1 0.81\n19.52 \u00b1 0.34\n\nCCCP\n\nSPL\n\n27.10 \u00b1 0.36\n27.08 \u00b1 0.38\n\n32.15 \u00b1 0.31\n32.24 \u00b1 0.25\n\n27.10 \u00b1 0.37\n27.03 \u00b1 0.13\n\n35.42 \u00b1 8.19\n30.84 \u00b1 1.38\n\n20.25 \u00b1 0.65\n19.65 \u00b1 0.39\n\nTable 1: Mean and standard deviations for the motif \ufb01nding experiments using the original CCCP algorithm\n(top row) and the proposed self-paced learning approach (bottom row). The better mean value is highlighted in\nbold. Note that self-paced learning provides an improved objective value (the primary concern of this work) for\nall proteins. The improvement in objective value also translates to an improvement in training and test errors.\n\nResults. We used \ufb01ve different folds for each protein, randomly initializing the motif positions for\nall training samples using four different seed values (\ufb01xed for both methods). We report results for\neach method using the best seed (chosen according to the value of the objective function). For all\nexperiments we use C = 150 and \u01eb = 0.001 (the large size of the dataset made cross-validation\nhighly time consuming). The average time over all 100 runs for CCCP and self-paced learning are\n824 and 1287 seconds respectively. Although our approach is slower than CCCP for this application,\nas table 1 shows, it learns a better set of parameters. While improvements for most folds are small,\nfor the fourth protein, CCCP gets stuck in a bad local minimum despite using multiple random\ninitializations (this is indicated by the large mean and standard deviation values). This behavior is to\nbe expected: in many cases, the objective function landscape is such that CCCP avoids local optima;\nbut in some cases, CCCP gets stuck in poor local optimum. Indeed, over all the 100 runs (5 proteins,\n5 folds and 4 seed values) CCCP got stuck in a bad local minimum 18 times (where a bad local\nminimum is one that gave 50% test error) compared to 1 run where self-paced learning got stuck.\n\nFig. 2 shows the average Hamming distance between the motifs of the selected samples at each it-\neration of the self-paced learning algorithm. Note that initially the algorithm selects samples whose\nmotifs have a low Hamming distance (which intuitively correspond to the easy samples for this ap-\nplication). It then gradually introduces more dif\ufb01cult samples (as indicated by the rise in the average\nHamming distance). Finally, it considers all samples and attempts to \ufb01nd the most discriminative\nmotif across the entire dataset. Note that the motifs found over the entire dataset using self-paced\nlearning provide a smaller average Hamming distance than those found using the original CCCP\nalgorithm, indicating a greater coherence for the resulting output.\n\n5.3 Handwritten Digit Recognition\nProblem Formulation. Handwritten digit recognition is a special case of multi-label classi\ufb01ca-\ntion, and hence can be formulated within the SSVM framework. Speci\ufb01cally, given an input vector\nx, which consists of m grayscale values that represent an image of a handwritten digit, our aim is\nto predict the digit. In other words, Y = {0, 1, \u00b7 \u00b7 \u00b7 , 9}. It is well-known that the accuracy of digit\nrecognition can be greatly improved by explicitly modeling the deformations present in each image,\nfor example see [18]. For simplicity, we assume that the deformations are restricted to an arbitrary\nrotation of the image, where the angle of rotation is not known beforehand. This angle (which takes\na value from a \ufb01nite discrete set) is modeled as the hidden variable h. We specify the joint feature\nvector as \u03a6(x, y, h) = (0y(m+1); \u03b8h(x) 1; 0(9\u2212y)(m+1)), where \u03b8h(x) is the vector representation\n\n6\n\n\fFigure 2: Average Hamming distance between the motifs found in all selected samples at each iteration. Our\napproach starts with easy samples (small Hamming distance) and gradually introduces more dif\ufb01cult samples\n(large Hamming distance) until it starts to consider all samples of the training set. The \ufb01gure shows results\nfor three different protein-fold pairs. The average Hamming distance (over all proteins and folds) of the motifs\nobtained at convergence are 0.6155 for CCCP and 0.6099 for self-paced learning.\n\nFigure 3: Four digit pairs from MNIST: 1-7, 2-7, 3-8, 8-9. Relative objective is computed as in Fig. 1. Positive\nvalues indicate superior results for self-paced learning. The dotted black lines delineate where the difference\nis greater than the convergence criteria range (C\u01eb); differences outside this range are highlighted in blue.\n\nof the image x rotated by the angle corresponding to h. In other words, the joint feature vector is\nthe rotated image of the digit which is padded in the front and back with the appropriate number of\nzeroes. Imputing the hidden variables simply involves a search over a discrete set of angles. Similar\nto the motif \ufb01nding experiment, we use the standard 0-1 classi\ufb01cation loss.\n\nDataset. We use the standard MNIST dataset [15], which represents each handwritten digit as a\nvector of length 784 (that is, an image of size 28 \u00d7 28). For ef\ufb01ciency, we use PCA to reduce the\ndimensionality of each sample to 10. We perform binary classi\ufb01cation on four dif\ufb01cult digit pairs\n(1-7, 2-7, 3-8, and 8-9), as in [25]. The training standard dataset size for each digit ranges from\n5, 851 to 6, 742, and the test sets range from 974 to 1, 135 digits. The rotation modeled by the\nhidden variable can take one of 11 discrete values, evenly spaced between \u221260 and 60 degrees.\n\nResults. For each digit pair, we use C values ranging from 25 to 300, set \u01eb = 0.001, and set\nK = 104\nC . Modeling rotation as a hidden variable signi\ufb01cantly improves classi\ufb01cation performance,\nallowing the images to be better aligned with each other. Across all experiments for both learning\nmethods, using hidden variables achieves better test error; the improvement over using no hidden\nvariables is 12%, 8%, 11%, and 22%, respectively, for the four digit pairs. CCCP learning took an\naverage of 18 minutes across all runs, while self-paced learning took an average of 53 minutes.\n\nThe above \ufb01gure compares the training and test errors and objective values between CCCP and self-\npaced learning. Self-paced learning achieves signi\ufb01cantly better values in 15 runs, and is worse in 4\nruns, demonstrating that it helps \ufb01nd better solutions to the optimization problems. Though training\nand test errors do not necessarily correlate to objective values, the best test error across C values is\nbetter for self-paced learning for one of the digit pairs (1-7), and is the same for the others.\n\n5.4 Object Localization\nProblem Formulation. Given a set of images along with labels that indicate the presence of a\nparticular object category in the image (for example, a mammal), our goal is to learn discriminative\nobject models for all object categories (that is, models that can distinguish between one object, say\nbison, from another, say elephant). In practice, although it is easy to mine such images from free\nphoto-sharing websites such as Flickr, it is burdensome to obtain ground truth annotations of the\nexact location of the object in each image. To avoid requiring these human annotations, we model\nthe location of objects as hidden variables. Formally, for a given image x, category y and location h,\nthe score is modelled as wT \u03a6(x, y, h) = wT\ny \u03a6h(x), where wy are the parameters that corresponds\nto the class y and \u03a6h(\u00b7) is the HOG [7, 9] feature extracted from the image at position h (the size of\nthe object is assumed to be the same for all images \u2014 a reasonable assumption for our datasets). For\n\n7\n\n\fFigure 4: The top row shows the imputed bounding boxes of an easy and a hard image using the CCCP\nalgorithm over increasing iterations (left to right). Note that for the hard (deer) image, the bounding box\nobtained at convergence does not localize the object accurately. In contrast, the self-paced learning approach\n(bottom row) does not use the hard image during initial iterations (indicated by the red color of the bounding\nbox). In subsequent iterations, it is able to impute accurate bounding boxes for both the easy and hard image.\n\nthe above problem, imputing the hidden variables involves a simple search over possible locations\nin a given image. The loss function \u2206(y, \u02c6y) is again the standard 0-1 classi\ufb01cation loss.\n\nDataset. We use images of 6 different mammals (approximately 45 images per mammal) that\nhave been previously employed for object localization [13]. We split the images of each category\ninto approximately 90% for training and 10% for testing.\n\nResults. We use \ufb01ve different folds to compare our method with the state of the art CCCP algo-\nrithm. For each fold, we randomly initialized the location of the object in each image (the initializa-\ntion was the same for the two methods). We used a value of C = 10 and \u01eb = 0.001. The average\ntraining time over all folds were 362 seconds and 482 seconds for CCCP and self-paced learning\nrespectively. Table 2 shows the mean and standard deviation of three terms: the objective value,\nthe training loss and the testing loss. Self-paced learning provided a signi\ufb01cantly lower (more than\ntolerance) objective value than CCCP for all folds. The better objective value resulted in a substantial\nimprovement in the training (for 4 folds) and testing loss (an improvement of approximately 4% for\nachieved for 2 folds). In these experiments, CCCP never outperformed self-paced learning for any of\nthe three measures of performance.\n\nObjective\n\nTrain Loss (%)\n\n4.70 \u00b1 0.11\n\n0.33 \u00b1 0.18\n\nTest Loss (%)\n16.92 \u00b1 5.16\n\nObjective\n\nTrain Loss (%)\n\n4.53 \u00b1 0.15\n\n0.0 \u00b1 0.0\n\nTest Loss (%)\n15.38 \u00b1 3.85\n\nTable 2: Results for the object localization experiment. Left: CCCP. Right: Self-paced learning. Note that\nself-paced learning provides better results for all measures of performance.\n\nFig. 4 shows the imputed bounding boxes for two images during various iterations of the two algo-\nrithms. The proposed self-paced learning algorithm does not use the hard image during the initial\niterations (as indicated by the red bounding box). In contrast, CCCP considers all images at each\niteration. Note that self-paced learning provides a more accurate bounding box for the hard im-\nage at convergence, thereby illustrating the importance of learning in a meaningful order. In our\nexperience, this was a typical behavior of the two algorithms.\n\n6 Discussion\nWe proposed the self-paced learning regime in the context of parameter estimation for latent variable\nmodels. Our method works by iteratively solving a biconvex optimization problem that simultane-\nously selects easy samples and updates the parameters. Using four standard datasets from disparate\ndomains (natural language processing, computational biology and computer vision) we showed that\nour method outperforms the state of the art approach.\n\nIn the current work, we solve the biconvex optimization problem using an alternate convex search\nstrategy, which only provides us with a local minimum solution. Although our results indicate that\nsuch a strategy is more accurate than the state of the art, it is worth noting that the biconvex problem\ncan also be solved using a global optimization procedure, for example the one described in [11].\nThis is a valuable direction for future work. We are also currently investigating the bene\ufb01ts of self-\npaced learning on other computer vision applications, where the ability to handle large and rapidly\ngrowing weakly supervised data is fundamental to the success of the \ufb01eld.\n\nAcknowledgements. This work is supported by NSF under grant IIS 0917151, MURI contract\nN000140710747, and the Boeing company.\n\n8\n\n\fReferences\n\n[1] E. Allgower and K. Georg. Numerical continuation methods: An introduction. Springer-\n\nVerlag, 1990.\n\n[2] M. Bazaraa, H. Sherali, and C. Shetty. Nonlinear Programming - Theory and Algorithms. John\n\nWiley and Sons, Inc., 1993.\n\n[3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n\n[4] M. Berger, G. Badis, A. Gehrke, and S. Talukder et al. Variation in homeodomain DNA binding\n\nrevealed by high-resolution analysis of sequence preferences. Cell, 27, 2008.\n\n[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT,\n\n98.\n\n[6] D. Cohn, Z. Ghahramani, and M. Jordan. Active learning with statistical models. JAIR, 4:129\u2013\n\n145, 1996.\n\n[7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n\n[8] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM\n\nalgorithm. Journal of Royal Statistical Society, 39(1):1\u201338, 1977.\n\n[9] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale,\n\ndeformable part model. In CVPR, 2008.\n\n[10] T. Finley and T. Joachims. Supervised clustering with support vector machines.\n\nIn ICML,\n\n2005.\n\n[11] C. Floudas and V. Visweswaran. Primal-relaxed dual global optimization approach. Journal\n\nof Optimization Theory and Applications, 78(2):187\u2013225, 1993.\n\n[12] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapman and Hall,\n\n1995.\n\n[13] G. Heitz, G. Elidan, B. Packer, and D. Koller. Shape-based object localization for descriptive\n\nclassi\ufb01cation. IJCV, 2009.\n\n[14] T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training for structural SVMs. Machine\n\nLearning, 77(1):27\u201359, 2009.\n\n[15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient based learning applied to document\n\nrecognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[16] V. Ng and C. Cardie. Improving machine learning approaches to coreference resolution. In\n\nACL, 2002.\n\n[17] K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In CIKM,\n\n2000.\n\n[18] P. Simard, B. Victorri, Y. LeCun, and J. Denker. Tangent Prop - a formalism for specifying\n\nselected invariances in adaptive network. In NIPS, 1991.\n\n[19] B. Sriperumbudur and G. Lanckriet. On the convergence of concave-convex procedure. In\n\nNIPS Workshop on Optimization for Machine Learning, 2009.\n\n[20] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n\n[21] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n\n\ufb01cation. JMLR, 2:45\u201366, 2001.\n\n[22] I. Tsochantaridis, T. Hofmann, Y. Altun, and T. Joachims. Support vector machine learning for\n\ninterdependent and structured output spaces. In ICML, 2004.\n\n[23] C.-N. Yu and T. Joachims. Learning structural SVMs with latent variables. In ICML, 2009.\n\n[24] A. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15, 2003.\n\n[25] K. Zhang, I. Tsang, and J. Kwok. Maximum margin clustering made practical. In ICML, 2007.\n\n9\n\n\f", "award": [], "sourceid": 772, "authors": [{"given_name": "M.", "family_name": "Kumar", "institution": null}, {"given_name": "Benjamin", "family_name": "Packer", "institution": null}, {"given_name": "Daphne", "family_name": "Koller", "institution": null}]}