{"title": "Self-Paced Learning with Diversity", "book": "Advances in Neural Information Processing Systems", "page_first": 2078, "page_last": 2086, "abstract": "Self-paced learning (SPL) is a recently proposed learning regime inspired by the learning process of humans and animals that gradually incorporates easy to more complex samples into training. Existing methods are limited in that they ignore an important aspect in learning: diversity. To incorporate this information, we propose an approach called self-paced learning with diversity (SPLD) which formalizes the preference for both easy and diverse samples into a general regularizer. This regularization term is independent of the learning objective, and thus can be easily generalized into various learning tasks. Albeit non-convex, the optimization of the variables included in this SPLD regularization term for sample selection can be globally solved in linearithmic time. We demonstrate that our method significantly outperforms the conventional SPL on three real-world datasets. Specifically, SPLD achieves the best MAP so far reported in literature on the Hollywood2 and Olympic Sports datasets.", "full_text": "Self-Paced Learning with Diversity\n\nLu Jiang1, Deyu Meng1,2, Shoou-I Yu1, Zhenzhong Lan1, Shiguang Shan1,3,\n\nAlexander G. Hauptmann1\n\n1School of Computer Science, Carnegie Mellon University\n\n2School of Mathematics and Statistics, Xi\u2019an Jiaotong University\n3Institute of Computing Technology, Chinese Academy of Sciences\n\nlujiang@cs.cmu.edu, dymeng@mail.xjtu.edu.cn\n\n{iyu, lanzhzh}@cs.cmu.edu, sgshan@ict.ac.cn, alex@cs.cmu.edu\n\nAbstract\n\nSelf-paced learning (SPL) is a recently proposed learning regime inspired by the\nlearning process of humans and animals that gradually incorporates easy to more\ncomplex samples into training. Existing methods are limited in that they ignore\nan important aspect in learning: diversity. To incorporate this information, we\npropose an approach called self-paced learning with diversity (SPLD) which for-\nmalizes the preference for both easy and diverse samples into a general regularizer.\nThis regularization term is independent of the learning objective, and thus can be\neasily generalized into various learning tasks. Albeit non-convex, the optimization\nof the variables included in this SPLD regularization term for sample selection can\nbe globally solved in linearithmic time. We demonstrate that our method signi\ufb01-\ncantly outperforms the conventional SPL on three real-world datasets. Speci\ufb01cal-\nly, SPLD achieves the best MAP so far reported in literature on the Hollywood2\nand Olympic Sports datasets.\n\n1 Introduction\n\nSince it was raised in 2009, Curriculum Learning (CL) [1] has been attracting increasing attention\nin the \ufb01eld of machine learning and computer vision [2]. The learning paradigm is inspired by the\nlearning principle underlying the cognitive process of humans and animals, which generally starts\nwith learning easier aspects of an aimed task, and then gradually takes more complex examples into\nconsideration. It has been empirically demonstrated to be bene\ufb01cial in avoiding bad local minima\nand in achieving a better generalization result [1].\n\nA sequence of gradually added training samples [1] is called a curriculum. A straightforward way\nto design a curriculum is to select samples based on certain heuristical \u201ceasiness\u201d measurements [3,\n4, 5]. This ad-hoc implementation, however, is problem-speci\ufb01c and lacks generalization capacity.\nTo alleviate this de\ufb01ciency, Kumar et al. [6] proposed a method called Self-Paced Learning (SPL)\nthat embeds curriculum designing into model learning. SPL introduces a regularization term into\nthe learning objective so that the model is jointly learned with a curriculum consisting of easy to\ncomplex samples. As its name suggests, the curriculum is gradually determined by the model itself\nbased on what it has already learned, as opposed to some prede\ufb01ned heuristic criteria. Since the\ncurriculum in the SPL is independent of model objectives in speci\ufb01c problems, SPL represents a\ngeneral implementation [7, 8] for curriculum learning.\n\nIn SPL, samples in a curriculum are selected solely in terms of \u201ceasiness\u201d. In this work, we reveal\nthat diversity, an important aspect in learning, should also be considered. Ideal self-paced learning\nshould utilize not only easy but also diverse examples that are suf\ufb01ciently dissimilar from what has\nalready been learned. Theoretically, considering diversity in learning is consistent with the increas-\ning entropy theory in CL that a curriculum should increase the diversity of training examples [1].\nThis can be intuitively explained in the context of human education. A rational curriculum for a\npupil not only needs to include examples of suitable easiness matching her learning pace, but also,\n\n1\n\n\fOutdoor bouldering\n\nArtificial wall climbing\n\nSnow mountain climbing\n\nPositive training samples of \u201cRock Climbing\u201d\n\na1\n\na2\n\na3\n\na5\n\nb1\n\nb3\n\n0.05\n\n0.12\n\n0.14\n\n0.17\n\n0.20\n\na4\n\na6\n\nb2\n\nb4\n\nc1\n\nc2\n\nc3\n\n0.15\n\n0.20\n\nc4\n\n0.12\n\n0.13\n\n0.40\n\n0.18\n\n0.35\n\n0.16\n\n0.50\n\nCurriculum for SPL\n\nCurriculum for SPLD\n\na1\n\na2\n\na3\n\na4\n\nc4\n\na1\n\nc1\n\nb1\n\na2\n\nc2\n\nc4\n\n0.05\neasy\n\n0.12\n\n0.12\n\n0.13\n\n...\n\n0.50\n\nhard\n\n0.05\n\n0.15\n\n0.17\n\n0.12\n\n0.16\n\neasy and diverse\n\n...\n\n0.50\nhard\n\nFigure 1: Illustrative comparison of SPL and SPLD on \u201cRock Climbing\u201d event using real sam-\nples [15]. SPL tends to \ufb01rst select the easiest samples from a single group. SPLD inclines to select\neasy and diverse samples from multiple groups.\n\nimportantly, should include some diverse examples on the subject in order for her to develop more\ncomprehensive knowledge. Likewise, learning from easy and diverse samples is expected to be\nbetter than learning from either criterion alone.\n\nWe name the learning paradigm that considers both easiness and diversity Self-Paced Learning with\nDiversity (SPLD). SPLD proves to be a general learning framework as its intuition is embedded as\na regularization term that is independent of speci\ufb01c model objectives. In addition, by considering\ndiversity in learning, SPLD is capable of obtaining better solutions. For example, Fig. 1 plots some\npositive samples for the event \u201cRock Climbing\u201d on a real dataset, named MED [15]. Three groups\nof samples are depicted for illustration. The number under the keyframe indicates the loss, and a\nsmaller loss corresponds to an easier sample. Every group has easy and complex samples. Having\nlearned some samples from a group, the SPL model prefers to select more samples from the same\ngroup as they appear to be easy to what the model has learned. This may lead to over\ufb01tting to a data\nsubset while ignoring easy samples in other groups. For example, in Fig. 1, the samples selected in\n\ufb01rst iterations of SPL are all from the \u201cOutdoor bouldering\u201d sub-event because they all look like a1.\nThis is signi\ufb01cant as the over\ufb01tting becomes more and more severe as the samples from the same\ngroup are kept adding into training. This phenomenon is more evident in real-world data where the\ncollected samples are usually biased towards some groups. In contrast, SPLD, considering both eas-\niness and diversity, produces a curriculum that reasonably mixes easy samples from multiple groups.\nThe diverse curriculum is expected to help quickly grasp easy and comprehensive knowledge and to\nobtain better solutions. This hypothesis is substantiated by our experiments.\n\nThe contribution of this paper is threefold: (1) We propose a novel idea of considering both easiness\nand diversity in the self-paced learning, and formulate it into a concise regularization term that\ncan be generally applied to various problems (Section 4.1). (2) We introduce the algorithm that\nglobally optimizes a non-convex problem w.r.t. the variables included in this SPLD regularization\nterm for sample selection (Section 4.2). (3) We demonstrate that the proposed SPLD signi\ufb01cantly\noutperforms SPL on three real-word datasets. Notably, SPLD achieves the best MAP so far reported\nin literature on two action datasets.\n\n2 Related work\n\nBengio et al. [1] proposed a new learning paradigm called curriculum learning (CL), in which a mod-\nel is learned by gradually including samples into training from easy to complex so as to increase the\nentropy of training samples. Afterwards, Bengio and his colleagues [2] presented insightful explo-\nrations for the rationality underlying this learning paradigm, and discussed the relationship between\nthe CL and conventional optimization techniques, e.g., the continuation and annealing methods.\nFrom human behavioral perspective, Khan et al. [10] provided evidence that CL is consistent with\nthe principle in teaching. The curriculum is often derived by predetermined heuristics in particular\nproblems. For example, Ruvolo and Eaton [3] took the negative distance to the boundary as the in-\ndicator for easiness in classi\ufb01cation. Spitkovsky et al. [4] used the sentence length as an indicator in\n\n2\n\n\fstudying grammar induction. Shorter sentences have fewer possible solutions and thus were learned\nearlier. Lapedriza et al. [5] proposed a similar approach by \ufb01rst ranking examples based on certain\n\u201ctraining values\u201d and then greedily training the model on these sorted examples.\n\nThe ad-hoc curriculum design in CL turns out onerous or conceptually dif\ufb01cult to implement in\ndifferent problems. To alleviate this issue, Kumar et al. [6] designed a new formulation, called\nself-paced learning (SPL). SPL embeds curriculum design (from easy to more complex samples)\ninto model learning. By virtue of its generality, various applications based on the SPL have been\nproposed very recently [7, 8, 11, 12, 13]. For example, Jiang et al. [7] discovered that pseudo\nrelevance feedback is a type of self-paced learning which explains the rationale of this iterative\nalgorithm starting from the easy examples i.e.\nthe top ranked documents/videos. Tang et al. [8]\nformulated a self-paced domain adaptation approach by training target domain knowledge starting\nwith easy samples in the source domain. Kumar et al. [11] developed an SPL strategy for the\nspeci\ufb01c-class segmentation task. Supan\u02c7ci\u02c7c and Ramanan [12] designed an SPL method for long-\nterm tracking by setting smallest increase in the SVM objective as the loss function. To the best of\nour knowledge, there has been no studies to incorporate diversity in SPL.\n\n3 Self-Paced Learning\n\nBefore introducing our approach, we \ufb01rst brie\ufb02y review the SPL. Given the training dataset D =\n{(x1, y1), \u00b7 \u00b7 \u00b7 , (xn, yn)}, where xi \u2208 Rm denotes the ith observed sample, and yi represents its\nlabel, let L(yi, f (xi, w)) denote the loss function which calculates the cost between the ground\ntruth label yi and the estimated label f (xi, w). Here w represents the model parameter inside the\ndecision function f . In SPL, the goal is to jointly learn the model parameter w and the latent weight\nvariable v = [v1, \u00b7 \u00b7 \u00b7 , vn] by minimizing:\n\nmin\nw,v\n\nE(w, v; \u03bb) =\n\nn\n\nXi=1\n\nviL(yi, f (xi, w)) \u2212 \u03bb\n\nn\n\nXi=1\n\nvi, s.t. v \u2208 [0, 1]n,\n\n(1)\n\nwhere \u03bb is a parameter for controlling the learning pace. Eq. (1) indicates the loss of a sample is\ndiscounted by a weight. The objective of SPL is to minimize the weighted training loss together\ni=1 vi (since vi \u2265 0). This regularization term\n\nwith the negative l1-norm regularizer \u2212kvk1 = \u2212Pn\n\nis general and applicable to various learning tasks with different loss functions [7, 11, 12].\n\nACS (Alternative Convex Search) is generally used to solve Eq. (1) [6, 8]. It is an iterative method\nfor biconvex optimization, in which the variables are divided into two disjoint blocks.\nIn each\niteration, a block of variables are optimized while keeping the other block \ufb01xed. When v is \ufb01xed,\nthe existing off-the-shelf supervised learning methods can be employed to obtain the optimal w\u2217.\nWith the \ufb01xed w, the global optimum v\u2217 = [v\u22171 , \u00b7 \u00b7 \u00b7 , v\u2217n] can be easily calculated by [6]:\n\nv\u2217i = (cid:26)1, L(yi, f (xi, w)) < \u03bb,\n\n0, otherwise.\n\n(2)\n\nThere exists an intuitive explanation behind this alternative search strategy: 1) when updating v with\na \ufb01xed w, a sample whose loss is smaller than a certain threshold \u03bb is taken as an \u201ceasy\u201d sample,\nand will be selected in training (v\u2217i = 1), or otherwise unselected (v\u2217i = 0); 2) when updating w\nwith a \ufb01xed v, the classi\ufb01er is trained only on the selected \u201ceasy\u201d samples. The parameter \u03bb controls\nthe pace at which the model learns new samples, and physically \u03bb corresponds to the \u201cage\u201d of the\nmodel. When \u03bb is small, only \u201ceasy\u201d samples with small losses will be considered. As \u03bb grows,\nmore samples with larger losses will be gradually appended to train a more \u201cmature\u201d model.\n\n4 Self-Paced Learning with Diversity\n\nIn this section we detail the proposed learning paradigm called SPLD. We \ufb01rst formally de\ufb01ne its\nobjective in Section 4.1, and discuss an ef\ufb01cient algorithm to solve the problem in Section 4.2.\n\n4.1 SPLD Model\n\nDiversity implies that the selected samples should be less similar or clustered. An intuitive approach\nfor realizing this is by selecting samples of different groups scattered in the sample space. We\nassume that the correlation of samples between groups is less than that of within a group. This\n\n3\n\n\fauxiliary group membership is either given, e.g. in object recognition frames from the same video\ncan be regarded from the same group, or can be obtained by clustering samples.\n\nThis aim of SPLD can be mathematically described as follows. Assume that the training samples\nX = (x1, \u00b7 \u00b7 \u00b7 , xn) \u2208 Rm\u00d7n are partitioned into b groups: X(1), \u00b7 \u00b7 \u00b7 , X(b), where columns of\nX(j) \u2208 Rm\u00d7nj correspond to the samples in the jth group, nj is the sample number in the group\nj=1 nj = n. Accordingly denote the weight vector as v = [v(1), \u00b7 \u00b7 \u00b7 , v(b)], where v(j) =\n(v(j)\nnj )T \u2208 [0, 1]nj . SPLD on one hand needs to assign nonzero weights of v to easy\nsamples as the conventional SPL, and on the other hand requires to disperse nonzero elements across\npossibly more groups v(i) to increase the diversity. Both requirements can be uniformly realized\nthrough the following optimization model:\n\nand Pb\n1 , \u00b7 \u00b7 \u00b7 , v(j)\n\nmin\nw,v\n\nE(w, v; \u03bb, \u03b3) =\n\nn\n\nXi=1\n\nviL(yi, f (xi, w)) \u2212 \u03bb\n\nn\n\nXi=1\n\nvi \u2212 \u03b3kvk2,1, s.t. v \u2208 [0, 1]n,\n\n(3)\n\nwhere \u03bb, \u03b3 are the parameters imposed on the easiness term (the negative l1-norm: \u2212kvk1) and the\ndiversity term (the negative l2,1-norm: \u2212kvk2,1), respectively. As for the diversity term, we have:\n\n\u2212kvk2,1 = \u2212\n\nb\n\nXj=1\n\nkv(j)k2.\n\n(4)\n\nThe SPLD introduces a new regularization term in Eq. (3) which consists of two components. One\nis the negative l1-norm inherited from the conventional SPL, which favors selecting easy over com-\nplex examples. The other is the proposed negative l2,1-norm, which favors selecting diverse sam-\nples residing in more groups. It is well known that the l2,1-norm leads to the group-wise sparse\nrepresentation of v [14], i.e. non-zero entries of v tend to be concentrated in a small number of\ngroups. Contrariwise, the negative l2,1-norm should have a counter-effect to group-wise sparsity,\ni.e. nonzero entries of v tend to be scattered across a large number of groups. In other words, this\nanti-group-sparsity representation is expected to realize the desired diversity. Note that when each\ngroup only contains a single sample, Eq. (3) degenerates to Eq. (1).\n\nUnlike the convex regularization term in Eq. (1) of SPL, the term in the SPLD is non-convex. Con-\nsequently, the traditional (sub)gradient-based methods cannot be directly applied to optimizing v.\nWe will discuss an algorithm to resolve this issue in the next subsection.\n\n4.2 SPLD Algorithm\n\nSimilar as the SPL, the alternative search strategy can be employed for solving Eq. (3). However, a\nchallenge is that optimizing v with a \ufb01xed w becomes a non-convex problem. We propose a simple\nyet effective algorithm for extracting the global optimum of this problem, as listed in Algorithm 1.\nIt takes as input the groups of samples, the up-to-date model parameter w, and two self-paced\nparameters, and outputs the optimal v of minv E(w, v; \u03bb, \u03b3). The global minimum is proved in the\nfollowing theorem (see the proof in supplementary materials):\n\nTheorem 1 Algorithm 1 attains the global optimum to minv E(w, v) for any given w in linearith-\nmic time.\nAs shown, Algorithm 1 selects samples in terms of both the easiness and the diversity. Speci\ufb01cally:\n\n\u2022 Samples with L(yi, f (xi, w)) < \u03bb will be selected in training (vi = 1) in Step 5. These\n\nsamples represent the \u201ceasy\u201d examples with small losses.\n\n\u2022 Samples with L(yi, f (xi, w)) > \u03bb + \u03b3 will not be selected in training (vi = 0) in Step 6.\n\nThese samples represent the \u201ccomplex\u201d examples with larger losses.\n\n\u2022 Other samples will be selected by comparing their losses to a threshold \u03bb+\n\n, where\ni is the sample\u2019s rank w.r.t. its loss value within its group. The sample with a smaller loss\nthan the threshold will be selected in training. Since the threshold decreases considerably\nas the rank i grows, Step 5 penalizes samples monotonously selected from the same group.\n\n\u03b3\n\n\u221ai+\u221ai\u22121\n\nWe study a tractable example that allows for clearer diagnosis in Fig. 2, where each keyframe rep-\nresents a video sample on the event \u201cRock Climbing\u201d of the TRECVID MED data [15], and the\nnumber below indicates its loss. The samples are clustered into four groups based on the visual\nsimilarity. A colored block on the right shows a curriculum selected by Algorithm 1. When \u03b3 = 0,\n\n4\n\n\fAlgorithm 1: Algorithm for Solving minv E(w, v; \u03bb, \u03b3).\n\ninput : Input dataset D, groups X(1), \u00b7 \u00b7 \u00b7 , X(b), w, \u03bb, \u03b3\noutput: The global solution v = (v(1), \u00b7 \u00b7 \u00b7 , v(b)) of minv E(w, v; \u03bb, \u03b3).\n\n1 for j = 1 to b do // for each group\n\n2\n\n3\n\n4\n\n5\n\n6\n\n1 , \u00b7 \u00b7 \u00b7 , x(j)\n\nSort the samples in X(j) as (x(j)\nAccordingly, denote the labels and weights of X(j) as (y(j)\nfor i = 1 to nj do // easy samples first\n\u221ai+\u221ai\u22121\n\ni\ni = 0; // not select this sample\n\nif L(y(j)\nelse v(j)\n\n, w)) < \u03bb + \u03b3\n\nthen v(j)\n\n, f (x\n\n(j)\ni\n\nnj ) in ascending order of their loss values L;\n\n1 , \u00b7 \u00b7 \u00b7 , y(j)\n\nnj ) and (v(j)\n\n1 , \u00b7 \u00b7 \u00b7 , v(j)\n\nnj );\n\n1\n\ni = 1 ; // select this sample\n\nend\n\n7\n8 end\n9 return v\n\nOutdoor bouldering\n\na\n\nb\n\n0.05\n\nc\n\nd\n\n0.12\n\ne\n\nf\n\n0.15\n\nBear climbing \n\na rock\n\nn\n\n0.28\n\nArtificial wall climbing\n\nSnow mountain climbing\n\n0.12\n\n0.12\n\n0.40\n\ng\n\nh\n\n0.17\n\n0.18\n\ni\n\n0.35\n\nj\n\nk\n\nl\n\n0.15\n\n0.20\n\nm\n\n0.16\n\n0.50\n\n(a)\n\nCurriculum: a, b, c, d\n\n(b)\n\n(c)\n\nn\n\ng\n\nh\n\nn\n\ng\n\nh\n\nn\n\ng\n\nh\n\na\n\nb\n\nc\n\nd\n\nj\n\nk\n\ne\n\nf\n\nl\n\nm\n\ni\n\nCurriculum: a, j, g, b\n\na\n\nb\n\nc\n\nd\n\ne\n\nf\n\ni\n\nj\n\nk\n\nl\n\nm\n\nCurriculum: a, j, g, n\n\na\n\nb\n\nc\n\nd\n\ne\n\nf\n\ni\n\nj\n\nk\n\nl\n\nm\n\nFigure 2: An example on samples selected by Algorithm 1. A colored block denotes a curriculum\nwith given \u03bb and \u03b3, and the bold (red) box indicates the easy sample selected by Algorithm 1.\n\nas shown in Fig. 2(a), SPLD, which is identical to SPL, selects only easy samples (with the smallest\nlosses) from a single cluster. Its curriculum thus includes duplicate samples like b, c, d with the same\nloss value. When \u03bb 6= 0 and \u03b3 6= 0 in Fig. 2(b), SPLD balances the easiness and the diversity, and\nproduces a reasonable and diverse curriculum: a, j, g, b. Note that even if there exist 3 duplicate\nsamples b, c, d, SPLD only selects one of them due to the decreasing threshold in Step 5 of Algorith-\nm 1. Likewise, samples e and j share the same loss, but only j is selected as it is better in increasing\nthe diversity. In an extreme case where \u03bb = 0 and \u03b3 6= 0, as illustrated in Fig. 2(c), SPLD selects\nonly diverse samples, and thus may choose outliers, such as the sample n which is a confusable\nvideo about a bear climbing a rock. Therefore, considering both easiness and diversity seems to\nbe more reasonable than considering either one alone. Physically the parameters \u03bb and \u03b3 together\ncorrespond to the \u201cage\u201d of the model, where \u03bb focuses on easiness whereas \u03b3 stresses diversity.\n\nAs Algorithm 1 \ufb01nds the optimal v, the alternative search strategy can be readily applied to solv-\ning Eq. (3). The details are listed in Algorithm 2. As aforementioned, Step 4 can be implemented\nusing the existing off-the-shelf learning method. Following [6], we initialize v by setting vi = 1 to\nrandomly selected samples. Following SPL [6], the self-paced parameters are updated by absolute\nvalues of \u00b51, \u00b52 (\u00b51, \u00b52 \u2265 1) in Step 6 at the end of every iteration. In practice, it seems more\nrobust by \ufb01rst sorting samples in ascending order of their losses, and then setting the \u03bb, \u03b3 according\nto the statistics collected from the ranked samples (see the discussion in supplementary material-\ns). According to [6], the alternative search in Algorithm 1 converges as the objective function is\nmonotonically decreasing and is bounded from below.\n\n5 Experiments\n\nWe present experimental results for the proposed SPLD on two tasks: event detection and action\nrecognition. We demonstrate that our approach signi\ufb01cantly outperforms SPL on three real-world\nchallenging datasets. The code is at (http://www.cs.cmu.edu/\u02dclujiang/spld).\n\n5\n\n\fAlgorithm 2: Algorithm of Self-Paced Learning with Diversity.\n\ninput : Input dataset D, self-pace parameters \u00b51, \u00b52\noutput: Model parameter w\nif no prior clusters exist then cluster the training samples X into b groups X(1), \u00b7 \u00b7 \u00b7 , X(b);\nInitialize v\u2217, \u03bb, \u03b3 ; // assign the starting value\nwhile not converged do\n\nUpdate w\u2217 = arg minw E(w, v\u2217; \u03bb, \u03b3) ; // train a classification model\nUpdate v\u2217 = arg minv E(w\u2217, v; \u03bb, \u03b3) using Algorithm 1; // select easy & diverse samples\n\u03bb \u2190 \u00b51\u03bb ; \u03b3 \u2190 \u00b52\u03b3 ; // update the learning pace\n\nend\nreturn w = w\u2217\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nSPLD is compared against four baseline methods: 1) RandomForest is a robust bootstrap method\nthat trains multiple decision trees using randomly selected samples and features [16]. 2) AdaBoost is\na classical ensemble approach that combines the sequentially trained \u201cbase\u201d classi\ufb01ers in a weighted\nfashion [18]. Samples that are misclassi\ufb01ed by one base classi\ufb01er are given greater weight when\nused to train the next classi\ufb01er in sequence. 3) BatchTrain represents a standard training approach\nin which a model is trained simultaneously using all samples; 4) SPL is a state-of-the-art method\nthat trains models gradually from easy to more complex samples [6]. The baseline methods are a\nmixture of the well-known and the state-of-the-art methods on training models using sampled data.\n\n5.1 Multimedia Event Detection (MED)\n\nProblem Formulation Given a collection of videos, the goal of MED is to detect events of interest,\ne.g. \u201cBirthday Party\u201d and \u201cParade\u201d, solely based on the video content. The task is very challenging\ndue to complex scenes, camera motion, occlusions, etc. [17, 19, 8].\n\nDataset The experiments are conducted on the largest collection on event detection: TRECVID\nMED13Test, which consists of about 32,000 Internet videos. There are a total of 3,490 videos from\n20 complex events, and the rest are background videos. For each event 10 positive examples are\ngiven to train a detector, which is tested on about 25,000 videos. The of\ufb01cial test split released by\nNIST (National Institute of Standards and Technology) is used [15].\n\nExperimental setting A Deep Convolutional Neural Network is trained on 1.2 million ImageNet\nchallenge images from 1,000 classes [20] to represent each video as a 1,000-dimensional vector.\nAlgorithm 2 is used. By default, the group membership is generated by the spectral clustering, and\nthe number of groups is set to 64. Following [9, 8], LibLinear is used as the solver in Step 4 of\nAlgorithm 2 due to its robust performance on this task. The performance is evaluated using MAP as\nrecommended by NIST. The parameters of all methods are tuned on the same validation set.\n\nTable 1 lists the overall MAP comparison. To reduce the in\ufb02uence brought by initialization, we\nrepeated experiments of SPL and SPLD 10 times with random starting values, and report the best\nrun and the mean (with the 95% con\ufb01dence interval) of the 10 runs. The proposed SPLD outperforms\nall baseline methods with statistically signi\ufb01cant differences at the p-value level of 0.05, according\nto the paired t-test. It is worth emphasizing that MED is very challenging [15] and 26% relative\n(2.5 absolute) improvement over SPL is a notable gain. SPLD outperforms other baselines on both\nthe best run and the 10 runs average. RandomForest and AdaBoost yield poorer performance. This\nobservation agrees with the study in literature [15, 9] that SVM is more robust on event detection.\n\nTable 1: MAP (x100) comparison with the baseline methods on MED.\n\nRun Name\nBest Run\n\n10 Runs Average\n\nRandomForest AdaBoost\n\nBatchTrain\n\n3.0\n3.0\n\n2.8\n2.8\n\n8.3\n8.3\n\nSPL\n9.6\n\nSPLD\n12.1\n\n8.6\u00b10.42\n\n9.8\u00b10.45\n\nBatchTrain, SPL and SPLD are all performed using SVM. Regarding the best run, SPL boosts the\nMAP of the BatchTrain by a relative 15.6% (absolute 1.3%). SPLD yields another 26% (absolute\n2.5%) over SPL. The MAP gain suggests that optimizing objectives with the diversity is inclined\nto attain a better solution. Fig. 3 plots the validation and test AP on three representative events.\nAs illustrated, SPLD attains a better solution within fewer iterations than SPL, e.g.\nin Fig. 3(a)\nSPLD obtains the best test AP (0.14) by 6 iterations as opposed to AP (0.12) by 11 iterations in\n\n6\n\n\f0.2\n\nn\no\ns\n\ni\n\ni\n\nc\ne\nr\n\n \n\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\n0.15\n\n0.1\n\n0.05\n\nL\nP\nS\n\n \n\nDev AP\nTest AP\nBatchTrain\n\nD\nL\nP\nS\n\ni\n\nn\no\ns\nc\ne\nr\n\ni\n\n \n\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\n0\n\n \n\n10\n\n20\n\n30\n\nIteration\n\n40\n\n50\n\n \n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\nDev AP\nTest AP\nBatchTrain\n\n10\n\n20\n\n30\n\n40\n\n50\n\n(a) E006: Birthday party\n\nIteration\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\n \n\n0.1\n\n0\n\n \n\n \n\nDev AP\nTest AP\nBatchTrain\n\ni\n\ni\n\nn\no\ns\nc\ne\nr\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\n10\n\n20\n\n30\n\nIteration\n\n40\n\n \n\n50\n\n \n\ni\n\nn\no\ns\nc\ne\nr\n\ni\n\n \n\nP\ne\ng\na\nr\ne\nv\nA\n\n \n\nDev AP\nTest AP\nBatchTrain\n\n10\n\n20\n\n30\n\n40\n\n50\n\n(b) E008: Flash mob gathering\n\nIteration\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n0\n\n \n\nDev AP\nTest AP\nBatchTrain\n\n \n\n10\n\n20\n\n30\n\n40\n\nIteration\n\n50\n\n \n\nDev AP\nTest AP\nBatchTrain\n\n10\n\n20\n\n30\n\n40\n\n50\n\n(c) E023: Dog show\n\nIteration\n\nFigure 3: The validation and test AP in different iterations. Top row plots the SPL result and bottom\nshows the proposed SPLD result. The x-axis represents the iteration in training. The blue solid curve\n(Dev AP) denotes the AP on the validation set, the red one marked by squares (Test AP) denotes the\nAP on the test set, and the green dashed curve denotes the Test AP of BatchTrain which remains the\nsame across iterations.\n\nIter 1\n\nIter 2\n\nIter 3\n\nIter 4\n\nIter 9\n\nIter 10\n\nThe number of iterations in training\n\n \n\ny\na\nd\nh\nt\nr\ni\nB\n\n \n:\n6\n0\n0\nE\n\n)\na\n(\n\ny\nt\nr\na\np\n\n)\nb\n(\n\nIndoorbirthday party\n\nIndoorbirthday party\n\nIndoorbirthday party\n\nIndoor birthday party\n\nIndoorbirthday party\n\nOutdoorbirthday party\n\n...\n\nIndoorbirthday party\n\nOutdoor birthday party\n\nIndoorbirthday party\n\nOutdoorbirthday party\n\nIndoorbirthday party\n\nIndoorbirthday party\n\n...\n\n)\na\n(\n\n)\nb\n(\n\ne\nr\ni\nt\n \ne\nl\nc\ni\nh\ne\nv\n\ni\n\n \na\n \ng\nn\ng\nn\na\nh\nC\n\n \n:\n7\n0\n0\nE\n\nCar/Truck\n\nCar/Truck\n\nCar/Truck\n\nCar/Truck\n\nBicycle/Scooter\n\nBicycle/Scooter\n\n...\n\n...\n\nCar/Truck\n\nBicycle/Scooter\n\nBicycle/Scooter\n\nCar/Truck\n\nCar/Truck\n\nBicycle/Scooter\n\nFigure 4: Comparison of positive samples used in each iteration by (a) SPL (b) SPLD.\n\nSPL. Studies [1, 6] have shown that SPL converges fast, while this observation further suggests that\nSPLD may lead to an even faster convergence. We hypothesize that it is because the diverse samples\nlearned in the early iterations in SPLD tend to be more informative. The best Test APs of both SPL\nand SPLD are better than BatchTrain, which is consistent with the observation in [5] that removing\nsome samples may be bene\ufb01cial in training a better detector. As shown, Dev AP and Test AP share\na similar pattern justifying the rationale for parameters tuning on the validation set.\n\nFig. 4 plots the curriculum generated by SPL and SPLD in a \ufb01rst few iterations on two representative\nevents. As we see, SPL tends to select easy samples similar to what it has already learned, whereas\nSPLD selects samples that are both easy and diverse to the model. For example, for the event \u201cE006\nBirthday Party\u201d, SPL keeps selecting indoor scenes due to the sample learned in the \ufb01rst place.\nHowever, the samples learned by SPLD are a mixture of indoor and outdoor birthday parties. For\nthe complex samples, both methods leave them to the last iterations, e.g. the 10th video in \u201cE007\u201d.\n\n5.2 Action Recognition\n\nProblem Formulation The goal is to recognize human actions in videos.\n\nDatasets Two representative datasets are used: Hollywood2 was collected from 69 different Holly-\nwood movies [21]. It contains 1,707 videos belonging to 12 actions, splitting into a training set (823\nvideos) and a test set (884 videos). Olympic Sports consists of athletes practicing different sports\ncollected from YouTube [22]. There are 16 sports actions from 783 clips. We use 649 for training\nand 134 for testing as recommended in [22].\n\nExperimental setting The improved dense trajectory feature is extracted and further represented by\nthe \ufb01sher vector [23, 24]. A similar setting discussed in Section 5.1 is applied, except that the groups\nare generated by K-means (K=128).\n\nTable 2 lists the MAP comparison on the two datasets. A similar pattern can be observed that\nSPLD outperforms SPL and other baseline methods with statistically signi\ufb01cant differences. We\nthen compare our MAP with the state-of-the-art MAP in Table 3. Indeed, this comparison may be\n\n7\n\n\fTable 2: MAP (x100) comparison with the baseline methods on Hollywood2 and Olympic Sports.\n\nRun Name\nHollywood2\n\nOlympic Sports\n\nRandomForest AdaBoost\n\nBatchTrain\n\n28.20\n63.32\n\n41.14\n69.25\n\n58.16\n90.61\n\nSPL\n63.72\n90.83\n\nSPLD\n66.65\n93.11\n\nless fair since the features are different in different methods. Nevertheless, with the help of SPLD,\nwe are able to achieve the best MAP reported so far on both datasets. Note that the MAPs in Table 3\nare obtained by recent and very competitive methods on action recognition. This improvement\ncon\ufb01rms the assumption that considering diversity in learning is instrumental.\n\nTable 3: Comparison of SPLD to the state-of-the-art on Hollywood2 and Olympic Sports\n\nHollywood2\n\nOlympic Sports\n\nVig et al. 2012 [25]\nJiang et al. 2012 [26]\nJain et al. 2013 [27]\nWang et al. 2013 [23]\n\nSPLD\n\n59.4% Brendel et al. 2011 [28]\nJiang et al. 2012 [26]\n59.5%\n62.5% Gaidon et al. 2012 [29]\n64.3% Wang et al. 2013 [23]\n66.7%\n\nSPLD\n\n73.7%\n80.6%\n82.7%\n91.2%\n93.1%\n\n5.3 Sensitivity Study\n\nWe conduct experiments using different number of groups generated by two clustering algorithm:\nK-means and Spectral Clustering. Each experiment is fully tuned under the given #groups and the\nclustering algorithm, and the best run is reported in Table 4. The results suggest that SPLD is\nrelatively insensitive to the clustering method and the given group numbers. We hypothesize that\nSPLD may not improve SPL in the cases where the assumption in Section 4.1 is violated, and the\ngiven groups, e.g. random clusters, cannot re\ufb02ect the latent variousness in data.\n\nTable 4: MAP (x100) comparison of different clustering algorithms and #clusters.\n\nDataset\n\nSPL\n\nMED\n\n8.6\u00b10.42\n\nHollywood2\n\n63.72\n\nOlympic\n\n90.83\n\nClustering\nK-means\nSpectral\nK-means\nSpectral\nK-means\nSpectral\n\n#Groups=32\n9.16\u00b10.31\n9.29\u00b10.42\n\n#Groups=64\n9.20\u00b10.36\n9.79\u00b10.45\n\n#Groups=128\n\n#Groups=256\n\n9.25\u00b10.32\n9.22\u00b10.41\n\n9.03\u00b10.28\n9.38\u00b10.43\n\n66.372\n66.639\n91.86\n91.08\n\n66.358\n66.504\n92.37\n92.51\n\n66.653\n66.264\n93.11\n93.25\n\n66.365\n66.709\n92.65\n92.54\n\n6 Conclusion\n\nWe advanced the frontier of the self-paced learning by proposing a novel idea that considers both\neasiness and diversity in learning. We introduced a non-convex regularization term that favors s-\nelecting both easy and diverse samples. The proposed regularization term is general and can be\napplied to various problems. We proposed a linearithmic algorithm that \ufb01nds the global optimum of\nthis non-convex problem on updating the samples to be included. Using three real-world datasets,\nwe showed that the proposed SPLD outperforms the state-of-the-art approaches.\n\nPossible directions for future work may include studying the diversity for samples in the mixture\nmodel, e.g. mixtures of Gaussians, in which a sample is assigned to a mixture of clusters. Another\npossible direction would be studying assigning reliable starting values for SPL/SPLD.\n\nAcknowledgments\n\nThis work was partially supported by Intelligence Advanced Research Projects Activity (IARPA) via Depart-\nment of Interior National Business Center contract number D11PC20068. Deyu Meng was partially supported\nby 973 Program of China (3202013CB329404) and the NSFC project (61373114). The U.S. Government is\nauthorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright an-\nnotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should\nnot be interpreted as necessarily representing the of\ufb01cial policies or endorsements, either expressed or implied,\nof IARPA, DoI/NBC, or the U.S. Government.\n\n8\n\n\fReferences\n\n[1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, Curriculum learning. In ICML, 2009.\n\n[2] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTrans. PAMI 35(8):1798-1828, 2013.\n\n[3] S. Basu and J. Christensen. Teaching classi\ufb01cation boundaries to humans. In AAAI, 2013.\n\n[4] V. I. Spitkovsky, H. Alshawi, and D. Jurafsky. Baby steps: How \u201cLess is More\u201d in unsupervised dependency\n\nparsing. In NIPS, 2009.\n\n[5] A. Lapedriza, H. Pirsiavash, Z. Bylinskii, and A. Torralba. Are all training examples equally valuable?\n\nCoRR abs/1311.6510, 2013.\n\n[6] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, 2010.\n\n[7] L. Jiang, D. Meng, T. Mitamura, and A. Hauptmann. Easy samples \ufb01rst: self-paced reranking for zero-\n\nexample multimedia search. In MM, 2014.\n\n[8] K. Tang, V. Ramanathan, L. Fei-Fei, and D. Koller. Shifting weights: Adapting object detectors from image\n\nto video. In NIPS, 2012.\n\n[9] Z. Lan, L. Jiang, S. I. Yu, et al. CMU-Informedia at TRECVID 2013 multimedia event detection. In\n\nTRECVID, 2013.\n\n[10] F. Khan, X. J. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching dimension.\n\nIn NIPS, 2011.\n\n[11] M. P. Kumar, H. Turki, D. Preston, and D. Koller. Learning spec\ufb01c-class segmentation from diverse data.\n\nIn ICCV, 2011.\n\n[12] J. S. Supancic and D. Ramanan. Self-paced learning for long-term tracking. In CVPR, 2013.\n\n[13] Y. J. Lee and K. Grauman. Learning the easy things \ufb01rst: Self-paced visual category discovery. In CVPR,\n\n2011.\n\n[14] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the\n\nRoyal Statistical Society: Series B 68(1):49-67, 2006.\n\n[15] P. Over, G. Awad, M. Michel, et al. TRECVID 2013 - an overview of the goals, tasks, data, evaluation\n\nmechanisms and metrics. In TRECVID, 2013.\n\n[16] L. Breiman. Random forests. Machine learning. 45(1):5-32, 2001.\n\n[17] L. Jiang, A. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event\n\ndetection. In MM, 2012.\n\n[18] J. Friedman. Stochastic Gradient Boosting. Coputational Statistics and Data Analysis. 38(4):367-378,\n\n2002.\n\n[19] L. Jiang, T. Mitamura, S. Yu, and A. Hauptmann. Zero-Example Event Search using MultiModal Pseudo\n\nRelevance Feedback. In ICMR, 2014.\n\n[20] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In CVPR, 2009.\n\n[21] M. Marszalek, I. Laptev, and C. Schmid. Actions in context. In CVPR, 2009.\n\n[22] J. C. Niebles, C. W. Chen, and L. Fei-Fei. Modeling temporal structure of decomposable motion segments\n\nfor activity classi\ufb01cation. In ECCV, 2010.\n\n[23] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.\n\n[24] Z. Lan, X. Li, and A. Hauptmann. Temporal Extension of Scale Pyramid and Spatial Pyramid Matching\n\nfor Action Recognition. In arXiv preprint arXiv:1408.7071, 2014.\n\n[25] E. Vig, M. Dorr, and D. Cox. Space-variant descriptor sampling for action recognition based on saliency\n\nand eye movements. In ECCV, 2012.\n\n[26] Y. G. Jiang, Q. Dai, X. Xue, W. Liu, and C. W. Ngo. Trajectory-based modeling of human actions with\n\nmotion reference points. In ECCV, 2012.\n\n[27] M. Jain, H. J\u00b4egou, and P. Bouthemy. Better exploiting motion for better action recognition. In CVPR,\n\n2013.\n\n[28] W. Brendel and S. Todorovic. Learning spatiotemporal graphs of human activities. In ICCV, 2011.\n\n[29] A. Gaidon, Z. Harchaoui, and C. Schmid. Recognizing activities with cluster-trees of tracklets. In BMVC,\n\n2012\n\n9\n\n\f", "award": [], "sourceid": 1118, "authors": [{"given_name": "Lu", "family_name": "Jiang", "institution": "Carnegie Mellon University"}, {"given_name": "Deyu", "family_name": "Meng", "institution": "Carnegie Mellon University"}, {"given_name": "Shoou-I", "family_name": "Yu", "institution": "Carnegie Mellon University"}, {"given_name": "Zhenzhong", "family_name": "Lan", "institution": "Carnegie Mellon University"}, {"given_name": "Shiguang", "family_name": "Shan", "institution": null}, {"given_name": "Alexander", "family_name": "Hauptmann", "institution": "Carnegie Mellon University"}]}