{"title": "Efficient Optimization for Average Precision SVM", "book": "Advances in Neural Information Processing Systems", "page_first": 2312, "page_last": 2320, "abstract": "The accuracy of information retrieval systems is often measured using average precision (AP). Given a set of positive (relevant) and negative (non-relevant) samples, the parameters of a retrieval system can be estimated using the AP-SVM framework, which minimizes a regularized convex upper bound on the empirical AP loss. However, the high computational complexity of loss-augmented inference, which is required for learning an AP-SVM, prohibits its use with large training datasets. To alleviate this deficiency, we propose three complementary approaches. The first approach guarantees an asymptotic decrease in the computational complexity of loss-augmented inference by exploiting the problem structure. The second approach takes advantage of the fact that we do not require a full ranking during loss-augmented inference. This helps us to avoid the expensive step of sorting the negative samples according to their individual scores. The third approach approximates the AP loss over all samples by the AP loss over difficult samples (for example, those that are incorrectly classified by a binary SVM), while ensuring the correct classification of the remaining samples. Using the PASCAL VOC action classification and object detection datasets, we show that our approaches provide significant speed-ups during training without degrading the test accuracy of AP-SVM.", "full_text": "Ef\ufb01cient Optimization for Average Precision SVM\n\nPritish Mohapatra\n\nIIIT Hyderabad\n\npritish.mohapatra@research.iiit.ac.in\n\nC.V. Jawahar\nIIIT Hyderabad\n\njawahar@iiit.ac.in\n\nM. Pawan Kumar\n\nEcole Centrale Paris & INRIA Saclay\n\npawan.kumar@ecp.fr\n\nAbstract\n\nThe accuracy of information retrieval systems is often measured using average\nprecision (AP). Given a set of positive (relevant) and negative (non-relevant) sam-\nples, the parameters of a retrieval system can be estimated using the AP-SVM\nframework, which minimizes a regularized convex upper bound on the empiri-\ncal AP loss. However, the high computational complexity of loss-augmented in-\nference, which is required for learning an AP-SVM, prohibits its use with large\ntraining datasets. To alleviate this de\ufb01ciency, we propose three complementary\napproaches. The \ufb01rst approach guarantees an asymptotic decrease in the compu-\ntational complexity of loss-augmented inference by exploiting the problem struc-\nture. The second approach takes advantage of the fact that we do not require a\nfull ranking during loss-augmented inference. This helps us to avoid the expen-\nsive step of sorting the negative samples according to their individual scores. The\nthird approach approximates the AP loss over all samples by the AP loss over dif-\n\ufb01cult samples (for example, those that are incorrectly classi\ufb01ed by a binary SVM),\nwhile ensuring the correct classi\ufb01cation of the remaining samples. Using the PAS-\nCAL VOC action classi\ufb01cation and object detection datasets, we show that our\napproaches provide signi\ufb01cant speed-ups during training without degrading the\ntest accuracy of AP-SVM.\n\nIntroduction\n\n1\nInformation retrieval systems require us to rank a set of samples according to their relevance to a\nquery. The parameters of a retrieval system can be estimated by minimizing the prediction risk on a\ntraining dataset, which consists of positive and negative samples. Here, positive samples are those\nthat are relevant to a query, and negative samples are those that are not relevant to the query. Several\nrisk minimization frameworks have been proposed in the literature, including structured support\nvector machines (SSVM) [15, 16], neural networks [14], decision forests [11] and boosting [13]. In\nthis work, we focus on SSVMs for clarity while noting the methods we develop are also applicable\nto other learning frameworks.\nThe SSVM framework provides a linear prediction rule to obtain a structured output for a structured\ninput. Speci\ufb01cally, the score of a putative output is the dot product of the parameters of an SSVM\nwith the joint feature vector of the input and the output. The prediction requires us to maximize\nthe score over all possible outputs for an input. During training, the parameters of an SSVM are\nestimated by minimizing a regularized convex upper bound on a user-speci\ufb01ed loss function. The\nloss function measures the prediction risk, and should be chosen according to the evaluation criterion\nfor the system. While in theory the SSVM framework can be employed in conjunction with any loss\nfunction, in practice its feasibility depends on the computational ef\ufb01ciency of the corresponding loss-\naugmented inference. In other words, given the current estimate of the parameters, it is important to\nbe able to ef\ufb01ciently maximize the sum of the score and the loss function over all possible outputs.\n\n1\n\n\fA common measure of accuracy for information retrieval is average precision (AP), which is used\nin several standard challenges such as the PASCAL VOC object detection, image classi\ufb01cation and\naction classi\ufb01cation tasks [7], and the TREC Web Track corpora. The popularity of AP inspired Yue\net al. [19] to propose the AP-SVM framework, which is a special case of SSVM. The input of AP-\nSVM is a set of samples, the output is a ranking and the loss function is one minus the AP of the\nranking. In order to learn the parameters of an AP-SVM, Yue et al. [19] developed an optimal greedy\nalgorithm for loss-augmented inference. Their algorithm consists of two stages. First, it sorts the\npositive samples P and the negative samples N separately in descending order of their individual\nscores. The individual score of a sample is equal to the dot product of the parameters with the feature\nvector of the sample. Second, starting from the negative sample with the highest score, it iteratively\n\ufb01nds the optimal interleaving rank for each of the |N| negative samples. The interleaving rank for a\nnegative sample is the index of the highest ranked positive sample ranked below it. which requires\nat most O(|P|) time per iteration. The overall algorithm is described in detail in the next section.\nNote that, typically |N| (cid:29) |P|, that is, the negative samples signi\ufb01cantly outnumber the positive\nsamples.\nWhile the AP-SVM has been successfully applied for ranking using high-order information in mid to\nlarge size datasets [5], many methods continue to use the simpler binary SVM framework for large\ndatasets. Unlike AP-SVM, a binary SVM optimizes the surrogate 0-1 loss. Its main advantage is\nthe ef\ufb01ciency of the corresponding loss-augmented inference algorithm, which has a complexity of\nO(|P| + |N|). However, this gain in training ef\ufb01ciency often comes at the cost of a loss in testing\naccuracy, which is especially signi\ufb01cant when training with weakly supervised datasets [1].\nIn order to facilitate the use of AP-SVM, we present three complementary approaches to speed-up\nits learning. Our \ufb01rst approach exploits an interesting structure in the problem corresponding to the\ncomputation of the rank of the j-th negative sample. Speci\ufb01cally, we show that when j > |P|, the\nrank of the j-th negative sample is obtained by maximizing a discrete unimodal function. Here, a\ndiscrete function de\ufb01ned over points {1,\u00b7\u00b7\u00b7 , p} is said to be unimodal if it is non-decreasing from\n{1,\u00b7\u00b7\u00b7 , k} and non-increasing from {k,\u00b7\u00b7\u00b7 , p} for some k \u2208 {1,\u00b7\u00b7\u00b7 , p}. Since the mode of a\ndiscrete unimodal function can be computed ef\ufb01ciently using binary search, it reduces the computa-\ntional complexity of computing the rank of the j-th negative sample from O(|P|) to O(log(|P|)). To\nthe best of our knowledge, ours is the \ufb01rst work to improve the speed of loss-augmented inference\nfor AP-SVM by taking advantage of the special structure of the problem. Unlike [2] which proposes\nan ef\ufb01cient method for a similar framework of structured output ranking, our method optimizes the\nAPloss.\nOur second approach relies on the fact that in many cases we do not need to explicitly compute the\noptimal interleaving rank for all the negative samples. Speci\ufb01cally, we only need to compute the\ninterleaving rank for the set of negative samples that would have an interleaving rank of less than\n|P| + 1. We identify this set using a binary search over the list of negative samples. While training,\nafter the initial few training iterations the size of this set rapidly reduces, allowing us to signi\ufb01cantly\nreduce the training time in practice.\nOur third approach uses the intuition that the 0-1 loss and the AP loss differ only when some of the\nsamples are dif\ufb01cult to classify (that is, some positive samples that can be confused as negatives and\nvice versa). In other words, when the 0-1 loss over the training dataset is 0, then the AP loss is also\n0. Thus, instead of optimizing the AP loss over all the samples, we adopt a two-stage approximate\nstrategy.\nIn the \ufb01rst stage, we identify a subset of dif\ufb01cult samples (speci\ufb01cally, those that are\nincorrectly classi\ufb01ed by a binary SVM). In the second stage, we optimize the AP loss over the subset\nof dif\ufb01cult samples, while ensuring the correct classi\ufb01cation of the remaining easy samples. Using\nthe PASCAL VOC action classi\ufb01cation and object detection datasets, we empirically demonstrate that\neach of our approaches greatly reduces the training time of AP-SVM while not decreasing the testing\naccuracy.\n\n2 The AP-SVM Framework\nWe provide a brief overview of the AP-SVM framework, highlighting only those aspects that are\nnecessary for the understanding of this paper. For a detailed description, we refer the reader to [19].\n\nInput and Output. The input of an AP-SVM is a set of n samples, which we denote by X =\n{xi, i = 1,\u00b7\u00b7\u00b7 , n}. Each sample can either belong to the positive class (that is, the sample is\n\n2\n\n\frelevant) or the negative class (that is, the sample is not relevant). The indices for the positive and\nnegative samples are denoted by P and N respectively. In other words, if i \u2208 P and j \u2208 N then xi\nbelongs to positive class and xj belongs to the negative class.\nThe desired output is a ranking matrix R of size n \u00d7 n, such that (i) Rij = 1 if xi is ranked higher\nthan xj; (ii) Rij = \u22121 if xi is ranked lower than xj; and (iii) Rij = 0 if xi and xj are assigned\nthe same rank. During training, the ground-truth ranking matrix R\u2217 is de\ufb01ned as: (i) R\u2217\nij = 1 and\nji = \u22121 for all i \u2208 P and j \u2208 N ; (ii) R\u2217\nR\u2217\nJoint Feature Vector. For a sample xi, let \u03c8(xi) denote its feature vector. The joint feature vector\nof the input X and an output R is speci\ufb01ed as\n\njj(cid:48) = 0 for all i, i(cid:48) \u2208 P and j, j(cid:48) \u2208 N .\n\nii(cid:48) = 0 and R\u2217\n\n\u03a8(X, R) =\n\n1\n\n|P||N|\n\nRij(\u03c8(xi) \u2212 \u03c8(xj)).\n\n(1)\n\n(cid:88)\n\n(cid:88)\n\ni\u2208P\n\nj\u2208N\n\nIn other words, the joint feature vector is the scaled sum of the difference between the features of all\npairs of samples, where one sample is positive and the other is negative.\nParameters and Prediction. The parameter vector of AP-SVM is denoted by w, and is of the same\nsize as the joint feature vector. Given the parameters w, the ranking of an input X is predicted by\nmaximizing the score, that is,\n\nR = argmax\n\nR\n\nw(cid:62)\u03a8(X, R).\n\n(2)\n\nYue et al. [19] showed that the above optimization can be performed ef\ufb01ciently by sorting the sam-\nples xk in descending order of their individual scores, that is, sk = w(cid:62)\u03c8(xk).\nParameter Estimation. Given the input X and the ground-truth ranking matrix R\u2217, we estimate\nthe AP-SVM parameters by optimizing a regularized upper bound on the empirical AP loss. The\nAP loss of an output R is de\ufb01ned as 1 \u2212 AP(R\u2217, R), where AP(\u00b7,\u00b7) corresponds to the AP of the\nranking R with respect to the true ranking R\u2217. Speci\ufb01cally, the parameters are obtained by solving\nthe following convex optimization problem:\n||w||2 + C\u03be,\n1\n2\nw(cid:62)\u03a8(X, R\u2217) \u2212 w(cid:62)\u03a8(X, R) \u2265 \u2206(R\u2217, R) \u2212 \u03be,\u2200R\n\nw\ns.t.\n\nmin\n\n(3)\n\nThe computational complexity of solving the above problem depends on the complexity of the cor-\nresponding loss-augmented inference, that is,\n\n\u02c6R = argmax\n\nw(cid:62)\u03a8(X, R) + \u2206(R\u2217, R).\n\nR\n\n(4)\n\nFor a given set of parameters w, the above problem requires us to \ufb01nd the most violated ranking,\nthat is, the ranking that maximizes the sum of the score and the AP loss. To be more precise, what\nwe require is the joint feature vector \u03a8(X, \u02c6R) and the AP loss \u2206(R\u2217, \u02c6R) corresponding to the most\nviolated ranking. Yue et al. [19] provided an optimal greedy algorithm for problem (4), which is\nsummarized in Algorithm 1. It consists of two stages. First, it sorts the positive and the negative\nsamples separately in descending order of their scores (steps 1-2). This takes O(|P| log(|P|) +\n|N| log(|N|)) time. Second, starting with the highest scoring negative sample, it iteratively \ufb01nds\nthe interleaving rank of each negative sample xj. This involves maximizing the quantity \u03b4j(i),\nde\ufb01ned in equation (5), over all i \u2208 {1,\u00b7\u00b7\u00b7 ,|P|} (steps 3-7), which takes O(|P||N|) time.\n3 Ef\ufb01cient Optimization for AP-SVM\nIn this section, we propose three methods to speed up the training procedure of AP-SVM. The \ufb01rst\ntwo methods are exact. Speci\ufb01cally, they reduce the time taken to perform loss-augmented inference\nwhile ensuring the computation of the same most violated ranking as Algorithm 1. The third method\nprovides a framework for a sensible trade-off between training ef\ufb01ciency and test accuracy.\n3.1 Ef\ufb01cient Search for Loss-Augmented Inference\nIn order to \ufb01nd the most violated ranking, the greedy algorithm of Yue et al. [19] iteratively com-\nputes the optimal interleaving rank optj \u2208 {1,\u00b7\u00b7\u00b7 ,|P| + 1} for each negative sample xj (step 5\n\n3\n\n\fAlgorithm 1 The optimal greedy algorithm for loss-augmented inference for training AP-SVM.\ninput Training samples X containing positive samples P and negative samples N , parameters w.\ni = w(cid:62)\u03c8(xi), i \u2208 {1, . . . ,|P|}.\n1: Sort the positive samples in descending order of the scores sp\nj = w(cid:62)\u03c8(xj), j \u2208 {1, . . . ,|N|}.\n2: Sort the negative samples in descending order of the scores sn\n3: Set j = 1.\n4: repeat\n(cid:19)\n5:\n\nCompute the interleaving rank optj = argmaxi\u2208{1,\u00b7\u00b7\u00b7 ,|P|} \u03b4j(i), where\nk \u2212 sn\nj )\n|P||N|\n\n\u2212 j \u2212 1\nj + k \u2212 1\n\n(cid:18) j\n\n(cid:26) 1\n\n\u2212 2(sp\n\n\u03b4j(i) =\n\n(cid:27)\n\n.\n\n|P|\n\nj + k\n\n(5)\n\n|P|(cid:88)\n\nk=i\n\nThe j-th negative sample is ranked between the (optj \u2212 1)-th and the optj-th positive sample.\nSet j \u2190 j + 1.\n\n6:\n7: until j > |N|.\n\nof Algorithm 1). The interleaving rank optj speci\ufb01es that the negative sample xj must be ranked\nbetween the (optj \u2212 1)-th and the optj-th positive sample. The computation of the optimal inter-\nleaving rank for a particular negative sample requires us to maximize the discrete function \u03b4j(i) over\nthe domain i \u2208 {1,\u00b7\u00b7\u00b7 ,|P|}. Yue et al. [19] use a simple linear algorithm for this step, which takes\nO(|P|) time. In contrast, we propose a more ef\ufb01cient algorithm to maximize \u03b4j(\u00b7), which exploits\nthe special structure of this discrete function.\nBefore we describe our ef\ufb01cient algorithm in detail, we require the de\ufb01nition of a unimodal function.\nA discrete function f : {1,\u00b7\u00b7\u00b7 , p} \u2190 R is said to be unimodal if and only if there exists a k \u2208\n{1,\u00b7\u00b7\u00b7 , p} such that\n\nf (i) \u2264 f (i + 1),\u2200i \u2208 {1,\u00b7\u00b7\u00b7 , k \u2212 1},\nf (i \u2212 1) \u2265 f (i),\u2200i \u2208 {k + 1,\u00b7\u00b7\u00b7 , p}.\n\n(6)\n\nIn other words, a unimodal discrete function is monotonically non-decreasing in the interval [1, k]\nand monotonically non-increasing in the interval [k, p]. The maximization of a unimodal discrete\nfunction over its domain {1,\u00b7\u00b7\u00b7 , p} simply requires us to \ufb01nd the index k that satis\ufb01es the above\nproperties. The maximization can be performed ef\ufb01ciently, in O(log(p)) time, using binary search.\nWe are now ready to state the main result that allows us to compute the optimal interleaving rank of\na negative sample ef\ufb01ciently.\nProposition 1. The discrete function \u03b4j(i), de\ufb01ned in equation (5), is unimodal in the domain\n{1,\u00b7\u00b7\u00b7 , p}, where p = min{|P|, j}.\n\nThe proof of the above proposition is provided in Appendix A (supplementary material).\n\nAlgorithm 2 Ef\ufb01cient search for the optimal interleaving rank of a negative sample.\ninput {\u03b4j(i), i = 1,\u00b7\u00b7\u00b7 ,|P|}.\n1: p = min{|P|, j}.\n2: Compute an interleaving rank i1 as\n\n3: Compute an interleaving rank i2 as\n\nii = argmax\ni\u2208{1,\u00b7\u00b7\u00b7 ,p}\n\n\u03b4j(i).\n\n4: Compute the optimal interleaving rank optj as\n\ni2 = argmax\n\ni\u2208{p+1,\u00b7\u00b7\u00b7 ,|P|}\n\n\u03b4j(i).\n\n(cid:26) i1\n\ni2\n\noptj =\n\nif \u03b4j(i1) \u2265 \u03b4j(i2),\notherwise.\n\n4\n\n(7)\n\n(8)\n\n(9)\n\n\fUsing the above proposition,\nthe discrete function \u03b4j(i) can be optimized over the domain\n{1,\u00b7\u00b7\u00b7 ,|P|} ef\ufb01ciently as described in Algorithm 2. Brie\ufb02y, our ef\ufb01cient search algorithm \ufb01nds\nan interleaving ranking i1 over the domain {1,\u00b7\u00b7\u00b7 , p}, where p is set to min{|P|, j} in order to\nensure that the function \u03b4j(\u00b7) is unimodal (step 2 of Algorithm 2). Since i1 can be computed us-\ning binary search, the computational complexity of this step is O(log(p)). Furthermore, we \ufb01nd an\ninterleaving ranking i2 over the domain {p + 1,\u00b7\u00b7\u00b7 ,|P|} (step 3 of Algorithm 2). Since i2 needs\nto be computed using linear search, the computational complexity of this step is O(|P| \u2212 p) when\np < |P| and 0 otherwise. The optimal interleaving ranking optj of the negative sample xj can then\nbe computed by comparing the values of \u03b4j(i1) and \u03b4j(i2) (step 4 of Algorithm 2).\nNote that, in a typical training dataset, the negative samples signi\ufb01cantly outnumber the positive\nsamples, that is, |N| (cid:29) |P|. For all the negative samples xj where j \u2265 |P|, p will be equal to |P|.\nHence, the maximization of \u03b4j(\u00b7) can be performed ef\ufb01ciently over the entire domain {1,\u00b7\u00b7\u00b7 ,|P|}\nusing binary search in O(log(|P|)) as opposed to the O(|P|) time suggested in [19].\n\n3.2 Selective Ranking for Loss-Augmented Inference\nWhile the ef\ufb01cient search algorithm described in the previous subsection allows us to \ufb01nd the opti-\nmal interleaving rank for a particular negative sample, the overall loss-augmented inference would\nstill remain computationally inef\ufb01cient when the number of negative samples is large (as is typi-\ncally the case). This is due to the following two reasons. First, loss-augmented inference spends a\nconsiderable amount of time sorting the negative samples according to their individual scores (step\n2 of Algorithm 1). Second, if we were to apply our ef\ufb01cient search algorithm to every negative\nsample, the total computational complexity of the second stage of loss-augmented inference (step\n3-7 of Algorithm 1) will still be O(|P|2 + (|N| \u2212 |P|) log(|P|)).\nIn order to overcome the above computational issues, we exploit two key properties of loss-\naugmented inference in AP-SVM. First, if a negative sample xj has the optimal interleaving rank\noptj = |P| + 1, then all the negative samples that have lower score than xj would also have the\nsame optimal interleaving rank (that is, optk = optj = |P| + 1 for all k > j). This property follows\ndirectly from the analysis of Yue et al. [19] who showed that, for k < j, optk \u2265 optj and for any\nnegative sample xj, optj \u2208 [1,|P| + 1]. We refer the reader to [19] for a detailed proof. Second, we\nnote that the desired output of loss-augmented inference is not the most violated ranking \u02c6R, but the\njoint feature vector \u03a8(X, \u02c6R) and the AP loss AP(R\u2217, \u02c6R). From the de\ufb01nition of the joint feature\nvector and the AP loss, it follows that they do not depend on the relative ranking of the negative\nsamples that share the same optimal interleaving rank. Speci\ufb01cally, both the joint feature vector\nand the AP loss only depend on the number of negatives that are ranked higher and lower than each\npositive sample.\nThe above two observations suggest the following alter-\nnate strategy to Algorithm 1. Instead of explicitly com-\nputing the optimal interleaving rank for each negative\nsample (which can be computationally expensive), we\ncompute it only for negative samples that are expected\nto have optimal interleaving rank less than |P| + 1. Algo-\nrithm 3 outlines the procedure we propose in detail. We\n\ufb01rst \ufb01nd the score \u02c6s such that every negative sample xj\nj < \u02c6s has optj = |P| + 1. We do a binary\nwith score sn\nsearch over the list of scores of negative samples to \ufb01nd \u02c6s\n(step 4 of algorithm 3). We do not need to sort the scores\nof all the negative samples, as we use the quick select al-\ngorithm to \ufb01nd the k-th highest score wherever required.\nIf the output of the loss-augmented inference is such that\na large number of negative samples have optimal inter-\nleaving rank as |P| + 1, then this alternate strategy would\nresult in a signi\ufb01cant speed-up during training.\nIn our\nexperiments, we found that in later iterations of the op-\ntimization, this is indeed the case in practice. Figure 1\nshows how the number of negative samples with optimal\ninterleaving rank equal to |P| + 1, rapidly increases after\n\nFigure 1: A row corresponds to the in-\nterleaving ranks of the negative sam-\nples after a training iteration. Here,\nthere are 4703 negative samples, and\n131 training iterations. The interleav-\ning ranks are represented using a heat\nmap where the deepest red represents\ninterleaving rank of |P| + 1. (The \ufb01g-\nure is best viewed in colour.)\n\n5\n\n\fj < \u02c6s(cid:9)\n\n3: Set Nl =(cid:8)j \u2208 N|sn\n\nAlgorithm 3 The selective ranking algorithm for loss-augmented inference in AP-SVM.\ninput Sx, S \u00afx, |P|, |N|\n1: Sort the positive samples in descending order of their scores Sx.\n2: Do binary search over S \u00afx to \ufb01nd \u02c6s.\n4: Sort Nl in descending order of the scores.\n5: for all j \u2208 Nl do\n6:\n7: end for\n8: Set Nr = N \u2212 Nl.\n9: for all j \u2208 Nr do\n10:\n11: end for\noutput optj , \u2200j \u2208 N\n\nCompute optj using Algorithm 2.\n\nSet optj = |P| + 1.\n\na few training iterations for a typical experiment. A large number of negative samples have optimal\ninterleaving rank equal to |P| + 1, while the negative samples that have other values of optimal\ninterleaving rank decrease considerably.\nIt would be worth taking note that here, even though we take advantage of the fact that a long\nsequence of negative samples at the end of the list take the same optimal interleaving rank, such\nsequences also occur at other locations throughout the list. This can be leveraged for further speed-\nup by computing the interleaving rank for only the boundary samples of such sequences and setting\nall the intermediate samples to the same interleaving rank as the boundary samples. We can use a\nmethod similar to the one presented in this section to search for such sequences by using the quick\nselect algorithm to compute the interleaving rank for any particular negative sample on the list.\n\n3.3 Ef\ufb01cient Approximation of AP-SVM\nThe previous two subsections provide exact algorithms for loss-augmented inference that reduce\nthe time require for training an AP-SVM. However, despite these improvements, AP-SVM might\nbe slower to learn compared to simpler frameworks such as the binary SVM, which optimizes the\nsurrogate 0-1 loss. The disadvantage of using the binary SVM is that, in general, the 0-1 loss is a\npoor approximation for the AP loss. However, the quality of the approximation is not uniformly\npoor for all samples, but depends heavily on their separability. Speci\ufb01cally, when the 0-1 loss of a\nset of samples is 0 (that is, they are linearly separable by a binary SVM), their AP loss is also 0. This\nobservation inspires us to approximate the AP loss over the entire set of training samples using the\nAP loss over the subset of dif\ufb01cult samples. In this work, we de\ufb01ne the subset of dif\ufb01cult samples\nas those that are incorrectly classi\ufb01ed by a simple binary SVM.\nFormally, given the complete input X and the ground-truth ranking matrix R\u2217, we represent indi-\nvidual samples as xi and their class as yi. In other words, yi = 1 if i \u2208 P and yi = \u22121 if i \u2208 N .\nIn order to approximate the AP-SVM, we adopt a two stage strategy. In the \ufb01rst stage, we learn\na binary SVM by minimizing the regularized convex upper bound on the 0-1 loss over the entire\ntraining set. Since the loss-augmented inference for 0-1 loss is very fast, the parameters w0 of the\nbinary SVM can be estimated ef\ufb01ciently. We use the binary SVM to de\ufb01ne the set of easy samples as\n0 \u03c6i(x) \u2265 1}. In other words, a positive sample is easy if it is assigned a score that\nXe = {xi, yiw(cid:62)\nis greater than 1 by the binary SVM. Similarly, a negative sample is easy if it is assigned a score that\nis less than -1 by the binary SVM. The remaining dif\ufb01cult samples are denoted by Xd = X \u2212 Xe\nand the corresponding ground-truth ranking matrix by R\u2217\nd. In the second stage, we approximate the\nAP loss over the entire set of samples X by the AP loss over the dif\ufb01cult samples Xd while ensuring\nthat the samples Xe are correctly classi\ufb01ed. In order to accomplish this, we solve the following\noptimization problem:\n\nmin\n\nw\ns.t.\n\n||w||2 + C\u03be\n1\n2\nw(cid:62)\u03a8(Xd, R\u2217\nyi\n\n(cid:0)w(cid:62)\u03c6(xi)(cid:1) > 1,\u2200xi \u2208 Xe.\n\nd) \u2212 w(cid:62)\u03a8(Xd, Rd) \u2265 \u2206(R\u2217\n\nd, Rd) \u2212 \u03be,\u2200Rd,\n\n(10)\n\n6\n\n\fIn practice, we can choose to retain only the top k% of Xe ranked in descending order of their score\nand push the remaining samples into the dif\ufb01cult set Xd. This gives the AP-SVM more \ufb02exibility to\nupdate the parameters at the cost of some additional computation.\n\n4 Experiments\nWe demonstrate the ef\ufb01cacy of our methods, described in the previous section, on the challenging\nproblems of action classi\ufb01cation and object detection.\n\n4.1 Action Classi\ufb01cation\nDataset. We use the PASCAL VOC 2011 [7] action classi\ufb01cation dataset for our experiments. This\ndataset consists of 4846 images, which include 10 different action classes. The dataset is divided\ninto two parts: 3347 \u2018trainval\u2019 person bounding boxes and 3363 \u2018test\u2019 person bounding boxes. We\nuse the \u2018trainval\u2019 bounding boxes for training since their ground-truth action classes are known.\nWe evaluate the accuracy of the different instances of SSVM on the \u2018test\u2019 bounding boxes using the\nPASCAL evaluation server.\nFeatures. We use the standard poselet [12] activation features to de\ufb01ne the sample feature for\neach person bounding box. The feature vector consists of 2400 action poselet activations and 4\nobject detection scores. We refer the reader to [12] for details regarding the feature vector.\nMethods. We present results on \ufb01ve different methods. First, the standard binary SVM, which\noptimizes the 0-1 loss. Second, the standard AP-SVM, which uses the inef\ufb01cient loss-augmented\ninference described in Algorithm 1. Third, AP-SVM-SEARCH, which uses ef\ufb01cient search to com-\npute the optimal interleaving rank for each negative sample using Algorithm 2. Fourth, AP-SVM-\nSELECT, which uses the selective ranking strategy outlined in Algorithm 3. Fifth, AP-SVM-APPX,\nwhich employs the approximate AP-SVM framework described in subsection 3.3. Note that, AP-\nSVM, AP-SVM-SEARCH and AP-SVM-SELECT are guaranteed to provide the same set of parameters\nsince both ef\ufb01cient search and selective ranking are exact methods. The hyperparameters of all \ufb01ve\nmethods are \ufb01xed using 5-fold cross-validation on the \u2018trainval\u2019 set.\nResults. Table 1 shows the AP for the rankings obtained by the \ufb01ve methods for \u2018test\u2019 set. Note that\nAP-SVM (and therefore, AP-SVM-SEARCH and AP-SVM-SELECT) consistently outperforms binary\nSVM by optimizing a more appropriate loss function during training. The approximate AP-SVM-\nAPPX provides comparable results to the exact AP-SVM formulations by optimizing the AP loss over\ndif\ufb01cult samples, while ensuring the correct classi\ufb01cation of easy samples. The time required to\ncompute the most violated rankings for each of the \ufb01ve methods in shown in Table 2. Note that\nall three methods described in this paper result in substantial improvement in training time. The\noverall time required for loss-augmented inference is reduced by a factor of 5 \u2212 10 compared to the\noriginal AP-SVM approach. It can also be observed that though each loss-augmented inference step\nfor binary SVM is signi\ufb01cantly more ef\ufb01cient than for AP-SVM (Table 3), in some cases we observe\nthat we required more cutting plane iterations for binary SVM to converge. As a result, in some cases\ntraining binary SVM is slower than training AP-SVM with our proposed speed-ups.\n\nObject class\n\nBinary SVM AP-SVM\n\nJumping\nPhoning\nPlaying instrument\nReading\nRiding bike\nRunning\nTaking photo\nUsing computer\nWalking\nRiding horse\n\n52.580\n32.090\n35.210\n27.410\n72.240\n73.090\n21.880\n30.620\n54.400\n79.820\n\n55.230\n32.630\n41.180\n26.600\n81.060\n76.850\n25.980\n32.050\n57.090\n83.290\n\nAP-SVM-APPX\n\nk=25% k=50% k=75%\n54.660\n54.570\n29.610\n31.380\n37.260\n40.510\n24.980\n27.100\n78.660\n80.660\n75.720\n72.550\n22.860\n25.360\n32.840\n32.460\n55.790\n57.380\n83.650\n82.390\n\n55.640\n30.660\n38.650\n25.530\n79.950\n74.670\n23.680\n32.810\n57.430\n83.560\n\nTable 1: Test AP for the different action classes of PASCAL VOC 2011 action dataset. For AP-SVM-\nAPPX, we report test results for 3 different values of k, which is the percentage of samples that are\nincluded in the easy set among all the samples that the binary SVM had classi\ufb01ed with margin > 1.\n\n7\n\n\fBinary SVM AP-SVM AP-SVM-SEARCH\n\n0.1068\n\n0.5660\n\n0.0671\n\nAP-SVM-SELECT\n\n0.0404\n\nAP-SVM-APPX (K=50)\n\n0.2341\n\nALL\n0.0251\n\nTable 2: Computation time (in seconds) for computing the most violated ranking when using the\ndifferent methods. The reported time is averaged over the training for all the action classes.\n\nBinary SVM AP-SVM AP-SVM-SEARCH\n\n0.637\n\n13.192\n\n1.565\n\nAP-SVM-SELECT\n\n0.942\n\nAP-SVM-APPX (K=50)\n\n8.217\n\nALL\n0.689\n\nTable 3: Computation time (in milli-seconds) for computing the most violated ranking per iteration\nwhen using the different methods. The reported time is averaged over all training iterations and\nover all the action classes.\n\n4.2 Object Detection\nDataset. We use the PASCAL VOC 2007 [6] object detection dataset, which consists of a total of\n9963 images. The dataset is divided into a \u2018trainval\u2019 set of 5011 images and a \u2018test\u2019 set of 4952\nimages. All the images are labelled to indicate the presence or absence of the instances of 20\ndifferent object categories. In addition, we are also provided with tight bounding boxes around the\nobject instances, which we ignore during training and testing. Instead, we treat the location of the\nobjects as a latent variable. In order to reduce the latent variable space, we use the selective-search\nalgorithm [17] in its fast mode, which generates an average of 2000 candidate windows per image.\nFeatures. For each of the candidate windows, we use a feature representation that is extracted\nfrom a trained Convolutional Neural Network (CNN). Speci\ufb01cally, we pass the image as input to the\nCNN and use the activation vector of the penultimate layer of the CNN as the feature vector. Inspired\nby the work of Girshick et al. [9], we use the CNN that is trained on the ImageNet dataset [4], by\nrescaling each candidate window to a \ufb01xed size of 224 \u00d7 224. The length of the resulting feature\nvector is 4096.\nMethods. We train latent AP-SVMs [1] as object detectors for 20 object categories. In our experi-\nments, we determine the value of the hyperparameters using 5-fold cross-validation. During testing,\nwe evaluate each candidate window generated by selective search, and use non-maxima suppression\nto prune highly overlapping detections.\nResults. This experiment places high computational demands due to the size of the dataset (5011\n\u2018trainval\u2019 images), as well as the size of the latent space (2000 candidate windows per image). We\ncompare the computational ef\ufb01ciency of the loss-augmented inference algorithm proposed in [19]\nand the exact methods proposed by us. The total time taken for loss-augmented inference during\ntraining, averaged over the all the 20 classes, is 0.3302 sec for our exact methods (SEARCH+SELECT)\nwhich is signi\ufb01cantly better than the 6.237 sec taken by the algorithm used in [19].\n5 Discussion\nWe proposed three complementary approaches to improve the ef\ufb01ciency of learning AP-SVM. The\n\ufb01rst two approaches exploit the problem structure to speed-up the computation of the most violated\nranking using exact loss-augmented inference. The third approach provides an accurate approxima-\ntion of AP-SVM, which facilitates the trade-off of test accuracy and training time.\nAs mentioned in the introduction, our approaches can also be used in conjunction with other learning\nframeworks, such as the popular deep convolutional neural networks. A combination of methods\nproposed in this paper and the speed-ups proposed in [10] may prove to be effective in such a\nframework. The ef\ufb01cacy of optimizing AP ef\ufb01ciently using other frameworks needs to be empirically\nevaluated. Another computational bottleneck of all SSVM frameworks is the computation of the joint\nfeature vector. An interesting direction of future research would be to combine our approaches with\nthose of sparse feature coding [3, 8, 18] to improve the speed to AP-SVM learning further.\n6 Acknowledgement\nThis work is partially funded by the European Research Council under the European Community\u2019s\nSeventh Framework Programme (FP7/2007-2013)/ERC Grant agreement number 259112. Pritish is\nsupported by the TCS Research Scholar Program.\n\n8\n\n\fReferences\n[1] A. Behl, C. V. Jawahar, and M. P. Kumar. Optimizing average precision using weakly super-\n\nvised data. In CVPR, 2014.\n\n[2] M. Blaschko, A. Mittal, and E. Rahtu. An O(n log n) cutting plane algorithm for structured\n\noutput ranking. In GCPR, 2014.\n\n[3] X. Boix, G. Roig, C. Leistner, and L. Van Gool. Nested sparse quantization for ef\ufb01cient feature\n\ncoding. In ECCV. 2012.\n\n[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nHierarchical Image Database. In CVPR, 2009.\n\nImageNet: A Large-Scale\n\n[5] P. Dokania, A. Behl, C. V. Jawahar, and M. P. Kumar. Learning to rank using high-order\n\ninformation. In ECCV, 2014.\n\n[6] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman.\n\nCAL Visual Object Classes Challenge 2007 (VOC2007) Results.\nnetwork.org/challenges/VOC/voc2007/workshop/index.html.\n\nThe PAS-\nhttp://www.pascal-\n\n[7] M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The PASCAL visual\n\nobject classes (VOC) challenge. IJCV, 2010.\n\n[8] T. Ge, Q. Ke, and J. Sun. Sparse-coded features for image retrieval. In BMVC, 2013.\n[9] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object\n\ndetection and semantic segmentation. In CVPR, 2014.\n\n[10] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. In BMVC, 2014.\n\n[11] D. Kim. Minimizing structural risk on decision tree classi\ufb01cation. In Multi-Objective Machine\n\nLearning, Springer. 2006.\n\n[12] S. Maji, L. Bourdev, and J. Malik. Action recognition from a distributed representation of pose\n\nand appearance. In CVPR, 2011.\n\n[13] C. Shen, H. Li, and N. Barnes. Totally corrective boosting for regularized risk minimization.\n\narXiv preprint arXiv:1008.5188, 2010.\n\n[14] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In NIPS,\n\n2013.\n\n[15] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. In NIPS, 2003.\n[16] I. Tsochantaridis, T. Hofmann, Y. Altun, and T. Joachims. Support vector machine learning for\n\ninterdependent and structured output spaces. In ICML, 2004.\n\n[17] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recog-\n\nnition. IJCV, 2013.\n\n[18] J. Yang, K. Yu, and T. Huang. Ef\ufb01cient highly over-complete sparse coding using a mixture\n\nmodel. In ECCV. 2010.\n\n[19] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. A support vector method for optimizing\n\naverage precision. In SIGIR, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1228, "authors": [{"given_name": "Pritish", "family_name": "Mohapatra", "institution": "International Institute of Information Technology, Hyderabad"}, {"given_name": "C.V.", "family_name": "Jawahar", "institution": "International Institute of Information Technology, Hyderabad"}, {"given_name": "M. Pawan", "family_name": "Kumar", "institution": "Ecole Centrale Paris"}]}