{"title": "Learning from Complementary Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 5639, "page_last": 5649, "abstract": "Collecting labeled data is costly and thus a critical bottleneck in real-world classification tasks. To mitigate this problem, we propose a novel setting, namely learning from complementary labels for multi-class classification. A complementary label specifies a class that a pattern does not belong to. Collecting complementary labels would be less laborious than collecting ordinary labels, since users do not have to carefully choose the correct class from a long list of candidate classes. However, complementary labels are less informative than ordinary labels and thus a suitable approach is needed to better learn from them. In this paper, we show that an unbiased estimator to the classification risk can be obtained only from complementarily labeled data, if a loss function satisfies a particular symmetric condition. We derive estimation error bounds for the proposed method and prove that the optimal parametric convergence rate is achieved. We further show that learning from complementary labels can be easily combined with learning from ordinary labels (i.e., ordinary supervised learning), providing a highly practical implementation of the proposed method. Finally, we experimentally demonstrate the usefulness of the proposed methods.", "full_text": "Learning from Complementary Labels\n\nTakashi Ishida1,2,3 Gang Niu2,3 Weihua Hu2,3 Masashi Sugiyama3,2\n\n1 Sumitomo Mitsui Asset Management, Tokyo, Japan\n\n2 The University of Tokyo, Tokyo, Japan\n\n{ishida@ms., gang@ms., hu@ms., sugi@}k.u-tokyo.ac.jp\n\n3 RIKEN, Tokyo, Japan\n\nAbstract\n\nCollecting labeled data is costly and thus a critical bottleneck in real-world classi-\n\ufb01cation tasks. To mitigate this problem, we propose a novel setting, namely learn-\ning from complementary labels for multi-class classi\ufb01cation. A complementary\nlabel speci\ufb01es a class that a pattern does not belong to. Collecting complementary\nlabels would be less laborious than collecting ordinary labels, since users do not\nhave to carefully choose the correct class from a long list of candidate classes.\nHowever, complementary labels are less informative than ordinary labels and thus\na suitable approach is needed to better learn from them. In this paper, we show\nthat an unbiased estimator to the classi\ufb01cation risk can be obtained only from\ncomplementarily labeled data, if a loss function satis\ufb01es a particular symmetric\ncondition. We derive estimation error bounds for the proposed method and prove\nthat the optimal parametric convergence rate is achieved. We further show that\nlearning from complementary labels can be easily combined with learning from\nordinary labels (i.e., ordinary supervised learning), providing a highly practical\nimplementation of the proposed method. Finally, we experimentally demonstrate\nthe usefulness of the proposed methods.\n\n1\n\nIntroduction\n\nIn ordinary supervised classi\ufb01cation problems, each training pattern is equipped with a label which\nspeci\ufb01es the class the pattern belongs to. Although supervised classi\ufb01er training is effective, labeling\ntraining patterns is often expensive and takes a lot of time. For this reason, learning from less\nexpensive data has been extensively studied in the last decades, including but not limited to, semi-\nsupervised learning [4, 38, 37, 13, 1, 21, 27, 20, 35, 16, 18], learning from pairwise/triple-wise\nconstraints [34, 12, 6, 33, 25], and positive-unlabeled learning [7, 11, 32, 2, 8, 9, 26, 17].\nIn this paper, we consider another weakly supervised classi\ufb01cation scenario with less expensive\ndata: instead of any ordinary class label, only a complementary label which speci\ufb01es a class that\nthe pattern does not belong to is available. If the number of classes is large, choosing the correct\nclass label from many candidate classes is laborious, while choosing one of the incorrect class\nlabels would be much easier and thus less costly. In the binary classi\ufb01cation setup, learning with\ncomplementary labels is equivalent to learning with ordinary labels, because complementary label 1\n(i.e., not class 1) immediately means ordinary label 2. On the other hand, in K-class classi\ufb01cation\nfor K > 2, complementary labels are less informative than ordinary labels because complementary\nlabel 1 only means either of the ordinary labels 2, 3, . . . , K.\nThe complementary classi\ufb01cation problem may be solved by the method of learning from partial la-\nbels [5], where multiple candidate class labels are provided to each training pattern\u2014complementary\nlabel y can be regarded as an extreme case of partial labels given to all K 1 classes other than class\ny. Another possibility to solve the complementary classi\ufb01cation problem is to consider a multi-label\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsetup [3], where each pattern can belong to multiple classes\u2014complementary label y is translated\ninto a negative label for class y and positive labels for the other K 1 classes.\nOur contribution in this paper is to give a direct risk minimization framework for the complementary\nclassi\ufb01cation problem. More speci\ufb01cally, we consider a complementary loss that incurs a large loss\nif a predicted complementary label is not correct. We then show that the classi\ufb01cation risk can be\nempirically estimated in an unbiased fashion if the complementary loss satis\ufb01es a certain symmetric\ncondition\u2014the sigmoid loss and the ramp loss (see Figure 1) are shown to satisfy this symmetric\ncondition. Theoretically, we establish estimation error bounds for the proposed method, showing\nthat learning from complementary labels is also consistent; the order of these bounds achieves the\noptimal parametric rate Op(1/pn), where Op denotes the order in probability and n is the number\nof complementarily labeled data.\nWe further show that our proposed complementary classi\ufb01cation can be easily combined with ordi-\nnary classi\ufb01cation, providing a highly data-ef\ufb01cient classi\ufb01cation method. This combination method\nis particularly useful, e.g., when labels are collected through crowdsourcing [14]: Usually, crowd-\nworkers are asked to give a label to a pattern by selecting the correct class from the list of all\ncandidate classes. This process is highly time-consuming when the number of classes is large. We\nmay instead choose one of the classes randomly and ask crowdworkers whether a pattern belongs to\nthe chosen class or not. Such a yes/no question can be much easier and quicker to be answered than\nselecting the correct class out of a long list of candidates. Then the pattern is treated as ordinarily\nlabeled if the answer is yes; otherwise, the pattern is regarded as complementarily labeled.\nFinally, we demonstrate the practical usefulness of the proposed methods through experiments.\n\n2 Review of ordinary multi-class classi\ufb01cation\nSuppose that d-dimensional pattern x 2 Rd and its class label y 2{ 1, . . . , K} are sampled in-\ndependently from an unknown probability distribution with density p(x, y). The goal of ordinary\nmulti-class classi\ufb01cation is to learn a classi\ufb01er f (x) : Rd !{ 1, . . . , K} that minimizes the classi-\n\ufb01cation risk with multi-class loss Lf (x), y:\n\n(1)\nwhere E denotes the expectation. Typically, a classi\ufb01er f (x) is assumed to take the following form:\n(2)\n\nR(f ) = Ep(x,y)\u21e5Lf (x), y\u21e4,\n\nf (x) = arg max\ny2{1,...,K}\n\ngy(x),\n\nwhere gy(x) : Rd ! R is a binary classi\ufb01er for class y versus the rest. Then, together with a\nbinary loss `(z) : R ! R that incurs a large loss for small z, the one-versus-all (OVA) loss1 or the\npairwise-comparison (PC) loss de\ufb01ned as follows are used as the multi-class loss [36]:\n\n1\n\nLOVA(f (x), y) = `gy(x) +\nLPCf (x), y = Xy06=y\n\nK 1 Xy06=y\n`gy(x) gy0(x).\n\n` gy0(x),\n\n(3)\n\n(4)\n\nFinally, the expectation over unknown p(x, y) in Eq.(1) is empirically approximated using training\nsamples to give a practical classi\ufb01cation formulation.\n\n3 Classi\ufb01cation from complementary labels\n\nIn this section, we formulate the problem of complementary classi\ufb01cation and propose a risk mini-\nmization framework.\nWe consider the situation where, instead of ordinary class label y, we are given only complementary\nlabel y which speci\ufb01es a class that pattern x does not belong to. Our goal is to still learn a classi\ufb01er\n\n1We normalize the \u201crest\u201d loss by K 1 to be consistent with the discussion in the following sections.\n\n2\n\n\fthat minimizes the classi\ufb01cation risk (1), but only from complementarily labeled training samples\ni=1 are drawn independently from an unknown probability\n{(xi, yi)}n\ndistribution with density:2\n\ni=1. We assume that {(xi, yi)}n\n\np(x, y) =\n\np(x, y).\n\n(5)\n\n1\n\nK 1Xy6=y\n\nLet us consider a complementary loss L(f (x), y) for a complementarily labeled sample (x, y).\nThen we have the following theorem, which allows unbiased estimation of the classi\ufb01cation risk\nfrom complementarily labeled samples:\nTheorem 1. The classi\ufb01cation risk (1) can be expressed as\n\nif there exist constants M1, M2 0 such that for all x and y, the complementary loss satis\ufb01es\n\nR(f ) = (K 1)Ep(x,y)\u21e5Lf (x), y\u21e4 M1 + M2,\nLf (x), y = M1\n\nand Lf (x), y + Lf (x), y = M2.\n\nKXy=1\n\n(6)\n\n(7)\n\nProof. According to (5),\n\nL(f (x), y)p(x, y)dx\n\n(K 1)Ep(x,y)[L(f (x), y)] = (K 1)Z KXy=1\n= (K 1)Z KXy=1\nK 1Xy6=y\n= Ep(x,y)24Xy6=y\n\nL(f (x), y)0@ 1\nL(f (x), y)35 = Ep(x,y)[M1 L(f (x), y)] = M1 Ep(x,y)[L(f (x), y)],\n\np(x, y)1A dx =Z KXy=1Xy6=y\n\nwhere the \ufb01fth equality follows from the \ufb01rst constraint in (7). Subsequently,\n(K 1)Ep(x,y)[L(f (x), y)] Ep(x,y)[L(f (x), y)] = M1 Ep(x,y)[L(f (x), y) + L(f (x), y)]\n\nL(f (x), y)p(x, y)dx\n\n= M1 Ep(x,y)[M2]\n= M1 M2,\nwhere the second equality follows from the second constraint in (7).\n\nThe \ufb01rst constraint in (7) can be regarded as a multi-class loss version of a symmetric constraint that\nwe later use in Theorem 2. The second constraint in (7) means that the smaller L is, the larger L\nshould be, i.e., if \u201cpattern x belongs to class y\u201d is correct, \u201cpattern x does not belong to class y\u201d\nshould be incorrect.\nWith the expression (6), the classi\ufb01cation risk (1) can be naively approximated in an unbiased fash-\nion by the sample average as\n\nK 1\n\nn\n\nnXi=1\n\nLf (xi), yi M1 + M2.\n\nbR(f ) =\n\n(8)\n\n(9)\n\n(10)\n\nLet us de\ufb01ne the complementary losses corresponding to the OVA loss LOVA(f (x), y) and the PC\nloss LPCf (x), y as\n\nLOVA(f (x), y) =\n\nLPCf (x), y =Xy6=y\n\n`gy(x) + ` gy(x),\n\n1\n\nK 1Xy6=y\n`gy(x) gy(x).\n\nThen we have the following theorem (its proof is given in Appendix A):\n\n2The coef\ufb01cient 1/(K 1) is for the normalization purpose: it would be natural to assume p(x, y) =\n(1/Z)Py6=y p(x, y) since all p(x, y) for y 6= y equally contribute to p(x, y); in order to ensure that p(x, y)\nis a valid joint density such that Ep(x,y)[1] = 1, we must take Z = K 1.\n\n3\n\n\fif z > 0,\nif z \uf8ff 0,\n\nZero-one loss: `0-1(z =\u21e20\nSigmoid loss: `S(z =\nRamp loss: `Rz =\n\n1\n1\n\n1 + ez ,\n1\n2\n\nmax\u21e30, min2, 1 z\u2318.\n\n(12)\n\n(13)\n\n(14)\n\nFigure 1: Examples of binary losses that satisfy the symmetric condition (11).\n\nTheorem 2. If binary loss `(z) satis\ufb01es\n\n(11)\nthen LOVA satis\ufb01es conditions (7) with M1 = K and M2 = 2, and LPC satis\ufb01es conditions (7) with\nM1 = K(K 1)/2 and M2 = K 1.\nFor example, the following binary losses satisfy the symmetric condition (11) (see Figure 1):\n\n`(z) + `(z) = 1,\n\nNote that these losses are non-convex [8]. In practice, the sigmoid loss or ramp loss may be used for\ntraining a classi\ufb01er, while the zero-one loss may be used for tuning hyper-parameters (see Section 6\nfor the details).\n\n4 Estimation Error Bounds\n\ng2G\n\n1\n\nn Xxi2X\n\nIn this section, we establish the estimation error bounds for the proposed method.\nLet G = {g(x)} be a function class for empirical risk minimization, 1, . . . , n be n Rademacher\nvariables, then the Rademacher complexity of G for X of size n drawn from p(x) is de\ufb01ned as\nfollows [23]:\n\nig(xi)# ;\nde\ufb01ne the Rademacher complexity of G for X of size n drawn from p(x) as\nig(xi)35 .\n\nRn(G) = EX E1,...,n\"sup\nRn(G) = EX E1,...,n24sup\n\nn Xxi2X\nNote that p(x) = p(x) and thus Rn(G) = Rn(G), which enables us to express the obtained theo-\nretical results using the standard Rademacher complexity Rn(G).\nTo begin with, lete`(z) = `(z) `(0) be the shifted loss such thate`(0) = 0 (in order to apply the\nTalagrand\u2019s contraction lemma [19] later), and eLOVA and eLPC be losses de\ufb01ned following (9) and\n\n1\n\ng2G\n\n4\n\n\fthe corresponding function classes as follows:\n\n(10) but withe` instead of `; let L` be any (not necessarily the best) Lipschitz constant of `. De\ufb01ne\n\nThen we can obtain the following lemmas (their proofs are given in Appendices B and C):\nLemma 3. Let Rn(HOVA) be the Rademacher complexity of HOVA for S of size n drawn from\np(x, y) de\ufb01ned as\n\nHOVA = {(x, y) 7! eLOVA(f (x), y) | g1, . . . , gK 2G} ,\nHPC = {(x, y) 7! eLPC(f (x), y) | g1, . . . , gK 2G} .\nih(xi, yi)35 .\nRn(HOVA) = ESE1,...,n24 sup\n\nn X(xi,yi)2S\n\nh2HOVA\n\n1\n\nThen,\n\nLemma 4. Let Rn(HPC) be the Rademacher complexity of HPC de\ufb01ned similarly to Rn(HOVA).\nThen,\n\nRn(HOVA) \uf8ff KL`Rn(G).\n\nRn(HPC) \uf8ff 2K(K 1)L`Rn(G).\n\nBased on Lemmas 3 and 4, we can derive the uniform deviation bounds of bR(f ) as follows (its proof\nis given in Appendix D):\nLemma 5. For any > 0, with probability at least 1 ,\n\nwhere bR(f ) is w.r.t. LOVA, and\n\nsup\n\ng1,...,gK2GbR(f ) R(f ) \uf8ff 2K(K 1)L`Rn(G) + (K 1)r 2 ln(2/)\ng1,...,gK2GbR(f ) R(f ) \uf8ff 4K(K 1)2L`Rn(G) + (K 1)2r ln(2/)\n\nsup\n\n2n\n\nn\n\n,\n\n,\n\nwhere bR(f ) is w.r.t. LPC.\nLet (g\u21e41, . . . , g\u21e4K) be the true risk minimizer and (bg1, . . . ,bgK) be the empirical risk minimizer, i.e.,\n\nR(f )\n\nand\n\ng1,...,gK2G bR(f ).\n\nand\n\ng\u21e4y(x)\n\nLet also\n\n(g\u21e41, . . . , g\u21e4K) = arg min\ng1,...,gK2G\n\nFinally, based on Lemma 5, we can establish the estimation error bounds as follows:\nTheorem 6. For any > 0, with probability at least 1 ,\n\nf\u21e4(x) = arg max\ny2{1,...,K}\n\n(bg1, . . . ,bgK) = arg min\nbf (x) = arg max\ny2{1,...,K}bgy(x).\nR(bf ) R(f\u21e4) \uf8ff 4K(K 1)L`Rn(G) + (K 1)r 8 ln(2/)\nR(bf ) R(f\u21e4) \uf8ff 8K(K 1)2L`Rn(G) + (K 1)2r 2 ln(2/)\n\nif (bg1, . . . ,bgK) is trained by minimizing bR(f ) is w.r.t. LOVA, and\nif (bg1, . . . ,bgK) is trained by minimizing bR(f ) is w.r.t. LPC.\n\nn\n\nn\n\n,\n\n,\n\n5\n\n\fProof. Based on Lemma 5, the estimation error bounds can be proven through\n\nsup\n\n\uf8ff 0 + 2\n\nR(bf ) R(g\u21e4) =\u21e3bR(bf ) bR(f\u21e4)\u2318 +\u21e3R(bf ) bR(bf )\u2318 +\u21e3bR(f\u21e4) R(f\u21e4)\u2318\n\ng1,...,gK2GbR(f ) R(f ) ,\nwhere we used that bR(bf ) \uf8ff bR(f\u21e4) by the de\ufb01nition of bf.\nTheorem 6 also guarantees that learning from complementary labels is consistent: as n ! 1,\nR(bf ) ! R(f\u21e4). Consider a linear-in-parameter model de\ufb01ned by\nwhere H is a Hilbert space with an inner product h\u00b7,\u00b7iH, w 2H is a normal, : Rd !H is a feature\nmap, and Cw > 0 and C > 0 are constants [29]. It is known that Rn(G) \uf8ff CwC/pn [23] and\nthus R(bf ) ! R(f\u21e4) in Op(1/pn) if this G is used, where Op denotes the order in probability.\nThis order is already the optimal parametric rate and cannot be improved without additional strong\nassumptions on p(x, y), ` and G jointly.\n5\n\nG = {g(x) = hw, (x)iH |k wkH \uf8ff Cw,k(x)kH \uf8ff C},\n\nIncorporation of ordinary labels\n\nIn many practical situations, we may also have ordinarily labeled data in addition to complementarily\nlabeled data. In such cases, we want to leverage both kinds of labeled data to obtain more accurate\nclassi\ufb01ers. To this end, motivated by [28], let us consider a convex combination of the classi\ufb01cation\nrisks derived from ordinarily labeled data and complementarily labeled data:\n\nR(f ) = \u21b5Ep(x,y)[L(f (x), y)] + (1 \u21b5)h(K 1)Ep(x,y)[L(f (x), y)] M1 + M2i,\n(15)\nwhere \u21b5 2 [0, 1] is a hyper-parameter that interpolates between the two risks. The combined risk\n(15) can be naively approximated by the sample averages as\n\nmXj=1\n\n\u21b5\nm\n\nL(f (xj), yj) +\n\nbR(f ) =\nj=1 are ordinarily labeled data and {(xi, yi)}n\n\nn\n\n(1 \u21b5)(K 1)\n\nnXi=1\n\nL(f (xi), yi),\n\n(16)\n\ni=1 are complementarily labeled data.\nwhere {(xj, yj)}m\nAs explained in the introduction, we can naturally obtain both ordinarily and complementarily la-\nbeled data through crowdsourcing [14]. Our risk estimator (16) can utilize both kinds of labeled data\nto obtain better classi\ufb01ers3. We will experimentally demonstrate the usefulness of this combination\nmethod in Section 6.\n\n6 Experiments\n\nIn this section, we experimentally evaluate the performance of the proposed methods.\n\n6.1 Comparison of different losses\nHere we \ufb01rst compare the performance among four variations of the proposed method with different\nloss functions: OVA (9) and PC (10), each with the sigmoid loss (13) and ramp loss (14). We used\nthe MNIST hand-written digit dataset, downloaded from the website of the late Sam Roweis4 (with\nall patterns standardized to have zero mean and unit variance), with different number of classes: 3\nclasses (digits \u201c1\u201d to \u201c3\u201d) to 10 classes (digits \u201c1\u201d to \u201c9\u201d and \u201c0\u201d). From each class, we randomly\nsampled 500 data for training and 500 data for testing, and generated complementary labels by\nrandomly selecting one of the complementary classes. From the training dataset, we left out 25% of\nthe data for validating hyperparameter based on (8) with the zero-one loss plugged in (9) or (10).\n\n3 Note that when pattern x has already been equipped with ordinary label y, giving complementary label y\n\ndoes not bring us any additional information (unless the ordinary label is noisy).\n\n4See http://cs.nyu.edu/~roweis/data.html.\n\n6\n\n\fTable 1: Means and standard deviations of classi\ufb01cation accuracy over \ufb01ve trials in percentage, when the\nnumber of classes (\u201ccls\u201d) is changed for the MNIST dataset. \u201cPC\u201d is (10), \u201cOVA\u201d is (9), \u201cSigmoid\u201d is (13), and\n\u201cRamp\u201d is (14). Best and equivalent methods (with 5% t-test) are highlighted in boldface.\n\nMethod\nOVA\n\nSigmoid\n\nOVA\nRamp\nPC\n\nSigmoid\n\nPC\nRamp\n\n3 cls\n95.2\n(0.9)\n95.1\n(0.9)\n94.9\n(0.5)\n94.5\n(0.7)\n\n4 cls\n91.4\n(0.5)\n90.8\n(1.0)\n90.9\n(0.8)\n90.8\n(0.5)\n\n5 cls\n87.5\n(2.2)\n86.5\n(1.8)\n88.1\n(1.8)\n88.0\n(2.2)\n\n6 cls\n82.0\n(1.3)\n79.4\n(2.6)\n80.3\n(2.5)\n81.0\n(2.2)\n\n7 cls\n74.5\n(2.9)\n73.9\n(3.9)\n75.8\n(2.5)\n74.0\n(2.3)\n\n8 cls\n73.9\n(1.2)\n71.4\n(4.0)\n72.9\n(3.0)\n71.4\n(2.4)\n\n9 cls\n63.6\n(4.0)\n66.1\n(2.1)\n65.0\n(3.5)\n69.0\n(2.8)\n\n10 cls\n57.2\n(1.6)\n56.1\n(3.6)\n58.9\n(3.9)\n57.3\n(2.0)\n\nFor all the methods, we used a linear-in-input model gk(x) = w>k x + bk as the binary classi\ufb01er,\nwhere > denotes the transpose, wk 2 Rd is the weight parameter, and bk 2 R is the bias parameter\nfor class k 2{ 1, . . . , K}. We added an `2-regularization term, with the regularization parameter\nchosen from {104, 103, . . . , 104}. Adam [15] was used for optimization with 5,000 iterations,\nwith mini-batch size 100. We reported the test accuracy of the model with the best validation score\nout of all iterations. All experiments were carried out with Chainer [30].\nWe reported means and standard deviations of the classi\ufb01cation accuracy over \ufb01ve trials in Table 1.\nFrom the results, we can see that the performance of all four methods deteriorates as the number\nof classes increases. This is intuitive because supervised information that complementary labels\ncontain becomes weaker with more classes.\nThe table also shows that there is no signi\ufb01cant difference in classi\ufb01cation accuracy among the four\nlosses. Since the PC formulation is regarded as a more direct approach for classi\ufb01cation [31] (it\ntakes the sign of the difference of the classi\ufb01ers, instead of the sign of each classi\ufb01er as in OVA)\nand the sigmoid loss is smooth, we use PC with the sigmoid loss as a representative of our proposed\nmethod in the following experiments.\n\n6.2 Benchmark experiments\nNext, we compare our proposed method, PC with the sigmoid loss (PC/S), with two baseline meth-\nods. The \ufb01rst baseline is one of the state-of-the-art partial label (PL) methods [5] with the squared\nhinge loss5:\n\nThe second baseline is a multi-label (ML) method [3], where every complementary label y is trans-\nlated into a negative label for class y and positive labels for the other K 1 classes. This yields the\nfollowing loss:\n\n`z = (max(0, 1 z))2.\n\noriginal paper [5].\n\n6See http://cs.nyu.edu/~roweis/data.html.\n7See http://archive.ics.uci.edu/ml/.\n\n7\n\nLML(f (x), y) =Xy6=y\n\n`gy(x) + ` gy(x),\n\nwhere we used the same sigmoid loss as the proposed method for `. We used a one-hidden-layer\nneural network (d-3-1) with recti\ufb01ed linear units (ReLU) [24] as activation functions, and weight de-\ncay candidates were chosen from {107, 104, 101}. Standardization, validation and optimization\ndetails follow the previous experiments.\nWe evaluated the classi\ufb01cation performance on the following benchmark datasets: WAVEFORM1,\nWAVEFORM2, SATIMAGE, PENDIGITS, DRIVE, LETTER, and USPS. USPS can be down-\nloaded from the website of the late Sam Roweis6, and all other datasets can be downloaded from the\nUCI machine learning repository7. We tested several different settings of class labels, with equal\nnumber of data in each class.\n\n5We decided to use the squared hinge loss (which is convex) here since it was reported to work well in the\n\n\fTable 2: Means and standard deviations of classi\ufb01cation accuracy over 20 trials in percentage. \u201cPC/S\u201d is\nthe proposed method for the pairwise comparison formulation with the sigmoid loss, \u201cPL\u201d is the partial label\nmethod with the squared hinge loss, and \u201cML\u201d is the multi-label method with the sigmoid loss. Best and\nequivalent methods (with 5% t-test) are highlighted in boldface. \u201cClass\u201d denotes the class labels used for the\nexperiment and \u201cDim\u201d denotes the dimensionality d of patterns to be classi\ufb01ed. \u201c# train\u201d denotes the total\nnumber of training and validation samples in each class. \u201c# test\u201d denotes the number of test samples in each\nclass.\n\nDataset\n\nWAVEFORM1\nWAVEFORM2\nSATIMAGE\n\nPENDIGITS\n\nDRIVE\n\nLETTER\n\nUSPS\n\nClass\n1 \u21e0 3\n1 \u21e0 3\n1 \u21e0 7\n1 \u21e0 5\n6 \u21e0 10\neven #\nodd #\n1 \u21e0 10\n1 \u21e0 5\n6 \u21e0 10\neven #\nodd #\n1 \u21e0 10\n1 \u21e0 5\n6 \u21e0 10\n11 \u21e0 15\n16 \u21e0 20\n21 \u21e0 25\n1 \u21e0 25\n1 \u21e0 5\n6 \u21e0 10\neven #\nodd #\n1 \u21e0 10\n\n16\n\n48\n\nDim # train\n1226\n21\n40\n1227\n415\n36\n719\n719\n719\n719\n719\n3955\n3923\n3925\n3939\n3925\n565\n550\n556\n550\n585\n550\n652\n542\n556\n542\n542\n\n256\n\n16\n\n# test\n398\n408\n211\n336\n335\n336\n335\n335\n1326\n1313\n1283\n1278\n1269\n171\n178\n177\n184\n167\n167\n166\n147\n147\n147\n127\n\nPC/S\n\nPL\n\nML\n\n85.8(0.5)\n\n85.7(0.9)\n\n79.3(4.8)\n\n84.7(1.3)\n\n84.6(0.8)\n\n74.9(5.2)\n\n68.7(5.4)\n\n60.7(3.7)\n\n33.6(6.2)\n\n87.0(2.9)\n78.4(4.6)\n90.8(2.4)\n76.0(5.4)\n38.0(4.3)\n\n89.1(4.0)\n88.8(1.8)\n81.8(3.4)\n85.4(4.2)\n40.8(4.3)\n\n79.7(5.3)\n76.2(6.2)\n78.3(4.1)\n77.2(3.2)\n80.4(4.2)\n5.1(2.1)\n\n79.1(3.1)\n69.5(6.5)\n67.4(5.4)\n77.5(4.5)\n30.7(4.4)\n\n76.2(3.3)\n71.1(3.3)\n76.8(1.6)\n67.4(2.6)\n33.2(3.8)\n\n77.7(1.5)\n78.5(2.6)\n63.9(1.8)\n74.9(3.2)\n32.0(4.1)\n\n75.1(4.4)\n66.8(2.5)\n67.4(3.3)\n68.4(2.1)\n75.1(1.9)\n5.0(1.0)\n\n70.3(3.2)\n66.1(2.4)\n66.2(2.3)\n69.3(3.1)\n26.0(3.5)\n\n44.7(9.6)\n38.4(9.6)\n43.8(5.1)\n40.2(8.0)\n16.1(4.6)\n\n31.1(3.5)\n30.4(7.2)\n29.7(6.3)\n27.6(5.8)\n12.7(3.1)\n\n28.3(10.4)\n34.0(6.9)\n28.6(5.0)\n32.7(6.4)\n32.0(5.7)\n5.2(1.1)\n\n44.4(8.9)\n37.3(8.8)\n35.7(6.6)\n36.6(7.5)\n13.3(5.4)\n\nIn Table 2, we summarized the speci\ufb01cation of the datasets and reported the means and standard\ndeviations of the classi\ufb01cation accuracy over 10 trials. From the results, we can see that the proposed\nmethod is either comparable to or better than the baseline methods on many of the datasets.\n\n6.3 Combination of ordinary and complementary labels\n\nFinally, we demonstrate the usefulness of combining ordinarily and complementarily labeled data.\nWe used (16), with hyperparameter \u21b5 \ufb01xed at 1/2 for simplicity. We divided our training dataset\nby 1 : (K 1) ratio, where one subset was labeled ordinarily while the other was labeled comple-\nmentarily8. From the training dataset, we left out 25% of the data for validating hyperparameters\nbased on the zero-one loss version of (16). Other details such as standardization, the model and\noptimization, and weight-decay candidates follow the previous experiments.\nWe compared three methods: the ordinary label (OL) method corresponding to \u21b5 = 1, the comple-\nmentary label (CL) method corresponding to \u21b5 = 0, and the combination (OL & CL) method with\n\u21b5 = 1/2. The PC and sigmoid losses were commonly used for all methods.\nWe reported the means and standard deviations of the classi\ufb01cation accuracy over 10 trials in Table 3.\nFrom the results, we can see that OL & CL tends to outperform OL and CL, demonstrating the\nusefulnesses of combining ordinarily and complementarily labeled data.\n\n8We used K1 times more complementarily labeled data than ordinarily labeled data since a single ordinary\n\nlabel corresponds to (K 1) complementary labels.\n\n8\n\n\fTable 3: Means and standard deviations of classi\ufb01cation accuracy over 10 trials in percentage. \u201cOL\u201d is the\nordinary label method, \u201cCL\u201d is the complementary label method, and \u201cOL & CL\u201d is a combination method\nthat uses both ordinarily and complementarily labeled data. Best and equivalent methods are highlighted in\nboldface. \u201cClass\u201d denotes the class labels used for the experiment and \u201cDim\u201d denotes the dimensionality d of\npatterns to be classi\ufb01ed. # train denotes the number of ordinarily/complementarily labeled data for training and\nvalidation in each class. # test denotes the number of test data in each class.\n\nDataset\n\nClass\n\nDim\n\n# train\n\n# test\n\nWAVEFORM1\nWAVEFORM2\nSATIMAGE\n\nPENDIGITS\n\nDRIVE\n\nLETTER\n\nUSPS\n\n1 \u21e0 3\n1 \u21e0 3\n1 \u21e0 7\n1 \u21e0 5\n6 \u21e0 10\neven #\nodd #\n1 \u21e0 10\n1 \u21e0 5\n6 \u21e0 10\neven #\nodd #\n1 \u21e0 10\n1 \u21e0 5\n6 \u21e0 10\n11 \u21e0 15\n16 \u21e0 20\n21 \u21e0 25\n1 \u21e0 25\n1 \u21e0 5\n6 \u21e0 10\neven #\nodd #\n1 \u21e0 10\n\n21\n40\n36\n\n16\n\n48\n\n16\n\n256\n\n413/826\n411/821\n69/346\n144/575\n144/575\n144/575\n144/575\n72/647\n780/3121\n795/3180\n657/3284\n790/3161\n397/3570\n113/452\n110/440\n111/445\n110/440\n117/468\n22/528\n130/522\n108/434\n108/434\n111/445\n54/488\n\n408\n411\n211\n336\n335\n336\n335\n335\n1305\n1290\n1314\n1255\n1292\n171\n178\n177\n184\n167\n167\n166\n147\n166\n147\n147\n\nOL\n\n(\u21b5 = 1)\n85.3(0.8)\n\nCL\n\n(\u21b5 = 0)\n86.0(0.4)\n\nOL & CL\n(\u21b5 = 1\n2 )\n86.9(0.5)\n\n82.7(1.3)\n\n82.0(1.7)\n\n84.7(0.6)\n\n74.9(4.9)\n\n70.1(5.6)\n\n81.2(1.1)\n\n91.3(2.1)\n86.3(3.5)\n94.3(1.7)\n85.6(2.0)\n61.7(4.3)\n\n92.1(2.6)\n87.0(3.0)\n91.4(2.9)\n91.1(1.5)\n75.2(2.8)\n\n85.2(1.3)\n81.0(1.7)\n81.1(2.7)\n81.3(1.8)\n86.8(2.7)\n11.9(1.7)\n\n83.8(1.7)\n79.2(2.1)\n79.6(2.7)\n82.7(1.9)\n43.7(2.6)\n\n84.7(3.2)\n78.3(6.2)\n91.0(4.3)\n75.9(3.1)\n41.1(5.7)\n\n89.0(2.1)\n86.5(3.1)\n81.8(4.6)\n86.7(2.9)\n40.5(7.2)\n\n77.2(6.1)\n77.6(3.7)\n76.0(3.2)\n77.9(3.1)\n81.2(3.4)\n6.5(1.7)\n\n76.5(5.3)\n67.6(4.3)\n67.4(4.4)\n72.9(6.2)\n28.5(3.6)\n\n93.1(2.0)\n87.8(2.8)\n95.8(0.6)\n86.9(1.1)\n66.9(2.0)\n\n94.2(1.0)\n89.5(2.1)\n91.8(3.3)\n93.4(0.5)\n77.6(2.2)\n\n89.5(1.6)\n84.6(1.0)\n87.3(1.6)\n84.7(2.0)\n91.1(1.0)\n31.0(1.7)\n\n89.5(1.3)\n85.5(2.4)\n84.8(1.4)\n87.3(2.2)\n59.3(2.2)\n\n7 Conclusions\n\nWe proposed a novel problem setting called learning from complementary labels, and showed that\nan unbiased estimator to the classi\ufb01cation risk can be obtained only from complementarily labeled\ndata, if the loss function satis\ufb01es a certain symmetric condition. Our risk estimator can easily be\nminimized by any stochastic optimization algorithms such as Adam [15], allowing large-scale train-\ning. We theoretically established estimation error bounds for the proposed method, and proved that\nthe proposed method achieves the optimal parametric rate. We further showed that our proposed\ncomplementary classi\ufb01cation can be easily combined with ordinary classi\ufb01cation. Finally, we ex-\nperimentally demonstrated the usefulness of the proposed methods.\nThe formulation of learning from complementary labels may also be useful in the context of privacy-\naware machine learning [10]: a subject needs to answer private questions such as psychological\ncounseling which can make him/her hesitate to answer directly. In such a situation, providing a\ncomplementary label, i.e., one of the incorrect answers to the question, would be mentally less\ndemanding. We will investigate this issue in the future.\nIt is noteworthy that the symmetric condition (11), which the loss should satisfy in our comple-\nmentary classi\ufb01cation framework, also appears in other weakly supervised learning formulations,\ne.g., in positive-unlabeled learning [8]. It would be interesting to more closely investigate the role\nof this symmetric condition to gain further insight into these different weakly supervised learning\nproblems.\n\n9\n\n\fAcknowledgements\nGN and MS were supported by JST CREST JPMJCR1403. We thank Ikko Yamane for the helpful\ndiscussions.\n\nReferences\n[1] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric framework for learning\n\nfrom labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399\u20132434, 2006.\n\n[2] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine Learning\n\nResearch, 11:2973\u20133009, 2010.\n\n[3] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown. Learning multi-label scene classi\ufb01cation. Pattern\n\nRecognition, 37(9):1757\u20131771, 2004.\n\n[4] O. Chapelle, B. Sch\u00f6lkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n\n[5] T. Cour, B. Sapp, and B. Taskar. Learning from partial labels. Journal of Machine Learning Research,\n\n12:1501\u20131536, 2011.\n\n[6] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metric learning. In ICML, 2007.\n\n[7] F. Denis. PAC learning from positive statistical queries. In ALT, 1998.\n\n[8] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled data. In\n\nNIPS, 2014.\n\n[9] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and unlabeled\n\ndata. In ICML, 2015.\n\n[10] C. Dwork. Differential privacy: A survey of results. In TAMC, 2008.\n\n[11] C. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. In KDD, 2008.\n\n[12] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In\n\nNIPS, 2004.\n\n[13] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. In NIPS, 2004.\n\n[14] J. Howe. Crowdsourcing: Why the power of the crowd is driving the future of business. Crown Publishing\n\nGroup, 2009.\n\n[15] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\n[16] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks. In ICLR,\n\n2017.\n\n[17] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positive-unlabeled learning with non-negative risk\n\nestimator. In NIPS, 2017.\n\n[18] S. Laine and T. Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2017.\n\n[19] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer,\n\n1991.\n\n[20] Y.-F. Li and Z.-H. Zhou. Towards making unlabeled data never hurt.\n\nAnalysis and Machine Intelligence, 37(1):175\u2013188, 2015.\n\nIEEE Transactions on Pattern\n\n[21] G. Mann and A. McCallum. Simple, robust, scalable semi-supervised learning via expectation regular-\n\nization. In ICML, 2007.\n\n[22] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in Combinatorics,\n\npages 148\u2013188. Cambridge University Press, 1989.\n\n[23] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.\n\n[24] V. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In ICML, 2010.\n\n10\n\n\f[25] G. Niu, B. Dai, M. Yamada, and M. Sugiyama. Information-theoretic semi-supervised metric learning via\n\nentropy regularization. Neural Computation, 26(8):1717\u20131762, 2014.\n\n[26] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of positive-\n\nunlabeled learning against positive-negative learning. In NIPS, 2016.\n\n[27] G. Niu, W. Jitkrittum, B. Dai, H. Hachiya, and M. Sugiyama. Squared-loss mutual information regular-\n\nization: A novel information-theoretic approach to semi-supervised learning. In ICML, 2013.\n\n[28] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classi\ufb01cation based on classi\ufb01ca-\n\ntion from positive and unlabeled data. In ICML, 2017.\n\n[29] B. Sch\u00f6lkopf and A. Smola. Learning with Kernels. MIT Press, 2001.\n\n[30] S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next-generation open source framework for deep\n\nlearning. In Proceedings of Workshop on Machine Learning Systems in NIPS, 2015.\n\n[31] V. N. Vapnik. Statistical learning theory. John Wiley and Sons, 1998.\n\n[32] G. Ward, T. Hastie, S. Barry, J. Elith, and J. Leathwick. Presence-only data and the EM algorithm.\n\nBiometrics, 65(2):554\u2013563, 2009.\n\n[33] K. Weinberger, J. Blitzer, and L. Saul. Distance metric learning for large margin nearest neighbor classi-\n\n\ufb01cation. Journal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[34] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering\n\nwith side-information. In NIPS, 2002.\n\n[35] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph embed-\n\ndings. In ICML, 2016.\n\n[36] T. Zhang. Statistical analysis of some multi-category large margin classi\ufb01cation methods. Journal of\n\nMachine Learning Research, 5:1225\u20131251, 2004.\n\n[37] D. Zhou, O. Bousquet, T. Navin Lal, J. Weston, and B. Sch\u00f6lkopf. Learning with local and global\n\nconsistency. In NIPS, 2003.\n\n[38] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using Gaussian \ufb01elds and harmonic\n\nfunctions. In ICML, 2003.\n\n11\n\n\f", "award": [], "sourceid": 2887, "authors": [{"given_name": "Takashi", "family_name": "Ishida", "institution": "Sumitomo Mitsui Asset Management, The University of Tokyo, RIKEN"}, {"given_name": "Gang", "family_name": "Niu", "institution": "The University of Tokyo / RIKEN"}, {"given_name": "Weihua", "family_name": "Hu", "institution": "The University of Tokyo"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}