{"title": "Binary Classification from Positive-Confidence Data", "book": "Advances in Neural Information Processing Systems", "page_first": 5917, "page_last": 5928, "abstract": "Can we learn a binary classifier from only positive data, without any negative data or unlabeled data? We show that if one can equip positive data with confidence (positive-confidence), one can successfully learn a binary classifier, which we name positive-confidence (Pconf) classification. Our work is related to one-class classification which is aimed at \"describing\" the positive class by clustering-related methods, but one-class classification does not have the ability to tune hyper-parameters and their aim is not on \"discriminating\" positive and negative classes. For the Pconf classification problem, we provide a simple empirical risk minimization framework that is model-independent and optimization-independent. We theoretically establish the consistency and an estimation error bound, and demonstrate the usefulness of the proposed method for training deep neural networks through experiments.", "full_text": "Binary Classi\ufb01cation from Positive-Con\ufb01dence Data\n\nTakashi Ishida1,2 Gang Niu2 Masashi Sugiyama2,1\n\n1 The University of Tokyo, Tokyo, Japan\n\n{ishida@ms., sugi@}k.u-tokyo.ac.jp, gang.niu@riken.jp\n\n2 RIKEN, Tokyo, Japan\n\nAbstract\n\nCan we learn a binary classi\ufb01er from only positive data, without any negative data\nor unlabeled data? We show that if one can equip positive data with con\ufb01dence\n(positive-con\ufb01dence), one can successfully learn a binary classi\ufb01er, which we name\npositive-con\ufb01dence (Pconf) classi\ufb01cation. Our work is related to one-class classi\ufb01-\ncation which is aimed at \u201cdescribing\u201d the positive class by clustering-related meth-\nods, but one-class classi\ufb01cation does not have the ability to tune hyper-parameters\nand their aim is not on \u201cdiscriminating\u201d positive and negative classes. For the Pconf\nclassi\ufb01cation problem, we provide a simple empirical risk minimization framework\nthat is model-independent and optimization-independent. We theoretically estab-\nlish the consistency and an estimation error bound, and demonstrate the usefulness\nof the proposed method for training deep neural networks through experiments.\n\n1\n\nIntroduction\n\nMachine learning with big labeled data has been highly successful in applications such as image\nrecognition, speech recognition, recommendation, and machine translation [14]. However, in many\nother real-world problems including robotics, disaster resilience, medical diagnosis, and bioinfor-\nmatics, massive labeled data cannot be easily collected typically. For this reason, machine learning\nfrom weak supervision has been actively explored recently, including semi-supervised classi\ufb01ca-\ntion [6, 30, 40, 53, 23, 36], one-class classi\ufb01cation [5, 21, 42, 51, 16, 46], positive-unlabeled (PU)\nclassi\ufb01cation [12, 33, 8, 9, 34, 24, 41], label-proportion classi\ufb01cation [39, 54], unlabeled-unlabeled\nclassi\ufb01cation [7, 29, 26], complementary-label classi\ufb01cation [18, 55, 19], and similar-unlabeled\nclassi\ufb01cation [1].\nIn this paper, we consider a novel setting of classi\ufb01cation from weak supervision called positive-\ncon\ufb01dence (Pconf) classi\ufb01cation, which is aimed at training a binary classi\ufb01er only from positive data\nequipped with con\ufb01dence, without negative data. Such a Pconf classi\ufb01cation scenario is conceivable\nin various real-world problems. For example, in purchase prediction, we can easily collect customer\ndata from our own company (positive data), but not from rival companies (negative data). Often times,\nour customers are asked to answer questionnaires/surveys on how strong their buying intention was\nover rival products. This may be transformed into a probability between 0 and 1 by pre-processing,\nand then it can be used as positive-con\ufb01dence, which is all we need for Pconf classi\ufb01cation.\nAnother example is a common task for app developers, where they need to predict whether app users\nwill continue using the app or unsubscribe in the future. The critical issue is that depending on the\nprivacy/opt-out policy or data regulation, they need to fully discard the unsubscribed user\u2019s data.\nHence, developers will not have access to users who quit using their services, but they can associate a\npositive-con\ufb01dence score with each remaining user by, e.g., how actively they use the app.\nIn these applications, as long as positive-con\ufb01dence data can be collected, Pconf classi\ufb01cation allows\nus to obtain a classi\ufb01er that discriminates between positive and negative data.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Illustrations of the Pconf classi\ufb01cation and other related classi\ufb01cation settings. Best viewed\nin color. Red points are positive data, blue points are negative data, and gray points are unlabeled data.\nThe dark/light red colors on the rightmost \ufb01gure show high/low con\ufb01dence values for positive data.\n\nRelated works Pconf classi\ufb01cation is related to one-class classi\ufb01cation, which is aimed at \u201cdescrib-\ning\u201d the positive class typically from hard-labeled positive data without con\ufb01dence. To the best of our\nknowledge, previous one-class methods are motivated geometrically [50, 42], by information theory\n[49], or by density estimation [4]. However, due to the descriptive nature of all previous methods,\nthere is no systematic way to tune hyper-parameters to \u201cclassify\u201d positive and negative data. In the\nconceptual example in Figure 1, one-class methods (see the second-left illustration) do not have any\nknowledge of the negative distribution, such that the negative distribution is in the lower right of the\npositive distribution (see the left-most illustration). Therefore, even if we have an in\ufb01nite number\nof training data, one-class methods will still require regularization to have a tight boundary in all\ndirections, wherever the positive posterior becomes low. Note that even if we knew that the negative\ndistribution lies in the lower right of the positive distribution, it is still impossible to \ufb01nd the decision\nboundary, because we still need to know the degree of overlap between the two distributions and\nthe class prior. One-class methods are designed for and work well for anomaly detection, but have\ncritical limitations if the problem of interest is \u201cclassi\ufb01cation\u201d.\nOn the other hand, Pconf classi\ufb01cation is aimed at constructing a discriminative classi\ufb01er and thus\nhyper-parameters can be objectively chosen to discriminate between positive and negative data. We\nwant to emphasize that the key contribution of our paper is to propose a method that is purely based\non empirical risk minimization (ERM) [52], which makes it suitable for binary classi\ufb01cation.\nPconf classi\ufb01cation is also related to positive-unlabeled (PU) classi\ufb01cation, which uses hard-labeled\npositive data and additional unlabeled data for constructing a binary classi\ufb01er. A practical advantage\nof our Pconf classi\ufb01cation method over typical PU classi\ufb01cation methods is that our method does\nnot involve estimation of the class-prior probability, which is required in standard PU classi\ufb01cation\nmethods [8, 9, 24], but is known to be highly challenging in practice [44, 2, 11, 29, 10]. This is\nenabled by the additional con\ufb01dence information which indirectly includes the information of the\nclass prior probability, bridging class conditionals and class posteriors.\n\nOrganization In this paper, we propose a simple ERM framework for Pconf classi\ufb01cation and\ntheoretically establish the consistency and an estimation error bound. We then provide an example\nof implementation to Pconf classi\ufb01cation by using linear-in-parameter models (such as Gaussian\nkernel models), which can be implemented easily and can be computationally ef\ufb01cient. Finally, we\nexperimentally demonstrate the practical usefulness of the proposed method for training linear-in-\nparameter models and deep neural networks.\n\n2 Problem formulation\n\nIn this section, we formulate our Pconf classi\ufb01cation problem. Suppose that a pair of d-dimensional\npattern x 2 Rd and its class label y 2{ +1,1} follow an unknown probability distribution with\ndensity p(x, y). Our goal is to train a binary classi\ufb01er g(x) : Rd ! R so that the classi\ufb01cation risk\nR(g) is minimized:\n(1)\n\nR(g) = Ep(x,y)[`(yg(x))],\n\n2\n\n\fwhere Ep(x,y) denotes the expectation over p(x, y), and `(z) is a loss function. When margin z is\nsmall, `(z) typically takes a large value. Since p(x, y) is unknown, the ordinary ERM approach [52]\nreplaces the expectation with the average over training data drawn independently from p(x, y).\nHowever, in the Pconf classi\ufb01cation scenario, we are only given positive data equipped with con\ufb01dence\ni=1, where xi is a positive pattern drawn independently from p(x|y = +1) and ri is\nX := {(xi, ri)}n\nthe positive con\ufb01dence given by ri = p(y = +1|xi). Note that this equality does not have to strictly\nhold as later shown in Section 4. Since we have no access to negative data in the Pconf classi\ufb01cation\nscenario, we cannot directly employ the standard ERM approach. In the next section, we show how\nthe classi\ufb01cation risk can be estimated only from Pconf data.\n\n3 Pconf classi\ufb01cation\n\nIn this section, we propose an ERM framework for Pconf classi\ufb01cation and derive an estimation error\nbound for the proposed method. Finally we give examples of practical implementations.\n\nR(g) = \u21e1+E+\uf8ff`g(x) +\n\n3.1 Empirical risk minimization (ERM) framework\nLet \u21e1+ = p(y = +1) and r(x) = p(y = +1|x), and let E+ denote the expectation over p(x|y =\n+1). Then the following theorem holds, which forms the basis of our approach:\nTheorem 1. The classi\ufb01cation risk (1) can be expressed as\n1 r(x)\nr(x)\nif we have p(y = +1|x) 6= 0 for all x sampled from p(x).\nA proof is given in Appendix A.1 in the supplementary material. Equation (2) does not include the ex-\npectation over negative data, but only includes the expectation over positive data and their con\ufb01dence\nvalues. Furthermore, when (2) is minimized with respect to g, unknown \u21e1+ is a proportional constant\nand thus can be safely ignored. Conceptually, the assumption of p(y = +1|x) 6= 0 is implying\nthat the support of the negative distribution is the same or is included in the support of the positive\ndistribution.\nBased on this, we propose the following ERM framework for Pconf classi\ufb01cation:\n\n` g(x) ,\n\n(2)\n\nmin\n\ng\n\n1 ri\nri\n\n` g(xi)i.\nnXi=1h`g(xi) +\nnXi=1hri`g(xi) + (1 ri)` g(xi)i.\n\n(3)\n\n(4)\n\nIt might be tempting to consider a similar empirical formulation as follows:\n\nmin\n\ng\n\nEquation (4) means that we weigh the positive loss with positive-con\ufb01dence ri and the negative\nloss with negative-con\ufb01dence 1 ri. This is quite natural and may look straightforward at a glance.\nHowever, if we simply consider the population version of the objective function of (4), we have\n\nE+hr(x)`g(x) +1 r(x)` g(x)i\n= E+hp(y = +1|x)`g(x) + p(y = 1|x)` g(x)i\n= E+h Xy2{\u00b11}\np(y|x)`yg(x)i = E+hEp(y|x)\u21e5`yg(x)\u21e4i,\n\n(5)\n\nwhich is not equivalent to the classi\ufb01cation risk R(g) de\ufb01ned by (1). If the outer expectation was\nover p(x) instead of p(x|y = +1) in (5), then it would be equal to (1). This implies that if we had\na different problem setting of having positive con\ufb01dence equipped for x sampled from p(x), this\nwould be trivially solved by a naive weighting idea.\n\n3\n\n\fFrom this viewpoint, (3) can be regarded as an application of importance sampling [13, 48] to (4) to\ncope with the distribution difference between p(x) and p(x|y = +1), but with the advantage of not\nrequiring training data from the test distribution p(x).\nIn summary, our ERM formulation of (3) is different from naive con\ufb01dence-weighted classi\ufb01cation\nof (4). We further show in Section 3.2 that the minimizer of (3) converges to the true risk minimizer,\nwhile the minimizer of (4) converges to a different quantity and hence learning based on (4) is\ninconsistent.\n\n3.2 Theoretical analysis\n\nHere we derive an estimation error bound for the proposed method. To begin with, let G be our\nfunction class for ERM. Assume there exists Cg > 0 such that supg2G kgk1 \uf8ff Cg as well as C` > 0\nsuch that sup|z|\uf8ffCg `(z) \uf8ff C`. The existence of C` may be guaranteed for all reasonable ` given a\nreasonable G in the sense that Cg exists. As usual [31], assume `(z) is Lipschitz continuous for all\n|z|\uf8ff Cg with a (not necessarily optimal) Lipschitz constant L`.\nDenote by bR(g) the objective function of (3) times \u21e1+, which is unbiased in estimating R(g) in (1)\naccording to Theorem 1. Subsequently, let g\u21e4 = arg ming2G R(g) be the true risk minimizer, and\n\u02c6g = arg ming2G bR(g) be the empirical risk minimizer, respectively. The estimation error is de\ufb01ned\nas R(\u02c6g) R(g\u21e4), and we are going to bound it from above.\nIn Theorem 1, (1 r(x))/r(x) is playing a role inside the expectation, for the fact that\n\nr(x) = p(y = +1 | x) > 0 for x \u21e0 p(x | y = +1).\n\nIn order to derive any error bound based on statistical learning theory, we should ensure that r(x)\ncould never be too close to zero. To this end, assume there is Cr > 0 such that r(x) Cr almost\nsurely. We may trim r(x) and then analyze the bounded but biased version of bR(g) alternatively. For\nsimplicity, only the unbiased version is involved after assuming Cr exists.\nLemma 2. For any > 0, the following uniform deviation bound holds with probability at least\n1 (over repeated sampling of data for evaluating bR(g)):\nCr\u25c6r ln(2/)\nwhere Rn(G) is the Rademacher complexity of G for X of size n drawn from p(x | y = +1).1\nLemma 2 guarantees that with high probability bR(g) concentrates around R(g) for all g 2G , and the\ndegree of such concentration is controlled by Rn(G). Based on this lemma, we are able to establish\nan estimation error bound, as follows:\nTheorem 3. For any > 0, with probability at least 1 (over repeated sampling of data for\ntraining \u02c6g), we have\n\nsupg2G |bR(g) R(g)|\uf8ff 2\u21e1+\u2713L` +\n\nCr\u25c6 Rn(G) + \u21e1+\u2713C` +\n\n(6)\n\nL`\n\nC`\n\n,\n\n2n\n\nR(\u02c6g) R(g\u21e4) \uf8ff 4\u21e1+\u2713L` +\n\nL`\n\nCr\u25c6 Rn(G) + 2\u21e1+\u2713C` +\n\nC`\n\nCr\u25c6r ln(2/)\n\n2n\n\n.\n\n(7)\n\nTheorem 3 guarantees learning with (3) is consistent [25]: n ! 1 always means R(\u02c6g) ! R(g\u21e4).\nConsider linear-in-parameter models de\ufb01ned by\n\nG = {g(x) = hw, (x)iH |k wkH \uf8ff Cw,k(x)kH \uf8ff C},\n\nwhere H is a Hilbert space, h\u00b7,\u00b7iH is the inner product in H, w 2H is the normal, : Rd !H is\na feature map, and Cw > 0 and C > 0 are constants [43]. It is known that Rn(G) \uf8ff CwC/pn\n[31] and thus R(\u02c6g) ! R(g\u21e4) in Op(1/pn), where Op denotes the order in probability. This order is\nalready the optimal parametric rate and cannot be improved without additional strong assumptions\n\n1Rn(G) = EX E1,...,n [supg2G\n\nlowing [31].\n\n1\n\nnPxi2X\n\nig(xi)] where 1, . . . , n are n Rademacher variables fol-\n\n4\n\n\fon p(x, y), ` and G jointly [28]. Additionally, if ` is strictly convex we have \u02c6g ! g\u21e4, and if the\naforementioned G is used \u02c6g ! g\u21e4 in Op(1/pn) [3].\nAt \ufb01rst glance, learning with (4) is numerically more stable; however, it is generally inconsistent,\nespecially when g is linear in parameters and ` is strictly convex. Denote by bR0(g) the objective\nfunction of (4) times \u21e1+, which is unbiased to R0(g) = \u21e1+E+Ep(y|x)[`(yg(x))] rather than R(g).\nBy the same technique for proving (6) and (7), it is not dif\ufb01cult to show that with probability at least\n1 ,\n\nand hence\n\nwhere\n\n,\n\nsupg2G |bR0(g) R0(g)|\uf8ff 4\u21e1+L`Rn(G) + 2\u21e1+C`r ln(2/)\nR0(\u02c6g0) R0(g0\u21e4) \uf8ff 8\u21e1+L`Rn(G) + 4\u21e1+C`r ln(2/)\n\u02c6g0 = arg ming2G bR0(g).\ng0\u21e4 = arg ming2G R0(g)\n\nand\n\n2n\n\n2n\n\n,\n\nAs a result, when the strict convexity of R0(g) and bR0(g) is also met, we have \u02c6g0 ! g0\u21e4. This\ndemonstrates the inconsistency of learning with (4), since R0(g) 6= R(g) which leads to g0\u21e4 6= g\u21e4\ngiven any reasonable G.\nImplementation\n3.3\n\nFinally we give examples of implementations. As a classi\ufb01er g, let us consider a linear-in-parameter\nmodel g(x) = \u21b5>(x), where > denotes the transpose, (x) is a vector of basis functions, and \u21b5\nis a parameter vector. Then from (3), the `2-regularized ERM is formulated as\n\nmin\n\u21b5\n\nnXi=1h`\u21b5>(xi) +\n\n1 ri\nri\n\n` \u21b5>(xi)i +\n\n\n2\n\n\u21b5>R\u21b5,\n\nwhere is a non-negative constant and R is a positive semi-de\ufb01nite matrix. In practice, we can use\nany loss functions such as squared loss `S(z) = (z 1)2, hinge loss `H(z) = max(0, 1 z), and\nramp loss `R(z) = min(1, max(0, 1 z)). In the experiments in Section 4, we use the logistic loss\n`L(z) = log(1 + ez), which yields,\n\nmin\n\u21b5\n\nnXi=1hlog1 + e\u21b5>(xi)+\n\n1 ri\nri\n\nlog1 + e\u21b5>(xi)i+\n\n\n2\n\n\u21b5>R\u21b5.\n\n(8)\n\nThe above objective function is continuous and differentiable, and therefore optimization can be\nef\ufb01ciently performed, for example, by quasi-Newton [35] or stochastic gradient methods [45].\n\n4 Experiments\n\nIn this section, we numerically illustrate the behavior of the proposed method on synthetic datasets\nfor linear models. We further demonstrate the usefulness of the proposed method on bench-\nmark datasets for deep neural networks that are highly nonlinear models. The implementa-\ntion is based on PyTorch [37], Sklearn [38], and mpmath [20]. Our code will be available on\nhttp://github.com/takashiishida/pconf.\n\n4.1 Synthetic experiments with linear models\n\nSetup: We used two-dimensional Gaussian distributions with means \u00b5+ and \u00b5 and covariance\nmatrices \u2303+ and \u2303, for p(x|y = +1) and p(x|y = 1), respectively. For these parameters, we\ntried various combinations visually shown in Figure 2. The speci\ufb01c parameters used for each setup\nare:\n2 .\n\n\u2022 Setup A: \u00b5+ = [0, 0]>, \u00b5 = [2, 5]>, \u2303+ =\uf8ff\n\n7 , \u2303 =\uf8ff 2\n\n7 6\n6\n\n0\n\n0\n\n5\n\n\fFigure 2: Illustrations based on a single trail of the four setups used in experiments with various\nGaussian distributions. The red and green lines are decision boundaries obtained by Pconf and\nWeighted classi\ufb01cation, respectively, where only positive data with con\ufb01dence are used (no negative\ndata). The black boundary is obtained by O-SVM, which uses only hard-labeled positive data. The\nblue boundary is obtained by the fully-supervised method using data from both classes. Histograms\nof con\ufb01dence of positive data are shown below.\n\n\u2022 Setup B: \u00b5+ = [0, 0]>, \u00b5 = [0, 4]>, \u2303+ =\uf8ff 5\n\u2022 Setup C: \u00b5+ = [0, 0]>, \u00b5 = [0, 8]>, \u2303+ =\uf8ff\n\u2022 Setup D: \u00b5+ = [0, 0]>, \u00b5 = [0, 4]>, \u2303+ =\uf8ff 4\n\n3\n\n0\n\n6\n\n0\n\n0\n\n3\n\n5 3\n3\n\n5 .\n5 , \u2303 =\uf8ff\n7 .\n7 , \u2303 =\uf8ff 7\n1 .\n4 , \u2303 =\uf8ff 1\n\n0\n\n6\n\n7 6\n6\n\nIn the case of using two Gaussian distributions, p(y = +1|x) > 0 is satis\ufb01ed for any x sampled from\np(x), which is a necessary condition for applying Theorem 1. 500 positive data and 500 negative\ndata were generated independently from each distribution for training.2 Similarly, 1,000 positive\nand 1,000 negative data were generated for testing. We compared our proposed method (3) with the\nweighted classi\ufb01cation method (4), a regression based method (predict the con\ufb01dence value itself\nand post-process output to a binary signal by comparing it to 0.5), one-class support vector machine\n(O-SVM, [42]) with the Gaussian kernel, and a fully-supervised method based on the empirical\nversion of (1). Note that the proposed method, weighted method, and regression based method only\nuse Pconf data, O-SVM only uses (hard-labeled) positive data, and the fully-supervised method uses\nboth positive and negative data.\nIn the proposed, weighted, fully-supervised methods, linear-in-input model g(x) = \u21b5>x + b and the\nlogistic loss were commonly used and vanilla gradient descent with 5, 000 epochs (full-batch size)\nand learning rate 0.001 was used for optimization. For the regression-based method, we used the\nsquared loss and analytical solution [15]. For the purpose of clear comparison of the risk, we did\nnot use regularization in this toy experiment. An exception was O-SVM, where the user is required\nto subjectively pre-specify regularization parameter \u232b and Gaussian bandwidth . We set them at\n\u232b = 0.05 and = 0.1.3\n\nAnalysis with true positive-con\ufb01dence: Our \ufb01rst experiments were conducted when true positive-\ncon\ufb01dence was known. The positive-con\ufb01dence r(x) was analytically computed from the two\n2Negative training data are used only in the fully-supervised method that is tested for performance comparison.\n3If we naively use default parameters in Sklearn [38] instead, which is the usual case in the real world without\nnegative data for validation, the classi\ufb01cation accuracy of O-SVM is worse for all setups except D in Table 1,\nwhich demonstrates the dif\ufb01culty of using O-SVM.\n\n6\n\n\fTable 1: Comparison of the proposed Pconf classi\ufb01cation with other methods, with varying degrees\nof overlap between the positive and negative distributions. We report the mean and standard deviation\nof the classi\ufb01cation accuracy over 20 trials. We show the best and equivalent methods based on the\n5% t-test in bold, excluding the fully-supervised method and O-SVM whose settings are different\nfrom the others.\n\nSetup\n\nPconf\n\nA\nB\nC\nD\n\n89.7 \u00b1 0.6\n81.2 \u00b1 1.1\n90.2 \u00b1 9.1\n91.5 \u00b1 0.5\n\nWeighted\n88.7 \u00b1 1.2\n78.1 \u00b1 1.8\n82.7 \u00b1 13.1\n90.8 \u00b1 0.7\n\nSetup A\n\nRegression\n68.4 \u00b1 6.5\n73.2 \u00b1 3.2\n50.5 \u00b1 1.7\n64.6 \u00b1 5.3\n\nO-SVM Supervised\n76.0 \u00b1 3.5\n89.8 \u00b1 0.7\n71.3 \u00b1 2.3\n81.4 \u00b1 1.0\n93.6 \u00b1 0.5\n90.8 \u00b1 1.2\n57.1 \u00b1 4.8\n91.4 \u00b1 0.5\n\nTable 2: Mean and\nstandard deviation of\nthe classi\ufb01cation accu-\nracy with noisy posi-\ntive con\ufb01dence. The\nexperimental setup is\nthe same as Table 1,\nexcept\nthat positive\ncon\ufb01dence scores for\npositive data are noisy.\nStd.\nis the standard\ndeviation of Gaussian\nnoise.\n\nPconf\n\nStd.\nWeighted\n0.01 89.8 \u00b1 0.6 88.8 \u00b1 0.9\n0.05 89.7 \u00b1 0.6 88.3 \u00b1 1.1\n0.10 89.2 \u00b1 0.7 87.6 \u00b1 1.4\n0.20 85.9 \u00b1 2.5 85.8 \u00b1 2.5\n\nSetup B\n\nPconf\n\nStd.\nWeighted\n0.01 81.2 \u00b1 0.9 78.2 \u00b1 1.4\n0.05 80.7 \u00b1 2.3 78.1 \u00b1 1.4\n0.10 80.8 \u00b1 1.2 77.8 \u00b1 1.5\n0.20 77.8 \u00b1 1.4 77.2 \u00b1 1.9\n\nSetup C\n\nPconf\n\nStd.\nWeighted\n0.01 92.4 \u00b1 1.7 84.0 \u00b1 8.2\n0.05 92.2 \u00b1 3.3 78.5 \u00b1 11.3\n0.10 90.8 \u00b1 9.5 72.6 \u00b1 12.9\n0.20 88.0 \u00b1 9.5 65.5 \u00b1 13.1\n\nSetup D\n\nPconf\n\nStd.\nWeighted\n0.01 91.6 \u00b1 0.5 90.6 \u00b1 0.9\n0.05 91.5 \u00b1 0.5 89.9 \u00b1 1.2\n0.10 90.8 \u00b1 0.7 88.7 \u00b1 1.8\n0.20 87.7 \u00b1 0.8 85.5 \u00b1 3.7\n\nGaussian densities and given to each positive data. The results in Table 1 show that the proposed\nPconf method is signi\ufb01cantly better than the baselines in all cases. In most cases, the proposed Pconf\nmethod has similar accuracy compared with the fully supervised case, excluding Setup C where there\nis a few percent loss. Note that the naive weighted method is consistent if the model is correctly\nspeci\ufb01ed, but becomes inconsistent if misspeci\ufb01ed [48].4\n\nAnalysis with noisy positive-con\ufb01dence:\nIn the above toy experiments, we assumed that true\npositive con\ufb01dence r(x) = p(y = +1|x) is exactly accessible, but this can be unrealistic in practice.\nTo investigate the in\ufb02uence of noise in positive-con\ufb01dence, we conducted experiments with noisy\npositive-con\ufb01dence.\nAs noisy positive con\ufb01dence, we added zero-mean Gaussian noise with standard deviation chosen\nfrom {0.01, 0.05, 0.1, 0.2}. As the standard deviation gets larger, more noise will be incorporated into\npositive-con\ufb01dence. When the modi\ufb01ed positive-con\ufb01dence was over 1 or below 0.01, we clipped it\nto 1 or rounded up to 0.01 respectively.\nThe results are shown in Table 2. As expected, the performance starts to deteriorate as the con\ufb01dence\nbecomes more noisy (i.e., as the standard deviation of Gaussian noise is larger), but the proposed\nmethod still works reasonably well in almost all cases.\n\n4.2 Benchmark experiments with neural network models\n\nHere, we use more realistic benchmark datasets and more \ufb02exible neural network models for experi-\nments.\n\n4Since our proposed method has coef\ufb01cient 1ri\nri\n\nin the 2nd term of (3), it may suffer numerical problems,\ne.g., when ri is extremely small. To investigate this, we used the mpmath package [20] to compute the gradient\nwith arbitrary precision. The experimental results were actually not that much different from the ones obtained\nwith single precision, implying that the numerical problems are not much troublesome.\n\n7\n\n\fFashion-MNIST: The Fashion-MNIST dataset5 consists of 70,000 examples where each sample is\na 28 \u21e5 28 gray-scale image (input dimension is 784), associated with a label from 10 fashion item\nclasses. We standardized the data to have zero mean and unit variance.\nFirst, we chose \u201cT-shirt/top\u201d as the positive class, and another item for the negative class. The binary\ndataset was then divided into four sub-datasets: a training set, a validation set, a test set, and a dataset\nfor learning a probabilistic classi\ufb01er to estimate positive-con\ufb01dence. Note that we ask labelers for\npositive-con\ufb01dence values in real-world Pconf classi\ufb01cation, but we obtained positive-con\ufb01dence\nvalues through a probabilistic classi\ufb01er here.\nWe used logistic regression with the same network architecture as a probabilistic classi\ufb01er to generate\ncon\ufb01dence.6 However, instead of weight decay, we used dropout [47] with rate 50% after each\nfully-connected layer, and early-stopping with 20 epochs, since softmax output of \ufb02exible neural\nnetworks tends to be extremely close to 0 or 1 [14], which is not suitable as a representation of\ncon\ufb01dence. Furthermore, we rounded up positive con\ufb01dence less than 1% to 1% to stabilize the\noptimization process.\nWe compared Pconf classi\ufb01cation (3) with weighted classi\ufb01cation (4) and fully-supervised classi\ufb01-\ncation based on the empirical version of (1). We used the logistic loss for these methods. We also\ncompared our method with Auto-Encoder [17] as a one-class classi\ufb01cation method.\nExcept Auto-Encoder, we used a fully-connected neural network of three hidden layers (d-100-100-\n100-1) with recti\ufb01ed linear units (ReLU) [32] as the activation functions, and weight decay candidates\nwere chosen from {107, 104, 101}. Adam [22] was again used for optimization with 200 epochs\nand mini-batch size 100.\nTo select hyper-parameters with validation data, we used the zero-one loss versions of (3) and (4) for\nPconf classi\ufb01cation and weighted classi\ufb01cation, respectively, since no negative data were available in\nthe validation process and thus we could not directly use the classi\ufb01cation accuracy. On the other\nhand, the classi\ufb01cation accuracy was directly used for hyper-parameter tuning of the fully-supervised\nmethod, which is extremely advantageous. We reported the test accuracy of the model with the best\nvalidation score out of all epochs.\nAuto-Encoder was trained with (hard-labeled) positive data, and we classi\ufb01ed test data into positive\nclass if the mean squared error (MSE) is below a threshold of 70% quantile, and into negative class\notherwise. Since we have no negative data for validating hyper-parameters, we sort the MSEs of\ntraining positive data in ascending order. We set the weight decay to 104. The architecture is\nd-100-100-100-100 for encoding and the reversed version for decoding, with ReLU after hidden\nlayers and Tanh after the \ufb01nal layer.\n\nCIFAR-10: The CIFAR-10 dataset 7 consists of 10 classes, with 5,000 images in each class. Each\nimage is given in a 32 \u21e5 32 \u21e5 3 format. We chose \u201cairplane\u201d as the positive class and one of the\nother classes as the negative class in order to construct a dataset for binary classi\ufb01cation. We used the\nneural network architecture speci\ufb01ed in Appendix B.1.\nFor the probabilistic classi\ufb01er, the same architecture as that for Fashion-MNIST was used except\ndropout with rate 50% was added after the \ufb01rst two fully-connected layers. For Auto-Encoder, the\nMSE threshold was set to 80% quantile, and we used the architecture speci\ufb01ed in Appendix B.2.\nOther details such as the loss function and weight-decay follow the same setup as the Fashion-MNIST\nexperiments.\n\nResults: The results in Table 3 and Table 4 show that in most cases, Pconf classi\ufb01cation either\noutperforms or is comparable to the weighted classi\ufb01cation baseline, outperforms Auto-Encoder, and\nis even comparable to the fully-supervised method in some cases.\n\n5https://github.com/zalandoresearch/fashion-mnist\n6Both positive and negative data are used to train the probabilistic classi\ufb01er to estimate con\ufb01dence, and this\n\ndata is separated from any other process of experiments.\n\n7https://www.cs.toronto.edu/\u02dckriz/cifar.html\n\n8\n\n\fTable 3: Mean and standard deviation of the classi\ufb01cation accuracy over 20 trials for the Fashion-\nMNIST dataset with fully-connected three hidden-layer neural networks. Pconf classi\ufb01cation\nwas compared with the baseline Weighted classi\ufb01cation method, Auto-Encoder method and fully-\nsupervised method, with T-shirt as the positive class and different choices for the negative class. The\nbest and equivalent methods are shown in bold based on the 5% t-test, excluding the Auto-Encoder\nmethod and fully-supervised method.\n\nP / N\n\nPconf\n\nT-shirt / trouser\nT-shirt / pullover\nT-shirt / dress\nT-shirt / coat\nT-shirt / sandal\nT-shirt / shirt\nT-shirt / sneaker\n\nT-shirt / bag\n\nT-shirt / ankle boot\n\n92.14 \u00b1 4.06\n96.00 \u00b1 0.29\n91.52 \u00b1 1.14\n98.12 \u00b1 0.33\n99.55 \u00b1 0.22\n83.70 \u00b1 0.46\n89.86 \u00b1 13.32\n97.56 \u00b1 0.99\n98.84 \u00b1 1.43\n\nWeighted\n85.30 \u00b1 9.07\n96.08 \u00b1 1.05\n89.31 \u00b1 1.08\n98.13 \u00b1 1.12\n87.83 \u00b1 18.79\n83.60 \u00b1 0.65\n58.26 \u00b1 14.27\n95.34 \u00b1 1.00\n88.87 \u00b1 7.86\n\nAuto-Encoder\n71.06 \u00b1 1.00\n70.27 \u00b1 1.22\n53.82 \u00b1 0.93\n68.74 \u00b1 0.98\n82.02 \u00b1 0.49\n57.76 \u00b1 0.55\n83.70 \u00b1 0.26\n82.79 \u00b1 0.70\n85.07 \u00b1 0.37\n\nSupervised\n98.98 \u00b1 0.16\n96.17 \u00b1 0.34\n96.56 \u00b1 0.34\n98.44 \u00b1 0.13\n99.93 \u00b1 0.09\n85.57 \u00b1 0.69\n100.00 \u00b1 0.00\n99.02 \u00b1 0.29\n99.76 \u00b1 0.07\n\nTable 4: Mean and standard deviation of the classi\ufb01cation accuracy over 20 trials for the CIFAR-10\ndataset with convolutional neural networks. Pconf classi\ufb01cation was compared with the baseline\nWeighted classi\ufb01cation method, Auto-Encoder method and fully-supervised method, with airplane\nas the positive class and different choices for the negative class. The best and equivalent methods\nare shown in bold based on the 5% t-test, excluding the Auto-Encorder method and fully-supervised\nmethod.\n\nP / N\n\nairplane / automobile\n\nairplane / bird\nairplane / cat\nairplane / deer\nairplane / dog\nairplane / frog\nairplane / horse\nairplane / ship\nairplane / truck\n\nPconf\n\n82.68 \u00b1 1.89\n82.23 \u00b1 1.21\n85.18 \u00b1 1.35\n87.68 \u00b1 1.36\n89.91 \u00b1 0.85\n90.80 \u00b1 0.98\n89.82 \u00b1 1.07\n69.71 \u00b1 2.37\n81.76 \u00b1 2.09\n\nWeighted\n76.21 \u00b1 2.43\n80.66 \u00b1 1.60\n89.60 \u00b1 0.92\n87.24 \u00b1 1.58\n89.08 \u00b1 1.95\n81.84 \u00b1 3.92\n85.10 \u00b1 2.61\n70.68 \u00b1 1.45\n86.74 \u00b1 0.85\n\nAuto-Encoder\n75.13 \u00b1 0.42\n54.83 \u00b1 0.39\n61.03 \u00b1 0.59\n55.60 \u00b1 0.53\n62.64 \u00b1 0.63\n62.52 \u00b1 0.68\n67.55 \u00b1 0.73\n52.09 \u00b1 0.42\n73.74 \u00b1 0.38\n\nSupervised\n93.96 \u00b1 0.58\n87.76 \u00b1 4.97\n92.90 \u00b1 0.58\n93.35 \u00b1 0.77\n94.61 \u00b1 0.45\n95.95 \u00b1 0.40\n95.65 \u00b1 0.37\n81.45 \u00b1 8.87\n92.10 \u00b1 0.82\n\n5 Conclusion\n\nWe proposed a novel problem setting and algorithm for binary classi\ufb01cation from positive data\nequipped with con\ufb01dence. Our key contribution was to show that an unbiased estimator of the classi\ufb01-\ncation risk can be obtained for positive-con\ufb01dence data, without negative data or even unlabeled data.\nThis was achieved by reformulating the classi\ufb01cation risk based on both positive and negative data,\nto an equivalent expression that only requires positive-con\ufb01dence data. Theoretically, we established\nan estimation error bound, and experimentally demonstrated the usefulness of our algorithm.\n\n9\n\n\fAcknowledgments\nTI was supported by Sumitomo Mitsui Asset Management. MS was supported by JST CREST\nJPMJCR1403. We thank Ikko Yamane and Tomoya Sakai for the helpful discussions. We also thank\nanonymous reviewers for pointing out numerical issues in our experiments, and for pointing out the\nnecessary condition in Theorem 1 in our earlier work of this paper.\n\nReferences\n[1] H. Bao, G. Niu, and M. Sugiyama. Classi\ufb01cation from pairwise similarity and unlabeled data.\n\nIn ICML, 2018.\n\n[2] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. Journal of Machine\n\nLearning Research, 11:2973\u20133009, 2010.\n\n[3] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n\n[4] M. M. Breunig, H. P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local\n\noutliers. In ACM SIGMOD, 2000.\n\n[5] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing\n\nSurveys, 41(3), 2009.\n\n[6] O. Chapelle, B. Sch\u00f6lkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press, 2006.\n\n[7] M. C. du Plessis, G. Niu, and M. Sugiyama. Clustering unclustered data: Unsupervised binary\n\nlabeling of two datasets having different class balances. In TAAI, 2013.\n\n[8] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled\n\ndata. In NIPS, 2014.\n\n[9] M. C. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and\n\nunlabeled data. In ICML, 2015.\n\n[10] M. C. du Plessis, G. Niu, and M. Sugiyama. Class-prior estimation for learning from positive\n\nand unlabeled data. Machine Learning, 106(4):463\u2013492, 2017.\n\n[11] M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data.\n\nIEICE Transactions on Information and Systems, E97-D(5):1358\u20131362, 2014.\n\n[12] C. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. In KDD,\n\n2008.\n\n[13] G. S. Fishman. Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag, 1996.\n\n[14] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.\n\n[15] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,\n\nInference, and Prediction. Springer, 2009.\n\n[16] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Inlier-based outlier detection\n\nvia direct density ratio estimation. In ICDM, 2008.\n\n[17] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[18] T. Ishida, G. Niu, W. Hu, and M. Sugiyama. Learning from complementary labels. In NIPS,\n\n2017.\n\n[19] T. Ishida, G. Niu, A. K. Menon, and M. Sugiyama. Complementary-label learning for arbitrary\n\nlosses and models. arXiv preprint arXiv:1810.04327, 2018.\n\n[20] F. Johansson et al. mpmath: a Python library for arbitrary-precision \ufb02oating-point arithmetic\n\n(version 0.18), December 2013. http://mpmath.org/.\n\n10\n\n\f[21] S. S. Khan and M. G. Madden. A survey of recent trends in one class classi\ufb01cation. In Irish\n\nConference on Arti\ufb01cial Intelligence and Cognitive Science, 2009.\n\n[22] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[23] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn ICLR, 2017.\n\n[24] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positive-unlabeled learning with\n\nnon-negative risk estimator. In NIPS, 2017.\n\n[25] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes.\n\nSpringer, 1991.\n\n[26] N. Lu, G. Niu, A. K. Menon, and M. Sugiyama. On the minimal supervision for training any\n\nbinary classi\ufb01er from only unlabeled data. arXiv preprint arXiv:1808.10585v2, 2018.\n\n[27] C. McDiarmid. On the method of bounded differences. In J. Siemons, editor, Surveys in\n\nCombinatorics, pages 148\u2013188. Cambridge University Press, 1989.\n\n[28] S. Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on\n\nInformation Theory, 54(8):3797\u20133803, 2008.\n\n[29] A. Menon, B. Van Rooyen, C. S. Ong, and B. Williamson. Learning from corrupted binary\n\nlabels via class-probability estimation. In ICML, 2015.\n\n[30] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with virtual\n\nadversarial training. In ICLR, 2016.\n\n[31] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press,\n\n2012.\n\n[32] V. Nair and G.E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\n\nICML, 2010.\n\n[33] N. Natarajan, I. S. Dhillon, P. Ravikumar, and A. Tewari. Learning with noisy labels. In NIPS,\n\n2013.\n\n[34] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of\n\npositive-unlabeled learning against positive-negative learning. In NIPS, 2016.\n\n[35] J. Nocedal and S. Wright. Numerical Optimization. Springer, 2006.\n[36] A. Oliver, A. Odena, C. Raffel, E. D. Cubuk, and I. J. Goodfellow. Realistic evaluation of deep\n\nsemi-supervised learning algorithms. In NeurIPS, 2018.\n\n[37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,\nL. Antiga, and A. Lerer. Automatic differentiation in pytorch. In Autodiff Workshop in NIPS,\n2017.\n\n[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,\nP. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,\nM. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine\nLearning Research, 12:2825\u20132830, 2011.\n\n[39] N. Quadrianto, A. Smola, T. Caetano, and Q. Le. Estimating labels from label proportions. In\n\nICML, 2008.\n\n[40] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classi\ufb01cation based on\n\nclassi\ufb01cation from positive and unlabeled data. In ICML, 2017.\n\n[41] T. Sakai, G. Niu, and M. Sugiyama. Semi-supervised AUC optimization based on positive-\n\nunlabeled learning. Machine Learning, 107(4):767\u2013794, 2018.\n\n[42] B. Sch\u00f6lkopf, J. C. Platt, J Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the\n\nsupport of a high-dimensional distribution. Neural Computation, 13:1443\u20131471, 2001.\n\n11\n\n\f[43] B. Sch\u00f6lkopf and A. Smola. Learning with Kernels. MIT Press, 2001.\n[44] C. Scott and G. Blanchard. Novelty detection: Unlabeled data de\ufb01nitely help. In AISTATS,\n\n2009.\n\n[45] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[46] A. Smola, L. Song, and C. H. Teo. Relative novelty detection. In AISTATS, 2009.\n[47] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A\nsimple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[48] M. Sugiyama and M. Kawanabe. Machine Learning in Non-Stationary Environments: Introduc-\n\ntion to Covariate Shift Adaptation. MIT Press, 2012.\n\n[49] M. Sugiyama, G. Niu, M. Yamada, M. Kimura, and H. Hachiya. Information-maximization\nclustering based on squared-loss mutual information. Neural Computation, 26(1):84\u2013131, 2014.\n[50] D. Tax and R. Duin. Support vextor domain description. In Pattern Recognition Letters, 1999.\n[51] D. M. J. Tax and R. P. W. Duin. Support vector data description. Machine Learning, 54(1):45\u201366,\n\n2004.\n\n[52] V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.\n[53] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph\n\nembeddings. In ICML, 2016.\n\n[54] F. X. Yu, D. Liu, S. Kumar, T. Jebara, and S.-F. Chang. /svm for learning with label proportions.\n\nIn ICML, 2013.\n\n[55] X. Yu, T. Liu, M. Gong, and D. Tao. Learning with biased complementary labels. In ECCV,\n\n2018.\n\n12\n\n\f", "award": [], "sourceid": 2861, "authors": [{"given_name": "Takashi", "family_name": "Ishida", "institution": "The University of Tokyo, RIKEN, SMAM"}, {"given_name": "Gang", "family_name": "Niu", "institution": "RIKEN"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}