{"title": "Analysis of Learning from Positive and Unlabeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 703, "page_last": 711, "abstract": "Learning a classifier from positive and unlabeled data is an important class of classification problems that are conceivable in many practical applications. In this paper, we first show that this problem can be solved by cost-sensitive learning between positive and unlabeled data. We then show that convex surrogate loss functions such as the hinge loss may lead to a wrong classification boundary due to an intrinsic bias, but the problem can be avoided by using non-convex loss functions such as the ramp loss. We next analyze the excess risk when the class prior is estimated from data, and show that the classification accuracy is not sensitive to class prior estimation if the unlabeled data is dominated by the positive data (this is naturally satisfied in inlier-based outlier detection because inliers are dominant in the unlabeled dataset). Finally, we provide generalization error bounds and show that, for an equal number of labeled and unlabeled samples, the generalization error of learning only from positive and unlabeled samples is no worse than $2\\sqrt{2}$ times the fully supervised case. These theoretical findings are also validated through experiments.", "full_text": "Analysis of Learning from\nPositive and Unlabeled Data\n\nMarthinus C. du Plessis\nThe University of Tokyo\nTokyo, 113-0033, Japan\n\nchristo@ms.k.u-tokyo.ac.jp\n\nGang Niu\nBaidu Inc.\n\nBeijing, 100085, China\nniugang@baidu.com\n\nMasashi Sugiyama\n\nThe University of Tokyo\nTokyo, 113-0033, Japan\n\nsugi@k.u-tokyo.ac.jp\n\nAbstract\n\nLearning a classi\ufb01er from positive and unlabeled data is an important class of\nclassi\ufb01cation problems that are conceivable in many practical applications. In this\npaper, we \ufb01rst show that this problem can be solved by cost-sensitive learning\nbetween positive and unlabeled data. We then show that convex surrogate loss\nfunctions such as the hinge loss may lead to a wrong classi\ufb01cation boundary due\nto an intrinsic bias, but the problem can be avoided by using non-convex loss func-\ntions such as the ramp loss. We next analyze the excess risk when the class prior\nis estimated from data, and show that the classi\ufb01cation accuracy is not sensitive to\nclass prior estimation if the unlabeled data is dominated by the positive data (this\nis naturally satis\ufb01ed in inlier-based outlier detection because inliers are dominant\nin the unlabeled dataset). Finally, we provide generalization error bounds and\nshow that, for an equal number of labeled and unlabeled samples, the generaliza-\np\ntion error of learning only from positive and unlabeled samples is no worse than\n2 times the fully supervised case. These theoretical \ufb01ndings are also validated\n2\nthrough experiments.\n\n1 Introduction\n\nLet us consider the problem of learning a classi\ufb01er from positive and unlabeled data (PU classi\ufb01ca-\ntion), which is aimed at assigning labels to the unlabeled dataset [1]. PU classi\ufb01cation is conceivable\nin various applications such as land-cover classi\ufb01cation [2], where positive samples (built-up urban\nareas) can be easily obtained, but negative samples (rural areas) are too diverse to be labeled. Outlier\ndetection in unlabeled data based on inlier data can also be regarded as PU classi\ufb01cation [3, 4].\nIn this paper, we \ufb01rst explain that, if the class prior in the unlabeled dataset is known, PU classi\ufb01ca-\ntion can be reduced to the problem of cost-sensitive classi\ufb01cation [5] between positive and unlabeled\ndata. Thus, in principle, the PU classi\ufb01cation problem can be solved by a standard cost-sensitive\nclassi\ufb01er such as the weighted support vector machine [6]. The goal of this paper is to give new\ninsight into this PU classi\ufb01cation algorithm. Our contributions are three folds:\n\n(cid:15) The use of convex surrogate loss functions such as the hinge loss may potentially lead\nto a wrong classi\ufb01cation boundary being selected, even when the underlying classes are\ncompletely separable. To obtain the correct classi\ufb01cation boundary, the use of non-convex\nloss functions such as the ramp loss is essential.\n\n1\n\n\f(cid:15) When the class prior in the unlabeled dataset is estimated from data, the classi\ufb01cation error\nis governed by what we call the effective class prior that depends both on the true class prior\nand the estimated class prior. In addition to gaining intuition behind the classi\ufb01cation error\nincurred in PU classi\ufb01cation, a practical outcome of this analysis is that the classi\ufb01cation\nerror is not sensitive to class-prior estimation error if the unlabeled data is dominated by\npositive data. This would be useful in, e.g., inlier-based outlier detection scenarios where\ninlier samples are dominant in the unlabeled dataset [3, 4]. This analysis can be regarded as\nan extension of traditional analysis of class priors in ordinary classi\ufb01cation scenarios [7, 8]\nto PU classi\ufb01cation.\n(cid:15) We establish generalization error bounds for PU classi\ufb01cation. For an equal number of\np\npositive and unlabeled samples, the convergence rate is no worse than 2\n2 times the fully\nsupervised case.\n\nFinally, we numerically illustrate the above theoretical \ufb01ndings through experiments.\n\n2 PU classi\ufb01cation as cost-sensitive classi\ufb01cation\n\nIn this section, we show that the problem of PU classi\ufb01cation can be cast as cost-sensitive classi\ufb01-\ncation.\n\nOrdinary classi\ufb01cation: The Bayes optimal classi\ufb01er corresponds to the decision function\nf (X) 2 f1;(cid:0)1g that minimizes the expected misclassi\ufb01cation rate w.r.t. a class prior of (cid:25):\n\nR(f ) := (cid:25)R1(f ) + (1 (cid:0) (cid:25))R(cid:0)1(f );\n\nwhere R(cid:0)1(f ) and R1(f ) denote the expected false positive rate and expected false negative rate:\n\nR(cid:0)1(f ) = P(cid:0)1(f (X) \u0338= (cid:0)1)\n\nand R1(f ) = P1(f (X) \u0338= 1);\nand P1 and P(cid:0)1 denote the marginal probabilities of positive and negative samples.\nIn the empirical risk minimization framework, the above risk is replaced with their empirical ver-\nsions obtained from fully labeled data, leading to practical classi\ufb01ers [9].\nCost-sensitive classi\ufb01cation: A cost-sensitive classi\ufb01er selects a function f (X) 2 f1;(cid:0)1g in\norder to minimize the weighted expected misclassi\ufb01cation rate:\n\nR(f ) := (cid:25)c1R1(f ) + (1 (cid:0) (cid:25))c(cid:0)1R(cid:0)1(f );\n\n(1)\n\nwhere c1 and c(cid:0)1 are the per-class costs [5].\nSince scaling does not matter in (1), it is often useful to interpret the per-class costs as reweighting\nthe problem according to new class priors proportional to (cid:25)c1 and (1 (cid:0) (cid:25))c(cid:0)1.\n\nPU classi\ufb01cation:\nIn PU classi\ufb01cation, a classi\ufb01er is learned using labeled data drawn from the\npositive class P1 and unlabeled data that is a mixture of positive and negative samples with unknown\nclass prior (cid:25):\n\nPX = (cid:25)P1 + (1 (cid:0) (cid:25))P(cid:0)1:\n\nSince negative samples are not available, let us train a classi\ufb01er to minimize the expected misclassi-\n\ufb01cation rate between positive and unlabeled samples. Since we do not have negative samples in the\nPU classi\ufb01cation setup, we cannot directly estimate R(cid:0)1(f ) and thus we rewrite the risk R(f ) not\nto include R(cid:0)1(f ). More speci\ufb01cally, let RX (f ) be the probability that the function f (X) gives the\npositive label over PX [10]:\n\nRX (f ) = PX (f (X) = 1)\n\n= (cid:25)P1(f (X) = 1) + (1 (cid:0) (cid:25))P(cid:0)1(f (X) = 1)\n= (cid:25)(1 (cid:0) R1(f )) + (1 (cid:0) (cid:25))R(cid:0)1(f ):\n\n(2)\n\n2\n\n\fThen the risk R(f ) can be written as\n\nR(f ) = (cid:25)R1(f ) + (1 (cid:0) (cid:25))R(cid:0)1(f )\n= (cid:25)R1(f ) (cid:0) (cid:25)(1 (cid:0) R1(f )) + RX (f )\n= 2(cid:25)R1(f ) + RX (f ) (cid:0) (cid:25):\n\n(3)\n\nn\n\nLet (cid:17) be the proportion of samples from P1 compared to PX, which is empirically estimated by\n\u2032 denote the numbers of positive and unlabeled samples, respectively. The risk\nn+n\u2032 where n and n\nR(f ) can then be expressed as\n\nR(f ) = c1(cid:17)R1(f ) + cX (1 (cid:0) (cid:17))RX (f ) (cid:0) (cid:25); where\n\nc1 =\n\n2(cid:25)\n(cid:17)\n\nand\n\ncX =\n\n1\n1 (cid:0) (cid:17)\n\n:\n\nComparing this expression with (1), we can con\ufb01rm that the PU classi\ufb01cation problem is solved\nby cost-sensitive classi\ufb01cation between positive and unlabeled data with costs c1 and cX. Some\nimplementations of support vector machines, such as libsvm [6], allow for assigning weights\nto classes. In practice, the unknown class prior (cid:25) may be estimated by the methods proposed in\n[10, 1, 11].\nIn the following sections, we analyze this algorithm.\n\n3 Necessity of non-convex loss functions in PU classi\ufb01cation\n\nIn this section, we show that solving the PU classi\ufb01cation problem with a convex loss function may\nlead to a biased solution, and the use of a non-convex loss function is essential to avoid this problem.\n\nLoss functions in ordinary classi\ufb01cation: We \ufb01rst consider ordinary classi\ufb01cation problems\nInstead of a binary decision function f (X) 2\nwhere samples from both classes are available.\nf(cid:0)1; 1g, a continuous decision function g(X) 2 R such that sign(g(X)) = f (X) is learned. The\nloss function then becomes\n\nJ0-1(g) = (cid:25)E1 [\u21130-1(g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u21130-1((cid:0)g(X))] ;\n\nwhere Ey is the expectation over Py and \u21130-1(z) is the zero-one loss:\n\n{\n0 z > 0;\n1 z (cid:20) 0:\n\n\u21130-1(z) =\n\nSince the zero-one loss is hard to optimize in practice due to its discontinuous nature, it may be\nreplaced with a ramp loss (as illustrated in Figure 1):\n\n\u2113R(z) =\n\n1\n2\n\nmax(0; min(2; 1 (cid:0) z));\n\ngiving an objective function of\n\nJR(g) = (cid:25)E1 [\u2113R (g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u2113R((cid:0)g(X))] :\n\n(4)\n\nTo avoid the non-convexity of the ramp loss, the hinge loss is often preferred in practice:\n\n\u2113H(z) =\n\nmax(1 (cid:0) z; 0);\n\n1\n2\n\ngiving an objective of\n\nJH(g) = (cid:25)E1 [\u2113H (g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u2113H((cid:0)g(X))] :\n\n(5)\nOne practical motivation to use the convex hinge loss instead of the non-convex ramp loss is that\nseparability (i.e., ming JR(g) = 0) implies \u2113R(z) = 0 everywhere, and for all values of z for which\n\u2113R(z) = 0, we have \u2113H(z) = 0. Therefore, the convex hinge loss will give the same decision\nboundary as the non-convex ramp loss in the ordinary classi\ufb01cation setup, under the assumption that\nthe positive and negative samples are non-overlapping.\n\n3\n\n\f\u2113R(z) = 1\n\n\u2113H (z) = 1\n\n2 max(0; min(2; 1(cid:0)z))\n2 max(0; 1(cid:0)z)\n\n\u2113H(z)\n\n\u2113R(z)\n\n1\n\n1\n2\n\n\u2113H(z) + \u2113H((cid:0)z)\n\n1\n\n\u2113R(z) + \u2113R((cid:0)z)\n\n(cid:0)1\n\n1\n\n(a) Loss functions\n\n(cid:0)1\n\n1\n\n(b) Resulting penalties\n\nFigure 1: \u2113R(z) denotes the ramp loss, and \u2113H(z) denotes the hinge loss. \u2113R(z)+\u2113R((cid:0)z) is constant\nbut \u2113H (z) + \u2113H ((cid:0)z) is not and therefore causes a super\ufb02uous penalty.\n\nRamp loss function in PU classi\ufb01cation: An important question is whether the same interpreta-\ntion will hold for PU classi\ufb01cation: can the PU classi\ufb01cation problem be solved by using the convex\nhinge loss? As we show below, the answer to this question is unfortunately \u201cno\u201d.\nIn PU classi\ufb01cation, the risk is given by (3), and its ramp-loss version is given by\n\nJPU-R(g) = 2(cid:25)R1(f ) + RX (f ) (cid:0) (cid:25)\n\n= 2(cid:25)E1 [\u2113R(g(X))] + [(cid:25)E1 [\u2113R((cid:0)g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u2113R((cid:0)g(X))]] (cid:0) (cid:25)\n= (cid:25)E1 [\u2113R(g(X))] + (cid:25)E1 [\u2113R(g(X)) + \u2113R((cid:0)g(X))]\n+ (1 (cid:0) (cid:25))E(cid:0)1 [\u2113R((cid:0)g(X))] (cid:0) (cid:25);\n\n(8)\nwhere (6) comes from (3) and (7) is due to the substitution of (2). Since the ramp loss is symmetric\nin the sense of\n\n(6)\n(7)\n\n\u2113R((cid:0)z) + \u2113R(z) = 1;\n\n(8) yields\n\nJPU-R(g) = (cid:25)E1 [\u2113R(g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u2113R((cid:0)g(X))] :\n\n(9)\n(9) is essentially the same as (4), meaning that learning with the ramp loss in the PU classi\ufb01cation\nsetting will give the same classi\ufb01cation boundary as in the ordinary classi\ufb01cation setting.\nFor non-convex optimization with the ramp loss, see [12, 13].\n\nHinge loss function in PU classi\ufb01cation: On the other hand, using the hinge loss to minimize (3)\nfor PU learning gives\nJPU-H(g) = 2(cid:25)E1 [\u2113H (g(X))] + [(cid:25)E1 [\u2113H ((cid:0)g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u2113H ((cid:0)g(X))]] (cid:0) (cid:25);\n\n}\n|\n= (cid:25)E1 [\u2113H (g(X))] + (1 (cid:0) (cid:25))E(cid:0)1 [\u2113H ((cid:0)g(X))]\n\n{z\n\n|\n}\n+ (cid:25)E1 [\u2113H (g(X)) + \u2113H ((cid:0)g(X))]\n\n{z\n\n(10)\n(cid:0)(cid:25):\n\nOrdinary error term, cf. (5)\n\nSuper\ufb02uous penalty\n\nWe see that the hinge loss has a term that corresponds to (5), but it also has a super\ufb02uous penalty\nterm (see also Figure 1). This penalty term may cause an incorrect classi\ufb01cation boundary to be\nselected. Indeed, even if g(X) perfectly separates the data, it may not minimize JPU-H(g) due to the\nsuper\ufb02uous penalty. To obtain the correct decision boundary, the loss function should be symmetric\n(and therefore non-convex). Alternatively, since the super\ufb02uous penalty term can be evaluated, it\ncan be subtracted from the objective function. Note that, for the problem of label noise, an identical\nsymmetry condition has been obtained [14].\n\nIllustration: We illustrate the failure of the hinge loss on a toy PU classi\ufb01cation problem with\nclass conditional densities of:\n\np(xjy = 1) = N((cid:0)3; 12\n\n)\n\nand p(xjy = 1) = N(\n\n)\n\n3; 12\n\n;\n\nwhere N ((cid:22); (cid:27)2) is a normal distribution with mean (cid:22) and variance (cid:27)2. The hinge-loss objective\nfunction for PU classi\ufb01cation, JPU-H(g), is minimized with a model of g(x) = wx + b (the ex-\npectations in the objective function is computed via numerical integration). The optimal decision\n\n4\n\n\f(a) Class-conditional densities of\nthe problem\n\n(b) Optimal threshold and threshold\nusing the hinge loss\n\n(c) The misclassi\ufb01cation rate for\nthe optimal and hinge loss case\n\nFigure 2: Illustration of the failure of the hinge loss for PU classi\ufb01cation. The optimal threshold\nand the threshold estimated by the hinge loss differ signi\ufb01cantly (Figure 2(b)), causing a difference\nin the misclassi\ufb01cation rates (Figure 2(c)). The threshold for the ramp loss agrees with the optimal\nthreshold.\n\nthreshold and the threshold for the hinge loss is plotted in Figure 2(b) for a range of class priors.\nNote that the threshold for the ramp loss will correspond to the optimal threshold. From this \ufb01gure,\nwe note that the hinge-loss threshold differs from the optimal threshold. The difference is especially\nsevere for larger class priors, due to the fact that the super\ufb02uous penalty is weighted by the class\nprior. When the class-prior is large enough, the large hinge-loss threshold causes all samples to be\npositively labeled. In such a case, the false negative rate is R1 = 0 but the false positive rate is\nR(cid:0)1 = 1. Therefore, the overall misclassi\ufb01cation rate for the hinge loss will be 1 (cid:0) (cid:25).\n\n4 Effect of inaccurate class-prior estimation\n\nTo solve the PU classi\ufb01cation problem by cost-sensitive learning described in Section 2, the true\nclass prior (cid:25) is needed. However, since it is often unknown in practice, it needs to be estimated, e.g.,\nby the methods proposed in [10, 1, 11]. Since many of the estimation methods are biased [1, 11],\nit is important to understand the in\ufb02uence of inaccurate class-prior estimation on the classi\ufb01cation\n\nperformance. In this section, we elucidate how the error in the estimated class priorb(cid:25) affects the\n\nclassi\ufb01cation accuracy in the PU classi\ufb01cation setting.\n\nRisk with true class prior in ordinary classi\ufb01cation:\nIn the ordinary classi\ufb01cation scenarios\nwith positive and negative samples, the risk for a classi\ufb01er f on a dataset with class prior (cid:25) is given\nas follows ([8, pp. 26\u201329] and [7]):\n\nR(f; (cid:25)) = (cid:25)R1(f ) + (1 (cid:0) (cid:25))R(cid:0)1(f ):\n\nThe risk for the optimal classi\ufb01er according to the class prior (cid:25) is therefore,\n\n(cid:3)\n\nR\n\n((cid:25)) = min\n\nf2F R(f; (cid:25))\n\n(cid:3)\n\nNote that R\nis illustrated in Figure 3(a).\n\n((cid:25)) is concave, since it is the minimum of a set of functions that are linear w.r.t. (cid:25). This\n\nExcess risk with class prior estimation in ordinary classi\ufb01cation: Suppose we have a classi\ufb01er\n\nbf that minimizes the risk for an estimated class priorb(cid:25):\nbf := arg min\nf2F R(f;b(cid:25)):\nThe risk when applying the classi\ufb01er bf on a dataset with true class prior (cid:25) is then on the line tangent\n((cid:25)) at (cid:25) =b(cid:25), as illustrated in Figure 3(a):\nbR((cid:25)) = (cid:25)R1(bf ) + (1 (cid:0) (cid:25))R(cid:0)1(bf ):\nThe function bf is suboptimal at (cid:25), and results in the excess risk [8]:\n\nto the concave function R\n\n(cid:3)\n\nE(cid:25) = bR((cid:25)) (cid:0) R((cid:25)):\n\n5\n\n\u22126\u2212303600.20.4xp(x) p(x|y=1)p(x|y=\u22121)0.10.30.50.70.9\u22121\u22120.500.51\u03c0Threshold OptimalHinge Loss0.10.30.50.70.900.0020.0040.0060.0080.01\u03c0Misclassification rate OptimalHinge\f((cid:25)) = bR((cid:25))\n\n(cid:3)\n\nR\n\nbR((cid:25))\n\n0:2\n\n0:1\n\nk\ns\ni\nR\n\nE(cid:25)\n\ne(cid:25)\n\n1\n\n(cid:25)\n\nClass prior\n(a) Selecting a classi\ufb01er to minimize (11) and apply-\ning it to a dataset with class prior (cid:25) leads to an excess\nrisk of E(cid:25).\n\n(b) The effective class priore(cid:25) vs. the estimated class\npriorb(cid:25) for different true class priors (cid:25).\nFigure 3: Learning in the PU framework with an estimated class priorb(cid:25) is equivalent to selecting\na classi\ufb01er which minimizes the risk according to an effective class prior e(cid:25).\nbetween the effective class prior e(cid:25) and the true class prior (cid:25) causes an excess risk E(cid:25). (b) The\neffective class priore(cid:25) depends on the true class prior (cid:25) and the estimated class priorb(cid:25).\nminimizes the risk in (3). In practice, however, we only know an estimated class priorb(cid:25). Therefore,\n\nExcess risk with class prior estimation in PU classi\ufb01cation: We wish to select a classi\ufb01er that\n\n(a) The difference\n\na classi\ufb01er is selected to minimize\n\nR(f ) = 2b(cid:25)R1(f ) + RX (f ) (cid:0)b(cid:25):\n\n(11)\n\n2 (cid:25).\n\nExpanding the above risk based on (2) gives\n\nR(f ) = 2b(cid:25)R1(f ) + (cid:25)(1 (cid:0) R1(f )) + (1 (cid:0) (cid:25))R(cid:0)1(f ) (cid:0)b(cid:25)\n= (2b(cid:25) (cid:0) (cid:25)) R1(f ) + (1 (cid:0) (cid:25))R(cid:0)1(f ) + (cid:25) (cid:0)b(cid:25):\n\nimmediately shows that PU classi\ufb01cation cannot be performed when the estimated class prior is\n\nThus, the estimated class prior affects the risk with respect to 2b(cid:25) (cid:0) (cid:25) and 1 (cid:0) (cid:25). This result\nless than half of the true class prior:b(cid:25) (cid:20) 1\nWe de\ufb01ne the effective class priore(cid:25) so that 2b(cid:25) (cid:0) (cid:25) and 1 (cid:0) (cid:25) are normalized to sum to one:\n2b(cid:25) (cid:0) (cid:25)\n2b(cid:25) (cid:0) (cid:25) + 1 (cid:0) (cid:25)\nFigure 3(b) shows the pro\ufb01le of the effective class priore(cid:25) for different (cid:25). The graph shows that\nwhen the true class prior (cid:25) is large,e(cid:25) tends to be \ufb02at around (cid:25). When the true class prior is known\n\n2b(cid:25) (cid:0) (cid:25)\n2b(cid:25) (cid:0) 2(cid:25) + 1\n\ne(cid:25) =\n\nto be large (such as the proportion of inliers in inlier-based outlier detection), a rough class-prior\nestimator is suf\ufb01cient to have a good classi\ufb01cation performance. On the other hand, if the true class\nprior is small, PU classi\ufb01cation tends to be hard and an accurate class-prior estimator is necessary.\nWe also see that when the true class prior is large, overestimation of the class prior is more attenu-\nated. This may explain why some class-prior estimation methods [1, 11] still give a good practical\nperformance in spite of having a positive bias.\n\n=\n\n:\n\n5 Generalization error bounds for PU classi\ufb01cation\n\nIn this section, we analyze the generalization error for PU classi\ufb01cation, when training samples are\nclearly not identically distributed.\nMore speci\ufb01cally, we derive error bounds for the classi\ufb01cation function f (x) of form\n\nn\u2211\n\ni=1\n\n\u2032\u2211\n\nn\n\nf (x) =\n\n(cid:11)ik(xi; x) +\n\n\u2032\njk(x\n\n(cid:11)\n\n\u2032\nj; x);\n\nwhere x1; : : : ; xn are positive training data and x\n\nA = f((cid:11)1; : : : ; (cid:11)n; (cid:11)\n\n\u2032\n1; : : : ; (cid:11)\n\nn\u2032) j x1; : : : ; xn (cid:24) p(x j y = +1); x\n\u2032\n\u2032\n1; : : : ; x\n\nn\u2032 (cid:24) p(x)g\n\u2032\n\n\u2032\n1; : : : ; x\n\nj=1\n\u2032\nn\u2032 are positive and negative test data. Let\n\n6\n\n0.20.30.40.50.60.70.80.90.95100.10.20.30.40.50.60.70.80.90.951Estimatedpriorb\u03c0E\ufb00ectivepriore\u03c0 \u03c0 = 0.95\u03c0 = 0.9\u03c0 = 0.7\u03c0 = 0.5\fbe the set of all possible optimal solutions returned by the algorithm given some training data and\ntest data according to p(x j y = +1) and p(x). Then de\ufb01ne the constants\n\u2211\nC(cid:11) = sup(cid:11)2A;x1;:::;xn(cid:24)p(xjy=+1);x\nn\u2032(cid:24)p(x)\n\u2032\n\u2032\n1;:::;x\n\n)1=2\n\n(\u2211\n\n\u2211\n\n\u2032\nn\nj=1 (cid:11)i(cid:11)\n\n\u2032\n\u2032\nj) +\njk(xi; x\n\n\u2032\nn\nj;j\u2032=1 (cid:11)\n\n\u2032\nj(cid:11)\n\n\u2032\n\u2032\n\u2032\nj\u2032 )\nj; x\nj\u2032k(x\n\n;\n\nn\ni;i\u2032=1 (cid:11)i(cid:11)i\u2032k(xi; xi\u2032 ) + 2\n\nn\ni=1\n\n\u221a\n\n\u2211\nF = ff : x 7! n\u2211\n\nCk = supx2Rd\n\nk(x; x);\nand de\ufb01ne the function class\n\nLet \u2113(cid:17)(z) be a surrogate loss for the zero-one loss\n\n\u2032\u2211\nx1; : : : ; xn (cid:24) p(x j y = +1); x\n\n(cid:11)ik(xi; x) +\n\ni=1\n\nn\n\n(cid:11)\n\nj; x) j (cid:11) 2 A;\n\u2032\n\u2032\njk(x\nj=1\nn\u2032 (cid:24) p(x)g:\n\u2032\n\u2032\n1; : : : ; x\n\n8<:0\n\n\u2113(cid:17)(z) =\n\n1 (cid:0) z=(cid:17)\n1\n\nif z > (cid:17);\nif 0 < z (cid:20) (cid:17);\nif z (cid:20) 0:\n\n(12)\n\nFor any (cid:17) > 0, \u2113(cid:17)(z) is lower bounded by \u21130-1(z) and approaches \u21130-1(z) as (cid:17) approaches zero.\n\nMoreover, lete\u2113(yf (x)) =\n\n2\n\ny + 3\n\n\u21130-1(yf (x))\n\nThen we have the following theorems (proofs are provided in Appendix A). Our key idea is to\ndecompose the generalization error as\n\n2\n\ny + 3\n\n\u2113(cid:17)(yf (x)):\n\nand e\u2113(cid:17)(yf (x)) =\n[e\u2113(f (x))\n]\nn\u2211\nn\u2032)g for evaluating the empirical error,1\n\u2032\ne\u2113(f (xi)) +\n\n[e\u2113(yf (x))\n)\u221a\n\n+ Ep(x;y)\n\n(\n\n]\n\n+\n\n;\n\nEp(x;y)[\u21130-1(yf (x))] = (cid:25)\n\n(cid:3)\n\nEp(xjy=+1)\n\n(cid:3)\n\n:= p(y = 1) is the true class prior of the positive class.\n\nwhere (cid:25)\nTheorem 1. Fix f 2 F, then, for any 0 < (cid:14) < 1, with probability at least 1 (cid:0) (cid:14) over the repeated\n\u2032\u2211\nsampling of fx1; : : : ; xng and f(x\nEp(x;y)[\u21130-1(yf (x))] (cid:0) 1\nn\u2032\n\n\u2032\n\u2032\n1); : : : ; (x\nn\u2032; y\n(cid:3)\nj)) (cid:20) (cid:25)\n\u2032\n\n(cid:3)\np\n(cid:25)\nn\n2\n\ne\u2113(y\n\n\u2032\njf (x\n\nln(2=(cid:14))\n\n\u2032\n1; y\n\nn\n\n2\n\nn\n\n:\n\n1p\nn\u2032\n\nj=1\n\ni=1\n\nn\n\n(cid:3)\n\n(cid:25)\n\nn\n\nj=1\n\n\u2032\n1; y\n\n)\n\ne\u2113(cid:17)(y\n\nj)) (cid:20) (cid:25)\n\u2032\n\u2032\njf (x\n\n\u2032\n\u2032\n1); : : : ; (x\nn\u2032; y\n\n(13)\nTheorem 2. Fix (cid:17) > 0, then, for any 0 < (cid:14) < 1 with probability at least 1 (cid:0) (cid:14) over the repeated\nn\u2032)g for evaluating the empirical error, every\nsampling of fx1; : : : ; xng and f(x\n\u2032\n\u2032\u2211\nn\u2211\nf 2 F satis\ufb01es\n(\n\n(\ne\u2113(cid:17)(f (xi)) +\n(cid:3)p\n)\u221a\n\nEp(x;y)[\u21130-1(yf (x))] (cid:0) 1\nn\u2032\n\n1p\nn\u2032\n2\np\np\nn\u2032). This order is\nIn both theorems, the generalization error bounds are of order O(1=\nn + 1=\n\u2032 i.i.d. data from\noptimal for PU classi\ufb01cation where we have n i.i.d. data from a distribution and n\n\u2032\np\nanother distribution. The error bounds for fully supervised classi\ufb01cation, by assuming these n + n\nn + n\u2032). However, this assumption is unreasonable\ndata are all i.i.d., would be of order O(1=\n\u2032 samples.\np\np\nfor PU classi\ufb01cation, and we cannot train fully supervised classi\ufb01ers using these n + n\nn\u2032) for PU classi\ufb01cation is no\np\np\nAlthough the orders (and the losses) differ slightly, O(1=\nn + 1=\n\u2032 are\nn + n\u2032) for fully supervised classi\ufb01cation (assuming n and n\n2 times O(1=\nworse than 2\nequal). To the best of our knowledge, no previous work has provided such generalization error\nbounds for PU classi\ufb01cation.\n\n(cid:3)\np\n(cid:25)\nn\n2\n\n2p\nn\u2032\n\nln(2=(cid:14))\n\nC(cid:11)Ck\n\ni=1\n\n+\n\n+\n\n+\n\nn\n\n(cid:17)\n\n:\n\n1The empirical error that we cannot evaluate in practice is in the left-hand side of (13), and the empirical\n\nerror and con\ufb01dence terms that we can evaluate in practice are in the right-hand side of (13).\n\n7\n\n\fTable 1: Misclassi\ufb01cation rate (in percent) for PU classi\ufb01cation on the USPS dataset. The best, and\nequivalent by 95% t-test, is indicated in bold.\n(cid:25)\n\n0.95\n\n0.4\n\n0.2\n\n0.8\n\n0.6\n\n0.9\n\nRamp Hinge Ramp Hinge Ramp Hinge Ramp Hinge Ramp Hinge Ramp Hinge\n3.36\n4.94\n5.15\n4.94\n3.49\n4.94\n1.68\n4.94\n5.21\n4.94\n11.47\n4.94\n1.89\n4.94\n3.98\n4.94\n1.22\n4.94\n\n4.00\n14.60\n16.51\n3.03\n19.78\n19.83\n2.49\n11.34\n2.24\n\n4.78\n8.67\n8.08\n4.00\n11.16\n19.59\n4.61\n7.00\n3.86\n\n5.18\n8.79\n8.52\n3.99\n12.04\n22.94\n3.70\n6.85\n3.56\n\n4.85\n6.96\n4.72\n2.05\n7.22\n19.87\n2.55\n4.81\n1.60\n\n4.40\n6.20\n5.52\n2.83\n7.42\n11.61\n3.55\n5.09\n2.76\n\n4.16\n5.90\n4.06\n2.00\n6.16\n15.13\n2.31\n3.74\n1.61\n\n5.48\n7.22\n5.02\n2.21\n7.46\n22.58\n2.64\n4.75\n1.73\n\n1.71\n2.80\n2.12\n1.42\n3.21\n5.29\n1.39\n2.11\n1.13\n\n0 vs 1\n0 vs 2\n0 vs 3\n0 vs 4\n0 vs 5\n0 vs 6\n0 vs 7\n0 vs 8\n0 vs 9\n\n2.68\n4.12\n2.89\n1.70\n4.36\n8.86\n1.78\n2.79\n1.38\n\n9.86\n9.92\n9.92\n9.92\n9.92\n9.92\n9.92\n9.92\n9.92\n\n.\n\n(a) Loss functions\n\n(b) Class prior is (cid:25) = 0:2 (c) Class prior is (cid:25) = 0:6 (d) Class prior is (cid:25) = 0:9.\n\nFigure 4: Examples of the classi\ufb01cation boundary for the \u201c0\u201d vs. \u201c7\u201d digits, obtained by PU learning.\nThe unlabeled dataset and the underlying (latent) class labels are given. Since discriminant function\nfor the hinge loss case is constant 1 when (cid:25) = 0:9, no decision boundary can be drawn and all\nnegative samples are misclassi\ufb01ed.\n6 Experiments\nIn this section, the experimentally compare the performance of the ramp loss and the hinge loss in PU\nclassi\ufb01cation (weighting was performed w.r.t. the true class prior and the ramp loss was optimized\nwith [12]). We used the USPS dataset, with the dimensionality reduced to 2 via principal component\nanalysis to enable illustration. 550 samples were used for the positive and mixture datasets. From the\nresults in Table 1, it is clear that the ramp loss gives a much higher classi\ufb01cation accuracy than the\nhinge loss, especially for large class priors. This is due to the fact that the effect of the super\ufb02uous\npenalty term in (10) becomes larger since it scales with (cid:25).\nWhen the class prior is large, the classi\ufb01cation accuracy for the hinge loss is often close to 1 (cid:0) (cid:25).\nThis can be explained by (10): collecting the terms for the positive expectation, we get an effective\nloss function for the positive samples (illustrated in Figure 4(a)). When (cid:25) is large enough, the\npositive loss is minimized, giving a constant 1. The misclassi\ufb01cation rate becomes 1 (cid:0) (cid:25) since it is\na combination of the false negative rate and the false positive rate according to the class prior.\nExamples of the discrimination boundary for digits \u201c0\u201d vs. \u201c7\u201d are given in Figure 4. When the\nclass-prior is low (Figure 4(b) and Figure 4(c)) the misclassi\ufb01cation rate of the hinge loss is slightly\nhigher. For large class-priors (Figure 4(d)), the hinge loss causes all samples to be classi\ufb01ed as\npositive (inspection showed that w = 0 and b = 1).\n7 Conclusion\nIn this paper we discussed the problem of learning a classi\ufb01er from positive and unlabeled data.\nWe showed that PU learning can be solved using a cost-sensitive classi\ufb01er if the class prior of the\nunlabeled dataset is known. We showed, however, that a non-convex loss must be used in order to\nprevent a super\ufb02uous penalty term in the objective function.\nIn practice, the class prior is unknown and estimated from data. We showed that the excess risk is\nactually controlled by an effective class prior which depends on both the estimated class prior and\nthe true class prior. Finally, generalization error bounds for the problem were provided.\nAcknowledgments\nMCdP is supported by the JST CREST program, GN was supported by the 973 Program No.\n2014CB340505 and MS is supported by KAKENHI 23120004.\n\n8\n\n\u22122\u221210120123zLoss Positive lossNegative loss\u22126\u22124\u221220246\u2212505x1x2 PositiveNegativeHingeRamp\u22126\u22124\u221220246\u2212505x1x2\u22126\u22124\u221220246\u2212505x1x2\fReferences\n[1] C. Elkan and K. Noto. Learning classi\ufb01ers from only positive and unlabeled data. In Proceedings of the\n14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2008),\npages 213\u2013220, 2008.\n\n[2] W. Li, Q. Guo, and C. Elkan. A positive and unlabeled learning algorithm for one-class classi\ufb01cation of\n\nremote-sensing data. IEEE Transactions on Geoscience and Remote Sensing, 49(2):717\u2013725, 2011.\n\n[3] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Inlier-based outlier detection via direct\ndensity ratio estimation.\nIn F. Giannotti, D. Gunopulos, F. Turini, C. Zaniolo, N. Ramakrishnan, and\nX. Wu, editors, Proceedings of IEEE International Conference on Data Mining (ICDM2008), pages 223\u2013\n232, Pisa, Italy, Dec. 15\u201319 2008.\n\n[4] C. Scott and G. Blanchard. Novelty detection: Unlabeled data de\ufb01nitely help.\n\nIn Proceedings of the\nTwelfth International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS2009), pages 464\u2013471,\nClearwater Beach, Florida USA, Apr. 16-18 2009.\n\n[5] C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International\n\nJoint Conference on Arti\ufb01cial Intelligence (IJCAI2001), pages 973\u2013978, 2001.\n\n[6] C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intel-\n\nligent Systems and Technology, 2:27:1\u201327:27, 2011.\n\n[7] H.L. Van Trees. Detection, Estimation, and Modulation Theory, Part I. Detection, Estimation, and\n\nModulation Theory. John Wiley and Sons, New York, NY, USA, 1968.\n\n[8] R. Duda, P. Hart, and D. Stork. Pattern Classi\ufb01cation. John Wiley & Sons, 2nd edition, 2001.\n[9] V. Vapnik. The Nature of Statistical Learning Theory. Springer, 2000.\n[10] G. Blanchard, G. Lee, and C. Scott. Semi-supervised novelty detection. The Journal of Machine Learning\n\nResearch, 11:2973\u20133009, 2010.\n\n[11] M. C. du Plessis and M. Sugiyama. Class prior estimation from positive and unlabeled data.\n\nTransactions on Information and Systems, E97-D:1358\u20131362, 2014.\n\nIEICE\n\n[12] R. Collobert, F.H. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In Proceedings of the\n\n23rd International Conference on Machine learning (ICML2006), pages 201\u2013208, 2006.\n\n[13] S. Suzumura, K. Ogawa, M. Sugiyama, and I. Takeuchi. Outlier path: A homotopy algorithm for robust\nSVM. In Proceedings of 31st International Conference on Machine Learning (ICML2014), pages 1098\u2013\n1106, Beijing, China, Jun. 21\u201326 2014.\n\n[14] A. Ghosh, N. Manwani, and P. S. Sastry. Making risk minimization tolerant to label noise. CoRR,\n\nabs/1403.3610, 2014.\n\n[15] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.\n\n9\n\n\f", "award": [], "sourceid": 495, "authors": [{"given_name": "Marthinus", "family_name": "du Plessis", "institution": "Tokyo Institute of Technology"}, {"given_name": "Gang", "family_name": "Niu", "institution": "Baidu, Inc."}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "The University of Tokyo"}]}