{"title": "Proper losses for learning from partial labels", "book": "Advances in Neural Information Processing Systems", "page_first": 1565, "page_last": 1573, "abstract": "This paper discusses the problem of calibrating posterior class probabilities from partially labelled data. Each instance is assumed to be labelled as belonging to one of several candidate categories, at most one of them being true. We generalize the concept of proper loss to this scenario, establish a necessary and sufficient condition for a loss function to be proper, and we show a direct procedure to construct a proper loss for partial labels from a conventional proper loss. The problem can be characterized by the mixing probability matrix relating the true class of the data and the observed labels. An interesting result is that the full knowledge of this matrix is not required, and losses can be constructed that are proper in a subset of the probability simplex.", "full_text": "Proper losses for learning from partial labels\n\nJes\u00b4us Cid-Sueiro\n\nDepartment of Signal Theory and Communications\n\nUniversidad Carlos III de Madrid\n\nLegans-Madrid, 28911 Spain\n\njcid@tsc.uc3m.es\n\nAbstract\n\nThis paper discusses the problem of calibrating posterior class probabilities from\npartially labelled data. Each instance is assumed to be labelled as belonging to\none of several candidate categories, at most one of them being true. We generalize\nthe concept of proper loss to this scenario, we establish a necessary and suf\ufb01cient\ncondition for a loss function to be proper, and we show a direct procedure to\nconstruct a proper loss for partial labels from a conventional proper loss. The\nproblem can be characterized by the mixing probability matrix relating the true\nclass of the data and the observed labels. The full knowledge of this matrix is not\nrequired, and losses can be constructed that are proper for a wide set of mixing\nprobability matrices.\n\n1\n\nIntroduction\n\nThe problem of learning multiple classes from data with imprecise label information has attracted\na recent attention in the literature. It arises in many different applications: Cour [1] cites some of\nthem: picture collections containing several faces per image and a caption that only speci\ufb01es who is\nin the picture but not which name matches which face, or video collections with labels taken from\nannotations.\nIn a partially labelled data set, each instance is assigned to a set of candidate categories, at most only\none of them true. The problem is closely related to learning from noisy labels, which is common\nin human-labelled data bases with multiple annotators [2] [3] medical imaging, crowdsourcing, etc.\nOther related problems can be interpreted as particular forms of partial labelling: semisupervised\nlearning, or hierarchical classi\ufb01cation in databases where some instances could be labelled with\nrespect to parent categories only. It is also a particular case of the more general problems of learning\nfrom soft labels [4] or learning from measurements [5].\nSeveral algorithms have been proposed to deal with partial labelling [1] [2] [6] [7] [8]. Though\nsome theoretical work has been addressed in order to analyze the consistency of algorithms [1] or\nthe information provided by uncertain data [8], little effort has been done to analyze the conditions\nunder which the true class can be inferred from partial labels.\nIn this paper we address the problem of estimating posterior class probabilities from partially la-\nbelled data. In particular, we obtain general conditions under which the posterior probability of the\ntrue class given the observation can be estimated from training data with ambiguous class labels. To\ndo so, we generalize the concept of proper losses to losses that are functions of ambiguous labels,\nand show that the capability to estimate posterior class probabilities using a given loss depends on\nthe probability matrix relating the ambiguous labels with the true class of the data. Each general-\nized proper loss can be characterized by the set (a convex polytope) of all admissible probability\nmatrices. Analyzing the structure of these losses is one of the main goals of this paper. Up to our\nknowledge, the design of proper losses for learning from imperfect labels has not been addressed in\nthe area of Statistical Learning.\n\n1\n\n\fThe paper is organized as follows: Sec. 2 formulates the problem discussed in the paper, Sec. 3\ngeneralizes proper losses to scenarios with ambiguous labels, Sec. 4 proposes a procedure to design\nproper losses for wide sets of mixing matrices, Sec. 5 discusses estimation errors and Sec. 6 states\nsome conclusions.\n\n2 Formulation\n\n2.1 Notation\n\nsimplex of n-dimensional probability vectors is Pn = {p \u2208 [0, 1]n :(cid:80)n\u22121\n\nVectors are written in boldface, matrices in boldface capital and sets in calligraphic letters. For any\ni is a n-dimensional unit vector with all zero components apart from the i-th component\ninteger n, en\nwhich is equal to one, and 1n is a n-dimensional all-ones vector. Superindex T denotes transposition.\nWe will use (cid:96)() to denote a loss based on partial labels, and \u02dc(cid:96) to losses based on true labels. The\ni=0 pi = 1} and the set of\nall left-stochastic matrices is M = {M \u2208 [0, 1]d\u00d7c : MT1d = 1c}. The number of classes is c, and\nthe number of possible partial label vectors is d \u2264 2c.\n\nj, j = 0, 1, . . . , c \u2212 1}, a set of labels, and Z \u2282 {0, 1}c a set of\n\n2.2 Learning from partial labels\nLet X be a sample set, Y = {ec\npartial labels. Sample (x, z) \u2208 X \u00d7 Z is drawn from an unknown distribution P .\nPartial label vector z \u2208 Z is a noisy version of the true label y \u2208 Y. Several authors [1] [6] [7] [8]\nassume that the true label is always present in z, i.e., zj = 1 when yj = 1, but this assumption is\nnot required in our setting, which admits noisy label scenarios (as, for instance, in [2]). Without loss\nof generality, we assume that Z contains only partial labels with nonzero probability (i.e. P{z =\nb} > 0 for any b \u2208 Z).\nIn general, we model the relationship between z and y through an arbitrary d\u00d7 c conditional mixing\nprobability matrix M(x) with components\n\nmij(x) = P{z = bi|yj = 1, x}\nwhere bi \u2208 Z is the i-th element of Z for some arbitrary ordering.\nNote that, in general, the mixing matrix could depend on x, though a constant mixing matrix [2] [6]\n[7] [8] is a common assumption, as well as the statistical independence of the incorrect labels [6] [7]\n[8]. In this paper we do not impose these assumptions.\nThe goal is to infer y given x without knowing model P . To do so, a set of partially labelled samples,\nS = {(xk, zk), k = 1, . . . , K} is available. True labels yk are not observed.\nWe will illustrate different partial label scenarios with a 3-class problem. Consider that each column\nof MT corresponds to a label pattern (z0, z1, z2) following the ordering (0, 0, 0), (1, 0, 0), (0, 1, 0),\n(0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 1, 1) (e.g. the \ufb01rst column contains P{z = (0, 0, 0)T|yj =\n1), for j = 0, 1, 2).\n\n(1)\n\n(cid:32) 0\n(cid:32) 0\n\n0\n0\n\nA. Supervised learning: M =\n\nB. Single noisy labels: M =\n\n(cid:33)T\n\n0\n0\n0\n\n0\n0\n1\n\n0\n1\n0\n\n1\n0\n0\n1 \u2212 \u03b1 \u03b1/2\n1 \u2212 \u03b2\n\u03b3/2\n\n(cid:32) \u03b1 1 \u2212 \u03b1 0\n\n0 \u03b2/2\n\u03b3/2\n0\n\n0\n0\n0\n\n0\n0\n0\n\n\u03b1/2\n\u03b2/2\n1 \u2212 \u03b3\n\n0\n0\n0\n\n0\n0\n0\n\n(cid:33)T\n\n0\n0\n0\n\n0\n0\n0\n\n0\n0\n0\n\nC. Semisupervised learning: M =\n\n0\n0\nD. True label with independent noisy labels:\n0\n0\n1 \u2212 \u03b3 \u2212 \u03b32\n\n1 \u2212 \u03b1 \u2212 \u03b12\n0\n0\n\n0\n1 \u2212 \u03b2 \u2212 \u03b22\n0\n\n(cid:32) 0\n\nM =\n\n\u03b2\n\u03b3\n\n0\n0\n\n1 \u2212 \u03b2\n0\n\n0\n0\n1 \u2212 \u03b3\n\n0\n0\n0\n\n0\n0\n0\n\n0\n0\n0\n\n\u03b1/2 \u03b1/2\n\u03b2/2\n0\n\n0\n\u03b3/2\n\n\u03b12\n0\n\u03b2/2 \u03b22\n\u03b32\n\u03b3/2\n\n2\n\n(cid:33)T\n(cid:33)T\n\n0\n0\n0\n\n\fE. Two labels, one of them true: M =\n\n(cid:32) 0\n\n0\n0\n\n1 \u2212 \u03b1 0\n0\n0\n\n1 \u2212 \u03b2\n0\n\n0\n0\n1 \u2212 \u03b3\n\n\u03b1/2 \u03b1/2\n\u03b2/2\n0\n\n0\n\u03b3/2\n\n0\n\u03b2/2\n\u03b3/2\n\n0\n0\n0\n\n(cid:33)T\n\nThe question that motivates our work is the following: knowing M (i.e. knowing the scenario and\nthe value of parameters \u03b1, \u03b2 and \u03b3), we can estimate accurate posterior class probabilities from\npartially labelled data in all these cases, however, is it possible if \u03b1, \u03b2 and \u03b3 are unknown? We will\nsee that the answer is negative for scenarios B,C,D, but it is positive for E. In the positive case, no\ninformation is lost by the partial label process for in\ufb01nite sample sizes. In the negative case, some\nperformance is lost as a consequence of the mixing process, that persists even for in\ufb01nite sample\nsizes 1\n\n2.3\n\nInference through partial label probabilities\n\nIf the mixing matrix is known, a conceptually simple strategy to solve the partial label problem\nconsists of estimating posterior partial label probabilities, using them to estimate posterior class\nprobabilities and predict y. Since\n\nc\u22121(cid:88)\n\nP{z = bi|x} =\n\nmij(x)P{yj = 1|x},\n\n(2)\n\nwe can de\ufb01ne vectors p(x) and \u03b7(x) with components pi = P{z = bi|x} and \u03b7j = P{yj = 1|x},\nto write (2) as p(x) = M(x)\u03b7(x) and, thus,\n\nj=0\n\n\u03b7(x) = M+(x)p(x)\n\n(3)\n\nwhere M+(x) = (MT(x)M(x))\u22121MT(x) is the left inverse (pseudoinverse) of M(x).\nThus, a \ufb01rst condition to estimate \u03b7 from p given M is that the conditional mixing matrix has a left\ninverse (i.e., the columns of M(x) are linearly independent).\nThere are some trivial cases where the mixing matrix has no pseudoinverse (for instance, if\nP{z|y, x} = P{z|x}, all rows in M(x) are equal, and MT(x)M(x) is a rank 1 matrix, which\nhas no inverse), but these are degenerate cases of no practical interest. From a practical point of\nview, the application of (3) states two major problems: (1) when the model P is unknown, even\nknowing M, estimating p from data may be infeasible for d close to 2c and a large number of\nclasses (furthermore, posterior probability estimates will not be accurate if the sample size is small),\nand (2) M(x) is generally unknown, and cannot be estimated from the partially labelled set, S.\nThe solution adopted in this paper for the \ufb01rst problem consists of estimating \u03b7 from data without\nestimating p. This is discussed in the next section. The second problem is discussed in Section 4.\n\n3 Loss functions for posterior probability estimation\n\nThe estimation of posterior probabilities from labelled data is a well known problem in statistics and\nmachine learning, that has received some recent attention in the machine learning literature [9] [10].\nIn order to estimate posteriors from labelled data, a loss function \u02dc(cid:96)(y, \u02c6\u03b7) is required such that \u03b7 is\na member of arg min\u02c6\u03b7 Ey{\u02dc(cid:96)(y, \u02c6\u03b7)}. Losses satisfying this property are said to be Fisher consistent\nand are known as proper scoring rules. A loss is strictly proper if \u03b7 is the only member of this set.\nA loss is regular if it is \ufb01nite for any y, except possibly that \u02dc(cid:96)(y, \u02c6\u03b7) = \u221e if yj = 1 and \u02c6\u03b7j = 0.\nProper scoring rules can be characterized by the Savage\u2019s representation [11] [12]\nTheorem 3.1 A regular scoring rule \u02dc(cid:96) : Z \u00d7 Pc \u2192 R is (strictly) proper if and only if\n\n(4)\nwhere h is a (strictly) concave function and g(\u02c6\u03b7) is a supergradient of h at the point \u02c6\u03b7, for all\n\u02c6\u03b7 \u2208 Pc.\n\n\u02dc(cid:96)(y, \u02c6\u03b7) = h(\u02c6\u03b7) + g(\u02c6\u03b7)(y \u2212 \u02c6\u03b7)\n\n1If the sample size is large (in particular for scenarios C and D), one could think of simply ignoring samples\nwith imperfect labels, and training the classi\ufb01er with the samples whose class is known. However, in general,\nthere is some bias in this process, which eventually can degrade performance.\n\n3\n\n\f(Remind that g is a supergradient of h at \u02c6\u03b7 if h(\u03b7) \u2264 h(\u02c6\u03b7) + gT(\u03b7 \u2212 \u02c6\u03b7)).\nIn order to deal with partial labels, we generalize proper losses as follows\nDe\ufb01nition Let y and z be random vectors taking values in Y and Z, respectively. A scoring rule\n(cid:96)(z, \u02c6\u03b7) is proper to estimate \u03b7 (with components \u03b7j = P{yj = 1}) from z if\n\n\u03b7 \u2208 arg min\n\nEz{(cid:96)(z, \u02c6\u03b7)}\n\n\u02c6\u03b7\n\n(5)\n\nIt is strictly proper if \u03b7 is the only member of this set.\n\nThis generalized family of proper scoring rules can be characterized by the following.\n\nTheorem 3.2 Scoring rule (cid:96)(z, \u02c6\u03b7) is (strictly) proper to estimate \u03b7 from z if and only if the equiva-\nlent loss\n\n(6)\nwhere l(\u02c6\u03b7) is a vector with components (cid:96)i(\u02c6\u03b7) = (cid:96)(bi, \u02c6\u03b7) and bi is the i-th element in Z (according\nto some arbitrary ordering), is (strictly) proper.\n\n\u02dc(cid:96)(y, \u02c6\u03b7) = yTMTl(\u02c6\u03b7),\n\nProof The proof is straightforward by noting that the expected loss can be expressed as\n\nd\u22121(cid:88)\n\ni=0\n\nd\u22121(cid:88)\n\nc\u22121(cid:88)\n\ni=0\n\nj=0\n\nEz{(cid:96)(z, \u02c6\u03b7)} =\n\nP{z = bi}(cid:96)i(\u02c6\u03b7) =\n\nmij\u03b7j(cid:96)i(\u02c6\u03b7)\n\n(7)\nTherefore, arg min\u02c6\u03b7 Ez{(cid:96)(z, \u02c6\u03b7)} = arg min\u02c6\u03b7 Ey{\u02dc(cid:96)(y, \u02c6\u03b7)} and, thus, (cid:96) is (strictly) proper with\nrespect to y iff \u02dc(cid:96) is (strictly) proper.\n\n= \u03b7TMTl(\u02c6\u03b7) = Ey{yTMTl(\u02c6\u03b7)} = Ey{\u02dc(cid:96)(y, \u02c6\u03b7)}\n\nNote that, de\ufb01ning vector \u02dcl(\u02c6\u03b7) with components \u02dc(cid:96)j(\u02c6\u03b7) = \u02dc(cid:96)(ec\n\nj, \u02c6\u03b7), we can write\n\n\u02dcl(\u02c6\u03b7) = MTl(\u02c6\u03b7)\n\n(8)\n\nWe will use this vector representation of losses extensively in the following.\nTh. 3.2 states that the proper character of a loss for estimating \u03b7 from z depends on M. For this\nreason, in the following we will say that (cid:96)(z, \u02c6\u03b7) is M-proper if it is proper to estimate \u03b7 from z.\n\n4 Proper losses for sets of mixing matrices\n\nEq. (8) may be useful to check if a given loss is M-proper. However, note that, since matrix MT is\nd\u00d7 c, it has no left inverse, and we cannot take MT out from the left side of (8) to compute (cid:96) from \u02dc(cid:96).\nFor any given M and any given equivalent loss \u02dcl(\u02c6\u03b7), there is an uncountable number of losses l(\u02c6\u03b7)\nsatisfying (8).\n\nExample Let \u02dc(cid:96) be an arbitrary proper loss for a 3-class problem. The losses\n\n(cid:96)(z, \u02c6\u03b7) = (z0 \u2212 z1z2)\u02dc(cid:96)0(\u02c6\u03b7) + (z1 \u2212 z0z2)\u02dc(cid:96)1(\u02c6\u03b7) + (z2 \u2212 z0z1)\u02dc(cid:96)2(\u02c6\u03b7)\n\nare M-proper for the mixing matrix M given by\n\n(cid:96)(cid:48)(z, \u02c6\u03b7) = z0\n\n\u02dc(cid:96)0(\u02c6\u03b7) + z1\n\n\u02dc(cid:96)1(\u02c6\u03b7) + z2\n\n\u02dc(cid:96)2(\u02c6\u03b7)\n\n(cid:20) 1\n\n0\n\nmij =\n\nif bi = ec\nj\notherwise\n\n(9)\n\n(10)\n\n(11)\n\nNote that M corresponds to a situation where labels are perfectly labelled, and z contains perfect\ninformation about y (in fact, z = y with probability one).\n\nAlso, for any (cid:96)(z, \u02c6\u03b7), there are different mixing matrices such that the equivalent loss is the same.\n\n4\n\n\fExample The loss given by (9) is M-proper for the mixing matrix M in (11) and it is also N-proper,\nfor N with components\n\nj + ec\n\nk, for some k (cid:54)= j\n\nif bi = ec\notherwise\n\n(12)\n\n(cid:20) 1/2\n\n0\n\nnij =\n\nMatrix N corresponds to a situation where label z contains the true class and another noisy compo-\nnent taken at random from the other classes.\n\nIn general, if l(\u02c6\u03b7) is M-proper and N-proper with equivalent loss \u02dcl(\u02c6\u03b7), then it is also Q-proper with\nthe same equivalent loss, for any Q in the form\n\n(13)\nwhere D is a diagonal nonnegative matrix (note that Q is a probability matrix, because QT1d = 1c).\nThis is because\n\nQ = M(I \u2212 D) + ND\n\nQTl(\u02c6\u03b7) = (I \u2212 D)MTl(\u02c6\u03b7) + DNTl(\u02c6\u03b7) = (I \u2212 D)\u02dcl(\u02c6\u03b7) + D\u02dcl(\u02c6\u03b7) = \u02dcl(\u02c6\u03b7)\n\n(14)\n\nMore generally, for arbitrary non-diagonal matrices D, provided that Q is a probability matrix, l(\u02c6\u03b7)\nis Q-proper.\n\nExample Assuming diagonal D, if M and N are the mixing matrices de\ufb01ned in (11) and (12),\nrespectively, the loss (9) is Q-proper for any mixing matrix Q in the form (13). This corresponds to\na matrix with components\n\n(1 \u2212 djj)/2\n0\n\nif bi = ec\nj\nif bi = ec\nj + ec\notherwise\n\nk, for some k (cid:54)= j\n\n(15)\n\n\uf8ee\uf8f0 djj\n\nqij =\n\nThat is, the loss in (9) is proper for any situation where the label z contains the true class and\npossibly another class taken at random, and the probability that the true label is corrupted may be\nclass-dependent.\n\n4.1 Building proper losses from ambiguity sets\n\ni \u2208 Pd, 1 \u2264 i \u2264 nj} be a set of nj > 0 probability\nj=0Vj) = Rd and let Q = {M \u2208\n\nj=0 nj = d and span(\u222ac\u22121\n\nj \u2208 span(Vj) \u2229 Pd}.\n\nvectors with dimension d, such that(cid:80)c\u22121\n\nThe ambiguity on M for a given loss l(\u02c6\u03b7) can be used to deal with the second problem mentioned\nin Sec. 2.3: in general, the mixing matrix may be unknown, or, even if it is known, it may depend\non the observation, x. Thus, we need a procedure to design losses that are proper for a wide family\nof mixing matrices. In general, given a set of mixing matrices, Q, we will say that (cid:96) is Q-proper if\nit is M-proper for any M \u2208 Q\nThe following result provides a way to construct a proper loss (cid:96) for partial labels from a given\nconventional proper loss \u02dc(cid:96).\nTheorem 4.1 For 0 \u2264 j \u2264 c \u2212 1, let Vj = {vj\nM : Mec\nThen, for any (strictly) proper loss \u02dc(cid:96)(y, \u02c6\u03b7), there exists a loss (cid:96)(z, \u02c6\u03b7) which is (strictly) Q-proper.\nProof The proof is constructive. Let V be a d\u00d7d matrix whose columns are the elements of \u222ac\u22121\nj=0Vj,\nwhich is invertible since span(\u222ac\u22121\nj=0Vj) = Rd. Let c(\u02c6\u03b7) be a d \u00d7 1 vector such that ci(\u02c6\u03b7) = \u02dc(cid:96)j(\u02c6\u03b7)\nif Ved\nLet (cid:96)(z, \u02c6\u03b7) be a loss de\ufb01ned by vector l(\u02c6\u03b7) = (VT)\u22121c(\u02c6\u03b7).\nConsider the set R = {M \u2208 M : Mec\nj \u2208 Vj for all j} (which is not empty because nj > 0).\nSince the columns of any M \u2208 R are also columns of V, then MTl(\u02c6\u03b7) = \u02dcl(\u02c6\u03b7) and, thus, (cid:96)(z, \u02c6\u03b7) is\nM-proper. Therefore, it is also proper for any af\ufb01ne combination of matrices in R inside Pd. But\nspan(R) \u2229 Pd = Q. Thus, (cid:96)(z, \u02c6\u03b7) is M-proper for all M \u2208 Q (i.e. it is Q-proper).\n\ni \u2208 Vj.\n\n5\n\n\fTheorem 4.1 shows that we can construct proper losses for learning from partial labels by specifying\nthe points of sets Vj, j = 0, . . . , c\u2212 1. Each of these sets de\ufb01nes an ambiguity set Aj = span(Vj)\u2229\nPd which represents all admissible conditional distributions for P (z|yj = 1). If the columns of the\ntrue mixing matrix M are members of the ambiguity set, the resulting loss can be used to estimate\nposterior class probabilities from the observed partial labels.\nThus, a general procedure to design a loss function for learning from partial labels is:\n\n1. Select a proper loss, \u02dc(cid:96)(y, \u02c6\u03b7)\n2. De\ufb01ne the ambiguity sets by choosing, for each class j, a set Vj of nj linearly independent\nbasis vectors for each class. The whole set of d basis vectors must be linearly independent.\n\n3. Construct matrix V whose columns comprise all basis vectors.\n4. Construct binary matrix U with uji = 1 if the i-th column of V is in Vj, and uji = 0\n\notherwise.\n\n5. Compute the desired proper loss vector as\n\nl(\u02c6\u03b7) = (VT)\u22121U\u02dcl(\u02c6\u03b7)\n\n(16)\nSince the ambiguity set Aj is the intersection of a nj-dimensional linear subspace with the d-\ndimensional probability simplex, it is a nj \u2212 1 dimensional convex polytope whose vertices lie\nin distinct (nj \u2212 1)-faces of Pd. These vertices must have a set of at least nj \u2212 1 zero components\nwhich cannot be a set of zeros in any other vertex.\nThis has two consequences: (1) we can de\ufb01ne the ambiguity sets from these vertices, and (2), the\nchoice is not unique, because the number of vertices can be higher than nj \u2212 1.\nIf proper loss \u02dc(cid:96)(y, \u02c6\u03b7) is non degenerate, Q contains all mixing matrices for which a loss is proper:\n\nTheorem 4.2 Let us assume that, if aT\u02dcl(\u02c6\u03b7) = 0 for any \u02c6\u03b7, then a = 0. Under the conditions of\nTheorem 4.1, for any M \u2208 M \\ Q, (cid:96)(z, \u02c6\u03b7) is not M-proper.\nj=0Aj) =\nProof Since the columns of V are in the ambiguity sets and form a basis of Rd, span(\u222ac\u22121\nj=1 \u03b1n,jwj for some\nwj \u2208 Aj and some coef\ufb01cients \u03b1nj. If M /\u2208 Q, \u03b1nj (cid:54)= 0 for some j (cid:54)= n and some n. Then\n\u02dc(cid:96)j(\u02c6\u03b7), which cannot be equal to (cid:96)n(\u02c6\u03b7) for all \u02c6\u03b7. Therefore, (cid:96)(z, \u02c6\u03b7) is not\nmT\nM-proper.\n\nRd. Thus, the n-th column of any arbitrary M can be represented as mn =(cid:80)c\nnl(\u02c6\u03b7) = (cid:80)l\n\nj=1 \u03b1nj\n\n4.2 Virtual labels\n\nThe analysis above shows a procedure to construct proper losses from ambiguity sets. The main\nresult of this section is to show that (16) is actually a universal representation, in the sense that any\nproper loss can be represented in this form, and we generalize the Savage\u2019s representation providing\nand explicit formula for Q-proper losses.\nTheorem 4.3 Scoring rule (cid:96)(z, \u02c6\u03b7) is (strictly) Q-proper for some matrix set Q with equivalent loss\n\u02dc(cid:96)(y, \u02c6\u03b7) if and only if\n\n(cid:96)(z, \u02c6\u03b7) = h(\u02c6\u03b7) + g(\u02c6\u03b7)T(UTV\u22121z \u2212 \u02c6\u03b7).\n\n(17)\nwhere h is the (strictly) concave function from the Savage\u2019s representation for \u02dc(cid:96), g(\u02c6\u03b7) is a super-\ngradient of h, V is a d \u00d7 d non-singular matrix and U is a binary matrix with only one unit value\nat each row.\nMoreover, the ambiguity set of class j is Aj = span(Vj), where Vj is the set of all columns in V\nsuch that uji = 1.\n\nProof See the Appendix.\n\nComparing (4) with (17), the effect of imperfect labelling becomes clear: the unknown true label y\nis replaced by a virtual label \u02dcy = UTV\u22121z, which is a linear combination of the partial labels.\n\n6\n\n\f4.3 Admissible scenarios\n\nThe previous analysis shows that, in order to calibrate posterior probabilities from partial labels\nin scenarios where the mixing matrix is known, two conditions are required: (1) the rows of any\nadmissible mixing matrix must be contained in the admissible sets, (2) the basis of all admissible\nsets must be linearly independent. It is not dif\ufb01cult to see that the parametric matrices in scenarios B,\nC and D de\ufb01ned in Section 2.2 cannot be generated using a set of basis satisfying these constraints.\nOn the contrary, scenario E is admissible, as we have shown in the example in Section 4.\n\n5 Estimation Errors\nIf the true mixing matrix M is not in Q, a Q-proper loss may fail to estimate \u03b7. The consequences\nof this can be analyzed using the expected loss, given by\n\nL(\u03b7, \u02c6\u03b7)\n\n= E{(cid:96)(z, \u02c6\u03b7)} = \u03b7TMTl(\u02c6\u03b7) = \u03b7TMT(VT)\u22121U\u02dcl(\u02c6\u03b7)\n.\n\n(18)\nIf M \u2208 Q, then L(\u03b7, \u02c6\u03b7) = \u03b7T\u02dcl(\u02c6\u03b7). However, if M /\u2208 Q, then we can decompose M = MQ + N,\nwhere MQ is the orthogonal projection of M in Q. Then\n\nL(\u03b7, \u02c6\u03b7) = \u03b7TNT(VT)\u22121U\u02dcl(\u02c6\u03b7) + \u03b7T\u02dcl(\u02c6\u03b7)\n\nExample The effect of a bad choice of the ambiguity set can be illustrated using the loss in (9) in\ntwo cases: \u02dclj(\u02c6\u03b7) = (cid:107)ec\nj \u2212 \u02c6\u03b7(cid:107)2 (the square error) and \u02dclj(\u02c6\u03b7) = \u2212 ln(\u02c6\u03b7j) (the cross entropy). As we\nhave discussed before, loss (9) is proper for any scenario where label z contains the true class and\npossibly another class taken at random. Let us assume that the true mixing matrix is\n\n(19)\n\n(20)\n\n(cid:32) 0.5\n\nM =\n\n0\n0\n\n0\n0.5\n0\n\n0\n0\n0.6\n\n0.4\n0.3\n0\n\n0.1\n0\n0.2\n\n0\n0.2\n0.2\n\n(cid:33)T\n\n(were each column of MT corresponds to a label vector (z0, z1, z2) following the ordering (1, 0, 0),\n(0, 1, 0), (0, 0, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1). Fig. 1 shows the expected loss in (19) for the square\nerror (left) and the cross entropy (right), as a function of \u02c6\u03b7 over the probability simplex P3, for\n\u03b7 = (0.45, 0.15, 0.4)T. Since M /\u2208 Q, the estimated posterior minimizing expected loss, \u02c6\u03b7\u2217 (which\nis unique because both losses are strictly proper), does not coincide with the true posterior.\n\nFigure 1: Square loss (left) and cross entropy (right) in the probability simplex, as a function of \u02c6\u03b7\nfor \u03b7 = (0.45, 0.15, 0.4)T\n\n\u2217 does not depend on the choice of the cost and, thus,\nIt is important to note that the minimum \u02c6\u03b7\nthe estimation error is invariant to the choice of the strict proper loss (though this could be not true\nwhen \u03b7 is estimated from an empirical distribution). This is because, using (19) and noting that the\nexpected proper loss is\n\n\u02dcL(\u03b7, \u02c6\u03b7)\n\n.\n= Ey\n\n\u02dc(cid:96)(y, \u02c6\u03b7) = \u03b7T\u02dcl(\u02c6\u03b7)\n\n(21)\n\n7\n\n\fwe have\n\nL(\u03b7, \u02c6\u03b7) = L(UTV\u22121M\u03b7, \u02c6\u03b7)\n\n(22)\n\n\u2217\n\nSince (22) is minimum for \u02c6\u03b7\n\n= UTV\u22121M\u03b7, the estimation error is\n(cid:107)\u03b7 \u2212 \u02c6\u03b7\n\u2217(cid:107)2 = (cid:107)(I \u2212 UTV\u22121M)\u03b7(cid:107)2\nwhich is independent on the particular choice of the equivalent loss.\nIf \u02dc(cid:96) is proper but not strictly proper, the minimum may be not unique. For instance, for the 0 \u2212 1\nloss, any \u02c6\u03b7 providing the same decisions than \u03b7 is a minimum of \u02dcL(\u03b7, \u02c6\u03b7). Therefore, those values\nof \u03b7 with \u03b7 and UTV\u22121M\u03b7 in the same decision region are not in\ufb02uenced by a bad choice of the\nambiguity set. Unfortunately, since the set of boundary decision points is not linear (but piecewise\nlinear) one can always \ufb01nd points \u03b7 that are affected by this choice. Therefore, a wrong choice\nof the ambiguity set always changes the boundary decision. Summarizing, the ambiguity set for\nprobability estimation is not larger than that for classi\ufb01cation.\n\n(23)\n\n6 Conclusions\n\nIn this paper we have generalized proper losses to deal with scenarios with partial labels. Proper\nlosses based on partial labels can be designed to cope with different mixing matrices. We have also\ngeneralized the Savage\u2019s representation of proper losses to obtain an explicit expression for proper\nlosses as a function of a concave generator.\n\nAppendix: Proof of Theorem 4.3\nLet us assume that (cid:96)(z, \u02c6\u03b7) is (strictly) Q-proper for some matrix set Q with equivalent loss \u02dc(cid:96)(y, \u02c6\u03b7).\nLet Qj be the set of the j-th rows of all matrices in Q, and take Aj = span(Qj) \u2229 Pd. Then any\nvector m \u2208 Aj is af\ufb01ne combination of vectors in Qj and, thus, mTl(\u02c6\u03b7) = \u02dcl(\u02c6\u03b7). Therefore, if\nspan(Qi) has dimension ni, we can take a basis Vi \u2208 Qi of ni linearly independent vectors such\nthat Ai = span(Qi) \u2229 Pd.\nBy construction l(\u02c6\u03b7) = (VT)\u22121U\u02dcl(\u02c6\u03b7). Combining this equation with the Savage\u2019s representation\nin (4), we get\n\n(cid:96)(z, \u02c6\u03b7) = zTl(\u02c6\u03b7) = zT(VT)\u22121U(h(\u02c6\u03b7)1c + (I \u2212 \u02c6\u03b71T\n\nc )Tg(\u02c6\u03b7)\n\n= h(\u02c6\u03b7)zT1d + zT(VT)\u22121(U \u2212 1d \u02c6\u03b7T)g(\u02c6\u03b7)\n= h(\u02c6\u03b7) + g(\u02c6\u03b7)T(UTV\u22121z \u2212 \u02c6\u03b7)\n\n(24)\n\n(25)\n\n(27)\n\nwhich is the desired result.\nNow, let us assume that (17) is true. Then\n\nFor any matrix M \u2208 M such that MTec\n\nj \u2208 Aj, we have\n\nl(\u02c6\u03b7) = h(\u02c6\u03b7)1d + ((VT)\u22121U \u2212 1d \u02c6\u03b7T)g(\u02c6\u03b7).\n\nMTl(\u02c6\u03b7) = h(\u02c6\u03b7)MT1d + (MT(VT)\u22121U \u2212 M1d \u02c6\u03b7T)g(\u02c6\u03b7)\n\n(26)\nIf M \u2208 Q, then we can express each column, j, of M as a convex combination of the columns in\nV with uji = 1, thus M = V\u039b for some matrix \u039b with the coef\ufb01cients of the convex combination\nat the corresponding positions of unit values in U. Then MT(VT)\u22121U = \u039bU = I. Using this in\n(26), we get\n\nMTl(\u02c6\u03b7) = h(\u02c6\u03b7)1c + (Ic \u2212 1c \u02c6\u03b7T)g(\u02c6\u03b7) = \u02dcl(\u02c6\u03b7).\n\nApplying Theorem 3.2, the proof is complete.\n\nAcknowledgments\n\nThis work was partially funded by project TEC2011-22480 from the Spanish Ministry of Science\nand Innovation, project PRI-PIBIN-2011-1266 and by the IST Programme of the European Com-\nmunity, under the PASCAL2 Network of Excellence, IST-2007-216886. Thanks to Ra\u00b4ul Santos-\nRodr\u00b4\u0131guez and Dar\u00b4\u0131o Garc\u00b4\u0131a-Garc\u00b4\u0131a for their constructive comments about this manuscript\n\n8\n\n\fReferences\n[1] T. Cour, B. Sapp, and B. Taskar, \u201cLearning from partial labels,\u201d Journal of Machine Learning\n\nResearch, vol. 12, pp. 1225\u20131261, 2011.\n\n[2] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, \u201cLearning\nfrom crowds,\u201d Journal of Machine Learning Research, vol. 99, pp. 1297\u20131322, August 2010.\n[3] V. S. Sheng, F. Provost, and P. G. Ipeirotis, \u201cGet another label? improving data quality and\ndata mining using multiple, noisy labelers,\u201d in Procs. of the 14th ACM SIGKDD international\nconference on Knowledge discovery and data mining, ser. KDD \u201908. New York, NY, USA:\nACM, 2008, pp. 614\u2013622.\n\n[4] E. C\u02c6ome, L. Oukhellou, T. Denux, and P. Aknin, \u201cMixture model estimation with soft la-\nbels,\u201d in Soft Methods for Handling Variability and Imprecision, ser. Advances in Soft Com-\nputing, D. Dubois, M. Lubiano, H. Prade, M. Gil, P. Grzegorzewski, and O. Hryniewicz, Eds.\nSpringer Berlin / Heidelberg, 2008, vol. 48, pp. 165\u2013174.\n\n[5] P. Liang, M. Jordan, and D. Klein, \u201cLearning from measurements in exponential families,\u201d in\nProceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009,\npp. 641\u2013648.\n\n[6] R. Jin and Z. Ghahramani, \u201cLearning with multiple labels,\u201d Advances in Neural Information\n\nProcessing Systems, vol. 15, pp. 897\u2013904, 2002.\n\n[7] C. Ambroise, T. Denoeux, G. Govaert, and P. Smets, \u201cLearning from an imprecise teacher:\nprobabilistic and evidential approaches,\u201d in Applied Stochastic Models and Data Analysis,\n2001, vol. 1, pp. 100\u2013105.\n\n[8] Y. Grandvalet and Y. Bengio, \u201cSemi-supervised learning by entropy minimization,\u201d 2005.\n[9] M. Reid and B. Williamson, \u201cInformation, divergence and risk for binary experiments,\u201d Jour-\n\nnal of Machine Learning Research, vol. 12, pp. 731\u2013817, 2011.\n\n[10] H. Masnadi-Shirazi and N. Vasconcelos, \u201cRisk minimization, probability elicitation, and cost-\nsensitive svms,\u201d in Proceedings of the International Conference on Machine Learning, 2010,\npp. 204\u2013213.\n\n[11] L. Savage, \u201cElicitation of personal probabilities and expectations,\u201d Journal of the American\n\nStatistical Association, pp. 783\u2013801, 1971.\n\n[12] T. Gneiting and A. Raftery, \u201cStrictly proper scoring rules, prediction, and estimation,\u201d Journal\n\nof the American Statistical Association, vol. 102, no. 477, pp. 359\u2013378, 2007.\n\n9\n\n\f", "award": [], "sourceid": 738, "authors": [{"given_name": "Jes\u00fas", "family_name": "Cid-sueiro", "institution": null}]}