{"title": "Worst-Case Analysis of Selective Sampling for Linear-Threshold Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 241, "page_last": 248, "abstract": null, "full_text": "Worst-Case Analysis of Selective Sampling for\n\nLinear-Threshold Algorithms(cid:3)\n\nNicol`o Cesa-Bianchi\nDSI, University of Milan\ncesa-bianchi@dsi.unimi.it\n\nClaudio Gentile\n\nUniversit`a dell\u2019Insubria\n\ngentile@dsi.unimi.it\n\nLuca Zaniboni\n\nDTI, University of Milan\n\nzaniboni@dti.unimi.it\n\nAbstract\n\nWe provide a worst-case analysis of selective sampling algorithms for\nlearning linear threshold functions. The algorithms considered in this\npaper are Perceptron-like algorithms, i.e., algorithms which can be ef\ufb01-\nciently run in any reproducing kernel Hilbert space. Our algorithms ex-\nploit a simple margin-based randomized rule to decide whether to query\nthe current label. We obtain selective sampling algorithms achieving on\naverage the same bounds as those proven for their deterministic coun-\nterparts, but using much fewer labels. We complement our theoretical\n\ufb01ndings with an empirical comparison on two text categorization tasks.\nThe outcome of these experiments is largely predicted by our theoreti-\ncal results: Our selective sampling algorithms tend to perform as good\nas the algorithms receiving the true label after each classi\ufb01cation, while\nobserving in practice substantially fewer labels.\n\nIntroduction\n\n1\nIn this paper, we consider learning binary classi\ufb01cation tasks with partially labelled data\nvia selective sampling. A selective sampling algorithm (e.g., [3, 12, 7] and references\ntherein) is an on-line learning algorithm that receives a sequence of unlabelled instances,\nand decides whether or not to query the label of the current instance based on instances\nand labels observed so far. The idea is to let the algorithm determine which labels are most\nuseful to its inference mechanism, so that redundant examples can be discarded on the \ufb02y\nand labels can be saved.\nThe overall goal of selective sampling is to \ufb01t real-world scenarios where labels are scarce\nor expensive. As a by now classical example, in a web-searching task, collecting web pages\nis a fairly automated process, but assigning them a label (a set of topics) often requires time-\nconsuming and costly human expertise. In these cases, it is clearly important to devise\nlearning algorithms having the ability to exploit the label information as much as possible.\nFurthermore, when we consider kernel-based algorithms [23, 9, 21], saving labels directly\nimplies saving support vectors in the currently built hypothesis, which, in turn, implies\nsaving running time in both training and test phases.\nMany algorithms have been proposed in the literature to cope with the broad task of learning\nwith partially labelled data, working under both probabilistic and worst-case assumptions,\nfor either on-line or batch settings. These range from active learning algorithms [8, 22],\n\n(cid:3)The authors gratefully acknowledge partial support by the PASCAL Network of Excellence un-\n\nder EC grant no. 506778. This publication only re\ufb02ects the authors\u2019 views.\n\n\fto the query-by-committee algorithm [12], to the adversarial \u201capple tasting\u201d and label-\nef\ufb01cient algorithms investigated in [16] and [17, 6], respectively. In this paper we present\na worst-case analysis of two Perceptron-like selective sampling algorithms. Our analysis\nrelies on and contributes to a well-established way of studying linear-threshold algorithms\nwithin the mistake bound model of on-line learning (e.g., [18, 15, 11, 13, 14, 5]). We\nshow how to turn the standard versions of the (\ufb01rst-order) Perceptron algorithm [20] and\nthe second-order Perceptron algorithm [5] into selective sampling algorithms exploiting a\nrandomized margin-based criterion (inspired by [6]) to select labels, while preserving in\nexpectation the same mistake bounds.\nIn a sense, this line of research complements an earlier work on selective sampling [7],\nwhere a second-order kind of algorithm was analyzed under precise stochastic assumptions\nabout the way data are generated. This is exactly what we face in this paper: we avoid\nany assumption whatsoever on the data-generating process, but we are still able to prove\nmeaningful statements about the label ef\ufb01ciency features of our algorithms.\nIn order to give some empirical evidence for our analysis, we made some experiments\non two medium-size text categorization tasks. These experiments con\ufb01rm our theoretical\nresults, and show the effectiveness of our margin-based label selection rule.\n\n2 Preliminaries, notation\nAn example is a pair (x; y), where x 2 Rn is an instance vector and y 2 f(cid:0)1; +1g\nis the associated binary label. A training set S is any \ufb01nite sequence of examples S =\n(x1; y1); : : : ; (xT ; yT ) 2 (Rn (cid:2) f(cid:0)1; +1g)T . We say that S is linearly separable if there\nexists a vector u 2 Rn such that ytu>xt > 0 for t = 1; : : : ; T .\nWe consider the following selective sampling variant of a standard on-line learning model\n(e.g., [18, 24, 19, 15] and references therein). This variant has been investigated in [6]\nfor a version of Littlestone\u2019s Winnow algorithm [18, 15]. Learning proceeds on-line in\na sequence of trials.\nIn the generic trial t the algorithm receives instance xt from the\nenvironment, outputs a prediction ^yt 2 f(cid:0)1; +1g about the label yt associated with xt,\nand decides whether or not to query the label yt. No matter what the algorithm decides,\nwe say that the algorithm has made a prediction mistake if ^yt 6= yt. We measure the\nperformance of the algorithm by the total number of mistakes it makes on S (including\nthe trials where the true label remains hidden). Given a comparison class of predictors, the\ngoal of the algorithm is to bound the amount by which this total number of mistakes differs,\non an arbitrary sequence S, from some measure of the performance of the best predictor in\nhindsight within the comparison class. Since we are dealing with (zero-threshold) linear-\nthreshold algorithms, it is natural to assume the comparison class be the set of all (zero-\nthreshold) linear-threshold predictors, i.e., all (possibly normalized) vectors u 2 Rn. Given\na margin value (cid:13) > 0, we measure the performance of u on S by its cumulative hinge loss1\n\n[11, 13] PT\n\nt=1 D(cid:13)(u; (xt; yt)); where D(cid:13)(u; (xt; yt)) = maxf0; (cid:13) (cid:0) ytu>xtg.\n\nBroadly speaking, the goal of the selective sampling algorithm is to achieve the best bound\non the number of mistakes with as few queried labels as possible. As in [6], our algorithms\nexploit a margin-based randomized rule to decide which labels to query. Thus, our mistake\nbounds are actually worst-case over the training sequence and average-case over the inter-\nnal randomization of the algorithms. All expectations occurring in this paper are w.r.t. this\nrandomization.\n\n3 The algorithms and their analysis\nAs a simple example, we start by turning the classical Perceptron algorithm [20] into a\nworst-case selective sampling algorithm. The algorithm, described in Figure 1, has a real\n\n1The cumulative hinge loss measures to what extent hyperplane u separates S at margin (cid:13). This\n\nis also called the soft margin in the SVM literature [23, 9, 21].\n\n\fALGORITHM Selective sampling Perceptron algorithm\nParameter b > 0.\nInitialization: v0 = 0; k = 1:\nFor t = 1; 2; : : : do:\n\n1. Get instance vector xt 2 Rn and set rt = v>\n2. predict with ^yt = SGN(rt) 2 f(cid:0)1; +1g;\n3. draw a Bernoulli random variable Zt 2 f0; 1g of parameter\n4. if Zt = 1 then:\n\nk(cid:0)1\n\nb\n\nb+jrtj ;\n\n^xt, with ^xt = xt=jjxtjj;\n\n(a) ask for label yt 2 f(cid:0)1; +1g,\n(b) if ^yt 6= yt then update as follows: vk = vk(cid:0)1 + yt ^xt; k k + 1:\n\nFigure 1: The selective sampling (\ufb01rst-order) Perceptron algorithm.\n\nparameter b > 0 which might be viewed as a noise parameter, ruling the extent to which\na linear threshold model \ufb01ts the data at hand. The algorithm maintains a vector v 2 Rn\nIn each trial t the algorithm observes an instance vector\n(whose initial value is zero).\n^xt.\nxt 2 Rn and predicts the binary label yt through the sign of the margin value rt = v>\nThen the algorithm decides whether to ask for the label yt through a simple randomized\nrule: a coin with bias b=(b + jrtj) is \ufb02ipped; if the coin turns up heads (Zt = 1 in Figure\n1) then the label yt is revealed. Moreover, on a prediction mistake (^yt 6= yt) the algorithm\nupdates vector vk according to the usual Perceptron additive rule. On the other hand, if\neither the coin turns up tails or ^yt = yt no update takes place. Notice that k is incremented\nonly when an update occurs. Thus, at the end of trial t, subscript k counts the number\nof updates made so far (plus one). In the following theorem we prove that our selective\nsampling version of the Perceptron algorithm can achieve, in expectation, the same mistake\nbound as the standard Perceptron\u2019s using fewer labels. See Remark 1 for a discussion.\n\nk(cid:0)1\n\nTheorem 1 Let S = ((x1; y1); (x2; y2); : : : ; (xT ; yT )) 2 (Rn (cid:2) f(cid:0)1; +1g)T be any se-\nquence of examples and UT be the (random) set of update trials for the algorithm in Figure\n1 (i.e, the set of trials t (cid:20) T such that ^yt 6= yt and Zt = 1). Then the expected number of\nmistakes made by the algorithm in Figure 1 is upper bounded by\n\n2b\n\n1\n\n8b\n\njjujj2\n\nEhPt2UT\n\nProof. Let Mt be the Bernoulli variable which is one iff ^yt 6= yt and denote by k(t) the\nvalue of the update counter k in trial t just before the update k k + 1. Our goal is then\n\ninf (cid:13)>0 inf u2Rn(cid:16) 2b+1\n\n(cid:13) D(cid:13)(u; ( ^xt; yt))i + (2b+1)2\nThe expected number of labels queried by the algorithm is equal toPT\nto bound EhPT\n\nt=1 Mti from above. Consider the case when trial t is such that Mt Zt = 1.\n\n^xt (as in Figure 1)\nThen one can verify by direct inspection that choosing rt = v>\nyields yt u> ^xt (cid:0) yt rt = 1\n2 jjvk(t(cid:0)1) (cid:0) vk(t)jj2;\nholding for any u 2 Rn. On the other hand, if trial t is such that Mt Zt = 0 we have\nvk(t(cid:0)1) = vk(t). Hence we conclude that the equality\n\n(cid:13)2 (cid:17) :\nEh\nb+jrtji :\n\n2 jju (cid:0) vk(t(cid:0)1)jj2 (cid:0) 1\n\n2 jju (cid:0) vk(t)jj2 + 1\n\nk(t(cid:0)1)\n\nt=1\n\nb\n\n2 jju (cid:0) vk(t)jj2 + 1\n\n2 jju (cid:0) vk(t(cid:0)1)jj2 (cid:0) 1\n\nMt Zt(cid:0)yt u\n\n> ^xt (cid:0) yt rt(cid:1) = 1\nPT\nt=1 Mt Zt(cid:0)yt u> ^xt + jrtj (cid:0) 1\n\n2 jjvk(t(cid:0)1) (cid:0) vk(t)jj2\nactually holds for all trials t. We sum over t = 1; : : : ; T while observing that Mt Zt = 1\nimplies both jjvk(t(cid:0)1)(cid:0)vk(t)jj = 1 and yt rt (cid:20) 0. Recalling that vk(0) = 0 and rearranging\nwe obtain\n(1)\nNow, since the previous inequality holds for any comparison vector u 2 Rn, we stretch u\nto b+1=2\nu, being (cid:13) > 0 a free parameter. Then, by the very de\ufb01nition of D(cid:13)(u; ( ^xt; yt)),\n(cid:13)\nb+1=2\n((cid:13) (cid:0) D(cid:13)(u; ( ^xt; yt))) 8(cid:13) > 0. Plugging into (1) and rearranging,\n\nyt u> ^xt (cid:21) b+1=2\n\n2(cid:1) (cid:20) 1\n\n8u 2 Rn:\n\n2 jjujj2;\n\n(cid:13)\n\n(cid:13)\n\nt=1 Mt Zt(b + jrtj) (cid:20) (b + 1\n\n1\n\n(cid:13) D(cid:13)(u; ( ^xt; yt)) + (2b+1)2\n\n8(cid:13)2\n\njjujj2 :\n\n(2)\n\nPT\n\n2 )Pt2UT\n\n\fALGORITHM Selective sampling second-order Perceptron algorithm\nParameter b > 0.\nInitialization: A0 = I; v0 = 0; k = 1:\nFor t = 1; 2; : : : do:\n\n1. Get xt 2 Rn and set rt = v>\n2. predict with ^yt = SGN(rt) 2 f(cid:0)1; +1g;\n3. draw a Bernoulli random variable Zt 2 f0; 1g of parameter\n\nk(cid:0)1(Ak(cid:0)1 + ^xt ^x\n\n>\nt )(cid:0)1 ^xt, ^xt = xt=jjxtjj;\n\nb + jrtj + 1\n\n2 r2\n\nb\n\nt (cid:16)1 + ^x\n\n>\nt A(cid:0)1\nk(cid:0)1\n\n;\n\n^xt(cid:17)\n\n(3)\n\n4. if Zt = 1 then:\n\n(a) ask for label yt 2 f(cid:0)1; +1g,\n(b) if ^yt 6= yt then update as follows:\n\nvk = vk(cid:0)1 + yt ^xt; Ak = Ak(cid:0)1 + ^xt ^x\n\n>\nt ; k k + 1:\n\nFigure 2: The selective sampling second-order Perceptron algorithm.\n\nFrom Figure 1 we see that E[Zt j Z1; : : : ; Zt(cid:0)1] = b\non both sides of (2),\n\nb+jrtj : Therefore, taking expectations\n\nt=1\n\nt=1\n\nt=1\n\nE[Zt j Z1; : : : ; Zt(cid:0)1]i.\n\nEhPT\nEhEhMt Zt(cid:16)b + jrtj(cid:17) j Z1; : : : ; Zt(cid:0)1ii\nt=1 Mt Zt(b + jrtj)i=PT\nt=1 Mti b:\n= PT\nReplacing back into (2) and dividing by b proves the claimed bound on EhPT\nThe value of EhPT\nt=1 Zti = EhPT\nEhPT\n\nEhMt(cid:16)b + jrtj(cid:17)EhZt j Z1; : : : ; Zt(cid:0)1ii = EhPT\nt=1 Mti.\nt=1 Zti (the expected number of queried labels) trivially follows from\n\nWe now consider the selective sampling version of the second-order Perceptron algorithm,\nas de\ufb01ned in [5]. See Figure 2. Unlike the \ufb01rst-order algorithm, the second-order al-\ngorithm mantains a vector v 2 Rn and a matrix A 2 Rn (cid:2) Rn (whose initial value is\nthe identity matrix I). The algorithm predicts through the sign of the margin quantity\n>\nt )(cid:0)1 ^xt, and decides whether to ask for the label yt through a\nrt = v>\nrandomized rule similar to the one in Figure 1. The analysis follows the same pattern as\nthe proof of Theorem 1. A key step in this analysis is a one-trial progress equation de-\nveloped in [10] for a regression framework. See also [4]. Again, the comparison between\nthe second-order Perceptron\u2019s bound and the one contained in Theorem 2 reveals that the\nselective sampling algorithm can achieve, in expectation, the same mistake bound (see Re-\nmark 1) using fewer labels.\nTheorem 2 Using the notation of Theorem 1, the expected number of mistakes made by\nthe algorithm in Figure 2 is upper bounded by\n\nk(cid:0)1(Ak(cid:0)1 + ^xt ^x\n\n2\n\nu\n\n1\n(cid:13)\n\ninf\n(cid:13)>0\n\ninf\n\nb\n2(cid:13)2\n\nD(cid:13)(u; ( ^xt; yt))# +\n\nu2Rn E\"Xt2UT\nwhere (cid:21)1; : : : ; (cid:21)n are the eigenvalues of the (random) correlation matrixPt2UT\nAk(T ) = I +Pt2UT\nber of labels queried by the algorithm is equal toPT\n\n>E(cid:2)Ak(T )(cid:3) u +\n\n>\nt and\n>\nt (thus 1 + (cid:21)i is the i-th eigenvalue of Ak(T )). The expected num-\n\nProof sketch. The proof proceeds along the same lines as the proof of Theorem 1. Thus\nwe only emphasize the main differences. In addition to the notation given there, we de\ufb01ne\n\nE ln (1 + (cid:21)i)! ;\n\n^xt(cid:17)(cid:21) :\n\nt(cid:16)1+ ^x\n\nXi=1\n\nE(cid:20)\n\nb+jrtj+ 1\n\n2 r2\n\n>\n\nt A(cid:0)1\nk(cid:0)1\n\n1\n2b\n\nn\n\nb\n\n^xt ^x\n\n^xt ^x\n\nt=1\n\n\f1\n\n2 jjujj2 +Pi2Ut\n\nUt as the set of update trials up to time t, i.e., Ut = fi (cid:20) t : Mi Zi = 1g, and Rt as the\n(random) function Rt(u) = 1\n2 (yi (cid:0) u> ^xi)2: When trial t is such that\nMt Zt = 1 we can exploit a result contained in [10] for linear regression (proof of Theorem\n^xt (as in Figure 2)\n3 therein), where it is essentially shown that choosing rt = v>\nyields\n2 (yt (cid:0) rt)2 = inf\n1\nu2Rn\nOn the other hand,\ninf u2Rn Rt(cid:0)1(u) = inf u2Rn Rt(u). Hence the equality\n\nif trial t is such that Mt Zt = 0 we have Ut = Ut(cid:0)1,\n\nRt(u) (cid:0) inf\nu2Rn\n\nRt(cid:0)1(u) + 1\n\n^xt(cid:17) :\n\n2(cid:16) ^x\n\nk(cid:0)1A(cid:0)1\n\n^xt (cid:0) r2\nt\n\n>\nt A(cid:0)1\nk(t)\n\n>\nt A(cid:0)1\n\nthus\n\nk(t)(cid:0)1\n\n(4)\n\nk(t)\n\n^x\n\n1\n\n2 Mt Zt(cid:0)(yt (cid:0) rt)2 + r2\n\nt\n\n= inf\n\n^x\n\n>\nt A(cid:0)1\n\nk(t)(cid:0)1\n\n^xt(cid:1)\n\nRt(u) (cid:0) inf\nu2Rn\n\nu2Rn\n\nRt(cid:0)1(u) + 1\n\n2 Mt Zt ^x\n\n>\nt A(cid:0)1\nk(t)\n\n^xt\n\n(5)\n\nholds for all trials t. We sum over t = 1; : : : ; T , and observe that by de\ufb01nition RT (u) =\n1\n2 jjujj2 (thus inf u2Rn R0(u) = 0).\nAfter some manipulation one can see that (5) implies\n\n(yi (cid:0) u> ^xi)2 and R0(u) = 1\n\nt=1\n\n2\n\nMt Zt\n\n2 jjujj2 +PT\nPT\nt=1 Mt Zt(cid:0)yt u\n\n> ^xt + jrtj + 1\n\n2 r2\n\nt (1 + ^x\n\n(cid:20) 1\n2\n\nu\n\n>Ak(T )u +PT\n\nk(t)(cid:0)1\n\n>\nt A(cid:0)1\n1\n2 Mt Zt ^x\n\n^xt)(cid:1)\n\nt=1\n\n>\nt A(cid:0)1\nk(t)\n\n^xt;\n\n(6)\n\nholding for any u 2 Rn. We continue by elaborating on (6). First, as in [4, 10, 5], we\nupper bound the quadratic terms ^x\n\ndet(Ak(t)(cid:0)1) . This gives\n\n>\nt A(cid:0)1\nk(t)\n\n1\n2 Mt Zt ^x\n\n>\nt A(cid:0)1\nk(t)\n\n^xt (cid:20) 1\n\nt=1\n\ni=1 ln (1 + (cid:21)i) :\n\nSecond, as in the proof of Theorem 1, we stretch the comparison vector u 2 Rn to b\n(cid:13)\nintroduce hinge loss terms. We obtain:\n\nPT\n\n^xt by2 ln det(Ak(t))\n2 ln det(Ak(T ))\n\ndet(A0) = 1\n\n2Pn\n\nt=1Mt Zt(cid:16)b + jrtj + 1\nPT\n\nt (1 + ^x\n\n>\nt A(cid:0)1\n(cid:13) D(cid:13)(u; ( ^xt; yt)) + b2\n\n2 r2\n\n1\n\nk(t)(cid:0)1\n\n^xt)(cid:17)\n\n2(cid:13)2 u>Ak(T )u + 1\n\ni=1 ln (1 + (cid:21)i):\n\n(cid:20) b Pt2UT\nThe bounds on EhPT\n\nt=1 Mti and EhPT\n\nt=1 Zti can now be obtained by following the\n\n2\n\nproof of Theorem 1.\nRemark 1 The bounds in Theorems 1 and 2 depend on the choice of parameter b. As a\nmatter of fact, the optimal tuning of this parameter is easily computed. Let us set for brevity\n^D(cid:13)(u; S) in\n\n^D(cid:13)(u; S) = EhPt2UT\n\n1\n\n(cid:13) D(cid:13)(u; ( ^xt; yt))i. Choosing3 b = 1\n\nTheorem 1 gives the following bound on the expected number of mistakes:\n\n2Pn\n\ninf u2Rn(cid:18) ^D(cid:13)(u; S) + jjujj2\n\n2(cid:13)2 + jjujj\n\n2(cid:13) q ^D(cid:13)(u; S) + jjujj2\n\nThis is an expectation version of the mistake bound for the standard (\ufb01rst-order) Perceptron\nalgorithm [14]. Notice, that in the special case when the data are linearly separable with\nmargin (cid:13)(cid:3) the optimal tuning simpli\ufb01es to b = 1=2 and yields the familiar Perceptron\n\njjujj2\n\n2q1 + 4(cid:13)2\n4(cid:13)2 (cid:19) :\n\nu and\n\n(7)\n\nare led to the bound\n\nbound jjujj2=((cid:13)(cid:3))2: On the other hand, if we set b = (cid:13)rPn\n(cid:13) q(u>E(cid:2)Ak(T )(cid:3) u) Pn\n\ninf u2Rn(cid:16) ^D(cid:13)(u; S) + 1\n\ni=1\n\nE ln (1 + (cid:21)i)(cid:17) ;\n\n2Here det denotes the determinant.\n3Clearly, this tuning relies on information not available ahead of time, since it depends on the\n\nwhole sequence of examples. The same holds for the choice of b giving rise to (9).\n\ni=1\n\nE ln(1+(cid:21)i)\nu> E[Ak(T )]u\n\nin Theorem 2 we\n\n(8)\n\n(9)\n\n\fwhich is an expectation version of the mistake bound for the (deterministic) second-order\nPerceptron algorithm, as proven in [5]. As it turns out, (8) and (9) might be even sharper\nthan their deterministic counterparts. In fact, the set of update trials UT is on average\nsigni\ufb01cantly smaller than the one for the deterministic algorithms. This tends to shrink the\nthe main ingredients\n\nE ln (1 + (cid:21)i),\n\nthree terms ^D(cid:13)(u; S), u>E(cid:2)Ak(T )(cid:3) u, and Pn\n\nof the selective sampling bounds.\nRemark 2 Like any Perceptron-like algorithm, the algorithms in Figures 1 and 2 can be\nef\ufb01ciently run in any given reproducing kernel Hilbert space (e.g., [9, 21, 23]), just by\nturning them into equivalent dual forms. This is actually what we did in the experiments\nreported in the next section.\n\ni=1\n\n4 Experiments\nThe empirical evaluation of our algorithms was carried out on two datasets of free-text doc-\numents. The \ufb01rst dataset is made up of the \ufb01rst (in chronological order) 40; 000 newswire\nstories from Reuters Corpus Volume 1 (RCV1) [2]. The resulting set of examples was\nclassi\ufb01ed over 101 categories. The second dataset is a speci\ufb01c subtree of the OHSUMED\ncorpus of medical abstracts [1]: the subtree rooted in \u201cQuality of Health Care\u201d (MeSH\ncode N05.712). From this subtree we randomly selected a subset of 40; 000 abstracts. The\nresulting number of categories was 94. We performed a standard preprocessing on the\ndatasets \u2013 details will be given in the full paper.\nTwo kinds of experiments were made on each dataset. In the \ufb01rst experiment we compared\nthe selective sampling algorithms in Figures 1 and 2 (for different values of b), with the\nstandard second-order Perceptron algorithm (requesting all labels). Such a comparison\nwas devoted to studying the extent to which a reduced number of label requests might\nlead to performance degradation.\nIn the second experiment, we compared variable vs.\nconstant label-request rate. That is, we \ufb01xed a few values for parameter b, run the selective\nsampling algorithm in Figure 2, and computed the fraction of labels requested over the\ntraining set. Call this fraction ^p = ^p(b). We then run a second-order selective sampling\nalgorithm with (constant) label request probability equal to ^p (independent of t). The aim\nof this experiment was to investigate the effectiveness of a margin-based selective sampling\ncriterion, as opposed to a random one.\nFigure 3 summarizes the results we obtained on RCV1 (the results on OHSUMED turned\nout to be similar, and are therefore omitted from this paper). For the purpose of this\ngraphical representation, we selected the 50 most frequent categories from RCV1, those\nwith frequency larger than 1%. The standard second-order algorithm is denoted by 2ND-\nORDER-ALL-LABELS, the selective sampling algorithms in Figures 1 and 2 are denoted by\n1ST-ORDER and 2ND-ORDER, respectively, whereas the second-order algorithm with con-\nstant label request is denoted by 2ND-ORDER-FIXED-BIAS.4 As evinced by Figure 3(a),\nthere is a range of values for parameter b that makes 2ND-ORDER achieve almost the same\nperformance as 2ND-ORDER-ALL-LABELS, but with a substantial reduction in the total\nnumber of queried labels.5 In Figure 3(b) we report the results of running 2ND-ORDER,\n1ND-ORDER and 2ND-ORDER-FIXED-BIAS after choosing values for b that make the av-\nerage F-measure achieved by 2ND-ORDER just slightly larger than those achieved by the\nother two algorithms. We then compared the resulting label request rates and found 2ND-\nORDER largely best among the three algorithms (its instantaneous label rate after 40; 000\nexamples is less than 19%). We made similar experiments for speci\ufb01c categories in RCV1.\nOn the most frequent ones (such as category 70 \u2013 Figure 3(c)) this behavior gets empha-\nsized. Finally, in Figure 3(d) we report a direct macroaveraged F-measure comparison\n\n4We omitted to report on the \ufb01rst-order algorithms 1 ST-ORDER-ALL-LABELS and 1ST-ORDER-\n\nFIXED-BIAS, since they are always outperformed by their corresponding second-order algorithms.\n\n5Notice that the \ufb01gures are plotting instantaneous label rates, hence the overall fraction of queried\n\nlabels is obtained by integration.\n\n\f2ND ORDER: Parameter \u2019b\u2019 variations\n\nSelective Sampling comparison on RCV1 Dataset\n\nF-measure 2ND-ORDER-ALL-LABELS\nF-measure 2ND-ORDER - b=0.025\nF-measure 2ND-ORDER - b=0.05\nF-measure 2ND-ORDER - b=0.075\nLabel-request 2ND-ORDER - b=0.025\nLabel-request 2ND-ORDER - b=0.05\nLabel-request 2ND-ORDER - b=0.075\n\n 1\n\n 0.9\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\nt\ns\ne\nu\nq\ne\nr\n-\nl\ne\nb\na\nL\n&\ne\nr\nu\ns\na\ne\nm\n-\nF\n\n \n\n \n\nF-measure 2ND-ORDER - b=0.025\nF-measure 2ND-ORDER-FIXED-BIAS - p=0.489\nF-measure 1ST-ORDER - b=1.0\nLabel-request 2ND-ORDER - b=0.025\nLabel-request 2ND-ORDER-FIXED-BIAS - p=0.489\nLabel-request 1ST-ORDER - b=1.0\n\n 1\n\n 0.9\n\n 0.8\n\n 0.7\n\n 0.6\n\n 0.5\n\n 0.4\n\n 0.3\n\n 0.2\n\nt\ns\ne\nu\nq\ne\nr\n-\nl\ne\nb\na\nL\n&\ne\nr\nu\ns\na\ne\nm\n-\nF\n\n \n\n \n\n 0.2\n\n4000\n\n8000\n\n16000\n12000\nTraining examples\n\n(a)\n\n20000\n\n24000\n\n 0.1\n\n4000\n\n8000\n\n12000\n16000\nTraining examples\n\n20000\n\n24000\n\n(b)\n\nSelective Sampling comparison on category 70 of RCV1 Dataset\n\n2ND-ORDER: Margin based vs Fixed bias\n\nF-measure 2ND-ORDER - b=0.025\nF-measure 2ND-ORDER-FIXED-BIAS - p=0.489\nF-measure 1ST-ORDER - b=1.0\nLabel-request 2ND-ORDER - b=0.025\nLabel-request 2ND-ORDER-FIXED-BIAS - p=0.489\nLabel-request 1ST-ORDER - b=1.0\n\n 1.2\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\ne\nr\nu\ns\na\ne\nm\n-\nF\n\n 0.74\n\n 0.72\n\n 0.7\n\n 0.68\n\n 0.66\n\n 0.64\n\n 0.62\n\n 0.6\n\n 0.58\n\n 0.56\n\n2ND-ORDER\n2ND-ORDER-FIXED-BIAS\n\nt\ns\ne\nu\nq\ne\nr\n-\nl\ne\nb\na\nL\n&\ne\nr\nu\ns\na\ne\nm\n-\nF\n\n \n\n \n\n4000\n\n8000\n\n16000\n12000\nTraining examples\n\n(c)\n\n20000\n\n24000\n\n0.104\n\n0.211\n\n0.336\n\n0.425\n\n0.489\n\nLabel-request\n(d)\n\nFigure 3:\nInstantaneous F-measure and instantaneous label-request rate on the RCV1\ndataset. We solved a binary classi\ufb01cation problem for each class and then (macro)averaged\nthe results. All curves tend to \ufb02atten after about 24; 000 examples (out of 40; 000). (a)\nInstantaneous macroaveraged F-measure of 2ND-ORDER (for three values of b) and their\ncorresponding label-request curves. For the very sake of comparison, we also included\nthe F-measure of 2ND-ORDER-ALL-LABELS. (b) Comparison among 2ND-ORDER, 1ST-\nORDER and 2ND-ORDER-FIXED-BIAS. (c) Same comparison on a speci\ufb01c category. (d)\nF-measure of 2ND-ORDER vs. F-measure of 2ND-ORDER-FIXED-BIAS for 5 values of pa-\nrameter b, after 40; 000 examples.\n\nbetween 2ND-ORDER and 2ND-ORDER-FIXED-BIAS for 5 values of b. On the x-axis are\nthe resulting 5 values of the constant bias ^p(b). As expected, 2ND-ORDER outperforms\n2ND-ORDER-FIXED-BIAS, though the difference between the two tends to shrink as b (or,\nequivalently, ^p(b)) gets larger.\n\n5 Conclusions and open problems\nWe have introduced new Perceptron-like selective sampling algorithms for learning linear-\nthreshold functions. We analyzed these algorithms in a worst-case on-line learning set-\nting, providing bounds on both the expected number of mistakes and the expected num-\nber of labels requested. Our theoretical investigation naturally arises from the traditional\nway margin-based algorithms are analyzed in the mistake bound model of on-line learning\n[18, 15, 11, 13, 14, 5]. This investigation suggests that our worst-case selective sampling\nalgorithms can achieve on average the same accuracy as that of their more standard rel-\natives, but allowing a substantial label saving. These theoretical results are corroborated\nby our empirical comparison on textual data, where we have shown that: (1) the selective\nsampling algorithms tend to be unaffected by observing less and less labels; (2) if we \ufb01x\nahead of time the total number of label observations, the margin-driven way of distributing\nthese observations over the training set is largely more effective than a random one.\n\nWe close by two simple open questions. (1) Our selective sampling algorithms depend on a\nscale parameter b having a signi\ufb01cant in\ufb02uence on their practical performance. Is there any\n\n\fprincipled way of adaptively tuning b so as to reduce the algorithms\u2019 sensitivity to tuning\nparameters? (2) Theorems 1 and 2 do not make any explicit statement about the number of\nweight updates/support vectors computed by our selective sampling algorithms. We would\nlike to see a theoretical argument that enables us to combine the bound on the number of\nmistakes with that on the number of labels, giving rise to a meaningful upper bound on the\nnumber of updates.\nReferences\n[1] The OHSUMED test collection. URL: medir.ohsu.edu/pub/ohsumed/.\n[2] Reuters corpus volume 1. URL: about.reuters.com/researchandstandards/corpus/.\n[3] Atlas, L., Cohn, R., and Ladner, R. (1990). Training connectionist networks with queries and\n\nselective sampling. In NIPS 2. MIT Press.\n\n[4] Azoury, K.S., and Warmuth, M.K. (2001). Relative loss bounds for on-line density estimation\n\nwith the exponential familiy of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\n[5] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2002). A second-order Perceptron algorithm.\n\nIn Proc. 15th COLT, pp. 121\u2013137. LNAI 2375, Springer.\n\n[6] Cesa-Bianchi, N. Lugosi, G., and Stoltz, G. (2004). Minimizing Regret with Label Ef\ufb01cient\n\nPrediction In Proc. 17th COLT, to appear.\n\n[7] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2003). Learning probabilistic linear-threshold\n\nclassi\ufb01ers via selective sampling. In Proc. 16th COLT, pp. 373\u2013386. LNAI 2777, Springer.\n\n[8] Campbell, C., Cristianini, N., and Smola, A. (2000). Query learning with large margin classi-\n\n\ufb01ers. In Proc. 17th ICML, pp. 111\u2013118. Morgan Kaufmann.\n\n[9] Cristianini, N., and Shawe-Taylor, J. (2001). An Introduction to Support Vector Machines.\n\nCambridge University Press.\n\n[10] Forster, J. On relative loss bounds in generalized linear regression. (1999). In Proc. 12th Int.\n\nSymp. FCT, pp. 269\u2013280, Springer.\n\n[11] Freund, Y., and Schapire, R. E. (1999). Large margin classi\ufb01cation using the perceptron algo-\n\nrithm. Machine Learning, 37(3), 277\u2013296.\n\n[12] Freund, Y., Seung, S., Shamir, E., and Tishby, N. (1997). Selective sampling using the query\n\nby committee algorithm. Machine Learning, 28(2/3):133\u2013168.\n\n[13] Gentile, C. & Warmuth, M. (1998). Linear hinge loss and average margin. In NIPS 10, MIT\n\nPress, pp. 225\u2013231.\n\n[14] Gentile, C. (2003). The robustness of the p-norm algorithms. Machine Learning, 53(3), 265\u2013\n\n299.\n\n[15] Grove, A.J., Littlestone, N., & Schuurmans, D. (2001). General convergence results for linear\n\ndiscriminant updates. Machine Learning, 43(3), 173\u2013210.\n\n[16] Helmbold, D.P., Littlestone, N. and Long, P.M. (2000). Apple tasting. Information and Compu-\n\ntation, 161(2), 85\u2013139.\n\n[17] Helmbold, D.P., and Panizza, S. (1997). Some label ef\ufb01cient learning results. In Proc. 10th\n\nCOLT, pp. 218\u2013230. ACM Press.\n\n[18] Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: a new linear-\n\nthreshold algorithm. Machine Learning, 2(4), 285\u2013318.\n\n[19] Littlestone, N., and Warmuth, M.K. (1994). The weighted majority algorithm. Information and\n\nComputation, 108(2), 212\u2013261.\n\n[20] F. Rosenblatt. (1958). The Perceptron: A probabilistic model for information storage and orga-\n\nnization in the brain. Psychol. Review, 65, 386\u2013408.\n\n[21] Sch\u00a8olkopf, B., and Smola, A. (2002). Learning with kernels. MIT Press, 2002.\n[22] Tong, S., and Koller, D. (2000). Support vector machine active learning with applications to\n\ntext classi\ufb01cation. In Proc. 17th ICML. Morgan Kaufmann.\n\n[23] Vapnik, V.N. (1998). Statistical Learning Theory. Wiley.\n[24] Vovk, V. (1990). Aggregating strategies. Proc. 3rd COLT, pp. 371\u2013383. Morgan Kaufmann.\n\n\f", "award": [], "sourceid": 2584, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Luca", "family_name": "Zaniboni", "institution": null}]}