{"title": "Online Learning with Costly Features and Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 1241, "page_last": 1249, "abstract": "This paper introduces the online probing\" problem: In each round, the learner is able to purchase the values of a subset of feature values. After the learner uses this information to come up with a prediction for the given round, he then has the option of paying for seeing the loss that he is evaluated against. Either way, the learner pays for the imperfections of his predictions and whatever he chooses to observe, including the cost of observing the loss function for the given round and the cost of the observed features. We consider two variations of this problem, depending on whether the learner can observe the label for free or not. We provide algorithms and upper and lower bounds on the regret for both variants. We show that a positive cost for observing the label significantly increases the regret of the problem.\"", "full_text": "Online Learning with Costly Features and Labels\n\nNavid Zolghadr\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nzolghadr@ualberta.ca\n\nDepartment of Computer Science\n\nG\u00b4abor Bart\u00b4ok\n\nETH Z\u00a8urich\n\nbartok@inf.ethz.ch\n\nRussell Greiner\n\nAndr\u00b4as Gy\u00a8orgy\n\nCsaba Szepesv\u00b4ari\n\nDepartment of Computing Science, University of Alberta\n{rgreiner,gyorgy,szepesva}@ualberta.ca\n\nAbstract\n\nThis paper introduces the online probing problem: In each round, the learner is\nable to purchase the values of a subset of feature values. After the learner uses\nthis information to come up with a prediction for the given round, he then has the\noption of paying to see the loss function that he is evaluated against. Either way,\nthe learner pays for both the errors of his predictions and also whatever he chooses\nto observe, including the cost of observing the loss function for the given round\nand the cost of the observed features. We consider two variations of this problem,\ndepending on whether the learner can observe the label for free or not. We provide\nalgorithms and upper and lower bounds on the regret for both variants. We show\nthat a positive cost for observing the label signi\ufb01cantly increases the regret of the\nproblem.\n\nIntroduction\n\n1\nIn this paper, we study a variant of online learning, called online probing, which is motivated by\npractical problems where there is a cost to observing the features that may help one\u2019s predictions.\nOnline probing is a class of online learning problems. Just like in standard online learning problems,\nthe learner\u2019s goal is to produce a good predictor.\nIn each time step t, the learner produces his\nprediction based on the values of some feature xt = (xt,1, . . . , xt,d)> 2X\u21e2 Rd.1 However, unlike\nin the standard online learning settings, if the learner wants to use the value of feature i to produce a\nprediction, he has to purchase the value at some \ufb01xed, a priori known cost, ci 0. Features whose\nvalue is not purchased in a given round remain unobserved by the learner. Once a prediction \u02c6yt 2Y\nis produced, it is evaluated against a loss function `t : Y! R. At the end of a round, the learner\nhas the option of purchasing the full loss function, again at a \ufb01xed prespeci\ufb01ed cost cd+1 0 (by\ndefault, the loss function is not revealed to the learner). The learner\u2019s performance is measured by his\nregret as he competes against some prespeci\ufb01ed set of predictors. Just like the learner, a competing\npredictor also needs to purchase the feature values needed in the prediction. If st 2{ 0, 1}d+1 is the\nindicator vector denoting what the learner purchased in round t (st,i = 1 if the learner purchased\nxt,i for 1 \uf8ff i \uf8ff d, and purchased the label for i = d + 1) and c 2 [0,1)d+1 denotes the respective\ncosts, then the regret with respect to a class of prediction functions F\u21e2{ f | f : X!Y}\nis de\ufb01ned\nby\n`t(f (xt))) ,\n\nf2F(Th s(f ), c1:d i +\n\n{`t(\u02c6yt) + h st, ci} inf\n\nRT =\n\nwhere c1:d 2 Rd is the vector obtained from c by dropping its last component and for a given func-\ntion f : Rd !Y , s(f ) 2{ 0, 1}d is an indicator vector whose ith component indicates whether f\n1We use > to denote the transpose of vectors. Throughout, all vectors x2Rd will denote column vectors.\n\nTXt=1\n\nTXt=1\n\n1\n\n\fis sensitive to its ith input (in particular, si(f ) = 0 by de\ufb01nition when f (x1, . . . , xi, . . . , xd) =\nf (x1, . . . , x0i, . . . , xd) holds for all (x1, . . . , xi, . . . , xd), (x1, . . . , x0i, . . . , xd) 2X ; otherwise\nsi(f ) = 1). Note that when de\ufb01ning the best competitor in hindsight, we did not include the cost of\nobserving the loss function. This is because (i) the reference predictors do not need it; and (ii) if we\ndid include the cost of observing the loss function for the reference predictors, then the loss of each\npredictor would just be increased by cd+1T , and so the regret RT would just be reduced by cd+1T ,\nmaking it substantially easier for the learner to achieve sublinear regret. Thus, we prefer the current\nregret de\ufb01nition as it promotes the study of regret when there is a price attached to observing the\nloss functions.\nTo motivate our framework, consider the problem of developing a computer-assisted diagnostic tool\nto determine what treatment to apply to a patient in a subpopulation of patients. When a patient\narrives, the computer can order a number of tests that cost money, while other information (e.g., the\nmedical record of the patient) is available for free. Based on the available information, the system\nchooses a treatment. Following-up the patient may or may not incur additional cost. In this example,\nthere is typically a delay in obtaining the information whether the treatment was effective. However,\nfor simplicity, in this work we have decided not to study the effect of this delay. Several works in\nthe literature show that delays usually increase the regret in a moderate fashion (Mesterharm, 2005;\nWeinberger and Ordentlich, 2006; Agarwal and Duchi, 2011; Joulani et al., 2013).\nAs another example, consider the problem of product testing in a manufacturing process (e.g., the\nproduction of electronic consumer devices). When the product arrives, it can be subjected to a\nlarge number of diagnostic tests that differ in terms of their costs and effectiveness. The goal is to\npredict whether the product is defect-free. Obtaining the ground truth can also be quite expensive,\nespecially for complex products. The challenge is that the effectiveness of the various tests is often\na priori unknown and that different tests may provide complementary information (meaning that\nmany tests may be required). . Hence, it might be challenging to decide what form the most cost-\neffective diagnostic procedure may take. Yet another example is the problem of developing a cost-\neffective way of instrument calibration. In this problem, the goal is to predict one or more real-valued\nparameters of some product. Again, various tests with different costs and reliability can be used as\nthe input to the predictor.\nFinally, although we pose the task as an online learning problem, it is easy to show that the proce-\ndures we develop can also be used to attack the batch learning problem, when the goal is to learn a\npredictor that will be cost-ef\ufb01cient on future data given a database of examples.\nObviously, when observing the loss is costly, the problem is related to active learning. However, to\nour best knowledge, the case when observing the features is costly has not been studied before in\nthe online learning literature. Section 1.1 will discusses the relationship of our work to the existing\nliterature in more detail.\nThis paper analyzes two versions of the online problem. In the \ufb01rst version, free-label online prob-\ning, there is no cost to seeing the loss function, that is, cd+1 = 0. (The loss function often compares\nthe predicted value with some label in a known way, in which case learning the value of the label\nfor the round means that the whole loss function becomes known; hence the choice of the name.)\nThus, the learner naturally will choose to see the loss function after he provides his prediction; this\nprovides feedback that the learner can use, to improve the predictor he produces. In the second\nversion, non-free-label online probing, the cost of seeing the loss function is positive: cd+1 > 0.\nIn Section 2 we study the case of free-label online probing. We give an algorithm that enjoys a regret\n\nof O(p2dLT lnNT (1/(T L))) when the losses are L-equi-Lipschitz (Theorem 2.2), where NT (\")\nis the \"-covering number of F on sequences of length T . This leads to an \u02dcO(p2dLT ) regret bound\n\nfor typical function classes, such as the class of linear predictors with bounded weights and bounded\ninputs. We also show that, in the worst case, the exponential dependence on the dimension cannot\nbe avoided in the bound. For the special case of linear prediction with quadratic loss, we give an\n\nalgorithm whose regret scales only as \u02dcO(pdt), a vast improvement in the dependence on d.\nThe case of non-free-label online probing is treated in Section 3. Here, in contrast to the free-label\ncase, we prove that the minimax growth rate of the regret is of the order \u02dc\u21e5(T 2/3). The increase of\nregret-rate stems from the fact that the \u201cbest competitor in hindsight\u201d does not have to pay for the\nlabel. In contrast to the previous case, since the label is costly here, if the algorithm decides to see the\n\n2\n\n\flabel it does not even have to reason about which features to observe, as querying the label requires\npaying a cost that is a constant over the cost of the best predictor in hindsight, already resulting in\nthe \u02dc\u21e5(T 2/3) regret rate. However, in practice (for shorter horizons) it still makes sense to select the\nfeatures that provide the best balance between the feature-cost and the prediction loss. Although we\ndo not study this, we note that by combining the algorithmic ideas developed for the free-label case\nwith the ideas developed for the non-free-label case, it is possible to derive an algorithm that reasons\nactively about the cost of observing the features, too.\nIn the part dealing with the free-label problem, we build heavily on the results of Mannor and\nShamir (2011), while in the part dealing with the non-free-label problem we build on the ideas of\n(Cesa-Bianchi et al., 2006). Due to space limitations, all of our proofs are relegated to the appendix.\n\n1.1 Related Work\n\nThis paper analyzes online learning when features (and perhaps labels) have to be purchased. The\nstandard \u201cbatch learning\u201d framework has a pure explore phase, which gives the learner a set of\nlabeled, completely speci\ufb01ed examples, followed by a pure exploit phase, where the learned pre-\ndictor is asked to predict the label for novel instances. Notice the learner is not required (nor even\nallowed) to decide which information to gather. By contrast, \u201cactive (batch) learning\u201d requires\nthe learner to identify that information (Settles, 2009). Most such active learners begin with com-\npletely speci\ufb01ed, but unlabeled instances; they then purchase labels for a subset of the instances.\nOur model, however, requires the learner to purchase feature values as well. This is similar to the\n\u201cactive feature-purchasing learning\u201d framework (Lizotte et al., 2003). This is extended in Kapoor\nand Greiner (2005) to a version that requires the eventual predictor (as well as the learner) to pay\nto see feature values as well. However, these are still in the batch framework: after gathering the\ninformation, the learner produces a predictor, which is not changed afterwards.\nOur problem is an online problem over multiple rounds, where at each round the learner is required\nto predict the label for the current example. Standard online learning algorithms typically assume\nthat each example is given with all the features. For example, Cesa-Bianchi et al. (2005) provided\nupper and lower bounds on the regret where the learner is given all the features for each example,\nbut must pay for any labels he requests. In our problem, the learner must pay to see the values of\nthe features of each example as well as the cost to obtain its true label at each round. This cost\nmodel means there is an advantage to \ufb01nding a predictor that involves few features, as long as it\nis suf\ufb01ciently accurate. The challenge, of course, is \ufb01nding these relevant features, which happens\nduring this online learning process.\nOther works, in particular Rostamizadeh et al. (2011) and Dekel et al. (2010), assume the features\nof different examples might be corrupted, missed, or partially observed due to various problems,\nsuch as failure in sensors gathering these features. Having such missing features is realistic in many\napplications. Rostamizadeh et al. (2011) provided an algorithm for this task in the online settings,\n\nwith optimal O(pT ) regret where T is the number of rounds. Our model differs from this model as\n\nin our case the learner has the option to obtain the values of only the subset of the features that he\nselects.\n\n2 Free-Label Probing\nIn this section we consider the case when the cost of observing the loss function is zero. Thus,\nwe can assume without loss of generality that the learner receives the loss function at the end of\neach round (i.e., st,d+1 = 1). We will \ufb01rst consider the general setting where the only restriction is\nthat the losses are equi-Lipschitz and the function set F has a \ufb01nite empirical worst-case covering\nnumber. Then we consider the special case where the set of competitors are the linear predictors and\nthe losses are quadratic.\n\n2.1 The Case of Lipschitz losses\n\nIn this section we assume that the loss functions, `t, are Lipschitz with a known, common Lipschitz\nconstant L over Y w.r.t. to some semi-metric dY of Y: for all t 1\ny,y02Y |`t(y) `t(y0)|\uf8ff L dY (y, y0).\n\nsup\n\n(1)\n\n3\n\n\fClearly, the problem is an instance of prediction with expert advice under partial information feed-\nback (Auer et al., 2002), where each expert corresponds to an element of F. Note that, if the learner\nchooses to observe the values of some features, then he will also be able to evaluate the losses of\nall the predictors f 2F that use only these selected features. This can be formalized as follows:\nBy a slight abuse of notation let st 2{ 0, 1}d be the indicator showing the features selected by\nthe learner at time t (here we drop the last element of st as st,d1 is always 1); similarly, we will\ndrop the last coordinate of the cost vector c throughout this section. Then, the learner can com-\npute the loss of any predictor f 2F such that s(f ) \uf8ff st, where \uf8ff denotes the conjunction of the\ncomponent-wise comparison. However, for some loss functions, it may be possible to estimate the\nlosses of other predictors, too. We will exploit this when we study some interesting special cases of\nthe general problem. However, in general, it is not possible to infer the losses for functions such that\nst,i < s(f )i for some i (cf. Theorem 2.3).\nThe idea is to study \ufb01rst the case when F is \ufb01nite and then reduce the general case to the \ufb01nite case\nby considering appropriate \ufb01nite coverings of the space F. The regret will then depend on how the\ncovering numbers of the space F behave.\nMannor and Shamir (2011) studied problems similar to this in a general framework, where in ad-\ndition to the loss of the selected predictor (expert), the losses of some other predictors are also\ncommunicated to the learner in every round. The connection between the predictors is represented\nby a directed graph whose nodes are labeled as elements of F (i.e., as the experts) and there is an\nedge from f 2F to g 2F if, when choosing f, the loss of g is also revealed to the learner. It is\nassumed that the graph of any round t, Gt = (F, Et) becomes known to the learner at the beginning\nof the round. Further, it is also assumed that (f, f ) 2 Et for every t 1 and f 2F . Mannor\nand Shamir (2011) gave an algorithm, called ELP (exponential weights with linear programming),\nto solve this problem, which calls the Exponential Weights algorithm, but modi\ufb01es it to explore\nless, exploiting the information structure of the problem. The exploration distribution is found by\nsolving a linear program, explaining the name of the algorithm. The regret of ELP is analyzed in the\nfollowing theorem.\nTheorem 2.1 (Mannor and Shamir 2011). Consider a prediction with expert advice problem over\nF where in round t, Gt = (F, Et) is the directed graph that encodes which losses become available\nto the learner. Assume that for any t 1, at most (Gt) cliques of Gt can cover all vertices of Gt.\nLet B be a bound on the non-negative losses `t: maxt1,f2F `t(f (xt)) \uf8ff B. Then, there exists\na constant CELP > 0 such that for any T > 0, the regret of Algorithm 2 (shown in the Appendix)\nwhen competing against the best predictor using ELP satis\ufb01es\n\nE[RT ] \uf8ff CELPBvuut(ln|F|)\n\nTXt=1\nThe algorithm\u2019s computational cost in any given round is poly(|F|).\n.\n= {(f, g)| s(g) \uf8ff s(f )}. Then clearly, (Gt) \uf8ff 2d. Further,\nFor a \ufb01nite F, de\ufb01ne Et \u2318 E\n.\n= C1 + `max (i.e., C1 = kc1:dk1). Plugging these into (2) gives\nB = kc1:dk1 + maxt1,y2Y `t(y)\n(3)\n\nE[RT ] \uf8ff CELP(C1 + `max)q2dT ln|F| .\n\nTo apply this algorithm in the case when F is in\ufb01nite, we have to approximate F with a \ufb01nite\nset F0 \u21e2{ f | f : X !Y} . The worst-case maximum approximation error of F using F0 over\nsequences of length T can be de\ufb01ned as\n\n(Gt) .\n\n(2)\n\nAT (F0,F) = max\nx2X T\n\nsup\nf2F\n\ninf\nf02F0\n\n1\nT\n\nTXt=1\n\ndY (f (xt), f0(xt)) + h (s(f0) s(f ))+, c1:d i ,\n\nwhere (s(f0)s(f ))+ denotes the coordinate-wise positive part of s(f0)s(f ), that is, the indicator\nvector of the features used by f0 and not used by f. The average error can also be viewed as a\n(normalized) dY-\u201cdistance\u201d between the vectors (f (xt))1\uf8fft\uf8ffT and (f0(xt))1\uf8fft\uf8ffT penalized with\nthe extra feature costs. For a given positive number \u21b5, de\ufb01ne the worst-case empirical covering\nnumber of F at level \u21b5 and horizon T > 0 by\n\nNT (F,\u21b5 ) = min{ |F0| | F0 \u21e2{ f | f : X !Y} , AT (F0,F) \uf8ff \u21b5}.\n\n4\n\n\fWe are going to apply the ELP algorithm to F0 and apply (3) to obtain a regret bound. If f0 uses\nmore features than f then the cost-penalized distance between f0 and f is bounded from below by\nthe cost of observing the extra features. This means that unless the problem is very special, F0 has\nto contain, for all s 2{ s(f )| f 2F} , some f0 with s(f0) = s. Thus, if F contains a function for\nall s 2{ 0, 1}d, (Gt) = 2d. Selecting a covering F0 that achieves accuracy \u21b5, the approximation\nerror becomes T L\u21b5 (using equation 1), giving the following bound:\nTheorem 2.2. Assume that the losses (`t)t1 are L-Lipschitz (cf. (1)) and \u21b5> 0. Then, there exists\nan algorithm such that for any T > 0, knowing T , the regret satis\ufb01es\n\nIn particular, by choosing \u21b5 = 1/(T L), we have\n\nE[RT ] \uf8ff CELP(C1 + `max)q2dT lnNT (F,\u21b5 ) + T L\u21b5.\nE[RT ] \uf8ff CELP(C1 + `max)q2dT lnNT (F, 1/(T L)) + 1 .\n\nWe note in passing that the the dependence of the algorithm on the time horizon T can be alleviated,\nusing, for example, the doubling trick.\nIn order to turn the above bound into a concrete bound, one must investigate the behavior of the\nmetric entropy, lnNT (F,\u21b5 ). In many cases, the metric entropy can be bounded independently of\nT . In fact, often, lnNT (F,\u21b5 ) = D ln(1 + c/\u21b5) for some c, D > 0. When this holds, D is often\ncalled the \u201cdimension\u201d of F and we get that\n\nE [RT ] \uf8ff CELP(C1 + `max)q2dT D ln(1 + cT L) + 1 .\n\nAs a speci\ufb01c example, we will consider the case of real-valued linear functions over a ball in a\nEuclidean space with weights belonging to some other ball. For a normed vector space V with norm\nk\u00b7k and dual norm k\u00b7k \u21e4, x 2 V , r 0, let Bk\u00b7k(x, r) = {v 2 V |k vk \uf8ff r} denote the ball in V\ncentered at x that has radius r. For X\u21e2 Rd, W\u21e2 Rd, let\n(4)\nbe the space of linear mappings from X to reals with weights belonging to W. We have the following\nlemma:\nLemma 2.1. Let X, W > 0, dY (y, y0) = |y y0|, X\u21e2 Bk\u00b7k(0, X) and W\u21e2 Bk\u00b7k\u21e4\n(0, W ).\nConsider a set of real-valued linear predictors F\u21e2 Lin(X ,W). Then, for any \u21b5> 0,\n\n.\n= {g : X! R| g(\u00b7) = h w,\u00b7i , w 2W}\n\nF\u21e2 Lin(X ,W)\n\nlnNT (F,\u21b5 ) \uf8ff d ln(1 + 2W X/\u21b5).\n\nThe previous lemma, together with Theorem 2.2 immediately gives the following result:\nCorollary 2.1. Assume that F\u21e2 Lin(X ,W), X\u21e2 Bk\u00b7k(0, X), W\u21e2 Bk\u00b7k\u21e4\n(0, W ) for some\nX, W > 0. Further, assume that the losses (`t)t1 are L-Lipschitz. Then, there exists an algorithm\nsuch that for any T > 0, the regret of the algorithm satis\ufb01es,\nE [RT ] \uf8ff CELP(C1 + `max)qd2dT ln(1 + 2T LW X) + 1 .\n\nNote that if one is given an a priori bound p on the maximum number of features that can be used\nin a single round (allowing the algorithm to use fewer than p, but not more features) then 2d in\n\ni \u21e1 dp, where the approximation assumes that\n\nthe above bound could be replaced byP1\uf8ffi\uf8ffpd\n\np < d/2. Such a bound on the number of features available per round may arise from strict bud-\ngetary considerations. When dp is small, this makes the bound non-vacuous even for small horizons\nT . In addition, in such cases the algorithm also becomes computationally feasible. It remains an\ninteresting open question to study the computational complexity when there is no restriction on the\nnumber of features used. In the next theorem, however, we show that the worst-case exponential\ndependence of the regret on the number of features cannot be improved (while keeping the root-T\ndependence on the horizon). The bound is based on the lower bound construction of Mannor and\nShamir (2011), which reduces the problem to known lower bounds in the multi-armed bandit case.\nTheorem 2.3. There exist an instance of free-label online probing such that the minimax regret of\n\nany algorithm is \u2326\u21e3q d\n\nd/2T\u2318.\n\n5\n\n\f2.2 Linear Prediction with Quadratic Losses\n\nIn this section, we study the problem under the assumption that the predictors have a linear form and\nthe loss functions are quadratic. That is, F\u21e2 Lin(X ,W) where W = {w 2 Rd |k wk\u21e4 \uf8ff wlim}\nand X = {x 2 Rd |k xk \uf8ff xlim} for some given constants wlim, xlim > 0, while `t(y) = (y yt)2,\nwhere |yt|\uf8ff xlimwlim. Thus, choosing a predictor is akin to selecting a weight vector wt 2W ,\nas well as a binary vector st 2G\u21e2{\n0, 1}d that encodes the features to be used in round t. The\nprediction for round t is then \u02c6yt = h wt, st xt i, where denotes coordinate-wise product, while\nthe loss suffered is (\u02c6ytyt)2. The set G is an arbitrary non-empty, a priori speci\ufb01ed subset of {0, 1}d\nthat allows the user of the algorithm to encode extra constraints on what subsets of features can be\nselected.\nIn this section we show that in this case a regret bound of size \u02dcO(ppoly(d)T ) is possible. The key\nidea that permits the improvement of the regret bound is that a randomized choice of a weight vector\nWt (and thus, of a subset) helps one construct unbiased estimates of the losses `t(h w, s xt i)\nfor all weight vectors w and all subsets s 2G under some mild conditions on the distribution of\nWt. That the construction of such unbiased estimates is possible, despite that some feature values\nare unobserved, is because of the special algebraic structure of the prediction and loss functions.\nA similar construction has appeared in a different context, e.g., in the paper of Cesa-Bianchi et al.\n(2010).\nThe construction works as follows. De\ufb01ne the d\u21e5d matrix, Xt by (Xt)i,j = xt,ixt,j (1 \uf8ff i, j \uf8ff d).\nExpanding the loss of the prediction \u02c6yt = h w, xt i, we get that the loss of using w 2W is\n\n`t(w)\n\n.\n= `t(h w, xt i) = w>Xt w 2 w>xtyt + y2\nt ,\n\nwhere with a slight abuse of notation we have introduced the loss function `t : W! R (we\u2019ll keep\nabusing the use of `t by overloading it based on the type of its argument). Clearly, it suf\ufb01ces to\nconstruct unbiased estimates of `t(w) for any w 2W .\nWe will use a discretization approach. Therefore, assume that we are given a \ufb01nite subset W0 of\nW that will be constructed later. In each step t, our algorithm will choose a random weight vector\nWt from a probability distribution supported on W0. Let pt(w) be the probability of selecting the\nweight vector, w 2W 0. For 1 \uf8ff i \uf8ff d, let\n\nbe the probability that s(Wt) will contain i, while for 1 \uf8ff i, j \uf8ff d, let\n\nqt(i) = Xw2W0:i2s(w)\nqt(i, j) = Xw2W0:i,j2s(w)\n\npt(w) ,\n\npt(w) ,\n\n,\n\nqt(i)\n\nqt(i, j)\n\nbe the probability that both i, j 2 s(Wt).2 Assume that pt(\u00b7) is constructed such that qt(i, j) > 0\nholds for any time t and indices 1 \uf8ff i, j \uf8ff d. This also implies that qt(i) > 0 for all 1 \uf8ff i \uf8ff d.\nDe\ufb01ne the vector \u02dcxt 2 Rd and matrix \u02dcXt 2 Rd\u21e5d using the following equations:\n( \u02dcXt)i,j = {i,j2s(Wt)}xt,ixt,j\n.\nIt can be readily veri\ufb01ed that E [\u02dcxt | pt] = xt and Eh \u02dcXt | pti = Xt. Further, notice that both \u02dcxt\n\nand \u02dcXt can be computed based on the information available at the end of round t, i.e., based on the\nfeature values (xt,i)i2s(Wt). Now, de\ufb01ne the estimate of prediction loss\n\n\u02dcxt,i = {i2s(Wt)}xt,i\n\n\u02dc`t(w) = w> \u02dcXt w 2 w> \u02dcxtyt + y2\nt .\n\n(6)\nNote that yt can be readily computed from `t(\u00b7), which is available to the algorithm (equivalently,\nwe may assume that the algorithm observed yt). Due to the linearity of expectation, we have\nEh\u02dc`t(w)|pti = `t(w). That is, \u02dc`t(w) provides an unbiased estimate of the loss `t(w) for any\nw 2W . Hence, by adding a feature cost term we get \u02dc`t(w) + h s(w), ci as an estimate of the loss\nthat the learner would have suffered at round t had he chosen the weight vector w.\n2Note that, following our earlier suggestion, we view the d-dimensional binary vectors as subsets of\n\n(5)\n\n{1, . . . , d}.\n\n6\n\n\fAlgorithm 1 The LQDEXP3 Algorithm\n\nParameters: Real numbers 0 \uf8ff \u2318, 0 < \uf8ff 1, W0 \u21e2W \ufb01nite set, a distribution \u00b5 over W0,\nhorizon T > 0.\nInitialization: u1(w) = 1 (w 2W 0).\nfor t = 1 to T do\n\nDraw Wt 2W 0 from the probability mass function\n\nut(w)\n\nUt\n\n+ \u00b5(w),\n\nw 2W 0 .\n\npt(w) = (1 )\nObtain the features values, (xt,i)i2s(Wt).\nPredict \u02c6yt =Pi2s(Wt) wt,ixt,i.\nfor w 2W 0 do\n\nUpdate the weights using (6) for the de\ufb01nitions of \u02dc`t(w):\n\nut+1(w) = ut(w)e\u2318(\u02dc`t(w)+h c,s(w) i), w 2W 0 .\n\nend for\n\nend for\n\n2.2.1 LQDExp3 \u2013 A Discretization-based Algorithm\nNext we show that the standard EXP3 Algorithm applied to a discretization of the weight space W\nachieves O(pdT ) regret. The algorithm, called LQDEXP3 is given as Algorithm 1. In the name\nof the algorithm, LQ stands for linear prediction with quadratic losses and D denotes discretization.\nNote that if the exploration distribution \u00b5 in the algorithm is such that for any 1 \uf8ff i, j \uf8ff d,\nPw2W 0:i,j2s(w) \u00b5(w) > 0 then qt(i, j) > 0 will be guaranteed for all time steps. Using the notation\nylim = wlimxlim and EG = maxs2G supw2W:kwk\u21e4=1 kw sk\u21e4, we can state the following regret\nbound on the algorithm\nTheorem 2.4. Let wlim, xlim > 0, c 2 [0,1)d be given, W\u21e2 Bk\u00b7k\u21e4\n(0, wlim) convex, X\u21e2\nBk\u00b7k(0, xlim) and \ufb01x T 1. Then, there exist a parameter setting for LQDEXP3 such that the\nfollowing holds: Let RT denote the regret of LQDEXP3 against the best linear predictor from\nLin(W,X ) when LQDEXP3 is used in an online free-label probing problem de\ufb01ned with the se-\nquence ((xt, yt))1\uf8fft\uf8ffT (kxtk \uf8ff xlim, |yt|\uf8ff ylim, 1 \uf8ff t \uf8ff T ), quadratic losses `t(y) = (y yt)2,\nand feature-costs given by the vector c. Then,\n\nE [RT ] \uf8ff CqT d (4y2\n\nlim + kck1)(w2\n\nlimx2\n\nlim + 2ylimwlimxlim + 4y2\n\nlim + kck1) ln(EGT ) ,\n\nwhere C > 0 is a universal constant (i.e., the value of C does not depend on the problem parame-\nters).\n\nThe actual parameter setting to be used with the algorithm is constructed in the proof. The compu-\ntational complexity of LQDEXP3 is exponential in the dimension d due to the discretization step,\nhence quickly becomes impractical when the number of features is large. On the other hand, one\ncan easily modify the algorithm to run without discretization by replacing EXP3 with its continuous\nversion. The resulting algorithm enjoys essentially the same regret bound, and can be implemented\nef\ufb01ciently whenever ef\ufb01cient sampling is possible from the resulting distribution. This approach\nseems to be appealing, since, from a \ufb01rst look, it seems to involve sampling from truncated Gaus-\nsian distributions, which can be done ef\ufb01ciently. However, it is easy to see that when the sampling\nprobabilities of some feature are small, the estimated loss will not be convex as \u02dcXt may not be pos-\nitive semi-de\ufb01nite, and therefore the resulting distributions will not always be truncated Gaussians.\nFinding an ef\ufb01cient sampling procedure for such situations is an interesting open problem.\nThe optimality of LQDEXP3 can be seen by the following lower bound on the regret:\nTheorem 2.5. Let d > 0, and consider the online free label probing problem with linear predictors,\nwhere W = {w 2 Rd |k wk1 \uf8ff wlim} and X = {x 2 Rd |k xk1 \uf8ff 1}. Assume, for all t 1,\nthat the loss functions are of the form `t(w) = (w>xt yt)2 + h s(w), ci, where |yt|\uf8ff 1 and\nc = 1/2 \u21e5 1 2 Rd. Then, for any prediction algorithm and for any T \n8 ln(4/3), there exists a\n\n4d\n\n7\n\n\fsequence ((xt, yt))1\uf8fft\uf8ffT 2 (X\u21e5 [1, 1])T such that the regret of the algorithm can be bounded\nfrom below as\n\nE[RT ] \n\n3 Non-Free-Label Probing\n\npT d .\n\np2 1\np32 ln(4/3)\n\nIf cd+1 > 0, the learner has to pay for observing the true label. This scenario is very similar to the\nwell-known label-ef\ufb01cient prediction case in online learning (Cesa-Bianchi et al., 2006). In fact,\nthe latter problem is a special case of this problem, immediately giving us that the regret of any\nalgorithm is at least of order T 2/3. It turns out that if one observes the (costly) label in a given round\nthen it does not effect the regret rate if one observes all the features at the same time. The resulting\n\u201crevealing action algorithm\u201d, given in Algorithm 3 in the Appendix, achieves the following regret\nbound for \ufb01nite expert classes:\nLemma 3.1. Given any non-free-label online probing with \ufb01nitely many experts, Algorithm 3 with\nappropriately set parameters achieves\n\nE[RT ] \uf8ff C max\u21e3T 2/3(`2\n\nmaxkck1 ln|F|)1/3,` maxpT ln|F|\u2318\n\nfor some constant C > 0.\n\nUsing the fact that, in the linear prediction case, approximately (2T LW X + 1)d experts are needed\nto approximate each expert in W with precision \u21b5 = 1\nLT in worst-case empirical covering, we obtain\nthe following theorem (note, however, that the complexity of the algorithm is again exponential in\nthe dimension d, as we need to keep a weight for each expert):\nTheorem 3.1. Given any non-free-label online probing with linear predictor experts and Lipschitz\nprediction loss function with constant L, Algorithm 3 with appropriately set parameters running on\na suf\ufb01ciently discretized predictor set achieves\n\nE[RT ] \uf8ff C max\u21e3T 2/3\u21e5`2\n\nfor some universal constant C > 0.\n\nmaxkck1 d ln(T LW X)\u21e41/3\n\n,` maxpT d ln(T LW X)\u2318\n\nThat Algorithm 3 is essentially optimal for linear predictions and quadratic losses is a consequence\nof the following almost matching lower bound:\nTheorem 3.2. There exists a constant C such that, for any non-free-label probing with linear pre-\ni=1 ci 1/2d for every j = 1, . . . , d, the expected regret\n\ndictors, quadratic loss, and cj > (1/d)Pd\n\nof any algorithm can be lower bounded by\n\nE[RT ] C(cd+1d)1/3T 2/3 .\n\n4 Conclusions\n\nWe introduced a new problem called online probing. In this problem, the learner has the option\nof choosing the subset of features he wants to observe as well as the option of observing the true\nlabel, but has to pay for this information. This setup produced new challenges in solving the online\nproblem. We showed that when the labels are free, it is possible to devise algorithms with optimal\nregret rate \u21e5(pT ) (up to logarithmic factors), while in the non-free-label case we showed that only\n\u21e5(T 2/3) is achievable. We gave algorithms that achieve the optimal regret rate (up to logarithmic\nfactors) when the number of experts is \ufb01nite or in the case of linear prediction. Unfortunately either\nour bounds or the computational complexity of the corresponding algorithms are exponential in\nthe problem dimension, and it is an open problem whether these disadvantages can be eliminated\nsimultaneously.\n\nAcknowledgements\n\nThe authors thank Yevgeny Seldin for \ufb01nding a bug in an earlier version of the paper. This work was\nsupported in part by DARPA grant MSEE FA8650-11-1-7156, the Alberta Innovates Technology\nFutures, AICML, and the Natural Sciences and Engineering Research Council (NSERC) of Canada.\n\n8\n\n\fReferences\nAgarwal, A. and Duchi, J. C. (2011). Distributed delayed stochastic optimization. In Shawe-Taylor,\nJ., Zemel, R. S., Bartlett, P. L., Pereira, F. C. N., and Weinberger, K. Q., editors, NIPS, pages\n873\u2013881.\n\nAuer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (2002). The nonstochastic multiarmed\n\nbandit problem. SIAM J. Comput., 32(1):48\u201377.\n\nBart\u00b4ok, G. (2012). The role of information in online learning. PhD thesis, Department of Computing\n\nScience, University of Alberta.\n\nCesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge Univ Pr.\nCesa-Bianchi, N., Lugosi, G., and Stoltz, G. (2005). Minimizing regret with label ef\ufb01cient predic-\n\ntion. IEEE Transactions on Information Theory, 51(6):2152\u20132162.\n\nCesa-Bianchi, N., Lugosi, G., and Stoltz, G. (2006). Regret minimization under partial monitoring.\n\nMath. Oper. Res., 31(3):562\u2013580.\n\nCesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O. (2010). Ef\ufb01cient learning with partially ob-\n\nserved attributes. CoRR, abs/1004.4421.\n\nDekel, O., Shamir, O., and Xiao, L. (2010). Learning to classify with missing and corrupted features.\n\nMachine Learning, 81(2):149\u2013178.\n\nJoulani, P., Gy\u00a8orgy, A., and Szepesv\u00b4ari, C. (2013). Online learning under delayed feedback. In 30th\n\nInternational Conference on Machine Learning, Atlanta, GA, USA.\n\nKapoor, A. and Greiner, R. (2005). Learning and classifying under hard budgets.\n\nConference on Machine Learning (ECML), pages 166\u2013173.\n\nIn European\n\nLizotte, D., Madani, O., and Greiner, R. (2003). Budgeted learning of naive-Bayes classi\ufb01ers. In\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI).\n\nMannor, S. and Shamir, O. (2011). From bandits to experts: On the value of side-observations.\n\nCoRR, abs/1106.2436.\n\nMesterharm, C. (2005). On-line learning with delayed label feedback. In Proceedings of the 16th\ninternational conference on Algorithmic Learning Theory, ALT\u201905, pages 399\u2013413, Berlin, Hei-\ndelberg. Springer-Verlag.\n\nRostamizadeh, A., Agarwal, A., and Bartlett, P. L. (2011). Learning with missing features. In UAI,\n\npages 635\u2013642.\n\nSettles, B. (2009). Active learning literature survey. Technical report.\nWeinberger, M. J. and Ordentlich, E. (2006). On delayed prediction of individual sequences. IEEE\n\nTrans. Inf. Theor., 48(7):1959\u20131976.\n\n9\n\n\f", "award": [], "sourceid": 642, "authors": [{"given_name": "Navid", "family_name": "Zolghadr", "institution": "University of Alberta"}, {"given_name": "Gabor", "family_name": "Bartok", "institution": "ETH Zurich"}, {"given_name": "Russell", "family_name": "Greiner", "institution": "University of Alberta"}, {"given_name": "Andr\u00e1s", "family_name": "Gy\u00f6rgy", "institution": "University of Alberta"}, {"given_name": "Csaba", "family_name": "Szepesvari", "institution": "University of Alberta"}]}