{"title": "Efficient Sublinear-Regret Algorithms for Online Sparse Linear Regression with Limited Observation", "book": "Advances in Neural Information Processing Systems", "page_first": 4099, "page_last": 4108, "abstract": "Online sparse linear regression is the task of applying linear regression analysis to examples arriving sequentially subject to a resource constraint that a limited number of features of examples can be observed. Despite its importance in many practical applications, it has been recently shown that there is no polynomial-time sublinear-regret algorithm unless NP$\\subseteq$BPP, and only an exponential-time sublinear-regret algorithm has been found. In this paper, we introduce mild assumptions to solve the problem. Under these assumptions, we present polynomial-time sublinear-regret algorithms for the online sparse linear regression. In addition, thorough experiments with publicly available data demonstrate that our algorithms outperform other known algorithms.", "full_text": "Ef\ufb01cient Sublinear-Regret Algorithms for Online\nSparse Linear Regression with Limited Observation\n\nShinji Ito\n\nNEC Corporation\n\ns-ito@me.jp.nec.com\n\nDaisuke Hatano\n\nNational Institute of Informatics\n\nhatano@nii.ac.jp\n\nHanna Sumita\n\nNational Institute of Informatics\n\nsumita@nii.ac.jp\n\nAkihiro Yabe\n\nNEC Corporation\n\na-yabe@cq.jp.nec.com\n\nTakuro Fukunaga\n\nJST, PRESTO\n\ntakuro@nii.ac.jp\n\nNaonori Kakimura\n\nKeio University\n\nkakimura@math.keio.ac.jp\n\nKen-ichi Kawarabayashi\n\nNational Institute of Informatics\n\nk-keniti@nii.ac.jp\n\nAbstract\n\nOnline sparse linear regression is the task of applying linear regression analysis\nto examples arriving sequentially subject to a resource constraint that a limited\nnumber of features of examples can be observed. Despite its importance in many\npractical applications, it has been recently shown that there is no polynomial-\ntime sublinear-regret algorithm unless NP\u2286BPP, and only an exponential-time\nsublinear-regret algorithm has been found. In this paper, we introduce mild as-\nsumptions to solve the problem. Under these assumptions, we present polynomial-\ntime sublinear-regret algorithms for the online sparse linear regression. In addi-\ntion, thorough experiments with publicly available data demonstrate that our al-\ngorithms outperform other known algorithms.\n\n1\n\nIntroduction\n\nIn online regression, a learner receives examples one by one, and aims to make a good prediction\nfrom the features of arriving examples, learning a model in the process. Online regression has\nattracted attention recently in the research community in managing massive learning data.In real-\nworld scenarios, however, with resource constraints, it is desired to make a prediction with only a\nlimited number of features per example. Such scenarios arise in the context of medical diagnosis of\na disease [3] and in generating a ranking of web pages in a search engine, in which it costs to obtain\nfeatures or only partial features are available in each round. In both these examples, predictions need\nto be made sequentially because a patient or a search query arrives online.\nTo resolve the above issue of limited access to features, Kale [7] proposed online sparse regression.\nIn this problem, a learner makes a prediction for the labels of examples arriving sequentially over\na number of rounds. Each example has d features that can be potentially accessed by the learner.\nHowever, in each round, the learner can acquire the values of at most k(cid:48) features out of the d features,\nwhere k(cid:48) is a parameter set in advance. The learner then makes a prediction for the label of the\nexample. After the prediction, the true label is revealed to the learner, and the learner suffers a\nloss for making an incorrect prediction. The performance of the prediction is measured here by the\nstandard notion of regret, which is the difference between the total loss of the learner and the total\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fTable 1: Computational complexity of online sparse linear regression.\n\nAssumptions\n(a)\n(1)\n(2)\n(cid:88) (cid:88)\n(cid:88)\n(cid:88)\n(cid:88) (cid:88) (cid:88)\n(cid:88) (cid:88)\n\nTime complexity\n\n(b)\n\nHard [5]\nHard (Theorem 1)\nPolynomial time (Algorithms 1, 2)\n\n(cid:88) Polynomial time (Algorithm 3)\n\nloss of the best predictor. In [7], the best predictor is de\ufb01ned as the best k-sparse linear predictor,\ni.e., the label is de\ufb01ned as a linear combination of at most k features.\nOnline sparse regression is a natural online variant of sparse regression; however, its computational\ncomplexity was not well known until recently, as Kale [7] raised a question of whether it is possible\nto achieve sublinear regret in polynomial time for online sparse linear regression. Foster et al. [5]\nanswered the question by proving that no polynomial-time algorithm achieves sublinear regret unless\nNP\u2286BPP. Indeed, this hardness result holds even when observing \u2126(k log d) features per example.\nOn the positive side, they also proposed an exponential-time algorithm with sublinear regret, when\nwe can observe at least k + 2 features in each round. However, their algorithm is not expected to\nk(cid:48) features in each round, which requires exponential time for any instance.\n\nwork ef\ufb01ciently in practice. In fact, the algorithm enumerates all the(cid:0) d\n\n(cid:1) possibilities to determine\n\nk(cid:48)\n\nOur contributions.\nIn this paper, we show that online sparse linear regression admits a\npolynomial-time algorithm with sublinear regret, under mild practical assumptions. First, we as-\nsume that the features of examples arriving online are determined by a hidden distribution (Assump-\ntion (1)), and the labels of the examples are determined by a weighted average of k features, where\nthe weights are \ufb01xed through all rounds (Assumption (2)). These are natural assumptions in the\nonline linear regression. However, Foster et al. [5] showed that no polynomial-time algorithm can\nachieve sublinear regret unless NP\u2286BPP even under these two assumptions.1\nOwing to this hardness, we introduce two types of conditions on the distribution of features, both\nof which are closely related to the restricted isometry property (RIP) that has been studied in the\nliterature of sparse recovery. The \ufb01rst condition, which we call linear independence of features\n(Assumption (a)), is stronger than RIP. This condition roughly says that all the features are lin-\nearly independent. The second condition, which we call compatibility (Assumption (b)), is weaker\nthan RIP. Thus, an instance having RIP always satis\ufb01es the compatibility condition. Under these\nassumptions, we propose the following three algorithms. Here, T is the number of rounds.\n\nk(cid:48)\u2212k\n\nT ) regret under Assumption (b) for the case\n\nWe can also construct an algorithm achieving O( d\nwhere k(cid:48) \u2265 k + 2, analogous to Algorithm 1, but we omit it due to space limitations.\nAssumptions (1)+(2)+(a) or (1)+(2)+(b) seem to be minimal assumptions needed to achieve sub-\nlinear regret in polynomial time. Indeed, as listed in Table 1, the problem is hard if any one of the\nassumptions is violated, where hard means that no polynomial-time algorithm can achieve sublinear\nregret unless NP\u2286BPP. Note that Assumption (a) is stronger than (b).\nIn addition to proving theoretical regret bounds of our algorithms, we perform thorough experi-\nments to evaluate the algorithms. We veri\ufb01ed that our algorithms outperform the exponential-time\nalgorithm [5] in terms of computational complexity as well as performance of the prediction. Our\nalgorithms also outperform (baseline) heuristic-based algorithms and algorithms proposed in [2, 6]\n\n1 Although the statement in [5] does not mention the assumptions, its proof indicates that the hardness holds\n\neven with these assumptions.\n\n2\n\n\u221a\n\n\u2022 Algorithm 1: A polynomial-time algorithm that achieves O( d\nk(cid:48)\u2212k\n\nT ) regret, under As-\nsumptions (1), (2), and (a), which requires at least k + 2 features to be observed per exam-\nple.\n\n\u2022 Algorithm 2: A polynomial-time algorithm that achieves O(\n\u221a\n\u2022 Algorithm 3: A polynomial-time algorithm that achieves O(\n\nk(cid:48)16 ) regret, under\nAssumptions (1), (2), and (a), which requires at least k features to be observed per example.\nk(cid:48)16 ) regret, under\nAssumptions (1), (2), and (b), which requires at least k features to be observed per example.\n\ndT + d16\n\ndT + d16\n\n\u221a\n\n\u221a\n\n\ffor online learning based on limited observation. Moreover, we observe that our algorithms perform\nwell even for a real dataset, which may not satisfy our assumptions (deciding whether the model\nsatis\ufb01es our assumptions is dif\ufb01cult; for example, the RIP parameter cannot be approximated within\nany constant factor under a reasonable complexity assumption [9]). Thus, we can conclude that our\nalgorithm is applicable in practice.\n\nOverview of our techniques. One naive strategy for choosing a limited number of features is to\nchoose \u201clarge-weight\u201d features in terms of estimated ground-truth regression weights. This strategy,\nhowever, does not achieve sublinear regret, as it ignores small-weight features. When we have\nAssumption (a), we show that if we observe two more features chosen uniformly at random, together\nwith the largest k features, we can make a good prediction. More precisely, using the observed\nfeatures, we output the label that minimizes the least-square loss function, based on the technique\nusing an unbiased estimator of the gradient [2, 6] and the regularized dual averaging (RDA) method\n(see, e.g., [11, 4]). This idea gives Algorithm 1, and the details are given in Section 4. The reason\nwhy we use RDA is that it is ef\ufb01cient in terms of computational time and memory space as pointed\nout in [11] and, more importantly, we will combine this with the (cid:96)1 regularization later. However,\nthis requires at least k + 2 features to be observed in each round.\nTo avoid the requirement of two extra observations, the main idea is to employ Algorithm 1 with\na partial dataset. As a by-product of Algorithm 1, we can estimate the ground-truth regression\nweight vector with high probability, even without observing extra features in each round. We use\nthe ground-truth weight vector estimated by Algorithm 1 to choose k features. Combining this idea\nwith RDA adapted for the sparse regression gives Algorithm 2 (Section 5.1) under Assumption (a).\nThe compatibility condition (Assumption (b)) is often used in LASSO (Least Absolute Shrinkage\nand Selection Operator), and it is known that minimization with an (cid:96)1 regularizer converges to the\nsparse solution under the compatibility condition [1]. We introduce (cid:96)1 regularization into Algo-\nrithm 1 to estimate the ground-truth regression weight vector when we have Assumption (b) instead\nof Assumption (a). This gives Algorithm 3 (Section 5.2).\n\nt xt)2. The aim is to minimize the total loss(cid:80)T\n\nRelated work.\nIn the online learning problem, a learner aims to predict a model based on the\narriving examples. Speci\ufb01cally, in the linear function case, a learner predicts the coef\ufb01cient wt of\na linear function w(cid:62)\nt xt whenever an example with features xt arrives in round t. The learner then\nsuffers a loss (cid:96)t(wt) = (yt \u2212 w(cid:62)\nt=1((cid:96)t(wt)\u2212 (cid:96)t(w))\n\u221a\nfor an arbitrary w. It is known that both the gradient descent method [12] and the dual averaging\nmethod [11] attain an O(\nT ) regret even for the more general convex function case. However, these\nmethods require access to all features of the examples.\nIn linear regression with limited observation, the limited access to features in regression has been\nconsidered [2, 6]. In this problem, a learner can acquire only the values of at most k(cid:48) features among\nd features. The purpose here is to estimate a good weight vector, e.g., minimize the loss function\n(cid:96)(w) or the loss function with (cid:96)1 regularizer (cid:96)(w) + (cid:107)w(cid:107)1. Let us note that, even if we obtain a\ngood weight vector w with small (cid:96)(w), we cannot always compute w(cid:62)xt from limited observation\nof xt and, hence, in our setting the prediction error might not be as small as (cid:96)(w). Thus, our setting\nuses a different loss function, de\ufb01ned in Section 2, to minimize the prediction error.\nAnother problem incorporating the limited access is proposed by Zolghadr et al. [13]. Here, instead\nof observing k(cid:48) features, one considers the situation where obtaining a feature has an associated cost.\nIn each round, one chooses a set of features to pay some amount of money, and the purpose is to\nminimize the sum of the regret and the total cost. They designed an exponential-time algorithm for\nthe problem.\nOnline sparse linear regression has been studied in [5, 7], but only an exponential-time algorithm\nhas been proposed so far. In fact, Foster et al. [5] suggested designing an ef\ufb01cient algorithm for a\nspecial class of the problem as future work. The present paper aims to follow this suggestion.\nRecently, Kale et al. [8]2 presented computationally ef\ufb01cient algorithms to achieve sublinear regret\nunder the assumption that input features satisfy RIP. Though this study includes similar results to\nours, we can realize some differences. Our paper considers the assumption of the compatibility\ncondition without extra observation (i.e., the case of k(cid:48) = k), whereas Kale et al. [8] studies a\n\n2The paper [8] was published after our manuscript was submitted.\n\n3\n\n\fstronger assumption with extra observation (k(cid:48) \u2265 k + 2) that yields a smaller regret bound than\nours. They also studies the agnostic (adversarial) setting.\n\n2 Problem setting\n\nOnline sparse linear regression. We suppose that there are T rounds, and an example arrives\nonline in each round. Each example is represented by d features and is associated with a label,\nwhere features and labels are all real numbers. We denote the features of the example arriving in\nround t by xt = (xt1, . . . , xtd)(cid:62) \u2208 {x \u2208 Rd | (cid:107)x(cid:107) \u2264 1}, where the norm (cid:107) \u00b7 (cid:107) without subscripts\ndenotes the (cid:96)2 norm. The label of each example is denoted by yt \u2208 [\u22121, 1].\nThe purpose of the online sparse regression is to predict the label yt \u2208 R from a partial observation\nof xt in each round t = 1, . . . , T . The prediction is made through the following four steps: (i) we\nchoose a set St \u2286 [d] := {1, . . . , d} of features to observe, where |St| is restricted to be at most k(cid:48);\n(ii) observe the selected features {xti}i\u2208St; (iii) on the basis of observation {xti}i\u2208St, estimate a\npredictor \u02c6yt of yt; and (iv) observe the true value of yt.\nFrom St, we de\ufb01ne Dt \u2208 Rd\u00d7d to be the diagonal matrix such that its (i, i)th entries are 1 for i \u2208 St\nand the other entries are 0. Then, observing the selected features {xti}i\u2208St in (ii) is equivalent to\nobserving Dtxt. The predictor \u02c6yt is computed by \u02c6yt = w(cid:62)\nThroughout the paper, we assume the following conditions, corresponding to Assumptions (1) and\n(2) in Section 1, respectively.\nAssumption (1) There exists a weight vector w\u2217 \u2208 Rd such that (cid:107)w(cid:107) \u2264 1 and yt = w\u2217(cid:62)xt + \u0001t\nfor all t = 1, . . . , T , where \u0001t \u223c D\u0001, independent and identically distributed (i.i.d.), and\n2] = \u03c32. There exists a distribution Dx on Rd such that xt \u223c Dx, i.i.d. and\nE[\u0001t] = 0, E[\u0001t\nindependent of {\u0001t}.\n\nt Dtxt in (iii).\n\nAssumption (2) The true weight vector w\u2217 is k-sparse, i.e., S\u2217 = supp(w\u2217) = {i \u2208 [d] | w\u2217\n\ni (cid:54)= 0}\n\nsatis\ufb01es |S\u2217| \u2264 k.\n\nRegret. The performance of the prediction is evaluated based on the regret RT (w) de\ufb01ned by\n\nT(cid:88)\n\n(\u02c6yt \u2212 yt)2 \u2212 T(cid:88)\n\nt=1\n\nt=1\n\nRT (w) =\n\n(w(cid:62)xt \u2212 yt)2.\n\n(1)\n\nOur goal is to achieve smaller regret RT (w) for an arbitrary w \u2208 Rd such that (cid:107)w(cid:107) \u2264 1 and\n(cid:107)w(cid:107)0 \u2264 k. For random inputs and randomized algorithms, we consider the expected regret\nmaxw:(cid:107)w(cid:107)0\u2264k,(cid:107)w(cid:107)\u22641 E[RT (w)].\nDe\ufb01ne the loss function (cid:96)t(w) = (w(cid:62)xt \u2212 yt)2. If we compute a predictor \u02c6yt = w(cid:62)\nt Dtxt using\na weight vector wt = (wt1, . . . , wtd)(cid:62) \u2208 Rd in each step, we can rewrite the regret RT (w) in (1)\nusing Dt and wt as\n\nRT (w) =\n\n((cid:96)t(Dtwt) \u2212 (cid:96)t(w))\n\n(2)\n\nbecause (\u02c6yt \u2212 yt)2 = (w(cid:62)\nconstruct wt that minimizes the loss function (cid:96)t(wt), then the de\ufb01nition of the regret should be\n\nt Dtxt \u2212 yt)2 = (cid:96)t(Dtwt). It is worth noting that if our goal is only to\n\nR(cid:48)\nT (w) =\n\n((cid:96)t(wt) \u2212 (cid:96)t(w)).\n\n(3)\n\nt=1\n\nHowever, the goal of online sparse regression involves predicting yt from the limited observation.\nHence, we use (2) to evaluate the performance. In terms of the regret de\ufb01ned by (3), several algo-\n\u221a\nrithms based on limited observation have been developed. For example, the algorithms proposed by\nCesa-Bianchi et al. [3] and Hazan and Koren [6] achieve O(\n\nT ) regret of (3).\n\n4\n\nT(cid:88)\n\nt=1\n\nT(cid:88)\n\n\f3 Extra assumptions on features of examples\n\nFoster et al. [5] showed that Assumptions (1) and (2) are not suf\ufb01cient to achieve sublinear regret.\nOwing to this observation, we impose extra assumptions.\nt xt] \u2208 Rd\u00d7d and let L be the Cholesky decomposition of V (i.e., V = L(cid:62)L). Denote\nLet V := E[x(cid:62)\nthe largest and the smallest singular values of L by \u03c31 and \u03c3d, respectively. Under Assumption (1)\nin Section 2, we have \u03c31 \u2264 1 because, for arbitrary unit vector u \u2208 Rd, it holds that u(cid:62)V u =\nE[(u(cid:62)x)2] \u2264 1. For a vector w \u2208 R[d] and S \u2286 [d], we let wS denote the restriction of w onto S.\nFor S \u2286 [d], Sc denotes [d] \\ S. We assume either one of the following conditions holds.\n\n(a) Linear independence of features: \u03c3d > 0.\n(b) Compatibility: There exists a constant \u03c60 > 0 that satis\ufb01es \u03c62\n\nw \u2208 Rd with (cid:107)w(S\u2217)c(cid:107)1 \u2264 2(cid:107)wS\u2217(cid:107)1.\n\n0(cid:107)wS\u2217(cid:107)2\n\n1 \u2264 kw(cid:62)V w for all\n\nWe assume the linear independence of features in Sections 4 and 5.1, and the compatibility in Sec-\ntion 5.2 to develop ef\ufb01cient algorithms.\nNote that condition (a) means that L is non-singular, and so is V . In other words, condition (a)\nindicates that the features in xt are linearly independent. This is the reason why we call condition\n(a) the \u201clinear independence of features\u201d assumption. Note that the linear independence of features\ndoes not imply the stochastic independence of features.\nConditions (a) and (b) are closely related to RIP. Indeed, condition (b) is a weaker assumption than\nRIP, and RIP is weaker than condition (a), i.e., (a) linear independence of features =\u21d2 RIP =\u21d2\n(b) compatibility (see, e.g., [1]). We now clarify how the above two assumptions are connected to\nthe regret. The expectation of the loss function (cid:96)t(w) is equal to\n\nExt,yt[(cid:96)t(w)] = Ext\u223cDx,\u0001t\u223cD\u0001 [(w(cid:62)xt \u2212 w\u2217(cid:62)xt \u2212 \u0001t)2]\n\n= Ext\u223cDx [((w \u2212 w\u2217)(cid:62)xt)2] + E\u0001t\u223cD\u0001 [\u0001(cid:62)\n\nt \u0001t] = (w \u2212 w\u2217)(cid:62)V (w \u2212 w\u2217) + \u03c32\n\nfor all t, where the second equality comes from E[\u0001t] = 0 and that xt and \u0001t are independent. Denote\nthis function by (cid:96)(w), and then (cid:96)(w) is minimized when w = w\u2217. If Dt and wt are determined\nindependently of xt and yt, the expectation of the regret RT (w) satis\ufb01es\n((cid:96)(Dtwt) \u2212 (cid:96)(w\u2217))]\n\n((cid:96)(Dtwt) \u2212 (cid:96)(w))] \u2264 E[\n\nE[RT (w)] = E[\n\nT(cid:88)\n\n= E[\n\n(Dtwt \u2212 w\u2217)(cid:62)V (Dtwt \u2212 w\u2217)] = E[\n\n(cid:107)L(Dtwt \u2212 w\u2217)(cid:107)2].\n\n(4)\n\nWe bound (4) in the analysis.\n\nt=1\n\nt=1\n\nHardness result. Similarly to [5], we can show that it remains hard under Assumptions (1), (2),\nand (a). Refer to Appendix A for the proof.\nTheorem 1. Let D be any positive constant, and let cD \u2208 (0, 1) be a constant dependent on D.\nSuppose that Assumptions (1) and (2) hold with k = O(dcD ) and k(cid:48) = (cid:98)kD ln d(cid:99). If an algorithm\nfor the online sparse regression problem runs in poly(d, T ) time per iteration and achieves a regret\nat most poly(d, 1/\u03c3d)T 1\u2212\u03b4 in expectation for some constant \u03b4 > 0, then NP\u2286BPP.\n\n4 Algorithm with extra observations and linear independence of features\nIn this section, we present Algorithm 1. Here we assume k(cid:48) \u2265 k + 2, in addition to the linear\nindependence of features (Assumption (a)). The additional assumption will be removed in Section 5.\nAs noted in Section 2, our algorithm \ufb01rst computes a weight vector wt, chooses a set St of k(cid:48)\nfeatures to be observed, and computes a label \u02c6yt by \u02c6yt = w(cid:62)\nt Dtxt in each round t. In addition,\nour algorithm constructs an unbiased estimator \u02c6gt of the gradient gt of the loss function (cid:96)t(w) at\nw = wt, i.e., gt = \u2207w(cid:96)t(wt) = 2xt(x(cid:62)\nt wt \u2212 yt) at the end of the round. In the following, we\ndescribe how to compute wt, St, and \u02c6gt in round t, respectively, assuming that wt(cid:48), St(cid:48), and \u02c6gt(cid:48) are\ncomputed in the previous rounds t(cid:48) = 1, . . . , t\u2212 1. The entire algorithm is described in Algorithm 1.\n\n5\n\nT(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\nT(cid:88)\n\n\fAlgorithm 1\nInput: {xt, yt} \u2286 Rd \u00d7 R, {\u03bbt} \u2286 R>0, k(cid:48) \u2265 2 and k1 \u2265 0 such that k1 \u2264 k(cid:48) \u2212 2.\n1: Set \u02c6h0 = 0.\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n6: end for\n\nDe\ufb01ne wt by (5) and de\ufb01ne St by Observe(wt, k(cid:48), k1).\nObserve Dtxt and output \u02c6yt := w(cid:62)\nObserve yt and de\ufb01ne \u02c6gt by (6) and set \u02c6ht = \u02c6ht\u22121 + \u02c6gt\n\nt Dtxt.\n\nDe\ufb01ne \u02c6ht\u22121 = (cid:80)t\u22121\n\nComputing wt. We use \u02c6g1, . . . , \u02c6gt\u22121 to estimate wt by the dual averaging method as follows.\nj=1 \u02c6gj, which is the average of all estimators of gradients computed in the pre-\nvious rounds. Moreover, let (\u03bb1, . . . , \u03bbT ) be a monotonically non-decreasing sequence of positive\nnumbers. From these, we de\ufb01ne wt by\n\u02c6h(cid:62)\nt\u22121w +\n\nwt = arg min\n\n(cid:107)w(cid:107)2\n\n= \u2212\n\n(cid:26)\n\n(cid:27)\n\n(5)\n\n1\n\nmax{\u03bbt,(cid:107)\u02c6ht\u22121(cid:107)} \u02c6ht\u22121,\n\nw\u2208Rd,(cid:107)w(cid:107)\u22641\n\n\u03bbt\n2\n\nComputing St. Let k1 be an integer such that k1 \u2264 k(cid:48) \u2212 2. We de\ufb01ne Ut \u2286 [d] as the set of the k1\nlargest features with respect to wt, i.e., choose Ut so that |Ut| = k1 and all i \u2208 Ut and j \u2208 [d] \\ Ut\nsatisfy |wti| \u2265 |wtj|. Let Vt be the set of (k(cid:48) \u2212 k1) elements chosen from [d] \\ Ut uniformly at\nrandom. Then our algorithm observes the set St = Ut \u222a Vt of the k(cid:48) features. We call this procedure\nto obtain St Observe(wt, k(cid:48), k1).\nObservation 1. We observe that Ut \u2286 St and Prob[i, j \u2208 St] \u2265 (k(cid:48)\u2212k1)(k(cid:48)\u2212k1\u22121)\nThus, Prob[i, j \u2208 St] > 0 for all i, j \u2208 [d] if k(cid:48) \u2265 k1 + 2.\n\n=: Cd,k(cid:48),k1.\n\nd(d\u22121)\n\nFor simplicity, we use the notation p(t)\n\ni = Prob[i \u2208 St] and p(t)\nComputing \u02c6gt. De\ufb01ne \u02dcXt = (\u02dcxtij) \u2208 Rd\u00d7d by \u02dcXt = Dtx(cid:62)\nwhose (i, j)-th entry is \u02dcxtij/p(t)\nde\ufb01ning zt = (zti) \u2208 Rd by zti = xti/p(t)\nunbiased estimator of xt. Using Xt and zt, we de\ufb01ne \u02c6gt to be\n\u02c6gt = 2Xtwt \u2212 2ytzt.\n\nij = Prob[i, j \u2208 St] for i, j \u2208 [d].\nt xtDt and let Xt \u2208 Rd\u00d7d be a matrix\nt . Similarly,\nfor i \u2208 St and zti = 0 for i /\u2208 St, we see that zt is an\n\nij . It follows that Xt is an unbiased estimator of xtx(cid:62)\n\n(6)\n\ni\n\n\u221a\n\nk(cid:48)\u2212k\n\nT ) in expectation.\n\nRegret bound of Algorithm 1. Let us show that\nO( d\nTheorem 2. Suppose that the linear independence of features is satis\ufb01ed and k \u2264 k(cid:48) \u2212 2. Let k1\nbe an arbitrary integer such that k \u2264 k1 \u2264 k(cid:48) \u2212 2. Then, for arbitrary w \u2208 Rd with (cid:107)w(cid:107) \u2264 1,\nAlgorithm 1 achieves E[RT (w)] \u2264 3\nt/Cd,k(cid:48),k1\nfor each t = 1, . . . , T , we obtain\n\nthe regret achieved by Algorithm 1 is\n\n. By setting \u03bbt = 8\n\n(cid:80)T\n\n+ \u03bbT +1\n\nCd,k(cid:48) ,k1\n\n(cid:17)\n\n\u221a\n\n1\n\u03bbt\n\nt=1\n\n\u03c32\nd\n\n2\n\n(cid:16) 16\n(cid:115)\n\nE[RT (w)] \u2264 24\n\u03c32\nd\n\nd(d \u2212 1)\n\n(k(cid:48) \u2212 k1)(k(cid:48) \u2212 k1 \u2212 1)\n\n\u00b7 \u221a\n\nT + 1.\n\n(7)\n\nE[(cid:80)T\n\nThe rest of this section is devoted to proving Theorem 2. By (4),\n\nit suf\ufb01ces to evaluate\nt=1 (cid:107)L(Dtwt \u2212 w\u2217)(cid:107)2] instead of E[RT (w)]. The following lemma asserts that each term\nof (4) can be bounded, assuming the linear independence of features. Proofs of all lemmas are given\nin the supplementary material.\nLemma 3. Suppose that the linear independence of features is satis\ufb01ed. If St \u2287 Ut,\n\n(cid:107)L(Dtwt \u2212 w\u2217)(cid:107)2 \u2264 3\n\u03c32\nd\n\n(cid:107)L(wt \u2212 w\u2217)(cid:107)2.\n\n(8)\n\n6\n\n\fProof. We have\n\n(cid:107)L(Dtwt \u2212 w\u2217)(cid:107)2 \u2264 \u03c32\n\n1(cid:107)Dtwt \u2212 w\u2217(cid:107)2 = \u03c32\n\n1\n\n\u2264 \u03c32\n\n1\n\n(wti \u2212 w\u2217\n\ni )2 +\n\ni\u2208S\u2217\u2229St\n\n\uf8eb\uf8ed (cid:88)\n\uf8eb\uf8ed(cid:107)wt \u2212 w\u2217(cid:107)2 +\n(cid:0)2w2\n\n(cid:88)\n\uf8f6\uf8f8 ,\n\ni\n\nw\u22172\n\ni\u2208S\u2217\\St\n\n(cid:88)\ni )2(cid:1)\n\ni\u2208S\u2217\\St\n\n(cid:88)\n\ni\u2208St\\S\u2217\n\nw\u22172\ni +\n\n\uf8f6\uf8f8\n\nw2\nti\n\n(9)\n\n(10)\n\nwhere the second inequality holds since w\u2217\n\ni = 0 for i \u2208 [d] \\ S\u2217. It holds that\n\n(cid:88)\n\ni\u2208S\u2217\\St\n\u2264 2\n\ni \u2264 (cid:88)\n(cid:88)\n\nw\u22172\n\nw2\n\nti + 2\n\ni\u2208S\u2217\\Ut\n\ni \u2264 (cid:88)\n(cid:88)\n\nw\u22172\n\ni\u2208S\u2217\\Ut\n(wti \u2212 w\u2217\n\ni\u2208Ut\\S\u2217\n\ni\u2208S\u2217\\Ut\n\nti + 2(wti \u2212 w\u2217\n\ni )2 \u2264 2(cid:107)wt \u2212 w\u2217(cid:107)2.\n\nThe \ufb01rst and third inequalities come from Ut \u2286 St and the de\ufb01nition of Ut. Putting (10) into (9),\nwe have\n\n(cid:107)L(Dtwt \u2212 w\u2217)(cid:107)2 \u2264 3\u03c32\n\n1(cid:107)wt \u2212 w\u2217(cid:107)2 \u2264 3\u03c32\n1\n\u03c32\nd\n\n(cid:107)L(wt \u2212 w\u2217)(cid:107)2.\n\n.\n\u221a\nt), the right-hand side of (11) is O(\n\nGt +\n\nt=1\n\n2\n\nincludes the support of w\u2217. Moreover, it holds that(cid:80)T\n\nt=1 E[(cid:107)L(wt \u2212 w\u2217)(cid:107)2] = E[(cid:80)T\nT (w\u2217)], since wt is independent of xt and yt. Thus, to bound(cid:80)T\n\nIt follows from the above lemma that, if wt converges to w\u2217, we have Dtwt = w\u2217, and hence St\nt=1((cid:96)t(wt)\u2212\nt=1 E[(cid:107)L(wt \u2212\n(cid:96)t(w\u2217))] = E[R(cid:48)\nw\u2217)(cid:107)2], we shall evaluate E[R(cid:48)\nLemma 4 ([11]). Suppose that wt is de\ufb01ned by (5) for each t = 1, . . . , T , and w \u2208 Rd satis\ufb01es\n(cid:107)w(cid:107) \u2264 1. Let Gt = E[(cid:107)\u02c6gt(cid:107)2] for t = 1, . . . , T . Then,\n1\n\u03bbt\n\nT (w)] \u2264 T(cid:88)\n\nT (w\u2217)].\n\nE[R(cid:48)\n\n\u03bbT +1\n\n(11)\n\nij = \u2126(1).\n\nT ). The following lemma shows\n\n\u221a\nIf Gt = O(1) and \u03bbt = \u0398(\nthat this is true if p(t)\nLemma 5. Suppose that the linear independence of features is satis\ufb01ed. Let t \u2208 [T ], and let q be a\npositive number such that q \u2264 min{p(t)\nWe are now ready to prove Theorem 2.\n\nij }. Then we have Gt \u2264 16/q.\n, p(t)\n\nProof of Theorem 2. The expectation E[RT (w)] of the regret\n\n(cid:80)T\nt=1 E[(cid:107)L(wt \u2212 w\u2217)(cid:107)2] = 3\nT (w\u2217)] \u2264 HT := (cid:80)T\n\n(cid:80)T\nt=1 E[(cid:107)L(Dtwt \u2212 w\u2217)(cid:107)2] \u2264 3\nGt \u2264 16/Cd,k(cid:48),k1. Hence, for \u03bbt = 8(cid:112)Cd,k(cid:48),k1t, HT satis\ufb01es HT \u2264(cid:80)T\n(cid:80)T\n\nis bounded as E[RT (w)] \u2264\nT (w\u2217)], where the \ufb01rst\nT (w\u2217)]\n. Lemma 5 and Observation 1 yield\n2 =\nT + 1. Combining the above three inequali-\n\ninequality comes from (4) and the second comes from Lemma 3. From Lemma 4, E[R(cid:48)\nis bounded by E[R(cid:48)\n\nT + 1 \u2264 8\n\nGt + \u03bbT +1\n\n16\nCd,k(cid:48),k1\n\n+ \u03bbT +1\n\nE[R(cid:48)\n\n\u221a\n\n\u221a\n\n1\n\u03bbt\n\nt=1\n\nt=1\n\n\u03c32\nd\n\n\u03c32\nd\n\n\u03bbt\n\n2\n\ni\n\n1\u221a\nCd,k(cid:48) ,k1\n\n2\u221a\nCd,k(cid:48) ,k1\n\n+\nties, we obtain (7).\n\nt=1\n\nt\n\n4\u221a\nCd,k(cid:48) ,k1\n\n5 Algorithms without extra observations\n\n5.1 Algorithm 2: Assuming (a) the linear independence of features\nIn Section 4, Lemma 3 showed a connection between RT and R(cid:48)\nunder Ut \u2286 St. Then, Lemmas 4 and 5 gave an upper bound of E[R(cid:48)\n\nT : E[RT (w)] \u2264 3\u03c32\nT (w\u2217)]: E[R(cid:48)\n\n\u03c3d2 E[R(cid:48)\nT (w\u2217)] = O(\n\nT (w\u2217)]\nT )\n\n\u221a\n\n1\n\n7\n\n\f(cid:80)s\n\nij = \u2126(1). In the case of k(cid:48) = k, however, the conditions Ut \u2286 St and p(t)\n\nunder p(t)\nij = \u2126(1) may\nnot be satis\ufb01ed simultaneously, since, if Ut \u2286 St and |St| = k(cid:48) = k \u2265 k1 = |Ut|, then we have\nij = 0 for i /\u2208 Ut or j /\u2208 Ut. Thus, we cannot use both relationships for the\nUt = St, which means p(t)\nanalysis. In Algorithm 2, we bound RT (w) without bounding R(cid:48)\nT (w).\nof {1, 2, . . . , T} by the set of squares, i.e., J = {s2 | s = 1, . . . ,(cid:98)\u221a\nLet us describe an idea of Algorithm 2. To achieve the claimed regret, we \ufb01rst de\ufb01ne a subset J\nT(cid:99)}. Let ts denote the s-th\nsmallest number in J for each s = 1, . . . ,|J|. In each round t, the algorithm computes St, a weight\nvector \u02dcwt, and a vector Dt\u02dcgt, where \u02dcgt is the gradient of (cid:96)t(w) at w = Dt \u02dcwt. In addition, if t = ts,\nj=1 wj, and an unbiased estimator\nthe algorithm computes other weight vectors ws and \u00afws := 1\ns\n\u02c6gs of the gradient of the loss function (cid:96)t(w) at ws.\nAt the beginning of round t, if t = ts, the algorithm \ufb01rst computes ws, and \u00afws is de\ufb01ned as the\naverage of w1, . . . , ws. Roughly speaking, ws is the weight vector computed with Algorithm 1\napplied to the examples (xt1, yt1), . . . , (xts, yts), setting k1 to be at most k \u2212 2. Then, we can\nshow that \u00afws is a consistent estimator of w\u2217. This step is only performed if t \u2208 J. Then St is\nde\ufb01ned from \u00afws, where s is the largest number such that ts \u2264 t. Thus, St does not change for any\nt \u2208 [ts, ts+1 \u2212 1]. After this, the algorithm computes \u02dcwt from D1\u02dcg1, . . . , Dt\u22121\u02dcgt\u22121, and predicts\nthe label of xt as \u02c6yt := \u02dcw(cid:62)\nt Dtxt. At the end of the round, the true label yt is observed, and Dt\u02dcgt\nis computed from wt and (Dtxt, yt). In addition, if t = ts, \u02c6gs is computed as in Algorithm 1. We\nneed \u02c6gs for computing ws(cid:48) with s(cid:48) > s in the subsequent rounds ts(cid:48).\nThe following theorem bounds the regret of Algorithm 2. See the supplementary material for details\nof the algorithm and the proof of the theorem.\nTheorem 6. Suppose that (a), the linear independence of features, is satis\ufb01ed and k \u2264 k(cid:48). Then,\nthere exists a polynomial-time algorithm such that E[RT (w)] is at most\n\n(cid:88)\n\ni\u2208S\u2217\n\n(cid:88)\n\ni\u2208S\u2217\n\n\u221a\n\n\u221a\n\nd)\n\n8(1+\n\nT + 1+12T\n\ni | exp(\u2212 C 2\n|w\u2217\n\nd,k(cid:48),0(T 1\n\n4 \u2212 1)|w\u2217\n18432\n\ni |2\u03c32\n\nd\n\n)+4\n\n|w\u2217\ni |(\n\n4096\nd,k(cid:48),0w\u22174\nC 2\n\ni \u03c34\nd\n\n+1)2,\n\nfor arbitrary w \u2208 Rd with (cid:107)w(cid:107) \u2264 1, where Cd,k(cid:48),0 = k(cid:48)(k(cid:48)\u22121)\n\nd(d\u22121) = O( k(cid:48)2\n\nd2 ).2\n\n5.2 Algorithm 3: Assuming (b) the compatibility condition\n\n\u221a\n\nt=1 Prob[i /\u2208 St] = O(\n\ngenerate {St} that satis\ufb01es(cid:80)T\nby de\ufb01ning St as the set of k largest features with respect to a weight vector \u00afws = (cid:80)s\n\nAlgorithm 3 adopts the same strategy as Algorithm 2 except for the procedure for determining ws\nand \u00afws. In the analysis of Algorithm 2, we show that, to achieve the claimed regret, it suf\ufb01ces to\nT ) for i \u2208 S\u2217. The condition was satis\ufb01ed\nj=1 wj/s.\nThe linear independence of features guarantees that \u00afws computed in Algorithm 2 converges to w\u2217,\nand hence {St} de\ufb01ned as above possesses the required property. Unfortunately, if the assumption\nof the independence of features is not satis\ufb01ed, e.g., if we have almost same features, then \u00afws does\nnot converge to w\u2217. However, if we introduce an (cid:96)1-regularization to the minimization problem in\nthe de\ufb01nition of ws and change the de\ufb01nition of \u00afws to a weighted average of the modi\ufb01ed vectors\nw1, . . . , ws, then we can generate a required set {St} under the compatibility assumption. See the\nsupplementary material for details and the proof of the following theorem.\nTheorem 7. Suppose that (b), the compatibility assumption, is satis\ufb01ed and k \u2264 k(cid:48). Then, there\nexists a polynomial-time algorithm such that E[RT (w)] is at most\ni |2\u03c62\n\n(cid:88)\n\n(cid:88)\n\n(cid:112)\n\n\u221a\n\n\u221a\n\n0\n\n8(1+\n\nd)\n\nT +1 + 12T\n\n|w\u2217\n\ni | exp(\u2212 Cd,k(cid:48),0\n\n4 \u22121|w\u2217\nT 1\n5832k\n\n|w\u2217\ni |(\n\n64 \u00b7 364k2\nd,k(cid:48),0w\u22174\ni \u03c64\nC 2\n0\n\n+1)2,\n\n) + 4\n\ni\u2208S\u2217\n\ni\u2208S\u2217\n\nfor arbitrary w \u2208 Rd with (cid:107)w(cid:107) \u2264 1, where Cd,k(cid:48),0 = k(cid:48)(k(cid:48)\u22121)\n\nd(d\u22121) = O( k(cid:48)2\n\nd2 ).3,4\n\n3 The asymptotic regret bound mentioned in Section 1, can be yielded by bounding the second term with\n\nthe aid of the following: maxT\u22650 T exp(\u2212\u03b1T \u03b2) = (\u03b1\u03b2)\n\n\u03b2 exp(\u22121/\u03b2) for arbitrary \u03b1 > 0, \u03b2 > 0.\n\u2212 1\n\n4Note that \u03c60 is the constant appearing in Assumption (b) in Section 3.\n\n8\n\n\f6 Experiments\n\nIn this section, we compare our algorithms with the following four baseline algorithms: (i) a greedy\nmethod that chooses the k(cid:48) largest features with respect to wt computed as in Algorithm 1; (ii)\na uniform-random method that chooses k(cid:48) features uniformly at random; (iii) the algorithm of [6]\n(called AELR); and (iv) the algorithm of [5] (called FKK). Owing to space limitations, we only\npresent typical results here. Other results and the detailed descriptions on experiment settings are\nprovided in the supplementary material.\nSynthetic data. First we show results on two kinds of synthetic datasets: instances with (d, k, k(cid:48))\nand instances with (d, k1, k). We set k1 = k in the setting of (d, k, k(cid:48)) and k(cid:48) = k in the setting of\n(d, k1, k). The instances with (d, k, k(cid:48)) assume that Algorithm 1 can use the ground truth k, while\nAlgorithm 1 cannot use k in the instances with (d, k1, k). For each (d, k, k(cid:48)) and (d, k1, k), we\nexecuted all algorithms on \ufb01ve instances with T = 5000 and computed the averages of regrets and\nrun time, respectively. When (d, k, k(cid:48)) = (20, 5, 7), FKK spent 1176 s on average, while AELR\nspent 6 s, and the others spent at most 1 s.\nFigures 1 and 2 plot the regrets given by (1) over the number of rounds on a typical instance with\n(d, k, k(cid:48)) = (20, 5, 7). Tables 2 and 3 summarize the average regrets at T = 5000, where A1, A2,\nA3, G, and U denote Algorithm 1, 2, 3, greedy, and uniform random, respectively. We observe that\nAlgorithm 1 achieves smallest regrets in the setting of (d, k, k(cid:48)), whereas Algorithms 2 and 3 are\nbetter than Algorithm 1 in the setting of (d, k1, k). The results match our theoretical results.\n\nFigure 1: Plot of regrets with\n(d, k, k(cid:48)) = (20, 5, 7)\n\nFigure 2: Plot of regrets with\n(d, k1, k) = (20, 5, 7)\n\nFigure 3: CT-slice datasets\n\nTable 2: Values of RT /102 when changing\n(d, k, k(cid:48)).\n\nTable 3: Values of RT /102 when changing\n(d, k1, k).\n\n(d, k1, k)\n(10,2,4)\n\nA1\n1.53\n\nA2\n2.38\n\nA3\n3.60\n\nG\n\n33.28\n\nU\n\n25.73\n\nAELR FKK\n24.05\n60.76\n\n(d, k1, k)\n(10,2,4)\n\nA1\n26.88\n\nA2\n20.59\n\nA3\n17.19\n\nG\n\n43.03\n\nU\n\n60.02\n\nAELR FKK\n58.71\n64.75\n\nReal data. We next conducted experiments using a CT-slice dataset, which is available online [10].\nEach data consists of 384 features retrieved from 53500 CT images associated with a label that\ndenotes the relative position of an image on the axial axis.\nWe executed all algorithms except FKK, which does not work due to its expensive run time. Since\nwe do not know the ground-truth regression weights, we measure the performance by the \ufb01rst term\nof (1), i.e., square loss of predictions. Figure 3 plots the losses over the number of rounds. The\nparameters are k1 = 60 and k(cid:48) = 70. For this instance, the run times of Algorithms 1 and 2, greedy,\nuniform random, and AELR were 195, 35, 147, 382, and 477 s, respectively.\nWe observe that Algorithms 2 and 3 are superior to the others, which implies that Algorithm 2 and 3\nare suitable for instances where the ground truth k is not known, such as real data-based instances.\n\nAcknowledgement\n\nThis work was supported by JST ERATO Grant Number JPMJER1201, Japan.\n\nReferences\n[1] P. B\u00a8uhlmann and S. van de Geer. Statistics for high-dimensional data. 2011.\n\n9\n\n010002000300040005000T01000200030004000500060007000RTAlgorithm 1Algorithm 2Algorithm 3greedyuniform randomAELRFKK010002000300040005000T01000200030004000500060007000RTAlgorithm 1Algorithm 2Algorithm 3greedyuniform randomAELRFKK01000020000300004000050000T0.000.250.500.751.001.251.50T\u2211t=0(\u012eyt\u2212yt)21e8Algorithm 1Algorithm 2Algorithm 3greedyuniform randomAELR\f[2] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Some impossibility results for budgeted\n\nlearning. In Joint ICML-COLT workshop on Budgeted Learning, 2010.\n\n[3] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Ef\ufb01cient learning with partially observed\n\nattributes. Journal of Machine Learning Research, 12:2857\u20132878, 2011.\n\n[4] X. Chen, Q. Lin, and J. Pena. Optimal regularized dual averaging methods for stochastic\noptimization. In Advances in Neural Information Processing Systems, pages 395\u2013403, 2012.\n\n[5] D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In 29th Annual Conference\n\non Learning Theory, pages 960\u2013970, 2016.\n\n[6] E. Hazan and T. Koren. Linear regression with limited observation. In Proceedings of the 29th\n\nInternational Conference on Machine Learning (ICML-12), pages 807\u2013814, 2012.\n\n[7] S. Kale. Open problem: Ef\ufb01cient online sparse regression. In Proceedings of The 27th Con-\n\nference on Learning Theory, pages 1299\u20131301, 2014.\n\n[8] S. Kale, Z. Karnin, T. Liang, and D. P\u00b4al. Adaptive feature selection: Computationally ef\ufb01cient\nonline sparse linear regression under rip. In Proceedings of the 34th International Conference\non Machine Learning (ICML-17), pages 1780\u20131788, 2017.\n\n[9] P. Koiran and A. Zouzias. Hidden cliques and the certi\ufb01cation of the restricted isometry prop-\n\nerty. IEEE Trans. Information Theory, 60(8):4999\u20135006, 2014.\n\n[10] M. Lichman. UCI machine learning repository, 2013.\n\n[11] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11:2543\u20132596, 2010.\n\n[12] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages\n928\u2013936, 2003.\n\n[13] N. Zolghadr, G. Bart\u00b4ok, R. Greiner, A. Gy\u00a8orgy, and C. Szepesv\u00b4ari. Online learning with costly\nfeatures and labels. In Advances in Neural Information Processing Systems, pages 1241\u20131249,\n2013.\n\n10\n\n\f", "award": [], "sourceid": 2163, "authors": [{"given_name": "Shinji", "family_name": "Ito", "institution": "NEC Corporation"}, {"given_name": "Daisuke", "family_name": "Hatano", "institution": "National Institute of Informatics"}, {"given_name": "Hanna", "family_name": "Sumita", "institution": "National Institute of Informatics"}, {"given_name": "Akihiro", "family_name": "Yabe", "institution": null}, {"given_name": "Takuro", "family_name": "Fukunaga", "institution": "National Institute of Informatics"}, {"given_name": "Naonori", "family_name": "Kakimura", "institution": null}, {"given_name": "Ken-Ichi", "family_name": "Kawarabayashi", "institution": null}]}