{"title": "Sample Efficient Stochastic Gradient Iterative Hard Thresholding Method for Stochastic Sparse Linear Regression with Limited Attribute Observation", "book": "Advances in Neural Information Processing Systems", "page_first": 5312, "page_last": 5321, "abstract": "We develop new stochastic gradient methods for efficiently solving sparse linear regression in a partial attribute observation setting, where learners are only allowed to observe a fixed number of actively chosen attributes per example at training and prediction times. It is shown that the methods achieve essentially a sample complexity of $O(1/\\varepsilon)$ to attain an error of $\\varepsilon$ under a variant of restricted eigenvalue condition, and the rate has better dependency on the problem dimension than existing methods. Particularly, if the smallest magnitude of the non-zero components of the optimal solution is not too small, the rate of our proposed {\\it Hybrid} algorithm can be boosted to near the minimax optimal sample complexity of {\\it full information} algorithms. The core ideas are (i) efficient construction of an unbiased gradient estimator by the iterative usage of the hard thresholding operator for configuring an exploration algorithm; and (ii) an adaptive combination of the exploration and an exploitation algorithms for quickly identifying the support of the optimum and efficiently searching the optimal parameter in its support. Experimental results are presented to validate our theoretical findings and the superiority of our proposed methods.", "full_text": "Sample Ef\ufb01cient Stochastic Gradient Iterative Hard\nThresholding Method for Stochastic Sparse Linear\n\nRegression with Limited Attribute Observation\n\nTomoya Murata\n\nNTT DATA Mathematical Systems Inc. , Tokyo, Japan\n\nmurata@msi.co.jp\n\nDepartment of Mathematical Informatics,\n\nGraduate School of Information Science and Technology,\n\nThe University of Tokyo, Tokyo, Japan\n\nCenter for Advanced Intelligence Project, RIKEN, Tokyo, Japan\n\nTaiji Suzuki\n\ntaiji@mist.i.u-tokyo.ac.jp\n\nAbstract\n\nWe develop new stochastic gradient methods for ef\ufb01ciently solving sparse linear\nregression in a partial attribute observation setting, where learners are only allowed\nto observe a \ufb01xed number of actively chosen attributes per example at training\nand prediction times. It is shown that the methods achieve essentially a sample\ncomplexity of O(1/\u03b5) to attain an error of \u03b5 under a variant of restricted eigenvalue\ncondition, and the rate has better dependency on the problem dimension than exist-\ning methods. Particularly, if the smallest magnitude of the non-zero components of\nthe optimal solution is not too small, the rate of our proposed Hybrid algorithm\ncan be boosted to near the minimax optimal sample complexity of full information\nalgorithms. The core ideas are (i) ef\ufb01cient construction of an unbiased gradient\nestimator by the iterative usage of the hard thresholding operator for con\ufb01guring\nan exploration algorithm; and (ii) an adaptive combination of the exploration and\nan exploitation algorithms for quickly identifying the support of the optimum and\nef\ufb01ciently searching the optimal parameter in its support. Experimental results are\npresented to validate our theoretical \ufb01ndings and the superiority of our proposed\nmethods.\n\n1\n\nIntroduction\n\nIn real-world sequential prediction scenarios, the features (or attributes) of examples are typically high-\ndimensional and construction of the all features for each example may be expensive or impossible.\nOne of the example of these scenarios arises in the context of medical diagnosis of a disease, where\neach attribute is the result of a medical test on a patient [4]. In this scenarios, observations of the all\nfeatures for each patient may be impossible because it is undesirable to conduct the all medical tests\non each patient due to its physical and mental burden.\nIn limited attribute observation settings [1, 4] , learners are only allowed to observe a given number\nof attributes per example at training time. Hence learners need to update their predictor based on the\nactively chosen attributes which possibly differ from example to example.\nSeveral methods have been proposed to deal with this setting in linear regression problems. Cesa-\nBianchi et al. [4] have proposed a generalized stochastic gradient descent algorithm [16, 5, 14] based\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fon the ideas of picking observed attributes randomly and constructing a noisy version of all attributes\nusing them. Hazan and Koren [7] have proposed an algorithm combining a stochastic variant of EG\nalgorithm [11] with the idea in [4], which improves the dependency of the problem dimension of the\nconvergence rate proven in [4].\nIn these work, limited attribute observation settings only at training time have been considered.\nHowever, it is natural to assume that the observable number of attributes at prediction time is same as\nthe one at training time. This assumption naturally requires the sparsity of output predictors.\nDespite the importance of the requirement of the sparsity of predictors, a hardness result in this\nsetting is known. Foster et al. [6] have considered online (agnostic) sparse linear regression in the\nlimited attribute observation setting. They have shown that no algorithm running in polynomial time\nper example can achieve any sub-linear regret unless NP \u2282 BPP. Also it has been shown that this\nhardness result holds in stochastic i.i.d. (non-agnostic) settings in [8]. These hardness results suggest\nthat some additional assumptions are needed.\nMore recently, Kale and Karnin [10] have proposed an algorithm based on Dantzig Selector [3],\nwhich run in polynomial time per example and achieve sub-linear regrets under restricted isometry\ncondition [2], which is well-known in sparse recovery literature. Particularly in non-agnostic settings,\n\nthe proposed algorithm achieves a sample complexity of (cid:101)O(1/\u03b5)1, but the rate has bad dependency on\n\nthe problem dimension. Additionally, this algorithm requires large memory cost since it needs to store\nthe all observed samples due to the applications of Dantzig Selector to the updated design matrices.\nIndependently, Ito et al. [8] have also proposed three ef\ufb01cient runtime algorithms based on regularized\ndual averaging [15] with their proposed exploration-exploitation strategies in non-agnostic settings\nunder linear independence of features or compatibility[2]. The one of the three algorithms achieves\na sample complexity of O(1/\u03b52) under linear independence of features, which is worse than the\none in [10], but has better dependency on the problem dimension. The other two algorithms also\nachieve a sample complexity of O(1/\u03b52), but the additional term independent to \u03b5 has unacceptable\ndependency on the problem dimension.\nAs mentioned above, there exist several ef\ufb01cient runtime algorithms which solve sparse linear\nregression problem with limited attribute observations under suitable conditions. However , the\nconvergence rates of these algorithms have bad dependency on the problem dimension or on desired\naccuracy. Whether more ef\ufb01cient algorithms exist is a quite important and interesting question.\n\nMain contribution In this paper, we focus on stochastic i.i.d. (non-agnostic) sparse linear regres-\nsion in the limited attribute observation setting and propose new sample ef\ufb01cient algorithms in this\nsetting. The main feature of proposed algorithms is summarized as follows:\n\nOur algorithms achieve a sample complexity of (cid:101)O(1/\u03b5) with much better dependency on the\n\nproblem dimension than the ones in existing work. Particularly, if the smallest magnitude of the\nnon-zero components of the optimal solution is not too small, the rate can be boosted to near the\nminimax optimal sample complexity of full information algorithms.\n\nAdditionally, our algorithms also possess run-time ef\ufb01ciency and memory ef\ufb01ciency, since the average\nrun-time cost per example and the memory cost of the proposed algorithms are in order of the number\nof observed attributes per example and of the problem dimension respectively, that are better or\ncomparable to the ones of existing methods.\nWe list the comparisons of our methods with several preceding methods in our setting in Table 1.\n\n2 Notation and Problem Setting\n\nIn this section, we formally describe the problem to be considered in this paper and the assumptions\nfor our theory.\n\n1(cid:101)O hides extra log-factors.\n\n\u2020Note that the necessary number of observed attributes per example at prediction time is s\u2217, that is nearly\n\nsame as the other algorithms in table 1.\n\n2\n\n\fTable 1: Comparisons of our methods with existing methods in our problem setting. Sample\ncomplexity means the necessary number of samples to attain an error \u03b5. \"# of observed attrs per\nex.\" indicates the necessary number of observed attributes per example at training time which the\nalgorithm requires at least. s(cid:48) is the number of observed attributes per example, s\u2217 is the size of the\nsupport of the optimal solution, d is the problem dimension, \u03b5 is the desired accuracy and r2\nmin is the\nsmallest magnitude of the non-zero components of the optimal solution. We regard the smoothness\nand strong convexity parameters of the objectives derived from the additional assumptions and the\n\nboundedness parameter of the input data distribution as constants. (cid:101)O hides extra log-factors for\n\nsimplifying the notation.\n\n(cid:101)O\n\nSample complexity\n\n\u03b5\n\ns(cid:48)\n\ns(cid:48)\n\nO\n\n(cid:17)\n\nd2 \u03c32\n\u03b52\n\n(cid:16) ds2\u2217\n(cid:0)\u03c3 + d\n(cid:1)2 1\n(cid:17)\n(cid:16)\n(cid:17)\n(cid:16) d16\n(cid:16) d16\n(cid:17)\n(cid:16) ds2\u2217\n(cid:17)\n(cid:17) \u03c32\n(cid:16) s\u2217\n\ns(cid:48)16 + d \u03c32\ns(cid:48)16 + d \u03c32\ns(cid:48) \u03c32\ns(cid:48) + ds\u2217\n\u2227 ds\u2217\ns(cid:48)\n\nO\n\nO\n\n\u03b52\n\n\u03b52\n\n\u03b5\n\n\u03b5\n\ns(cid:48) +\n\nr2\nmin\n\n(cid:101)O\n(cid:16) ds2\u2217\n\n# of observed\nattrs per ex.\n\n\u2020\n\n1\n\ns\u2217 + 2\n\ns\u2217\n\ns\u2217\n\nO(s\u2217)\n\nO(s\u2217)\n\n(cid:17)\n\nAdditional\nassumptions\n\nrestricted isometry\n\ncondition\n\nlinear independence\n\nof features\n\nlinear independence\n\nof features\ncompatibility\n\nObjective type\n\nRegret\n\nRegret\n\nRegret\n\nRegret\n\nrestricted smoothness &\nrestricted strong convexity\nrestricted smoothness &\nrestricted strong convexity\n\nExpected risk\n\nExpected risk\n\nDantzig [10]\n\nRDA1 [8]\n\nRDA2 [8]\n\nRDA3 [8]\n\nExploration\n\nHybrid\n\n(cid:101)O\n\n2.1 Notation\n\nWe use the following notation in this paper.\n\n\u2022 (cid:107) \u00b7 (cid:107) denotes the Euclidean L2 norm (cid:107) \u00b7 (cid:107)2: (cid:107)x(cid:107) = (cid:107)x(cid:107)2 =(cid:112)(cid:80)\n\ni .\ni x2\n\nabbreviated as [n].\n\n\u2022 For natural number m, n, [m, n] denotes the set {m, m + 1, . . . , n}. If m = 1, [1, n] is\n\u2022 Hs denotes the projection onto s-sparse vectors, i.e., Hs(\u03b8(cid:48)) = argmin\u03b8\u2208Rd,(cid:107)\u03b8(cid:107)0\u2264s(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)\n\u2022 For x \u2208 Rd, x|j denotes the j-th element of x. For S \u2282 [d], we use x|S \u2208 R|S| to denote\n\nfor s \u2208 N, where (cid:107)\u03b8(cid:107)0 denotes the number of non-zero elements of \u03b8.\nthe restriction of x to S: (x|S)|j = xj for j \u2208 S.\n\n2.2 Problem de\ufb01nition\n\nIn this paper, we consider the following sparse linear regression model:\n\u2217 x + \u03be, where (cid:107)\u03b8\u2217(cid:107)0 = s\u2217, x \u223c DX ,\n\n(1)\nwhere \u03be is a mean zero sub-Gaussian random variable with parameter \u03c32, which is independent to\nx \u223c DX. We denote D as the joint distribution of x and y.\nFor \ufb01nding true parameter \u03b8\u2217 of model (1), we focus on the following optimization problem:\n\ny = \u03b8(cid:62)\n\n(2)\nwhere (cid:96)y(a) is the standard squared loss (a \u2212 y)2 and s \u2265 s\u2217 is some integer. We can easily see that\ntrue parameter \u03b8\u2217 is an optimal solution of the problem (2).\n\n(cid:107)\u03b8(cid:107)0\u2264s,\u03b8\u2208Rd\n\nmin\n\n{L(\u03b8) def= E(x,y)\u223cD[(cid:96)y(\u03b8(cid:62)x)]},\n\nLimited attribute observation We assume that only a small subset of the attributes which we\nactively choose per example rather than all attributes can be observed at both training and prediction\ntime. In this paper, we aim to construct algorithms which solve problem (2) with only observing\ns(cid:48)(\u2265 s \u2265 s\u2217) \u2208 [d] attributes per example. Typically, the situation s(cid:48) (cid:28) d is considered.\n\n2.3 Assumptions\n\nWe make the following assumptions for our analysis.\n\n3\n\n\fAssumption 1 (Boundedness of data). For x \u223c DX, (cid:107)x(cid:107)\u221e \u2264 R\u221e with probability one.\nAssumption 2 (Restricted smoothness of L). Objective function L satis\ufb01es the following restricted\nsmoothness condition:\n\u2200s \u2208 [d],\u2203Ls > 0,\u2200\u03b81, \u03b82 \u2208 Rd : (cid:107)\u03b81(cid:107)0,(cid:107)\u03b82(cid:107)0 \u2264 s \u21d2 L(\u03b81) \u2264 L(\u03b82)+(cid:104)\u2207L(\u03b82), \u03b81\u2212\u03b82(cid:105)+\nAssumption 3 (Restricted strong convexity of L). Objective function L satis\ufb01es the following\nrestricted strong convexity condition:\n\u2200s \u2208 [d],\u2203\u00b5s > 0,\u2200\u03b81, \u03b82 \u2208 Rd : (cid:107)\u03b81(cid:107)0,(cid:107)\u03b82(cid:107)0 \u2264 s \u21d2 L(\u03b82)+(cid:104)\u2207L(\u03b82), \u03b81\u2212\u03b82(cid:105)+\n\nLs\n2\n\n(cid:107)\u03b81\u2212\u03b82(cid:107)2.\n\n(cid:107)\u03b81\u2212\u03b82(cid:107)2 \u2264 L(\u03b81).\n\n\u00b5s\n2\n\nBy the restricted strong convexity of L, we can easily see that the true parameter of model (1) is the\nunique optimal solution of optimization problem (2). We denote the condition number Ls/\u00b5s by \u03bas.\nRemark. In linear regression settings, Assumptions 2 and 3 are equivalent to assuming\n\nand\n\n\u2200s \u2208 [d],\u2203Ls > 0 :\n\nsup\n\n\u03b8\u2208Rd\\{0},(cid:107)\u03b8(cid:107)0\u22642s\n\n\u03b8(cid:62)Ex\u223cDX [xx(cid:62)]\u03b8\n\n(cid:107)\u03b8(cid:107)2\n\n\u2264 Ls\n\n\u2200s \u2208 [d],\u2203\u00b5s > 0 :\n\ninf\n\n\u03b8\u2208Rd\\{0},(cid:107)\u03b8(cid:107)0\u22642s\n\n\u03b8(cid:62)Ex\u223cDX [xx(cid:62)]\u03b8\n\n(cid:107)\u03b8(cid:107)2\n\n\u2265 \u00b5s\n\nrespectively. Note that these conditions are stronger than restricted eigenvalue condition, but are\nweaker than restricted isometry condition.\n\n3 Approach and Algorithm Description\n\nIn this section, we illustrate our main ideas and describe the proposed algorithms in detail.\n\n3.1 Exploration algorithm\n\nOne of the dif\ufb01culties in partial information settings is that the standard stochastic gradient is\nno more available. In linear regression settings, the gradient what we want to estimate is given\ny(\u03b8(cid:62)x)x] = E(x,y)\u223cD[2(\u03b8(cid:62)x \u2212 y)x]. In general, we need to construct unbiased\nby E(x,y)\u223cD[(cid:96)(cid:48)\nestimators of E(x,y)\u223cD[yx] and Ex\u223cDX [xx(cid:62)]. A standard technique is an usage of \u02c6x, which is\nde\ufb01ned as \u02c6x|j = x|j (j \u2208 S) and \u02c6x|j = 0 (j (cid:54)\u2208 S), where S \u2282 [d] is randomly observed with\n|S| = s(cid:48) and x \u223c DX. Then we obtain an unbiased estimator of E(x,y)\u223cD[yx] as y d\ns(cid:48) \u02c6x. Similarly, an\nunbiased estimator of Ex\u223cDX [xx(cid:62)] is given by \u02c6x\u02c6x(cid:62) with adequate element-wise scaling. Note that\nparticularly the latter estimator has a quite large variance because the probability that the (i, j)- entry\nof \u02c6x\u02c6x(cid:62) becomes non-zero is O(s(cid:48)2/d2) when i (cid:54)= j, which is very small when s(cid:48) (cid:28) d.\nIf the updated solution \u03b8 is sparse, computing \u03b8(cid:62)x requires only observing the attributes of x which\ncorrespond to the support of \u03b8 and there exists no need to estimate Ex\u223cDX [xx(cid:62)], which has a\npotentially large variance. However, this idea is not applied to existing methods because they do not\nensure the sparsity of the updated solutions at training time and generate sparse output solutions only\nat prediction time by using the hard thresholding operator.\nIterative applications of the hard thresholding to the updated solutions at training time ensure the\nsparsity of them and an ef\ufb01cient construction of unbiased gradient estimators is enabled. Also we can\nfully utilize the restricted smoothness and restricted strong convexity of the objective (Assumption 2\nand 3) due to the sparsity of the updated solutions if the optimal solution of the objective is suf\ufb01ciently\nsparse.\nNow we present our proposed estimator. Motivated by the above discussion, we adopt the iterative\nusage of the hard thresholding at training time. Thanks to the usage of the hard thresholding operator\nthat projects dense vectors to s-sparse ones, we are guaranteed that the updated solutions are s(< s(cid:48))-\nsparse, where s(cid:48) is the number of observable attributes per example. Hence we can ef\ufb01ciently estimate\nEx\u223cDX [\u03b8(cid:62)xx] as \u03b8(cid:62)x\u02c6x with adequate scaling. As described above, computing \u03b8(cid:62)x can be ef\ufb01ciently\n\n4\n\n\fexecuted and only requires observing s attributes of x. Thus an naive algorithm based on this idea\nbecomes as follows:\n\nSample (xt, yt) \u223c D.\nObserve xt|supp(\u03b8t\u22121)\u222aS, where S is a random subset of [d] with |S| = s(cid:48) \u2212 s.\nt\u22121xt \u2212 yt)\nCompute gt = 2(\u03b8(cid:62)\nUpdate \u03b8t = Hs(\u03b8t\u22121 \u2212 \u03b7tgt).\n\ns(cid:48) \u2212 s\n\n\u02c6xt.\n\nd\n\nfor t = 1, 2, . . . , T . Unfortunately, this algorithm has no theoretical guarantee due to the use of\nthe hard thresholding. Generally, stochastic gradient methods need to decrease the learning rate \u03b7t\nas t \u2192 \u221e for reducing the noise effect caused by the randomness in the construction of gradient\nestimators. Then a large amount of stochastic gradients with small step sizes are cumulated for\nproper updates of solutions. However, the hard thresholding operator clears the cumulated effect on\nthe outside of the support of the current solution at every update and thus the convergence of the\nabove algorithm is not ensured if decreasing learning rate is used. For overcoming this problem, we\nadopt the standard mini-batch strategy for reducing the variance of the gradient estimator without\ndecreasing the learning rate.\ns(cid:48)\u2212s(cid:101) \u00d7 Bt\nWe provide the concrete procedure based on the above ideas in Algorithm 1. We sample (cid:100) d\nexamples per one update. The support of the current solution and deterministically selected s(cid:48) \u2212 s\nattributes are observed for each example. For constructing unbiased gradient estimator gt, we average\nthe Bt unbiased gradient estimators, where each estimator is the concatenation of block-wise unbiased\ngradient estimators of (cid:100) d\ns(cid:48)\u2212s(cid:101) examples. Note that a constant step size is adopted. We call Algorithm\n1 as Exploration since each coordinate is equally treated with respect to the construction of the\ngradient estimator.\n\n3.2 Re\ufb01nement of Algorithm 1 using exploitation and its adaptation\n\nAs we will state in Theorem 4.1 of Section 4, Exploration (Algorithm 1) achieves a linear convergence\nwhen adequate leaning rate, support size s and mini-batch sizes {Bt}\u221e\nt=1 are chosen. Using this fact,\nwe can show that Algorithm 1 identi\ufb01es the optimal support in \ufb01nite iterations with high probability.\nWhen once we \ufb01nd the optimal support, it is much ef\ufb01cient to optimize the parameter on it rather\nthan globally. We call this algorithm as Exploitation and describe the detail in Algorithm 2. Ideally, it\nis desirable that \ufb01rst we run Exploration (Algorithm 1) and if we \ufb01nd the optimal support, then we\nswitch from Exploration to Exploitation (Algorithm 2). However, whether the optimal support has\nbeen found is uncheckable in practice and the theoretical number of updates for \ufb01nding it depends\non the smallest magnitude of the non-zero components of the optimal solution, which is unknown.\nTherefore, we need to construct an algorithm which combines Exploration and Exploitation, and\nis adaptive to the unknown value. We give this adaptive algorithm in Algorithm 3. This algorithm\nalternately uses Exploration and Exploitation. We can show that Algorithm 3 achieves at least the\nsame convergence rate as Exploration, and thanks to the usage of Exploitation, its rate can be much\nboosted when the smallest magnitude of the non-zero components of the optimal solution is not too\nsmall. We call this algorithm as Hybrid.\n\n4 Convergence Analysis\n\nIn this section, we provide the convergence analysis of our proposed algorithms. We use (cid:101)O nota-\nO(cid:0)log(cid:0) \u03basd\n\n(cid:1)(cid:1), where \u03b4 is a con\ufb01dence parameter used in the statements.\n\ntion to hide extra log-factors for simplifying the statements. Here, the log-factors have the form\n\n\u03b4\n\n4.1 Analysis of Algorithm 1\nThe following theorem implies that Algorithm 1 with suf\ufb01ciently large mini-batch sizes {Bt}\u221e\nachieves a linear convergence.\nTheorem 4.1 (Exploration). Let T \u2208 N and \u03b80 \u2208 Rd. For Algorithm 1, if we adequately choose\n, then for any s(cid:48)(> s) \u2208 [d], \u03b4 \u2208 (0, 1) and \u2206 > 0\n\ns = O(cid:0)\u03ba2\n\nss\u2217(cid:1), \u03b7 = \u0398\n\nand \u02c7\u03b1 = \u0398\n\n(cid:16) 1\n\n(cid:16) 1\n\n(cid:17)\n\nt=1\n\nLs\n\n(cid:17)\n\n\u03bas\n\n5\n\n\fAlgorithm 1: Exploration(\u03b80 \u2208 Rd, \u03b7 > 0, s(cid:48), s \u2208 [d] (s(cid:48) > s), {Bt}, T \u2208 N)\n\nand Ji = [(s(cid:48) \u2212 s)(i \u2212 1) + 1, (s(cid:48) \u2212 s)i \u2227 d] for i \u2208 [d(cid:48)].\n\n(cid:109)\n\ns(cid:48)\u2212s\n\n(cid:16)\n\nSet \u03b80 = Hs(\u03b80), d(cid:48) =\nfor t = 1 to T do\n\n(cid:108) d\n(cid:17) \u223c D for i \u2208 [d(cid:48)] and b \u2208 [Bt].\nSet St\u22121 = supp(\u03b8t\u22121).\n, y(b)\nx(b)\nSample\n(cid:80)Bt\ni\n|St\u22121 and y(b)\n|Ji, x(b)\nObserve x(b)\nCompute gt|Ji = 1\nb=1 (cid:96)(cid:48)\nUpdate \u03b8t = Hs(\u03b8t\u22121 \u2212 \u03b7gt).\n\n(\u03b8t\u22121|(cid:62)\n\nx(b)\ni\n\nSt\u22121\n\ny(b)\ni\n\nBt\n\ni\n\ni\n\ni\n\ni\n\nend for\nreturn \u03b8T .\n\nfor i \u2208 [d(cid:48)] and b \u2208 [Bt].\n\n|St\u22121 )x(b)\n\ni\n\n|Ji for i \u2208 [d(cid:48)].\n\nAlgorithm 2: Exploitation(\u03b80 \u2208 Rd, \u03b7 > 0, {Bt}, T \u2208 N)\n\nSet S0 = supp(\u03b80).\nfor t = 1 to T do\n\nSample(cid:0)x(b), y(b)(cid:1) \u223c D for b \u2208 [Bt].\n\n(cid:80)Bt\nb=1 (cid:96)(cid:48)\n\nObserve x(b)|S0 and y(b) for b \u2208 [Bt].\nCompute gt|S0 = 1\nSet gt|S\nUpdate \u03b8t = \u03b8t\u22121 \u2212 \u03b7gt.\n\ny(b) (\u03b8t\u22121|(cid:62)\n\n= 0.\n\nBt\n\n(cid:123)\n0\n\nS0\n\nx(b)|S0 )x(b)|S0.\n\nend for\nreturn \u03b8T .\n\nthere exists Bt = (cid:101)O\n\n(cid:16)\n(cid:16)L(\u03b8T ) \u2212 L(\u03b8\u2217) \u2264 (1 \u2212 \u02c7\u03b1)T (L(\u03b80) \u2212 L(\u03b8\u2217) + \u2206)\n\n(t = 1, . . . ,\u221e) such that\n\ns2 \u2228 \u03c32\n\n(1\u2212 \u02c7\u03b1)T\n\nR4\u221e\nL2\ns\n\nR2\u221e\nLs\n\n(cid:17)\n\n\u03ba2\ns\n\nT s\n\n\u2206\n\n(cid:17) \u2265 1 \u2212 \u03b4.\n\nP\n\nThe proof of Theorem 4.1 is found in Section A.1 of the supplementary material.\nFrom Theorem 4.1, we obtain the following corollary, which gives a sample complexity of the\nalgorithm.\nCorollary 4.2 (Exploration). For Algorithm 1, under the settings of Theorem 4.1 with \u2206 = L(\u03b80) \u2212\nL(\u03b8\u2217), the necessary number of observed samples to achieve P (L(\u03b8T ) \u2212 L(\u03b8\u2217) \u2264 \u03b5) \u2265 1 \u2212 \u03b4 is\n\n(cid:18) \u03basR4\u221e\n\n\u00b52\ns\n\n(cid:101)O\n\n(cid:19)\n\n.\n\nds2\ns(cid:48) \u2212 s\n\n+\n\n\u03basR2\u221e\n\n\u00b5s\n\nds\ns(cid:48) \u2212 s\n\n\u03c32\n\u03b5\n\nThe proof of Corollary 4.2 is given in Section A.2 of the supplementary material.\nsample complexity of (cid:101)O(ds\u2217 + d\u03c32/\u03b5).\nRemark. If we set s(cid:48) \u2212 s = \u0398(s) and assume that \u03bas, R2\u221e and \u00b5s are \u0398(1), Corollary 4.2 gives the\nachieves a sample complexity of (cid:101)O(s2\u2217 + s\u2217\u03c32/\u03b5), if \u03bas, R2\u221e and \u00b5s are regard as \u0398(1). This rate is\nRemark. Corollary 4.2 implies that in full information settings, i.e., s(cid:48) \u2212 s = \u0398(d), Algorithm 4.2\nnear the minimax optimal sample complexity of (cid:101)O(s\u2217\u03c32/\u03b5) in full information settings [13].\n\nRemark. The estimator \u03b8T is guaranteed to be asymptotically consistent, because it can be easily seen\nthat (cid:107)\u03b8T \u2212 \u03b8\u2217(cid:107)2 converges to 0 as T \u2192 \u221e by using the restricted strong convexity of the objective L\nand its convergence rate is nearly same as the one of the objective gap L(\u03b8T ) \u2212 L(\u03b8\u2217).\n\n4.2 Analysis of Algorithm 2\n\nGenerally, Algorithm 2 does not ensure its convergence. However, the following theorem shows that\nrunning Algorithm 2 with suf\ufb01ciently large batch sizes will not increase the objective values too\n\n6\n\n\ffor k = 1 to K do\n\nAlgorithm 3: Hybrid((cid:101)\u03b80 \u2208 Rd, \u03b7 > 0, s(cid:48), s \u2208 [d] (s(cid:48) > s), {B\u2212\nt=1, T \u2212\nk ).\n\nk = Exploration((cid:101)\u03b8k\u22121, \u03b7, s(cid:48), s,{B\u2212\nt,k}\u221e\nk , \u03b7,{Bt,k}\u221e\nt=1, Tk).\n\nUpdate(cid:101)\u03b8\u2212\nUpdate(cid:101)\u03b8k = Exploitation((cid:101)\u03b8\u2212\nreturn (cid:101)\u03b8K.\n\nend for\n\nt,k}, {Bt,k}, {T \u2212\n\nk }, {Tk}, K \u2208 N)\n\nmuch. Moreover, if the support of the optimal solution is included in the one of a initial point, then\nAlgorithm 2 also achieves a linear convergence.\nTheorem 4.3 (Exploitation). Let T \u2208 N, \u03b80 \u2208 Rd and s \u2265 |supp(\u03b80)| \u2228 |supp(\u03b8\u2217)| \u2208 N. For\n, then for any \u03b4 \u2208 (0, 1) and\nAlgorithm 2, if we adequately choose \u03b7 = \u0398\nR2\u221e\nLs\n\n(cid:17)\n(cid:16) 1\n(cid:17)\n\u2206 > 0, there exists Bt = (cid:101)O\n\uf8f1\uf8f2\uf8f3P\n(cid:17) \u2265 1 \u2212 \u03b4\n(cid:16)L(\u03b8T ) \u2212 L(\u03b8\u2217) \u2264 1\n(cid:16)L(\u03b8T ) \u2212 L(\u03b8\u2217) \u2264 (1 \u2212 \u02c7\u03b1)T (L(\u03b80) \u2212 L(\u03b8\u2217) + \u2206)\n(cid:17) \u2265 1 \u2212 \u03b4\n\n(cid:16) R4\u221e\n(1\u2212 \u02c7\u03b1)T\n1\u2212 \u02c7\u03b1 (L(\u03b80) \u2212 L(\u03b8\u2217)) + \u2206\n\n(Generally),\n(If supp(\u03b8\u2217) \u2282 supp(\u03b80)) .\n\n(t = 1, . . . ,\u221e) such that\n\nT s2 \u2228 \u03c32\n\nand \u02c7\u03b1 = \u0398\n\n(cid:16) 1\n\n(cid:17)\n\n\u00b52\ns\n\nLs\n\nT s\n\nP\n\n\u03bas\n\n\u2206\n\nThe proof of Theorem 4.3 is found in Section B of the supplementary material.\n\n4.3 Analysis of Algorithm 3\n\n\u03c32\n\nCombining Theorem 4.1 and Theorem 4.3, we obtain the following theorem and corollary. These\nimply that using the adequate numbers of inner loops {T \u2212\nk }, {Tk} and mini-batch sizes {B\u2212\nt,k},\n{Bt,k} of Algorithm 1 and Algorithm 2 respectively, Algorithm 3 is guaranteed to achieve the same\nsample complexity as the one of Algorithm 1 at least. Furthermore, if the smallest magnitude of the\nnon-zero components of the optimal solution is not too small, its sample complexity can be much\nreduced.\n(1\u2212 \u02c7\u03b1)T . Let K \u2208 N and(cid:101)\u03b80 \u2208 Rd. If we adequately choose s = O(\u03ba2\nTheorem 4.4 (Hybrid). We denote rmin = minj\u2208supp(\u03b8\u2217)|\u03b8\u2217|j| and Bk(T, s, \u02c7\u03b1) = \u03ba2\nT s2 \u2228\n(cid:17)\n(cid:16) 1\n(cid:16) 1\n(cid:17)\n\u02c7\u03b1(1\u2212 \u02c7\u03b1)k(L((cid:101)\u03b80)\u2212L(\u03b8\u2217))\nss\u2217),\n(cid:108)\n, for any s(cid:48)(> s) \u2208 [d] and \u03b4 \u2208 (0, 1\n3 ), Algorithm 3 with T \u2212\nand adequate Tk = (cid:101)T =\nt,k = (cid:101)O(Bk(T \u2212\nk = 3,\nBt,k = (cid:101)O(Bk(Tk, s, \u02c7\u03b1)) satis\ufb01es\n, B\u2212\nk , s, \u02c7\u03b1)) and\n\uf8f1\uf8f2\uf8f3P\n(cid:16)L((cid:101)\u03b8K) \u2212 L(\u03b8\u2217) \u2264 2(1 \u2212 \u02c7\u03b1)K(L((cid:101)\u03b80) \u2212 L(\u03b8\u2217)\n(cid:16)L((cid:101)\u03b8K) \u2212 L(\u03b8\u2217) \u2264 2(1 \u2212 \u02c7\u03b1)K+(cid:101)T (L((cid:101)\u03b80) \u2212 L(\u03b8\u2217))\n(cid:108)\n\n(cid:17)(cid:109)\ns)(s(cid:48)\u2212s) \u2228 1\n(cid:17) \u2265 1 \u2212 \u03b4\n(cid:17) \u2265 1 \u2212 2\u03b4\n\n(Generally),\n(if K \u2265 \u02c7k + 1),\n\nR2\u221e\nLs\nand \u02c7\u03b1 = \u0398\n\n(cid:16) 4(L((cid:101)\u03b80)\u2212L(\u03b8\u2217))\n\nlog((1\u2212 \u02c7\u03b1s)\u22121) log\n\n(cid:17)(cid:109)\n\n\u03b7 = O\n\nR4\u221e\nL2\ns\n\n(cid:16)\n\n\u0398(\u03ba2\n\nLs\n\nT s\n\nP\n\n\u03bas\n\nd\n\n1\n\ns\n\nwhere \u02c7k =\n\n1\n\nlog((1\u2212 \u02c7\u03b1)\u22121) log\n\nr2\nmin\u00b5s\n\n.\n\nThe proof of Theorem 4.4 is found in Section C.1 of the supplementary material.\nCorollary 4.5 (Hybrid). Under the settings of Theorem 4.5, the necessary number of observed\n\nsamples to achieve P (L((cid:101)\u03b8K) \u2212 L(\u03b8\u2217) \u2264 \u03b5) \u2265 1 \u2212 \u03b4 for Algorithm 3 is\n\n(cid:18) \u03ba3\n\n(cid:101)O\n\nsR4\u221e\n\u00b52\ns\n\ns2 +\n\n\u03basR4\u221e\n\n\u00b52\ns\n\nds2\ns(cid:48) \u2212 s\n\n+\n\n\u03basR2\u221e\n\n\u00b5s\n\nss\n\u00b5sr2\n\nmin\n\n\u2227 ds\ns(cid:48) \u2212 s\n\n(cid:18) \u03ba2\n\n(cid:19) \u03c32\n\n(cid:19)\n\n.\n\n\u03b5\n\nThe proof of Corollary 4.5 is given in Section C.2 of the supplementary material.\nmin) (cid:28) d/(s(cid:48) \u2212 s), the sample complexity of Hybrid can be\nRemark. From Corollary 4.5, if \u03ba2\nare \u0398(1) and s(cid:48) \u2212 s = \u0398(s), Algorithm 3 achieves a sample complexity of (cid:101)O(ds\u2217 + s\u2217\u03c32/\u03b5), which\nmuch better than the one of Exploration only. Particularly, if we assume that \u03bas, R\u221e/\u00b5s and \u00b5sr2\npartial information settings. In this case, the complexity is signi\ufb01cantly smaller than (cid:101)O(ds\u2217 + d\u03c32/\u03b5)\n\nis asymptotically near the minimax optimal sample complexity of full information algorithms even in\n\ns/(\u00b5sr2\n\nmin\n\nof Algorithm 1 in this situation.\n\n7\n\n\f5 Relation to Existing Work\n\nIn this section, we describe the relation between our methods and the most relevant existing methods.\nThe methods of [4] and [7] solve the stochastic linear regression with limited attribute observation,\nbut the limited information setting is only assumed at training time and not at prediction time, which\nis different from ours. Also their theoretical sample complexities are O(1/\u03b52) which is worse than\nours. The method of [10] solve the sparse linear regression with limited information based on Dantzig\nSelector. It has been shown that the method achieves sub-linear regret in both agnostic (online)\nand non-agnostic (stochastic) settings under an online variant of restricted isometry condition. The\nconvergence rate in non-agnostic cases is much worse than the ones of ours in terms of the dependency\non the problem dimension d, but the method has high versatility since it has theoretical guarantees\nalso in agnostic settings, which have not been focused in our work. The methods of [8] are based\non regularized dual averaging with their exploration-exploitation strategies and achieve a sample\ncomplexity of O(1/\u03b52) under linear independence of features or compatibility, which is worse than\n\n(cid:101)O(1/\u03b5) of ours. Also the rate of Algorithm 1 in [8] has worse dependency on the dimension d than\n\nthe ones of ours. Additionally theoretical analysis of the method assumes linear independence of\nfeatures, which is much stronger than restricted isometry condition or our restricted smoothness and\nstrong convexity conditions. The rate of Algorithm 2, 3 in [8] has an additional term which has\nquite terrible dependency on d, though it is independent to \u03b5. Their exploration-exploitation idea is\ndifferent from ours. Roughly speaking, these methods observe s\u2217 attributes which correspond to the\ncoordinates that have large magnitude of the updated solution, and s(cid:48) \u2212 s\u2217 attributes uniformly at\nrandom. This means that exploration and exploitation are combined in single updates. In contrast, our\nproposed Hybrid updates a predictor alternatively using Exploration and Exploitation. This is a big\ndifference: if their scheme is adopted, the variance of the gradient estimator on the coordinates that\nhave large magnitude of the updated solution becomes small, however the variance reduction effect\nis buried in the large noise derived from the other coordinates, and this makes ef\ufb01cient exploitation\nimpossible. In [9] and [12], (stochastic) gradient iterative hard thresholding methods for solving\nempirical risk minimization with sparse constraints in full information settings have been proposed.\nOur Exploration algorithm can be regard as generalization of these methods to limited information\nsettings.\n\n6 Numerical Experiments\n\nIn this section, we provide numerical experiments to demonstrate the performance of the proposed\nalgorithms through synthetic data and real data.\nWe compare our proposed Exploration and Hybrid with state-of-the-art Dantzig [10] and RDA\n(Algorithm 13in [8]) in our limited attribute observation setting on a synthetic and real dataset.\nWe randomly split the dataset into training (90%) and testing (10%) set and then we trained each\nalgorithm on the training set and executed the mean squared error on the test set. We independently\nrepeated the experiments 5 times and averaged the mean squared error. For each algorithm, we\nappropriately tuned the hyper-parameters and selected the ones with the lowest mean squared error.\n\nSynthetic dataset Here we compare the performances in synthetic data. We generated n = 105\nsamples with dimension d = 500. Each feature was generated from an i.i.d. standard normal. The\noptimal predictor was constructed as follows: \u03b8\u2217|j = 1 for j \u2208 [13], \u03b8\u2217|j = \u22121 for j \u2208 [14, 25] and\n\u03b8\u2217|j = 0 for the other j. The optimal predictor has only 25 non-zero components and thus s\u2217 = 25.\nThe output was generated as y = \u03b8(cid:62)\n\u2217 x + \u03be, where \u03be was generated from an i.i.d. standard normal.\nWe set the number of observed attributes per example s(cid:48) as 50. Figure 1 shows the averaged mean\nsquared error as a function of the number of observed samples. The error bars depict two standard\ndeviation of the measurements. Our proposed Hybrid and Exploration outperformed the other two\nmethods. RDA initially performed well, but its convergence slowed down. Dantzig showed worse\nperformance than all the other methods. Hybrid performed better than Exploration and showed rapid\nconvergence.\n\n3In [8], three algorithms have been proposed (Algorithm 1, 2 and 3). We did not implement the latter two\nones because the theoretical sample complexity of these algorithms makes no sense unless d/s(cid:48) is quite small\ndue to the existence of the additional term d16/s(cid:48)16 in it.\n\n8\n\n\fReal dataset Finally, we show the experimental results on a real dataset CT-slice4. CT-slice dataset\nconsists of n = 53, 500 CT images with d = 383 features. The target variable of each image denotes\nthe relative location of the image on the axial axis. We set the number of observable attributes per\nexample s(cid:48) as 20. In \ufb01gure 2, the mean squared error is depicted against the number of observed\nexamples. The error bars show two standard deviation of the measurements. Again, our proposed\nmethods surpasses the performances of the existing methods. Particularly, the convergence of Hybrid\nwas signi\ufb01cantly fast and stable. In this dataset, Dantzig showed nice convergence and comparable to\nour Exploration. The convergence of RDA was quite slow and a bit unstable.\n\nFigure 1: Comparison on synthetic data.\n\nFigure 2: Comparison on CT-slice data.\n\n7 Conclusion\n\nWe presented sample ef\ufb01cient algorithms for stochastic sparse linear regression problem with limited\nattribute observation. We developed Exploration algorithm based on an ef\ufb01cient construction of an\nunbiased gradient estimator by taking advantage of the iterative usage of hard thresholding in the\nupdates of predictors . Also we re\ufb01ned Exploration by adaptively combining it with Exploitation\nand proposed Hybrid algorithm. We have shown that Exploration and Hybrid achieve a sample\n\ncomplexity of (cid:101)O(1/\u03b5) with much better dependency on the problem dimension than the ones in\n\nexisting work. Particularly, if the smallest magnitude of the non-zero components of the optimal\nsolution is not too small, the rate of Hybrid can be boosted to near the minimax optimal sample\ncomplexity of full information algorithms. In numerical experiments, our methods showed superior\nconvergence behaviors compared to preceding methods on synthetic and real data sets.\n\nAcknowledgement\n\nTS was partially supported by MEXT Kakenhi (25730013, 25120012, 26280009, 15H05707 and\n18H03201), Japan Digital Design, and JST-CREST.\n\nReferences\n[1] S. Ben-David and E. Dichterman. Learning with restricted focus of attention. In Proceedings of\n\nthe sixth annual conference on Computational learning theory, pages 287\u2013296. ACM, 1993.\n\n[2] P. B\u00fchlmann and S. Van De Geer. Statistics for high-dimensional data: methods, theory and\n\napplications. Springer Science & Business Media, 2011.\n\n[3] E. Candes, T. Tao, et al. The dantzig selector: Statistical estimation when p is much larger than\n\nn. The Annals of Statistics, 35(6):2313\u20132351, 2007.\n\n[4] N. Cesa-Bianchi, S. Shalev-Shwartz, and O. Shamir. Ef\ufb01cient learning with partially observed\n\nattributes. Journal of Machine Learning Research, 12(Oct):2857\u20132878, 2011.\n\n4This dataset\n\nis publicly available on https://archive.ics.uci.edu/ml/datasets/\n\nRelative+location+of+CT+slices+on+axial+axis.\n\n9\n\n\f[5] J. Duchi and Y. Singer. Ef\ufb01cient online and batch learning using forward backward splitting.\n\nJournal of Machine Learning Research, 10(Dec):2899\u20132934, 2009.\n\n[6] D. Foster, S. Kale, and H. Karloff. Online sparse linear regression. In Conference on Learning\n\nTheory, pages 960\u2013970, 2016.\n\n[7] E. Hazan and T. Koren. Linear regression with limited observation.\n\narXiv:1206.4678, 2012.\n\narXiv preprint\n\n[8] S. Ito, D. Hatano, H. Sumita, A. Yabe, T. Fukunaga, N. Kakimura, and K.-I. Kawarabayashi.\nEf\ufb01cient sublinear-regret algorithms for online sparse linear regression with limited observation.\nIn Advances in Neural Information Processing Systems, pages 4102\u20134111, 2017.\n\n[9] P. Jain, A. Tewari, and P. Kar. On iterative hard thresholding methods for high-dimensional\nm-estimation. In Advances in Neural Information Processing Systems, pages 685\u2013693, 2014.\n\n[10] S. Kale, Z. Karnin, T. Liang, and D. P\u00e1l. Adaptive feature selection: Computationally ef\ufb01cient\nonline sparse linear regression under RIP. In Proceedings of the 34th International Conference\non Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1780\u2013\n1788, 2017.\n\n[11] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for linear\n\npredictors. Information and Computation, 132(1):1\u201363, 1997.\n\n[12] N. Nguyen, D. Needell, and T. Woolf. Linear convergence of stochastic iterative greedy\nalgorithms with sparse constraints. IEEE Transactions on Information Theory, 63(11):6869\u2013\n6895, 2017.\n\n[13] G. Raskutti, M. J. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional\nlinear regression over (cid:96)q -balls. IEEE transactions on information theory, 57(10):6976\u20136994,\n2011.\n\n[14] S. Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated sub-gradient\n\nsolver for svm. Mathematical programming, 127(1):3\u201330, 2011.\n\n[15] L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization.\n\nJournal of Machine Learning Research, 11(Oct):2543\u20132596, 2010.\n\n[16] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages\n928\u2013936, 2003.\n\n10\n\n\f", "award": [], "sourceid": 2542, "authors": [{"given_name": "Tomoya", "family_name": "Murata", "institution": "NTT DATA Mathematical Systems Inc."}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": "The University of Tokyo/JST-PRESTO/RIKEN"}]}