{"title": "Regret Bounds for Online Portfolio Selection with a Cardinality Constraint", "book": "Advances in Neural Information Processing Systems", "page_first": 10588, "page_last": 10597, "abstract": "Online portfolio selection is a sequential decision-making problem in which a learner repetitively selects a portfolio over a set of assets, aiming to maximize long-term return. In this paper, we study the problem with the cardinality constraint that the number of assets in a portfolio is restricted to be at most k, and consider two scenarios: (i) in the full-feedback setting, the learner can observe price relatives (rates of return to cost) for all assets, and (ii) in the bandit-feedback setting, the learner can observe price relatives only for invested assets. We propose efficient algorithms for these scenarios that achieve sublinear regrets. We also provide regret (statistical) lower bounds for both scenarios which nearly match the upper bounds when k is a constant. In addition, we give a computational lower bound which implies that no algorithm maintains both computational efficiency, as well as a small regret upper bound.", "full_text": "Regret Bounds for Online Portfolio Selection\n\nwith a Cardinality Constraint\n\nShinji Ito\n\nNEC Corporation\n\nDaisuke Hatano\n\nRIKEN AIP\n\nHanna Sumita\n\nTokyo Metropolitan University\n\nAkihiro Yabe\n\nNEC Corporation\n\nTakuro Fukunaga\n\nRIKEN AIP, JST PRESTO\n\nNaonori Kakimura\n\nKeio University\n\nKen-ichi Kawarabayashi\n\nNational Institute of Informatics\n\nAbstract\n\nOnline portfolio selection is a sequential decision-making problem in which a\nlearner repetitively selects a portfolio over a set of assets, aiming to maximize\nlong-term return. In this paper, we study the problem with the cardinality constraint\nthat the number of assets in a portfolio is restricted to be at most k, and consider\ntwo scenarios: (i) in the full-feedback setting, the learner can observe price relatives\n(rates of return to cost) for all assets, and (ii) in the bandit-feedback setting, the\nlearner can observe price relatives only for invested assets. We propose ef\ufb01cient\nalgorithms for these scenarios, which achieve sublinear regrets. We also provide\nregret (statistical) lower bounds for both scenarios which nearly match the upper\nbounds when k is a constant. In addition, we give a computational lower bound,\nwhich implies that no algorithm maintains both computational ef\ufb01ciency, as well\nas a small regret upper bound.\n\n1\n\nIntroduction\n\nOnline portfolio selection [10, 22] is a fundamental problem in \ufb01nancial engineering, in which a\nlearner sequentially selects a portfolio over a set of assets, aiming to maximize cumulative wealth. For\nthis problem, principled algorithms (e.g., the universal portfolio algorithm [10]) have been proposed,\nwhich behave as if one knew the empirical distribution of future market performance. On the other\nhand, these algorithms work only under the strong assumption that we can hold portfolios of arbitrary\ncombinations of assets, and that we can observe price relatives, the multiplicative factors by which\nprices change, for all assets. Due to these limitations, this framework does not directly apply to such\nreal-world applications as investment in advertising or R&D, where the available combination of\nassets is restricted and/or price relatives (return on investment) are revealed only for assets that have\nbeen invested in.\nIn order to overcome such issues, we consider the following problem setting: Suppose that there are\nT rounds and a market has d assets, represented by [d] := {1, . . . , d}. In each round t, we design a\n(cid:80)d\nportfolio, that represents the proportion of the current wealth invested in each of the d assets. That is,\na portfolio can be expressed as a vector xt = [xt1, . . . , xtd](cid:62) such that xti \u2265 0 for all i \u2208 [d] and\ni=1 xti \u2264 1. The combination of assets is restricted with a set of available combinations S \u2286 2[d],\nthat is, a portfolio xt must satisfy supp(xt) = {i \u2208 [d] | xti (cid:54)= 0} \u2208 S. Thus, in each period t,\nwe choose St from S and determine a portfolio xt only from assets in St. A typical example of\nS can be given by cardinality constraints, i.e., Sk := {S \u2286 [d] | |S| = k} for some k \u2264 d. We\ndenote by rt = [rt1, . . . , rtd](cid:62) a price relative vector, where 1 + rti is the price relative for the i-th\nasset in the t-th period. Then the wealth AT resulting from the sequentially rebalanced portfolios\nt xt). The best constant portfolio strategy earns the wealth\n\nx1, . . . , xt is given by AT =(cid:81)T\n\nt=1(1 + r(cid:62)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fTable 1: Regret bounds for the full-feedback setting.\n\nConstraints\nSingle asset (S = S1)\nCombination (S = Sk) RT = O\n\nUpper bound by Algorithm 1 Lower bound\nRT = O(\u221aT log d)\nT k log d\nk\n\n(cid:19)\n(cid:18)(cid:113)\n(cid:1)poly(k)-time )\n( run in T(cid:0)d\n\n(cid:18)(cid:113)\n\nRT = \u2126(\u221aT log d)\nfor d \u2265 17k\nT log d\nRT = \u2126\nk\nand no poly(d, k, T )-time algorithm\nachieves RT \u2264 T 1\u2212\u03b4poly(d, k)\n\n(cid:19)\n\nk\n\nTable 2: Regret bounds for the bandit-feedback setting.\n\nConstraint\nSingle asset (S = S1)\nCombination (S = Sk) RT = O\n\nUpper bound by Algorithm 2 Lower bound\nRT = O(\u221adT log T )\n\n(cid:1)k(cid:19)\n\n(cid:18)(cid:113)\nT(cid:0) d\n\nRT = \u2126(\u221adT )\nRT = \u2126\nand no poly(d, k, T )-time algorithm\nachieves RT \u2264 T 1\u2212\u03b4poly(d, k)\n\nfor d > k\n\nCk3\n\n(cid:18)(cid:113)\nT k(cid:0)d\n\nk\n\n(cid:19)\n\n(cid:1) log T\n\n( run in T poly(d, k)-time )\n\n(cid:81)T\nt=1(1 + r(cid:62)\n\nA\u2217\nt x) subject to the constraint that x is a portfolio satisfying supp(x) \u2208 S.\nT := maxx\nThe performance of our portfolio selection is measured by RT = log A\u2217\nT \u2212 log AT , which we call\nregret. The reason that we use log AT rather than AT comes from capital growth theory [16, 21].1\nIn terms of the observable information, we consider two different settings: (i) in the full-feedback\nsetting, we can observe all the price relatives rti for i = 1, . . . , d, and (ii) in the bandit-feedback\nsetting, we can observe the price relatives rti only for i \u2208 St. Note that in each round t a portfolio xt\nhas to be determined before knowing rti in either of the settings. Note also that we do not make any\nstatistical assumption about the behavior of rti, but we assume that rti is bounded in a closed interval\n[C1, C2], where C1 and C2 are constants satisfying \u22121 < C1 \u2264 C2.\nOur problem is a generalization of the standard online portfolio selection problem. In fact, if portfolios\ncombining all assets are available, i.e., if S = 2[d], then our problem coincides with the standard\nonline portfolio selection problem. For this special case, it has been shown that some online convex\noptimization (OCO) methods [18, 17, 27] (e.g., the online Newton step method) achieve regret of\nO(d log T ), and that any algorithm will suffer from regret of \u2126(d log T ) in the worst case [26].\nOur contribution is twofold; algorithms with sublinear regret upper bounds, and analyses proving\nregret lower bounds. First, we propose the following two algorithms:\n\n\u2022 Algorithm 1 for the full-feedback setting, achieving regret of O((cid:112)T log |S|).\n\u2022 Algorithm 2 for the bandit-feedback setting, achieving regret of O((cid:112)T k|S| log T ), where\n\nk denotes the largest cardinality among elements in S, i.e., k = maxS\u2208S |S|.\n\nTables 1 and 2 summarize the regret bounds for the special case in which the cardinality of assets\nis restricted to be at most 1 or at most k. As shown in Table 1, Algorithm 1 can achieve regret of\n\nO(\u221aT poly(d)) even if k = \u2126(d) when S has an exponentially large size with respect to d. In such a\n\ncase, however, Algorithm 1 requires exponentially large computational time. For the bandit-feedback\nsetting, the regret upper bound can be exponential w.r.t. d if k = \u2126(d), but it is still sublinear in\nT . One main idea behind our algorithms is to combine the multiplicative weight update method\n(MWU) [3, 14] (in the full-feedback setting) / multi-armed bandit algorithms (MAB) [5, 6] (in the\nbandit-feedback setting) with OCO. Speci\ufb01cally, for choosing the combination St of assets, we\nemploy MWU/MAB, which are online decision making methods over a \ufb01nite set of actions. For\nmaintaining the proportion xt of portfolios, we use OCO, that is, online decision making methods for\nconvex objectives over a convex set of actions.\nSecond, we show regret lower bounds for both the full-feedback setting and the bandit-feedback\nsetting where S = Sk, which give insight into the tightness of regret upper bounds achieved with our\nalgorithms. As shown in Table 1, the proven lower bounds for the full-feedback setting are tight up to\nthe O(\u221ak) term. For the bandit-feedback setting, the lower bounds are also tight up to the O(\u221alog T )\nterm, if k = O(1). Note that, if k = d then the problem coincides with the standard online portfolio\n\n1 For more details, see Appendix A in the supplementary material.\n\n2\n\n\fselection problem, and hence, there exist algorithms achieving RT = O(\u221aT log d). This implies\nthat the assumption of d = \u2126(k) is essential for proving the lower bounds of \u2126(\u221aT ). We also\nnote that these statistical lower bounds are valid for arbitrary learners, including exponential-time\nalgorithms. Besides statistical ones, we also show computational lower bounds suggesting that there\nis no polynomial-time algorithm achieving a regret bound with a sublinear term in T and a polynomial\nterm w.r.t. d and k, unless NP \u2286 BPP. This means that we cannot improve the computational\nef\ufb01ciency of Algorithm 1 to O(poly(d, k, T ))-time while preserving its regret upper bound.\nTo prove the regret lower bounds, we use three different techniques: for the statistical lower bound\nfor the full-feedback setting, we consider a completely random market and evaluate how well the\n\u201cbest\u201d strategy worked after observing the market behavior, in a similar way to that for the lower\nbound for MWU [3]; for the bandit-feedback setting, we construct a \u201cgood\u201d combination S\u2217\n\u2208 S\nof assets so that it is hard to distinguish it from the others, and bound the number of choosing this\n\u201cgood\u201d combination via a technique similar to that used in the proof of the regret lower bound for\nMAB [5]; to prove the computational lower bound, we reduce the 3-dimensional matching problem\n(3DM), one of Karp\u2019s 21 NP-complete problems [20], to our problem.\n\n2 Related work\n\nOnline portfolio selection has been studied in many research areas, including \ufb01nance, statistics,\nmachine learning, and optimization [1, 10, 19, 22, 23] since Cover [10] formulated the problem\nsetting and proposed a universal portfolio algorithm that achieves regret of O(d log T ) with ex-\nponential computation cost. This regret upper bound was shown to be optimal by Ordentlich and\nCover [26]. The computation cost was reduced by the celebrated work on the online gradient method\nof Zinkervich [29] for solving online convex optimization (OCO) [17, 27], a general framework\nincluding online portfolio selection, but the regret bound is O(d\u221aT ) and suboptimal for online\nportfolio selection. A breakthrough w.r.t. this suboptimality came with the online Newton step and the\nfollow-the-approximation-leader method of Hazan et al. [18], which are computationally ef\ufb01cient and\nachieve regret of O(d log T ) for a special case of OCO, including online portfolio selection. Among\nstudies on online portfolio selection, the work by Das et al. [12] has a motivation similar to ours: the\naim of selecting portfolios with a group-sparse structure. However, their problem setting differs from\nours in that they did not put constraints about sparsity but, rather, de\ufb01ned regret containing regularizer\ninducing group sparsity, and that they supposed that a learner can observe price relatives for all\nassets after determining portfolios. In contrast to this, our work deals with the sparsity constraint\non portfolios, and our methods work even for the bandit-feedback setting, in which feedbacks are\nobserved only on assets that have been invested in.\nAnother closely related topic is the multi-armed bandit problem (MAB) [4, 5, 6]. For nonstochastic\nMAB problems, a nearly optimal regret bound is achieved by the Exp3 algorithm [5], which our algo-\nrithm strongly relies on. For combinatorial bandit problems [7, 8, 9] in which each arm corresponds\nto a subset, the work by Chen et al. [8] gives solutions to a wide range of problems. However, this\nwork does not directly apply to our setting, because we need to maintain not only subsets St but also\ncontinuous variables xt, and both of them affect regret.\n\n3 Upper bounds\n\n3.1 Notation and preliminary consideration\n\n(cid:110)\nx | xi \u2265 0 (i \u2208 [d]),(cid:80)d\n\nLet us introduce some notations. For S \u2286 [d], denote by \u2206S the set of portfolios whose supports are\n. Let (S\u2217, x\u2217) denote\nincluded in S, i.e., \u2206S =\nthe optimal \ufb01xed strategy for T rounds, i.e., (S\u2217, x\u2217) \u2208 arg max\nt x). Let xt denote\nS\u2208S,x\u2208\u2206S\nthe output of an algorithm for the t-th round. Then the regret RT of the algorithm can be expressed as\nT(cid:88)\n\ni=1 xi \u2264 1, supp(x) \u2286 S\n\nT(cid:88)\n\nT(cid:88)\n\n(cid:111)\n(cid:80)T\nt=1 log(1+r(cid:62)\n\nRT = max\n\nS\u2208S,x\u2208\u2206S\n\nt=1\n\nlog(1 + r(cid:62)\n\nt x) \u2212\n\nlog(1 + r(cid:62)\n\nt x\u2217) \u2212\n\nlog(1 + r(cid:62)\n\nt xt).\n\nlog(1 + r(cid:62)\n\nt xt) =\n\nt=1\n\nt=1\n\n3\n\nT(cid:88)\n\nt=1\n\n\fAlgorithm 1 An algorithm for the full-feedback setting.\nInput: The number T of rounds. The number d of assets. The set of available subsets S \u2286 2[d].\n1: Set w1 = (wS\n2: for t = 1, . . . , T do\n3:\n\n1 = 0, respectively, for S \u2208 S.\n\n1 )S\u2208S \u2208 RS and (xS\n\nParameters \u03b7 > 0 and \u03b2 > 0.\n\n1 = 1 and xS\n\n1 )S\u2208S by wS\n\nt , i.e., choose S with\n\nSet St by randomly choosing S \u2208 S with a probability proportional to wS\nprobability wS\nt /(cid:107)wt(cid:107)1.\nOutput St and xt = xSt\nUpdate wt; set wt+1 by wS\nUpdate xS\n\nt and observe rti for all i \u2208 [d].\n\nt xS\nt+1 by equation (3) for S \u2208 S.\n\nt )\u03b7 for S \u2208 S.\n\nt (1 + r(cid:62)\n\nt+1 = wS\n\nt ; set xS\n\n4:\n5:\n6:\n7: end for\n\nThe algorithms presented in this section maintain vectors xS\nof the t-th round. They then choose St from S, and output (St, xSt\n(S (cid:54)= St) do not appear in the output, they are used to compute outputs in subsequent rounds.\nIn the computation of xS\n\nt \u2208 \u2206S for all S \u2208 S at the beginning\nt ). Although other vectors xS\nt\n\nt+1, we refer to the following vectors gt and matrices H S\nt :\n\nt have the following property which plays an important role in our analysis:\n\ngS\nt =\nt1, . . . , r(cid:48)\n\nrt|S\n, H S\nt =\n1 + r(cid:62)\ntd](cid:62) is de\ufb01ned by r(cid:48)\n\nt xS\nt\n\n(1 + C1)2\n(1 + C2)2 gS\n\nt gS(cid:62)\nti = rti for i \u2208 S and r(cid:48)\n\nwhere rt|S = [r(cid:48)\nand H S\nLemma 1. For any x \u2208 \u2206S, it holds that\nt x) \u2212 log(1 + r(cid:62)\nt xS\n\nlog(1 + r(cid:62)\n\nt ) \u2212\nFor the proof, see Appendix B in the supplementary material.\n\n(x \u2212 xS\n\nt ) \u2264 gS(cid:62)\n\nt\n\nt = C3gS\n\nt gS(cid:62)\n\nt\n\n,\n\n(1)\n\nti = 0 for i \u2208 [d] \\ S. These gS\n\nt\n\n1\n2\n\n(x \u2212 xS\n\nt )(cid:62)H S\n\nt (x \u2212 xS\nt ).\n\n(2)\n\n3.2 Algorithm for the full-feedback setting\n\nWe propose an algorithm for the full-feedback setting, created by combining the multiplicative\nweight update method (MWU) [3] and the follow-the-approximate-leader method (FTAL) [18]. More\nspeci\ufb01cally, our proposed algorithm updates the probability of choosing a subset S \u2208 S by MWU\nt by FTAL. The entire algorithm is summarized in Algorithm 1.\nand updates the portfolio vector xS\nOur algorithm maintains weight wS\nt for each subset S \u2208 S at the\nt \u2265 0 and a portfolio vector xS\nbegining of the t-th round, where wS\n1 = 0 for all S \u2208 S.\n1 = 1 and xS\n1 are initialized by wS\n1 and xS\nIn each round t, a subset St is chosen with a probability proportional to wS\nt . Given the feedback\nt by multiplying\nrt, the algorithm computes wS\n(1 + r(cid:62)\nt+1 is computed\nby FTAL as follows:\n\nt+1 is obtained from wS\nt )\u03b7, where \u03b7 > 0 is a parameter we optimize later. The portfolio vector xS\n\nt+1. The weight wS\n\nt+1 and xS\n\nt xS\n\n(cid:19)\n\n\uf8fc\uf8fd\uf8fe ,\n\n(cid:18)\n\n\uf8f1\uf8f2\uf8f3 t(cid:88)\n2 = (cid:80)d\n\nj=1\n\nxS\nt+1 \u2208 arg max\nx\u2208\u2206S\n\ngS(cid:62)\n\nj\n\n(x \u2212 xS\n\nj ) \u2212\n\n1\n2\n\n(x \u2212 xS\n\nj )(cid:62)H S\n\nj (x \u2212 xS\nj )\n\n\u03b2\n2 (cid:107)x(cid:107)2\n\n2\n\n\u2212\n\n(3)\n\ni=1 x2\n\nwhere \u03b2 is a regularization parameter optimized later, and (cid:107) \u00b7 (cid:107) stands for the (cid:96)2 norm:\n(cid:107)[x1, . . . , xd](cid:62)\ni . Since (3) is a convex quadratic programming problem with lin-\n(cid:107)2\near constraints, xS\nt+1 can be computed ef\ufb01ciently by, e.g., interior point methods [24]. Recently, Ye\n(cid:17)\net al. [28] have proposed a more ef\ufb01cient algorithm for solving (3). For the special case of the single\n{i}\nasset selection setting, i.e., if S = S1 = {{i} | i \u2208 [d]}, then x\nt+1 = (0, . . . , 0, xt+1,i, 0, . . . , 0)\nhas a closed-form expression: xt+1,i = \u03c0[0,1]\nand \u03c0[0,1](\u00b7)\nstands for a projection onto [0, 1] de\ufb01ned by \u03c0[0,1](y) = 0 for y < 0, \u03c0[0,1](y) = y for 0 \u2264 y \u2264 1,\nand \u03c0[0,1](y) = 1 for y > 1.\nOur algorithm achieves the regret described below for arbitrary inputs, where constants C3, C4, C5\nare given by C3 = (1+C1)2\n\n(cid:16) (cid:80)t\n(cid:80)t\n\n, and C5 = max{C2\n\n, where gji := rji\n\n(1+C2)2 , C4 = log 1+C2\n\n2}\n1 ,C2\n(1+C1)2\n\n1+rjixji\n\nj=1 g2\nji\n\nj=1 gji\n\n\u03b2+C3\n\n1+C1\n\n.\n\n4\n\n\f1 )S\u2208S \u2208 RS and (xS\n\nParameters \u03b7 > 0, \u03b3 \u2208 (0, 1) and \u03b2 > 0.\n\nAlgorithm 2 An algorithm for the bandit-feedback setting.\nInput: The number T of rounds. The number d of assets. The set of available subsets S \u2286 2[d].\n1: Set w1 = (wS\n2: for t = 1, . . . , T do\n3:\n4:\n5:\n\n1 = 0, respectively, for S \u2208 S.\nt = \u03b3|S| + (1 \u2212 \u03b3) wS\nt(cid:107)wt(cid:107)1\n\nSet the probability vector pt = (pS\nRandomly choose St \u2208 S on the basis of the probability vector pt.\nOutput St and xt = xSt\nUpdate wt; set wS\nt ; set xS\nUpdate xS\n\n1 = 1 and xS\nt )S\u2208S \u2208 [0, 1]S by pS\n(cid:16) 1+r(cid:62)\n\n(cid:17)\u03b7/ptit and wS\n\nt+1 by wSt\nt+1 = wtit\nt+1 by equation (7).\n\nt , and observe rti for i \u2208 St.\n\nt for S \u2208 S \\ {St}.\n\n1 )S\u2208S by wS\n\nt+1 = wS\n\nt xt\n1+C1\n\n.\n\n6:\n7:\n8: end for\n\n(cid:19)\nTheorem 2. Algorithm 1 achieves the following regret upper bound if \u03b7 \u2264 1/C4:\n\n+ C 2\n\n1\n2\n\n\u03b2 +\n\n(cid:27)\n\n4 \u03b7T +\n\n(cid:18)\n(cid:113) log |S|\n(cid:16)(cid:112)T log |S| + k log T + log |S|\n(cid:17)\n\nand \u03b2 = 1, we obtain\n\nk\nC3\n\n1 +\n\nlog\n\nT\n\n\u03b2\n\n.\n\nC3C5T\n\n.\n\nE[RT ] \u2264\nIn particular, setting \u03b7 = 1\nC4\n\nlog |S|\n\n\u03b7\n\n(cid:26)\n\nmin\n\n1,\n\nE[RT ] = O\n\n(4)\n\n(5)\n\nRunning time\nIf (3) can be computed in p(k)-time, Algorithm 1 runs in O(|S|p(k))-time per round.\nIf S is an exponentially large set, e.g., if S = {S \u2286 [d] | |S| = k} and k = \u0398(d), the computational\ntime for O(|S|p(k)) will be exponentially large w.r.t. d. This computational complexity is shown\nto be inevitable in Section 4.1. For the special case of the single asset selection setting, i.e., if\nS = S1 = {{i} | i \u2208 [d]}, Algorithm 1 runs in O(d)-time per round since each x\ncan be updated\nin constant time.\n\n{i}\nt\n\n3.3 Algorithm for the bandit-feedback setting\n\nWe construct an algorithm for the bandit-feedback setting by combining the Exp3 algorithm [5]\nfor the multi-armed bandit problem and FTAL. Similarly to the process used in Algorithm 1, the\nalgorithm updates the probability of choosing St \u2208 S by the Exp3 algorithm (in place of MWU)\nand updates portfolios xS\nt by FTAL. The main dif\ufb01culty comes from the fact that the learner cannot\nt for all S \u2208 S.\nobserve all the entries of (rti)d\nIn order to deal with this problem, we construct unbiased estimators of gS\nt for each S \u2208 S\nby\n(6)\n\ni=1. Due to this limitation, we cannot always update xS\n\nt and H S\n\n\u02c6H St\n\n,\n\n,\n\n\u02c6gS\nt = 0,\n\n\u02c6gSt\nt =\n\nt =\n\n\u02c6H S\nt = O (S \u2208 S \\ {St}),\n\nH St\nt\npSt\nt\n\ngSt\nt\npSt\nt\n\nj=1\n\n1\n2\n\n(cid:19)\n\n(cid:18)\n\n\u02c6gS(cid:62)\n\nj\n\nj ) \u2212\n\nt+1 = xS\n\nt and \u02c6H S\n\nj )(cid:62) \u02c6H S\n\n(x \u2212 xS\n\n\uf8fc\uf8fd\uf8fe .\n\n\uf8f1\uf8f2\uf8f3 t(cid:88)\n\nxS\nt+1 \u2208 arg max\nx\u2208\u2206S\n\n(x \u2212 xS\nt = 0 and \u02c6H S\n\nt for each S \u2208 S \\{St} since \u02c6gS\n\nwhere pS\nt is the probability of choosing S in round t, which is computed by a procedure similar to that\nused in the Exp3 algorithm. Note that \u02c6gS\nt can be calculated from the observed information\nt+1 by FTAL as follows:\nalone. Using these unbiased estimators, we compute the portfolio vectors xS\n(7)\nj (x \u2212 xS\nj )\nNote that xS\nt = O. Hence the convex quadratic\nprogramming problem (7) is solved only once in each round. The entire algorithm is summarized in\nAlgorithm 2.\nTheorem 3. Algorithm 2 achieves the following regret upper bound if \u03b7 \u2264 \u03b3\n(cid:17)\n\n(cid:18)\n(cid:113) log |S|\n(cid:110)\n(cid:111)\n(cid:112)k log |S| log T + |S|k\n\n(cid:27)\n(cid:16)(cid:112)T|S|k log T + |S|\n\n(cid:113) k|S| log(1+T )\n\nand \u03b2 = C3C5, we obtain\n\n4 \u03b7|S| + C4\u03b3)T +\n\nT\nE[RT ] = O\n\nC4|S| :\nC3C5T\n\nSetting \u03b3 = min\n\nE[RT ] \u2264\n\nC4|S| min\n\n\u03b2(cid:107)x(cid:107)2\n\nlog |S|\n\n, \u03b7 = \u03b3\n\nk|S|\nC3\u03b3\n\n+ (C 2\n\n(cid:26)\n\nlog\n\n1 +\n\n(cid:19)\n\n1\n2\n\n\u2212\n\n\u03b2\n\n.\n\nk log(1+T )\n\n\u03b2 +\n\n(8)\n\n1\n2\n\n1,\n\n1,\n\n\u03b7\n\n2\n\n.\n\n5\n\n\ffor some S \u2208 S and computing the pre\ufb01x sum(cid:80)i\n\nRunning time Algorithm 2 runs in O(p(k) + log2(|S|))-time per round, assuming that (7) can\nbe computed in p(k)-time. In fact, from the de\ufb01nition (6) of \u02c6gS\nt given\nby (7) is needed only for S = St. Furthermore, for S = {S1, S2, . . . , S|S|}, both updating wS\nt\nfor some i \u2208 [|S|] can be performed in\nt = \u03b3|S| + wS\nt(cid:107)wt(cid:107)S\n\nO(log |S|)-time by using a Fenwick tree [13]. This implies that sampling St w.r.t. pS\ncan be performed in O(log2 |S|)-time.\n4 Lower bounds\n\nt , the update of xS\n\nt and \u02c6H S\n\nj=1 wSj\n\nt\n\nIn this section, we present lower bounds on regrets achievable by algorithms for the online portfolio\nselection problem. We focus on the case of S = Sk = {S \u2286 [d] | |S| = k} throughout this section.\n4.1 Computational complexity\n\nWe show that, unless the complexity class BPP includes NP, there exists no algorithm for the\nonline problem with a cardinality constraint such that its running time is polynomial both on d and T\nand its regret is bounded by a polynomial in d and sublinear in T . This fact is shown by presenting a\nreduction from the 3-dimensional matching problem (3DM). An instance U of 3DM consists of\n3-tuples (x1, y1, z1), . . . , (xd, yd, zd) \u2208 [k] \u00d7 [k] \u00d7 [k]. Two tuples, (xi, yi, zi) and (xj, yj, zj), are\ncalled disjoint if xi (cid:54)= xj, yi (cid:54)= yj, and zi (cid:54)= zj. The task of 3DM is to determine whether or not\nthere exist k pairwise-disjoint tuples; if they do exist, we write U \u2208 3DM.\nFrom a 3DM instance U = {(xj, yj, zj)}d\nj=1, we construct an input sequence (rt)t=1,...,T of\nthe online portfolio selection problem as follows. Let A = (aij) \u2208 {0, 1}3k\u00d7d be a matrix such\nthat aij = 1 if i = xj or i = k + yj or i = 2k + zj, and aij = 0 otherwise. From A, we\nconstruct B \u2208 R3k\u00d7(d+1) by B = 1\n3k [A,\u221213k], where 13k is the all-one vector of dimension 3k.\nLet T \u2265 max{(4 \u00b7 5184k4)2, (5184k4 \u00b7 p2(d)) 1\n\u03b4 } for an arbitrary polynomial p2 and an arbitrary\npositive parameter \u03b4. For each t \u2208 [T ], take zt from the uniform random distribution on {\u22121, 1}3k,\nindependently. Then, rt can be de\ufb01ned by rt = 1d+1 + B(cid:62)zt for each t \u2208 [T ]. Note that\nrt \u2208 [0, 2](d+1) holds for each t \u2208 [T ].\nWe give the sequence (rt)t=1,...,T to an algorithm A. Let (xt)t=1,...,T denote the sequence output\n5184k4 ) holds, while\notherwise we determine U /\u2208 3DM to hold. We can prove that this determination is correct with a\nprobability of at least 2/3. For the proof, see Appendix E in the supplementary material.\nTheorem 4. Let \u03b4 be an arbitrary positive number, and p1 and p2 be arbitrary polynomials. Assume\nthat there exists a p1(d, T )-time algorithm A for the full-feedback online portfolio selection problem\nwith S = Sk+1 that achieves regret RT \u2264 p2(d)T 1\u2212\u03b4 with a probability of at least 2/3. Then, given\na 3DM instance U \u2286 [k] \u00d7 [k] \u00d7 [k], one can decide if U \u2208 3DM with a probability of at least 2/3\nin p1(|U|, max{k8, (k4p2(|U|)) 1\nCorollary 5. Under the assumption of NP (cid:54)\u2286 BPP, if an algorithm achieves O(p(d, k)T 1\u2212\u03b4)\nregret for arbitrary d and arbitrary k, the algorithm will not run in polynomial time, i.e., the running\ntime will be larger than any polynomial for some d and some k.\n\nby A. We determine that U \u2208 3DM if(cid:80)T\n\nt xt) \u2265 T (log 2 \u2212 1\n\nt=1 log(1 + r(cid:62)\n\n\u03b4 })-time.\n\nNote that the computational lower bounds described in Theorem 4 and Corollary 5 are also valid for\nthe bandit-feedback setting, since algorithms for the bandit-feedback settings can be used for the\nfull-feedback setting.\n\n4.2 Regret lower bound for the full-feedback setting\nWe show here that, for the full-feedback setting of the online portfolio selection problem with S = Sk,\nin the\nevery algorithm (including exponential-time algorithms) suffers from regret of \u2126\nworst case. We can show this by analyzing the behavior of an algorithm for a certain random input.\nIn the analysis, we use the fact that the following two inequalities hold when rt follows the discrete\nuniform distribution on {0, 1}d independently:\n\n(cid:18)(cid:113)\n\nT log d\nk\n\n(cid:19)\n\n6\n\n\f(cid:34) T(cid:88)\nT(cid:88)\n\nt=1\n\nt=1\n\nE\nrt,xt\n\nlog(1 + r(cid:62)\n\nt xt)\n\nmax\n\nS\u2208Sk,x\u2208\u2206S\n\nlog(1 + r(cid:62)\n\nt x)\n\n(cid:35)\n(cid:35)\n\n\u2264 T E\n\nX\n\n(cid:20)\n\nlog\n\n(cid:20)\n\n(cid:18)\n(cid:18)\n\n1 +\n\n\u2265 T \u00b7 E\n\nX\n\nlog\n\n1 +\n\n1\nk\n\nX\n\n1\nk\n\n(cid:19)(cid:21)\n(cid:19)(cid:21)\n\n,\n\n(cid:32)(cid:114)\n\nX\n\n+ \u2126\n\nT log\n\n(cid:33)\n\n,\n\nd\nk\n\n(cid:34)\n\nE\nrt,xt\n\nwhere X is a binomial random variable following B(k, 1/2). See Appendix F for details regarding\nthe proof.\n(cid:19)\nTheorem 6. Let d \u2265 17k, and consider the online portfolio selection problem with d assets and\nt=1 such\navailable combinations S = Sk. There is a probability distribution of input sequences {rt}T\nthat the regret of any algorithm for the full-feedback setting is bounded as E[RT ] = \u2126\n,\nT log d\nk\nwhere the expectation is with respect to the randomness of both r and the algorithm.\n\n(cid:18)(cid:113)\n\n4.3 Regret lower bound for the bandit-feedback setting\n\n(cid:18)(cid:113)\n\n(cid:19)\n\ni\u2208S\n\nzi =\n\n,\n\nzi =\n\ni\u2208S\u2217\n\n(cid:89)\n\nT ( d\n\nCk3 )k\n\n\u22121 w.p. 1/2 + \u0001\n\n(S \u2208 2[d] \\ {\u2205, S\u2217\n\nwhen the input sequence is de\ufb01ned as follows. Let S\u2217\n\n(cid:26) 1 w.p. 1/2\n(cid:89)\n(cid:26) 1 w.p. 1/2 \u2212 \u0001\n\n\u22121 w.p. 1/2\n\nIn this subsection, we consider the bandit-feedback setting of the online portfolio selection problem\nwith S = Sk. We show that every algorithm (including exponential-time algorithms) for this setting\nsuffers from regret of \u2126\n\u2208 Sk.\nWe de\ufb01ne a random distribution DS\u2217 on {\u22121, 1}d so that a random vector z = [z1, . . . , zd](cid:62)\nfollowing this distribution satis\ufb01es\n(cid:26) 1 w.p. 1/2\n\n(cid:26) 1 w.p. 1/2 \u2212 \u0001\n\n\u2208 S\u2217, let zi =\nSuch a distribution can be constructed as follows: \ufb01x an index i\u2217\n\u22121 w.p. 1/2\n(cid:81)\nfor each i \u2208 [d] \\ {i\u2217\nindependently. De\ufb01ne zi\u2217 =\n}, and let z0 =\n\u22121 w.p. 1/2 + \u0001\ni\u2208S\u2217\\{i\u2217} zi. Then z = [z1, . . . , zd](cid:62)\n\u223c DS\u2217. The price relative vector rt in the t-th round can\nbe de\ufb01ned by rt = 1d \u2212 zt, where zt \u223c D\u2217\nS independently for t \u2208 [T ]. We can show that rt|S\nfollows a uniform distribution for any S \u2208 Sk \\ {S\u2217\n} and only rt|S\u2217 follows a slightly different\ndistribution. Because of this, it is dif\ufb01cult for algorithms to distinguish S\u2217 from others, which makes\ntheir regrets large. For more details, see Appendix G.\nTheorem 7. Let d \u2265 k \u2212 1, and consider the online portfolio selection problem with d assets and\n(cid:18)\navailable combinations S = Sk. There is a probability distribution of input sequences {rt}T\nt=1\nsuch that the regret of any algorithm for the bandit-feedback setting is bounded as E[RT ] =\n, where the expectation is with respect to the randomness of both r\n\n(cid:27)(cid:19)\n\n(cid:113)\n\nT\n\nmin\n\nk(Ck)k ,\n\n\u2126\nand the algorithm, and C is a constant depending on C1 and C2.\n\nCk3 )k\n\nT ( d\n\n(cid:26)\n\n}).\n\nz0\n\n5 Experimental evaluation\n\nWe show the empirical performance of our algorithms through experiments over synthetic and real-\nworld data. In this section, we consider the online portfolio selection problem with S = S1. A\nproblem instance is parameterized by a tuple (d, T,{rt}T\nt=1). A synthetic instance is generated as\nfollows: given parameters d, T , C1, and C2, we randomly choose an asset i\u2217 from [d], and generate\nrti\u2217 \u223c U ((C2 + C1)/2, C2) and rti \u223c U (C1, C2) for i \u2208 [d] \\ {i\u2217\nWe also conduct our experiments for two real-world instances. The \ufb01rst is based on crypto coin\nhistorical data2, including dates and price data for 19 crypto coins. From this data, we select 7 crypto\ncoins, each having 929 prices, and obtain price relatives rti of coin i at time t by (pti/pt\u22121,i) \u2212 1,\nwhere pti indicates the price of coin i at time t. Thus, d = 7 and T = 928 in this instance. The other\n\n}.\n\n2https://www.kaggle.com/sudalairajkumar/cryptocurrencypricehistory\n\n7\n\n\finstance is based on S&P 500 stock data3, including dates and price data for 505 companies. From\nthis data, we choose d = 470 companies, each having 1259 stock prices, and compute T = 1258\nprice relatives for each company in the same way.\nFor purposes of comparison, we prepare three baseline algorithms: Exp3_cont, Exp3_disc, and\nMWU_disc. MWU_disc (based on MWU [3]) works in the full-feedback setting and is compared\nwith Algorithm 1. Exp3_cont and Exp3_disc (based on Exp3 [5]) work in the bandit-feedback setting\nand are compared with Algorithm 2. These baseline algorithms have different ways of updating xS\nt\nt can be expressed as\nfrom those of Algorithms 1 and 2. Note that since S = S1 = {{i} | i \u2208 [d]}, xS\nxS\nt = x\n\n{i}\nt = [0, . . . , 0, xti, 0, . . . , 0](cid:62). Below, we offer a brief explanation of the comparisons.\n\nj=1 rji \u2265 0 and xti = 0 otherwise. For each t \u2208 [T ], select it\ni=1, and output\n\nby MWU, where rewards in the t-th round are given by [log(1 + rtixti)]d\nit, x\n\nMWU_disc Set xti = 1 if(cid:80)t\u22121\nExp3_disc Set xti = 1 if(cid:80)\nj\u2208[t\u22121]:ij =i rji \u2265 0 and xti = 0 otherwise. For each t \u2208 [T ], select it\n{it}\n.\nby Exp3, where reward in the t-th round is given by log(1 + rtitxtit), and output it, x\nt\nExp3_cont Set a parameter B \u2208 N, and consider an MAB problem instance with d(B + 1)\narms in which the rewards for the d(B + 1) arms in the t-th round are given by\n(log(1 + rtib/B))1\u2264i\u2264d,0\u2264b\u2264B. Apply Exp3 to this MAB problem instance.\n\n{it}\nt\n\n.\n\nWe assess the performance of the algorithms on the basis of regrets for synthetic instances and of\ncumulative price relatives for real-world instances, where regrets and cumulative price relatives are\naveraged over 10 executions. We set parameters \u03b7 according to Theorem 2 for Algorithm 1 and\nMWU_disc, and \u03b7 and \u03b3 according to Theorem 3 for Algorithm 2, Exp3_disc, and Exp3_cont.\nFigure 1 shows average regrets for a synthetic instance with (d, T, C1, C2) = (20, 10000,\u22120.5, 0.5).\nWe observe that both Algorithms 1 and 2 converge faster than MWU_disc, Exp3_cont, and Exp3_disc.\nIn addition, the results empirically show that our theoretical bounds are correct.\nFigures 2 and 3 show average cumulative price relatives for a real-world instance of S&P 500 stock\ndata with (d, T, C1, C2) = (470, 1258,\u22120.34, 1.04) and for a real-world instance of crypto coin data\nwith (d, T, C1, C2) = (7, 928,\u22120.7, 3.76), respectively. From these \ufb01gures, we observe that the\ncumulative price relatives of our algorithms are higher than those of baseline algorithms.\n\nFigure 2: The average cumu-\nlative price relatives over S&P\n500 stock dataset\n\nFigure 3: The average cumu-\nlative price relatives over the\ncryptocoin historical dataset\n\nThe\naverage\nFigure 1:\nregrets over\nthe synthetic\ndataset with (d, T, C1, C2) =\n(20, 10000,\u22120.5, 0.5)\n\nAcknowledgement\n\nThis work was supported by JST ERATO Grant Number JPMJER1201, Japan, and JSPS KAKENHI\nGrant Number JP18H05291.\n\n3https://www.kaggle.com/camnugent/sandp500\n\n8\n\n00.20.40.60.81\u00b710401,0002,000TRTAlgorithm1MWU_discAlgorithm2Exp3_contExp3_disc02004006008001,0001,200\u22120.4\u22120.20TCumulativepricerelativeAlgorithm1MWU_discAlgorithm2Exp3_contExp3_disc02004006008001,0000246TCumulativepricerelativeAlgorithm1MWU_discAlgorithm2Exp3_contExp3_disc\fReferences\n[1] A. Agarwal, E. Hazan, S. Kale, and R. E. Schapire. Algorithms for portfolio management based\non the Newton method. Proceedings of the 23rd International Conference on Machine Learning\n- ICML \u201906, pages 9\u201316, 2006.\n\n[2] N. Alon and J. H. Spencer. The Probabilistic Method. John Wiley & Sons, 2004.\n\n[3] S. Arora, E. Hazan, and S. Kale. The multiplicative weights update method: a meta-algorithm\n\nand applications. Theory of Computing, 8(1):121\u2013164, 2012.\n\n[4] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem.\n\nMachine Learning, 47(2-3):235\u2013256, 2002.\n\n[5] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit\n\nproblem. SIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[6] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\nSciences, 78(5):1404\u20131422, 2012.\n\n[7] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System\n\n[8] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and\n\napplications. In International Conference on Machine Learning, pages 151\u2013159, 2013.\n\n[9] R. Combes, M. S. T. M. Shahi, A. Proutiere, et al. Combinatorial bandits revisited. In Advances\n\nin Neural Information Processing Systems, pages 2116\u20132124, 2015.\n\n[10] T. M. Cover. Universal portfolios. Mathematical Finance, 1(1):1\u201329, 1991.\n\n[11] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.\n\n[12] P. Das, N. Johnson, and A. Banerjee. Online portfolio selection with group sparsity.\n\nTwenty-Eighth AAAI Conference on Arti\ufb01cial Intelligence, 2014.\n\nIn\n\n[13] P. M. Fenwick. A new data structure for cumulative frequency tables. Software: Practice and\n\nExperience, 24(3):327\u2013336, 1994.\n\n[14] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[15] J. E. Gentle. Computational Statistics. Springer Science & Business Media, 2009.\n\n[16] N. H. Hakansson and W. T. Ziemba. Capital growth theory. Handbooks in Operations Research\n\nand Management Science, 9:65\u201386, 1995.\n\n[17] E. Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimiza-\n\ntion, 2(3-4):157\u2013325, 2016.\n\n[18] E. Hazan, A. Agarwal, and S. Kale. Logarithmic regret algorithms for online convex optimiza-\n\ntion. Machine Learning, 69(2-3):169\u2013192, 2007.\n\n[19] A. Kalai and S. Vempala. Ef\ufb01cient algorithms for universal portfolios. Journal of Machine\n\nLearning Research, 3(Nov):423\u2013440, 2002.\n\n[20] R. M. Karp. Reducibility among combinatorial problems. In Complexity of Computer Compu-\n\ntations, pages 85\u2013103. Springer, 1972.\n\n[21] J. Kelly. A new interpretation of information rate. Bell Sys. Tech. Journal, 35:917\u2013926, 1956.\n\n[22] B. Li and S. C. Hoi. Online portfolio selection: A survey. ACM Computing Surveys (CSUR),\n\n46(3):35, 2014.\n\n[23] B. Li and S. C. H. Hoi. On-Line Portfolio Selection with Moving Average Reversion. Proceed-\nings of the 29th International Conference on Machine Learning (ICML-12), pages 273\u2013280,\n2012.\n\n9\n\n\f[24] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret. Applications of second-order cone\n\nprogramming. Linear Algebra and Its Applications, 284(1-3):193\u2013228, 1998.\n\n[25] J. Matou\u0161ek and J. Vondr\u00e1k. The probabilistic method. Lecture Notes, Department of Applied\n\nMathematics, Charles University, Prague, 2001.\n\n[26] E. Ordentlich and T. M. Cover. The cost of achieving the best portfolio in hindsight. Mathematics\n\nof Operations Research, 23(4):960\u2013982, 1998.\n\n[27] S. Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends R(cid:13)\n\nin Machine Learning, 4(2):107\u2013194, 2012.\n\n[28] Y. Ye, L. Lei, and C. Ju. Hones: A fast and tuning-free homotopy method for online newton\nstep. In Proceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS-18), pages 2008\u20132017, 2018.\n\n[29] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In\nProceedings of the 20th International Conference on Machine Learning (ICML-03), pages\n928\u2013936, 2003.\n\n10\n\n\f", "award": [], "sourceid": 6757, "authors": [{"given_name": "Shinji", "family_name": "Ito", "institution": "NEC Corporation"}, {"given_name": "Daisuke", "family_name": "Hatano", "institution": "RIKEN AIP"}, {"given_name": "Hanna", "family_name": "Sumita", "institution": "Tokyo Metropolitan University"}, {"given_name": "Akihiro", "family_name": "Yabe", "institution": null}, {"given_name": "Takuro", "family_name": "Fukunaga", "institution": "RIKEN AIP, JST PRESTO"}, {"given_name": "Naonori", "family_name": "Kakimura", "institution": "Keio University"}, {"given_name": "Ken-Ichi", "family_name": "Kawarabayashi", "institution": "National Institute of Informatics"}]}