{"title": "Learning Eigenvectors for Free", "book": "Advances in Neural Information Processing Systems", "page_first": 945, "page_last": 953, "abstract": "We extend the classical problem of predicting a sequence of outcomes from a finite alphabet to the matrix domain. In this extension, the alphabet of $n$ outcomes is replaced by the set of all dyads, i.e. outer products $\\u\\u^\\top$ where $\\u$ is a vector in $\\R^n$ of unit length. Whereas in the classical case the goal is to learn (i.e. sequentially predict as well as) the best multinomial distribution, in the matrix case we desire to learn the density matrix that best explains the observed sequence of dyads. We show how popular online algorithms for learning a multinomial distribution can be extended to learn density matrices. Intuitively, learning the $n^2$ parameters of a density matrix is much harder than learning the $n$ parameters of a multinomial distribution. Completely surprisingly, we prove that the worst-case regrets of certain classical algorithms and their matrix generalizations are identical. The reason is that the worst-case sequence of dyads share a common eigensystem, i.e. the worst case regret is achieved in the classical case. So these matrix algorithms learn the eigenvectors without any regret.", "full_text": "Learning Eigenvectors for Free\n\nWouter M. Koolen\n\nRoyal Holloway and CWI\nwouter@cs.rhul.ac.uk\n\nWojtek Kot\u0142owski\n\nCentrum Wiskunde & Informatica\n\nManfred K. Warmuth\n\nUC Santa Cruz\n\nkotlowsk@cwi.nl\n\nmanfred@cse.ucsc.edu\n\nAbstract\n\nWe extend the classical problem of predicting a sequence of outcomes from a \ufb01-\nnite alphabet to the matrix domain. In this extension, the alphabet of n outcomes is\nreplaced by the set of all dyads, i.e. outer products uu(cid:62) where u is a vector in Rn\nof unit length. Whereas in the classical case the goal is to learn (i.e. sequentially\npredict as well as) the best multinomial distribution, in the matrix case we desire\nto learn the density matrix that best explains the observed sequence of dyads. We\nshow how popular online algorithms for learning a multinomial distribution can\nbe extended to learn density matrices. Intuitively, learning the n2 parameters of a\ndensity matrix is much harder than learning the n parameters of a multinomial dis-\ntribution. Completely surprisingly, we prove that the worst-case regrets of certain\nclassical algorithms and their matrix generalizations are identical. The reason is\nthat the worst-case sequence of dyads share a common eigensystem, i.e. the worst\ncase regret is achieved in the classical case. So these matrix algorithms learn the\neigenvectors without any regret.\n\n1\n\nIntroduction\n\nWe consider the extension of the classical online problem of predicting outcomes from a \ufb01nite\nalphabet to the matrix domain. In this extension, the alphabet of n outcomes is replaced by a set of\nall dyads, i.e. outer products uu(cid:62) where u is a unit vector in Rn. Whereas classically the goal is\nto learn as well as the best multinomial distribution over outcomes, in the matrix case we desire to\nlearn the distribution over dyads that best explains the sequence of dyads seen so far. A distribution\non dyads is summarized as a density matrix, i.e. a symmetric positive-de\ufb01nite1 matrix of unit trace.\nSuch matrices are heavily used in quantum physics, where dyads represent states. We will show\nhow popular online algorithms for learning multinomials can be extended to learn density matrices.\nConsiderable attention has been placed recently on generalizing algorithms for learning and opti-\nmization problems from probability vector parameters to density matrices [17, 19]. Ef\ufb01cient semi-\nde\ufb01nite programming algorithms have been devised [1] and better approximation algorithms for\nNP-hard problems have been obtained [2] by employing on-line algorithms that update a density\nmatrix parameter. Also two important quantum complexity classes were shown to collapse based on\nthese algorithms [8]. Even though the matrix generalization led to progress in many contexts, in the\noriginal domain of on-line learning, the regret bounds proven for the algorithms in the matrix case\nare often the same as those provable for the original classical \ufb01nite alphabet case [17, 19]. Therefore\nit was posed as an open problem to determine whether this is just a case of loose classical bound or\nwhether there truly exists a \u201cfree matrix lunch\u201d for some of these algorithms [18]. Such algorithms\nessentially would learn the eigensystem of the data for free without incurring any additional regret.\nThis is non-intuitive, since one would expect a matrix to have n2 parameters and be much harder to\nlearn than an n dimensional parameter vector.\n\n1We use positive in the non-strict sense, and omit \u2018symmetric\u2019 and \u2018de\ufb01nite\u2019. Our matrices are real-valued.\n\n1\n\n\ffor trial t = 1, 2, . . . , T do\n\nAlgorithm predicts with probability vector \u03c9t\u22121\nNature responds with outcome xt.\nAlgorithm incurs loss \u2212 log \u03c9t\u22121,xt.\n\nend for\n\nfor trial t = 1, 2, . . . , T do\n\nAlgorithm predicts with density matrix Wt\u22121\nNature responds with density matrix Xt.\n\nAlgorithm incurs loss \u2212 tr(cid:0)Xt log(Wt\u22121)(cid:1).\n\nend for\n\nProbability vector prediction\n\nDensity matrix prediction\n\nFigure 1: Protocols\n\nprobability vector as a mixture of these events: \u03c9 =(cid:80)\n\nIn this paper we investigate this frivolously named but deep \u201cfree matrix lunch\u201d question in arguably\nthe simplest context: learning a multinomial distribution. In the classical case, there are n \u2265 2\noutcomes and a distribution is parametrized by an n-dimensional probability vector \u03c9, where \u03c9i is\nthe probability of outcome i. One can view the base vectors ei as the elementary events and the\ni \u03c9iei. We de\ufb01ne a \u201cmatrix generalization\u201d\nof a multinomial which is parametrized by a density matrix W (positive matrix of unit trace). Now\nthe elementary events are dyads of the form uu(cid:62), where u is a unit vector in Rn. Dyads are the\nrepresentations of states used in quantum physics [20]. A density matrix is a mixture of dyads.\nWhereas probability vectors represent uncertainty over n basis vectors, density matrices can be\nviewed as representing uncertainty over in\ufb01nitely many dyads in Rn.\nIn the classical case, the algorithm predicts at trial t with multinomial \u03c9t\u22121. Nature produces an\noutcome xt \u2208 {1, . . . , n}, and the algorithm incurs loss \u2212 log(\u03c9t\u22121,xt). The most common heuristic\n(a.k.a. the Laplace estimator) chooses \u03c9t\u22121,i proportional to 1 plus the number of previous trials in\nwhich outcome i was observed. The on-line algorithms are evaluated by their worst-case regret over\ndata sequences, where the regret is the additional loss of the algorithm over the total loss of the best\nprobability vector chosen in hindsight.\nIn this paper we develop the corresponding matrix setting, where the algorithm predicts with a den-\nsity matrix Wt\u22121, Nature produces a dyad xtx(cid:62)\nt log(Wt\u22121)xt.\nHere log denotes the matrix logarithm. We are particularly interested in how the regret changes\nwhen the algorithms are generalized to the matrix case. Surprisingly we can show that for the\nLaplace as well as the Krichevsky-Tro\ufb01mov [10] estimators the worst-case regret is the same in the\nmatrix case as it is in the classical case. For the Last-Step Minimax algorithm [16], we can prove\nthe same regret bound for the matrix case that was proven for the classical case.\nWhy are we doing this? Most machine learning algorithms deal with vector parameters. The goal of\nthis line of research is to develop methods for handling matrix parameters. We are used to dealing\nwith probability vectors. Recently a probability calculus was developed for density matrices [20]\nincluding various Bayes rules for updating generalized conditionals. The vector problems are typi-\ncally retained as special cases of the matrix problems, where the eigensystem is \ufb01xed and only the\nvectors of eigenvalues has to be learned. We exhibit for the \ufb01rst time a basic fundamental prob-\nlem, for which the regret achievable in the matrix case is no higher than the regret achievable in the\noriginal vector setting.\n\nt , and the algorithm incurs loss \u2212x(cid:62)\n\nPaper outline De\ufb01nitions and notation are given in the next section, followed by proofs of the\nfree matrix lunch for the three discussed algorithms in Section 3. At the core of our proofs is\na new technical lemma for mixing quantum entropies. We also discuss the minimax algorithm\nfor multinomials due to Shtarkov, and corresponding minimax algorithm for density matrices. We\nprovide strong experimental evidence that the free matrix lunch holds for this algorithm as well. To\nput the results into context, we motivate and discuss our choice of the loss function, and compare it\nto several alternatives in Section 4. More discussion and perspective is provided in the Section 5.\n\n2 Setup\n\nThe protocol for the classical probability vector prediction problem and the new density matrix\nprediction problem are displayed side-by-side in Figure 1. We explain the latter problem. Learning\nproceeds in trials. During trial t the algorithm predicts with a density matrix Wt\u22121. We use index\nt\u2212 1 to indicate that is based on the t\u2212 1 previous outcomes. Then nature responds with an outcome\n\n2\n\n\f(cid:96)(Wt\u22121, Xt) := \u2212 tr(cid:0)Xt log(Wt\u22121)(cid:1),\n\ndensity matrix Xt. The discrepancy between prediction and outcome is measured by the matrix\nentropic loss\n\n(cid:80)\ni \u03c9t\u22121,i eie(cid:62)\n\ni for some probability vector \u03c9t\u22121, and the outcome Xt is an eigendyad eje(cid:62)\n\n(1)\nwhere log denotes matrix logarithm2. When the outcome density matrix Xt is a dyad xtx(cid:62)\nt , then\nthis loss becomes \u2212x(cid:62)\nt log(Wt\u22121)xt, which is the simpli\ufb01ed form of the entropic loss discussed\nin the introduction. Also if the prediction density matrix is diagonal, i.e. it has the form Wt\u22121 =\nj of the\nsame eigensystem, then this loss simpli\ufb01es to the classical log loss: (cid:96)(Wt\u22121, Xt) = \u2212 log(\u03c9t\u22121,j).\nThe above de\ufb01nition is not the only way to promote the log loss to the matrix domain. Yet, in Section\n4 we justify this choice.\nWe aim to design algorithms with low regret compared to the best \ufb01xed density matrix in hindsight.\nThe loss of the best \ufb01xed density matrix can be expressed succinctly in terms of the von Neumann\nentropy, which is de\ufb01ned for any density matrix D as H(D) := \u2212 tr(D log D), and the suf\ufb01-\n\ncient statistic ST = (cid:80)T\n\n(cid:1) . For \ufb01xed data\n\nt=1 Xt as follows: infW\n\nX1, . . . , XT , the regret of a strategy that issues prediction Wt after observing X1, . . . , Xt is\n\nt=1 (cid:96)(W , Xt) = T H(cid:0) ST\n(cid:80)T\n(cid:18) ST\n\n(cid:19)\n\nT\n\n(cid:96)(Wt\u22121, Xt) \u2212 T H\n\n,\n\nT\n\n(2)\n\nT(cid:88)\n\nt=1\n\nand the worst-case regret on T trials is obtained by taking supX1,...,XT over (2). Our aim is to\ndesign strategies for density matrix prediction that have low worst-case regret.\n\n3 Free Matrix Lunches\n\nIn this section, we will show how four popular online algorithms for learning multinomials can be\nextended to learning density matrices. We start with the simple Laplace estimator, continue with\nits improved version known as the Krichevsky-Tro\ufb01mov estimator, and also extend the less known\nLast Step Minimax strategy which has even less regret. We will prove a version of the free matrix\nlunch (FML) for all three algorithms. Finally we discuss the minimax algorithm for which we have\nexperimental evidence that the free matrix lunch holds as well.\n\n3.1 Laplace\n\npredicts with the probability vector \u03c9t := \u03c3t+1\n\nAfter observing classical data with suf\ufb01cient statistic vector \u03c3t = (cid:80)t\nanalogy, after observing matrix data with suf\ufb01cient statistic St =(cid:80)t\n\nq=1 exq, classical Laplace\nt+n consisting of the normalized smoothed counts. By\nq=1 Xt, matrix Laplace predicts\nwith the correspondingly smoothed matrix Wt := St+I\nt+n . Classical Laplace is commonly motivated\nas either the Bayes predictive distribution w.r.t. the uniform prior or as a loss minimization with\nvirtual outcomes [3]. The latter motivation can be \u201clifted\u201d to the matrix domain by adding n virtual\noutcomes at I/n:\n\n(cid:40)\n\nt(cid:88)\n\n(cid:41)\n\nWt = argmin\nW dens. mat.\n\nThe worst-case regret of classical Laplace after T iterations equals log(cid:0)T +n\u22121\n\nn (cid:96)(W , I/n) +\n\n(cid:96)(W , Xq)\n\nq=1\n\n=\n\nn\u22121\n\nSt + I\nt + n\n\n.\n\n(3)\n\n(cid:1) \u2264 (n\u22121) log(T +1)\n\n(see e.g. [6]). We now show that in the matrix case, no additional regret is incurred.\nTheorem 1 (Laplace FML). The worst-case regrets of classical and matrix Laplace coincide.\nProof. Let W \u2217\nLaplace can be bounded as follows:\n\nt denote the best density matrix for the \ufb01rst t outcomes. The regret (2) of matrix\n\nT(cid:88)\n\n(cid:96)(Wt\u22121, Xt) \u2212 T(cid:88)\n\nT , Xt) \u2264 T(cid:88)\n2For any positive matrix with eigendecomposition A =(cid:80)\n\n(cid:96)(W \u2217\n\nt=1\n\nt=1\n\nt=1\n\n(cid:16)\n\n(cid:17)\n\n(cid:96)(Wt\u22121, Xt) \u2212 (cid:96)(W \u2217\n\ni , log(A) :=(cid:80)\n\ni \u03b1i aia(cid:62)\n\nt , Xt)\n\n.\n\n(4)\n\ni log(\u03b1i) aia(cid:62)\ni .\n\n3\n\n\f(cid:18)\n\n(cid:18)\n\nNow consider each term in the right-hand sum separately. The tth term equals\n\n(cid:19)(cid:19)\n\n(cid:18) t \u2212 1 + n\n\n(cid:19)\n\n\u2212 tr(cid:0)Xt\n\n(cid:0)log(St\u22121 + I) \u2212 log St\n\n(cid:1)(cid:1) .\n\n\u2212 tr\n\nXt\n\nlog\n\nSt\u22121 + I\nt \u2212 1 + n\n\n\u2212 log\n\nSt\nt\n\n= log\n\nt\n\nNote that the \ufb01rst term constitutes the \u201cclassical\u201d part of the per-round regret, while the second term\nis the \u201cmatrix\u201d part. The matrix part is non-positive since St\u22121 + I (cid:23) St, and the logarithm is a\nmatrix monotone operation (i.e. A (cid:23) B implies log A (cid:23) log B). By omitting it, we obtain an\nupper bound on the regret of matrix Laplace, that is tight: for any sequence of identical dyads the\nT for all t \u2264 T . The same upper\nmatrix part is zero and (4) holds with equality since W \u2217\nbound is also met by classical Laplace on any sequence of identical outcomes [6].\n\nt = W \u2217\n\nWe just showed that matrix Laplace has the same worst-case regret as classical Laplace, albeit matrix\nLaplace learns a matrix of n2 parameters whereas classical Laplace only learns n probabilities. No\nadditional regret is incurred for learning the eigenvectors. Matrix Laplace can update Wt in O(n2)\ntime per trial. The same will be true for our next algorithm.\n\n3.2 Krichevsky-Tro\ufb01mov (KT)\n\n2 to each count, i.e. \u03c9t := \u03c3t+1/2\n\nt+n/2 and Wt := St+I/2\nClassical and matrix KT smooth by adding 1\nt+n/2 .\nThe former can again be obtained as the Bayes predictive distribution w.r.t. Jeffreys\u2019 prior, the latter\nas the solution to the matrix entropic loss minimization problem (3) with n/2 virtual outcomes\ninstead of n for Laplace.\nThe leading term in the worst-case regret for classical KT is the optimal 1\n2 log(T ) rate per parameter\ninstead of the log(T ) rate for Laplace. More precisely, classical KT\u2019s worst-case regret after T\niterations is known to be log \u0393(T +n/2)\nAgain we show that no additional regret is incurred in the matrix case.\nTheorem 2 (KT FML). The worst-case regrets of classical and matrix KT coincide.\n\n(cid:0)log(T + 1) + log(\u03c0)(cid:1) (see e.g. [6]).\n\n\u0393(T +1/2) + log \u0393(1/2)\n\n\u0393(n/2) \u2264 n\u22121\n\n2\n\nLemma 1. For positive matrices A, B with A =(cid:80)\n\nThe proof uses the following key entropy decomposition lemma (proven in Appendix A):\n\nthe eigendecomposition of A:\n\ni\n\ni \u03b1i aia(cid:62)\n\nH(cid:0)A + tr(B) aia(cid:62)\n\ni\n\na(cid:62)\ni Bai\ntr(B)\n\n(cid:18) St\u22121 + Xt\n\n(cid:19)\n\nt\n\n(cid:1),\n(cid:18) St\u22121\n\nt \u2212 1\n\n(cid:19)(cid:19)\n\n+ (t \u2212 1)H\n\n.\n\n(5)\n\n(cid:80)n\ni=1 \u03c3i sis(cid:62)\n\ni=1\n\nt=1\n\nT(cid:88)\n\nH(A + B) \u2265 n(cid:88)\n(cid:18)\n\u2212 tr(cid:0)Xt log(Wt\u22121)(cid:1) \u2212 tH\nn(cid:88)\n\u2265 n(cid:88)\n\n\u2212 tr(cid:0)Xt log(Wt\u22121)(cid:1) = \u2212 tr(cid:0)Xt\n(cid:19)\n(cid:18) St\u22121 + Xt\n(cid:18)\n\ni=1\n\ni=1\n\nH\n\nt\n\ns(cid:62)\ni Xtsi\n\n\u2212 log(\u03c9t\u22121,i) \u2212 tH\n\n\u03b4t :=\n\nn(cid:88)\n\ni=1\n\nMoreover, it follows from Lemma 1 that:\n\nwhich, in turn, is at most:\n\n(cid:18)\n\nProof of Theorem 2. We start by telescoping the regret (2) of matrix KT as follows\n\nWe bound each term separately. Let us denote the eigendecomposition of St\u22121 by St\u22121 =\n\ni . Notice that since Wt\u22121 plays in the eigensystem of St\u22121, we have:\n\nlog(\u03c9t\u22121,i) sis(cid:62)\n\ns(cid:62)\ni Xtsi log(\u03c9t\u22121,i).\n\ni\n\n(cid:1) = \u2212 n(cid:88)\n(cid:18) St\u22121 + sis(cid:62)\n\ni=1\n\ni\n\n(cid:19)\n\n.\n\nt\n\n(cid:19)\n\ns(cid:62)\ni XtsiH\n\n(cid:18) St\u22121 + sis(cid:62)\n(cid:19)\n(cid:18) St\u22121 + sis(cid:62)\n\nt\n\ni\n\ni\n\nt\n\n4\n\n+ (t \u2212 1)H\n\n,\n\n(6)\n\n(cid:19)(cid:19)\n\n(cid:18) St\u22121\n(cid:19)(cid:19)\n(cid:18) St\u22121\n\nt \u2212 1\n\nt \u2212 1\n\n.\n\nTaking this equality and inequality into account, the tth term in (5) is bounded above by:\n\n\u03b4t \u2264 sup\n\ni\n\n\u2212 log(\u03c9t\u22121,i) \u2212 tH\n\n+ (t \u2212 1)H\n\n\fIn other words the per-round regret increase is largest for one of the eigenvectors of the suf\ufb01cient\nstatistic St\u22121, i.e. for classical data. To get an upper bound, maximize over S0, . . . , ST\u22121 indepen-\ndently, each with the constraint that tr(St) = t. A particular maximizer is St = t e1e(cid:62)\n1 , which is\nthe suf\ufb01cient statistic of the sequence of outcomes all equal to Xt = e1e(cid:62)\n1 . For this sequence all\nbounding steps hold with equality. Hence the matrix KT regret is below the classical KT regret. The\nreverse is obvious.\n\n3.3 Last Step Minimax\n\nThe bounding technique, developed using Lemma 1 and applied to KT can be used to prove bounds\nfor a much broader class of prediction strategies. The crucial part of the KT proof was showing\nthat each term in the telescoped regret (5) can be bounded above by \u03b4t as de\ufb01ned in (6), in which\nall matrices share the same eigensystem, and which is hence equivalent to the corresponding clas-\nsical expression. The only property of the prediction strategy that we used was that it plays in the\neigensystem of the past suf\ufb01cient statistic. Therefore, using the same line of argument, we can show\nthat if for some classical prediction strategy we can obtain a meaningful regret bound by bounding\neach term in the regret \u03b4t independently, we can obtain the same bound for the corresponding matrix\nstrategy, i.e. its spectral promotion.\nIn particular, we can push this argument to its limit by considering the algorithm designed to mini-\nmize \u03b4t in each iteration. This algorithm is known as Last Step Minimax.\nIn fact, the Last Step Minimax (LSM) principle is a general recipe for online prediction, which states\nthat the algorithm should minimize the worst-case regret with respect to the next outcome [16]. In\nother words, it should act as the minimax algorithm given that the time horizon is one iteration\nahead. In the classical case for the multinomial distribution, after observing data with suf\ufb01cient\nstatistic \u03c3t\u22121, classical LSM predicts with\n\n(cid:40)\n\n\u2212 t(cid:88)\n(cid:124)\n\nq=1\n\n(cid:96)(\u03c9\u2217\n\n(cid:123)(cid:122)\n\ntH( \u03c3t\n\nt )\n\n(cid:96)(\u03c9, xt)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n\u2212 log(\u03c9t\u22121,xt )\n\n(cid:41)\n(cid:125)\n\nt , xq)\n\n=\n\nexp(cid:0)\u2212tH( \u03c3t\u22121+ei\n)(cid:1)\nj exp(cid:0)\u2212tH( \u03c3t\u22121+ej\n(cid:80)\n\nt\n\nt\n\nn(cid:88)\n\ni=1\n\n)(cid:1) ei.\n\n(7)\n\n\u03c9t\u22121 := argmin\n\n\u03c9\n\nmax\n\nxt\n\nClassical LSM is analyzed in [16] for the Bernoulli (n = 2) case. For our straightforward gener-\nalization to the classical multinomial case, the regret is bounded by n\u22121\nln(T + 1) + 1. LSM is\ntherefore slightly better than KT.\nApplying the Last Step Minimax principle to density prediction, we obtain matrix LSM which issues\nprediction:\n\n2\n\n(cid:26)\n\n\u2212 tr(cid:0)Xt log(W )(cid:1) \u2212 tH\n\n(cid:18) St\n\n(cid:19)(cid:27)\n\nWt\u22121 := argmin\n\nmax\nXt\n\nt\nWe show that matrix LSM learns the eigenvectors without additional regret.\nTheorem 3 (LSM FML). The regrets of classical and matrix LSM are at most n\u22121\n\nW\n\n.\n\n2\n\nProof. We determine the form of Wt\u22121. By Sion\u2019s minimax theorem [15]:\n\nmax\nXt\n\nmin\nW\nwhere P ranges over probability distribution on density matrices Xt. Plugging in the minimizer\nW = EP [Xt], the right hand side becomes:\n\n= max\n\nmin\nW\n\nt\n\nt\n\nP\n\n,\n\nln(T + 1) + 1.\n\n(cid:18) St\n\n(cid:19)(cid:21)\n\n(cid:20)\n\n\u2212 tr(cid:0)Xt log(W )(cid:1) \u2212 tH\n(cid:19)(cid:21)(cid:27)\n\ntH\n\n.\n\n(8)\n\n(cid:18) St\n\n(cid:19)(cid:27)\n\n(cid:26)\n\n\u2212 tr(cid:0)Xt log(W )(cid:1) \u2212 tH\n(cid:26)\n\nEP\n\n(cid:18) St\n\nt\n\nn(cid:88)\n\ni=1\n\n(cid:20)\n\nH(cid:0)EP [Xt](cid:1) \u2212 EP\n(cid:19)(cid:35)\n(cid:18) St\u22121 + sis(cid:62)\n\ni\n\nP\n\nmax\ni=1 \u03c3i sis(cid:62)\n\n(cid:62)\ni XtsiH\n\nt\n\n5\n\nNow decompose St\u22121 as(cid:80)n\n(cid:20)\nn(cid:88)\n\ninside the maximum:\n\u2265 EP\n\n(cid:18) St\n\n(cid:19)(cid:21)\n\n(cid:34)\n\nEP\n\ntH\n\ns\n\nt\n\nt\n\ni=1\n\ni . Using Lemma 1, we can bound the second expression\n\n= t\n\n(cid:62)\ni EP [Xt] siH\n\ns\n\n(cid:18) St\u22121 + sis(cid:62)\n\ni\n\n(cid:19)\n\n.\n\nt\n\n\fEP [Xt] by its pinching (a.k.a. projective measurement) (cid:80)n\n\nOn the other hand, we know that the entropy does not decrease when we replace the argument\ni w.r.t. any\n\nEP [Xt]ui) uiu(cid:62)\n\neigensystem ui [12]. Therefore, we have:\n\ni\n\ni=1(u(cid:62)\n(cid:33)\n\n(s(cid:62)\n\ni\n\nEP [Xt]si) sis(cid:62)\n\ni\n\n= H(p),\n\nH(cid:0)EP [Xt](cid:1) \u2264 H\n(cid:18) St\n(cid:20)\n\nH(cid:0)EP [Xt](cid:1) \u2212 EP\n\ntH\n\n(cid:32) n(cid:88)\n\ni=1\n\n(cid:19)(cid:21)\n\nt\n\nwhere the last entropy is a classical entropy and p is a vector such that pi = s(cid:62)\nbining those two results together, we have:\n\ni\n\nEP [Xt]si. Com-\n\nn(cid:88)\n\ni=1\n\n(cid:18) \u03c3t\u22121 + ei\n\n(cid:19)\n\n.\n\nt\n\n\u2264 H(p) \u2212 t\n\npiH\n\nNote that we have equality only when the distribution P puts nonzero mass only on the eigenvectors\ns1, . . . , sn. This means that when p is \ufb01xed, we will maximize (8) by using a distribution with such\na property, i.e. P is restricted to the eigensystem of St\u22121. This, in turn, means that Wt\u22121 = EP [Xt]\nwill play in the eigensystem of St\u22121 as well. It follows that Wt\u22121 is the classical LSM strategy in\n\nthe eigensystem of St\u22121, i.e. Wt\u22121 =(cid:80)\n\ni , where \u03c9t\u22121 are taken as in (7).\n\ni \u03c9t\u22121,i sis(cid:62)\n\nThe proof of the classical LSM guarantee is based on bounding the per-round regret increase:\n\n\u03b4t := \u2212 log(\u03c9t\u22121,xt) \u2212 tH\n\n+ (t \u2212 1)H\n\n(cid:18) \u03c3t\u22121 + ext\n\n(cid:19)\n\nt\n\n(cid:19)\n\n(cid:18) \u03c3t\u22121\n\nt \u2212 1\n\n,\n\nby choosing the worst case w.r.t. xt and \u03c3t\u22121. Since, for matrices, the worst case for the correspond-\ning matrix version of \u03b4t, see (6), is the diagonal case, the whole analysis immediately goes through\nand we get the same bound as for classical LSM.\n\nNote that the bound for LSM is not tight, i.e. there exists no data sequence for which the bound is\nachieved. Therefore, the bound for matrix LSM is also not tight. This theorem is a weaker FML\nbecause it only relates worst-case regret bounds. We have veri\ufb01ed experimentally that the actual\nregrets coincide in dimension n = 2 for up to T = 5 outcomes, using a grid of 30 dyads per trial,\nwith uniformly spaced (x(cid:62)e1)2. So we believe that in fact\nConjecture 1 (LSM FML). The worst-case regrets of classical and matrix LSM coincide.\n\nTo execute the LSM matrix strategy, we need to have the eigendecomposition of the suf\ufb01cient statis-\ntic. For density matrix data Xt, we may need to recompute it each trial in \u2126(n3) time. For dyadic\ndata xtx(cid:62)\nit can be incrementally updated in O(n2) per trial with methods along the lines of [11].\n\nt\n\n3.4 Shtarkov\n\nT(cid:88)\n\nt=1\n\n(cid:16) \u03c3T\n\n(cid:17)\n\n.\n\nFix horizon T . The minimax algorithm for multinomials, due to Shtarkov [14], minimizes the worst-\ncase regret\n\n(cid:96)(\u03c9t\u22121, xt) \u2212 T H\n\nT\n\n. . .\n\n\u03c9t :=\n\ninf\n\u03c90\n\nsup\nx1\n\ninf\n\u03c9T \u22121\n\nsup\nxT\n\nn(cid:88)\n\n\u03c6r\u22121(\u03c3t + ei)\n\n(9)\nAfter observing data with suf\ufb01cient statistic \u03c3t and hence with r := T \u2212 t rounds remaining,\n(cid:17)(cid:17)\nclassical Shtarkov predicts with\nThe so-called Shtarkov sum \u03c6r can be evaluated in time O(cid:0)n r log(r)(cid:1) using a straightforward exten-\n(cid:0)log(T ) \u2212 log(n \u2212 2) + 1(cid:1) [6]. This is\n\nsion of the method described in [9] for computing \u03c6T (0), which is based on dynamic programming\nand Fast Fourier Transforms.\nThe regret of classical Shtarkov equals log \u03c6T (0) \u2248 n\u22121\nagain better than Last Step Minimax, which is in turn better than KT which dominates Laplace.\n\n(cid:88)\n(cid:80)n\n\n(cid:16)\u2212T H\n\n(cid:16) \u03c3 + c\n\nei where \u03c6r(\u03c3) :=\n\nc1,...,cn\ni=1 ci=r\n\nc1, . . . , cn\n\n\u03c6r(\u03c3t)\n\n(cid:33)\n\n(cid:32)\n\n(10)\n\nexp\n\ni=1\n\nT\n\nr\n\n2\n\n.\n\n6\n\n\fThe minimax algorithm for density matrices, called matrix Shtarkov, optimizes the worst-case regret\n\nT(cid:88)\n\nt=1\n\n(cid:18) ST\n\n(cid:19)\n\nT\n\ninf\nW0\n\nsup\nX1\n\n. . .\n\ninf\n\nWT \u22121\n\nsup\nXT\n\n(cid:96)(Wt\u22121, Xt) \u2212 T H\n\n.\n\n(11)\n\nTo this end, after observing data with suf\ufb01cient statistic St, with r rounds remaining, it predicts with\n\nWt := argmin\n\nW\n\nsup\nX\n\n(cid:96)(W , X) + Rr\u22121(St + X),\n\nwhere Rr is the tail sequence of inf/sups of (11) of length r. We now argue that the FML holds\nfor matrix Shtarkov. Matrix Shtarkov is surprisingly dif\ufb01cult to analyze. However, we provide a\nsimplifying conjecture that we veri\ufb01ed experimentally. A rigorous proof remains an open problem.\nOur conjecture is that Lemma 1 holds with the entropy H replaced by the minimax regret tail Rr:\nConjecture 2. For each integer r, for each pair of positive matrices A and B\n\nRr(A + B) \u2265 (cid:88)\n\na(cid:62)\ni Bai\ntr(B)\n\nRr\n\ni\n\n(cid:0)A + tr(B) aia(cid:62)\n\n(cid:1).\n\ni\n\nNote that this conjecture generalizes Lemma 1, which is retained as the case r = 0. It follows\nfrom this conjecture, using the same argument as for LSM, that matrix Shtarkov predicts in the\n\neigensystem of St, i.e. with Wt =(cid:80)\n\ni \u03c9t,i sis(cid:62)\n\ni , where \u03c9t as in (10), and furthermore that\n\nConjecture 3 (Shtarkov FML). The worst-case regrets of classical and matrix Shtarkov coincide.\n\nWe have veri\ufb01ed Conjecture 3 for the matrix Bernoulli case (n = 2) up to T = 5 outcomes, using a\ngrid of 30 dyads per trial, with uniformly spaced (x(cid:62)e1)2. Then assuming that Rr(S) = log(\u03c6(\u03c3)),\nwhere \u03c3 are the eigenvalues of S, for each n from 2 to 5 we drew 105 trace pairs uniformly from\n[0, 10], then drew matrix pairs A and B uniformly at random with those traces. Conjecture 2 always\nheld.\nObtaining the FML for the minimax algorithm is mathematically challenging and of academic inter-\nest but of minor practical relevance. First, the time horizon T must be speci\ufb01ed in advance, so the\nminimax algorithm can not be used in a purely online fashion. Secondly, the running time is super-\nlinear in the number of rounds remaining, while it is constant for the previous three algorithms.\n\n4 Motivation and Discussion of the Loss Function\n\nThe matrix entropic loss (1) that we choose as our loss function has a coding interpretation and it is\na proper scoring rule. The latter seems to be a necessary condition for the free matrix lunch.\n\nQuantum coding Classical log-loss forecasting can be motivated from the point of view of data\ncompression and variable-length coding [7]. In information theory, the Kraft-McMillan inequality\nstates that, ignoring rounding issues, for every uniquely decodable code with a code length function\n\u03bb, there is a probability distribution \u03c9 such that \u03bbi = \u2212 log \u03c9i for all symbols i = 1, . . . , n, and\nvice versa. Therefore, the log loss can be interpreted as the code length assigned to the observed out-\ncome. Quantum information theory[13, 5] generalizes variable length coding to the quantum/density\nmatrix case. Instead of messages composed of bits, the sender and the receiver exchange messages\ndescribed by density matrices, and the role analogous to the message length is now played by the\ndimension of the density matrix. Variable-length quantum coding requires the de\ufb01nition of a code\nlength operator L, which is a positive matrix such that for any density matrix X, tr(XL) gives\nthe expected dimension (\u201clength\u201d) of the message assigned to X. The quantum version of Kraft\u2019s\ninequality states that, ignoring rounding issues, for every variable-length quantum code with code-\nlength operator L, there exists a density matrix W such that L = \u2212 log W . Therefore, the matrix\nentropic loss can be interpreted as the (expected) code length of the observed outcome.\n\nProper score function In decision theory, the loss function (cid:96)(\u03c9, x) assessing the quality of pre-\ndictions is also referred to as a score function. A score function is said to be proper, if for\nany distribution p on outcomes, the expected loss is minimized by predicting with p itself, i.e.\nEx\u223cp[(cid:96)(\u03c9, x)] = p. Minimization of a proper score function leads to well-calibrated\nargmin\u03c9\nforecasting. The log loss is known to be a proper score function [4].\n\n7\n\n\fWe will say that a matrix loss function (cid:96)(W , X) is proper if for any distribution P on density\nmatrix outcomes, the expected loss with respect to P is minimized by predicting with the mean out-\nEX\u223cP [(cid:96)(W , X)] = EX\u223cP [X]. The matrix entropic loss (1) is proper,\ncome of P , i.e. argminW\n\nfor EX\u223cP [\u2212 tr(X log W )] = \u2212 tr(cid:0)EX\u223cP [X] log W(cid:1) is minimized at W = EX\u223cP [X] [12].\nlog trace loss (cid:96)(W , X) := \u2212 log(cid:0)tr(XW )(cid:1). Note that here the trace and the logarithm are ex-\n\nTherefore, minimization of the matrix entropic loss leads to well-calibrated forecasting, as in the\nclassical case.\nA second generalization of the log loss to the matrix domain used in quantum physics [12] is the\n\nchanged compared to (1). The expression tr(XW ) plays an important role in quantum physics\nas the expected value of a measurement outcome, and for X = xx(cid:62), tr(xx(cid:62)W ) is interpreted\nas a probability. However, log trace loss is not proper. The counterexample is straightforward:\n2 }, then the minimizer of the expected log trace loss is\nif we take P uniform on {x1x(cid:62)\nW \u221d (x1 + x2)(x1 + x2)(cid:62), which differs from EX\u223cP [X] = 1\n2 ). Also for\nlog trace loss we found an example (not presented) against the FML for the minimax algorithm.\n\nA third generalization of the loss is (cid:96)(W , X) := \u2212 log(cid:0)tr(X (cid:12) W )(cid:1), where (cid:12) denotes the commu-\n\n1 + x2x(cid:62)\n\n1 , x2x(cid:62)\n\n2 (x1x(cid:62)\n\ntative \u201cproduct\u201d between matrices that underlies the probability calculus of [20].3 This loss upper\nbounds the log trace loss. We don\u2019t know whether it is a proper scoring function. However, it equals\nthe matrix entropic loss when X is a dyad.\nFinally, another loss explored in the on-line learning community is the trace loss (cid:96)(W , X) :=\ntr(W X). This loss is not a proper scoring function (it behaves like the absolute loss in the vector\ncase) and we have an example that shows that there is no FML for the minimax algorithm in this\ncase (not presented).\nIn summary, for there to exist a FML, properness of the loss function seems to be required.\n\n5 Conclusion\n\nWe showed that the free matrix lunch holds for the matrix version of the KT estimator. Thus the\nconjectured free matrix lunch [18] is realized. Our paper raises many open questions. Perhaps the\nmain one is whether the free matrix lunch holds for the matrix minimax algorithm. Also we would\nlike to know what properties of the loss function and algorithm cause the free matrix lunch to occur.\nFrom the examples given in this paper it is tempting to believe that you always get a free matrix\nlunch when upgrading any classical suf\ufb01cient-statistics-based predictor to a matrix version by just\nplaying this predictor in the eigensystem of the current matrix suf\ufb01cient statistics. However the\nfollowing counter example shows that a general reduction must be more subtle: Consider \ufb02oored\nKT, which predicts with \u03c9t,i \u221d (cid:98)\u03c3t,i(cid:99) + 1/2. For T = 5 trials in dimension n = 2, the worst-case\nregret is 1.297 for the classical log loss and 1.992 for matrix entropic loss.\n\nA Proof of Lemma 1\nWe prove the following slightly stronger inequality for all \u03b3 \u2265 0. The lemma is the case \u03b3 = 1.\n\nlogarithm implies that log(cid:0)A + \u03b3 tr(B)I(cid:1) (cid:23) log(A + \u03b3B), so that f(cid:48)(\u03b3) \u2265 0.\n\nSince tr(B)I (cid:23) B, we have A + \u03b3 tr(B)I (cid:23) A + \u03b3B, and hence the matrix monotonicity of the\n\nB log(A + \u03b3B)\n\n.\n\n3We can compute A (cid:12) B as the matrix exponential of the sum of matrix logarithms of A and B.\n\n8\n\nSince f (0) = 0, it suf\ufb01ces to show that f(cid:48)(\u03b3) \u2265 0. Since \u2202H(D)\n\nf (\u03b3) := H(A + \u03b3B) \u2212 n(cid:88)\nn(cid:88)\nf(cid:48)(\u03b3) = \u2212 tr(cid:0)B log(A + \u03b3B)(cid:1) +\n(cid:16)\nB log(cid:0)A + \u03b3 tr(B)I(cid:1)(cid:17) \u2212 tr\n\n= tr\n\ni=1\n\ni=1\n\n(cid:16)\n\na(cid:62)\ni Bai\ntr(B)\n\nH(A + \u03b3 tr(B)aia(cid:62)\n\ni ) \u2265 0.\n\na(cid:62)\ni Bai tr\n\n\u2202D = \u2212 log(D) \u2212 I,\n(cid:16)\naia(cid:62)\n\ni log(cid:0)A + \u03b3 tr(B) aia(cid:62)\n(cid:17)\n\ni\n\n(cid:1)(cid:17)\n\n\fReferences\n[1] S. Arora, E. Hazan, and S. Kale. Fast algorithms for approximate semide\ufb01nite programming\n\nusing the multiplicative weights update method. In FOCS, pages 339\u2013348, 2005.\n\n[2] S. Arora and S. Kale. A combinatorial, primal-dual approach to semide\ufb01nite programs. In\n\nSTOC, pages 227\u2013236, 2007.\n\n[3] K. S. Azoury and M. K. Warmuth. Relative loss bounds for on-line density estimation with the\n\nexponential family of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\n[4] J. M. Bernardo and A. F. M. Smith. Bayesian Theory. Wiley, 1994.\n[5] K. Bostroem and T. Felbinger. Lossless quantum data compression and variable-length coding.\n\nPhys. Rev. A, 65(3):032313, 2002.\n\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University\n\nPress, New York, NY, USA, 2006.\n\n[7] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.\n[8] R. Jain, Z. Ji, S. Upadhyay, and J. Watrous. QIP = PSPACE. In Proceedings of the 42nd ACM\n\nSymposium on Theory of Computing, STOC, pages 573\u2013582, 2010.\n\n[9] P. Kontkanen and P. Myllym\u00a8aki. A fast normalized maximum likelihood algorithm for multi-\nIn Proceedings of the Nineteenth International Joint Conference on Arti\ufb01cial\n\nnomial data.\nIntelligence (IJCAI-05), pages 1613\u20131616, 2005.\n\n[10] R. E. Krichevsky and V. K. Tro\ufb01mov. The performance of universal encoding. IEEE Transac-\n\ntions on Information Theory, 27(2):199\u2013207, Mar. 1981.\n\n[11] J. T. Kwok and H. Zhao.\n\n270\u2013273, 2003.\n\nIncremental eigen decomposition.\n\nIn IN PROC. ICANN, pages\n\n[12] M. A. Nielsen and I. L. Chuang. Quantum Computation and Quantum Information. Cambridge\n\nUniversity Press, 2000.\n\n[13] B. Schumacher and M. D. Westmoreland. Indeterminate-length quantum coding. Phys. Rev.\n\nA, 64(4):042304, 2001.\n\n[14] Y. M. Shtarkov. Universal sequential coding of single messages. Problems of Information\n\nTransmission, 23(3):3\u201317, 1987.\n\n[15] M. Sion. On general minimax theorems. Paci\ufb01c Jouronal of Mathematics, 8(1):171\u2013176, 1958.\n[16] E. Takimoto and M. Warmuth. The last-step minimax algorithm. In Proceedings of the 13th\n\nAnnual Conference on Computational Learning Theory, pages 100\u2013106, 2000.\n\n[17] K. Tsuda, G. R\u00a8atsch, and M. K. Warmuth. Matrix exponentiated gradient updates for on-line\nlearning and Bregman projections. Journal of Machine Learning Research, 6:995\u20131018, June\n2005.\n\n[18] M. K. Warmuth. When is there a free matrix lunch. In Proc. of the 20th Annual Conference on\n\nLearning Theory (COLT \u201907). Springer-Verlag, June 2007. Open problem.\n\n[19] M. K. Warmuth and D. Kuzmin. Online variance minimization. In Proceedings of the 19th\n\nAnnual Conference on Learning Theory (COLT \u201906), Pittsburg, June 2006. Springer-Verlag.\n\n[20] M. K. Warmuth and D. Kuzmin. Bayesian generalized probability calculus for density matri-\n\nces. Journal of Machine Learning, 78(1-2):63\u2013101, January 2010.\n\n9\n\n\f", "award": [], "sourceid": 591, "authors": [{"given_name": "Wouter", "family_name": "Koolen", "institution": null}, {"given_name": "Wojciech", "family_name": "Kotlowski", "institution": null}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": null}]}