{"title": "A scaled Bregman theorem with applications", "book": "Advances in Neural Information Processing Systems", "page_first": 19, "page_last": 27, "abstract": "Bregman divergences play a central role in the design and analysis of a range of machine learning algorithms through a handful of popular theorems. We present a new theorem which shows that ``Bregman distortions'' (employing a potentially non-convex generator) may be exactly re-written as a scaled Bregman divergence computed over transformed data. This property can be viewed from the standpoints of geometry (a scaled isometry with adaptive metrics) or convex optimization (relating generalized perspective transforms). Admissible distortions include {geodesic distances} on curved manifolds and projections or gauge-normalisation.  Our theorem allows one to leverage to the wealth and convenience of Bregman divergences when analysing algorithms relying on the aforementioned Bregman distortions.  We illustrate this with three novel applications of our theorem: a reduction from multi-class density ratio to class-probability estimation, a new adaptive projection free yet norm-enforcing  dual norm mirror descent algorithm,  and a reduction from clustering on flat manifolds to clustering on curved manifolds. Experiments on each of these domains validate the analyses and suggest that the scaled Bregman theorem might be a worthy addition to the popular handful of Bregman divergence properties that have been pervasive in machine learning.", "full_text": "A scaled Bregman theorem with applications\n\nRichard Nock\u2020,\u2021,\u00a7\nCheng Soon Ong\u2020,\u2021\n\u2020Data61, \u2021the Australian National University and \u00a7the University of Sydney\n{richard.nock, aditya.menon, chengsoon.ong}@data61.csiro.au\n\nAditya Krishna Menon\u2020,\u2021\n\nAbstract\n\nBregman divergences play a central role in the design and analysis of a range of\nmachine learning algorithms through a handful of popular theorems. We present\na new theorem which shows that \u201cBregman distortions\u201d (employing a potentially\nnon-convex generator) may be exactly re-written as a scaled Bregman divergence\ncomputed over transformed data. This property can be viewed from the standpoints\nof geometry (a scaled isometry with adaptive metrics) or convex optimization (re-\nlating generalized perspective transforms). Admissible distortions include geodesic\ndistances on curved manifolds and projections or gauge-normalisation.\nOur theorem allows one to leverage to the wealth and convenience of Bregman\ndivergences when analysing algorithms relying on the aforementioned Bregman\ndistortions. We illustrate this with three novel applications of our theorem: a\nreduction from multi-class density ratio to class-probability estimation, a new\nadaptive projection free yet norm-enforcing dual norm mirror descent algorithm,\nand a reduction from clustering on \ufb02at manifolds to clustering on curved manifolds.\nExperiments on each of these domains validate the analyses and suggest that the\nscaled Bregman theorem might be a worthy addition to the popular handful of\nBregman divergence properties that have been pervasive in machine learning.\n\n1\n\nIntroduction: Bregman divergences as a reduction tool\n\nBregman divergences play a central role in the design and analysis of a range of machine learning\n(ML) algorithms.\nIn recent years, Bregman divergences have arisen in procedures for convex\noptimisation [4], online learning [9, Chapter 11] clustering [3], matrix approximation [13], class-\nprobability estimation [7, 26, 29, 28], density ratio estimation [35], boosting [10], variational inference\n[18], and computational geometry [5]. Despite these being very different applications, many of\nthese algorithms and their analyses basically rely on three beautiful analytic properties of Bregman\ndivergences, properties that we summarize for differentiable scalar convex functions \u03d5 with derivative\n\u03d5(cid:48), conjugate \u03d5(cid:63), and divergence D\u03d5:\n\u2022 the triangle equality: D\u03d5(x(cid:107)y) + D\u03d5(y(cid:107)z) \u2212 D\u03d5(x(cid:107)z) = (\u03d5(cid:48)(z) \u2212 \u03d5(cid:48)(y))(x \u2212 y);\n\u2022 the dual symmetry property: D\u03d5(x(cid:107)y) = D\u03d5(cid:63) (\u03d5(cid:48)(y)(cid:107)\u03d5(cid:48)(x));\n\u2022 the right-centroid (population minimizer) is the average: arg min\u00b5 E[D\u03d5(X(cid:107)\u00b5)] = E[X].\nCasting a problem as a Bregman minimisation allows one to employ these properties to simplify\nanalysis; for example, by interpreting mirror descent as applying a particular Bregman regulariser,\nBeck and Teboulle [4] relied on the triangle equality above to simplify its proof of convergence.\nAnother intriguing possibility is that one may derive reductions amongst learning problems by\nconnecting their underlying Bregman minimisations. Menon and Ong [24] recently established how\n(binary) density ratio estimation (DRE) can be exactly reduced to class-probability estimation (CPE).\nThis was facilitated by interpreting CPE as a Bregman minimisation [7, Section 19], and a new\nproperty of Bregman divergences \u2014 Menon and Ong [24, Lemma 2] showed that for any twice\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fProblem A\nMulticlass density-ratio estimation Multiclass class-probability estimation\nConvex unconstrained online learning\nOnline optimisation on Lq ball\nClustering on curved manifolds\nClustering on \ufb02at manifolds\n\nProblem B that Theorem 1 reduces A to Reference\n\n\u00a73, Lemma 2\n\u00a74, Lemma 4\n\u00a75, Lemma 5\n\nTable 1: Applications of our scaled Bregman Theorem (Theorem 1) \u2014 \u201cReduction\u201d encompasses\nshortcuts on algorithms and on analyses (algorithm/proof A uses algorithm/proof B as subroutine).\ndifferentiable scalar convex \u03d5, for g(x) = 1 + x and \u02c7\u03d5(x) .= g(x) \u00b7 \u03d5(x/g(x)),\n\ng(x) \u00b7 D\u03d5(x/g(x)(cid:107)y/g(y)) = D \u02c7\u03d5(x(cid:107)y) ,\u2200x, y.\n\n(1)\nSince the binary class-probability function \u03b7(x) = Pr(Y = 1|X = x) is related to the class-\nconditional density ratio r(x) = Pr(X = x|Y = 1)/ Pr(X = x|Y = \u22121) via Bayes\u2019 rule as\n\u03b7(x) = r(x)/g(r(x)) ([24] assume Pr(Y = 1) = 1/2), any \u02c6\u03b7 with small D\u03d5(\u03b7(cid:107)\u02c6\u03b7) implicitly\nproduces an \u02c6r with low D \u02c7\u03d5(r(cid:107)\u02c6r) i.e. a good estimate of the density ratio. The Bregman property of\neq. (1) thus establishes a reduction from DRE to CPE. Two questions arise from this analysis: can we\ngeneralise eq. (1) to other g(\u00b7), and if so, can we similarly relate other problems to each other?\nThis paper presents a new Bregman identity (Theorem 1), the scaled Bregman theorem, a signi\ufb01cant\ngeneralisation of Menon and Ong [24, Lemma 2]. It shows that general distortions D \u02c7\u03d5 \u2013 which are not\nnecessarily convex, positive, bounded or symmetric \u2013 may be re-expressed as a Bregman divergence\nD\u03d5 computed over transformed data, and thus inherit their good properties despite appearing prima\nfacie to be a very different object. This transformation can be as simple as a projection or normalisation\nby a gauge, or more involved like the exponential map on lifted coordinates for a curved manifold.\nOur theorem can be summarized in two ways. The \ufb01rst is geometric as it specializes to a scaled\nisometry involving adaptive metrics. The second calls to a fundamental object of convex analysis,\ngeneralized perspective transforms [11, 22, 23]. Indeed, our theorem states when\n\n\"the perspective of a Bregman divergence equals the distortion of a perspective\",\n\nfor a perspective ( \u02c7\u03d5 in eq. 1) which is analytically a generalized perspective transform but does\nnot rely on the same convexity and sign requirements as in Mar\u00e9chal [22, 23]. We note that the\nperspective of a Bregman divergence (the left-hand side of eq. 1) is a special case of conformal\ndivergence [27], yet to our knowledge it has never been formally de\ufb01ned. As with the aforementioned\nkey properties of Bregman divergences, Theorem 1 has potentially wide implications for ML. We\ngive three such novel applications to vastly different problems (see Table 1):\n\u2022 a reduction of multiple density ratio estimation to multiclass-probability estimation (\u00a73), general-\n\u2022 a projection-free yet norm-enforcing mirror gradient algorithm (enforced norms are those of\n\u2022 a seeding approach for clustering on positively or negatively (constant) curved manifolds based\n\nmirrored vectors and of the offset) with guarantees for adaptive \ufb01ltering (\u00a74), and\n\nising the results of [24] for the binary label case,\n\non a popular seeding for \ufb02at manifolds and with the same approximation guarantees (\u00a75).\n\nExperiments on each of these domains (\u00a76) validate our analysis. The Supplementary Material (SM)\ndetails the proofs of all results, provides the experimental results in extenso and some additional\n(nascent) applications to exponential families and computational information geometry.\n\n2 Main result: the scaled Bregman theorem\n\nIn the remaining, [k]\nnot necessarily convex) \u03d5 : X \u2192 R, we de\ufb01ne the Bregman distortion D\u03d5 as\n(cid:62)\u2207\u03d5(y) .\n\n.= {0, 1, ..., k} and [k]\u2217 .= {1, 2, ..., k} for k \u2208 N. For any differentiable (but\nD\u03d5(x(cid:107)y)\n\n(2)\nIf \u03d5 is convex, D\u03d5 is the familiar Bregman divergence with generator \u03d5. Without further ado, we\npresent our main result.\nTheorem 1 Let, \u03d5 : X \u2192 R be convex differentiable, and g : X \u2192 R\u2217 be differentiable. Then,\n\n.= \u03d5(x) \u2212 \u03d5(y) \u2212 (x \u2212 y)\n\n(cid:0)(1/g(x)) \u00b7 x(cid:13)(cid:13) (1/g(y)) \u00b7 y(cid:1) = D \u02c7\u03d5\n\n(cid:0)x(cid:13)(cid:13) y(cid:1) ,\u2200x, y \u2208 X ,\n\ng(x) \u00b7 D\u03d5\n\n.= g(x) \u00b7 \u03d5 ((1/g(x)) \u00b7 x) ,\n\n(3)\n(4)\n\nwhere \u02c7\u03d5(x)\n\n2\n\n\f1\n\n1\n\ni\n\n2\n\n2\n\n2\n\n(cid:107)y(cid:107)q\u22122\n\nq\n\nxi\u00b7sign(yi)\u00b7|yi|q\u22121\n\n(cid:107)y(cid:107)q\u22121\n\nq\n\ni\n\n(xi\u2212yi)\u00b7sign(yi)\u00b7|yi|q\u22121\n\nq) \u2212(cid:80)\n\ni\n\n\u00b7 (1 \u2212 cos DG(x, y))\n\nD \u02c7\u03d5 (x(cid:107)y)\n(cid:107)x(cid:107)2 \u00b7 (1 \u2212 cos \u2220x, y)\n\nW \u00b7 (cid:107)x(cid:107)q \u2212 W \u00b7(cid:80)\n(cid:107)x(cid:107)2\n(cid:80)\nsin (cid:107)x(cid:107)2\n\u2212 (cid:107)x(cid:107)2\nsinh (cid:107)x(cid:107)2\n(cid:80)\nxi((cid:81)\ni xi log xi\nyi\ndet(Y1/d)tr(cid:0)XY\u22121(cid:1) \u2212 d \u00b7 det(X1/d)\n\nD\u03d5 (x(cid:107)y)\nX\n2 \u00b7 (cid:107)x \u2212 y(cid:107)2\nRd\n2 \u00b7 ((cid:107)x(cid:107)2\nq \u2212 (cid:107)y(cid:107)2\nRd\n2 \u00b7 (cid:107)xS \u2212 yS(cid:107)2\nRd \u00d7 R 1\n(cid:80)\nRd \u00d7 C 1\n2 \u00b7 (cid:107)xH \u2212 yH(cid:107)2\n(cid:80)\n\u2212(cid:80)\n\u2212 1(cid:62)(x \u2212 y)\ni xi log xi\nRd\n+\nyi\ntr(cid:0)XY\u22121(cid:1) \u2212 log det(XY\u22121) \u2212 d\ni log xi\nxi\nRd\n+\nyi\nyi\nS(d)\nS(d)\nTable 2: Examples of (D\u03d5, D \u02c7\u03d5, g) for which eq. (3) holds. Function xS .= f (x) : Rd \u2192 Rd+1 and\nxH .= f (x) : Rd \u2192 Rd \u00d7 C are the Sphere and Hyperbolic lifting maps de\ufb01ned in SM, eqs. 51, 62.\nW > 0 is a constant. DG denotes the Geodesic distance on the sphere (for xS) or the hyperboloid\n(for xH). S(d) is the set of symmetric real matrices. Related proofs are in SM, Section III.\nif and only if (i) g is af\ufb01ne on X, or (ii) for every z \u2208 Xg\n\ng(x)\n(cid:107)x(cid:107)2\n(cid:107)x(cid:107)q/W\n(cid:107)x(cid:107)2/ sin(cid:107)x(cid:107)2\n\u2212(cid:107)x(cid:107)2/ sinh(cid:107)x(cid:107)2\n(cid:81)\n1(cid:62)x\n\n\u00b7 (cosh DG(x, y) \u2212 1)\n\u2212 d \u00b7 E[X] \u00b7 log\nE[X]E[Y]\nj xj)1/d\n\n.= {(1/g(x)) \u00b7 x : x \u2208 X},\n\ntr (X log X \u2212 X log Y) \u2212 tr (X) \u00b7 log tr(X)\n\ntr (X log X \u2212 X log Y) \u2212 tr (X) + tr (Y)\n\ni x1/d\n\ni\ntr (X)\ndet(X1/d)\n\n\u2212 d((cid:81)\n\nj yj )1/d\nyi\n\ni\n\n\u2212 d\n\ntr(Y)\n\n\u03d5 (z) = z\n\n(cid:62)\u2207\u03d5(z) .\n\n(5)\n\nTable 2 presents some examples of (sometimes involved) triplets (D\u03d5, D \u02c7\u03d5, g) for which eq. (3) holds;\nrelated proofs are in Appendix III. Depending on \u03d5 and g, there are at least two ways to summarize\nTheorem 1. One is geometric: Theorem 1 sometimes states a scaled isometry between X and Xg. The\nother one comes from convex optimisation: Theorem 1 de\ufb01nes generalized perspective transforms on\nBregman divergences and roughly states the identity between the perspective transform of a Bregman\ndivergence and the Bregman distortion of the perspective transform. Appendix VIII gives more\ndetails for both properties. We refer to Theorem 1 as the scaled Bregman theorem.\nRemark.\nIf Xg is a vector space, \u03d5 satis\ufb01es eq. (5) if and only if it is positive homogeneous of\ndegree 1 on Xg (i.e. \u03d5(\u03b1z) = \u03b1 \u00b7 \u03d5(z) for any \u03b1 > 0) from Euler\u2019s homogenous function theorem.\nWhen Xg is not a vector space, this only holds for \u03b1 such that \u03b1z \u2208 Xg as well. We thus call the\ngradient condition of eq. (5) \u201crestricted positive homogeneity\u201d for simplicity.\nRemark. Appendix IV gives a \u201cdeep composition\u201d extension of Theorem 1.\nFor the special case where X = R, and g(x) = 1 + x, Theorem 1 is exactly [24, Lemma 2] (c.f. eq.\n1). We wish to highlight a few points with regard to our more general result. First, the \u201cdistortion\u201d\ngenerator \u02c7\u03d5 may be1 non-convex, as the following illustrates.\nExample. Suppose \u03d5(x) = (1/2)(cid:107)x(cid:107)2\ng(x) = 1 + 1(cid:62)x, we have \u02c7\u03d5(x) = (1/2) \u00b7 (cid:107)x(cid:107)2\nWhen \u02c7\u03d5 is non-convex, the right hand side in eq. (3) is an object that ostensibly bears only a\nsuper\ufb01cial similarity to a Bregman divergence; it is somewhat remarkable that Theorem 1 shows\nthis general \u201cdistortion\u201d between a pair (x, y) to be entirely equivalent to a (scaling of a) Bregman\ndivergence between some transformation of the points. Second, when g is linear, eq. (3) holds for any\nconvex \u03d5 (This was the case considered in [24]). When g is non-linear, however, \u03d5 must be chosen\ncarefully so that (\u03d5, g) satis\ufb01es the restricted homogeneity conditon2 of eq. (5). In general, given a\nconvex \u03d5, one can \u201creverse engineer\u201d a suitable g, as illustrated by the following example.\nExample. Suppose3 \u03d5(x) = (1 + (cid:107)x(cid:107)2\n2 = 1 for every x \u2208 Xg,\ni.e. Xg is (a subset of) the unit sphere. This is afforded by the choice g(x) = (cid:107)x(cid:107)2.\nThird, Theorem 1 is not merely a mathematical curiosity: we now show that it facilitates novel\nresults in three very different domains, namely estimating multiclass density ratios, constrained\nonline optimisation, and clustering data on a manifold with non-zero curvature. We discuss nascent\napplications to exponential families and computational geometry in Appendices V and VI.\n\n2/(1 + 1(cid:62)x), which is non-convex on X = Rd.\n\n2, the generator for squared Euclidean distance. Then, for\n\n2)/2. Then, eq. (5) requires that (cid:107)x(cid:107)2\n\n\u201cdistortion\u201d is nonnegative [6, Section 3.1.3].\n\n1Evidently, \u02c7\u03d5 is convex iff g is non-negative, by eq. (3) and the fact that a function is convex iff its Bregman\n2We stress that this condition only needs to hold on Xg \u2286 X; it would not be really interesting in general for\n\n\u03d5 to be homogeneous everywhere in its domain, since we would basically have \u02c7\u03d5 = \u03d5.\n\n3The constant 1/2 added in \u03d5 does not change D\u03d5, since a Bregman divergence is invariant to af\ufb01ne terms;\n\nremoving this however would make the divergences D\u03d5 and D \u02c7\u03d5 differ by a constant.\n\n3\n\n\f3 Multiclass density-ratio estimation via class-probability estimation\n\nGiven samples from a number of densities, density ratio estimation concerns estimating the ratio\nbetween each density and some reference density. This has applications in the covariate shift problem\nwherein the train and test distributions over instances differ [33]. Our \ufb01rst application of Theorem 1\nis to show how density ratio estimation can be reduced to class-probability estimation [7, 29].\nTo proceed, we \ufb01x notation. For some integer C \u2265 1, consider a distribution P(X, Y) over an\n(instance, label) space X \u00d7 [C]. Let ({Pc}C\nc=1, \u03c0) be densities giving P(X|Y = c) and P(Y = c)\nrespectively, and M giving P(X) accordingly. Fix c\u2217 \u2208 [C] a reference class, and suppose for\nsimplicity that c\u2217 = C. Let \u02dc\u03c0 \u2208 (cid:52)C\u22121 such that \u02dc\u03c0c\n.= \u03c0c/(1 \u2212 \u03c0C). Density ratio estimation\n[35] concerns inferring the vector r(x) \u2208 RC\u22121 of density ratios relative to C, with rc(x) .=\nP(X = x|Y = c)/P(X = x|Y = C) , while class-probability estimation [7] concerns inferring the\nvector \u03b7(x) \u2208 RC\u22121 of class-probabilities, with \u03b7c(x) .= P(Y = c|X = x)/\u02dc\u03c0c . In both cases, we\nestimate the respective quantities given an iid sample S \u223c P(X, Y)m (m is the training sample size).\nThe genesis of the reduction from density ratio to class-probability estimation is the fact that r(x) =\n(\u03c0C/(1 \u2212 \u03c0C)) \u00b7 \u03b7(x)/\u03b7C(x). In practice one will only have an estimate \u02c6\u03b7, typically derived by\nminimising a suitable loss on the given S [37], with a canonical example being multiclass logistic\nregression. Given \u02c6\u03b7, it is natural to estimate the density ratio via:\n\n\u02c6r(x) = \u02c6\u03b7(x)/\u02c6\u03b7C(x) .\n\n(6)\nWhile this estimate is intuitive, to establish a formal reduction we must relate the quality of \u02c6r to\nthat of \u02c6\u03b7. Since the minimisation of a suitable loss for class-probability estimation is equivalent to a\nBregman minimisation [7, Section 19], [37, Proposition 7], this is however immediate by Theorem 1:\nLemma 2 Given a class-probability estimator \u02c6\u03b7 : X \u2192 [0, 1]C\u22121, let the density ratio estimator \u02c6r be\nas per Equation 6. Then for any convex differentiable \u03d5 : [0, 1]C\u22121 \u2192 R,\n\nEX\u223cM [D\u03d5(\u03b7(X)(cid:107) \u02c6\u03b7(X))] = (1 \u2212 \u03c0C) \u00b7 EX\u223cPC\n\nwhere \u03d5\u2020 is as per Equation 4 with g(x) .= \u03c0C/(1 \u2212 \u03c0C) + \u02dc\u03c0(cid:62)x .\n\n(cid:2)D\u03d5\u2020(r(X)(cid:107)\u02c6r(X))(cid:3)\n\n(7)\n\nLemma 2 generalises [24, Proposition 3], which focussed on the binary case with \u03c0 = 1/2 (See\nAppendix VII for a review of that result). Unpacking the Lemma, the LHS in Equation 7 represents\nthe object minimised by some suitable loss for class-probability estimation. Since g is af\ufb01ne, we\ncan use any convex, differentiable \u03d5, and so can use any suitable class-probability loss to estimate\n\u02c6\u03b7. Lemma 2 thus implies that producing \u02c6\u03b7 by minimising any class-probability loss equivalently\nproduces an \u02c6r as per Equation 6 that minimises a Bregman divergence to the true r. Thus, Theorem 1\nprovides a reduction from density ratio to multiclass probability estimation.\nWe now detail two applications where g(\u00b7) is no longer af\ufb01ne, and \u03d5 must be chosen more carefully.\n\n4 Dual norm mirror descent: projection-free online learning on Lp balls\n\nA substantial amount of work in the intersection of ML and convex optimisation has focused on\nconstrained optimisation within a ball [32, 14]. This optimisation is typically via projection operators\nthat can be expensive to compute [17, 19]. We now show that gauge functions can be used as an\ninexpensive alternative, and that Theorem 1 easily yields guarantees for this procedure in online\nlearning. We consider the adaptive \ufb01ltering problem, closely related to the online least squares\nproblem with linear predictors [9, Chapter 11]. Here, over a sequence of T rounds, we observe some\nxt \u2208 X. We must then predict a target value \u02c6yt = w(cid:62)\nt\u22121xt using our current weight vector wt\u22121.\nThe true target yt = u(cid:62)xt + \u0001t is then revealed, where \u0001t is some unknown noise, and we may update\nour weight to wt. Our goal is to minimise the regret of the sequence {wt}T\nt=0,\nxt \u2212 yt\n\n(cid:1)2 \u2212 T(cid:88)\n\nR(w1:T|u) .=\n\nxt \u2212 w\n\nT(cid:88)\n\n(cid:0)u\n\n(cid:0)u\n\n(cid:62)\n\n(cid:1)2\n\n.\n\n(8)\n\n(cid:62)\n\n(cid:62)\nt\u22121xt\n\nt=1\n\nt=1\n\nLet q \u2208 (1, 2] and p be such that 1/p + 1/q = 1. For \u03d5 .= (1/2) \u00b7 (cid:107)x(cid:107)2\nq and loss (cid:96)t(w) =\n(1/2) \u00b7 (yt \u2212 w(cid:62)xt)2, the p-LMS algorithm [20] employs the stochastic mirror gradient updates:\n(9)\n\n\u03b7t \u00b7 (cid:96)t(w) + D\u03d5(w(cid:107)wt\u22121) = (\u2207\u03d5)\n\n\u22121 (\u2207\u03d5(wt\u22121) \u2212 \u03b7t \u00b7 \u2207(cid:96)t) ,\n\n.= argmin\n\nwt\n\nw\n\n4\n\n\fp \u00b7 (cid:107)u(cid:107)2\nq.\n\nwhere \u03b7t is a learning rate to be speci\ufb01ed by the user. [20, Theorem 2] shows that for appropriate \u03b7t,\none has R(w1:T|u) \u2264 (p \u2212 1) \u00b7 maxx\u2208X (cid:107)x(cid:107)2\nThe p-LMS updates do not provide any explicit control on (cid:107)wt(cid:107), i.e. there is no regularisation.\nExperiments (Section \u00a76) suggest that leaving (cid:107)wt(cid:107) uncontrolled may not be a good idea as the\nincrease of the norm sometimes prevents (signi\ufb01cant) updates (eq. (9)). Also, the wide success of\nregularisation in ML calls for regularised variants that retain the regret guarantees and computational\nef\ufb01ciency of p-LMS. (Adding a projection step to eq. (9) would not achieve both.) We now do just\nthis. For \ufb01xed W > 0, let \u03d5 .= (1/2) \u00b7 (W 2 + (cid:107)x(cid:107)2\nq), a translation of that used in p-LMS. Invoking\nTheorem 1 with the admissible gq(x) = ||x||q/W yields \u02c7\u03d5 .= \u02c7\u03d5q = W(cid:107)x(cid:107)q (see Table 2). Using\nthe fact that Lp and Lq norms are dual of each other, we replace eq. (9) by:\n\nwt\n\n.= \u2207 \u02c7\u03d5p (\u2207 \u02c7\u03d5q(wt\u22121) \u2212 \u03b7t \u00b7 \u2207(cid:96)t) .\n\n(10)\nSee Lemma A of the Appendix for the simple forms of \u2207 \u02c7\u03d5{p,q}. We call update (10) the dual norm\np-LMS (DN-p-LMS) algorithm, noting that the dual refers to the polar transform of the norm, and g\nstems from a gauge normalization for Bq(W ), the closed Lq ball with radius W > 0. Namely, we\nhave \u03b3GAU(x) = W/(cid:107)x(cid:107)q = g(x)\u22121 for the gauge \u03b3GAU(x) .= sup{z \u2265 0 : z \u00b7 x \u2208 Bq(W )}, so\nthat \u02c7\u03d5q implicitly performs gauge normalisation of the data. This update is no more computationally\nexpensive than eq. (9) \u2014 we simply need to compute the p- and q-norms of appropriate terms \u2014 but,\ncrucially, automatically constrains the norms of wt and its image by \u2207 \u02c7\u03d5q.\nLemma 3 For the update in eq. (10), (cid:107)wt(cid:107)q = (cid:107)\u2207 \u02c7\u03d5q(wt)(cid:107)p = W,\u2200t > 0.\n\nLemma 3 is remarkable, since nowhere in eq. (10) do we project onto the Lq ball. Nonetheless, for\nthe DN-p-LMS updates to be principled, we need a similar regret guarantee to the original p-LMS.\nFortunately, this may be done using Theorem 1 to exploit the original proof of [20]. For any u \u2208 Rd,\nde\ufb01ne the q-normalised regret of {wt}T\n\nt=0 by\n\nT(cid:88)\n\n(cid:0)(1/gq(u)) \u00b7 u\n\n(cid:62)\n\nxt \u2212 w\n\n(cid:62)\nt\u22121xt\n\n(cid:1)2 \u2212 T(cid:88)\n\n(cid:0)(1/gq(u)) \u00b7 u\n\n(cid:1)2\n\n(cid:62)\n\nxt \u2212 yt\n\n.(11)\n\nRq(w1:T|u)\n\n.=\n\nt=1\n\nt=1\n\nWe have the following bound on Rq for the DN-p-LMS updates (We cannot expect a bound on the\nunnormalised R(\u00b7) of eq. (8), since by Lemma 3 we can only compete against norm W vectors).\nLemma 4 Pick any u \u2208 Rd, p, q satisfying 1/p + 1/q = 1 and p > 2, and W > 0. Suppose\n(cid:107)xt(cid:107)p \u2264 Xp and |yt| \u2264 Y,\u2200t \u2264 T . Let {wt} be as per eq. (10), using learning rate\n\n.= \u03b3t \u00b7\n\n\u03b7t\n\n4(p \u2212 1) max{W, Xp}XpW + |yt \u2212 w(cid:62)\n\nt\u22121xt|Xp\n\nW\n\n,\n\nfor any desired \u03b3t \u2208 [1/2, 1]. Then,\n\nRq(w1:T|u) \u2264 4(p \u2212 1)X 2\n\np W 2 + (16p \u2212 8) max{W, Xp}X 2\n\np W + 8Y X 2\n\np .\n\n(12)\n\n(13)\n\nSeveral remarks can be made. First, the bound depends on the maximal signal value Y , but this is the\nmaximal signal in the observed sequence, so it may not be very large in practice; if it is comparable to\nW , then our bound is looser than [20] by just a constant factor. Second, the learning rate is adaptive\nin the sense that its choice depends on the last mistake made. There is a nice way to represent the\n\u201coffset\u201d vector \u03b7t \u00b7 \u2207(cid:96)t in eq. (10), since we have, for Q(cid:48)(cid:48)\n\n(cid:19)\n.= 4(p \u2212 1) max{W, Xp}XpW ,\n\n(cid:18) 1\n\n\u00b7 sign(yt \u2212 w\n\nt\u22121xt) \u00b7\n(cid:62)\n\n\u00b7 x\n\n,\n\nXp\n\n(14)\n\n\u03b7t \u00b7 \u2207(cid:96)t = W \u00b7\n\n|yt \u2212 w(cid:62)\n\nt\u22121xt|Xp\n\nQ(cid:48)(cid:48) + |yt \u2212 w(cid:62)\n\nt\u22121xt|Xp\n\nso the Lp norm of the offset is actually equal to W \u00b7 Q, where Q \u2208 [0, 1] is all the smaller as the\nvector w. gets better. Hence, the update in eq. (10) controls in fact all norms (that of w., its image by\n\u2207 \u02c7\u03d5q and the offset). Third, because of the normalisation of u, the bound actually does not depend on\nu, but on the radius W chosen for the Lq ball.\n\n5\n\n\fSphere\n\nHyperboloid\n\nFigure 1: (L) Lifting map into Rd \u00d7 R for clustering on the sphere with k-means++. (M) Drec in Eq.\n(15) in vertical thick red line. (R) Lifting map into Rd \u00d7 C for the hyperboloid.\n5 Clustering on a curved manifold via clustering on a \ufb02at manifold\n\nOur \ufb01nal application can be related to two problems that have received a steadily growing interest\nover the past decade in unsupervised ML: clustering on a non-linear manifold [12], and subspace\ncustering [36]. We consider two fundamental manifolds investigated by [16] to compute centers of\nmass from relativistic theory: the sphere Sd and the hyperboloid Hd, the former being of positive\ncurvature, and the latter of negative curvature. Applications involving these speci\ufb01c manifolds are\nnumerous in text processing, computer vision, geometric modelling, computer graphics, to name a\nfew [8, 12, 15, 21, 30, 34]. We emphasize the fact that the clustering problem has signi\ufb01cant practical\nimpact for d as small as 2 in computer vision [34].\nThe problem is non-trivial for two separate reasons. First, the ambient space, i.e.\nthe space of\nregistration of the input data, is often implicitly Euclidean and therefore not the manifold [12]: if the\nmapping to the manifold is not carefully done, then geodesic distances measured on the manifold may\nbe inconsistent with respect to the ambient space. Second, the fact that the manifold has non-zero\ncurvature essentially prevents the direct use of Euclidean optimization algorithms [38] \u2014 put simply,\nthe average of two points that belong to a manifold does not necessarily belong to the manifold, so\nwe have to be careful on how to compute centroids for hard clustering [16, 27, 30, 31].\nWhat we show now is that Riemannian manifolds with constant sectional curvature may be clustered\nwith the k-means++ seeding for \ufb02at manifolds [2], without even touching a line of the algorithm.\nTo formalise the problem, we need three key components of Riemannian geometry: tangent planes,\nexponential map and geodesics [1]. We assume that the ambient space is a tangent plane to the\nmanifold M, which conveniently makes it look Euclidean (see Figure 1). The point of tangency is\ncalled q, and the tangent plane TqM. The exponential map, expq : TqM \u2192 M, performs a distance\npreserving mapping: the geodesic length between q and expq(x) in M is the same as the Euclidean\nlength between q and x in TqM. Our clustering objective is to \ufb01nd C .= {c1, c2, ...ck} \u2282 M such\nthat Drec(S : C) = inf C(cid:48)\u2282M,|C(cid:48)|=k Drec(S, C(cid:48)), with\n\n.=(cid:80)\n\n(cid:26) 1 \u2212 cos DG(y, c)\n\ncosh DG(y, c) \u2212 1\n\nDrec(S, C)\n\ni\u2208[m]\u2217 minj\u2208[k]\u2217 Drec(expq(xi), cj) ,\n\n(15)\n\nwhere Drec is a reconstruction loss, a function of the geodesic distance between expq(xi) and cj.\nWe use two loss functions de\ufb01ned from [16] and used in ML for more than a decade [12]:\n\nR+ (cid:51) Drec(y, c)\n\n.=\n\nfor M = Sd\nfor M = Hd\n\n.\n\n(16)\n\nHere, DG(y, c) is the corresponding geodesic distance of M between y and c. Figure 1 shows\nthat Drec(y, c) is the orthogonal distance between TcM and y when M = Sd. The solution to the\nclustering problem in eq. (15) is therefore the one that minimizes the error between tangent planes\nde\ufb01ned at the centroids, and points on the manifold.\nIt turns out that both distances in 16 can be engineered as Bregman divergences via Theorem 1, as seen\nin Table 2. Furthermore, they imply the same \u03d5, which is just the generator of Mahalanobis distortion,\nbut a different g. The construction involves a third party, a lifting map (lift(.)) that increases the\ndimension by one. The Sphere lifting map Rd (cid:51) x (cid:55)\u2192 xS \u2208 Rd+1 is indicated in Table 3 (left). The\nnew coordinate depends on the norm of x. The Hyperbolic lifting map, Rd (cid:51) x (cid:55)\u2192 xH \u2208 Rd \u00d7 C,\ninvolves a pure imaginary additional coordinate, is indicated in in Table 3 (right, with a slight abuse\nof notation) and Figure 1. Both xS and xH live on a d-dimensional manifold, depicted in Figure 1.\n\n6\n\nxSdLifting mapRd+1Sphericalk-means(inSd)(inRd+1)k-means(++)xSexpq(x)TqSdqSphereSdcRdyDG(y,c)Drec(y,c)(inRd+1)k-means(++)xHIm(xd+1)Lifting map\f(Sphere) Sk-means++(S, k)\nInput: dataset S \u2282 TqSd, k \u2208 N\u2217;\nS (xS) \u00b7 xS : xS \u2208 lift(S)};\nStep 1: S+ \u2190 {g\n\u22121\nStep 2: C+ \u2190 k-means++_seeding(S+, k);\nStep 3: C \u2190 exp\u22121\nq (C+);\nOutput: Cluster centers C \u2208 TqSd;\n\u00b7\u00b7\u00b7\nxS .= [x1 x2\ngS(xS) .= (cid:107)x(cid:107)2/ sin(cid:107)x(cid:107)2\n\n(cid:107)x(cid:107)2 cot(cid:107)x(cid:107)2]\n\nxd\n\n(Hyperboloid) Hk-means++(S, k)\n\nInput: dataset S \u2282 TqHd, k \u2208 N\u2217;\nH (xH )\u00b7xH : xH \u2208 lift(S)};\nStep 1: S+ \u2190 {g\n\u22121\nStep 2: C+ \u2190 k-means++_seeding(S+, k);\nStep 3: C \u2190 exp\u22121\nq (C+);\nOutput: Cluster centers C \u2208 TqHd;\n\u00b7\u00b7\u00b7\nxH .= [x1 x2\ngH (xH ) .= \u2212(cid:107)x(cid:107)2/ sinh(cid:107)x(cid:107)2\n\ni(cid:107)x(cid:107)2 coth(cid:107)x(cid:107)2]\n\nxd\n\nTable 3: How to use k-means++ to cluster points on the sphere (left) or the hyperboloid (right).\n\n(p, q) = (1.17, 6.9)\n\n(p, q) = (2.0, 2.0)\n\n(p, q) = (6.9, 1.17)\n\n(p, q) = (1.17, 6.9)\n\n(p, q) = (2.0, 2.0)\n\n(p, q) = (6.9, 1.17)\n\n\u03c1 = 1.0\n\n\u03c1 = 1.0\n\n\u03c1 = 1.0\n\nTable 4: Summary of the experiments displaying (y) the error of p-LMS minus error of DN-p-LMS\n(when > 0, DN-p-LMS beats p-LMS) as a function of t, in the setting of [20], for various values of\n(p, q) (columns). Left panel: (D)ense target; Right panel: (S)parse target.\n\n\u03c1 = 0.5\n\n\u03c1 = 1.3\n\n\u03c1 = 0.2\n\nWhen they are scaled by the corresponding g.(.), they happen to be mapped to Sd or Hd, respectively,\nby what happens to be the manifold\u2019s exponential map for the original x (see Appendix III).\nTheorem 1 is interesting in this case because \u03d5 corresponds to a Mahalanobis distortion: this shows\n\u22121{S,H}(x{S,H}) \u00b7\nthat k-means++ seeding [2, 25] can be used directly on the scaled coordinates (g\nx{S,H}) to pick centroids that yield an approximation of the global optimum for the clustering\nproblem on the manifold which is just as good as the original Euclidean approximation bound [2].\n\nLemma 5 The expected potential of Sk-means++ seeding over the random choices of C+ satis\ufb01es:\n(17)\nThe same approximation bounds holds for Hk-means++ seeding on the hyperboloid (C(cid:48), C+ \u2208 Hd).\n\nE[Drec(S : C)] \u2264 8(2 + log k) \u00b7\n\nDrec(S : C(cid:48)\n\ninf\nC(cid:48)\u2208Sd\n\n) .\n\nLemma 5 is notable since it was only recently shown that such a bound is possible for the sphere [15],\nand to our knowledge, no such approximation quality is known for clustering on the hyperboloid [30,\n31]. Notice that Lloyd iterations on non-linear manifolds would require repetitive renormalizations\nto keep centers on the manifold [12], an additional disadvantage compared to clustering on \ufb02at\nmanifolds that {G, K}-means++ seedings do not bear.\n\n6 Experimental validation\n\nWe present some experiments validating our theoretical analysis for the applications above.\nMultiple density ratio estimation. See Appendix IX for experiments in this domain.\nDual norm p-LMS (DN-p-LMS). We ran p-LMS and the DN-p-LMS of \u00a74 on the experimental\nsetting of [20]. We refer to that paper for an exhaustive description of the experimental setting, which\nwe brie\ufb02y summarize: it is a noisy signal processing setting, involving a dense or a sparse target. We\ncompute, over the signal received, the error of our predictor on the signal. We keep all parameters as\nthey are in [20], except for one: we make sure that data are scaled to \ufb01t in a Lp ball of prescribed\nradius, to test the assumption related in [20] that \ufb01xing the learning rate \u03b7t is not straightforward\nin p-LMS. Knowing the true value of Xp, we then scale it by a misestimation factor \u03c1, typically\nin [0.1, 1.7]. We use the same misestimation in DN-p-LMS. Thus, both algorithms suffer the same\nsource of uncertainty. Also, we periodically change the signal (each 1000 iterations), to assess the\nperformances of the algorithms in tracking changes in the signal.\nExperiments, given in extenso in Appendix X, are sumarized in Table 4. The following trends emerge:\nin the mid to long run, DN-p-LMS is never beaten by p-LMS by more than a fraction of percent.\nOn the other hand, DN-p-LM can beat p-LMS by very signi\ufb01cant differences (exceeding 40%), in\nparticular when p < 2, i.e. when we are outside the regime of the proof of [20]. This indicates that\n\n7\n\n-10 0 10 20 30 40 50 60 70 0 20000 40000-25-20-15-10-5 0 5 0 20000 40000-25-20-15-10-5 0 5 0 20000 40000-8-6-4-2 0 2 4 6 8 10 12 0 20000 40000-30-25-20-15-10-5 0 5 0 20000 40000-16-14-12-10-8-6-4-2 0 0 20000 40000\fTable 5: (L) Relative improvement (decrease) in k-means potential of SKM\u25e6Sk-means++ compared\nto SKM alone. (R) Relative improvement of Sk-means++ over Forgy initialization on the sphere.\n\nTable 6: (L) % of the number of runs of SKM whose output (when it has converged) is better than\nSk-means++. (C) Maximal # of iterations for SKM after which it beats Sk-means++ (ignoring runs\nof SKM that do not beat Sk-means++). (R) Average # of iterations for SKM to converge.\n\nsigni\ufb01cantly stronger and more general results than the one of Lemma 4 may be expected. Also, it\nseems that the problem of p-LMS lies in an \u201cexploding\u201d norm problem: in various cases, we observe\nthat (cid:107)wt(cid:107) (in any norm) blows up with t, and this correlates with a very signi\ufb01cant degradation of its\nperformances. Clearly, DN-p-LMS does not have this problem since all relevant norms are under\ntight control. Finally, even when the norm does not explode, DN-p-LMS can still beat p-LMS, by less\nimportant differences though. Of course, the output of p-LMS can repeatedly be normalised, but the\nnormalisation would escape the theory of [20] and it is not clear which normalisation would be best.\nClustering on the sphere. For k \u2208 [50]\u2217, we simulate on T0S2 a mixture of spherical Gaussian and\nuniform densities in random rectangles with 2k components. We run three algorithms: (i) SKM [12]\non the data embedded on S2 with random (Forgy) initialization, (ii), Sk-means++ and (iii) SKM with\nSk-means++ initialisation. Results are averaged over the algorithms\u2019 runs.\nTable 5 (left) displays that using Sk-means++ as initialization for SKM brings a very signi\ufb01cant\ngain over SKM alone, since we almost divide the k-means potential by a factor 2 on some runs.\nThe right plot of Table 5 shows that S-k-means++ consistently reduces the k-means potential by at\nleast a factor 2 over Forgy. The left plot in Table 6 displays that even when it has converged, SKM\ndoes not necessarily beat Sk-means++. Finally, the center+right plots in Table 6 display that even\nwhen it does beat Sk-means++ when it has converged, the iteration number after which SKM beats\nSk-means++ increases with k, and in the worst case may exceed the average number of iterations\nneeded for SKM to converge (we stopped SKM if relative improvement is not above 1o/oo).\n\n7 Conclusion\n\nWe presented a new scaled Bregman identity, and used it to derive novel results in several \ufb01elds\nof machine learning: multiple density ratio estimation, adaptive \ufb01ltering, and clustering on curved\nmanifolds. We believe that, like other known key properties of Bregman divergences, there is potential\nfor other applications of the result; Appendices V, VI present preliminary thoughts in this direction.\n\n8 Acknowledgments\n\nThe authors wish to thank Bob Williamson and the reviewers for insightful comments.\n\nReferences\n[1] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.\n\n8\n\n-10 0 10 20 30 40 50 60 0 5 10 15 20 25 30 35 40 45 50rel. improvement (%) k 0 10 20 30 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50rel. improvement (%) k 40 50 60 70 80 90 100 0 5 10 15 20 25 30 35 40 45 50prop (%) DM beats GKM k 0 5 10 15 20 25 30 35 40 45 50 55 0 5 10 15 20 25 30 35 40 45 50iteration# k 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50iteration# k\f[2] D. Arthur and S. Vassilvitskii. k-means++ : the advantages of careful seeding. In 19 th SODA, 2007.\n[3] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. JMLR, 6:1705\u2013\n\n1749, 2005.\n\n[4] A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for convex\n\noptimization. Operations Research Letters, 31(3):167\u2013175, 2003.\n\n[5] J.-D. Boissonnat, F. Nielsen, and R. Nock. Bregman Voronoi diagrams. DCG, 44(2):281\u2013307, 2010.\n[6] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[7] A. Buja, W. Stuetzle, and Y. Shen. Loss functions for binary class probability estimation and classi\ufb01cation:\n\nStructure and applications, 2005. Unpublished manuscript.\n\n[8] S.-R. Buss and J.-P. Fillmore. Spherical averages and applications to spherical splines and interpolation.\n\nACM Transactions on Graphics, 20:95\u2013126, 2001.\n\n[9] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, 2006.\n[10] M. Collins, R. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. MLJ, 2002.\n[11] B. Dacorogna and P. Mar\u00e9chal. The role of perspective functions in convexity, polyconvexity, rank-one\n\nconvexity and separate convexity. J. Convex Analysis, 15:271\u2013284, 2008.\n\n[12] I. Dhillon and D.-S. Modha. Concept decompositions for large sparse text data using clustering. MLJ,\n\n42:143\u2013175, 2001.\n\n[13] I.-S. Dhillon and J.-A. Tropp. Matrix nearness problems with Bregman divergences. SIAM Journal on\n\nMatrix Analysis and Applications, 29(4):1120\u20131146, 2008.\n\n[14] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the (cid:96)1-ball for learning\n\nin high dimensions. In ICML \u201908, pages 272\u2013279, New York, NY, USA, 2008. ACM.\n\n[15] Y. Endo and S. Miyamoto. Spherical k-means++ clustering. In Proc. of the 12th MDAI, pages 103\u2013114,\n\n[16] G.-A. Galperin. A concept of the mass center of a system of material points in the constant curvature\n\nspaces. Communications in Mathematical Physics, 154:63\u201384, 1993.\n\n[17] E. Hazan and S. Kale. Projection-free online learning. In John Langford and Joelle Pineau, editors, ICML\n\n\u201912, pages 521\u2013528, New York, NY, USA, 2012. ACM.\n\n[18] M. Hern\u00e1ndez-Lobato, Y. Li, M. Rowland, D. Hern\u00e1ndez-Lobato, T. Bui, and R.-E. Turner. Black-box\n\nalpha-divergence minimization. In 33rd ICML, 2016.\n\n[19] M. Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In 30th ICML, 2013.\n[20] J. Kivinen, M. Warmuth, and B. Hassibi. The p-norm generalization of the LMS algorithm for adaptive\n\n\ufb01ltering. IEEE Trans. SP, 54:1782\u20131793, 2006.\n\n[21] D. Kuang, S. Yun, and H. Park. SymNMF: nonnegative low-rank approximation of a similarity matrix for\n\ngraph clustering. J. Global Optimization, 62:545\u2013574, 2014.\n\n[22] P. Mar\u00e9chal. On a functional operation generating convex functions, part 1: duality. J. of Optimization\n\nTheory and Applications, 126:175\u2013189, 2005.\n\n[23] P. Mar\u00e9chal. On a functional operation generating convex functions, part 2: algebraic properties. J. of\n\nOptimization Theory and Applications, 126:375\u2013366, 2005.\n\n[24] A.-K. Menon and C.-S. Ong. Linking losses for class-probability and density ratio estimation. In ICML,\n\n[25] R. Nock, P. Luosto, and J. Kivinen. Mixed Bregman clustering with approximation guarantees. In ECML,\n\n[26] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE PAMI, 31:2048\u20132059,\n\n[27] R. Nock, F. Nielsen, and S.-I. Amari. On conformal divergences and their population minimizers. IEEE\n\nTrans. IT, 62:527\u2013538, 2016.\n\n[28] M. Reid and R. Williamson. Information, divergence and risk for binary experiments. JMLR, 12:731\u2013817,\n\n[29] M.-D. Reid and R.-C. Williamson. Composite binary losses. JMLR, 11:2387\u20132422, 2010.\n[30] G. Rong, M. Jin, and X. Guo. Hyperbolic centroidal Voronoi tessellation. In 14 th ACM SPM, 2010.\n[31] O. Schwander and F. Nielsen. Matrix Information Geometry, chapter Learning Mixtures by Simplifying\n\nKernel Density Estimators, pages 403\u2013426. Springer Berlin Heidelberg, 2013.\n\n[32] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for SVM. In\n\nICML \u201908, page 807\u2013814. ACM, 2007.\n\n[33] H. Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227 \u2013 244, 2000.\n\n[34] J. Straub, G. Rosman, O. Freifeld, J.-J. Leonard, and J.-W. Fisher III. A mixture of Manhattan frames:\n\nBeyond the Manhattan world. In Proc. of the 27th IEEE CVPR, pages 3770\u20133777, 2014.\n\n[35] M. Sugiyama, T. Suzuki, and T. Kanamori. Density-ratio matching under the Bregman divergence: a\n\nuni\ufb01ed framework of density-ratio estimation. AISM, 64(5):1009\u20131044, 2012.\n\n[36] R. Vidal. Subspace clustering. IEEE Signal Processing Magazine, 28:52\u201368, 2011.\n[37] R.-C. Williamson, E. Vernet, and M.-D. Reid. Composite multiclass losses, 2014. Unpublished manuscript.\n[38] H. Zhang and S. Sra. First-order methods for geodesically convex optimization. CoRR, abs/1602.06053,\n\n2015.\n\n2016.\n\n2008.\n\n2009.\n\n2011.\n\n2016.\n\n9\n\n\f", "award": [], "sourceid": 8, "authors": [{"given_name": "Richard", "family_name": "Nock", "institution": "Data61 and ANU"}, {"given_name": "Aditya", "family_name": "Menon", "institution": "NICTA"}, {"given_name": "Cheng Soon", "family_name": "Ong", "institution": "Data61"}]}