{"title": "On Regularizing Rademacher Observation Losses", "book": "Advances in Neural Information Processing Systems", "page_first": 37, "page_last": 45, "abstract": "It has recently been shown that supervised learning linear classifiers with two of the most popular losses, the logistic and square loss, is equivalent to optimizing an equivalent loss over sufficient statistics about the class: Rademacher observations (rados). It has also been shown that learning over rados brings solutions to two prominent problems for which the state of the art of learning from examples can be comparatively inferior and in fact less convenient: protecting and learning from private examples, learning from distributed datasets without entity resolution. Bis repetita placent: the two proofs of equivalence are different and rely on specific properties of the corresponding losses, so whether these can be unified and generalized inevitably comes to mind. This is our first contribution: we show how they can be fit into the same theory for the equivalence between example and rado losses. As a second contribution, we show that the generalization unveils a surprising new connection to regularized learning, and in particular a sufficient condition under which regularizing the loss over examples is equivalent to regularizing the rados (i.e. the data) in the equivalent rado loss, in such a way that an efficient algorithm for one regularized rado loss may be as efficient when changing the regularizer. This is our third contribution: we give a formal boosting algorithm for the regularized exponential rado-loss which boost with any of the ridge, lasso, \\slope, l_\\infty, or elastic nets, using the same master routine for all. Because the regularized exponential rado-loss is the equivalent of the regularized logistic loss over examples we obtain the first efficient proxy to the minimisation of the regularized logistic loss over examples using such a wide spectrum of regularizers. Experiments with a readily available code display that regularization significantly improves rado-based learning and compares favourably with example-based learning.", "full_text": "On Regularizing Rademacher Observation Losses\n\nData61, The Australian National University & The University of Sydney\n\nrichard.nock@data61.csiro.au\n\nRichard Nock\n\nAbstract\n\nIt has recently been shown that supervised learning linear classi\ufb01ers with two of\nthe most popular losses, the logistic and square loss, is equivalent to optimizing an\nequivalent loss over suf\ufb01cient statistics about the class: Rademacher observations\n(rados). It has also been shown that learning over rados brings solutions to two\nprominent problems for which the state of the art of learning from examples can be\ncomparatively inferior and in fact less convenient: (i) protecting and learning from\nprivate examples, (ii) learning from distributed datasets without entity resolution.\nBis repetita placent: the two proofs of equivalence are different and rely on speci\ufb01c\nproperties of the corresponding losses, so whether these can be uni\ufb01ed and general-\nized inevitably comes to mind. This is our \ufb01rst contribution: we show how they can\nbe \ufb01t into the same theory for the equivalence between example and rado losses. As\na second contribution, we show that the generalization unveils a surprising new con-\nnection to regularized learning, and in particular a suf\ufb01cient condition under which\nregularizing the loss over examples is equivalent to regularizing the rados (i.e. the\ndata) in the equivalent rado loss, in such a way that an ef\ufb01cient algorithm for one\nregularized rado loss may be as ef\ufb01cient when changing the regularizer. This is our\nthird contribution: we give a formal boosting algorithm for the regularized expo-\nnential rado-loss which boost with any of the ridge, lasso, SLOPE, (cid:96)\u221e, or elastic\nnet regularizer, using the same master routine for all. Because the regularized ex-\nponential rado-loss is the equivalent of the regularized logistic loss over examples\nwe obtain the \ufb01rst ef\ufb01cient proxy to the minimization of the regularized logistic\nloss over examples using such a wide spectrum of regularizers. Experiments with a\nreadily available code display that regularization signi\ufb01cantly improves rado-based\nlearning and compares favourably with example-based learning.\n\n1\n\nIntroduction\n\nWhat kind of data should we use to train a supervised learner ? A recent result has shown that\nminimising the popular logistic loss over examples with linear classi\ufb01ers (in supervised learning) is\nequivalent to the minimisation of the exponential loss over suf\ufb01cient statistics about the class known\nas Rademacher observations (rados, [Nock et al., 2015]), for the same classi\ufb01er. In short, we \ufb01t a\nclassi\ufb01er over data that is different from examples, and the same classi\ufb01er generalizes well to new\nobservations. It has been shown that rados offer solutions for two problems for which the state of the\nart involving examples can be comparatively signi\ufb01cantly inferior:\n\n\u2022 protection of the examples\u2019 privacy from various algebraic, geometric, statistical and com-\n\u2022 learning from a large number of distributed datasets without having to perform entity\n\nputational standpoints, and learning from private data [Nock et al., 2015];\n\nresolution between datasets [Patrini et al., 2016].\n\nQuite remarkably, the training time of the algorithms involved can be smaller than it would be on\nexamples, by orders of magnitude [Patrini et al., 2016]. Two key problems remain however: the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\faccuracy of learning from rados can compete experimentally with that of learning from examples, yet\nthere is a gap to reduce for rados to be not just a good material to learn from in a privacy/distributed\nsetting, but also a serious alternative to learning from examples at large, yielding new avenues to\nsupervised learning. Second, theoretically speaking, it is now known that two widely popular losses\nover examples admit an equivalent loss in the rado world: the logistic loss and the square loss [Nock\net al., 2015, Patrini et al., 2016]. This inevitably suggests that this property may hold for more losses,\nyet barely anything displays patterns of generalizability in the existing proofs.\nOur contributions: in this paper, we provide answers to these two questions, with three main\ncontributions. Our \ufb01rst contribution is to show that this generalization indeed holds: other example\nlosses admit equivalent losses in the rado world, meaning in particular that their minimiser classi\ufb01er\nis the same, regardless of the dataset of examples. The technique we use exploits a two-player zero\nsum game representation of convex losses, that has been very useful to analyse boosting algorithms\n[Schapire, 2003, Telgarsky, 2012], with one key difference: payoffs are non-linear convex, eventually\nnon-differentiable. These also resemble the entropic dual losses [Reid et al., 2015], with the difference\nthat we do not enforce conjugacy over the simplex. The conditions of the game are slightly different\nfor examples and rados. We provide necessary and suf\ufb01cient conditions for the resulting losses over\nexamples and rados to be equivalent. Informally, equivalence happens iff the convex functions of the\ngames satisfy a symmetry relationship and the weights satisfy a linear system of equations. Some\npopular losses \ufb01t in the equivalence [Nair and Hinton, 2010, Gentile and Warmuth, 1998, Nock and\nNielsen, 2008, Telgarsky, 2012, Vapnik, 1998, van Rooyen et al., 2015].\nOur second contribution came unexpectedly through this equivalence. Regularizing a loss is standard\nin machine learning [Bach et al., 2011]. We show a suf\ufb01cient condition for the equivalence under\nwhich regularizing the example loss is equivalent to regularizing the rados in the equivalent rado\nloss, i.e. making a Minkowski sum of the rado set with a classi\ufb01er-based set. This property is\nindependent of the regularizer, and incidentally happens to hold for all our cases of equivalence (Cf\n\ufb01rst contribution). A regularizer added to a loss over examples thus transfers to data in the rado world,\nin essentially the same way for all regularizers, and if one can solve the non-trivial computational and\noptimization problem that poses this data modi\ufb01cation for one regularized rado loss, then, basically,\n\n\"A good optimization algorithm for this regularized rado loss may \ufb01t to other regularizers as well\u201d\n\nOur third contribution exempli\ufb01es this. We propose an iterative boosting algorithm, \u2126-R.ADABOOST,\nthat learns a classi\ufb01er from rados using the exponential regularized rado loss, with regularization\nchoice belonging to the ridge, lasso, (cid:96)\u221e, or the recently coined SLOPE [Bogdan et al., 2015]. Since\nrado regularization would theoretically require to modify data at each iteration, such schemes are\ncomputationally non-trivial. We show that this modi\ufb01cation can in fact be bypassed for the exponen-\ntial rado loss, and the algorithm, \u2126-R.ADABOOST, is as fast as ADABOOST. \u2126-R.ADABOOST has\nhowever a key advantage over ADABOOST that to our knowledge is new in the boosting world: for\nany of these four regularizers, \u2126-R.ADABOOST is a boosting algorithm \u2014 thus, because of the\nequivalence between the minimization of the logistic loss over examples and the minimization of the\nexponential rado loss, \u2126-R.ADABOOST is in fact an ef\ufb01cient proxy to boost the regularized logistic\nloss over examples using whichever of the four regularizers, and by extension, linear combination of\nthem (e.g., elastic net regularization [Zou and Hastie, 2005]). We are not aware of any regularized\nlogistic loss formal boosting algorithm with such a wide spectrum of regularizers. Extensive exper-\niments validate this property: \u2126-R.ADABOOST is all the better vs ADABOOST (unregularized or\nregularized) as the domain gets larger, and is able to rapidly learn both accurate and sparse classi\ufb01ers,\nmaking it an especially good contender for supervised learning at large on big domains.\nThe rest of this paper is as follows. Sections \u00a72, 3 and 4 respectively present the equivalence\nbetween example and rado losses, its extension to regularized learning and \u2126-R.ADABOOST. \u00a75\nand 6 respectively present experiments, and conclude. In order not to laden the paper\u2019s body, a\nSupplementary Material (SM) contains the proofs and additional theoretical and experimental results.\n\n2 Games and equivalent example/rado losses\n\nTo avoid notational load, we brie\ufb02y present our learning setting to point the key quantity in our\n.= {\u22121, 1}m, for\nformulation of the general two players game. Let [m]\nm > 0. The classical (batch) supervised learner is example-based: it is given a set of examples\nS = {(xi, yi), i \u2208 [m]} where xi \u2208 Rd, yi \u2208 \u03a31, \u2200i \u2208 [m]. It returns a classi\ufb01er h : Rd \u2192 R from\n\n.= {1, 2, ..., m} and \u03a3m\n\n2\n\n\fa prede\ufb01ned set H. Let zi(h) .= yh(xi) and abbreviate z(h) by z for short. The learner \ufb01ts h to the\nminimization of a loss. Table 1, column (cid:96)e, presents some losses that can be used: we remark that h\nappears only through z, so let us consider in this section that the learner rather \ufb01ts vector z \u2208 Rm.\nWe can now de\ufb01ne our two players game setting. Let \u03d5e : R \u2192 R and \u03d5r : R \u2192 R two convex and\nlower-semicontinuous generators. We de\ufb01ne functions Le : Rm\u00d7Rm \u2192 R and Lr : R2m\u00d7Rm \u2192 R:\n(1)\n\n(cid:88)\n\npizi + \u00b5e\n\nLe(p, z)\n\n\u03d5e(pi) ,\n\n.=\n\nLr(q, z)\n\n.=\n\nqI\n\nzi + \u00b5r\n\n\u03d5r(qI) ,\n\n(2)\n\n(cid:88)\n(cid:88)\n\ni\u2208[m]\n\nI\u2286[m]\n\n(cid:88)\n\ni\u2208I\n\ni\u2208[m]\n\n(cid:88)\n\nI\u2286[m]\n\nwhere \u00b5e, \u00b5r > 0 do not depend on z. For the notation to be meaningful, the coordinates in q are\nassumed (wlog) to be in bijection with 2[m]. The dependence of both problems in their respective\ngenerators is implicit and shall be clear from context. The adversary\u2019s goal is to \ufb01t\n\np\u2217(z)\nq\u2217(z)\n\n.= arg min\np\u2208Rm\n.= arg min\nq\u2208H2m\n\nLe(p, z) ,\n\nLr(q, z) ,\n\nwith H2m .= {q \u2208 R2m\n\n: 1(cid:62)q = 1}, so as to attain\n\nLe(z)\nLr(z)\n\n.= Le(p\u2217(z), z) ,\n.= Lr(q\u2217(z), z) ,\n\n(3)\n\n(4)\n\n(5)\n(6)\n\nand let \u2202Le(z) and \u2202Lr(z) denote their subdifferentials. We view the learner\u2019s task as the problem\nof maximising the corresponding problems in eq. (5) (with examples; this is already sketched above)\nor (6) (with what we shall call Rademacher observations, or rados), or equivalently minimising\nnegative the corresponding function, and then resort to a loss function. The question of when these\ntwo problems are equivalent from the learner\u2019s standpoint motivates the following de\ufb01nition.\nDe\ufb01nition 1 Two generators \u03d5e, \u03d5r are said proportionate iff \u2200m > 0, there exists (\u00b5e, \u00b5r) such that\n(7)\n\nLe(z) = Lr(z) + b ,\u2200z \u2208 Rm .\n\n(b does not depend on z) \u2200m \u2208 N\u2217, let\n\n(cid:20) 0(cid:62)\n\nGm\n\n.=\n\n2m\u22121 1(cid:62)\n2m\u22121\nGm\u22121 Gm\u22121\n\n(cid:21)\n\n(\u2208 {0, 1}m\u00d72m\n\n)\n\n.= [0 1] otherwise (notation zd indicates a vector in Rd).\n\nif m > 1, and G1\nTheorem 2 \u03d5e, \u03d5r are proportionate iff the optima p\u2217(z) and q\u2217(z) to eqs (3) and (4) satisfy:\n\np\u2217(z) \u2208 \u2202Lr(z) ,\nGmq\u2217(z) \u2208 \u2202Le(z) .\n\n(8)\n\n(9)\n(10)\n\nIf \u03d5e, \u03d5r are differentiable and strictly convex, they are proportionate iff p\u2217(z) = Gmq\u2217(z).\n\nWe can alleviate the fact that convexity is strict, which results in a set-valued identity for \u03d5e, \u03d5r to be\nproportionate. This gives a necessary and suf\ufb01cient condition for two generators to be proportionate.\nIt does not say how to construct one from the other, if possible. We now show that it is indeed possible\nand prune the search space: if \u03d5e is proportionate to some \u03d5r, then it has to be a \u201csymmetrized\u201d\nversion of \u03d5r, according to the following de\ufb01nition.\nDe\ufb01nition 3 Let \u03d5r s.t. dom\u03d5r \u2287 (0, 1). \u03d5s(r)(z) .= \u03d5r(z) + \u03d5r(1 \u2212 z) is the symmetrisation of \u03d5r.\nLemma 4 If \u03d5e and \u03d5r are proportionate, then \u03d5e(z) = (\u00b5r/\u00b5e) \u00b7 \u03d5s(r)(z) + (b/\u00b5e) (b is in (7)).\n\n3\n\n\fI (cid:80)\n\n#\n\nII\nIII\nIV\n\n(cid:96)e(z, \u00b5e)\n\n(cid:80)\n(cid:80)\ni\u2208[m] log (1 + exp (ze\n(cid:80)\ni\u2208[m] (1 + ze\ni )2\ni\u2208[m] max{0, ze\ni}\n\ni ze\n\ni\n\ni ))\n\n(cid:80)\nmax(cid:8)0, maxI\u2286[m]{zr\nI}(cid:9)\n\n(cid:96)r(z, \u00b5r)\nI\u2286[m] exp (zr\nI)\nI] \u2212 \u00b5r \u00b7 VI [\u2212zr\nI])\n\n\u2212(EI [\u2212zr\n\nEI [zr\nI]\n\n\u03d5r(z)\nz log z \u2212 z\n(1/2) \u00b7 z2\n\u03c7[0,1](z)\n2 ](z)\n\u03c7[ 1\n\n2m , 1\n\n\u00b5e and \u00b5r\n\u2200\u00b5e = \u00b5r\n\u2200\u00b5e = \u00b5r\n\u2200\u00b5e, \u00b5r\n\u2200\u00b5e, \u00b5r\n\nae\n\u00b5e\n\u00b5e/4\n\u00b5e\n\u00b5e\n\nTable 1: Examples of equivalent example and rado losses. Names of the rado-losses (cid:96)r(z, \u00b5r) are\nrespectively the Exponential (I), Mean-variance (II), ReLU (III) and Unhinged (IV) rado loss. We\nuse shorthands ze\ni\u2208I zi. Parameter ae appears in eq. (14).\ni\nColumn \u201c\u00b5e and \u00b5r\u201d gives the constraints for the equivalence to hold. EI and VI are the expectation\nand variance over uniform sampling in sets I \u2286 [m] (see text for details).\n\n.= \u2212(1/\u00b5e) \u00b7 zi and zr\n\nI\n\n.= \u2212(1/\u00b5r) \u00b7(cid:80)\n\nTo summarize, \u03d5e and \u03d5r are proportionate iff (i) they meet the structural property that \u03d5e is\n(proportional to) the symmetrized version of \u03d5r (according to De\ufb01nition 3), and (ii) the optimal\nsolutions p\u2217(z) and q\u2217(z) to problems (1) and (2) satisfy the conditions of Theorem 2. Depending on\nthe direction, we have two cases to craft proportionate generators. First, if we have \u03d5r, then necessarily\n\u03d5e \u221d \u03d5s(r) so we merely have to check Theorem 2. Second, if we have \u03d5e, then it matches De\ufb01nition\n31. In this case, we have to \ufb01nd \u03d5r = f + g where g(z) = \u2212g(1 \u2212 z) and \u03d5e(z) = f (z) + f (1 \u2212 z).\nWe now come back to Le(z), Lr(z) (De\ufb01nition 1), and make the connection with example and rado\nlosses. In the next de\ufb01nition, an e-loss (cid:96)e(z) is a function de\ufb01ned over the coordinates of z, and a\nr-loss (cid:96)r(z) is a function de\ufb01ned over the subsets of sums of coordinates. Functions can depend on\nother parameters as well.\nDe\ufb01nition 5 Suppose e-loss (cid:96)e(z) and r-loss (cid:96)r(z) are such that there exist (i) fe : R \u2192 R and\nfr(z) : R \u2192 R both strictly increasing and such that \u2200z \u2208 Rm,\n\u2212Le(z) = fe ((cid:96)e(z)) ,\n\u2212Lr(z) = fr ((cid:96)r(z)) ,\n\n(11)\n(12)\nwhere Le(z) and Lr(z) are de\ufb01ned via two proportionate generators \u03d5e and \u03d5r (De\ufb01nition 1). Then\nthe couple ((cid:96)e, (cid:96)r) is called a couple of equivalent example-rado losses.\n\nFollowing is the main Theorem of this Section, which summarizes all the cases of equivalence\nbetween example and rado losses, and shows that the theory developed on example / rado losses with\nproportionate generators encompasses the speci\ufb01c proofs and cases already known [Nock et al., 2015,\nPatrini et al., 2016]. Table 1 also displays generator \u03d5r.\n\nTheorem 6 In each row of Table 1, (cid:96)e(z, \u00b5e) and (cid:96)r(z, \u00b5r) are equivalent for \u00b5e and \u00b5r as indicated.\n\nThe proof (SM, Subsection 2.3) details for each case the proportionate generators \u03d5e and \u03d5r.\n\n3 Learning with (rado) regularized losses\n\nsince(cid:80)\n\nWe now detail further the learning setting. In the preceeding Section, we have de\ufb01nef zi(h) .= yh(xi),\nwhich we plug in the losses of Table 1 to obtain the corresponding example and rado losses. Losses\nsimplify conveniently when H consists of linear classi\ufb01ers, h(x) .= \u03b8(cid:62)x for some \u03b8 \u2208 \u0398 \u2286 Rd. In\n.= {yi \u00b7 xi, i = 1, 2, ..., m} since\nthis case, the example loss can be described using edge vectors Se\nzi = \u03b8(cid:62)(yi\u00b7xi), and the rado loss can be described using rademacher observations [Nock et al., 2015],\ni(\u03c3i + yi) \u00b7 xi.\n.= {\u03c0\u03c3, \u03c3 \u2208 \u03a3m} the set of all rademacher observations. We rewrite any couple of\nLet us de\ufb01ne S\u2217\nequivalent example and rado losses as (cid:96)e(Se, \u03b8) and (cid:96)r(S\u2217\nr , \u03b8) respectively2, omitting parameters \u00b5e\nand \u00b5r, assumed to be \ufb01xed beforehand for the equivalence to hold (see Table 1). Let us regularize\nthe example loss, so that the learner\u2019s goal is to minimize\n\ni\u2208I zi = \u03b8(cid:62)\u03c0\u03c3 for \u03c3i = yi iff i \u2208 I (and \u2212yi otherwise) and \u03c0\u03c3\n\n.= (1/2) \u00b7(cid:80)\n\nr\n\n(cid:96)e(Se, \u03b8, \u2126)\n\n.= (cid:96)e(Se, \u03b8) + \u2126(\u03b8) ,\n\n(13)\n\n1Alternatively, \u2212\u03d5e is permissible [Kearns and Mansour, 1999].\n2To prevent notational overload, we blend notions of (pointwise) loss and (samplewise) risk, as just \u201closses\u201d.\n\n4\n\n\fAlgorithm 1 \u2126-R.ADABOOST\n\nInput set of rados Sr\nStep 1 : let \u03b80 \u2190 0, w0 \u2190 (1/n)1 ;\nStep 2 : for t = 1, 2, ..., T\n\n.= {\u03c01, \u03c02, ..., \u03c0n}; T \u2208 N\u2217; parameters \u03b3 \u2208 (0, 1), \u03c9 \u2208 R+;\n\nStep 2.1 : call the weak learner: (\u03b9(t), rt) \u2190 \u2126-WL(Sr, wt, \u03b3, \u03c9, \u03b8t\u22121);\nStep 2.2 : compute update parameters \u03b1\u03b9(t) and \u03b4t (here, \u03c0\u2217k\n\n.= maxj |\u03c0jk|):\n\n\u03b1\u03b9(t) \u2190 (1/(2\u03c0\u2217\u03b9(t))) log((1 + rt)/(1 \u2212 rt))\n\nand\n\n\u03b4t \u2190 \u03c9 \u00b7 (\u2126(\u03b8t) \u2212 \u2126(\u03b8t\u22121)) ;\n\nStep 2.3 : update and normalize weights: for j = 1, 2, ..., n,\n\nwtj \u2190 w(t\u22121)j \u00b7 exp(cid:0)\u2212\u03b1t\u03c0j\u03b9(t) + \u03b4t\n\n(cid:1) /Zt ;\n\n(16)\n\n(17)\n\nReturn \u03b8T ;\n\nwith \u2126 a regularizer [Bach et al., 2011]. The following shows that when fe in eq. (11) is linear, there\nis a rado-loss equivalent to this regularized loss, regardless of \u2126.\n\nTheorem 7 Suppose H contains linear classi\ufb01ers. Let ((cid:96)e(Se, \u03b8), (cid:96)r(S\u2217\nalent example-rado losses such that fe in eq. (11) is linear:\n\nr , \u03b8)) be any couple of equiv-\n\n(14)\nfor some ae > 0, be \u2208 R. Then for any regularizer \u2126(.) (assuming wlog \u2126(0) = 0), the regularized\nexample loss (cid:96)e(Se, \u03b8, \u2126) is equivalent to rado loss (cid:96)r(S\u2217,\u2126,\u03b8\n, \u03b8) computed over regularized rados:\n\nr\n\nfe(z) = ae \u00b7 z + be ,\n\nHere, \u2295 is Minkowski sum and \u02dc\u2126(\u03b8) .= ae \u00b7 \u2126(\u03b8)/(cid:107)\u03b8(cid:107)2\n\n2 if \u03b8 (cid:54)= 0 (and 0 otherwise).\n\nS\u2217,\u2126,\u03b8\n\nr\n\n.= S\u2217\n\nr \u2295 {\u2212 \u02dc\u2126(\u03b8) \u00b7 \u03b8} ,\n\n(15)\n\ni:yi=\u03c3i\n\nshall actually meet, over examples,(cid:80)\n\nTheorem 7 applies to all rado losses (I-IV) in Table 1. The effect of regularization on rados is intuitive\nfrom the margin standpoint: assume that a \u201cgood\u201d classi\ufb01er \u03b8 is one that ensures lowerbounded inner\nproducts \u03b8(cid:62)z \u2265 \u03c4 for some margin threshold \u03c4. Then any good classi\ufb01er on a regularized rado \u03c0\u03c3\n\u03b8(cid:62)(yi \u00b7 xi) \u2265 \u03c4 + ae \u00b7 \u2126(\u03b8). This inequality ties an\n\"accuracy\" of \u03b8 (edges, left hand-side) and its sparsity (right-hand side). Clearly, Theorem 7 has an\nunfamiliar shape since regularisation modi\ufb01es data in the rado world: a different \u03b8, or a different\n\u2126, yields a different S\u2217,\u2126,\u03b8\n, and therefore it may seem very tricky to minimize such a regularized\nloss. Even more, iterative algorithms like boosting algorithms look at \ufb01rst glance a poor choice, since\nany update on \u03b8 implies an update on the rados as well. What we show in the following Section\nis essentially the opposite for the exponential rado loss, and a generalization of the RADOBOOST\nalgorithm of Nock et al. [2015], which does not modify rados, is a formal boosting algorithm for a\nbroad set of regularizers. Also, remarkably, only the high-level code of the weak learner depends on\nthe regularizer; that of the strong learner is not affected.\n\nr\n\n4 Boosting with (rado) regularized losses\n\n\u2126-R.ADABOOST presents our approach to learning with rados regularized with regularizer \u2126 to\nt(cid:48)=1 \u03b1\u03b9(t(cid:48)) \u00b7 1\u03b9(t(cid:48)), where\nminimise loss (cid:96)exp\n1k is the kth canonical basis vector. The expected edge rt used to compute \u03b1t in eq. (16) is based on\nthe following basis assignation:\n\n(Sr, \u03b8, \u2126) in eq. (45). Classi\ufb01er \u03b8t is de\ufb01ned as \u03b8t\n\nr\n\n.=(cid:80)t\n\nn(cid:88)\n\nj=1\n\nr\u03b9(t) \u2190\n\n1\n\n\u03c0\u2217\u03b9(t)\n\nwtj\u03c0j\u03b9(t) (\u2208 [\u22121, 1]) .\n\n(19)\n\nThe computation of rt is eventually tweaked by the weak learner, as displayed in Algorithm \u2126-\nWL. We investigate four choices for \u2126. For each of them, we prove the boosting ability of \u2126-\nR.ADABOOST (\u0393 is symmetric positive de\ufb01nite, Sd is the symmetric group of order d, |\u03b8| is the\n\n5\n\n\fAlgorithm 2 \u2126-WL, for \u2126 \u2208 {(cid:107).(cid:107)1,(cid:107).(cid:107)2\n\n\u0393,(cid:107).(cid:107)\u221e,(cid:107).(cid:107)\u03a6}\n\nInput set of rados Sr\nclassi\ufb01er \u03b8 \u2208 Rd;\nStep 1 : pick weak feature \u03b9\u2217 \u2208 [d];\n\n.= {\u03c01, \u03c02, ..., \u03c0n}; weights w \u2208 (cid:52)n; parameters \u03b3 \u2208 (0, 1), \u03c9 \u2208 R+;\n\nOptional \u2014 use preference order: \u03b9 (cid:23) \u03b9(cid:48) \u21d4 |r\u03b9| \u2212 \u03b4\u03b9 \u2265 |r\u03b9(cid:48)| \u2212 \u03b4\u03b9(cid:48)\n// \u03b4\u03b9\n\n.= \u03c9 \u00b7 (\u2126(\u03b8 + \u03b1\u03b9 \u00b7 1\u03b9) \u2212 \u2126(\u03b8)), r\u03b9 is given in (19) and \u03b1\u03b9 is given in (16)\n\nr\u03b9\u2217\n\nsign (r\u03b9\u2217 ) \u00b7 \u03b3\n\nr\u03b9\u2217 \u2208 [\u2212\u03b3, \u03b3]\n\nif\notherwise\n\n;\n\n(cid:26)\n\nStep 2 : if \u2126 = (cid:107).(cid:107)2\n\n\u0393 then\n\nr\u2217 \u2190\n\nelse r\u2217 \u2190 r\u03b9\u2217;\n\nReturn (\u03b9\u2217, r\u2217);\n\n(18)\n\n(20)\n\nvector whose coordinates are the absolute values of the coordinates of \u03b8):\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n\u2126(\u03b8) =\n\n(cid:107)\u03b8(cid:107)1\n(cid:107)\u03b8(cid:107)2\n(cid:107)\u03b8(cid:107)\u221e .= maxk |\u03b8k|\n(cid:107)\u03b8(cid:107)\u03a6\n\n.= |\u03b8|(cid:62)1\n.= \u03b8(cid:62)\u0393 \u03b8\n.= maxM\u2208Sd (M|\u03b8|)(cid:62)\u03be\n\n\u0393\n\nLasso\nRidge\n(cid:96)\u221e\nSLOPE\n\n[Bach et al., 2011, Bogdan et al., 2015, Duchi and Singer, 2009, Su and Cand\u00e8s, 2015]. The\n.= \u03a6\u22121(1 \u2212 kq/(2d)) where \u03a6\u22121(.) is the quantile of the standard\ncoordinates of \u03be in SLOPE are \u03bek\nnormal distribution and q \u2208 (0, 1); thus, the largest coordinates (in absolute value) of \u03b8 are more\npenalized. We now establish the boosting ability of \u2126-R.ADABOOST. We give no direction for Step\n1 in \u2126-WL, which is consistent with the de\ufb01nition of a weak learner in the boosting theory: all we\nrequire from the weak learner is |r.| no smaller than some weak learning threshold \u03b3WL > 0.\nDe\ufb01nition 8 Fix any constant \u03b3WL \u2208 (0, 1). \u2126-WL is said to be a \u03b3WL-Weak Learner iff the feature\n\u03b9(t) it picks at iteration t satis\ufb01es |r\u03b9(t)| \u2265 \u03b3WL, for any t = 1, 2, ..., T .\n\nWe also provide an optional step for the weak learner in \u2126-WL, which we exploit in the experimenta-\ntions, which gives a total preference order on features to optimise further \u2126-R.ADABOOST.\nTheorem 9 (boosting with ridge). Take \u2126(.) = (cid:107).(cid:107)2\nand the number of iterations T of \u2126-R.ADABOOST are chosen so that\n\n\u0393. Fix any 0 < a < 1/5, and suppose that \u03c9\n\n\u03c9 < (2a min\n\nk\n\nmax\n\nj\n\n\u03c02\n\njk)/(T \u03bb\u0393) ,\n\n(21)\n\nwhere \u03bb\u0393 > 0 is the largest eigenvalue of \u0393. Then there exists some \u03b3 > 0 (depending on a,\nand given to \u2126-WL) such that for any \ufb01xed 0 < \u03b3WL < \u03b3, if \u2126-WL is a \u03b3WL-Weak Learner, then\n\u2126-R.ADABOOST returns at the end of the T boosting iterations a classi\ufb01er \u03b8T which meets:\n\n(22)\nFurthermore, if we \ufb01x a = 1/7, then we can \ufb01x \u03b3 = 0.98, and if a = 1/10, then we can \ufb01x \u03b3 = 0.999.\n\nWLT /2) .\n\n(cid:96)exp\nr\n\n(Sr, \u03b8T ,(cid:107).(cid:107)2\n\n\u0393) \u2264 exp(\u2212a\u03b32\n\nTwo remarks are in order. First, the cases a = 1/7, 1/10 show that \u2126-WL can still obtain large\nedges in eq. (19), so even a \u201cstrong\u201d weak learner might \ufb01t in for \u2126-WL, without clamping edges.\nSecond, the right-hand side of ineq. (21) may be very large if we consider that mink maxj \u03c02\njk may\nbe proportional to m2. So the constraint on \u03c9 is in fact loose.\nTheorem 10 (boosting with lasso or (cid:96)\u221e). Take \u2126(.) \u2208 {(cid:107).(cid:107)1,(cid:107).(cid:107)\u221e}. Suppose \u2126-WL is a \u03b3WL-Weak\nLearner for some \u03b3WL > 0. Suppose \u22030 < a < 3/11 s. t. \u03c9 satis\ufb01es:\n\n\u03c9 = a\u03b3WL min\n\nk\n\nmax\n\nj\n\n|\u03c0jk| .\n\n(23)\n\nThen \u2126-R.ADABOOST returns at the end of the T boosting iterations a classi\ufb01er \u03b8T which meets:\n(24)\n\n(Sr, \u03b8T , \u2126) \u2264 exp(\u2212 \u02dcT \u03b32\n\nWL/2) ,\n\n(cid:96)exp\nr\n\n6\n\n\fwhere \u02dcT = a\u03b3WLT if \u2126 = (cid:107).(cid:107)1, and \u02dcT = (T \u2212 T\u2217) + a\u03b3WL \u00b7 T\u2217 if \u2126 = (cid:107).(cid:107)\u221e; T\u2217 is the number of\niterations where the feature computing the (cid:96)\u221e norm was updated3.\n\nWe \ufb01nally investigate the SLOPE choice. The Theorem is proven for \u03c9 = 1 in \u2126-R.ADABOOST, for\ntwo reasons: it matches the original de\ufb01nition [Bogdan et al., 2015] and furthermore it unveils an\ninteresting connection between boosting and SLOPE properties.\nTheorem 11 (boosting with SLOPE). Take \u2126(.) = (cid:107).(cid:107)\u03a6. Let a .= min{3\u03b3WL/11, \u03a6\u22121(1 \u2212\nq/(2d))/ mink maxj |\u03c0jk|}. Suppose wlog |\u03b8T k| \u2265 |\u03b8T (k+1)|,\u2200k, and \ufb01x \u03c9 = 1. Suppose (i)\n\u2126-WL is a \u03b3WL-Weak Learner for some \u03b3WL > 0, and (ii) the q-value is chosen to meet:\n\n(cid:26)(cid:18)\n\n(cid:18) 3\u03b3WL\n\n11\n\nq \u2265 2 \u00b7 max\n\nk\n\n1 \u2212 \u03a6\n\n\u00b7 max\n\nj\n\n|\u03c0jk|\n\n(cid:19)(cid:19)(cid:30)(cid:18) k\n\n(cid:19)(cid:27)\n\n.\n\nd\n\nThen classi\ufb01er \u03b8T returned by \u2126-R.ADABOOST at the end of the T boosting iterations satis\ufb01es:\n\n(Sr, \u03b8T ,(cid:107).(cid:107)\u03a6) \u2264 exp(\u2212a\u03b32\n\n(cid:96)exp\nr\n\nWLT /2) .\n\n(25)\n\n|\u03c0jk| \u2265 K \u00b7(cid:112)\n\nConstraint (ii) on q is interesting in the light of the properties of SLOPE [Su and Cand\u00e8s, 2015].\nModulo some assumptions, SLOPE yields a control the false discovery rate (FDR) \u2014 i.e., negligible\ncoef\ufb01cients in the \"true\u201d linear model \u03b8\u2217 that are found signi\ufb01cant in the learned \u03b8 \u2014. Constraint\n(ii) links the \"small\u201d achievable FDR (upperbounded by q) to the \"boostability\u201d of the data: the fact\nthat each feature k can be chosen by the weak learner for a \"large\u201d \u03b3WL, or has maxj |\u03c0jk| large,\nprecisely \ufb02ags potential signi\ufb01cant features, thus reducing the risk of sparsity errors, and allowing\nsmall q, which is constraint (ii). Using the second order approximation of normal quantiles [Su and\nCand\u00e8s, 2015], a suf\ufb01cient condition for (ii) is that, for some K > 0,\n\nlog d + log q\u22121 ;\n\nj\n\nj\n\n\u03b3WL min\n\nmax\n\n(26)\nbut minj maxj |\u03c0jk| is proportional to m, so ineq. (26), and thus (ii), may hold even for small\nsamples and q-values. An additional Theorem deferred to SM sor space considerations shows that\nfor any applicable choice of regularization (eq. 20), the regularized log-loss of \u03b8T over examples\n(Se, \u03b8, \u2126) \u2264\nenjoys with high probability a monotonically decreasing upperbound with T as: (cid:96)log\nlog 2 \u2212 \u03ba \u00b7 T + \u03c4(m), with \u03c4(m) \u2192 0 when m \u2192 \u221e (and \u03c4 does not depend on T ), and \u03ba > 0 does\nnot depend on T . Hence, \u2126-R.ADABOOST is an ef\ufb01cient proxy to boost the regularized log-loss over\nexamples, using whichever of the ridge, lasso, (cid:96)\u221e or SLOPE regularization \u2014 establishing the \ufb01rst\nboosting algorithm for this choice \u2014, or linear combinations of the choices, e.g. for elastic nets. If\nwe were to compare Theorems 9 \u2013 11 (eqs (22, 24, 25)), then the convergence looks best for ridge\n(the unsigned exponent is \u02dcO(\u03b32\nWL)) while it looks slightly worse for (cid:96)\u221e and SLOPE (the unsigned\nexponent is now \u02dcO(\u03b33\n\nWL)), the lasso being in between.\n\ne\n\n5 Experiments\n\nthe proofs of Theorems 9 \u2014 11, showing that(cid:81)\n\nWe have implemented \u2126-WL4 using the order suggested to retrieve the topmost feature in the order.\nHence, the weak learner returns the feature maximising |r\u03b9| \u2212 \u03b4\u03b9. The rationale for this comes from\n\u03b9(t)/2 \u2212 \u03b4\u03b9(t))) is an upperbound on the\nexponential regularized rado-loss. We do not clamp the weak learner for \u2126(.) = (cid:107).(cid:107)2\n\u0393, so the weak\nlearner is restricted to Step 1 in \u2126-WL5.\nThe objective of these experiments is to evaluate \u2126-R.ADABOOST as a contender for supervised\nlearning per se. We compared \u2126-R.ADABOOST to ADABOOST/(cid:96)1 regularized-ADABOOST [Schapire\nand Singer, 1999, Xi et al., 2009]. All algorithms are run for a total of T = 1000 iterations, and\nat the end of the iterations, the classi\ufb01er in the sequence that minimizes the empirical loss is kept.\nNotice therefore that rado-based classi\ufb01ers are evaluated on the training set which computes the\n\nt exp(\u2212(r2\n\n3If several features match this criterion, T\u2217 is the total number of iterations for all these features.\n4Code available at: http://users.cecs.anu.edu.au/\u223crnock/\n5the values for \u03c9 that we test, in {10\u2212u, u \u2208 {0, 1, 2, 3, 4, 5}}, are small with respect to the upperbound in\nineq. (21) given the number of boosting steps (T = 1000), and would yield on most domains a maximal \u03b3 \u2248 1.\n\n7\n\n\frados. To obtain very sparse solutions for regularized-ADABOOST, we pick its \u03c9 (\u03b2 in [Xi et al.,\n2009]) in {10\u22124, 1, 104}. The complete results aggregate experiments on twenty (20) domains, all\nbut one coming from the UCI [Bache and Lichman, 2013] (plus the Kaggle competition domain\n\u201cGive me some credit\u201d), with up to d =500+ features and m =100 000+ examples. Two tables, in\nthe SM (Tables 1 and 2 in Section 3) report respectively the test errors and sparsity of classi\ufb01ers,\nwhose summary is given here in Table 2. The experimental setup is a ten-folds strati\ufb01ed cross\nvalidation for all algorithms and each domain. ADABOOST/regularized-ADABOOST is trained\nusing the complete training fold. When the domain size m \u2264 40000, the number of rados n\nused for \u2126-R.ADABOOST is a random subset of rados of size equal to that of the training fold.\nWhen the domain size exceeds 40000, a random set of n = 10000 rados is computed from the\ntraining fold. Thus, (i) there is no optimisation of the examples chosen to compute rados, (ii) we\nalways keep a very small number of rados compared to the maximum available, and (iii) when the\ndomain size gets large, we keep a comparatively tiny number of rados. Hence, the performances\nof \u2126-R.ADABOOST do not stem from any optimization in the choice or size of the rado sample.\n\nAda\n\nId\n\n11\n\n(cid:107).(cid:107)2\nId\n10\n3\n\n7\n9\n10\n\n(cid:107).(cid:107)1\n10\n3\n11\n\n9\n10\n\n\u2205\n11\n\n17\n17\n18\n19\n\n(cid:107).(cid:107)\u221e (cid:107).(cid:107)\u03a6\n9\n8\n1\n2\n9\n7\n4\n7\n8\n\nAda\n\u2205\n9\n(cid:107).(cid:107)2\n10\n(cid:107).(cid:107)1\n10\n(cid:107).(cid:107)\u221e 11\n(cid:107).(cid:107)\u03a6\n10\nTable 2: Number of domains for which algorithm in\nrow beats algorithm in column (Ada = best result of AD-\nABOOST, \u2205 = \u2126-R.ADABOOST not regularized, see text).\n\nExperiments support several key observations.\nFirst, regularization consistently reduces the\ntest error of \u2126-R.ADABOOST, by more than\n15% on Magic, and 20% on Kaggle. In Table\n2, \u2126-R.ADABOOST unregularized (\"\u2205\") is vir-\ntually always beaten by its SLOPE regularized\nversion. Second, \u2126-R.ADABOOST is able to\nobtain both very sparse and accurate classi-\n\ufb01ers (Magic, Hardware, Marketing, Kaggle).\nThird, \u2126-R.ADABOOST competes or beats\nADABOOST on all domains, and is all the\nbetter as the domain gets bigger. Even qual-\nitatively as seen in Table 2, the best result\nobtained by ADABOOST (regularized or not) does not manage to beat any of the regularized versions\nof \u2126-R.ADABOOST on the majority of the domains. Fourth, it is important to have several choices\nof regularizers at hand. On domain Statlog, the difference in test error between the worst and the\nbest regularization of \u2126-R.ADABOOST exceeds 15%. Fifth, as already remarked [Nock et al., 2015],\nsigni\ufb01cantly subsampling rados (e.g. Marketing, Kaggle) still yields very accurate classi\ufb01ers. Sixth,\nregularization in \u2126-R.ADABOOST successfully reduces sparsity to learn more accurate classi\ufb01ers on\nseveral domains (Spectf, Transfusion, Hill-noise, Winered, Magic, Marketing), achieving ef\ufb01cient\nadaptive sparsity control. Last, the comparatively extremely poor results of ADABOOST on the\nbiggest domains seems to come from another advantage of rados that the theory developed so far does\nnot take into account: on domains for which some features are signi\ufb01cantly correlated with the class\nand for which we have a large number of examples, the concentration of the expected feature value in\nrados seems to provide leveraging coef\ufb01cients that tend to have much larger (absolute) value than in\nADABOOST, making the convergence of \u2126-R.ADABOOST signi\ufb01cantly faster than ADABOOST. For\nexample, we have checked that it takes much more than the T = 1000 iterations for ADABOOST to\nstart converging to the results of regularized \u2126-R.ADABOOST on Hardware or Kaggle.\n\n6 Conclusion\n\nWe have shown that the recent equivalences between two example and rado losses can be uni\ufb01ed\nand generalized via a principled representation of a loss function in a two-player zero-sum game.\nFurthermore, we have shown that this equivalence extends to regularized losses, where the regulariza-\ntion in the rado loss is performed over the rados themselves with Minkowski sums. Our theory and\nexperiments on \u2126-R.ADABOOST with prominent regularizers (including ridge, lasso, (cid:96)\u221e, SLOPE)\nindicate that when such a simple regularized form of the rado loss is available, it may help to devise\naccurate and ef\ufb01cient workarounds to boost a regularized loss over examples via the rado loss, even\nwhen the regularizer is signi\ufb01cantly more involved like e.g. for group norms [Bach et al., 2011].\n\nAcknowledgments\n\nThanks are due to Stephen Hardy and Giorgio Patrini for stimulating discussions around this material.\n\n8\n\n\fReferences\nF. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.\n\nFoundations and Trends in Machine Learning, 4:1\u2013106, 2011.\n\nK. Bache and M. Lichman. UCI machine learning repository, 2013.\n\nM Bogdan, E. van den Berg, C. Sabatti, W. Su, and E.-J. Cand\u00e8s. SLOPE \u2013 adaptive variable selection\n\nvia convex optimization. Annals of Applied Statistics, 2015. Also arXiv:1310.1969v2.\n\nJ.-C. Duchi and Y. Singer. Ef\ufb01cient learning using forward-backward splitting. In NIPS*22, pages\n\n495\u2013503, 2009.\n\nC. Gentile and M. Warmuth. Linear hinge loss and average margin. In NIPS*11, pages 225\u2013231,\n\n1998.\n\nM.J. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms.\n\nJ. Comp. Syst. Sc., 58:109\u2013128, 1999.\n\nV. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In 27th ICML,\n\npages 807\u2013814, 2010.\n\nR. Nock and F. Nielsen. On the ef\ufb01cient minimization of classi\ufb01cation-calibrated surrogates. In\n\nNIPS*21, pages 1201\u20131208, 2008.\n\nR. Nock, G. Patrini, and A Friedman. Rademacher observations, private data, and boosting. In 32nd\n\nICML, pages 948\u2013956, 2015.\n\nG. Patrini, R. Nock, S. Hardy, and T. Caetano. Fast learning from distributed datasets without entity\n\nmatching. In 26 th IJCAI, 2016.\n\nM.-D. Reid, R.-M. Frongillo, R.-C. Williamson, and N.-A. Mehta. Generalized mixability via\n\nentropic duality. In 28th COLT, pages 1501\u20131522, 2015.\n\nR.-E. Schapire. The boosting approach to machine learning: An overview. In D.-D. Denison, M.-H.\nHansen, C.-C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classi\ufb01cation,\nvolume 171 of Lecture Notes in Statistics, pages 149\u2013171. Springer Verlag, 2003.\n\nR. E. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated predictions. MLJ,\n\n37:297\u2013336, 1999.\n\nW. Su and E.-J. Cand\u00e8s. SLOPE is adaptive to unkown sparsity and asymptotically minimax. CoRR,\n\nabs/1503.08393, 2015.\n\nM. Telgarsky. A primal-dual convergence analysis of boosting. JMLR, 13:561\u2013606, 2012.\n\nB. van Rooyen, A. Menon, and R.-C. Williamson. Learning with symmetric label noise: The\n\nimportance of being unhinged. In NIPS*28, 2015.\n\nV. Vapnik. Statistical Learning Theory. John Wiley, 1998.\n\nY.-T. Xi, Z.-J. Xiang, P.-J. Ramadge, and R.-E. Schapire. Speed and sparsity of regularized boosting.\n\nIn 12th AISTATS, pages 615\u2013622, 2009.\n\nH. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society B, 67:301\u2013321, 2005.\n\n9\n\n\f", "award": [], "sourceid": 15, "authors": [{"given_name": "Richard", "family_name": "Nock", "institution": "Data61 and ANU"}]}