{"title": "On Coresets for Logistic Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 6561, "page_last": 6570, "abstract": "Coresets are one of the central methods to facilitate the analysis of large data. We continue a recent line of research applying the theory of coresets to logistic regression. First, we show the negative result that no strongly sublinear sized coresets exist for logistic regression. To deal with intractable worst-case instances we introduce a complexity measure $\\mu(X)$, which quantifies the hardness of compressing a data set for logistic regression. $\\mu(X)$ has an intuitive statistical interpretation that may be of independent interest. For data sets with bounded $\\mu(X)$-complexity, we show that a novel sensitivity sampling scheme produces the first provably sublinear $(1\\pm\\eps)$-coreset. We illustrate the performance of our method by comparing to uniform sampling as well as to state of the art methods in the area. The experiments are conducted on real world benchmark data for logistic regression.", "full_text": "On Coresets for Logistic Regression\n\nAlexander Munteanu\n\nDepartment of Computer Science\n\nTU Dortmund University\n44227 Dortmund, Germany\n\nChris Schwiegelshohn\n\nDepartment of Computer Science\n\nSapienza University of Rome\n\n00185 Rome, Italy\n\nalexander.munteanu@tu-dortmund.de\n\nschwiegelshohn@diag.uniroma1.it\n\nChristian Sohler\n\nDepartment of Computer Science\n\nTU Dortmund University\n44227 Dortmund, Germany\n\nchristian.sohler@tu-dortmund.de\n\nDavid P. Woodruff\n\nDepartment of Computer Science\n\nCarnegie Mellon University\nPittsburgh, PA 15213, USA\ndwoodruf@cs.cmu.edu\n\nAbstract\n\nCoresets are one of the central methods to facilitate the analysis of large data. We\ncontinue a recent line of research applying the theory of coresets to logistic regres-\nsion. First, we show the negative result that no strongly sublinear sized coresets\nexist for logistic regression. To deal with intractable worst-case instances we intro-\nduce a complexity measure \u00b5(X), which quanti\ufb01es the hardness of compressing a\ndata set for logistic regression. \u00b5(X) has an intuitive statistical interpretation that\nmay be of independent interest. For data sets with bounded \u00b5(X)-complexity, we\nshow that a novel sensitivity sampling scheme produces the \ufb01rst provably sublinear\n(1 \u00b1 \u03b5)-coreset. We illustrate the performance of our method by comparing to\nuniform sampling as well as to state of the art methods in the area. The experiments\nare conducted on real world benchmark data for logistic regression.\n\n1\n\nIntroduction\n\nScalability is one of the central challenges of modern data analysis and machine learning. Algorithms\nwith polynomial running time might be regarded as ef\ufb01cient in a conventional sense, but nevertheless\nbecome intractable when facing massive data sets. As a result, performing data reduction techniques\nin a preprocessing step to speed up a subsequent optimization problem has received considerable\nattention. A natural approach is to sub-sample the data according to a certain probability distribution.\nThis approach has been successfully applied to a variety of problems including clustering [31, 22, 6, 4],\nmixture models [21, 33], low rank approximation [16], spectral approximation [2, 32], and Nystr\u00f6m\nmethods [2, 37].\nThe unifying feature of these works is that the probability distribution is based on the sensitivity\nscore of each point. Informally, the sensitivity of a point corresponds to the importance of the point\nwith respect to the objective function we wish to minimize. If the total sensitivity, i.e., the sum of\nall sensitivity scores S, is bounded by a reasonably small value S, there exists a collection of input\npoints known as a coreset with very strong aggregation properties. Given any candidate solution (e.g.,\na set of k centers for k-means, or a hyperplane for linear regression), the objective function computed\non the coreset evaluates to the objective function of the original data up to a small multiplicative error.\nSee Sections 2 and 4 for formal de\ufb01nitions of sensitivity and coresets.\nOur Contribution We investigate coresets for logistic regression within the sensitivity framework.\nLogistic regression is an instance of a generalized linear model where we are given data Z \u2208 Rn\u00d7d,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand labels Y \u2208 {\u22121, 1}n. The optimization task consists of minimizing the negative log-likelihood\n\n(cid:80)n\ni=1 ln(1 + exp(\u2212YiZi\u03b2)) with respect to the parameter \u03b2 \u2208 Rd [34].\n\n\u2022 Our \ufb01rst contribution is an impossibility result: logistic regression has no sublinear streaming\nalgorithm. Due to a standard reduction between coresets and streaming algorithms, this also implies\nthat logistic regression admits no coresets or bounded sensitivity scores in general.\n\u2022 Our second contribution is an investigation of available sensitivity sampling distributions for logistic\nregression. For points with large contribution, where \u2212YiZi\u03b2 (cid:29) 0, the objective function increases\nby a term almost linear in \u2212YiZi\u03b2. This questions the use of sensitivity scores designed for problems\nwith squared cost functions such as (cid:96)2-regression, k-means, and (cid:96)2-based low-rank approximation.\nInstead, we propose sampling from a mixture distribution with one component proportional to the\nsquare root of the (cid:96)2\n2 leverage scores. Though seemingly similar to the sampling distributions of e.g.\n[6, 4] at \ufb01rst glance, it is important to note that sampling according to (cid:96)2\n2 scores is different from\nsampling according to their square roots. The former is good for (cid:96)2-related loss functions, while\nthe latter preserves (cid:96)1-related functions such as the linear part of the original logistic regression loss\nfunction studied here. The other mixture component is uniform sampling to deal with the remaining\ndomain, where the cost function consists of an exponential decay towards zero. Our experiments\nshow that this distribution outperforms uniform and k-means based sensitivity sampling by a wide\nmargin on real data sets. The algorithm is space ef\ufb01cient, and can be implemented in a variety of\nmodels used to handle large data sets such as 2-pass streaming, and massively parallel frameworks\nsuch as Hadoop and MapReduce, and can be implemented in input sparsity time, \u02dcO(nnz(Z)), the\nnumber of non-zero entries of the data [12].\n\u2022 Our third contribution is an analysis of our sampling distribution for a parametrized class of instances\nwe call \u00b5-complex, placing our work in the framework of beyond worst-case analysis [5, 39]. The\nparameter \u00b5 roughly corresponds to the ratio between the log of correctly estimated odds and the\nlog of incorrectly estimated odds. The condition of small \u00b5 is justi\ufb01ed by the fact that for instances\nwith large \u00b5, logistic regression exhibits methodological problems like imbalance and separability,\ncf. [35, 26]. We show that the total sensitivity of logistic regression can be bounded in terms of \u00b5,\nand that our sampling scheme produces the \ufb01rst coreset of provably sublinear size, provided that \u00b5 is\nsmall.\nRelated Work There is more than a decade of extensive work on sampling based methods relying on\nthe sensitivity framework for (cid:96)2-regression [19, 20, 32, 15] and (cid:96)1-regression [10, 40, 11]. These were\ngeneralized to (cid:96)p-regression for all p \u2208 [1,\u221e) [17, 44]. More recent works study sampling methods\nfor M-estimators [14, 13] and extensions to generalized linear models [27, 36]. The contemporary\ntheory behind coresets has been applied to logistic regression, \ufb01rst by [38] using \ufb01rst order gradient\nmethods, and subsequently via sensitivity sampling by [27]. In the latter work, the authors recovered\nthe result that bounded sensitivity scores for logistic regression imply coresets. Explicit sublinear\nbounds on the sensitivity scores, as well as an algorithm for computing them, were left as an open\nquestion. Instead, they proposed using sensitivity scores derived from any k-means clustering for\nlogistic regression. While high sensitivity scores of an input point for k-means provably do not imply\na high sensitivity score of the same point for logistic regression, the authors observed that they can\noutperform uniform random sampling on a number of instances with a clustering structure. Recently\nand independently of our work, [41] gave a coreset construction for logistic regression in a more\ngeneral framework. Our construction is without regularization and therefore can be also applied for\nany regularized version of logistic regression, but we have constraints regarding the \u00b5-complexity\nof the input. Their result is for (cid:96)2\n2-regularization, which signi\ufb01cantly changes the objective and\ndoes not carry over to the unconstrained version. They do not constrain the input but the domain of\noptimization is bounded. This indicates that both results differ in many important points and are of\nindependent interest.\nAll proofs and additional plots from the experiments are in the appendices A and B, respectively.\n\n2 Preliminaries and Problem Setting\nIn logistic regression we are given a data matrix Z \u2208 Rn\u00d7d, and labels Y \u2208 {\u22121, 1}n. Logistic\nregression has a negative log-likelihood [34]\n\n(cid:88)n\n\ni=1\n\nL(\u03b2|Z, Y ) =\n\nln(1 + exp(\u2212YiZi\u03b2))\n\n2\n\n\fwhich from a learning and optimization perspective, is the objective function that we would like to\nminimize over \u03b2 \u2208 Rd. For brevity we fold for all i \u2208 [n] the labels Yi as well as the factor \u22121 in\nthe exponent into X \u2208 Rn\u00d7d comprising row vectors xi = \u2212YiZi. Let g(z) = ln(1 + exp(z)). For\ntechnical reasons we deal with a weighted version for weights w \u2208 Rn\n>0, where each weight satis\ufb01es\nwi > 0. Any positive scaling of the all ones vector 1 = {1}n corresponds to the unweighted case.\nWe denote by Dw a diagonal matrix carrying the entries of w, i.e., (Dw)ii = wi, so that multiplying\nDw to a vector or matrix has the effect of scaling row i by a factor of wi. The objective function\nbecomes\n\n(cid:88)n\n\n(cid:88)n\n\nfw(X\u03b2) =\n\nwig(xi\u03b2) =\n\nwi ln(1 + exp(xi\u03b2)).\n\ni=1\n\ni=1\n\nIn this paper we assume we have a very large number of observations in a moderate number of\ndimensions, that is, n (cid:29) d. In order to speed up the computation and to lower memory and storage\nrequirements we would like to signi\ufb01cantly reduce the number of observations without losing much\ninformation in the original data. A suitable data compression reduces the size to a sublinear number\nof o(n) data points while the dependence on d and the approximation parameters may be polynomials\nof low degree. To achieve this, we design a so-called coreset construction for the objective function.\nA coreset is a possibly (re)weighted and signi\ufb01cantly smaller subset of the data that approximates the\nobjective value for any possible query points. More formally, we de\ufb01ne coresets for the weighted\nlogistic regression function.\nDe\ufb01nition 1 ((1 \u00b1 \u03b5)-coreset for logistic regression). Let X \u2208 Rn\u00d7d be a set of points weighted\nby w \u2208 Rn\n>0, is a (1 \u00b1 \u03b5)-coreset of X for fw, if\nk (cid:28) n and\n\n>0. Then a set C \u2208 Rk\u00d7d, (re)weighted by u \u2208 Rk\n\n\u2200\u03b2 \u2208 Rd : |fw(X\u03b2) \u2212 fu(C\u03b2)| \u2264 \u03b5 \u00b7 fw(X\u03b2).\n\n\u00b5-Complex Data Sets We will see in Section 3 that in general, there is no sublinear one-pass\nstreaming algorithm approximating the objective function up to any \ufb01nite constant factor. More\nspeci\ufb01cally there exists no sublinear summary or coreset construction that works for all data sets.\nFor the sake of developing coreset constructions that work reasonably well, as well as conducting a\nformal analysis beyond worst-case instances, we introduce a measure \u00b5 that quanti\ufb01es the complexity\nof compressing a given data set.\nDe\ufb01nition 2. Given a data set X \u2208 Rn\u00d7d weighted by w \u2208 Rn\n>0 and a vector \u03b2 \u2208 Rd let (DwX\u03b2)\u2212\ndenote the vector comprising only the negative entries of DwX\u03b2. Similarly let (DwX\u03b2)+ denote\nthe vector of positive entries. We de\ufb01ne for X weighted by w\n\n(cid:107)(DwX\u03b2)+(cid:107)1\n(cid:107)(DwX\u03b2)\u2212(cid:107)1\n\n.\n\n\u00b5w(X) = sup\n\n\u03b2\u2208Rd\\{0}\nX weighted by w is called \u00b5-complex if \u00b5w(X) \u2264 \u00b5.\nThe size of our (1 \u00b1 \u03b5)-coreset constructions for logistic regression for a given \u00b5-complex data set X\nwill have low polynomial dependency on \u00b5, d, 1/\u03b5 but only sublinear dependency on its original size\nparameter n. So for \u00b5-complex data sets having small \u00b5(X) \u2264 \u00b5 we have the \ufb01rst (1 \u00b1 \u03b5)-coreset of\nprovably sublinear size. The above de\ufb01nition implies, for \u00b5(X) \u2264 \u00b5, the following inequalities. The\nreader should keep in mind that for all \u03b2 \u2208 Rd\n\n\u22121(cid:107)(DwX\u03b2)\n\u00b5\n\n\u2212(cid:107)1 \u2264 (cid:107)(DwX\u03b2)+(cid:107)1 \u2264 \u00b5(cid:107)(DwX\u03b2)\n\n\u2212(cid:107)1 .\n\nWe conjecture that computing the value of \u00b5(X) is hard. However, it can be approximated in\npolynomial time. It is not necessary to do so in practical applications, but we include this result for\nthose who wish to evaluate whether their data has nice \u00b5-complexity.\nTheorem 3. Let X \u2208 Rn\u00d7d be weighted by w \u2208 Rn\nof \u00b5w(X) can be computed in O(poly(nd)) time.\n\n>0. Then a poly(d)-approximation to the value\n\nThe parameter \u00b5(X) has an intuitive interpretation and might be of independent interest. The odds of\nP[V =1]\nP[V =0] . The model assumption of logistic regression is\na binary random variable V are de\ufb01ned as\nthat for every sample Xi, the logarithm of the odds is a linear function of Xi\u03b2. For a candidate \u03b2,\nmultiplying all odds and taking the logarithm is then exactly (cid:107)X\u03b2(cid:107)1. Our de\ufb01nition now relates the\nprobability mass due to the incorrectly predicted odds and the probability mass due to the correctly\n\n3\n\n\fpredicted odds. We say that the ratio between these two is upper bounded by \u00b5. For logistic regression,\nassuming they are within some order of magnitude is not uncommon. One extreme is the (degenerate)\ncase where the data set is exactly separable. Choosing \u03b2 to parameterize a separating hyperplane for\nwhich X\u03b2 is all positive, implies that \u00b5(X) = \u221e. Another case is when we have a large ratio between\nthe number of positively and negatively labeled points which is a lower bound to \u00b5. Under either\nof these conditions, logistic regression exhibits methodological weaknesses due to the separation or\nimbalance between the given classes, cf. [35, 26].\n\n3 Lower Bounds\n\nAt \ufb01rst glance, one might think of taking a uniform sample as a coreset. We demonstrate and discuss\non worst-case instances in Appendix C that this won\u2019t work in theory or in practice. In the following\nwe will show a much stronger result, namely that no ef\ufb01cient streaming algorithms or coresets\nfor logistic regression can exist in general, even if we assume that the points lie in 2-dimensional\nEuclidean space. To this end we will reduce from the INDEX communication game. In its basic\nvariant, there exist two players Alice and Bob. Alice is given a binary bit string x \u2208 {0, 1}n and Bob\nis given an index i \u2208 [n]. The goal is to determine the value of xi with constant probability while\nusing as little communication as possible. Clearly, the dif\ufb01culty of the problem is inherently one-way;\notherwise Bob could simply send his index to Alice. If the entire communication consists of only a\nsingle message sent by Alice to Bob, the message must contain \u2126(n) bits [30].\nTheorem 4. Let Z \u2208 Rn\u00d72, Y \u2208 {\u22121, 1}n be an instance of logistic regression in 2-dimensional\nEuclidean space. Any one-pass streaming algorithm that approximates the optimal solution of logistic\nregression up to any \ufb01nite multiplicative approximation factor requires \u2126(n/ log n) bits of space.\n\nA similar reduction also holds if Alice\u2019s message consists of points forming a coreset. Hence, the\nfollowing corollary holds.\nCorollary 5. Let Z \u2208 Rn\u00d72, Y \u2208 {\u22121, 1}n be an instance of logistic regression in 2-dimensional\nEuclidean space. Any coreset of Z, Y for logistic regression consists of at least \u2126(n/ log n) points.\n\nWe note that the proof can be slightly modi\ufb01ed to rule out any \ufb01nite additive error as well. This\nindicates that the notion of lightweight coresets with multiplicative and additive error [4] is not a\nsuf\ufb01cient relaxation. Independently of our work [41] gave a linear lower bound in a more general\ncontext based on a worst case instance to the sensitivity approach due to [27]. Our lower bounds and\ntheirs are incomparable; they show that if a coreset can only consist of input points it comprises the\nentire data set in the worst-case. We show that no coreset with o(n/ log n) can exist, irrespective\nof whether input points are used. While the distinction may seem minor, a number of coreset\nconstructions in literature necessitate the use of non-input points, see [1] and [23].\n\n4 Sampling via Sensitivity Scores\n\n1. Recall the function under study is fw(X\u03b2) =(cid:80)n\n\nOur sampling based coreset constructions are obtained with the following approach, called sensitivity\nsampling. Suppose we are given a data set X \u2208 Rn\u00d7d together with weights w \u2208 Rn\n>0 as in De\ufb01nition\ni=1 wi \u00b7 g(xi\u03b2). Associate with each point xi the\nfunction gi(\u03b2) = g(xi\u03b2). Then we have the following de\ufb01nition.\nDe\ufb01nition 6. [31] Consider a family of functions F = {g1, . . . , gn} mapping from Rd to [0,\u221e) and\nweighted by w \u2208 Rn\n\n>0. The sensitivity of gi for fw(\u03b2) =(cid:80)n\n\ni=1 wigi(\u03b2) is\n\n(1)\nwhere the sup is over all \u03b2 \u2208 Rd with fw(\u03b2) > 0. If this set is empty then \u03c2i = 0. The total sensitivity\n\n\u03c2i = sup\n\nwigi(\u03b2)\nfw(\u03b2)\n\nis S =(cid:80)n\n\ni=1 \u03c2i.\n\nThe sensitivity of a point measures its worst-case importance for approximating the objective function\non the entire input data set. Performing importance sampling proportional to the sensitivities of the\ninput points thus yields a good approximation. Computing the sensitivities is often intractable and\ninvolves solving the original optimization problem to near-optimality, which is the problem we want\nto solve in the \ufb01rst place, as pointed out in [8]. To get around this, it was shown that any upper bound\n\n4\n\n\fi=1 si \u2265(cid:80)n\n\ndepends on the total sensitivity, that is, the sum of their estimates S =(cid:80)n\n\non the sensitivities si \u2265 \u03c2i also has provable guarantees. However, the number of samples needed\ni=1 \u03c2i = S, so\nwe need to carefully control this quantity. Another complexity measure that plays a crucial role in the\nsampling complexity is the VC dimension of the range space induced by the set of functions under\nstudy.\nDe\ufb01nition 7. A range space is a pair R = (F, ranges) where F is a set and ranges is a family of\nsubsets of F. The VC dimension \u2206(R) of R is the size |G| of the largest subset G \u2286 F such that G\nis shattered by ranges, i.e., |{G \u2229 R | R \u2208 ranges}| = 2|G|.\nDe\ufb01nition 8. Let F be a \ufb01nite set of functions mapping from Rd to R\u22650. For every \u03b2 \u2208 Rd and\nr \u2208 R\u22650, let rangeF (\u03b2, r) = {f \u2208 F | f (\u03b2) \u2265 r}, and ranges(F) = {rangeF (\u03b2, r) | \u03b2 \u2208 Rd, r \u2208\nR\u22650}, and RF = (F, ranges(F)) be the range space induced by F.\nRecently a framework combining the sensitivity scores with a theory on the VC dimension of range\nspaces was developed in [8]. For technical reasons we use a slightly modi\ufb01ed version.\nTheorem 9. Consider a family of functions F = {f1, . . . , fn} mapping from Rd to [0,\u221e) and a\nvector of weights w \u2208 Rn\ni=1 si \u2265 S. Given si one\ncan compute in time O(|F|) a set R \u2282 F of\n\n(cid:19)(cid:19)(cid:19)\n\n>0. Let \u03b5, \u03b4 \u2208 (0, 1/2). Let si \u2265 \u03c2i. Let S =(cid:80)n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:18) S\n(cid:18)\nwifi(\u03b2) \u2212(cid:88)\n\n(cid:18) 1\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 \u03b5\n\n\u2206 log S + log\n\n(cid:88)\n\n(cid:88)\n\nwifi(\u03b2).\n\nuifi(\u03b2)\n\nO\n\n\u03b52\n\n\u03b4\n\nweighted functions such that with probability 1 \u2212 \u03b4 we have for all \u03b2 \u2208 Rd simultaneously\n\nf\u2208F\n\nf\u2208R\n\nf\u2208F\n\nS from F, ui = Swj\n\nwhere each element of R is sampled i.i.d. with probability pj = sj\nsj|R| denotes the\nweight of a function fi \u2208 R that corresponds to fj \u2208 F, and where \u2206 is an upper bound on the VC\ndimension of the range space RF\u2217 induced by F\u2217 that can be obtained by de\ufb01ning F\u2217 to be the set\nof functions fj \u2208 F where each function is scaled by Swj\nsj|R| .\nNow we show that the VC dimension of the range space induced by the set of functions studied in\nlogistic regression can be related to the VC dimension of the set of linear classi\ufb01ers. We \ufb01rst start\nwith a \ufb01xed common weight and generalize the result to a more general \ufb01nite set of distinct weights.\nLemma 10. Let X \u2208 Rn\u00d7d, c \u2208 R>0. The range space induced by F c\nlog = {c \u00b7 g(xi\u03b2)| i \u2208 [n]}\nsatis\ufb01es \u2206(RF c\nLemma 11. Let X \u2208 Rn\u00d7d be weighted by w \u2208 Rn where wi \u2208 {v1, . . . , vt} for all i \u2208 [n]. The\nrange space induced by Flog = {wi \u00b7 g(xi\u03b2) | i \u2208 [n]} satis\ufb01es \u2206(RFlog ) \u2264 t \u00b7 (d + 1).\nWe will see later how to bound the number of distinct weights t by a logarithmic term in the range of\nthe involved weights. It remains for us to derive tight and ef\ufb01ciently computable upper bounds on the\nsensitivities.\nBase Algorithm We show that sampling proportional to the square root of the (cid:96)2-leverage scores\nj\u2208[n] wj yields a coreset whose size is roughly linear in \u00b5 and the dependence\n\naugmented by wi/(cid:80)\n\nlog ) \u2264 d + 1.\n\nn. In what follows, let W =(cid:80)\n\non the input size is roughly\nWe make a case distinction covered by lemmas 12 and 13. The intuition in the \ufb01rst case is that for\na suf\ufb01ciently large positive entry z, we have that |z| \u2264 g(z) \u2264 2|z|. The lower bound holds even\nfor all non-negative entries. Moreover, for \u00b5-complex inputs we are able to relate the (cid:96)1 norm of all\nentries to the positive ones, which will yield the desired bound, arguing similarly to the techniques of\n[13] though adapted here for logistic regression.\nLemma 12. Let X \u2208 Rn\u00d7d weighted by w \u2208 Rn\n>0 be \u00b5-complex. Let U be an orthonormal\nbasis for the columnspace of DwX. If for index i, the supreme \u03b2 in (1) satis\ufb01es 0.5 \u2264 xi\u03b2 then\nwig(xi\u03b2) \u2264 2(1 + \u00b5)(cid:107)Ui(cid:107)2fw(X\u03b2).\nIn the second case, the element under study is bounded by a constant. We consider two sub cases. If\nthere are a lot of contributions, which are not too small, and thus cost at least a constant each, then\n\ni\u2208[n] wi.\n\n\u221a\n\n5\n\n\fnd.\n\n>0 be \u00b5-complex. If for index i, the supreme \u03b2 in (1)\n\nwe can lower bound the total cost by a constant times their total weight. If on the other hand there are\nmany very small negative values, then this implies again that the cost is within a \u00b5 fraction of the\ntotal weight.\nLemma 13. Let X \u2208 Rn\u00d7d weighted by w \u2208 Rn\nsatis\ufb01es 0.5 \u2265 xi\u03b2 then wig(xi\u03b2) \u2264 (20+\u00b5)wiW fw(X\u03b2).\nCombining both lemmas yields general upper bounds on the sensitivities that we can use as an\nimportance sampling distribution. We also derive an upper bound on the total sensitivity that will be\nused to bound the sampling complexity.\nLemma 14. Let X \u2208 Rn\u00d7d weighted by w \u2208 Rn\n>0 be \u00b5-complex. Let U be an orthonormal basis\nfor the columnspace of DwX. For each i \u2208 [n], the sensitivity of gi(\u03b2) = g(xi\u03b2) for the weighted\nlogistic regression function is bounded by \u03c2i \u2264 si = (20+2\u00b5)\u00b7((cid:107)Ui(cid:107)2 +wi/W). The total sensitivity\n\u221a\nis bounded by S \u2264 S \u2264 44\u00b5\nWe combine the above results into the following theorem.\nTheorem 15. Let X \u2208 Rn\u00d7d weighted by w \u2208 Rn be \u00b5-complex. Let \u03c9 = wmax\nbe the ratio between\nthe maximum and minimum weight in w. Let \u03b5 \u2208 (0, 1/2). There exists a (1 \u00b1 \u03b5)-coreset of X, w for\n\u221a\nlogistic regression of size k \u2208 O( \u00b5\n\u03b52 d3/2 log(\u00b5nd) log(\u03c9n)). Such a coreset can be constructed\nin two passes over the data, in O(nnz(X) log n + poly(d) log n) time, and with success probability\n1 \u2212 1/nc for any absolute constant c > 1.\nRecursive Algorithm Here we develop a recursive algorithm, inspired by the recursive sampling\ntechnique of [14] for the Huber M-estimator, though adapted here for logistic regression. This yields\na better dependence on the input size. More speci\ufb01cally, we can diminish the leading\nn factor\nto only logc(n) for an absolute constant c. One complication is that the parameter \u00b5 grows in the\nrecursion, which we need to control, while another complication is having to deal with the separate\n(cid:96)1 and uniform parts of our sampling distribution.\nWe apply the Algorithm of Theorem 15 recursively. To do so, we need to ensure that after one stage\nof subsampling and reweighting, the resulting data set remains \u00b5(cid:48)-complex for a value \u00b5(cid:48) that is not\ntoo much larger than \u00b5. To this end, we \ufb01rst bound the VC dimension of a range space induced by an\n(cid:96)1 related family of functions.\nLemma 16. The range space induced by F(cid:96)1 = {hi(\u03b2) = wi|xi\u03b2|| i \u2208 [n]} satis\ufb01es \u2206(RF(cid:96)1\n) \u2264\n10(d + 1).\nApplying Theorem 9 to F(cid:96)1 implies that the subsample of Theorem 15 satis\ufb01es a so called \u03b5-subspace\nembedding property for (cid:96)1. Note that, by linearity of the (cid:96)1-norm, we can fold the weights into DwX.\nLemma 17. Let T be a sampling and reweighting matrix according to Theorem 15. I.e., T DwX is\nthe resulting reweighted sample when Theorem 15 is applied to \u00b5-complex input X, w. Then with\nprobability 1 \u2212 1/nc, for all \u03b2 \u2208 Rd simultaneously\n\n\u221a\n\nwmin\n\nn\n\n(1 \u2212 \u03b5\n(cid:48)\n\u221a\n\n)(cid:107)DwX\u03b2(cid:107)1 \u2264 (cid:107)T DwX\u03b2(cid:107)1 \u2264 (1 + \u03b5\n\n(cid:48)\n\n)(cid:107)DwX\u03b2(cid:107)1\n\n\u00b5 + 1.\n\nholds, where \u03b5(cid:48) = \u03b5/\nUsing this, we can show that the \u00b5-complexity is not violated too much after one stage of sampling.\nLemma 18. Let T be a sampling and reweighting matrix according to Theorem 15 where parameter\n\u03b5 is replaced by \u03b5/\n\u00b5 + 1. That is T DwX is the resulting reweighted sample when Theorem 15\nsucceeds on \u00b5-complex input X, w. Suppose that simultaneously Lemma 17 holds. Let\n\n\u221a\n\n(cid:107)(T DwX\u03b2)+(cid:107)1\n(cid:107)(T DwX\u03b2)\u2212(cid:107)1\n\n.\n\n= \u00b5T w(X) = sup\n\u03b2\u2208Rd\n\n(cid:48)\n\u00b5\nThen we have \u00b5(cid:48) \u2264 (1 + \u03b5)\u00b5.\nNow we are ready to prove our theorem regarding the recursive subsampling algorithm.\nTheorem 19. Let X \u2208 Rn\u00d7d be \u00b5-complex. Let \u03b5 \u2208 (0, 1/2). There exists a (1 \u00b1 \u03b5)-coreset of\nX for logistic regression of size k \u2208 O( \u00b53\n\u03b54 d3 log2(\u00b5nd) log2 n (log log n)4). Such a coreset can be\nconstructed in time O((nnz(X) + poly(d)) log n log log n) in 2 log( 1\n\u03b7 ) passes over the data for a\nsmall \u03b7 > 0, assuming the machine has access to suf\ufb01cient memory to store and process \u02dcO(n\u03b7)\nweighted points. The success probability is 1 \u2212 1/nc for any absolute constant c > 1.\n\n6\n\n\f5 Experiments\n\nWe ran a series of experiments to illustrate the performance of our coreset method. All experiments\nwere run on a Linux machine using an Intel i7-6700, 4 core CPU at 3.4 GHz, and 32GB of RAM. We\nimplemented our algorithms in Python. Now, we compare our basic algorithm to simple uniform\nsampling and to sampling proportional to the sensitivity upper bounds given by [27].\nImplementation Details The approach of [27] is based on a k-means++ clustering [3] on a small\nuniform sample of the data and was performed using standard parameters taken from the publication.\nFor this purpose we used parts of their original Python code. However, we removed the restriction\nof the domain of optimization to a region of small radius around the origin. This way, we enabled\nunconstrained regression in the domain Rd. The exact QR-decomposition is rather slow on large\ndata matrices. We thus optimized the running time of our approach in the following way. We used\na fast approximation algorithm based on the sketching techniques of [12], cf. [43]. That leads to a\nprovable constant approximation of the square root of the leverage scores with constant probability,\ncf. [18], which means that the total sensitivity bounds given in our theory will grow by only a\nsmall constant factor. A detailed description of the algorithm is in the proof of Theorem 15. The\nsubsequent optimization was done for all approaches with the standard gradient based optimizer from\nthe scipy.optimize package, see http://www.scipy.org/.\nData Sets We brie\ufb02y introduce the data sets that we used. The WEBB SPAM1 data consists of\n350, 000 unigrams with 127 features from web pages which have to be classi\ufb01ed as spam or normal\npages (61% positive). The COVERTYPE2 data consists of 581, 012 cartographic observations of\ndifferent forests with 54 features. The task is to predict the type of trees at each location (49%\npositive). The KDD CUP \u2019993 data comprises 494, 021 network connections with 41 features and\nthe task is to detect network intrusions (20% positive).\nExperimental Assessment For each data set we assessed the total running times for computing the\nsampling probabilities, sampling and optimizing on the sample. In order to assess the approximation\naccuracy we examined the relative error |L(\u03b2\u2217|X)\u2212L( \u02dc\u03b2|X)|/L(\u03b2\u2217|X) of the negative log-likelihood\nfor the maximum likelihood estimators obtained from the full data set \u03b2\u2217 and the subsamples \u02dc\u03b2. For\neach data set, we ran all three subsampling algorithms for a number of thirty regular subsampling\nsteps in the range k \u2208 [(cid:98)2\nn(cid:99),(cid:100)n/16(cid:101)]. For each step, we present the mean relative error as well as\nthe trade-off between mean relative error and running time, taken over twenty independent repetitions,\nin Figure 1. Relative running times, standard deviations and absolute values are presented in Figure 2\nrespectively in Table 1 in Appendix B.\nEvaluation The accuracy of the QR-sampling distribution outperforms uniform sampling and the\ndistribution derived from k-means on all instances. This is especially true for small sampling\nsizes. Here, the relative error especially for uniform sampling tends to deteriorate. While k-means\nsampling occasionally improved over uniform sampling for small sample sizes, the behavior of\nboth distributions was similar for larger sampling sizes. The standard deviations had a similarly\nlow magnitude as the mean values, where the QR method usually showed the lowest values. The\ntrade-off between the running time and relative errors shows a common picture for WEBB SPAM and\nCOVERTYPE. QR is nearly always more accurate than the other algorithms for a similar time budget,\nexcept for regions where the relative error is large, say above 5-10% while for larger time budgets,\nQR is better by a factor between 1.5-3 and drops more quickly towards 0. The conclusion so far\ncould be that for a quick guess, say a 1.1-approximation, the competitors are faster, but to provably\nobtain a reasonably small relative error below 5%, QR outperforms its competitors. However, for\nKDD CUP \u201999, QR always has a lower error than its competitors. Their relative errors remain above\n15% or much worse, while QR never exceeds 22% and drops quickly below 4%. Our estimates for \u00b5\nsupport that KDD CUP \u201999 seems more dif\ufb01cult to approximate than the others. The estimates were\n4.39 for WEBB SPAM, 1.86 for COVERTYPE, and 35.18 for KDD CUP \u201999.\nThe relative running time for the QR-distribution was comparable to k-means and only slightly higher\nthan uniform sampling. However, it never exceeded a factor of two compared to its competitors\nand remained negligible compared to the full optimization task, see Figure 2 in Appendix B. The\nstandard deviations were negligible except for the k-means algorithm and the KDD CUP \u201999 data\n\n\u221a\n\n1https://www.cc.gatech.edu/projects/doi/WebbSpamCorpus.html\n2https://archive.ics.uci.edu/ml/datasets/covertype\n3http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html\n\n7\n\n\fWEBB SPAM\n\nCOVERTYPE\n\nKDD CUP \u201999\n\nFigure 1: Each column shows the results for one data set comprising thirty different coreset sizes\n(depending on the individual size of the data sets). The plotted values are means taken over twenty\nindependent repetitions of each experiment. The plots in the upper row show the mean relative\nlog-likelihood errors of the three subsampling distributions, uniform sampling (blue), our QR\nderived distribution (red), and the k-means based distribution (green). All values are relative to the\ncorresponding optimal log-likelihood values of the optimization task on the full data set. The plots in\nthe lower row show the trade-off between running time and relative errors (lower is better).\n\nset, where the uniform and k-means based algorithms showed larger values. The QR method had\nmuch lower standard deviations. This indicates that the resulting coresets are more stable for the\nsubsequent numerical optimization. We note that the savings of all presented data reduction methods\nbecome even more signi\ufb01cant when performing more time consuming data analysis tasks like MCMC\nsampling in a Bayesian setting, see e.g., [27, 24].\n\n6 Conclusions\n\nWe \ufb01rst showed that (sublinear) coresets for logistic regression do not exist in general. It is thus\nnecessary to make further assumptions on the nature of the data. To this end we introduced a new\ncomplexity measure \u00b5(X), which quanti\ufb01es the amount of overlap of positive and negative classes\nand the balance in their cardinalities. We developed the \ufb01rst rigorously sublinear (1 \u00b1 \u03b5)-coresets\n\u221a\nfor logistic regression, given that the original data has small \u00b5-complexity. The leading factor is\nO(\u03b5\u22122\u00b5\nn). We have further developed a recursive coreset construction that reduces the dependence\non the input size to only O(logc n) for absolute constant c. This comes at the cost of an increased\ndependence on \u00b5. However, it is bene\ufb01cial for very large and well-behaved data. Our algorithms are\nspace ef\ufb01cient, and can be implemented in a variety of models, used to tackle the challenges of large\ndata sets, such as 2-pass streaming, and massively parallel frameworks like Hadoop and MapReduce,\nand can be implemented to run in input sparsity time \u02dcO(nnz(X)), which is especially bene\ufb01cial for\nsparsely encoded input data. Our experimental evaluation shows that our implementation of the basic\nalgorithm outperforms uniform sampling as well as state of the art methods in the area of coresets for\nlogistic regression while being competitive to both regarding its running time.\n\nAcknowledgments\n\nWe thank the anonymous reviewers for their valuable comments. We also thank our assistant Moritz Paweletz.\nThis work was supported by the German Science Foundation (DFG) Collaborative Research Center SFB 876,\nprojects A2 and C4. Chris Schwiegelshohn is supported in part by an ERC Advanced Grant 788893 AMDROMA.\nDavid P. Woodruff is supported in part by an Of\ufb01ce of Naval Research (ONR) grant N00014-18-1-2562, and\npart of this work was done while he was visiting the Simons Institute for the Theory of Computing.\n\n8\n\n25005000750010000125001500017500samplesize0.050.100.150.200.250.300.350.40meanrelativelog-likelihooderrorQRUniformk-Means50001000015000200002500030000samplesize0.000.050.100.150.200.250.300.35meanrelativelog-likelihooderrorQRUniformk-Means500010000150002000025000samplesize0.00.10.20.30.40.50.60.70.8meanrelativelog-likelihooderrorQRUniformk-Means51015202530absoluterunningtime(s)0.0250.0500.0750.1000.1250.1500.1750.200meanrelativelog-likelihooderrorQRUniformk-Means5101520absoluterunningtime(s)0.0000.0250.0500.0750.1000.1250.1500.175meanrelativelog-likelihooderrorQRUniformk-Means2468101214absoluterunningtime(s)0.00.10.20.30.40.50.60.70.8meanrelativelog-likelihooderrorQRUniformk-Means\fReferences\n[1] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Approximating extent measures of points. Journal of\n\nthe ACM, 51(4):606\u2013635, 2004.\n\n[2] A. E. Alaoui and M. W. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In\n\nAdvances in Neural Information Processing Systems 28 (NIPS), pages 775\u2013783, 2015.\n\n[3] D. Arthur and S. Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the 18th\n\nAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1027\u20131035, 2007.\n\n[4] O. Bachem, M. Lucic, and A. Krause. Scalable k-means clustering via lightweight coresets. In Proceedings\nof the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD),\npages 1119\u20131127, 2018.\n\n[5] M.-F. Balcan, B. Manthey, H. R\u00f6glin, and T. Roughgarden. Analysis of algorithms beyond the worst case\n\n(Dagstuhl seminar 14372). Dagstuhl Reports, 4(9):30\u201349, 2015.\n\n[6] A. Barger and D. Feldman. k-means for streaming and distributed big sparse data. In Proceedings of the\n\nSIAM International Conference on Data Mining (SDM), pages 342\u2013350, 2016.\n\n[7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the Vapnik-Chervonenkis\n\ndimension. Journal of the ACM, 36(4):929\u2013965, 1989.\n\n[8] V. Braverman, D. Feldman, and H. Lang. New frameworks for of\ufb02ine and streaming coreset constructions.\n\narXiv preprint CoRR, abs/1612.00889, 2016.\n\n[9] M. T. Chao. A general purpose unequal probability sampling plan. Biometrika, 69(3):653\u2013656, 1982.\n\n[10] K. L. Clarkson. Subgradient and sampling algorithms for (cid:96)1 regression. In Proceedings of the 16th annual\n\nACM-SIAM symposium on Discrete algorithms (SODA), pages 257\u2013266, 2005.\n\n[11] K. L. Clarkson, P. Drineas, M. Magdon-Ismail, M. W. Mahoney, X. Meng, and D. P. Woodruff. The fast\n\nCauchy transform and faster robust linear regression. SIAM J. Comput., 45(3):763\u2013810, 2016.\n\n[12] K. L. Clarkson and D. P. Woodruff. Low rank approximation and regression in input sparsity time. In\n\nSymposium on Theory of Computing (STOC), pages 81\u201390, 2013.\n\n[13] K. L. Clarkson and D. P. Woodruff. Input sparsity and hardness for robust subspace approximation. In\n\nIEEE 56th Annual Symposium on Foundations of Computer Science (FOCS), pages 310\u2013329, 2015.\n\n[14] K. L. Clarkson and D. P. Woodruff. Sketching for M-estimators: A uni\ufb01ed approach to robust regression. In\nProceedings of the 26th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 921\u2013939,\n2015.\n\n[15] M. B. Cohen, Y. T. Lee, C. Musco, C. Musco, R. Peng, and A. Sidford. Uniform sampling for matrix\napproximation. In Proceedings of the Conference on Innovations in Theoretical Computer Science (ITCS),\npages 181\u2013190, 2015.\n\n[16] M. B. Cohen, C. Musco, and C. Musco. Input sparsity time low-rank approximation via ridge leverage score\nsampling. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA),\npages 1758\u20131777, 2017.\n\n[17] A. Dasgupta, P. Drineas, B. Harb, R. Kumar, and M. W. Mahoney. Sampling algorithms and coresets for\n\n(cid:96)p regression. SIAM Journal on Computing, 38(5):2060\u20132078, 2009.\n\n[18] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix\n\ncoherence and statistical leverage. Journal of Machine Learning Research, 13:3475\u20133506, 2012.\n\n[19] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for (cid:96)2 regression and applications.\nIn Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1127\u2013\n1136, 2006.\n\n[20] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM\n\nJournal on Matrix Analysis and Applications, 30(2):844\u2013881, 2008.\n\n[21] D. Feldman, M. Faulkner, and A. Krause. Scalable training of mixture models via coresets. In Advances in\n\nNeural Information Processing Systems 24 (NIPS), pages 2142\u20132150, 2011.\n\n9\n\n\f[22] D. Feldman and M. Langberg. A uni\ufb01ed framework for approximating and clustering data. In Proceedings\n\nof the 43rd ACM Symposium on Theory of Computing (STOC), pages 569\u2013578, 2011.\n\n[23] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means,\nPCA and projective clustering. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete\nAlgorithms (SODA), pages 1434\u20131453, 2013.\n\n[24] L. N. Geppert, K. Ickstadt, A. Munteanu, J. Quedenfeld, and C. Sohler. Random projections for Bayesian\n\nregression. Statistics and Computing, 27(1):79\u2013101, 2017.\n\n[25] G. H. Golub and C. F. van Loan. Matrix computations (4. ed.). J. Hopkins Univ. Press, 2013.\n\n[26] G. Heinze and M. Schemper. A solution to the problem of separation in logistic regression. Statistics in\n\nMedicine, 21(16):2409\u20132419, 2002.\n\n[27] J. H. Huggins, T. Campbell, and T. Broderick. Coresets for scalable Bayesian logistic regression. In\n\nAdvances in Neural Information Processing Systems 29 (NIPS), pages 4080\u20134088, 2016.\n\n[28] W. B. Johnson and J. Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary\n\nMathematics, 26(1):189\u2013206, 1984.\n\n[29] M. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press, 1994.\n\n[30] I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Computational\n\nComplexity, 8(1):21\u201349, 1999.\n\n[31] M. Langberg and L. J. Schulman. Universal \u03b5-approximators for integrals. In Proceedings of the 21st\n\nAnnual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 598\u2013607, 2010.\n\n[32] M. Li, G. L. Miller, and R. Peng. Iterative row sampling. In 54th Annual IEEE Symposium on Foundations\n\nof Computer Science (FOCS), pages 127\u2013136, 2013.\n\n[33] M. Lucic, O. Bachem, and A. Krause. Strong coresets for hard and soft Bregman clustering with\napplications to exponential family mixtures. In Proceedings of the 19th International Conference on\nArti\ufb01cial Intelligence and Statistics (AISTATS), pages 1\u20139, 2016.\n\n[34] P. McCullagh and J. A. Nelder. Generalized Linear Models. Chapman & Hall, London, 1989.\n\n[35] C. R. Mehta and N. R. Patel. Exact logistic regression: Theory and examples. Statistics in Medicine,\n\n14(19):2143\u20132160, 1995.\n\n[36] A. Molina, A. Munteanu, and K. Kersting. Core dependency networks. In Proceedings of the 32nd AAAI\n\nConference on Arti\ufb01cial Intelligence (AAAI), 2018.\n\n[37] C. Musco and C. Musco. Recursive sampling for the Nystr\u00f6m method. In Advances in Neural Information\n\nProcessing Systems 30 (NIPS), pages 3836\u20133848, 2017.\n\n[38] S. J. Reddi, B. P\u00f3czos, and A. J. Smola. Communication ef\ufb01cient coresets for empirical loss minimization.\nIn Proceedings of the Thirty-First Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages\n752\u2013761, 2015.\n\n[39] T. Roughgarden. Beyond worst-case analysis, 2017. Invited talk held at the Highlights of Algorithms\n\nconference (HALG), 2017.\n\n[40] C. Sohler and D. P. Woodruff. Subspace embeddings for the L1-norm with applications. In Proceedings of\n\nthe 43rd ACM Symposium on Theory of Computing (STOC), pages 755\u2013764, 2011.\n\n[41] E. Tolochinsky and D. Feldman. Coresets for monotonic functions with applications to deep learning.\n\nCoRR, abs/1802.07382, 2018.\n\n[42] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, USA, 1995.\n\n[43] D. P. Woodruff. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical\n\nComputer Science, 10(1-2):1\u2013157, 2014.\n\n[44] D. P. Woodruff and Q. Zhang. Subspace embeddings and (cid:96)p-regression using exponential random variables.\n\nIn The 26th Conference on Learning Theory (COLT), pages 546\u2013567, 2013.\n\n10\n\n\f", "award": [], "sourceid": 3300, "authors": [{"given_name": "Alexander", "family_name": "Munteanu", "institution": "TU Dortmund"}, {"given_name": "Chris", "family_name": "Schwiegelshohn", "institution": "Sapienza, University of Rome"}, {"given_name": "Christian", "family_name": "Sohler", "institution": "TU Dortmund"}, {"given_name": "David", "family_name": "Woodruff", "institution": "Carnegie Mellon University"}]}