{"title": "On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4308, "page_last": 4318, "abstract": "Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there is a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.", "full_text": "On the Fine-Grained Complexity of\n\nEmpirical Risk Minimization:\n\nKernel Methods and Neural Networks\n\nArturs Backurs\n\nCSAIL\nMIT\n\nbackurs@mit.edu\n\nPiotr Indyk\n\nCSAIL\nMIT\n\nindyk@mit.edu\n\nLudwig Schmidt\n\nCSAIL\nMIT\n\nludwigs@mit.edu\n\nAbstract\n\nEmpirical risk minimization (ERM) is ubiquitous in machine learning and under-\nlies most supervised learning methods. While there is a large body of work on\nalgorithms for various ERM problems, the exact computational complexity of ERM\nis still not understood. We address this issue for multiple popular ERM problems\nincluding kernel SVMs, kernel ridge regression, and training the \ufb01nal layer of a neu-\nral network. In particular, we give conditional hardness results for these problems\nbased on complexity-theoretic assumptions such as the Strong Exponential Time\nHypothesis. Under these assumptions, we show that there are no algorithms that\nsolve the aforementioned ERM problems to high accuracy in sub-quadratic time.\nWe also give similar hardness results for computing the gradient of the empirical\nloss, which is the main computational burden in many non-convex learning tasks.\n\n1\n\nIntroduction\n\nEmpirical risk minimization (ERM) has been highly in\ufb02uential in modern machine learning [37].\nERM underpins many core results in statistical learning theory and is one of the main computational\nproblems in the \ufb01eld. Several important methods such as support vector machines (SVM), boosting,\nand neural networks follow the ERM paradigm [34]. As a consequence, the algorithmic aspects of\nERM have received a vast amount of attention over the past decades. This naturally motivates the\nfollowing basic question:\n\nWhat are the computational limits for ERM algorithms?\n\nIn this work, we address this question both in convex and non-convex settings. Convex ERM problems\nhave been highly successful in a wide range of applications, giving rise to popular methods such as\nSVMs and logistic regression. Using tools from convex optimization, the resulting problems can be\nsolved in polynomial time. However, the exact time complexity of many important ERM problems\nsuch as kernel SVMs is not yet well understood. As the size of data sets in machine learning continues\nto grow, this question is becoming increasingly important. For ERM problems with millions of\nhigh-dimensional examples, even quadratic time algorithms can become painfully slow (or expensive)\nto run.\nNon-convex ERM problems have also attracted extensive research interest, e.g., in the context of deep\nneural networks. First order methods that follow the gradient of the empirical loss are not guaranteed\nto \ufb01nd the global minimizer in this setting. Nevertheless, variants of gradient descent are by far the\nmost common method for training large neural networks. Here, the computational bottleneck is to\ncompute a number of gradients, not necessarily to minimize the empirical loss globally. Although we\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fcan compute gradients in polynomial time, the large number of parameters and examples in modern\ndeep learning still makes this a considerable computational challenge.\nUnfortunately, there are only few existing results concerning the exact time complexity of ERM or\ngradient computations. Since the problems have polynomial time algorithms, the classical machinery\nfrom complexity theory (such as NP hardness) is too coarse to apply. Oracle lower bounds from\noptimization offer useful guidance for convex ERM problems, but the results only hold for limited\nclasses of algorithms. Moreover, they do not account for the cost of executing the oracle calls, as\nthey simply lower bound their number. Overall, we do not know if common ERM problems allow\nfor algorithms that compute a high-accuracy solution in sub-quadratic or even nearly-linear time for\nall instances.1 Furthermore, we do not know if there are more ef\ufb01cient techniques for computing\n(mini-)batch gradients than simply treating each example in the batch independently.2\nWe address both questions for multiple well-studied ERM problems.\n\nHardness of ERM. First, we give conditional hardness results for minimizing the empirical risk in\nseveral settings, including kernel SVMs, kernel ridge regression (KRR), and training the top layer\nof a neural network. Our results give evidence that no algorithms can solve these problems to high\naccuracy in strongly sub-quadratic time. Moreover, we provide similar conditional hardness results\nfor kernel PCA. All of these methods are popular learning algorithms due to the expressiveness of the\nkernel or network embedding. Our results show that this expressiveness also leads to an expensive\ncomputational problem.\n\nHardness of gradient computation in neural networks. Second, we address the complexity of\ncomputing a gradient for the empirical risk of a neural network. In particular, we give evidence that\ncomputing (or even approximating, up to polynomially large factors) the norm of the gradient of\nthe top layer in a neural network takes time that is \u201crectangular\u201d. The time complexity cannot be\nsigni\ufb01cantly better than O(n \u00b7 m), where m is the number of examples and n is the number of units\nin the network. Hence, there are no algorithms that compute batch gradients faster than handling each\nexample individually, unless common complexity-theoretic assumptions fail.\nOur hardness results for gradient computation apply to common activation functions such as ReLU\nor sigmoid units. We remark that for polynomial activation functions (for instance, studied in [24]),\nsigni\ufb01cantly faster algorithms do exist. Thus, our results can be seen as mapping the \u201cef\ufb01ciency\nlandscape\u201d of basic machine learning sub-routines. They distinguish between what is possible and\n(likely) impossible, suggesting further opportunities for improvement.\nOur hardness results are based on recent advances in \ufb01ne-grained complexity and build on conjectures\nsuch as the Strong Exponential Time Hypothesis (SETH) [23, 22, 38]. SETH concerns the classic\nsatis\ufb01ability problem for formulas in Conjunctive Normal Form (CNF). Informally, the conjecture\nstates that there is no algorithm for checking satis\ufb01ability of a formula with n variables and m clauses\nin time less than O(cn \u00b7 poly(m)) for some c < 2.3 While our results are conditional, SETH has been\nemployed in many recent hardness results. Its plausibility stems from the fact that, despite 60 years\nof research on satis\ufb01ability algorithms, no such improvement has been discovered.\nOur results hold for a signi\ufb01cant range of the accuracy parameter. For kernel methods, our bounds\nhold for algorithms approximating the empirical risk up to a factor of 1+\u03b5, for log(1/\u03b5) = \u03c9(log2 n)).\nThus, they provide conditional quadratic lower bounds for algorithms with, say, a log 1/\u03b5 runtime\ndependence on the approximation error \u03b5. A (doubly) logarithmic dependence on 1/\u03b5 is generally\nseen as the ideal rate of convergence in optimization, and algorithms with this property have been\nstudied extensively in the machine learning community (cf. [12].). At the same time, approximate\n\n1More ef\ufb01cient algorithms exist if the running time is allowed to be polynomial in the accuracy parameter,\ne.g., [35] give such an algorithm for the kernel SVM problem that we consider as well. See also the discussion\nat the end of this section.\n\n2Consider a network with one hidden layer containing n units and a training set with m examples, for\nsimplicity in small dimension d = O(log n). No known results preclude an algorithm that computes a full\ngradient in time O((n+m) log n). This would be signi\ufb01cantly faster than the standard O(n\u00b7m\u00b7log n) approach\nof computing the full gradient example by example.\n3Note that SETH can be viewed as a signi\ufb01cant strengthening of the P (cid:54)= NP conjecture, which only\npostulates that there is no polynomial time algorithm for CNF satis\ufb01ability. The best known algorithms for CNF\nsatis\ufb01ability have running times of the form O(2(1\u2212o(1))n \u00b7 poly(m)).\n\n2\n\n\fsolutions to ERM problems can be suf\ufb01cient for good generalization in learning tasks. Indeed,\nstochastic gradient descent (SGD) is often advocated as an ef\ufb01cient learning algorithm despite its\npolynomial dependence on 1/\u03b5 in the optimization error [35, 15]. Our results support this viewpoint\nsince SGD sidesteps the quadratic time complexity of our lower bounds.\nFor other problems, our assumptions about the accuracy parameter are less stringent. In particular,\nfor training the top layer of the neural network, we only need to assume that \u03b5 \u2248 1/n. Finally, our\nlower bounds for approximating the norm of the gradient in neural networks hold even if \u03b5 = nO(1),\ni.e., for polynomial approximation factors (or alternatively, a constant additive factor for ReLU and\nsigmoid activation functions).\nFinally, we note that our results do not rule out algorithms that achieve a sub-quadratic running\ntime for well-behaved instances, e.g., instances with low-dimensional structure. Indeed, many such\napproaches have been investigated in the literature, for instance the Nystr\u00f6m method or random\nfeatures for kernel problems [40, 30]. Our results offer an explanation for the wide variety of\ntechniques. The lower bounds are evidence that there is no \u201csilver bullet\u201d algorithm for solving the\naforementioned ERM problems in sub-quadratic time, to high accuracy, and for all instances.\n\n2 Background\n\nFine-grained complexity. We obtain our conditional hardness results via reductions from two\nwell-studied problems: Orthogonal Vectors and Bichromatic Hamming Close Pair.\nDe\ufb01nition 1 (Orthogonal Vectors problem (OVP)). Given two sets A = {a1, . . . , an} \u2286 {0, 1}d and\nB = {b1, . . . , bn} \u2286 {0, 1}d of n binary vectors, decide if there exists a pair a \u2208 A and b \u2208 B such\nthat aTb = 0.\n\nFor OVP, we can assume without loss of generality that all vectors in B have the same number of 1s.\nThis can be achieved by appending d entries to every bi and setting the necessary number of them to\n1 and the rest to 0. We then append d entries to every ai and set all of them to 0.\nDe\ufb01nition 2 (Bichromatic Hamming Close Pair (BHCP) problem). Given two sets A =\n{a1, . . . , an} \u2286 {0, 1}d and B = {b1, . . . , bn} \u2286 {0, 1}d of n binary vectors and an integer\nt \u2208 {2, . . . , d}, decide if there exists a pair a \u2208 A and b \u2208 B such that the number of coordinates in\nwhich they differ is less than t (formally, Hamming(a, b) := ||a \u2212 b||1 < t). If there is such a pair\n(a, b), we call it a close pair.\nIt is known that both OVP and BHCP require almost quadratic time (i.e., n2\u2212o(1)) for any d =\n\u03c9(log n) assuming SETH [5].4 Furthermore, if we allow the sizes |A| = n and |B| = m to be\ndifferent, both problems require (nm)1\u2212o(1) time assuming SETH, as long as m = n\u03b1 for some\nconstant \u03b1 \u2208 (0, 1) [17]. Our proofs will proceed by embedding OVP and BHCP instances into ERM\nproblems. Such a reduction then implies that the ERM problem requires almost quadratic time if the\nSETH is true. If we could solve the ERM problem faster, we would also obtain a faster algorithm for\nthe satis\ufb01ability problem.\n\n3 Our contributions\n\n3.1 Kernel ERM problems\nWe provide hardness results for multiple kernel problems. In the following, let x1, . . . , xn \u2208 Rd\nbe the n input vectors, where d = \u03c9(log n). We use y1, . . . , yn \u2208 R as n labels or target values.\nFinally, let k(x, x(cid:48)) denote a kernel function and let K \u2208 Rn\u00d7n be the corresponding kernel\nmatrix, de\ufb01ned as Ki,j := k(xi, xj) [33]. Concretely, we focus on the Gaussian kernel k(x, x(cid:48)) :=\n\n(cid:1) for some C > 0. We note that our results can be generalized to any kernel with\n\n2\n\nexp(cid:0)\u2212C(cid:107)x \u2212 x(cid:48)(cid:107)2\n\nexponential tail.\n\n4We use \u03c9(g(n)) to denote any function f such that limn\u2192\u221e f (n)/g(n) = \u221e. Similarly, we use o(g(n))\nto denote any function f such that limn\u2192\u221e f (n)/g(n) = 0. Consequently, we will refer to functions of the\nform \u03c9(1) as super-constant and to n\u03c9(1) as super-polynomial.\n\n3\n\n\fKernel SVM. For simplicity, we present our result for hard-margin SVMs without bias terms. This\ngives the following optimization problem.\nDe\ufb01nition 3 (Hard-margin SVM). A (primal) hard-margin SVM is an optimization problem of the\nfollowing form:\n\n(1)\n\nn(cid:88)\n\nminimize\n\u03b11,...,\u03b1n\u22650\n\nsubject to\n\n1\n\u03b1i \u03b1j yi yj k(xi, xj)\n2\nyif (xi) \u2265 1, i = 1, . . . , n,\n\ni,j=1\n\nwhere f (x) :=(cid:80)n\n\ni=1 \u03b1iyik(xi, x).\n\nThe following theorem is our main result for SVMs, described in more detail in Section 4. In\nSections B, C, and D of the supplementary material we provide similar hardness results for other\ncommon SVM variants, including the soft-margin version.\nTheorem 4. Let k(a, a(cid:48)) be the Gaussian kernel with C = 100 log n and let \u03b5 = exp(\u2212\u03c9(log2 n)).\nThen approximating the optimal value of Equation (1) within a multiplicative factor 1 + \u03b5 requires\nalmost quadratic time assuming SETH.\n\nKernel Ridge Regression. Next we consider Kernel Ridge Regression, which is formally de\ufb01ned\nas follows.\nDe\ufb01nition 5 (Kernel ridge regression). Given a real value \u03bb \u2265 0, the goal of kernel ridge regression\nis to output\n\narg min\n\u03b1\u2208Rn\n\n1\n2\n\n||y \u2212 K\u03b1||2\n\n2 +\n\n\u03bb\n2\n\n\u03b1TK\u03b1.\n\nThis problem is equivalent to computing the vector (K + \u03bbI)\u22121y. We focus on the special case\nwhere \u03bb = 0 and the vector y has all equal entries y1 = . . . = yn = 1. In this case, the entrywise\nsum of K\u22121y is equal to the sum of the entries in K\u22121. Thus, we show hardness for computing the\nlatter quantity (see Section F in the supplementary material for the proof).\nTheorem 6. Let k(a, a(cid:48)) be the Gaussian kernel for any parameter C = \u03c9(log n) and let \u03b5 =\nexp(\u2212\u03c9(log2 n)). Then computing the sum of the entries in K\u22121 up to a multiplicative factor of\n1 + \u03b5 requires almost quadratic time assuming SETH.\n\nKernel PCA. Finally, we turn to the Kernel PCA problem, which we de\ufb01ne as follows [26].\nDe\ufb01nition 7 (Kernel Principal Component Analysis (PCA)). Let 1n be an n \u00d7 n matrix where each\nentry takes value 1/n, and de\ufb01ne K(cid:48) := (I \u2212 1n)K(I \u2212 1n). The goal of the kernel PCA problem is\nto output the n eigenvalues of the matrix K(cid:48).\nIn the above de\ufb01nition, the output only consists of the eigenvalues, not the eigenvectors. This is\nbecause computing all n eigenvectors trivially takes at least quadratic time since the output itself\nhas quadratic size. Our hardness proof applies to the potentially simpler problem where only the\neigenvalues are desired. Speci\ufb01cally, we show that computing the sum of the eigenvalues (i.e., the\ntrace of the matrix) is hard. See Section E in the supplementary material for the proof.\nTheorem 8. Let k(a, a(cid:48)) be the Gaussian kernel with C = 100 log n and let \u03b5 = exp(\u2212\u03c9(log2 n)).\nThen approximating the sum of the eigenvalues of K(cid:48) = (I \u2212 1n)K(I \u2212 1n) within a multiplicative\nfactor of 1 + \u03b5 requires almost quadratic time assuming SETH.\n\nWe note that the argument in the proof shows that even approximating the sum of the entries of K is\nhard. This provides an evidence of hardness of the kernel density estimation problem for Gaussian\nkernels, complementing recent upper bounds of [20].\n\n3.2 Neural network ERM problems\n\nWe now consider neural networks. We focus on the problem of optimizing the top layer while keeping\nlower layers unchanged. An instance of this problem is transfer learning with large networks that\nwould take a long time and many examples to train from scratch [31]. We consider neural networks of\ndepth 2, with the sigmoid or ReLU activation function. Our hardness result holds for a more general\nclass of \u201cnice\u201d activation functions S as described later (see De\ufb01nition 12).\n\n4\n\n\fGiven n weight vectors w1, . . . , wn \u2208 Rd and n weights \u03b11, . . . , \u03b1n \u2208 R, consider the function\nf : Rd \u2192 R using a non-linearity S : R \u2192 R:\n\nf (u) :=\n\n\u03b1j \u00b7 S(uTwj) .\n\nn(cid:88)\n\nj=1\n\nm(cid:88)\n\ni=1\n\nThis function can be implemented as a neural net that has d inputs, n nonlinear activations (units),\nand one linear output.\nTo complete the ERM problem, we also require a loss function. Our hardness results hold for a large\nclass of \u201cnice\u201d loss functions, which includes the hinge loss and the logistic loss.5 Given a nice\nloss function and m input vectors a1, . . . , am \u2208 Rd with corresponding labels yi, we consider the\nfollowing problem:\n\nminimize\n\u03b11,...,\u03b1n\u2208R\n\nloss(yi, f (ui)).\n\n(2)\n\nOur main result is captured by the following theorem (see Section 5 for the proof). For simplicity, we\nset m = n.\nTheorem 9. For any d = \u03c9(log n), approximating the optimal value in Equation (2) up to a\nmultiplicative factor of 1 + 1\n\n4n requires almost quadratic time assuming SETH.\n\n3.3 Hardness of gradient computation\n\nFinally, we consider the problem of computing the gradient of the loss function for a given set of\nexamples. We focus on the network architecture from the previous section. Formally, we obtain the\nfollowing result:\nTheorem 10. Consider the empirical risk in Equation (2) under the following assumptions: (i) The\nfunction f is represented by a neural network with n units, n \u00b7 d parameters, and the ReLU activation\nfunction. (ii) We have d = \u03c9(log n). (iii) The loss function is the logistic loss or hinge loss. Then\napproximating the (cid:96)p-norm (for any p \u2265 1) of the gradient of the empirical risk for m examples\n\nwithin a multiplicative factor of nC for any constant C > 0 takes at least O(cid:0)(nm)1\u2212o(1)(cid:1) time\n\nassuming SETH.\n\nSee Section 6 for the proof. We also prove a similar statement for the sigmoid activation function. At\nthe same time, we remark that for polynomial activation functions, signi\ufb01cantly faster algorithms\ndo exist, using the polynomial lifting argument. Speci\ufb01cally, for the polynomial activation function\nof the form xr for some integer r \u2265 2, all gradients can be computed in O((n + m)dr) time. Note\nthat the running time of the standard backpropagation algorithm is O(dnm) for networks with this\narchitecture. Thus one can improve over backpropagation for a non-trivial range of parameters,\nespecially for quadratic activation function when r = 2. See Section H in the supplementary material\nfor more details.\n\n3.4 Related work\n\nRecent work has demonstrated conditional quadratic hardness results for many combinatorial opti-\nmization problems over graphs and sequences. These results include computing diameter in sparse\ngraphs [32, 21], Local Alignment [2], Fr\u00e9chet distance [16], Edit Distance [13], Longest Common\nSubsequence, and Dynamic Time Warping [1, 17]. In the machine learning literature, [14] recently\nshowed a tight lower bound for the problem of inferring the most likely path in a Hidden Markov\nModel, matching the upper bound achieved by the Viterbi algorithm [39]. As in our paper, the SETH\nand related assumptions underlie these lower bounds. To the best of our knowledge, our paper is\nthe \ufb01rst application of this methodology to continuous (as opposed to combinatorial) optimization\nproblems.\nThere is a long line of work on the oracle complexity of optimization problems, going back to [28].\nWe refer the reader to [29] for these classical results. The oracle complexity of ERM problems is still\n\n5In the binary setting we consider, the logistic loss is equivalent to the softmax loss commonly employed in\n\ndeep learning.\n\n5\n\n\fsubject of active research, e.g., see [3, 19, 41, 9, 10]. The work closest to ours is [19], which gives\nquadratic time lower bounds for ERM algorithms that access the kernel matrix through an evaluation\noracle or a low-rank approximation.\nThe oracle results are fundamentally different from the lower bounds presented in our paper. Oracle\nlower bounds are typically unconditional, but inherently apply only to a limited class of algorithms\ndue to their information-theoretic nature. Moreover, they do not account for the cost of executing\nthe oracle calls, as they merely lower bound their number. In contrast, our results are conditional\n(based on the SETH and related assumptions), but apply to any algorithm and account for the total\ncomputational cost. This signi\ufb01cantly broadens the reach of our results. We show that the hardness is\nnot due to the oracle abstraction but instead inherent in the computational problem.\n\n4 Overview of the hardness proof for kernel SVMs\nLet A = {a1, . . . , an} \u2286 {0, 1}d and B = {b1, . . . , bn} \u2286 {0, 1}d be the two sets of binary vectors\nfrom a BHCP instance with d = \u03c9(log n). Our goal is to determine whether there is a close pair of\nvectors. We show how to solve this BHCP instance by reducing it to three computations of SVM,\nde\ufb01ned as follows:\n\n1. We take the \ufb01rst set A of binary vectors, assign label 1 to all vectors, and solve the\n\ncorresponding SVM on the n vectors:\n\n\u03b1i\u03b1jk(ai, aj)\n\n\u03b1jk(ai, aj) \u2265 1, i = 1, . . . , n.\n\n(3)\n\nNote that we do not have yi in the expressions because all labels are 1.\n\n2. We take the second set B of binary vectors, assign label \u22121 to all vectors, and solve the\n\ncorresponding SVM on the n vectors:\n\nn(cid:88)\nn(cid:88)\n\n1\n2\n\ni,j=1\n\nj=1\n\nminimize\n\u03b11,...,\u03b1n\u22650\n\nsubject to\n\nn(cid:88)\nsubject to \u2212 n(cid:88)\n\nminimize\n\u03b21,...,\u03b2n\u22650\n\n1\n2\n\ni,j=1\n\nj=1\n\n\u03b2i\u03b2jk(bi, bj)\n\n\u03b2jk(bi, bj) \u2264 \u22121, i = 1, . . . , n.\n\n(4)\n\n3. We take both sets A and B of binary vectors, assign label 1 to all vectors from the \ufb01rst set A\nand label \u22121 to all vectors from the second set B. We then solve the corresponding SVM\non the 2n vectors:\n\nminimize\n\u03b11,...,\u03b1n\u22650\n\u03b21,...,\u03b2n\u22650\n\nsubject to\n\n1\n2\n\n1\n2\n\ni,j=1\n\n\u03b1i\u03b1jk(ai, aj) +\n\nn(cid:88)\nn(cid:88)\n\u03b1jk(ai, aj) \u2212 n(cid:88)\nn(cid:88)\n\u2212 n(cid:88)\n\n\u03b2jk(bi, bj) +\n\nj=1\n\nj=1\n\nn(cid:88)\n\n\u03b2i\u03b2jk(bi, bj) \u2212 n(cid:88)\n\ni,j=1\n\ni,j=1\n\n\u03b1i\u03b2jk(ai, bj)\n\n\u03b2jk(ai, bj) \u2265 1,\n\ni = 1, . . . , n ,\n\n(5)\n\n\u03b1jk(bi, aj) \u2264 \u22121,\n\ni = 1, . . . , n .\n\nj=1\n\nj=1\n\nIntuition behind the construction. To show a reduction from the BHCP problem to SVM compu-\ntation, we have to consider two cases:\n\n\u2022 The YES case of the BHCP problem when there are two vectors that are close in Hamming\n\u2022 The NO case of the BHCP problem when there is no close pair of vectors. That is, for all\n\ndistance. That is, there exist ai \u2208 A and bj \u2208 B such that Hamming(ai, bj) < t.\nai \u2208 A and bj \u2208 B, we have Hamming(ai, bj) \u2265 t.\n\n6\n\n\fWe show that we can distinguish between these two cases by comparing the objective value of the\n\ufb01rst two SVM instances above to the objective value of the third.\nIntuition for the NO case. We have Hamming(ai, bj) \u2265 t for all ai \u2208 A and bj \u2208 B. The\nGaussian kernel then gives the inequality\n\nk(ai, bj) = exp(\u2212100 log n \u00b7 (cid:107)ai \u2212 bj(cid:107)2\n\n2) \u2264 exp(\u2212100 log n \u00b7 t)\n\nfor all ai \u2208 A and bj \u2208 B. This means that the value k(ai, bj) is very small. For simplicity, assume\nthat it is equal to 0, i.e., k(ai, bj) = 0 for all ai \u2208 A and bj \u2208 B.\nConsider the third SVM (5). It contains three terms involving k(ai, bj): the third term in the objective\nfunction, the second term in the inequalities of the \ufb01rst type, and the second term in the inequalities\nof the second type. We assumed that these terms are equal to 0 and we observe that the rest of the\nthird SVM is equal to the sum of the \ufb01rst SVM (3) and the second SVM (4). Thus we expect that\nthe optimal value of the third SVM is approximately equal to the sum of the optimal values of the\n\ufb01rst and the second SVMs. If we denote the optimal value of the \ufb01rst SVM (3) by value(A), the\noptimal value of the second SVM (4) by value(B), and the optimal value of the third SVM (5) by\nvalue(A, B), then we can express our intuition in terms of the approximate equality\n\nvalue(A, B) \u2248 value(A) + value(B) .\nIn this case, there is a close pair of vectors ai \u2208 A and bj \u2208 B\nIntuition for the YES case.\nsuch that Hamming(ai, bj) \u2264 t \u2212 1. Since we are using the Gaussian kernel we have the following\ninequality for this pair of vectors:\n\nk(ai, bj) = exp(\u2212100 log n \u00b7 (cid:107)ai \u2212 bj(cid:107)2\n\n2) \u2265 exp(\u2212100 log n \u00b7 (t \u2212 1)) .\n\nWe therefore have a large summand in each of the three terms from the above discussion. Thus\nthe three terms do not (approximately) disappear and there is no reason for us to expect that the\napproximate equality holds. We can thus expect\n\nvalue(A, B) (cid:54)\u2248 value(A) + value(B) .\n\nThus, by computing value(A, B) and comparing it to value(A) + value(B) we can distinguish\nbetween the YES and NO instances of BHCP. This completes the reduction. The full proofs are given\nin Section B of the supplementary material.\n\n5 Overview of the hardness proof for training the \ufb01nal layer of a neural\n\nnetwork\n\nWe start by formally de\ufb01ning the class of \u201cnice\u201d loss functions and \u201cnice\u201d activation functions.\nDe\ufb01nition 11. For a label y \u2208 {\u22121, 1} and a prediction w \u2208 R, we call the loss function loss(y, w) :\n{\u22121, 1} \u00d7 R \u2192 R\u22650 nice if the following three properties hold:\n\n\u2022 loss(y, w) = l(yw) for some convex function l : R \u2192 R\u22650.\n\u2022 For some suf\ufb01ciently large constant K > 0, we have that (i) l(x) \u2264 o(1) for all x \u2265 nK,\n(ii) l(x) \u2265 \u03c9(n) for all x \u2264 \u2212nK, and (iii) l(x) = l(0) \u00b1 o(1/n) for all x \u2208 \u00b1O(n\u2212K).\n\n\u2022 l(0) > 0 is some constant strictly larger than 0.\n\nln 2 ln (1 + e\u2212y\u00b7x) are nice loss functions according to the above de\ufb01nition.\n\nWe note that the hinge loss function loss(y, x) = max(0, 1 \u2212 y \u00b7 x) and the logistic loss function\nloss(y, x) = 1\nDe\ufb01nition 12. A non-decreasing activation functions S : R \u2192 R\u22650 is \u201cnice\u201d if it satis\ufb01es the\nfollowing property: for all suf\ufb01ciently large constants T > 0 there exist v0 > v1 > v2 such that\nS(v0) = \u0398(1), S(v1) = 1/nT , S(v2) = 1/n\u03c9(1) and v1 = (v0 + v2)/2.\n\nThe ReLU activation S(z) = max(0, z) satis\ufb01es these properties since we can choose v0 = 1,\nv1 = 1/nT , and v2 = \u22121 + 2/nT . For the sigmoid function S(z) =\n1+e\u2212z , we can choose\n\n1\n\n7\n\n\fv1 = \u2212 log(nT \u2212 1), v0 = v1 + C, and v2 = v1 \u2212 C for some C = \u03c9(log n). In the rest of the proof\nwe set T = 1000K, where K is the constant from De\ufb01nition 11.\nWe now describe the proof of Theorem 9. We use the notation \u03b1 := (\u03b11, . . . , \u03b1n)T. Invoking the\n\ufb01rst property from De\ufb01nition 11, we observe that the optimization problem (2) is equivalent to the\nfollowing optimization problem:\n\nm(cid:88)\n\ni=1\n\nminimize\n\n\u03b1\u2208Rn\n\nl(yi \u00b7 (M \u03b1)i),\n\n(6)\n\nwhere M \u2208 Rm\u00d7n is the matrix de\ufb01ned as Mi,j := S(uT\nthe rest of the section we will use m = \u0398(n).6\nLet A = {a1, . . . , an} \u2286 {0, 1}d and B = {b1, . . . , bn} \u2286 {0, 1}d with d = \u03c9(log n) be the input to\nthe Orthogonal Vectors problem. To show hardness we de\ufb01ne a matrix M as a vertical concatenation\nof 3 smaller matrices: M1, M2 and M2 (repeated). Both matrices M1, M2 \u2208 Rn\u00d7n are of size n \u00d7 n.\nThus the number of rows of M (equivalently, the number of training examples) is m = 3n.\n\ni wj) for i = 1, . . . , m and j = 1, . . . n. For\n\nReduction overview. We select the input examples and weights so that the matrices M1 and M2,\nhave the following properties:\n\n\u2022 M1: if two vectors ai and bj are orthogonal, then the corresponding entry (M1)i,j =\n\nS(v0) = \u0398(1) and otherwise (M1)i,j \u2248 0.7\n\n\u2022 M2: (M2)i,i = S(v1) = 1/n1000K and (M2)i,j \u2248 0 for all i (cid:54)= j\n\nTo complete the description of the optimization problem (6), we assign labels to the inputs corre-\nsponding to the rows of the matrix M. We assign label 1 to all inputs corresponding to rows of the\nmatrix M1 and the \ufb01rst copy of the matrix M2. We assign label \u22121 to all remaining rows of the\nmatrix M corresponding to the second copy of matrix M2.\nThe proof of the theorem is completed by the following two lemmas. See Section G in the supple-\nmentary material for the proofs.\nLemma 13. If there is a pair of orthogonal vectors, then the optimal value of (6) is upper bounded\nby (3n \u2212 1) \u00b7 l(0) + o(1).\nLemma 14. If there is no pair of orthogonal vectors, then the optimal value of (6) is lower bounded\nby 3n \u00b7 l(0) \u2212 o(1).\n\n6 Hardness proof for gradient computation\n\nFinally, we consider the problem of computing the gradient of the loss function for a given set\nof examples. We focus on the network architecture as in the previous section. Speci\ufb01cally, let\nj=1 \u03b1jS(a, bj) be the output of a neural net with activation function S, where: (1) a\nis an input vector from the set A := {a1, . . . , am} \u2286 {0, 1}d; (2) B := {b1, . . . , bn} \u2286 {0, 1}d is a\nset of binary vectors; (3) \u03b1 = {\u03b11, . . . , \u03b1n}T \u2208 Rn is an n-dimensional real-valued vector. We \ufb01rst\nprove the following lemma.\nLemma 15. For some loss function l : R \u2192 R, let l(F\u03b1,B(a)) be the loss for input a when the\na\u2208A l(F\u03b1,B(a)) at\n\u03b11 = . . . = \u03b1n = 0 with respect to \u03b11, . . . , \u03b1n. The sum of the entries of the gradient is equal to\n\nF\u03b1,B(a) :=(cid:80)n\nlabel of the input a is +1. Consider the gradient of the total loss l\u03b1,A,B :=(cid:80)\nl(cid:48)(0) \u00b7(cid:80)\n\na\u2208A,b\u2208B S(a, b), where l(cid:48)(0) is the derivative of the loss function l at 0.\n\nFor the hinge loss function, we have that the loss function is l(x) = max(0, 1 \u2212 x) if the label\nis +1. Thus, l(cid:48)(0) = \u22121. For the logistic loss function, we have that the loss function is l(x) =\nln 2 ln (1 + e\u2212x) if the label is +1. Thus, l(cid:48)(0) = \u2212 1\n\n2 ln 2 in this case.\n\n1\n\ninput examples and weights.\n\n6Note that our reduction does not explicitly construct M. Instead, the values of the matrix are induced by the\n7We write x \u2248 y if x = y up to an inversely superpolynomial additive factor, i.e., |x \u2212 y| \u2264 n\u2212\u03c9(1).\n\n8\n\n\fthe total loss function is equal to |l(cid:48)(0)| \u00b7(cid:80)\n\nProof of Theorem 10. Since all (cid:96)p-norms are within a polynomial factor, it suf\ufb01ces to show the\nstatement for (cid:96)1-norm.\nWe set S(a, b) := max(0, 1 \u2212 2aTb). Using Lemma 15, we get that the (cid:96)1-norm of the gradient of\na\u2208A,b\u2208B 1aTb=0. Since l(cid:48)(0) (cid:54)= 0, this reduces OV to the\ngradient computation problem. Note that if there is no orthogonal pair, then the (cid:96)1-norm is 0 and\notherwise it is a constant strictly greater than 0. Thus approximating the (cid:96)1-norm within any \ufb01nite\nfactor allows us to distinguish the cases.\n\nSee Section H in the supplementary material for other results.\n\n7 Conclusions\n\nWe have shown that a range of kernel problems require quadratic time for obtaining a high accuracy\nsolution unless the Strong Exponential Time Hypothesis is false. These problems include variants\nof kernel SVM, kernel ridge regression, and kernel PCA. We also gave a similar hardness result for\ntraining the \ufb01nal layer of a depth-2 neural network. This result is general and applies to multiple\nloss and activation functions. Finally, we proved that computing the empirical loss gradient for such\nnetworks takes time that is essentially \u201crectangular\u201d, i.e., proportional to the product of the network\nsize and the number of examples.\nWe note that our quadratic (rectangular) hardness results hold for general inputs. There is a long line of\nresearch on algorithms for kernel problems with running times depending on various input parameters,\nsuch as its statistical dimension [42], degrees of freedom [11] or effective dimensionality [27]. It\nwould be interesting to establish lower bounds on the complexity of kernel problems as a function of\nthe aforementioned input parameters.\nOur quadratic hardness results for kernel problems apply to kernels with exponential tails. A natural\nquestion is whether similar results can be obtained for \u201cheavy-tailed\u201d kernels, e.g., the Cauchy kernel.\nWe note that similar results for the linear kernel do not seem achievable using our techniques.8\nSeveral of our results are obtained by a reduction from the (exact) Bichromatic Hamming Closest Pair\nproblem or the Orthogonal Vectors problem. This demonstrates a strong connection between kernel\nmethods and similarity search, and suggests that perhaps a reverse reduction is also possible. Such a\nreduction could potentially lead to faster approximate algorithms for kernel methods: although the\nexact closest pair problem has no known sub-quadratic solution, ef\ufb01cient and practical sub-quadratic\ntime algorithms for the approximate version of the problem exist (see e.g., [6, 36, 8, 7, 4]).\n\nAcknowledgements\n\nLudwig Schmidt is supported by a Google PhD fellowship. Arturs Backurs is supported by an IBM\nResearch fellowship. This research was supported by grants from NSF and Simons Foundation.\n\nReferences\n[1] A. Abboud, A. Backurs, and V. V. Williams. Tight hardness results for LCS and other sequence\n\nsimilarity measures. In Symposium on Foundations of Computer Science (FOCS), 2015.\n\n[2] A. Abboud, V. V. Williams, and O. Weimann. Consequences of faster alignment of sequences.\n\nIn International Colloquium on Automata, Languages, and Programming (ICALP), 2014.\n\n[3] A. Agarwal and L. Bottou. A lower bound for the optimization of \ufb01nite sums. In International\n\nConference on Machine Learning (ICML), 2015.\n\n8In particular, assuming a certain strengthening of SETH, known as the \u201cnon-deterministic SETH\u201d [18], it is\nprovably impossible to prove SETH hardness for any of the linear variants of the studied ERM problems, at least\nvia deterministic reductions. This is due to the fact that these problems have short certi\ufb01cates of optimality via\nduality arguments. Also, it should be noted that linear analogs of some of the problems considered in this paper\n(e.g., linear ridge regression) can be solved in O(nd2) time using SVD methods.\n\n9\n\n\f[4] J. Alman, T. M. Chan, and R. Williams. Polynomial Representations of Threshold Functions\n\nand Algorithmic Applications. 2016.\n\n[5] J. Alman and R. Williams. Probabilistic polynomials and hamming nearest neighbors. In\n\nSymposium on Foundations of Computer Science (FOCS), 2015.\n\n[6] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in\n\nhigh dimensions. In Symposium on Foundations of Computer Science (FOCS), 2006.\n\n[7] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt. Practical and optimal lsh\n\nfor angular distance. In Advances in Neural Information Processing Systems (NIPS). 2015.\n\n[8] A. Andoni and I. Razenshteyn. Optimal data-dependent hashing for approximate near neighbors.\n\nIn Symposium on Theory of Computing (STOC), 2015.\n\n[9] Y. Arjevani and O. Shamir. Dimension-free iteration complexity of \ufb01nite sum optimization\n\nproblems. In Advances in Neural Information Processing Systems (NIPS). 2016.\n\n[10] Y. Arjevani and O. Shamir. Oracle complexity of second-order methods for \ufb01nite-sum problems.\n\nCoRR, abs/1611.04982, 2016.\n\n[11] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Conference on Learning\n\nTheory (COLT), 2013.\n\n[12] F. Bach and S. Sra. Stochastic optimization: Beyond stochastic gradients and convexity. NIPS\n\nTutorial, 2016. http://suvrit.de/talks/vr_nips16_bach.pdf.\n\n[13] A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time\n\n(unless SETH is false). In Symposium on Theory of Computing (STOC), 2015.\n\n[14] A. Backurs and C. Tzamos. Improving viterbi is hard: Better runtimes imply faster clique\n\nalgorithms. International Conference on Machine Learning (ICML), 2017.\n\n[15] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2007.\n\n[16] K. Bringmann. Why walking the dog takes time: Frechet distance has no strongly subquadratic\nalgorithms unless SETH fails. In Symposium on Foundations of Computer Science (FOCS),\n2014.\n\n[17] K. Bringmann and M. K\u00fcnnemann. Quadratic conditional lower bounds for string problems\nand dynamic time warping. In Symposium on Foundations of Computer Science (FOCS), 2015.\n\n[18] M. L. Carmosino, J. Gao, R. Impagliazzo, I. Mihajlin, R. Paturi, and S. Schneider. Nondetermin-\nistic extensions of the strong exponential time hypothesis and consequences for non-reducibility.\nIn Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science,\npages 261\u2013270. ACM, 2016.\n\n[19] N. Cesa-Bianchi, Y. Mansour, and O. Shamir. On the complexity of learning with kernels. In\n\nConference On Learning Theory (COLT), 2015.\n\n[20] M. Charikar and P. Siminelakis. Hashing-based-estimators for kernel density in high dimensions.\n\nFOCS, 2017.\n\n[21] S. Chechik, D. H. Larkin, L. Roditty, G. Schoenebeck, R. E. Tarjan, and V. V. Williams. Better\napproximation algorithms for the graph diameter. In Symposium on Discrete Algorithms (SODA),\n2014.\n\n[22] R. Impagliazzo and R. Paturi. On the complexity of k-sat. Journal of Computer and System\n\nSciences, 62(2):367\u2013375, 2001.\n\n[23] R. Impagliazzo, R. Paturi, and F. Zane. Which problems have strongly exponential complexity?\n\nJournal of Computer and System Sciences, 63:512\u2013530, 2001.\n\n10\n\n\f[24] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational ef\ufb01ciency of training neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 855\u2013863, 2014.\n\n[25] K.-R. M\u00fcller, S. Mika, G. R\u00e4tsch, K. Tsuda, and B. Sch\u00f6lkopf. An introduction to kernel-based\n\nlearning algorithms. IEEE transactions on neural networks, 12(2):181\u2013201, 2001.\n\n[26] K. P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.\n\n[27] C. Musco and C. Musco. Recursive sampling for the Nystr\u00f6m method. Advances in Neural\n\nInformation Processing Systems (NIPS), 2016.\n\n[28] A. S. Nemirovski and D. B. Yudin. Problem Complexity and Method Ef\ufb01ciency in Optimization.\n\nWiley Interscience, 1983.\n\n[29] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic\n\nPublishers, 2004.\n\n[30] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems (NIPS). 2008.\n\n[31] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: An\nastounding baseline for recognition. In Conference on Computer Vision and Pattern Recognition\nWorkshops (CVPRW), 2014.\n\n[32] L. Roditty and V. Vassilevska Williams. Fast approximation algorithms for the diameter and\n\nradius of sparse graphs. In Symposium on Theory of Computing (STOC), 2013.\n\n[33] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2001.\n\n[34] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[35] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for\n\nSVM. In International Conference on Machine Learning (ICML), 2007.\n\n[36] G. Valiant. Finding correlations in subquadratic time, with applications to learning parities and\n\njuntas. In Symposium on Foundations of Computer Science (FOCS), 2012.\n\n[37] V. Vapnik. Statistical learning theory. Wiley, 1998.\n\n[38] V. Vassilevska Williams. Hardness of easy problems: Basing hardness on popular conjectures\nsuch as the Strong Exponential Time Hypothesis (invited talk). In LIPIcs-Leibniz International\nProceedings in Informatics, volume 43. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik,\n2015.\n\n[39] A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding\n\nalgorithm. IEEE transactions on Information Theory, 13(2):260\u2013269, 1967.\n\n[40] C. K. Williams and M. Seeger. Using the nystr\u00f6m method to speed up kernel machines. In\n\nAdvances in Neural Information Processing Systems (NIPS). 2001.\n\n[41] B. E. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives.\n\nIn Advances in Neural Information Processing Systems (NIPS), 2016.\n\n[42] Y. Yang, M. Pilanci, M. J. Wainwright, et al. Randomized sketches for kernels: Fast and optimal\n\nnonparametric regression. The Annals of Statistics, 45(3):991\u20131023, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2252, "authors": [{"given_name": "Arturs", "family_name": "Backurs", "institution": "MIT"}, {"given_name": "Piotr", "family_name": "Indyk", "institution": "MIT"}, {"given_name": "Ludwig", "family_name": "Schmidt", "institution": "MIT"}]}