{"title": "Computational Separations between Sampling and Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 15023, "page_last": 15033, "abstract": "Two commonly arising computational tasks in Bayesian learning are Optimization (Maximum A Posteriori estimation) and Sampling (from the posterior distribution). In the convex case these two problems are efficiently reducible to each other. Recent work (Ma et al. 2019) shows that in the non-convex case, sampling can sometimes be provably faster. We present a simpler and stronger separation.\nWe then compare sampling and optimization in more detail and show that they are provably incomparable: there are families of continuous functions for which optimization is easy but sampling is NP-hard, and vice versa. Further, we show function families that exhibit a sharp phase transition in the computational complexity of sampling, as one varies the natural temperature parameter. Our results draw on a connection to analogous separations in the discrete setting which are well-studied.", "full_text": "Computational Separations between Sampling and\n\nOptimization\n\nKunal Talwar\n\nGoogle Brain\n\nMountain View, CA\nkunal@google.com\n\nAbstract\n\nTwo commonly arising computational tasks in Bayesian learning are Optimization\n(Maximum A Posteriori estimation) and Sampling (from the posterior distribu-\ntion). In the convex case these two problems are ef\ufb01ciently reducible to each\nother. Recent work [Ma et al., 2019] shows that in the non-convex case, sampling\ncan sometimes be provably faster. We present a simpler and stronger separation.\nWe then compare sampling and optimization in more detail and show that they\nare provably incomparable: there are families of continuous functions for which\noptimization is easy but sampling is NP-hard, and vice versa. Further, we show\nfunction families that exhibit a sharp phase transition in the computational com-\nplexity of sampling, as one varies the natural temperature parameter. Our results\ndraw on a connection to analogous separations in the discrete setting which are\nwell-studied.\n\nIntroduction\n\n1\nGiven a a compact set X \u2286 Rd and function f : X \u2192 R, one can de\ufb01ne two natural problems:\nOptimize(f,X , \u03b5) : Find x \u2208 X such that f (x) \u2264 f (x(cid:48)) + \u03b5 for all x(cid:48) \u2208 X .\nSample(f,X , \u03b7) : Sample from a distribution on X that is \u03b7-close to \u00b5(cid:63)(x) \u221d exp(\u2212f (x)).\n\nThese problems arise naturally in machine learning settings. When f is the negative log likelihood\nfunction of a posterior distribution, the optimization problem corresponds to the Maximum A\nPosteriori (MAP) estimate, whereas the Sampling problem gives us a sample from the posterior. In\nthis work we are interested in the computational complexities of these tasks for speci\ufb01c families of\nfunctions.\nWhen f and X are both convex, these two problems have a deep connection (see e.g. Lovasz and\nVempala [2006]) and are ef\ufb01ciently reducible to each other in a very general setting. There has been\nconsiderable interest in both these problems in the non-convex setting. Given that in practice, we\nare often able to practically optimize certain non-convex loss functions, it would be appealing to\nextend this equivalence beyond the convex case. If sampling could be reduced to optimization for our\nfunction of interest (e.g. differentiable Lipschitz functions), that might allow us to design sampling\nalgorithms for the function that are usually ef\ufb01cient in practice. Ma et al. [2019] recently showed\nthat in the case when f is not necessarily convex (and X = Rd), these problems are not equivalent.\nThey exhibit a family of continuous, differentiable functions for which approximate sampling can be\ndone ef\ufb01ciently, but where approximate optimization requires exponential time (in an oracle model \u00e0\nla Nemirovsky and Yudin [1983]). In this work, we study the relationship of these two problems in\nmore detail.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fn\n\n(cid:80)\n\nTo aid the discussion, it will be convenient to consider a more general sampling problem where we\nwant to sample with probability proportional to exp(\u2212\u03bbf (x)) for a parameter \u03bb > 0. Such a scaling\nhas no effect on the optimization problem, up to scaling of \u03b5. However changing \u03bb can signifcantly\nchange the distribution for the sampling problem. In statistical physics literature, this parameter is\nthe inverse temperature. For families F that are invariant to multiplication by a positive scalar (such\nas the family of convex functions), this \u03bb parameter has no impact on the complexity of sampling\nfrom the family. We will however be looking at families of functions that are controlled in some\nway (e.g. bounded, Lipschitz, or Smooth) and do not enjoy such an invariance to scale. E.g. in some\nBayesian settings, each sample may give us a 1-smooth negative log likelihood function, so we may\nwant to consider the family Fsmooth of 1-smooth functions. Given n i.i.d. samples, the posterior\nlog likelihood would be \u2212nf, where f = 1\ni fi is in Fsmooth. The parameter \u03bb then corresponds\nnaturally to the number of samples n.\nThis phenomenon of sampling being easier than optimization is primarily a \u201chigh temperature\u201d or\n\u201clow signal\u201d phenomenon. As \u03bb approaches in\ufb01nity, the distribution exp(\u2212\u03bbf ) approaches a point\nmass at the minimizer of f. This connection goes back to at least Kirkpatrick et al. [1983] and one\ncan easily derive a quantitative \ufb01nite-\u03bb version of this statement for many function families. Ma et al.\n[2019] reconcile this with their separation by pointing out that their sampling algorithm becomes\ninef\ufb01cient as \u03bb increases.\nWe \ufb01rst show a more elementary and stronger separation. We give a simple family of continuous\nLipschitz functions which are ef\ufb01ciently samplable but hard even to approximately optimize. This\nimproves on the separation in Ma et al. [2019] since our sampler is exact (modulo representation\nissues), and much simpler. The hardness of optimization here is in the oracle model, where the\ncomplexity is measured in terms of number of point evaluations of the function or its gradients.\nWhile these oracle model separations rule out black-box optimization, they leave open the possibility\nof ef\ufb01cient algorithms that access the function in a different way. We next show that this hardness can\nbe strengthened to an NP-hardness for an ef\ufb01ciently computable f. This allows for the implementation\nof any oracle for f or its derivatives. Thus assuming the Exponential Time Hypothesis [Impagliazzo\nand Paturi, 2001], our result implies the oracle model lower bounds. Additionally, it rules out ef\ufb01cient\nnon-blackbox algorithms that could examine the implementation of f beyond point evaluations.\nWe leave open the question of whether other oracle lower bounds [Nemirovsky and Yudin, 1983,\nNesterov, 2014, Bubeck, 2015, Hazan, 2016] in optimization can be strengthened to NP-hardness\nresults.\nWe next look at the large \u03bb case. As discussed above, for large enough \u03bb sampling must be hard if\noptimization is. Is hardness of optimization the only obstruction to ef\ufb01cient sampling? We answer\nthis question in the negative. We exhibit a family of functions for which optimization is easy, but\nwhere sampling is NP-hard for large enough \u03bb. We draw on extensive work on the discrete analogs\nof these questions, where f is simple (e.g. linear) but X is a discrete set.\nOur upper bound on optimization for this family can be strengthened to work under weaker models of\naccess to the function, where we only have blackbox access to the function. In other words, there are\nfunctions that can be optimized via gradient descent for which sampling is NP-hard. Conceptually,\nthis separation is a result of the fact that \ufb01nding one minimizer suf\ufb01ces for optimization whereas\nsampling intuitively requires \ufb01nding all minima.\nBoth the separation result of Ma et al. [2019], and our small-\u03bb result have the property that the\nsampling algorithm\u2019s complexity increases exponentially in poly(\u03bb). Thus as we increase \u03bb, the\nproblem gradually becomes harder. Is there always a smooth transition in the complexity of sampling?\nOur \ufb01nal result gives a surprising negative answer. We exhibit a family of easy-to-optimize functions\nfor which there is a sharp threshold: there is a \u03bbc such that for \u03bb < \u03bbc, sampling can be done\nef\ufb01ciently, whereas for \u03bb > \u03bbc, sampling becomes NP-hard. In the process, this demonstrates that\nfor some families of functions, ef\ufb01cient sampling algorithms can be very structure-dependent, and do\nnot fall into the usual Langevin-dynamics, or rejection-sampling categories.\nOur results show that once we go beyond the convex setting, the problems of sampling and optimiza-\ntion exhibit a rich set of computational behaviors. The connection to the discrete case should help\nfurther understand the complexities of these problems.\n\n2\n\n\fThe rest of the paper is organized as follows. We start with some preliminaries in Section 2. We give a\nsimple separation between optimization and sampling in Section 3 and derive a computational version\nof this separation. Section 4 relates the discrete sampling/optimization problems on the hypercube to\ntheir continuous counterparts, and uses this connection to derive NP-hardness results for sampling for\nfunction families where optimization is easy. In Section 5, we prove the sharp threshold for \u03bb. We\ndescribe additional related work in Section 6. Some missing proofs and strengthenings of our results\nare deferred to the supplementary material.\n\n2 Preliminaries\nWe consider real-valued functions f : Rd \u2192 R. We will be restricting ourselves to functions that are\ncontinuous and bounded. We say a function f is L-Lipschitz if f (x) \u2212 f (x(cid:48)) \u2264 L \u00b7 (cid:107)x \u2212 x(cid:48)(cid:107) for all\nx, x(cid:48) \u2208 Rd. In this work, (cid:107) \u00b7 (cid:107) will denote the Euclidean norm.\nWe will look at speci\ufb01c families of functions which have compact representations, and ask questions\nabout ef\ufb01ciency of optimization and sampling. We will think of d as a parameter, and look at function\nfamilies such that at any function in the family can be computed in poly(d) time and space.\nWe will look at constrained optimization in this work and our constraint set X will be a Euclidean\nball. Our hardness results however do not stem from the constraint set, and nearly all of our results\ncan be extended easily to the unconstrained case.\nGiven \u03bb > 0 and a function f, we let D\u03bb,X\nWhen X is obvious from context, we will usually omit it and write D\u03bb\n\ndenote the distribution on X with Pr[x] \u221d exp(\u2212\u03bbf (x)).\nfor the partition\n\nf . We will Z \u03bb,X\n\nf\n\nf\n\nfunction(cid:82)\n\nX exp(\u2212\u03bbf (x))dx.\n\nh\n\nh\n\neq\n\nto be(cid:80)\n\nf if it samples from a distribution that is \u03b7-close to D\u03bb\n\nWe will also look at real-valued functions h : Hd \u2192 R, where Hd = {\u22121, 1}d is the d-dimensional\nhypercube. We will often think of a y \u2208 Hd as being contained in Rd. Analagous to the Euclidean\nas the distribution over the hypercube with Pr[y] \u221d exp(\u2212\u03bbh(y)), and\nspace case, we de\ufb01ne D\u03bb,Hd\nexp(\u2212\u03bbh(y)).\nde\ufb01ne Z \u03bb,Hd\ny\u2208Hd\nWe say that an algorithm \u03b7-samples from D\u03bb\nf in\nstatistical distance. We will also use the Wasserstein distance between distributions on Rd, de\ufb01ned as\nW(P, Q)\n= inf \u03c0 E(x,x(cid:48))\u223c\u03c0[(cid:107)x\u2212 x(cid:48)(cid:107)2], where the inf is taken over all couplings \u03c0 between P and Q.\nWe remark that our results are not sensitive to the choice of distance between distributions and extend\nin a straightforward way to other distances. As is common in literature on sampling from continuous\ndistributions, we will for the most part assume that we can sample from a real-valued distribution such\nas a Gaussian and ignore bit-precision issues. Statements such as our NP-hardness results usually\nrequire \ufb01nite precision arithmetic. This issue is discussed at length by Tosh and Dasgupta [2019] and\nfollowing them, we will discuss using Wasserstein distance in those settings. An \u03b5-optimizer of f is a\npoint x \u2208 X such that f (x) \u2264 f (x(cid:48)) + \u03b5 for any x(cid:48) \u2208 X .\nIn the supplementary material, we quantify the folklore results showing that sampling for high \u03bb\nimplies approximate optimization. Quantitatively, they say that for L-Lipschitz functions over a\nball of radius R, sampling implies \u03b5-approximate optimization if \u03bb \u2265 \u2126(d ln LR\n\u03b5 /\u03b5). Similarly, for\n\u03b2-smooth functions over a ball of radius R, \u03bb \u2265 \u2126(d ln \u03b2R\n\u03b5 /\u03b5) suf\ufb01ces to get \u03b5-close to an optimum.\n\n3 A Simple separation\nWe consider the case when X = Bd(0, 1) is the unit norm Euclidean ball in d dimensions. We\nlet FLip be the family of all 1-lipschitz functions from X to R. We show that for any f \u2208 FLip,\nexact sampling can be done in time exp(O(\u03bb)). On the other hand, for any algorithm, there is an\nf \u2208 FLip that forces the algorithm to use \u2126(\u03bb/\u03b5)d queries to f to get an \u03b5-approximate optimal\nsolution. Thus, e.g., for constant \u03bb, sampling can be done in poly(d) time, whereas optimization\nrequires time exponential in the dimension. Moreover, for any \u03bb \u2264 d, there is an exponential gap\nbetween the complexities of these two problems. Our lower bound proof is similar to the analagous\nclaim in Ma et al. [2019], but has better parameters due to the simpler setting. Our upper bound proof\nis signifantly simpler and gives an exact sampler.\n\n3\n\n\fTheorem 1 (Fast Sampling). There is an algorithm that for any f \u2208 FLip, outputs a sample from\nD\u03bb\nf and makes an expected O(exp(2\u03bb)) oracle calls to computing f.\nProof. The algorithm is based on rejection sampling. We \ufb01rst compute f (0) and let M = f (0) \u2212 1.\nBy the Lipschitzness assumption, f (x) \u2208 [M, M + 2] for all x in the unit ball. The algorithm now\nrepeatedly samples a random point x from the unit ball. With probability exp(\u03bb(M \u2212 f (x))), this\npoint is accepted and we output it. Otherwise we continue.\nSince exp(\u03bb(M \u2212 f (x))) \u2208 [exp(\u22122\u03bb), 1], this is a rejection sampling algorithm, and it outputs an\nexact sample from D\u03bb\nf . Each step accepts with probability at least exp(\u2212\u03bb). Thus the algorithm\nterminates in an expected O(exp(2\u03bb)) many steps, each of which requires one evaluation of f.\n\nRemark 1. The above algorithm assumes access to an oracle to sample from Bd(0, 1) to arbitrary\nprecision. This can be replaced by sampling from a grid of \ufb01nite precision points in the ball. This\ncreates two sources of error. Firstly, the function is not constant in the grid cell. This error is easily\nbounded since f is Lipschitz. Secondly, some grid cells may cross the boundary of Bd(0, 1). This is\na probability d2\u2212b event when sampling a grid point with b bits of precision. Taking these errors into\naccount gives us a sample within Wasserstein distance at most O((d + \u03bb)2\u2212b).\nRemark 2. The above is a Las Vegas algorithm. One can similarly derive a Monte Carlo algorithm by\naborting and outputting a random x after exp(2\u03bb) log 1\nRemark 3. Under the assumptions in Ma et al. [2019] (\u03b2-smooth f, \u2207(0) = 0, \u03ba\u03b2-strong convexity\noutside a ball of radius R), a direct reduction to our setting will be lossy and a rejection-sampling-\nbased approach will not be ef\ufb01cient. The Langevin dynamics based sampler in that work is more\nef\ufb01cient under their assumptions.\nTheorem 2 (No Fast Optimization). For any algorithm A that queries f or any of its derivatives at\nless than (1/4\u03b5)d points, there is an f \u2208 FLip for which A fails to output an \u03b5-optimizer of f except\nwith negligible probability.\n\n\u03b7 steps.\n\nProof. Consider the function fx that is zero everywhere, except for a small ball of radius 2\u03b5 around\nx, where it is f (y) = (cid:107)y \u2212 x(cid:107) \u2212 2\u03b5. i.e. the function1 is fx(y) = min(0,(cid:107)y \u2212 x(cid:107) \u2212 2\u03b5). This\nfunction has optimum \u22122\u03b5. Let g be the zero function.\nLet A be an algorithm (possibly randomized) that queries f or its derivatives at T \u2264 (1/4\u03b5)d points.\nConsider running A on a function f chosen randomly as:\n\n(cid:26) g\n\nfx\n\nf =\n\nwith probability 1\n2\nfor a x chosen u.a.r. from B(0, 1) otherwise.\n\nUntil A has queried a point in B(x, 2\u03b5), the behavior of the algorithm on fx and g is identical,\nsince the functions and all their derivatives agree outside this ball. Since an \u03b5-approximation must\ndistinguish these two cases, for A to succeed, it must query this ball. The probability that A queries in\nvol(B(0,1)) = (2\u03b5)d. As A makes only (1/4\u03b5)d queries in total,\nthis ball in any given step is at most vol(B(x,2\u03b5))\nthe expected number of queries A makes to the ball B(x, 2\u03b5) is at most 2\u2212d. Thus with probability\nat least 1 \u2212 1\n\n2d , the algorithm fails to distinguish g from fx, and hence cannot succeed.\n\n3.1 Making the Separation Computational\n\nThe oracle-hardness of Theorem 2 stems from possibly \u201chiding\u201d a minimizer x of f. The computa-\ntional version of this hardness result will instead possibly hide an appropriate encoding of a satisfying\nassignment to a 3SAT formula.\nTheorem 3 (No Fast Optimization: Computational). There is a constant \u03b5 > 0 such that it is NP-hard\nto \u03b5-optimize an ef\ufb01ciently computable Lipschitz function over the unit ball.\nProof. Let \u03c6 : {0, 1}d \u2192 Bd(0, 1) be a map such that \u03c6 is ef\ufb01ciently computable, (cid:107)\u03c6(y)\u2212 \u03c6(y(cid:48))(cid:107) \u2265\n4\u03b5 and such that given x \u2208 Bd(\u03c6(y), 2\u03b5), we can ef\ufb01ciently recover y. For a small enough absolute\n1As de\ufb01ned, this function is Lipschitz but not smooth. It can be easily modi\ufb01ed to a 2-Lipschitz, 2/\u03b5-smooth\n\nfunction by replacing its linear behavior in the ball by an appropriately Huberized version.\n\n4\n\n\fFigure 1: (Left) An example of a function h for d = 2, with the 2-d hypercube shown in gray and\nthe values of h denoted by blue points. (Right) The corresponding function f that results from the\ntransformation, for M = 4, R = 2.\n\nconstant \u03b5, such maps can be easily constructed using error correcting codes and we defer details to\nsupplementary material.\nWe start with an an instance I of 3SAT on d variables, and de\ufb01ne f as follows. Given a point x, we\n\ufb01rst \ufb01nd a y \u2208 {0, 1}d, if any, such that x \u2208 Bd(\u03c6(y), 2\u03b5). If no such y exists, f (x) is set to 0. If\nsuch a y exists, we interpret it as an assignment to the variables in the 3SAT instance I and set f (x)\nto be min(0,(cid:107)\u03c6(y) \u2212 x(cid:107) \u2212 2\u03b5) if y is a satisfying assignment to instance I, and to 0 otherwise.\nIt is clear from de\ufb01nition that f as de\ufb01ned is ef\ufb01ciently computable. Moreover, the minimum attained\nvalue for f is \u22122\u03b5 if I is satsi\ufb01able, and 0 otherwise. Since distinguishing between these two cases is\nNP-hard, so is \u03b5-optimization of f.\n\nWe note that assuming the exponential time hypothesis, this implies the exp(\u2126(d)) oracle complexity\nlower bound of Theorem 2.\n\n4 Relating Discrete and Continuous Settings\n\nFor any function h on the hypercube, we can construct a function on f on Rd such that optimization\nof f and h are reducible to each other, and similarly sampling from f and h are reducible to each\nother. This would allow us to use separation results for the hypercube case to establish analagous\nseparation results for the continuous case.\n\u221a\nTheorem 4. Let h : Hd \u2192 R have range [0, d]. Fix M \u2265 2d, R \u2265 2\nf : Bd(0, R) \u2192 R satisfying the following properties:\nEf\ufb01ciency: Given x \u2208 Bd(0, R) and oracle access to h, f can computed in polynomial time.\nLipschitzness: f is continuous and L-Lipschitz for L = 2M.\nSampler Equivalence: Fix \u03bb \u2265 4d ln 24R\n\nM . Given access to an \u03b7-sampler for D\u03bb,Hd\n, there is an\nf , for \u03b7(cid:48) = \u03b7 + exp(\u2212\u2126(d)). Conversely, given access to an\n\nd. Then there is a funtion\n\nh\n\nef\ufb01cient \u03b7(cid:48)-sampler for D\u03bb\n\u03b7-sampler for D\u03bb\n\nf , there is an ef\ufb01cient \u03b7(cid:48)-sampler for D\u03bb,Hd\n\nh\n\nfor \u03b7(cid:48) = \u03b7 + exp(\u2212\u2126(d)).\n\nProof. The function f is fairly natural: it takes a large value M \u2265 2d at most points, except in small\nballs around the hypercube vertices. At each hypercube vertex, f is equal to the h value at the vertex,\nand we interpolate linearly in a small ball. See Figure 2 for an illustration.\nFormally, let round : R \u2192 {\u22121, 1} be the function that takes the value 1 for x \u2265 0 and \u22121 otherwise,\nand let round : Rd \u2192 Hd be its natural vectorized form that applies the function coordinate-wise.\nLet g(x) = (cid:107)x \u2212 round(x)(cid:107) denote the Euclidean distance from x to round(x). Let s = 2M. The\n\n5\n\n-2-1012-2-1012h(x)0.00.51.01.52.0-2-1012-2-1012f(x)1234\ffunction f is de\ufb01ned as follows:\n\n(cid:40)\n\nf (x) =\n\nh(round(x)) + s \u00b7 g(x)\nM\n\nif g(x) \u2264 M\u2212h(round(x))\nif g(x) \u2265 M\u2212h(round(x))\n\ns\n\ns\n\ns\n\n2\n\n4 , 1\n\nIt is easy to verify that f is continuous. Moreover, since M \u2265 2d, and h has range [0, d], the value\nM\u2212h(round(x))\n2 ]. It follows that f takes the value M outside balls of radius 1\n\nis in the range [ 1\n\naround the hypercube vertices, and is strictly smaller than M in balls for radius 1\n4.\nSince round(x) is easy to compute, this implies that f can be computed in polynomial time, using a\nsingle oracle call to h. Moreover it is immediate from the de\ufb01nition that the function f has Lipschitz\nconstant s.\nNote that f (y) = h(y) for y \u2208 Hd and f (x) \u2265 h(round(x)) for all x \u2208 Bd(0, R). This implies\nthat the minimum value of f is the same as the minimum value of h, and indeed any (\u03b5-)minimizer y\nof h also (\u03b5-)minimizes f. Conversely, let x be an \u03b5-minimizer of f. Since h(round(x)) \u2264 f (x), it\nfollows that round(x) is an \u03b5-minimizer of h.\nFinally we prove the equivalence of approximate sampling. Towards that goal, we de\ufb01ne an intermedi-\nf conditioned on being in \u222ay\u2208Hd Bd(y, 1\n4 ).\n. Suppose that X is\n\nate distribution on Bd(0, R). Let (cid:99)D\u03bb\nf be the distribution D\u03bb\nWe \ufb01rst argue that \u03b7-samplability of (cid:99)D\u03bb\na sample from (cid:99)D\u03bb\nf is equivalent to \u03b7-samplability of D\u03bb,Hd\n(cid:82)\nf . Then for any y(cid:63) \u2208 Hd,\n(cid:82)\n(cid:80)\n4 ) exp(\u2212\u03bbf (x)) dx\nexp(\u2212\u03bbh(y(cid:63))) \u00b7(cid:82)\nexp(\u2212\u03bbh(y)) \u00b7(cid:82)\n(cid:80)\n(cid:80)\n\n4 ) exp(\u2212s\u03bbg(x)) dx\n\n4 ) exp(\u2212s\u03bbg(x)) dx\n\n4 ) exp(\u2212\u03bbf (x)) dx\n\ny\u2208Hd\nexp(\u2212\u03bbh(y(cid:63)))\ny\u2208Hd\nThus round(X) is a sample from D\u03bb,Hd\nsample Y from D\u03bb,Hd\n\nexp(\u2212\u03bbh(y))\n. Conversely, the same calculation implies that given a\n4 ) sampled proportional to exp(\u2212s\u03bb(cid:107)z(cid:107)), Y + Z is\nf . Noting that Z is a sample from a ef\ufb01ciently sample-able log-concave distribution\n\n, and a vector Z \u2208 Bd(0, 1\n\na sample from (cid:99)D\u03bb\n\nPr[X \u2208 Bd(y(cid:63),\n\nBd(y(cid:63), 1\n\nBd(y(cid:63), 1\n\ny\u2208Hd\n\nBd(y, 1\n\nBd(y, 1\n\n)] =\n\n1\n4\n\n=\n\n=\n\nh\n\nh\n\nh\n\n\u03bb denote this last expression. We will argue that Z f\n\nLet (cid:99)Z f\n(cid:90)\n\nBd(0,R)\n\n(cid:90)\n\nexp(\u2212\u03bbf (x)) dx \u2264 (cid:88)\n\u2264 (cid:88)\n(cid:124)\n\ny\u2208Hd\n\ny\u2208Hd\n\n(cid:90)\n\nBd(y, 1\n2 )\nexp(\u2212\u03bbh(y)) \u00b7\n\n6\n\n\u03bb \u2264 (1 + exp(\u2212\u2126(d)))(cid:99)Z f\n\n(cid:90)\n\n(1)\n\n\u03bb. We write Z f\n\n\u03bb as\n\nexp(\u2212\u03bbf (x)) dx +\n\nexp(\u2212\u03bbM ) dx\n\nBd(0,R)\n\nexp(\u2212s\u03bbg(x)) dx\n\n(cid:125)\n\n+\n\nBd(y, 1\n2 )\n\n(cid:123)(cid:122)\n\n(A)\n\nBd(0,R)\n\nexp(\u2212\u03bbM ) dx\n\n(cid:125)\n\n(cid:123)(cid:122)\n\n(B)\n\n(cid:90)\n(cid:124)\n\ncompletes the equivalence.\nWe next argue that D\u03bb\nD\u03bb\nf , this is equivalent to showing that nearly all of the mass of D\u03bb\n\nf are exp(\u2212\u2126(d))-close as distributions. Since (cid:99)D\u03bb\nf and (cid:99)D\u03bb\n(cid:90)\n(cid:90)\n\u2265 (cid:88)\n(cid:88)\n\nexp(\u2212\u03bbf (x)) dx\n\nexp(\u2212\u03bbf (x)) dx\n\ny\u2208Hd\n\nBd(0,R)\n\n(cid:90)\n\n\u03bb =\n\nBd(y, 1\n4 )\nexp(\u2212\u03bbh(y)) \u00b7\n\nexp(\u2212s\u03bbg(x)) dx\n\nZ f\n\n=\n\ny\u2208Hd\n\nBd(y, 1\n4 )\n\nf lies in \u222ay\u2208Hd Bd(y, 1\n\nf is a conditioning of\n4 ). We write\n\n\f2 ) exp(\u2212s\u03bbg(x)) dx is within (1 + 2 exp(\u2212d)) of(cid:82)\n\nA simple calculation, formalized as Lemma 16 in supplementary material shows that the integral\n4 ) exp(\u2212s\u03bbg(x)) dx for s\u03bb > 16d.\nBd(y, 1\n\u03bb. To bound the second term (B)\n8 around any single vertex y of the hypercube contributes\n\n(cid:82)\nThis implies that the term (A) above is at most (1 + 2 exp(\u2212d))(cid:99)Z f\n(cid:90)\n\nabove, we argue that a ball of radius 1\nsigni\ufb01cantly more than than the term (B). Indeed\n\nBd(y, 1\n\n(cid:90)\n\nexp(\u2212\u03bbf (x)) dx \u2265\n\nexp(\u2212\u03bb(h(y) +\n\n)) dx\n\n(cid:90)\n\ndx\n\nBd(0,1)\n\n)) \u00b7 (\n\n3M\n4\n\n1\n8\n\n)d\n\ndx \u2265 exp(\u2212\u03bb(\n\n(cid:90)\n\u03bb is at most (1 + exp(\u2212\u2126(d)))(cid:99)Z h\n\nBd(0,1)\n\ndx.\n\nBd(y, 1\n8 )\n\n\u2265 exp(\u2212\u03bb(d +\n\nM\n4\n\n)) \u00b7 (\n\n1\n8\n\n)d\n\ns\n8\n\n(cid:90)\n\nBd(0,1)\n\nexp(\u2212\u03bbM ) dx = exp(\u2212\u03bbM ) \u00b7 Rd \u00b7\n\nBd(y, 1\n8 )\n\nWhereas,\n\n(cid:90)\n\nThus (B)/(cid:99)Z f\nmost exp(\u2212\u2126(d)) times(cid:99)Z h\n\nBd(0,R)\n\n\u03bb is at most exp(\u2212\u03bbM/4+d ln 8R). Under the assumptions on \u03bb, it follows that (B) is at\n\u03bb.\n\n\u03bb. In other words, we have shown that Z f\n\nh\n\n, one gets a Wasserstein sampler for (cid:99)D\u03bb\n\nRemark 4. The equivalence of sampling extends immediately to Wasserstein distance. Indeed given\na sampler for D\u03bb,Hd\nf by sampling from a simple isotropic log-\nconcave distribution. A Wasserstein sampler for a ball suf\ufb01ces for this. Since W(P, Q) is bounded\nby the diameter times the statistical distance, this gives a \u03b7 + exp(\u2212\u2126(d)) Wasserstein sampler for\nD\u03bb\nf . Similarly, a \u03b7 Wasserstein sampler for D\u03bb\nf immediately yields\nan O(\u03b7)-sampler for D\u03bb,Hd\n. Moreover, it is easy to check that this conditioning is on a constant\nprobability event as long as \u03b7 < 1\n16.\n\nf conditioned on the support of (cid:99)D\u03bb\n\nh\n\n4.1 Optimization can be Easier than Sampling\n\nGiven the reduction from the previous section, there are many options for a starting discrete problem\nto apply the reduction. We will start from one of the most celebrated NP-hard problems. The\nNP-hardness of Hamiltonian Cycle dates back to Karp [1972].\nTheorem 5 (Hardness of HAMCYCLE). Given a constant-degree graph G = (V, E), it is NP-hard\nto distinguish the following two cases.\nYES Case: G has a Hamiltonian Cycle.\nNO Case: G has no Hamiltonian Cycle.\n\nWe can then amplify the gap between the number of long cycles in the two cases.\nTheorem 6 (#CYCLE hardness). Given a constant-degree graph G = (V, E) and for L = |V |/2, it\nis NP-hard to distinguish the following two cases.\nYES Case: G has at least 1 + 2L simple cycles of length L.\nNO Case: G has exactly one simple cycle C (planted) of length L, and no longer simple cycle.\n\nMoreover, C (planted) can be ef\ufb01ciently found in polynomial time.\n\nThe proof of the above uses a simple extension of a relatively standard reduction (see e.g. Vadhan\n[2002]) from Hamiltonian Cycle. Starting with an instance G1 of Hamiltonian Cycle, we replace\neach edge by a two-connected path of length t, for some integer t. For L = nt, this gives us 2L\ncycles of length L for every Hamiltonian cycle in G1. Moreover, any simple cycle of length L must\ncorrespond to a Hamiltonian Cycle in G1. We add to G a simple cycle of length L on a separate set\nof L vertices. This \u201cplanted\u201d cycle is easy to \ufb01nd, since it forms its own connected component of size\nL. A full proof is deferred to supplementary material.\nArmed with these, we form a function on the hypercube in d = |E(cid:48)| dimensions such that optimizing\nit is easy, but sampling is hard.\n\n7\n\n\fTheorem 7. There exists a function h : Hd \u2192 [0, d] satisfying the following properties.\nEf\ufb01ciency: h can be computed ef\ufb01ciently on Hd.\nEasy Optimization: One can ef\ufb01ciently \ufb01nd a particular minimizer y(planted) of h on Hd.\nSampling is hard: Let \u03bb \u2265 2d. It is NP-hard to distinguish the following two cases, for L = \u2126(d):\n\nd\n\ny\u223cD\u03bb,H\ny\u223cD\u03bb,H\n\nYES Case: Pr\nNO Case: Pr\nIn particular this implies that 1 \u2212 1\n\n[y = y(planted)] \u2264 1\n[y = y(planted)] \u2265 1 \u2212 1\n\n2L\n\nh\n\nh\n\nd\n\n2L\n\n2L\u22122 -sampling from D\u03bb,Hd\n\nh\n\nis NP-hard.\n\nProof. Let G = (V, E) a graph produced by Theorem 6 and let d = |E|. A vertex y of the hypercube\nHd is easily identi\ufb01ed with a set Ey \u2282 E consisting of the edges {e \u2208 E : ye(cid:48) = 1}. The function\nh1(y) is equal to zero if Ey does not de\ufb01ne a simple cycle and is equal to the length of the cycle\notherwise. To convert this into a minimization problem, we de\ufb01ne h(y) = d\u2212 h1(y). It is immediate\nthat a minimizer of h corresponds to a longest simple cycle in G.\nGiven a vertex y, testing whether Ey is a simple cycle can done ef\ufb01ciently, and the length of the\ncycle is simply |Ey|. This implies that h can be ef\ufb01ciently computed on Hd. Further, since we can\n\ufb01nd the planted cycle in G ef\ufb01ciently, we can ef\ufb01ciently construct a minimizer y(planted) of h.\nSuppose that G has at least (2L + 1) cycles of length L. In this case, the distribution D\u03bb,Hd\nrestricted\nto the minimizers is uniform, and thus the probability mass on a speci\ufb01c minimizer y(planted) is at\nmost\nOn the other hand, suppose that planted cycle is the unique longest simple cycle in G. Thus the\nprobability mass on y(planted) is at least exp(\u03bb(d \u2212 L))/Z \u03bb,Hd\n. Since every other cycle is of length\nat most L \u2212 1, and there are at most 2d cycles, it follows that\nexp(\u03bb). For \u03bb \u2265 2d,\nexp(\u03bb(d\u2212L)) \u2264 1 + 2d\nZ\nthis ratio is (1 + exp(\u2212d)) \u2264 (1 \u2212 1\n2d )\u22121. The claim follows.\n\n2L+1. This also therefore upper bounds the probability mass on y(planted) in the D\u03bb,Hd\n\n\u03bb,H\nh\n\nh\n\nh\n\nh\n\n.\n\n1\n\nd\n\nWe can now apply Theorem 4 to derive the following result.\nTheorem 8. There is a family F of functions f : Bd(0, R) \u2192 R such that the following hold.\nEf\ufb01ciency: Every f \u2208 F is computable in time poly(d).\nEasy Optimization: An optimizer of f can be found in time poly(d)\nSampling is NP-hard: For \u03bb \u2265 2d and \u03b7 < 1 \u2212 exp(\u2212\u2126(d)), there is no ef\ufb01cient \u03b7-sampler for\n\nD\u03bb\nf unless N P = RP .\n\nRemark 5. In the statement above, ef\ufb01ciently computable means that given a t-bit representation of\nx, one can compute f (x) to t bits of accuracy in time poly(d, t).\nRemark 6. The easy optimization result above can be considerably strengthened. We can ensure that 0\nis the optimizer of f and that except with negligible probability, gradient descent starting at a random\npoint will converge to this minimizer. Further, one can ensure that all local minima are global and\nthat f is strict-saddle. Thus not only is the function easy to optimize given the representation, black\nbox oracle access to f and its gradients suf\ufb01ces to optimize f. We defer details to the supplementary\nmaterial.\nRemark 7. The hardness of sampling holds also for Wasserstein distance 1\n\n16, given Remark 4.\n\n5 A Sharp Threshold for \u03bb\n\nWe start with the following threshold result for sampling from the Gibb\u2019s distribution on independent\nsets due to Weitz [2006], Sly and Sun [2012].\nTheorem 9. For any \u2206 \u2265 6, there is a threshold \u03bbc(\u2206) > 0 such that the following are true.\n\n8\n\n\fFPRAS for small \u03bb: For any \u03bb < \u03bbc(\u2206), the problem of sampling independent sets with Pr[I] \u221d\n\nexp(\u03bb|I|) on \u2206-regular graphs has a fully polynomial time approximation scheme.\n\nNP-hard for large \u03bb: For any \u03bb > \u03bbc(\u2206), unless N P = RP , there is no fully polynomial time\nrandomized approximation scheme for the problem of sampling independent sets with\nPr[I] \u221d exp(\u03bb|I|) on \u2206-regular graphs.\n\nIn the supplementarly material, we show that this implies the following result.\nTheorem 10. There is a family F of ef\ufb01ciently computable functions f : Bd(0, R) \u2192 R such that\nthe sampling problem has a sharp computational threshold. There is a constant \u03bbc > 0 such that for\nany 1\nf . On the other hand,\nfor any \u03bb > \u03bbc, there is a constant \u03b7(cid:48) > 0 such that no polynomial time algorithm \u03b7(cid:48)-samples from\nD\u03bb\nf unless N P = RP .\n\nd < \u03bb < \u03bbc, there is a poly(d/\u03b7)-time \u03b7-sampler for the distribution D\u03bb\n\n6 Related Work\n\nThe problems of counting solutions and sampling solutions are intimately related, and well-studied\nin the discrete case. The class #P was de\ufb01ned in Valiant [1979], where he showed that the problem\nof computing the permanenet of matrix was complete for this class. This class has been well-\nstudied, and Toda [1991] showed ef\ufb01cient exact algorithms for any #P-complete problem would\nimply a collapse of the polynomial hierarchy. Many common problems in #P however admit ef\ufb01cient\napproximation schemes, that for any \u03b5 > 0, allow for a randomized (1 + \u03b5)-approximation in time\npolynomial in n/\u03b5. Such Fully Polynomial Randomized Approximation Schemes (FPRASes) are\nknown for many problems in #P, perhaps the most celebrated of them being that for the permananet\nof a non-negative matrix [Jerrum et al., 2004].\nThese FPRASes are nearly always based on Markov Chain methods, and their Metropolis-\nHastings [Metropolis et al., 1953, Hastings, 1970] variants. These techniques have been used\nboth in the discrete case (e.g. Jerrum et al. [2004]) and the continuous case (e.g. Lovasz and Vempala\n[2006]). The closely related technique of Langevin dynamics [Rossky et al., 1978, O. Roberts and\nStramer, 2002, Durmus and Moulines, 2017] and its Metropolis-adjusted variant are often faster in\npractice and have only recently been analyzed.\n\nReferences\nSt\u00e9phane Boucheron, G\u00e1bor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymp-\n\ntotic Theory of Independence. Oxford University Press, 2013.\n\nS\u00e9bastien Bubeck. Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn.,\n8(3-4):231\u2013357, November 2015. ISSN 1935-8237. doi: 10.1561/2200000050. URL http:\n//dx.doi.org/10.1561/2200000050.\n\nAlain Durmus and \u00c9ric Moulines. Nonasymptotic convergence analysis for the unadjusted langevin\nalgorithm. Ann. Appl. Probab., 27(3):1551\u20131587, 06 2017. doi: 10.1214/16-AAP1238. URL\nhttps://doi.org/10.1214/16-AAP1238.\n\nW. K. Hastings. Monte carlo sampling methods using markov chains and their applications.\nBiometrika, 57(1):97\u2013109, 1970. ISSN 00063444. URL http://www.jstor.org/stable/\n2334940.\n\nElad Hazan. Introduction to online convex optimization. Foundations and Trends R(cid:13) in Optimization,\n2(3-4):157\u2013325, 2016. ISSN 2167-3888. doi: 10.1561/2400000013. URL http://dx.doi.org/\n10.1561/2400000013.\n\nRussell Impagliazzo and Ramamohan Paturi. On the complexity of k-SAT. Journal of Com-\nputer and System Sciences, 62(2):367 \u2013 375, 2001.\nISSN 0022-0000. doi: https://doi.org/\n10.1006/jcss.2000.1727. URL http://www.sciencedirect.com/science/article/pii/\nS0022000000917276.\n\n9\n\n\fMark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for\nthe permanent of a matrix with nonnegative entries. J. ACM, 51(4):671\u2013697, July 2004. ISSN\n0004-5411. doi: 10.1145/1008731.1008738. URL http://doi.acm.org/10.1145/1008731.\n1008738.\n\nJ\u00f8rn Justesen. Class of constructive asymptotically good algebraic codes. IEEE Trans. Information\n\nTheory, 18:652\u2013656, 1972.\n\nR. Karp. Reducibility among combinatorial problems. In R. Miller and J. Thatcher, editors, Complex-\n\nity of Computer Computations, pages 85\u2013103. Plenum Press, 1972.\n\nS. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. Science,\n220(4598):671\u2013680, 1983. ISSN 0036-8075. doi: 10.1126/science.220.4598.671. URL https:\n//science.sciencemag.org/content/220/4598/671.\n\nL. Lovasz and S. Vempala. Fast algorithms for logconcave functions: Sampling, rounding, integration\nand optimization. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science\n(FOCS\u201906), pages 57\u201368, Oct 2006. doi: 10.1109/FOCS.2006.28.\n\nYi-An Ma, Yuansi Chen, Chi Jin, Nicolas Flammarion, and Michael I. Jordan. Sampling can be faster\nthan optimization. Proceedings of the National Academy of Sciences, 116(42):20881\u201320885, 2019.\nISSN 0027-8424. doi: 10.1073/pnas.1820003116. URL https://www.pnas.org/content/\n116/42/20881.\n\nNicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and\nEdward Teller. Equation of state calculations by fast computing machines. The Journal of\nChemical Physics, 21(6):1087\u20131092, 1953. doi: 10.1063/1.1699114. URL https://doi.org/\n10.1063/1.1699114.\n\nArkadii Semenovich Nemirovsky and David Borisovich Yudin. Problem complexity and method\n\nef\ufb01ciency in optimization. John Wiley & Sons, 1983.\n\nYurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Springer Publishing\n\nCompany, Incorporated, 1 edition, 2014. ISBN 1461346916, 9781461346913.\n\nG O. Roberts and Osnat Stramer. Langevin diffusions and metropolis-hastings algorithms. Methodol-\nogy And Computing In Applied Probability, 4:337\u2013357, 01 2002. doi: 10.1023/A:1023562417138.\n\nP. J. Rossky, J. D. Doll, and H. L. Friedman. Brownian dynamics as smart monte carlo simulation.\nThe Journal of Chemical Physics, 69(10):4628\u20134633, 1978. doi: 10.1063/1.436415. URL\nhttps://doi.org/10.1063/1.436415.\n\nAllan Sly and Nike Sun. The computational hardness of counting in two-spin models on d-regular\ngraphs. In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer\nScience, FOCS \u201912, pages 361\u2013369, Washington, DC, USA, 2012. IEEE Computer Society. ISBN\n978-0-7695-4874-6. doi: 10.1109/FOCS.2012.56. URL https://doi.org/10.1109/FOCS.\n2012.56.\n\nS. Toda. PP is as hard as the polynomial-time hierarchy. SIAM Journal on Computing, 20(5):865\u2013877,\n\n1991. doi: 10.1137/0220053. URL https://doi.org/10.1137/0220053.\n\nChristopher Tosh and Sanjoy Dasgupta. The relative complexity of maximum likelihood estimation,\nIn Alina Beygelzimer and Daniel Hsu, editors, Proceedings\nmap estimation, and sampling.\nof the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine\nLearning Research, pages 2993\u20133035, Phoenix, USA, 25\u201328 Jun 2019. PMLR. URL http:\n//proceedings.mlr.press/v99/tosh19a.html.\n\nSalil Vadhan. Computational complexity lecture notes. 2002. URL https://people.seas.\n\nharvard.edu/~salil/cs221/fall02/scribenotes/nov22.ps.\n\nLeslie G. Valiant. The complexity of computing the permanent. Theoretical Computer Science, 8(2):\n189 \u2013 201, 1979. ISSN 0304-3975. doi: https://doi.org/10.1016/0304-3975(79)90044-6. URL\nhttp://www.sciencedirect.com/science/article/pii/0304397579900446.\n\n10\n\n\fDror Weitz. Counting independent sets up to the tree threshold.\n\nIn Proceedings of the Thirty-\neighth Annual ACM Symposium on Theory of Computing, STOC \u201906, pages 140\u2013149, New\nYork, NY, USA, 2006. ACM. ISBN 1-59593-134-1. doi: 10.1145/1132516.1132538. URL\nhttp://doi.acm.org/10.1145/1132516.1132538.\n\n11\n\n\f", "award": [], "sourceid": 8572, "authors": [{"given_name": "Kunal", "family_name": "Talwar", "institution": "Google"}]}