{"title": "Large-Scale Quadratically Constrained Quadratic Program via Low-Discrepancy Sequences", "book": "Advances in Neural Information Processing Systems", "page_first": 2297, "page_last": 2307, "abstract": "We consider the problem of solving a large-scale Quadratically Constrained Quadratic Program. Such problems occur naturally in many scientific and web applications. Although there are efficient methods which tackle this problem, they are mostly not scalable. In this paper, we develop a method that transforms the quadratic constraint into a linear form by a sampling a set of low-discrepancy points. The transformed problem can then be solved by applying any state-of-the-art large-scale solvers. We show the convergence of our approximate solution to the true solution as well as some finite sample error bounds. Experimental results are also shown to prove scalability in practice.", "full_text": "Large-Scale Quadratically Constrained Quadratic\n\nProgram via Low-Discrepancy Sequences\n\nKinjal Basu, Ankan Saha, Shaunak Chatterjee\n\nLinkedIn Corporation\n\nMountain View, CA 94043\n\n{kbasu, asaha, shchatte}@linkedin.com\n\nAbstract\n\nWe consider the problem of solving a large-scale Quadratically Constrained\nQuadratic Program. Such problems occur naturally in many scienti\ufb01c and web\napplications. Although there are ef\ufb01cient methods which tackle this problem, they\nare mostly not scalable. In this paper, we develop a method that transforms the\nquadratic constraint into a linear form by sampling a set of low-discrepancy points\n[16]. The transformed problem can then be solved by applying any state-of-the-art\nlarge-scale quadratic programming solvers. We show the convergence of our ap-\nproximate solution to the true solution as well as some \ufb01nite sample error bounds.\nExperimental results are also shown to prove scalability as well as improved quality\nof approximation in practice.\n\nIntroduction\n\n1\nIn this paper we consider the class of problems called quadratically constrained quadratic program-\nming (QCQP) which take the following form:\n\nMinimize\n\nx\n\nsubject to\n\nxT P0x + qT\n\n0 x + r0\n\nxT Pix + qT\n\ni x + ri \u2264 0,\n\n1\n2\n1\n2\nAx = b,\n\ni = 1, . . . , m\n\n(1)\n\nwhere P0, . . . , Pm are n \u00d7 n matrices. If each of these matrices are positive de\ufb01nite, then the\noptimization problem is convex. In general, however, solving QCQP is NP-hard, which can be\nveri\ufb01ed by easily reducing a 0 \u2212 1 integer programming problem (known to be NP-hard) to a QCQP\n[4]. In spite of that challenge, they form an important class of optimization problems, since they\narise naturally in many engineering, scienti\ufb01c and web applications. Two famous examples of QCQP\ninclude the max-cut and boolean optimization [11]. Other examples include alignment of kernels\nin semi-supervised learning [29], learning the kernel matrix in discriminant analysis [28] as well as\nmore general learning of kernel matrices [21], steering direction estimation for radar detection [15],\nseveral applications in signal processing [20], the triangulation in computer vision [3] among others.\nInternet applications handling large scale of data, often model trade-offs between key utilities using\nconstrained optimization formulations [1, 2]. When there is independence among the expected utilities\n(e.g., click, time spent, revenue obtained) of items, the objective or the constraints corresponding to\nthose utilities are linear. However, in most real life scenarios, there is dependence among expected\nutilities of items presented together on a web page or mobile app. Examples of such dependence are\nabundant in newsfeeds, search result pages and most lists of recommendations on the internet. If\nthis dependence is expressed through a linear model, it makes the corresponding objective and/or\nconstraint quadratic. This makes the constrained optimization problem a very large scale QCQP, if\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe dependence matrix (often enumerated by a very large number of members or updates) is positive\nde\ufb01nite with co-dependent utilities [6].\nAlthough there are a plethora of such applications, solving this problem on a large scale is still\nextremely challenging. There are two main relaxation techniques that are used to solve a QCQP,\nnamely, semi-de\ufb01nite programming (SDP) and reformulation-linearization technique (RLT) [11].\nHowever, both of them introduce a new variable X = xxT so that the problem becomes linear in X.\nThen they relax the condition X = xxT by different means. Doing so unfortunately increases the\nnumber of variables from n to O(n2). This makes these methods prohibitively expensive for most\nlarge scale applications. There is literature comparing these methods which also provides certain\ncombinations and generalizations[4, 5, 22]. However, they all suffer from the same curse of dealing\nwith O(n2) variables. Even when the problem is convex, there are techniques such as second order\ncone programming [23], which can be ef\ufb01cient, but scalability still remains an important issue with\nprior QCQP solvers.\nThe focus of this paper is to introduce a novel approximate solution to the convex QCQP problem\nwhich can tackle such large-scale situations. We devise an algorithm which approximates the\nquadratic constraints by a set of linear constraints, thus converting the problem into a quadratic\nprogram (QP) [11]. In doing so, we remain with a problem having n variables instead of O(n2). We\nthen apply ef\ufb01cient QP solvers such as Operator Splitting or ADMM [10, 26] which are well adapted\nfor distributed computing, to get the \ufb01nal solution for problems of much larger scale. We theoretically\nprove the convergence of our technique to the true solution in the limit. We also provide experiments\ncomparing our algorithm to existing state-of-the-art QCQP solvers to show comparative solutions for\nsmaller data size as well as signi\ufb01cant scalability in practice, particularly in the large data regime\nwhere existing methods fail to converge. To the best of our knowledge, this technique is new and has\nnot been previously explored in the optimization literature.\nNotation: Throughout the paper, bold small case letters refer to vectors while bold large-case letters\nrefer to matrices.\nThe rest of the paper is structured as follows. In Section 2, we describe the approximate problem,\nimportant concepts to understand the sampling scheme as well as the approximation algorithm\nto convert the problem into a QP. Section 3 contains the proof of convergence, followed by the\nexperimental results in Section 4. Finally, we conclude with some discussion in Section 5.\n\n2 QCQP to QP Approximation\nFor sake of simplicity throughout the paper, we deal with a QCQP having a single quadratic constraint.\nThe procedure detailed in this paper can be easily generalized to multiple constraints. Thus, for the\nrest of the paper, without loss of generality we consider the problem of the form,\n\nMinimize\n\nx\n\nsubject to\n\n(x \u2212 a)T A(x \u2212 a)\n(x \u2212 b)T B(x \u2212 b) \u2264 \u02dcb,\n\nCx = c.\n\n(2)\n\nThis is a special case of the general formulation in (1). For this paper, we restrict our case to A,\nB \u2208 Rn\u00d7n being positive de\ufb01nite matrices so that the objective function is strongly convex.\nIn this section, we describe the linearization technique to convert the quadratic constraint into a set of\nN linear constraints. The main idea behind this approximation, is the fact that given any convex set\nin the Euclidean plane, there exists a convex polytope that covers the set. Let us begin by introducing\na few notations. Let P denote the optimization problem (2). De\ufb01ne,\n\nS := {x \u2208 Rn : (x \u2212 b)T B(x \u2212 b) \u2264 \u02dcb}.\n\n(3)\nLet \u2202S denote the boundary of the ellipsoid S. To generate the N linear constraints for this one\nquadratic constraint, we generate a set of N points, XN = {x1, . . . , xN} such that each xj \u2208 \u2202S for\nj = 1, . . . , N . The sampling technique to select the point set is given in Section 2.1. Corresponding\nto these N points we get the following set of N linear constraints,\n\n(4)\nLooking at it geometrically, it is not hard to see that each of these linear constraints are just tangent\nplanes to S at xj for j = 1, . . . , N. Figure 1 shows a set of six linear constraints for a ellipsoidal\n\n(x \u2212 b)T B(xj \u2212 b) \u2264 \u02dcb\n\nfor j = 1, . . . , N.\n\n2\n\n\ffeasible set in two dimensions. Thus, using these N linear constraints we can write the approximate\noptimization problem, P(XN ), as follows.\n\nMinimize\n\nx\n\nsubject to\n\n(x \u2212 a)T A(x \u2212 a)\n(x \u2212 b)T B(xj \u2212 b) \u2264 \u02dcb\n\nCx = c.\n\nfor j = 1, . . . , N\n\n(5)\n\nNow instead of solving P, we solve P(XN ) for a large enough value of N. Note that as we sample\nmore points (N \u2192 \u221e), our approximation keeps getting better.\n\nFigure 1: Converting a quadratic constraint into linear constraints. The tangent planes through the 6\npoints x1, . . . , x6 create the approximation to S.\n2.1 Sampling Scheme\nThe accuracy of the solution of P(XN ) solely depends on the choice of XN . The tangent planes to\nS at those N points create a cover of S. We use the notion of a bounded cover, which we de\ufb01ne as\nfollows.\nDe\ufb01nition 1. Let T be the convex polytope generated by the tangent planes to S at the points\nx1, . . . , xN \u2208 \u2202S. T is said to be a bounded cover of S if,\n\nd(T ,S) := sup\nt\u2208T\n\nd(t,S) < \u221e,\n\nwhere d(t,S) := inf x\u2208S (cid:107)t \u2212 x(cid:107) and (cid:107) \u00b7 (cid:107) denotes the Euclidean distance.\nThe \ufb01rst result shows that there exists a bounded cover with only n + 1 points.\nLemma 1. Let S be a n dimensional ellipsoid as de\ufb01ned in (3). Then there exists a bounded cover\nwith n + 1 points.\nProof. Note that since S is a compact convex body in Rn, there exists a location translated version of\nan n-dimensional simplex T = {x \u2208 Rn\ni=1 xi = K} such that S is contained in the interior of\nT . We can always shrink T such that each edge touches S tangentially. Since there are n + 1 faces,\nwe will get n + 1 points whose tangent surface creates a bounded cover.\n\n+ :(cid:80)n\n\nAlthough Lemma 1 gives a simple constructive proof of a bounded cover, it is not what we are truly\ninterested in. What we want is to construct a bounded cover T which is as close as possible to S, thus\nleading to a better approximation. However note that, choosing the points via a naive sampling can\nlead to arbitrarily bad enlargements of the feasible set and in the worst case might even create a cover\nwhich is not bounded. Hence we need an optimal set of points which creates an optimal bounded\ncover. Formally,\nDe\ufb01nition 2. T\n\nN ) is said to be an optimal bounded cover, if\n\n\u2217 = T (x\u2217\n\n1, . . . , x\u2217\n\nsup\nt\u2208T \u2217\n\nd(t,S) \u2264 sup\nt\u2208T\n\nd(t,S)\n\nfor any bounded cover T generated by any other N-point sets. Moreover, {x\u2217\nN} are de\ufb01ned\nto be the optimal N-point set.\nNote that we can think of the optimal N-point set as that set of N points which minimize the\nmaximum distance between T and S, i.e.\nT\n\n\u2217 = argmin\n\n1, . . . , x\u2217\n\nd(T ,S).\n\nT\n\n3\n\nx1x2x3x4x5x6ST\fIt is not hard to see that the optimal N-point set on the unit circle in two dimensions are the N-th\nroots of unity, unique up to rotation. This point set also has a very good property. It has been shown\nthat the N-th roots of unity minimize the discrete Riesz energy for the unit circle [14, 17]. The\nconcept of Reisz energy also exists in higher dimensions. Thus, generalizing this result, we choose\nour optimal N-point set on \u2202S which tries to minimize the Reisz energy. We brie\ufb02y describe it below.\n2.1.1 Riesz Energy\n\nRiesz energy of a point set AN = {x1, . . . , xN} is de\ufb01ned as Es(AN ) :=(cid:80)N\n\n\u2212s for\na positive real parameter s. There is a vast literature on Riesz energy and its association with \u201cgood\u201d\ncon\ufb01guration of points. It is well known that the measures associated to the optimal point set that\nminimizes the Riesz energy on \u2202S converge to the normalized surface measure of \u2202S [17]. Thus\nusing this fact, we can associate the optimal N-point set to the set of N points that minimize the\nRiesz energy on \u2202S. For more details see [18, 19] and the references therein. To describe these good\ncon\ufb01gurations of points, we introduce the concept of equidistribution. We begin with a \u201cgood\u201d or\nequidistributed point set in the unit hypercube (described in Section 2.1.2) and map it to \u2202S such that\nthe equidistribution property still holds (described in Section 2.1.3).\n\ni(cid:54)=j=1 (cid:107)xi \u2212 xj(cid:107)\n\n2.1.2 Equidistribution\n\nInformally, a set of points in the unit hypercube is said to be equidistributed, if the expected number\nof points inside any axis-parallel subregion, matches the true number of points. One such point set\nin [0, 1]n is called the (t, m, n)-net in base \u03b7, which is de\ufb01ned as a set of N = \u03b7m points in [0, 1]n\nsuch that any axis parallel \u03b7-adic box with volume \u03b7t\u2212m would contain exactly \u03b7t points. Formally,\nit is a point set that can attain the optimal integration error of O((log(N ))n\u22121/N ) [16] and is usually\nreferred to as a low-discrepancy point set. There is vast literature on easy construction of these point\nsets. For more details on nets we refer to [16, 24].\n2.1.3 Area preserving map to \u2202S\nNow once we have a point set on [0, 1]n we try to map it to \u2202S using a measure preserving transfor-\nmation so that the equidistribution property remains intact. We describe the mapping in two steps.\nFirst we map the point set from [0, 1]n to the hyper-sphere Sn = {x \u2208 Rn+1 : xT x = 1}. Then we\nmap it to \u2202S. The mapping from [0, 1]n to Sn is based on [12].\nThe cylindrical coordinates of the n-sphere, can be written as\n\n(cid:113)\n\nx = xn = ((cid:112)1 \u2212 t2\n\nnxn\u22121, tn),\n\n. . .\n\n, x2 = (\n\n2x1, t2), x1 = (cos \u03c6, sin \u03c6)\n\n1 \u2212 t2\n\nwhere 0 \u2264 \u03c6 \u2264 2\u03c0,\u22121 \u2264 td \u2264 1, xd \u2208 Sd and d = (1, . . . , n). Thus, an arbitrary point x \u2208 Sn can\nbe represented through angle \u03c6 and heights t2, . . . , tn as,\n\nx = x(\u03c6, t2, . . . , tn),\n\n0 \u2264 \u03c6 \u2264 2\u03c0,\u22121 \u2264 t2, . . . , tn \u2264 1.\n\n(d = 2, . . . , n)\n\n\u03d51(y1) = 2\u03c0y1,\n\nWe map a point y = (y1, . . . , yn) \u2208 [0, 1)n to x \u2208 Sn using\n\u03d5d(yd) = 1 \u2212 2yd\nand cylindrical coordinates x = \u03a6n(y) = x(\u03d51(y1), \u03d52(y2), . . . , \u03d5n(yn)). The fact that \u03a6n :\n[0, 1)n \u2192 Sn is an area preserving map has been proved in [12].\nRemark. Instead of using (t, m, n)-nets and mapping to Sn, we could have also used spherical\nt-designs, the existence of which was proved in [9]. However, construction of such sets is still a tough\nproblem in high dimensions. We refer to [13] for more details.\n(cid:112)\u02dcbB\u22121/2x + b.\nFinally, we consider the map \u03c8 to translate the point set from Sn\u22121 to \u2202S. Speci\ufb01cally we de\ufb01ne,\n(6)\nFrom the de\ufb01nition of S in (3), it is easy to see that \u03c8(x) \u2208 \u2202S. The next result shows that this is\nalso an area-preserving map, in the sense of normalized surface measures.\nLemma 2. Let \u03c8 be a mapping from Sn\u22121 \u2192 \u2202S as de\ufb01ned in (6). Then for any set A \u2286 \u2202S,\nwhere, \u03c3n, \u03bbn are the normalized surface measure of \u2202S and Sn\u22121 respectively.\n\n\u03c3n(A) = \u03bbn(\u03c8\u22121(A))\n\n\u03c8(x) =\n\n4\n\n\f(cid:40)\nProof. Pick any A \u2286 \u2202S. Then we can write,\n\n(cid:41)\n\n1(cid:112)\u02dcb\n\n\u03c8\u22121(A) =\n\nB1/2(x \u2212 b) : x \u2208 A\n(cid:41)(cid:33)\n(cid:32)(cid:40)\nNow since the linear shift does not change the surface area, we have,\n\u03bbn(\u03c8\u22121(A)) = \u03bbn\n\n(cid:32)(cid:40)\n\n= \u03bbn\n\n1(cid:112)\u02dcb\n\n.\n\n(cid:41)(cid:33)\n\nB1/2x : x \u2208 A\n\n= \u03c3n(A),\n\n1(cid:112)\u02dcb\n\nB1/2(x \u2212 b) : x \u2208 A\n\n(cid:112)\u02dcb \u2208 Sn\u22121. This completes the proof.\n\nwhere the last equality follows from the de\ufb01nition of normalized surface measures and noting that\nB1/2x/\n\nUsing Lemma 2 we see that the map \u03c8 \u25e6 \u03a6n\u22121 : [0, 1)n\u22121 \u2192 \u2202S, is a measure preserving map.\nUsing this map and the (t, m, n \u2212 1) net in base \u03b7, we derive the optimal \u03b7m-point set on \u2202S. Figure\n2 shows how we transform a (0, 7, 2)-net in base 2 to a sphere and then to an ellipsoid. For more\ngeneral geometric constructions we refer to [7, 8].\n\nFigure 2: The left panel shows a (0, 7, 2)-net in base 2 which is mapped to a sphere in 3 dimensions\n(middle panel) and then mapped to the ellipsoid as seen in the right panel.\n\n2.2 Algorithm and Ef\ufb01cient Solution\n\nFrom the description in the previous section we are now at a stage to describe the approximation\nalgorithm. We approximate the problem P by P(XN ) using a set of points x1, . . . , xN as described\nin Algorithm 1. Once we formulate the problem P as P(XN ), we solve the large scale QP via\nAlgorithm 1 Point Simulation on \u2202S\n1: Input : B, b, \u02dcb to specify S and N = \u03b7m points\n2: Output : x1, . . . , xN \u2208 \u2202S\n3: Generate y1, . . . , yN as a (t, m, n \u2212 1)-net in base \u03b7.\n4: for i \u2208 1, . . . , N do\n5:\n6: end for\n7: return x1, . . . , xN\n\nxi = \u03c8 \u25e6 \u03a6n\u22121(yi)\n\nstate-of-the-art solvers such as Operator Splitting or Block Splitting approaches [10, 25, 26].\n\n3 Convergence of P(XN ) to P\nIn this section, we shall show that if we follow Algorithm 1 to generate the approximate problem\nP(XN ), then we converge to the original problem P as N \u2192 \u221e. We shall also prove some \ufb01nite\nsample results to give error bounds on the solution to P(XN ). We start by introducing some notation.\n\n5\n\n\fLet x\u2217, x\u2217(N ) denote the solution to P and P(XN ) respectively and f (\u00b7) denote the strongly convex\nobjective function in (2), i.e., for ease of notation\n\nf (x) = (x \u2212 a)T A(x \u2212 a).\n\nWe begin with our main result.\nTheorem 1. Let P be the QCQP de\ufb01ned in (2) and P(XN ) be the approximate QP problem de\ufb01ned in\n(5) via Algorithm 1. Then, P(XN ) \u2192 P as N \u2192 \u221e in the sense that limN\u2192\u221e (cid:107)x\u2217(N ) \u2212 x\u2217\n(cid:107) = 0.\nProof. Fix any N. Let TN denote the optimal bounded cover constructed with N points on \u2202S. Note\nthat to prove the result, it is enough to show that TN \u2192 S as N \u2192 \u221e. This guarantees that linear\nconstraints of P(XN ) converge to the quadratic constraint of P, and hence the two problems match.\nNow since S \u2286 TN for all N, it is easy to see that S \u2286 limN\u2192\u221e TN .\nTo prove the converse, let t0 \u2208 limN\u2192\u221e TN but t0 (cid:54)\u2208 S. Thus, d(t0,S) > 0. Let t1 denote the\nprojection of t0 onto S. Thus, t0 (cid:54)= t1 \u2208 \u2202S. Choose \u0001 to be arbitrarily small and consider any\nregion A\u0001 around t1 on \u2202S such that d(x, t1) \u2264 \u0001 for all x \u2208 A\u0001. Here d denotes the surface distance\nfunction. Now, by the equidistribution property of Algorithm 1 as N \u2192 \u221e, there exists a point\nt\u2217\n\u2208 A\u0001, the tangent plane through which cuts the plane joining t0 and t1. Thus, t0 (cid:54)\u2208 limN\u2192\u221e TN .\nHence, we get a contradiction and the result is proved.\nAs a simple Corollary to Theorem 1 it is easy to see that as limN\u2192\u221e |f (x\u2217(N )) \u2212 f (x\u2217)| = 0. We\nnow move to some \ufb01nite sample results.\nTheorem 2. Let g : N \u2192 R such that limn\u2192\u221e g(n) = 0. Further assume that (cid:107)x\u2217(N ) \u2212 x\u2217\n(cid:107) \u2264\nC1g(N ) for some constant C1 > 0. Then, |f (x\u2217(N )) \u2212 f (x\u2217)| \u2264 C2g(N ) where C2 > 0 is a\n(cid:107). Note that since x\u2217 satis\ufb01es the constraint of the optimization\nProof. We begin by bounding the (cid:107)x\u2217\nproblem, we have, \u02dcb \u2265 (x\u2217\n\u2212 b(cid:107)2, where \u03c3min(B) denotes the\nsmallest singular value of B. Thus,\n\n\u2212 b) \u2265 \u03c3min(B)(cid:107)x\u2217\n\nconstant.\n\n\u2212 b)T B(x\u2217\n(cid:107)x\u2217\n\n(cid:115)\n\n\u02dcb\n\n\u03c3min(B)\n\n(cid:107) \u2264 (cid:107)b(cid:107) +\n\n(cid:90) 1\n0 (cid:104)\u2207f (x\u2217 + t(x \u2212 x\u2217)), x \u2212 x\u2217\n\nNow, since f (x) = (x \u2212 a)T A(x \u2212 a) and \u2207f (x) = 2A(x \u2212 a), we can write\n\nf (x) = f (x\u2217) +\n\n(cid:105)dt\n\n(cid:90) 1\n0 (cid:104)\u2207f (x\u2217 + t(x \u2212 x\u2217)) \u2212 \u2207f (x\u2217), x \u2212 x\u2217\n\n(cid:105)dt\n\n.\n\n(7)\n\n= f (x\u2217) + (cid:104)\u2207f (x\u2217), x \u2212 x\u2217\n= I1 + I2 + I3 (say) .\n\n(cid:105) +\n\nNow, we can bound the last term as follows. Observe that using Cauchy-Schwarz inequality,\n\n(cid:90) 1\n(cid:90) 1\n0 |(cid:104)\u2207f (x\u2217 + t(x \u2212 x\u2217)) \u2212 \u2207f (x\u2217), x \u2212 x\u2217\n0 (cid:107)\u2207f (x\u2217 + t(x \u2212 x\u2217)) \u2212 \u2207f (x\u2217)(cid:107)(cid:107)x \u2212 x\u2217\n\n(cid:105)| dt\n\n|I3| \u2264\n\n\u2264\n\n(cid:107)dt = \u03c3max(A)(cid:107)x \u2212 x\u2217\nwhere \u03c3max(A) denotes the largest singular value of A. Thus, we have\n(cid:105) + \u02dcC(cid:107)x \u2212 x\u2217\n\n(cid:90) 1\n0 (cid:107)t(x \u2212 x\u2217)(cid:107)(cid:107)x \u2212 x\u2217\nf (x) = f (x\u2217) + (cid:104)\u2207f (x\u2217), x \u2212 x\u2217\n\n\u2264 2\u03c3max(A)\n\n(cid:107)dt\n\n(cid:107)2\n\n(cid:107)2,\n\nwhere | \u02dcC| \u2264 \u03c3max(A). Furthermore,\n\n|(cid:104)\u2207f (x\u2217), x\u2217(N ) \u2212 x\u2217\n\n\u2212 a), x\u2217(N ) \u2212 x\u2217\n\n(cid:105)|\n\n(cid:107) + (cid:107)a(cid:107))(cid:107)x\u2217(N ) \u2212 x\u2217\n\n(cid:107)\n\n(cid:105)| = |(cid:104)2A(x\u2217\n\u2264 2\u03c3max(A)((cid:107)x\u2217\n\u2264 2C1\u03c3max(A)\n\n\uf8eb\uf8ed(cid:115)\n\n\u02dcb\n\n\u03c3min(B)\n\n+ (cid:107)b(cid:107) + (cid:107)a(cid:107)\n\n\uf8f6\uf8f8 g(N ),\n\n(8)\n\n(9)\n\n6\n\n\fwhere the last line inequality follows from (7). Combining (8) and (9) the result follows.\n\nNote that the function g gives us an idea about how fast x\u2217(N ) converges x\u2217. To help, identify the\nfunction g we state the following results.\nLemma 3. If f (x\u2217) = f (x\u2217(N )), then x\u2217 = x\u2217(N ). Furthermore, if f (x\u2217) \u2265 f (x\u2217(N )), then\nx\u2217\n\n\u2208 \u2202U and x\u2217(N ) (cid:54)\u2208 U, where U = S \u2229 {x : Cx = c} is the feasible set for (2).\n\nProof. Let V = TN \u2229 {x : Cx = c}. It is easy to see that U \u2286 V. Assume f (x\u2217) = f (x\u2217(N )), but\n(cid:54)= x\u2217(N ). Note that x\u2217, x\u2217(N ) \u2208 V. Since V is convex, consider a line joining x\u2217 and x\u2217(N ).\nx\u2217\nFor any point \u03bbt = tx\u2217 + (1 \u2212 t)x\u2217(N ),\n\nf (\u03bbt) \u2264 tf (x\u2217) + (1 \u2212 t)f (x\u2217(N )) = f (x\u2217(N )).\n\n\u25e6\n\n\u2208\n\nThus, f is constant on the line joining x\u2217 and x\u2217(N ). But, it is known that f is strongly convex\nsince A is positive de\ufb01nite [27]. Thus, there exists only one unique minimum. Hence, we have\na contradiction, which proves x\u2217 = x\u2217(N ). Now let us assume that f (x\u2217) \u2265 f (x\u2217(N )). Clearly,\nx\u2217(N ) (cid:54)\u2208 U. Suppose x\u2217\nU, the interior of U. Let \u02dcx \u2208 \u2202U denote the point on the line joining\nx\u2217 and x\u2217(N ). Clearly, \u02dcx = tx\u2217 + (1 \u2212 t)x\u2217(N ) for some t > 0. Thus, f (\u02dcx) < tf (x\u2217) + (1 \u2212\nt)f (x\u2217(N )) \u2264 f (x\u2217). But x\u2217 is the minimizer over U. Thus, we have a contradiction, which gives\nx\u2217\nLemma 4. Following the notation of Lemma 3, if x\u2217(N ) (cid:54)\u2208 U, then x\u2217 lies on \u2202U and no point on\nthe line joining x\u2217 and x\u2217(N ) lies in S.\nProof. Since the gradient of f is linear, the result follows from a similar argument to Lemma 3.\n\n\u2208 \u2202U. This completes the proof.\n\nBased on the above two results we can identify the function g by considering the maximum distance\nof the points lying on the conic cap to the hyperplanes forming it. That is g(N ) is the maximum\ndistance between a point x \u2208 \u2202S and a point in t \u2208 T such the line joining x and t do not intersect\nS and hence, lie completely within the conic section. This is highly dependent on the shape of S and\non the cover TN . For example, if S is the unit circle in two dimensions, then the optimal N-point set\nare the N-th roots of unity. In which case, there are N equivalent conic sections C1, . . . ,CN which\nare created by the intersections of \u2202S with TN . Figure 3 elaborates these regions.\n\nFigure 3: The shaded region shows the 6 equivalent conic regions, C1, . . . ,C6.\n\nTo formally de\ufb01ne g(N ) in this situation, let us de\ufb01ne A(t, x) to be the set of all points in the line\njoining t \u2208 T and x \u2208 \u2202S. Now, it is easy to see that,\n\ng(N ) := max\n\ni=1,...,N\n\nsup\n\nt,x:A(t,x)\u2208Ci (cid:107)t \u2212 x(cid:107) = tan\n\nN\n\n= O\n\n,\n\n(10)\n\n(cid:16) \u03c0\n\n(cid:17)\n\n(cid:18) 1\n\n(cid:19)\n\nN\n\nwhere the bound follows from using the Taylor series expansion of tan(x). Combining this observa-\ntion with Theorem 2 shows that in order to get an objective value within \u0001 of the true optimal, we\nwould need N to be a constant multiplier of \u0001\u22121. More such results can be achieved by such explicit\ncalculations over various different domains S.\n\n7\n\nC1C2C3C4C5C6\u03c0/6\f4 Experimental Results\nWe compare our proposed technique to the current state-of-the-art solvers of QCQP. Speci\ufb01cally,\nwe compare it to the SDP and RLT relaxation procedures as described in [4]. For small enough\nproblems, we also compare our method to the exact solution by interior point methods. Furthermore,\nwe provide empirical evidence to show that our sampling technique is better than other simpler\nsampling procedures such as uniform sampling on the unit square or on the unit sphere and then\nmapping it subsequently to our domain as in Algorithm 1. We begin by considering a very simple\nQCQP for the form\n\nMinimize\n\nx\n\nsubject to\n\nxT Ax\n(x \u2212 x0)T B(x \u2212 x0) \u2264 \u02dcb,\nl \u2264 x \u2264 u.\n\n(11)\n\nWe randomly sample A, B, x0 and \u02dcb keeping the problem convex. The lower bound, l and upper\nbounds u are chosen in a way such that they intersect the ellipsoid. We vary the dimension n of the\nproblem and tabulate the \ufb01nal objective value as well as the time taken for the different procedures\nto converge in Table 1. The stopping criteria throughout our simulation is same as that of Operator\nSplitting algorithm as presented in [26].\n\nTable 1: The Optimal Objective Value and Convergence Time\n\nn\n\n5\n\n10\n\n20\n\n50\n\n100\n\n1000\n\n105\n\n106\n\nOur Method\n\n3.00\n(4.61s)\n206.85\n(5.04s)\n6291.4\n(6.56s)\n99668\n(15.55s)\n1.40 \u00d7 106\n(58.41s)\n2.24 \u00d7 107\n(14.87m)\n3.10 \u00d7 108\n(25.82m)\n3.91 \u00d7 109\n(38.30m)\n\nSampling\non [0, 1]n\n\n2.99\n(4.74s)\n205.21\n(5.65s)\n4507.8\n(6.28s)\n15122\n(18.98s)\n69746\n(1.03m)\n8.34 \u00d7 106\n(15.63m)\n7.12 \u00d7 107\n(24.59m)\n2.69 \u00d7 108\n(39.15m)\n\nSampling\non Sn\n2.95\n\n(6. 11s)\n206.5\n(5.26s)\n5052.2\n(6.69s)\n26239\n(17.32s)\n1.24 \u00d7 106\n(54.69s)\n9.02 \u00d7 106\n(15.32m)\n8.39 \u00d7 107\n(27.23m)\n7.53 \u00d7 108\n(37.21m)\n\nSDP\n\nRLT\n\n3.07\n(0.52s)\n252.88\n(0.53s)\n6841.6\n(2.05s)\n1.11 \u00d7 105\n(4.31s)\n1.62 \u00d7 106\n(30.41s)\n\n3.08\n(0.51s)\n252.88\n(0.51s)\n6841.6\n(1.86s)\n1.08 \u00d7 105\n(2.96s)\n1.52 \u00d7 106\n(15.36s)\n\nExact\n\n3.07\n(0.49)\n252.88\n(0.51)\n6841.6\n(0.54)\n\n(0.64)\n\n1.11 \u00d7 105\n1.62 \u00d7 106\n(2.30s)\n\nNA\n\nNA\n\nNA\n\nNA\n\nNA\n\nNA\n\nNA\n\nNA\n\nNA\n\nThroughout our simulations, we have chosen \u03b7 = 2 and the number of optimal points as N =\nmax(1024, 2m), where m is the smallest integer such that 2m \u2265 10n. Note that even though the\nSDP and the interior point methods converge very ef\ufb01ciently for small values of n, they cannot\nscale to values of n \u2265 1000, which is where the strength of our method becomes evident. From\nTable 1 we observe that the relaxation procedures SDP and RLT fail to converge within an hour of\ncomputation time for n \u2265 1000, whereas all the approximation procedures can easily scale up to\nn = 106 variables. Moreover, since the A, B were randomly sampled, we have seen that the true\noptimal solution occurred at the boundary. Therefore, relaxing the constraint to linear forced the\nsolution to occur outside of the feasible set, as seen from the results in Table 1 as well as from Lemma\n3. However, that is not a concern, since increasing N will de\ufb01nitely bring us closer to the feasible\nset. The exact choice of N differs from problem to problem but can be computed as we did with the\nsmall example in (10). Finally, the last column in Table 1 is obtained by solving the problem using\ncvx in MATLAB using via SeDuMi and SDPT3, which gives the true x\u2217.\nFurthermore, our procedure gives the best approximation result when compared to the remaining\ntwo sampling schemes. Lemma 3 shows that if the both the objective values are the same then we\nindeed get the exact solution. To see how much the approximation deviates from the truth, we also\nplot the log of the relative squared error, i.e. log((cid:107)x\u2217(N ) \u2212 x\u2217\n(cid:107)2) for each of the sampling\n\n(cid:107)2/(cid:107)x\u2217\n\n8\n\n\fprocedures in Figure 4. Throughout this simulation, we keep N \ufb01xed at 1024. This is why we see\nthat the error level increases with the increase in dimension. We omit SDP and RLT results in Figure\n\nFigure 4: The log of the relative squared error log((cid:107)x\u2217(N ) \u2212 x\u2217\nand varying dimension n.\n\n(cid:107)2/(cid:107)x\u2217\n\n(cid:107)2) with N \ufb01xed at 1024\n\n4 since both of them produce a solution very close to the exact minimum for small n. If we grow this\nN with the dimension, then we see that the increasing trend vanishes and we get much more accurate\nresults as seen in Figure 5. We plot both the log of relative squared error as well as the log of the\nfeasibility error, where the feasibility error is de\ufb01ned as\n\nFeasibility Error =\n\nwhere (x)+ denotes the positive part of x.\n\n(cid:16)\n\n(cid:17)\n(x\u2217(N ) \u2212 x0)T B(x\u2217(N ) \u2212 x0) \u2212 \u02dcb\n\n+\n\nFigure 5: The plot on the left panel and the right panel shows the decay in the relative squared error\nand the feasibility error respectively, for our method, as we increase N for various dimensions.\n\nFrom these results, it is clear that our procedure gets the smallest relative error compared to the other\nsampling schemes, and increasing N brings us closer to the feasible set, with better accurate results.\n\n5 Discussion and Future Work\nIn this paper, we look at the problem of solving a large scale QCQP problem by relaxing the quadratic\nconstraint by a near-optimal sampling scheme. This approximate method can scale up to very large\nproblem sizes, while generating solutions which have good theoretical properties of convergence.\nTheorem 2 gives us an upper bound as a function of g(N ), which can be explicitly calculated for\ndifferent problems. To get the rate as a function of the dimension n, we need to understand how the\nmaximum and minimum eigenvalues of the two matrices A and B grow with n. One idea is to use\nrandom matrix theory to come up with a probabilistic bound. Because of the nature of complexity of\nthese problems, we believe they deserve special attention and hence we leave them to future work.\nWe also believe that this technique can be immensely important in several applications. Our next step\nis to do a detailed study where we apply this technique on some of these applications and empirically\ncompare it with other existing large-scale commercial solvers such as CPLEX and ADMM based\ntechniques for SDP.\n\n9\n\n\u22123\u22122\u221210125102050100Dimension of the Problemlog (Relative Square Error)MethodLow Discrepancy SamplingUniform Sampling on SquareUniform Sampling on Sphere\u221216\u221214\u221212\u221210\u22128\u22126\u22124\u22122234567log(Number of Constraints)log(Relative Squared Error)Dimension51020\u22126\u22124\u221220234567log(Number of Constraints)log(FeasibilityError)Dimension51020\fAcknowledgment\nWe would sincerely like to thank the anonymous referees for their helpful comments which has\ntremendously improved the paper. We would also like to thank Art Owen, Souvik Ghosh, Ya Xu and\nBee-Chung Chen for the helpful discussions.\nReferences\n[1] D. Agarwal, S. Chatterjee, Y. Yang, and L. Zhang. Constrained optimization for homepage\nrelevance. In Proceedings of the 24th International Conference on World Wide Web Companion,\npages 375\u2013384. International World Wide Web Conferences Steering Committee, 2015.\n\n[2] D. Agarwal, B.-C. Chen, P. Elango, and X. Wang. Personalized click shaping through lagrangian\nduality for online recommendation. In Proceedings of the 35th international ACM SIGIR\nconference on Research and development in information retrieval, pages 485\u2013494. ACM, 2012.\n[3] C. Aholt, S. Agarwal, and R. Thomas. A QCQP Approach to Triangulation. In Proceedings of\nthe 12th European Conference on Computer Vision - Volume Part I, ECCV\u201912, pages 654\u2013667,\nBerlin, Heidelberg, 2012. Springer-Verlag.\n\n[4] K. M. Anstreicher. Semide\ufb01nite programming versus the reformulation-linearization tech-\nnique for nonconvex quadratically constrained quadratic programming. Journal of Global\nOptimization, 43(2):471\u2013484, 2009.\n\n[5] X. Bao, N. V. Sahinidis, and M. Tawarmalani. Semide\ufb01nite relaxations for quadratically\nconstrained quadratic programming: A review and comparisons. Mathematical Programming,\n129(1):129\u2013157, 2011.\n\n[6] K. Basu, S. Chatterjee, and A. Saha. Constrained Multi-Slot Optimization for Ranking Recom-\n\nmendations. arXiv:1602.04391, 2016.\n\n[7] K. Basu and A. B. Owen. Low discrepancy constructions in the triangle. SIAM Journal on\n\nNumerical Analysis, 53(2):743\u2013761, 2015.\n\n[8] K. Basu and A. B. Owen. Scrambled Geometric Net Integration Over General Product Spaces.\n\nFoundations of Computational Mathematics, 17(2):467\u2013496, Apr. 2017.\n\n[9] A. V. Bondarenko, D. Radchenko, and M. S. Viazovska. Optimal asymptotic bounds for\n\nspherical designs. Annals of Mathematics, 178(2):443\u2013452, 2013.\n\n[10] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical\nlearning via the alternating direction method of multipliers. Foundations and Trends in Machine\nLearning, 3(1):1\u2013122, 2011.\n\n[11] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[12] J. S. Brauchart and J. Dick. Quasi\u2013Monte Carlo rules for numerical integration over the unit\n\nsphere S2 . Numerische Mathematik, 121(3):473\u2013502, 2012.\n\n[13] J. S. Brauchart and P. J. Grabner. Distributing many points on spheres: minimal energy and\n\ndesigns. Journal of Complexity, 31(3):293\u2013326, 2015.\n\n[14] J. S. Brauchart, D. P. Hardin, and E. B. Saff. The riesz energy of the nth roots of unity: an\nasymptotic expansion for large n. Bulletin of the London Mathematical Society, 41(4):621\u2013633,\n2009.\n\n[15] A. De Maio, Y. Huang, D. P. Palomar, S. Zhang, and A. Farina. Fractional QCQP with\napplications in ML steering direction estimation for radar detection. IEEE Transactions on\nSignal Processing, 59(1):172\u2013185, 2011.\n\n[16] J. Dick and F. Pillichshammer. Digital sequences, discrepancy and quasi-Monte Carlo integra-\n\ntion. Cambridge University Press, Cambridge, 2010.\n\n[17] M. G\u00f6tz. On the Riesz energy of measures. Journal of Approximation Theory, 122(1):62\u201378,\n\n2003.\n\n[18] P. J. Grabner. Point sets of minimal energy. In Applications of Algebra and Number Theory\n(Lectures on the Occasion of Harald Niederreiter\u2019s 70th Birthday) (edited by G. Larcher, F.\nPillichshammer, A. Winterhof, and C. Xing), pages 109\u2013125, 2014.\n\n[19] D. Hardin and E. Saff. Minimal Riesz energy point con\ufb01gurations for recti\ufb01able d-dimensional\n\nmanifolds. Advances in Mathematics, 193(1):174\u2013204, 2005.\n\n10\n\n\f[20] Y. Huang and D. P. Palomar. Randomized algorithms for optimal solutions of double-sided\nIEEE Transactions on Signal Processing,\n\nQCQP with applications in signal processing.\n62(5):1093\u20131108, 2014.\n\n[21] G. R. G. Lanckriet, N. Cristianini, P. L. Bartlett, L. E. Ghaoui, and M. I. Jordan. Learning the\nkernel matrix with semi-de\ufb01nite programming. In Machine Learning, Proceedings of (ICML\n2002), pages 323\u2013330, 2002.\n\n[22] J. B. Lasserre. Semide\ufb01nite programming vs. LP relaxations for polynomial programming.\n\nMathematics of Operations Research, 27(2):347\u2013360, 2002.\n\n[23] Y. Nesterov and A. Nemirovskii. Interior-point polynomial algorithms in convex programming.\n\nSIAM, 1994.\n\n[24] H. Niederreiter. Random Number Generation and Quasi-Monte Carlo Methods. SIAM,\n\nPhiladelphia, PA, 1992.\n\n[25] B. O\u2019Donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting\nand homogeneous self-dual embedding. Journal of Optimization Theory and Applications,\n169(3):1042\u20131068, 2016.\n\n[26] N. Parikh and S. Boyd. Block splitting for distributed optimization. Mathematical Programming\n\nComputation, 6(1):77\u2013102, 2014.\n\n[27] R. T. Rockafellar. Convex analysis, 1970.\n[28] J. Ye, S. Ji, and J. Chen. Learning the kernel matrix in discriminant analysis via quadratically\nconstrained quadratic programming. In Proceedings of the 13th ACM SIGKDD 2007, pages\n854\u2013863, 2007.\n\n[29] X. Zhu, J. Kandola, J. Lafferty, and Z. Ghahramani. Graph kernels by spectral transforms.\n\nSemi-supervised learning, pages 277\u2013291, 2006.\n\n11\n\n\f", "award": [], "sourceid": 1339, "authors": [{"given_name": "Kinjal", "family_name": "Basu", "institution": "LinkedIn"}, {"given_name": "Ankan", "family_name": "Saha", "institution": "Linkedin Corporation"}, {"given_name": "Shaunak", "family_name": "Chatterjee", "institution": null}]}