{"title": "Sparse Approximate Conic Hulls", "book": "Advances in Neural Information Processing Systems", "page_first": 2534, "page_last": 2544, "abstract": "We consider the problem of computing a restricted nonnegative matrix factorization (NMF) of an m\\times n matrix X. Specifically, we seek a factorization X\\approx BC, where the k columns of B are a subset of those from X and C\\in\\Re_{\\geq 0}^{k\\times n}. Equivalently, given the matrix X, consider the problem of finding a small subset, S, of the columns of X such that the conic hull of S \\eps-approximates the conic hull of the columns of X, i.e., the distance of every column of X to the conic hull of the columns of S should be at most an \\eps-fraction of the angular diameter of X. If k is the size of the smallest \\eps-approximation, then we produce an O(k/\\eps^{2/3}) sized O(\\eps^{1/3})-approximation, yielding the first provable, polynomial time \\eps-approximation for this class of NMF problems, where also desirably the approximation is independent of n and m. Furthermore, we prove an approximate conic Carath\u00e9odory theorem, a general sparsity result, that shows that any column of X can be \\eps-approximated with an O(1/\\eps^2) sparse combination from S. Our results are facilitated by a reduction to the problem of approximating convex hulls, and we prove that both the convex and conic hull variants are d-sum-hard, resolving an open problem. Finally, we provide experimental results for the convex and conic algorithms on a variety of feature selection tasks.", "full_text": "Sparse Approximate Conic Hulls\n\nGregory Van Buskirk, Benjamin Raichel, and Nicholas Ruozzi\n\nDepartment of Computer Science\n\nUniversity of Texas at Dallas\n\n{greg.vanbuskirk, benjamin.raichel, nicholas.ruozzi}@utdallas.edu\n\nRichardson, TX 75080\n\nAbstract\n\nWe consider the problem of computing a restricted nonnegative matrix factorization\n(NMF) of an m \u00d7 n matrix X. Speci\ufb01cally, we seek a factorization X \u2248 BC,\nwhere the k columns of B are a subset of those from X and C \u2208 Rk\u00d7n\u22650 . Equiv-\nalently, given the matrix X, consider the problem of \ufb01nding a small subset, S,\nof the columns of X such that the conic hull of S \u03b5-approximates the conic\nhull of the columns of X, i.e., the distance of every column of X to the conic\nhull of the columns of S should be at most an \u03b5-fraction of the angular diame-\nter of X. If k is the size of the smallest \u03b5-approximation, then we produce an\nO(k/\u03b52/3) sized O(\u03b51/3)-approximation, yielding the \ufb01rst provable, polynomial\ntime \u03b5-approximation for this class of NMF problems, where also desirably the\napproximation is independent of n and m. Furthermore, we prove an approximate\nconic Carath\u00e9odory theorem, a general sparsity result, that shows that any column\nof X can be \u03b5-approximated with an O(1/\u03b52) sparse combination from S. Our\nresults are facilitated by a reduction to the problem of approximating convex hulls,\nand we prove that both the convex and conic hull variants are d-SUM-hard, resolv-\ning an open problem. Finally, we provide experimental results for the convex and\nconic algorithms on a variety of feature selection tasks.\n\n1\n\nIntroduction\n\nMatrix factorizations of all sorts (SVD, NMF, CU, etc.) are ubiquitous in machine learning and\ncomputer science. In general, given an m \u00d7 n matrix X, the goal is to \ufb01nd a decomposition into a\nproduct of two matrices B \u2208 Rm\u00d7k and C \u2208 Rk\u00d7n such that the Frobenius norm between X and\nBC is minimized. If no further restrictions are placed on the matrices B and C, this problem can be\nsolved optimally by computing the singular value decomposition. However, imposing restrictions on\nB and C can lead to factorizations which are more desirable for reasons such as interpretability and\nsparsity. One of the most common restrictions is non-negative matrix factorization (NMF), requiring\nB and C to consist only of non-negative entries (see [Berry et al., 2007] for a survey). Practically,\nNMF has seen widespread usage as it often produces nice factorizations that are frequently sparse.\nTypically NMF is accomplished by applying local search heuristics, and while NMF can be solved\nexactly in certain cases (see [Arora et al., 2016]), in general NMF is not only NP-hard [Vavasis, 2009]\nbut also d-SUM-hard [Arora et al., 2016].\nOne drawback of factorizations such as SVD or NMF is that they can represent the data using a basis\nthat may have no clear relation to the data. CU decompositions [Mahoney and Drineas, 2009] address\nthis by requiring the basis to consist of input points. While it appears that the hardness of this problem\nhas not been resolved, approximate solutions are known. Most notable is the additive approximation\nof Frieze et al. [2004], though more recently there have been advances on the multiplicative front\n[Drineas et al., 2008, \u00c7ivril and Magdon-Ismail, 2012, Guruswami and Sinop, 2012]. Similar\nrestrictions have also been considered for NMF. Donoho and Stodden [2003] introduced a separability\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fassumption for NMF, and Arora et al. [2016] showed that a NMF can be computed in polynomial\ntime under this assumption. Various other methods have since been proposed for NMF under the\nseparability (or near separability) assumption [Recht et al., 2012, Kumar et al., 2013, Benson et al.,\n2014, Gillis and Vavasis, 2014, Zhou et al., 2014, Kumar and Sindhwani, 2015]. The separability\nassumption requires that there exists a subset S of the columns of X such that X = XSC for some\nnonnegative matrix C. This assumption can be restrictive in practice, e.g., when an exact subset\ndoes not exist but a close approximate subset does, i.e., X \u2248 XSC. To our knowledge, no exact or\napproximate polynomial time algorithms have been proposed for the general problem of computing a\nNMF under only the restriction that the columns must be selected from those of X.\nIn this work, we \ufb01ll this gap by arguing that a simple greedy algorithm can be used to provide a\npolynomial time \u03b5-approximation algorithm for NMF under the column subset restriction. Note that\nthe separability assumption is not required here: our theoretical analysis bounds the error of our\nselected columns versus the best possible columns that could have been chosen. The algorithm is\nbased off of recent work on fast algorithms for approximately computing the convex hull of a set\nof points [Blum et al., 2016]. As in previous approaches [Donoho and Stodden, 2003, Kumar et al.,\n2013], we formulate restricted NMF geometrically as \ufb01nding a subset, S, of the columns of the matrix\nX whose conic hull, the set of all nonnegative combinations of columns of S, well-approximates\nthe conic hull of X. Using gnomonic projection, we reduce the conic hull problem to a convex hull\nproblem and then apply the greedy strategy of Blum et al. [2016] to compute the convex hull of the\nprojected points. Given a set of points P in Rm, the convex hull of S \u2286 P , denoted Convex(S), is\nsaid to \u03b5-approximate Convex(P ) if the Hausdorff distance between Convex(S) and Convex(P ) is\nat most \u03b5 \u00b7 diameter(P ). For a \ufb01xed \u03b5 > 0, suppose the minimum sized subset of P whose convex\nhull \u03b5-approximates the convex hull of P has size k, then Blum et al. [2016] show that a simple\ngreedy algorithm gives an \u03b5(cid:48) = O(\u03b51/3) approximation using at most k(cid:48) = O(k/\u03b52/3) points of P ,\nwith an ef\ufb01cient O(nc(m + c/\u03b52 + c2)) running time, where c = O(kopt/\u03b52/3). By careful analysis,\nwe show that our reduction achieves the same guarantees for the conic problem. (Note Blum et al.\n[2016] present other trade-offs between k(cid:48) and \u03b5(cid:48), which we argue carry to the conic case as well).\nSigni\ufb01cantly, k(cid:48) and \u03b5(cid:48) are independent of n and m, making this algorithm desirable for large high\ndimensional point sets. Note that our bounds on the approximation quality and the number of points\ndo not explicitly depend on the dimension as they are relative to the size of the optimal solution,\nwhich itself may or may not depend on dimension. Like the X-RAY algorithm [Kumar et al., 2013],\nour algorithm is easy to parallelize, allowing it to be applied to large-scale problems.\nIn addition to the above \u03b5-approximation algorithm, we also present two additional theoretical\nresults of independent interest. The \ufb01rst theoretical contribution provides justi\ufb01cation for empirical\nobservations about the sparsity of NMF [Lee and Seung, 1999, Ding et al., 2010]. Due to the\nhigh dimensional nature of many data sets, there is signi\ufb01cant interest in sparse representations\nrequiring far fewer points than the dimension. Our theoretical justi\ufb01cation for sparsity is based\non Carath\u00e9odory\u2019s theorem: any point q in the convex hull of P can be expressed as a convex\ncombination of at most m + 1 points from P . This is tight in the worst case for exact representation,\nhowever the approximate Carath\u00e9odory theorem [Clarkson, 2010, Barman, 2015] states there is a\npoint q(cid:48) which is a convex combination of O(1/\u03b52) points of P (i.e., independent of n and m) such\nthat ||q \u2212 q(cid:48)\n|| \u2264 \u03b5 \u00b7 diameter(P ). This result has a long history with signi\ufb01cant implications in\nmachine learning, e.g., relating to the analysis of the perceptron algorithm [Novikoff, 1962], though\nthe clean geometric statement of this theorem appears to be not well known outside the geometry\ncommunity. Moreover, this approximation is easily computable with a greedy algorithm (e.g., [Blum\net al., 2016]) similar to the Frank-Wolfe algorithm. The analogous statement for the linear case\ndoes not hold, so it is not immediately obvious whether such an approximate Carath\u00e9odory theorem\nshould hold for the conic case, a question which we answer in the af\ufb01rmative. As a second theoretical\ncontribution, we address the question of whether or not the convex/conic hull problems are actually\nhard, i.e., whether approximations are actually necessary. We answer this question for both problems\nin the af\ufb01rmative, resolving an open question of Blum et al. [2016], by showing both that the conic\nand convex problems are d-SUM-hard.\nFinally, we evaluate the performance of the greedy algorithms for computing the convex and conic\nhulls on a variety of feature selection tasks against existing methods. We observe that, both the\nconic and convex algorithms perform well for a variety of feature selection tasks, though, somewhat\nsurprisingly, the convex hull algorithm, for which previously no experimental results had been\n\n2\n\n\fproduced, yields consistently superior results on text datasets. We use our theoretical results to\nprovide intuition for these empirical observations.\n\n2 Preliminaries\nLet P be a point set in Rm. For any p \u2208 P , we interchangeably use the terms vector and point,\ndepending on whether or not we wish to emphasize the direction from the origin. Let ray(p)\ndenote the unbounded ray passing through p, whose base lies at the origin. Let unit(p) denote the\nunit vector in the direction of p, or equivalently unit(p) is the intersection of ray(p) with the unit\nhypersphere S(m\u22121). For any subset X = {x1, . . . , xk} \u2286 P , ray(X) = {ray(x1), . . . , ray(xk)}\nand unit(X) = {unit(x1), . . . , unit(xk)}.\nGiven points p, q \u2208 P , let d(p, q) = ||p\u2212q|| denote their Euclidean distance, and let (cid:104)p, q(cid:105) denote their\ndot product. Let angle(ray(p), ray(q)) = angle(p, q) = cos\u22121((cid:104)unit(p), unit(q)(cid:105)) denote the angle\nbetween the rays ray(p) and ray(q), or equivalently between vectors p and q. For two sets, P, Q \u2286 Rm,\n(cid:80)\ni \u03b1ixi | \u03b1i \u2265 0, (cid:80)\nwe write d(P, Q) = minp\u2208P,q\u2208Q d(p, q) and for a single point q we write d(q, P ) = d({q}, P ), and\n(cid:80)\nthe same de\ufb01nitions apply to angle().\nFor any subset X = {x1, . . . , xk} \u2286 P , let Convex(X) = {\ni \u03b1i = 1} denote\nthe convex hull of X. Similarly, let Conic(X) = {\ni \u03b1ixi | \u03b1i \u2265 0} denote the conic hull of X and\nDualCone(X) = {z \u2208 X | (cid:104)x, z(cid:105) \u2265 0 \u2200x \u2208 X} the dual cone. For any point q \u2208 Rm, the projection\nof q onto Convex(X) is the closest point to q in Convex(X), proj(q) = proj(q, Convex(X)) =\narg minx\u2208Convex(X) d(q, x). Similarly the angular projection of q onto Conic(X) is the angularly\nclosest point to q in Conic(X), aproj(q) = aproj(q, Conic(X)) = arg minx\u2208Conic(X) angle(q, x).\nNote that angular projection de\ufb01nes an entire ray of Conic(X), rather than a single point, which\nwithout loss of generality we choose the point on the ray minimizing the Euclidean distance to q. In\nfact, abusing notation, we sometimes equivalently view Conic(X) as a set of rays rather than points,\nin which case aproj(ray(q)) = aproj(q) is the entire ray.\nFor X \u2282 Rm, let \u2206 = \u2206X = maxp,q\u2208X d(p, q) denote the diameter of X. The angular diameter of\nX is \u03c6 = \u03c6X = maxp,q\u2208X angle(p, q). Similarly \u03c6X (q) = maxp\u2208X angle(p, q) denotes the angular\nradius of the minimum radius cone centered around the ray through q and containing all of P .\nDe\ufb01nition 2.1. Consider a subset X of a point set P \u2282 Rm. X is an \u03b5-approximation to Convex(P )\nif dconvex(X, P ) = maxp\u2208Convex(P ) d(p, Convex(X)) \u2264 \u03b5\u2206. Note dconvex(X, P ) is the Hausdorff\ndistance between Convex(X) and Convex(P ). Similarly X is an \u03b5-approximation to Conic(P ) if\ndconic(X, P ) = maxp\u2208Conic(P ) angle(p, Conic(X)) \u2264 \u03b5\u03c6P .\nNote that the de\ufb01nition of \u03b5-approximation for Conic(P ) uses angular rather than Euclidean distance\nin order to be de\ufb01ned for rays, i.e., scaling a point outside the conic hull changes its Euclidean\ndistance but its angular distance is unchanged since its ray stays the same. Thus we \ufb01nd considering\nangles better captures what it means to approximate the conic hull than the distance based Frobenius\nnorm which is often used to evaluate the quality of approximation for NMF.\nAs we are concerned only with angles, without loss of generality we often will assume that all points\nin the input set P have been scaled to have unit length, i.e., P = unit(P ). In our theoretical results,\nwe will always assume that \u03c6P < \u03c0/2. Note that if P lies in the non-negative orthant, then for any\nstrictly positive q, \u03c6P (q) < \u03c0/2. In the case that the P is not strictly inside the positive orthant, the\npoints can be uniformly translated a small amount to ensure that \u03c6P < \u03c0/2.\n\n3 A Simple Greedy Algorithm\nLet P be a \ufb01nite point set in Rm (with unit lengths). Call a point p \u2208 P extreme if it lies on the\nboundary of the conic hull (resp. convex hull). Observe that for any X \u2286 P , containing all the\nextreme points, it holds that Conic(X) = Conic(P ) (resp. Convex(X) = Convex(P )). Consider\nthe simple greedy algorithm which builds a subset of points S, by iteratively adding to S the point\nangularly furthest from the conic hull of the current point set S (for the convex hull take the furthest\npoint in distance). One can argue in each round this algorithm selects an extreme point, and thus can\nbe used to \ufb01nd a subset of points whose hull captures that of P . Note if the hull is not degenerate, i.e.,\n\n3\n\n\fno point on the boundary is expressible as a combination of other points on the boundary, then this\nproduces the minimum sized subset capturing P . Otherwise, one can solve a recursive subproblem as\ndiscussed by Kumar et al. [2013] to exactly recover S.\nHere instead we consider \ufb01nding a small subset of points (potentially much smaller than the number\nof extreme points) to approximate the hull. The question is then whether this greedy approach\nstill yields a reasonable solution, which is not clear as there are simple examples showing the best\napproximate subset includes non-extreme points. Moreover, arguing about the conic approximation\ndirectly is challenging as it involves angles and hence spherical (rather than planar) geometry. For the\nconvex case, Blum et al. [2016] argued that this greedy strategy does yield a good approximation.\nThus we seek a way to reduce our conic problem to an instance of the convex problem, without\nintroducing too much error in the process, which brings us to the gnomonic projection. Let hplane(q)\nbe the hyperplane de\ufb01ned by the equation (cid:104)(q \u2212 x), q(cid:105) = 0 where q \u2208 Rm is a unit length normal\nvector. The gnomonic projection of P onto hplane(q), is de\ufb01ned as gpq(P ) = {ray(P ) \u2229 hplane(q)}\n(see Figure 3.1). Note that gpq(q) = q. For any point x in hplane(q), the inverse gnomonic projection\nis pgq(x) = ray(x) \u2229 S(m\u22121). Similar to other work [Kumar et al., 2013], we allow projections onto\nany hyperplane tangent to the unit hypersphere with normal q in the strictly positive orthant.\nA key property of the gnomonic projection, is that the problem of \ufb01nding the extreme points of the\nconvex hull of the projected points is equivalent to \ufb01nding the extreme points of the conic hull of P .\n(Additional properties of the gnomonic projection are discussed in the full version.) Thus the strategy\nto approximate the conic hull should now be clear. Let P (cid:48) = gpq(P ). We apply the greedy strategy\nof Blum et al. [2016] to P (cid:48) to build a set of extreme points S, by iteratively adding to S the point\nfurthest from the convex hull of the current point set S. This procedure is shown in Algorithm 1.\nWe show that Algorithm 1 can be used to produce an \u03b5-approximation to the restricted NMF\nproblem. Formally, for \u03b5 > 0, let opt(P, \u03b5) denote any minimum cardinality subset X \u2286 P which\n\u03b5-approximates Conic(P ), and let kopt = |opt(P, \u03b5)|. We consider the following problem.\nProblem 3.1. Given a set P of n points in Rm such that \u03c6P \u2264 \u03c0/2 \u2212 \u03b3, for a constant \u03b3 > 0, and\na value \u03b5 > 0, compute opt(P, \u03b5).\n\nAlternatively one can \ufb01x k rather than \u03b5, de\ufb01ning opt(P, k) = arg minX\u2286P,|X|=k dconic(X, P ) and\n\u03b5opt = dconic(opt(P, k), P ). Our approach works for either variant, though here we focus on the\nversion in Problem 3.1. Note the bounded angle assumption applies to any collection of points in the\nstrictly positive orthant (a small translation can be used to ensure this for any nonnegative data set).\nIn this section we argue Algorithm 1 produces an (\u03b1, \u03b2)-approximation to an instance (P, \u03b5) of\nProblem 3.1, that is a subset X \u2286 P such that dconic(X, P ) \u2264 \u03b1 and |X| \u2264 \u03b2 \u00b7 kopt = \u03b2 \u00b7|opt(P, \u03b5)|.\nFor \u03b5 > 0, similarly de\ufb01ne optconvex(P, \u03b5) to be any minimum cardinality subset X \u2286 P which\n\u03b5-approximates Convex(P ). Blum et al. [2016] gave (\u03b1, \u03b2)-approximation for the following.\nProblem 3.2. Given a set P of n points in Rm, and a value \u03b5 > 0, compute optconvex(P, \u03b5).\n\nNote the proofs of correctness and approximation quality from Blum et al. [2016] for Problem 3.2 do\nnot immediately imply the same results for using Algorithm 1 for Problem 3.1. To see this, consider\nany points u, v on S(m\u22121). Note the angle between u and v is the same as their geodesic distance\non S(m\u22121). Intuitively, we want to claim the geodesic distance between u and v is roughly the same\nas the Euclidean distance between gpq(u) and gpq(v). While this is true for points near q, as we\nmove away from q the correspondence breaks down (and is unbounded as you approach \u03c0/2). This\nnon-uniform distortion requires care, and thus the proofs had to be moved to the full version.\nFinally, observe that Algorithm 1, requires being able to compute the point furthest from the convex\nhull. To do so we use the (convex) approximate Carath\u00e9odory, which is both theoretically and\npractically very ef\ufb01cient, and produces provably sparse solutions. As a stand alone result, we \ufb01rst\nprove the conic analog of the approximate Carath\u00e9odory theorem. This result is of independent\ninterest since it can be used to sparsify the returned solution from Algorithm 1, or any other algorithm.\n\n3.1 Sparsity and the Approximate Conic Carath\u00e9odory Theorem\nOur \ufb01rst result is a conic approximate Carath\u00e9odory theorem. That is, given a point set P \u2286 Rm and\na query point q, then the angularly closest point to q in Conic(P ) can be approximately expressed as\n\n4\n\n\fAlgorithm 1: Greedy Conic Hull\n\nData: A set of n points, P , in Rm such\n\nthat \u03c6P < \u03c0/2, a positive integer k,\nand a normal vector q in\nDualCone(P ).\nResult: S \u2286 P such that |S| = k\nY \u2190 gpq(P );\nSelect an arbitrary starting point p0 \u2208 Y ;\nS \u2190 {p0};\nfor i = 2 to k do\nSelect\np\u2217\nS \u2190 S \u222a {p\u2217\n\n\u2208 arg maxp\u2208Y dconvex(p, S);\n\n};\n\nFigure 3.1: Side view of gnomonic projection.\n\na sparse combination of point from P . More precisely, one can compute a point t which is a conic\ncombination of O(1/\u03b52) points from P such that angle(q, t) \u2264 angle(q, Conic(P )) + \u03b5\u03c6P .\nThe signi\ufb01cance of this result is as follows. Recall that we seek a factorization X \u2248 BC, where the k\ncolumns of B are a subset of those from X and the entries of C are non-negative. Ideally each point\nin X is expressed as a sparse combination from the basis B, that is each column of C has very few\nnon-zero entries. So suppose we are given any factorization BC, but C is dense. Then no problem,\njust throw out C, and use our Carath\u00e9odory theorem to compute a new matrix C(cid:48) with sparse columns.\nNamely treat each column of X as the query q and run the theorem for the point set P = B, and then\nthe non-zero entries of corresponding column of C(cid:48) are just the selected combination from B. Not\nonly does this mean we can sparsify any solution to our NMF problem (including those obtained by\nother methods), but it also means conceptually that rather than \ufb01nding a good pair BC, one only\nneeds to focus on \ufb01nding the subset B, as is done in Algorithm 1. Note that Algorithm 1 allows\nnon-negative inputs in P because \u03c6P < \u03c0/2 ensures P can be rotated into the positive orthant.\nWhile it appears the conic approximate Carath\u00e9odory theorem had not previously been stated, the\nconvex version has a long history (e.g., implied by [Novikoff, 1962]). The algorithm to compute this\nsparse convex approximation is again a simple and fast greedy algorithm, which roughly speaking is\na simpli\ufb01cation of the Frank-Wolfe algorithm for this particular problem. Speci\ufb01cally, to \ufb01nd the\nprojection of q onto Convex(P ), start with any point t0 \u2208 Convex(P ). In the ith round, \ufb01nd the point\npi \u2208 P most extreme in the direction of q from ti\u22121 (i.e., maximizing (cid:104)q \u2212 ti\u22121, pi(cid:105)) and set ti to be\nthe closest point to q on the segment ti\u22121pi (thus simplifying Frank Wolfe, as we ignore step size\nissues). The standard analysis of this algorithm (e.g., [Blum et al., 2016]) gives the following.\ncompute, in O(cid:0)\nTheorem 3.3 (Convex Carath\u00e9odory). For a point set P \u2286 Rm, \u03b5 > 0, and q \u2208 Rm, one can\nwhere \u2206 = \u2206P . Furthermore, t is a convex combination of O(1/\u03b52) points of P .\n\n|P| m/\u03b52(cid:1) time, a point t \u2208 Convex(P ), such that d(q, t) \u2264 d(q, Convex(P )) + \u03b5\u2206,\n\nAgain by exploiting properties of the gnomonic projection we are able to prove a conic analog of the\nabove theorem. Note for P \u2282 Rm, P is contained in the linear span of at most m points from P , and\nsimilarly the exact Carath\u00e9odory theorem states any point q \u2208 Convex(P ) is expressible as a convex\ncombination of at most m + 1 points from P . As the conic hull lies between the linear case (with\nall combinations) and the convex case (with non-negative combinations summing to one), it is not\nsurprising an exact conic Carath\u00e9odory theorem holds. However, the linear analog of the approximate\nconvex Caratheodory theorem does not hold, and so the following conic result is not a priori obvious.\nTheorem 3.4. Let P \u2282 Rm be a point set, let q be such that \u03c6P (q) < \u03c0/2 \u2212 \u03b3 for some constant\n\u03b3 > 0, and let \u03b5 > 0 be a parameter. Then one can \ufb01nd, in O(|P|m/\u03b52) time, a point t \u2208 Conic(P )\nsuch that angle(q, t) \u2264 angle(q, Conic(P )) + \u03b5\u03c6P (q). Moreover, t is a conic combination of O(1/\u03b52)\npoints from P .\n\nDue to space constraints, the detailed proof of Theorem 3.4 appears in the full version. In the proof,\nthe dependence on \u03b3 is made clear but we make a remark about it here. If \u03b5 is kept \ufb01xed, \u03b3 shows up\n\n5\n\nqxx0hplane(q)\fin the running time roughly by a factor of tan2(\u03c0/2 \u2212 \u03b3). Alternatively, if the running time is \ufb01xed,\nthe approximation error will roughly depend on the factor 1/ tan(\u03c0/2 \u2212 \u03b3).\nWe now give a simple example of a high dimensional point set which shows our bounded angle\nassumption is required for the conic Carath\u00e9odory theorem to hold. Let P consist of the standard\nbasis vectors in Rm, let q be the all ones vector, and let \u03b5 be a parameter. Let X be a subset of P of\nsize k, and consider aproj(q) = aproj(q, X). As P consists of basis vectors, each of which have all\nbut one entry set to zero, aproj(q) will have at most k non-zero entries. By the symmetry of q it is\nalso clear that all non-zero entries in aproj(q) should have the same value. Without loss of generality\nassume that this value is 1, and hence the magnitude of aproj(q) is \u221ak. Thus for aproj(q) to be an\n\u03b5-approximation to q, angle(aproj(q), q) = cos\u22121(\n\u03b5, the number of points required to \u03b5-approximate q depends on m, while the conic Carath\u00e9odory\ntheorem should be independent of m.\n\n) = cos\u22121((cid:112)k/m) < \u03b5. Hence for a \ufb01xed\n\nk\u221a\n\u221a\nk\n\nm\n\n3.2 Approximating the Conic Hull\n\nWe now prove that Algorithm 1 yields an approximation to the conic hull of a given point set\nand hence an approximation to the nonnegative matrix factorization problem. As discussed above,\npreviously Blum et al. [2016] provided the following (\u03b1, \u03b2)-approximation for Problem 3.2.\nTheorem 3.5 ([Blum et al., 2016]). For a set P of n points in Rm, and \u03b5 > 0, the greedy strat-\negy, which iteratively adds the point furthest from the current convex hull, gives a ((8\u03b51/3 +\n\u03b5)\u2206, O(1/\u03b52/3))-approximation to Problem 3.2, and has running time O(nc(m + c/\u03b52 + c2))\ntime, where c = O(kopt/\u03b52/3).\n\nOur second result, is a conic analog of the above theorem.\nTheorem 3.6. Given a set P of n points in Rm such that \u03c6P \u2264 \u03c0\n2 \u2212 \u03b3 for a constant \u03b3 > 0, and a\nvalue \u03b5 > 0, Algorithm 1 gives an ((8\u03b51/3 + \u03b5)\u03c6P , O(1/\u03b52/3))-approximation to Problem 3.1, and\nhas running time O(nc(m + c/\u03b52 + c2)), where c = O(kopt/\u03b52/3).\n\nBounding the approximation error requires carefully handling the distortion due to the gnomonic\nproject, and the details are presented in the full version. Additionally, Blum et al. [2016] provide\nother (\u03b1, \u03b2)-approximations, for different values of \u03b1 and \u03b2, and in the full version these other results\nare also shown to hold for the conic case.\n\n4 Hardness of the Convex and Conic Problems\n\nThis section gives a reduction from d-SUM to the convex approximation of Problem 3.2, implying\nit is d-SUM-hard. In the full version a similar setup is used to argue the conic approximation\nof Problem 3.1 is d-SUM-hard. Actually if Problem 3.1 allowed instances where \u03c6P = \u03c0/2 the\nreduction would be virtually the same. However, arguing that the problem remains hard under our\nrequirement that \u03c6P \u2264 \u03c0/2 \u2212 \u03b3, is non-trivial and some of the calculations become challenging and\nlengthy. The reductions to both problems are partly inspired by Arora et al. [2016]. However, here,\nwe use the somewhat non-standard version of d-SUM where repetitions are allowed as described\nbelow.\nProblem 4.1 (d-SUM). In the d-SUM problem we are given a set S = {s1, s2,\u00b7\u00b7\u00b7 , sN} of N\nvalues, each in the interval [0, 1], and the goal is to determine if there is a set of d numbers (not\nnecessarily distinct) whose sum is exactly d/2.\n\nIt was shown by Patrascu and Williams [2010] that if d-SUM can be solved in N o(d) time then\n3-SAT has a sub-exponential time algorithm, i.e., that the Exponential Time Hypothesis is false.\n\nTheorem 4.2 (d-SUM-hard). Let d < N 0.99, \u03b4 < 1. If d-SUM on N numbers of O(d log(N )) bits\ncan be solved in O(N \u03b4d) time, then 3-SAT on n variables can be solved in 2o(n) time.\n\n6\n\n\fWe will prove the following decision version of Problem 3.2 is d-SUM-hard. Note in this section the\ndimension will be denoted by d rather than m, as this is standard for d-SUM reductions.\nProblem 4.3. Given a set P of n points in Rd, a value \u03b5 > 0, and an integer k, is there a subset\nX \u2286 P of k points such that dconvex(X, P ) \u2264 \u03b5\u2206, where \u2206 is the diameter of P .\nGiven an instance of d-SUM with N values S = {s1, s2,\u00b7\u00b7\u00b7 , sN} we construct an instance of\nProblem 4.3 where P \u2282 Rd+2, k = d, and \u03b5 = 1/3 (or any suf\ufb01ciently small value). The idea is to\ncreate d clusters each containing N points corresponding to a choice of one of the si values. The\nclusters are positioned such that exactly one point from each cluster must be chosen. The d + 2\ncoordinates are labeled ai for i \u2208 [d], w, and v. Together, a1,\u00b7\u00b7\u00b7 , ad determine the cluster. The w\ndimension is used to compute the sum of the chosen si values. The v dimension is used as a threshold\nto determine whether d-SUM is a yes or no instance to Problem 4.3. Let w(pj) denote the w value of\nan arbitrary point pj.\nWe assume d \u2265 2 as d-SUM is trivial for d = 1. Let e1, e2,\u00b7\u00b7\u00b7 , ed \u2208 Rd be the standard basis in Rd,\ne1 = (1,\u00b7\u00b7\u00b7 , 0), e2 = (0, 1,\u00b7\u00b7\u00b7 , 0), . . . , and ed = (0,\u00b7\u00b7\u00b7 , 1). Together they form the unit d-simplex,\nconstant where smax and smin are, respectively, the maximum and minimum values in S.\nDe\ufb01nition 4.4. The set of points P \u2282 Rd+2 are the following\nj points: For each i \u2208 [d], j \u2208 [N ], set (a1,\u00b7\u00b7\u00b7 , ad) = ei, w = \u03b5sj and v = 0\npi\nq point: For each i \u2208 [d], ai = 1/d, w = \u03b5/2, v = 0\nq(cid:48) point: For each i \u2208 [d], ai = 1/d and w = \u03b5/2, v = \u03b5\u2206\u2217\n\nand they de\ufb01ne the d clusters in the construction. Finally, let \u2206\u2217 =(cid:112)2 + (\u03b5smax \u2212 \u03b5smin)2 be a\n\nLemma 4.5 (Proof in full version). The diameter of P , \u2206P , is equal to \u2206\u2217.\n\n(cid:80)\n\ni \u03b1ibi|| = ||\n\ni \u03b1i(a \u2212 bi)|| \u2264\n\n(cid:80)\ni \u03b1i||a \u2212 bi|| \u2264 maxi ||a \u2212 bi||.\n\nWe prove completeness and soundness of the reduction. Below P i = \u222aj pi\nj denotes the ith cluster.\n(cid:80)\nObservation 4.6. If maxp\u2208P d(p, Convex(X)) \u2264 \u03b5\u2206, then dconvex(X, P ) \u2264 \u03b5\u2206: For point sets A\nand B = {b1, . . . , bm}, if we \ufb01x a \u2208 Convex(A), then for any b \u2208 Convex(B) we have ||a \u2212 b|| =\n||a \u2212\ndistinct) such that(cid:80)\nLemma 4.7 (Completeness). If there is a subset {sk1 , sk2,\u00b7\u00b7\u00b7 , skd} of d values (not necessarily\ni\u2208[d] ski = d/2, then the above described instance of Problem 4.3 is a true\ninstance, i.e. there is a d sized subset X \u2286 P with dconvex(X, P ) \u2264 \u03b5\u2206.\nProof: For each value ski consider the point xi = (ei, \u03b5 \u00b7 ski, 0), which by De\ufb01nition 4.4 is a\n(cid:113)\npoint in P . Let X = {x1, . . . , xd}. We now prove maxp\u2208P d(p, Convex(X)) \u2264 \u03b5\u2206, which by\nObservation 4.6 implies that dconvex(X, P ) \u2264 \u03b5\u2206.\nj) \u2212 w(xi))2 \u2264 |\u03b5sj \u2212 \u03b5ski| \u2264 \u03b5\u2206. The\nFirst observe that for any pi\nonly other points in P are q and q(cid:48). Note that d(q, q(cid:48)) = \u03b5\u2206\u2217 = \u03b5\u2206 from Lemma 4.5. Thus\n(cid:80)d\nif we can prove that q \u2208 Convex(X) then we will have shown maxp\u2208P d(p, Convex(X)) \u2264 \u03b5\u2206.\nSpeci\ufb01cally, we prove that the convex combination x = 1\ni xi is the point q. As X contains\nd\nexactly one point from each set P i, and in each such set all points have ai = 1 and all other\naj = 0, it holds that x has 1/d for all the a coordinates. All points in X have v = 0 and so this\nholds for x as well. Thus we only need to verify that w(x) = w(q) = \u03b5/2, for which we have\nw(x) = 1\nd\n\ni \u03b5ski = 1\n\nd (\u03b5d/2) = \u03b5/2.\n\nj in P , d(pi\n\nj, xi) =\n\n(w(pi\n\n(cid:80)\n\ni w(xi) = 1\nd\n\n(cid:80)\n\nProving soundness requires some helper lemmas. Note that in the above proof we constructed a\nsolution to Problem 4.3 that selected exactly one point from each cluster P i. We now prove that this\nis a required property.\nLemma 4.8 (Proof in full version). Let P \u2282 Rd+2 be as de\ufb01ned above, and let X \u2286 P be a subset\nof size d. If dconvex(X, P ) \u2264 \u03b5\u2206, then for all i, X contains exactly one point from P i.\n\n7\n\n\fy\nc\na\nr\nu\nc\nc\nA\nM\nV\nS\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\n\ny\nc\na\nr\nu\nc\nc\nA\nM\nV\nS\n\n0.8\n\n0.6\n\n25\n\n50\n\n0.4\n\n25\n\n50\n\nUSPS\n\nConic\nConvex\nX-RAY\nMutant X-RAY\nConic+\u03b3\n\n75\n\n100 125 150\n\n# Features\n\nReuters\n\n75\n\n100 125 150\n\n# Features\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n25\n\n50\n\n25\n\n50\n\nCOIL20\n\n75\n\n100 125 150\n\n# Features\n\nBBC\n\n75\n\n100 125 150\n\n# Features\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nIsolet\n\n25\n\n50\n\n75\n\n100 125 150\n\n# Features\n\nwarpPIE10P\n\n25\n\n50\n\n75\n\n100 125 150\n\n# Features\n\nFigure 4.1: Experimental results for feature selection on six different data sets. Best viewed in color.\n\n(cid:80)\n\nLemma 4.9 (Proof in full version). If dconvex(X, P ) \u2264 \u03b5\u2206, then q \u2208 Convex(X) and moreover\nq = 1\nd\n\nxi\u2208X xi.\n\nLemma 4.10 (Soundness). Let P be an instance of Problem 4.3 generated from a d-SUM instance\nS, as described in De\ufb01nition 4.4. If there is a subset X \u2286 P of size d such that dconvex(X, P ) \u2264 \u03b5\u2206,\nthen there is a choice of d values from S that sum to exactly d/2.\n\n(cid:80)\n(cid:80)\nProof: From Lemma 4.8 we know that X consist of exactly one point from each cluster P i. Thus\ni xi, which implies\nfor each xi \u2208 X, w(xi) = \u03b5ski for some ski \u2208 S. By Lemma 4.9, q = 1\ni w(xi). By De\ufb01nition 4.4 w(q) = \u03b5/2, which implies \u03b5/2 = 1\ni \u03b5ski.\nw(q) = 1\ni w(xi) = 1\nd\nd\ni ski = d/2.\n\nThus we have a set {sk1, . . . , skd} of d values from S such that(cid:80)\n\n(cid:80)\n\n(cid:80)\n\nd\n\nd\n\nLemma 4.7 and Lemma 4.10 immediately imply the following.\nTheorem 4.11. For point sets in Rd+2, Problem 4.3 is d-SUM-hard.\n\n5 Experimental Results\n\nWe report an experimental comparison of the proposed greedy algorithm for conic hulls, the greedy\nalgorithm for convex hulls (the conic hull algorithm without the projection step) [Blum et al., 2016],\nthe X-RAY (max) algorithm [Kumar et al., 2013], a modi\ufb01ed version of X-RAY, dubbed mutant\nX-RAY, which simply selects the point furthest away from the current cone (i.e., with the largest\nresidual), and a \u03b3-shifted version of the conic hull algorithm described below. Other methods such\nas Hottopixx [Recht et al., 2012, Gillis and Luce, 2014] and SPA [Gillis and Vavasis, 2014] were\nnot included due to their similar performance to the above methods. For our experiments, we\nconsidered the performance of each of the methods when used to select features for a variety of SVM\nclassi\ufb01cation tasks on various image, text, and speech data sets including several from the Arizona\nState University feature selection repository [Li et al., 2016] as well as the UCI Reuters dataset and\nthe BBC News dataset [Greene and Cunningham, 2006]. The Reuters and BBC text datasets are\nrepresented using the TF-IDF representation. For the Reuters dataset, only the ten most frequent\n\n8\n\n\ftopics were used for classi\ufb01cation. In all datasets, columns (corresponding to features) that were\nidentically equal to zero were removed from the data matrix.\nFor each problem, the data is divided using a 30/70 train/test split, the features are selected by the\nindicated method, and then an SVM classi\ufb01er is trained using only the selected features. For the conic\nand convex hull methods, \u0001 is set to 0.1. The accuracy (percent of correctly classi\ufb01ed instances) is\nplotted versus the number of selected features for each method in Figure 4.1. Additional experimental\nresults can be found in the full version. Generally speaking, the convex, mutant X-RAY, and shifted\nconic algorithms seem to consistently perform the best on the tasks. The difference in performance\nbetween convex and conic is most striking on the two text data sets Reuters and BBC. In the case of\nBBC and Reuters, this is likely due to the fact that many of the columns of the TF-IDF matrix are\northogonal. We note that the quality of both X-RAY and conic is improved if thresholding is used\nwhen constructing the feature matrix, but they still seem to under perform the convex method for text\ndatasets.\nThe text datasets are also interesting as not only do they violate the explicit assumption in our\ntheorems that the angular diameter of the conic hull be strictly less than \u03c0/2, but that there are many\nsuch mutually orthogonal columns of the document-feature matrix. This observation motivates the\n\u03b3-shifted version of the conic hull algorithm that simply takes the input matrix X and adds \u03b3 to all\nof the entries (essentially translating the data along the all ones vector) and then applies the conic\nhull algorithm. Let 1a,b denote the a \u00d7 b matrix of ones. After a nonnegative shift, the angular\nassumption is satis\ufb01ed, and the restricted NMF problem is that of approximating (X + \u03b31m,n) as\n(B + \u03b31m,k)C, where the columns of B are again chosen from those of X. Under the Frobenus\nnorm ||(X + \u03b31m,n) \u2212 (B + \u03b31m,k)C||2\ni,j(Xij \u2212 Bi,:C:,j + \u03b3(1 \u2212 ||C:,j||1))2. As C must\nbe a nonnegative matrix, the shifted conic case acts like the original conic case plus a penalty that\nencourages the columns of C to sum to one (i.e., it is a hybrid between the conic case and the convex\ncase). The plots illustrate the performance of the \u03b3-shifted conic hull algorithm for \u03b3 = 10. After\nthe shift, the performance more closely matches that of the convex and mutant X-RAY methods on\nTF-IDF features.\nGiven these experimental results and the simplicity of the proposed convex and conic methods, we\nsuggest that both methods should be added to practitioners\u2019 toolboxes. In particular, the superior\nperformance of the convex algorithm on text datasets, compared to X-RAY and the conic algorithm,\nseems to suggest that these types of \u201cconvex\u201d factorizations may be more desirable for TF-IDF\nfeatures.\n\n2 =(cid:80)\n\nAcknowledgments\n\nGreg Van Buskirk and Ben Raichel were partially supported by NSF CRII Award-1566137. Nicholas\nRuozzi was partially supported by DARPA Explainable Arti\ufb01cial Intelligence Program under contract\nnumber N66001-17- 2-4032 and NSF grant III-1527312\n\nReferences\nM. Berry, M. Browne, A. Langville, V. Pauca, and R. Plemmons. Algorithms and applications for\napproximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):\n155\u2013173, 2007.\n\nS. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization - provably.\n\nSIAM J. Comput., 45(4):1582\u20131611, 2016.\n\nS. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization,\n\n20(3):1364\u20131377, 2009.\n\nM. Mahoney and P. Drineas. Cur matrix decompositions for improved data analysis. Proceedings of\n\nthe National Academy of Sciences, 106(3):697\u2013702, 2009.\n\nA. Frieze, R. Kannan, and S. Vempala. Fast monte-carlo algorithms for \ufb01nding low-rank approxima-\n\ntions. J. ACM, 51(6):1025\u20131041, 2004.\n\n9\n\n\fP. Drineas, M. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM J.\n\nMatrix Analysis Applications, 30(2):844\u2013881, 2008.\n\nA. \u00c7ivril and M. Magdon-Ismail. Column subset selection via sparse approximation of SVD. Theor.\n\nComput. Sci., 421:1\u201314, 2012.\n\nV. Guruswami and A. Sinop. Optimal column-based low-rank matrix reconstruction. In Proceedings\nof the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1207\u2013\n1214, 2012.\n\nD. L. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decompo-\n\nsition into parts? In Advances in Neural Information Processing Systems (NIPS), 2003.\n\nB. Recht, C. Re, J. Tropp, and V. Bittorf. Factoring nonnegative matrices with linear programs. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 1214\u20131222, 2012.\n\nA. Kumar, V. Sindhwani, and P. Kambadur. Fast conical hull algorithms for near-separable non-\nnegative matrix factorization. In International Conference on Machine Learning (ICML), pages\n231\u2013239, 2013.\n\nA. R. Benson, J. D. Lee, B. Rajwa, and D. F. Gleich. Scalable methods for nonnegative matrix\nfactorizations of near-separable tall-and-skinny matrices. In Advances in Neural Information\nProcessing Systems (NIPS), pages 945\u2013953, 2014.\n\nN. Gillis and S. A. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix\nfactorization. IEEE transactions on pattern analysis and machine intelligence, 36(4):698\u2013714,\n2014.\n\nT. Zhou, J. A. Bilmes, and C. Guestrin. Divide-and-conquer learning by anchoring a conical hull. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 1242\u20131250, 2014.\n\nA. Kumar and V. Sindhwani. Near-separable Non-negative Matrix Factorization with l1 and Bregman\n\nLoss Functions, pages 343\u2013351. 2015.\n\nA. Blum, S. Har-Peled, and B. Raichel. Sparse approximation via generating point sets. In Proceed-\nings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages\n548\u2013557, 2016.\n\nD. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature,\n\n401(6755):788\u2013791, 1999.\n\nC. H. Q. Ding, T. Li, and M. I. Jordan. Convex and semi-nonnegative matrix factorizations. IEEE\n\ntransactions on pattern analysis and machine intelligence, 32(1):45\u201355, 2010.\n\nK. L. Clarkson. Coresets, sparse greedy approximation, and the frank-wolfe algorithm. 6(4), 2010.\n\nS. Barman. Approximating nash equilibria and dense bipartite subgraphs via an approximate version\nof caratheodory\u2019s theorem. In Proceedings of the Forty-Seventh Annual ACM on Symposium on\nTheory of Computing (STOC), pages 361\u2013369, 2015.\n\nA.B.J. Novikoff. On convergence proofs on perceptrons. In Proc. Symp. Math. Theo. Automata,\n\nvolume 12, pages 615\u2013622, 1962.\n\nM. Patrascu and R. Williams. On the possibility of faster SAT algorithms. In Proceedings of the\nTwenty-First Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1065\u20131075,\n2010.\n\nNicolas Gillis and Robert Luce. Robust near-separable nonnegative matrix factorization using linear\n\noptimization. Journal of Machine Learning Research, 15(1):1249\u20131280, 2014.\n\n10\n\n\fJ. Li, K. Cheng, S. Wang, F. Morstatter, T. Robert, J. Tang, and H. Liu. Feature selection: A data\n\nperspective. arXiv:1601.07996, 2016.\n\nDerek Greene and P\u00e1draig Cunningham. Practical solutions to the problem of diagonal dominance\nin kernel document clustering. In Proceedings of the 23rd international conference on Machine\nlearning, pages 377\u2013384. ACM, 2006.\n\n11\n\n\f", "award": [], "sourceid": 1466, "authors": [{"given_name": "Greg", "family_name": "Van Buskirk", "institution": "UT Dallas"}, {"given_name": "Benjamin", "family_name": "Raichel", "institution": "UT Dallas"}, {"given_name": "Nicholas", "family_name": "Ruozzi", "institution": "UTDallas"}]}