{"title": "Learning Sparse Distributions using Iterative Hard Thresholding", "book": "Advances in Neural Information Processing Systems", "page_first": 6760, "page_last": 6769, "abstract": "Iterative hard thresholding (IHT) is a projected gradient descent algorithm, known to achieve state of the art performance for a wide range of structured estimation problems, such as sparse inference. In this work, we consider IHT as a solution to the problem of learning sparse discrete distributions. We study the hardness of using IHT on the space of measures. As a practical alternative, we propose a greedy approximate projection which simultaneously captures appropriate notions of sparsity in distributions, while satisfying the simplex constraint, and investigate the convergence behavior of the resulting procedure in various settings. Our results show, both in theory and practice, that IHT can achieve state of the art results for learning sparse distributions.", "full_text": "Learning Sparse Distributions using Iterative Hard\n\nThresholding\n\nJacky Y. Zhang\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nyiboz@illinois.edu\n\nRajiv Khanna\n\nDepartment of Statistics\n\nUniversity of California at Berkeley\n\nrajivak@berkeley.edu\n\nAnastasios Kyrillidis\n\nDepartment of Computer Science\n\nRice University\n\nrajivak@berkeley.edu\n\nOluwasanmi Koyejo\n\nDepartment of Computer Science\n\nUniversity of Illinois at Urbana-Champaign\n\nsanmi@illinois.edu\n\nAbstract\n\nIterative hard thresholding (IHT) is a projected gradient descent algorithm, known\nto achieve state of the art performance for a wide range of structured estimation\nproblems, such as sparse inference. In this work, we consider IHT as a solution\nto the problem of learning sparse discrete distributions. We study the hardness\nof using IHT on the space of measures. As a practical alternative, we propose a\ngreedy approximate projection which simultaneously captures appropriate notions\nof sparsity in distributions, while satisfying the simplex constraint, and investigate\nthe convergence behavior of the resulting procedure in various settings. Our results\nshow, both in theory and practice, that IHT can achieve state of the art results for\nlearning sparse distributions.\n\nIntroduction\n\n1\nProbabilistic models provide a \ufb02exible approach for capturing uncertainty in real world processes, with\na variety of applications which include latent variable models and density estimation, among others.\nLike other machine learning tools, probabilistic models can be enhanced by encouraging parsimony,\nas this captures useful inductive biases. In practice, this often improves the interpretability and\ngeneralization performance of the resulting models, and is particularly useful in applied settings with\nlimited samples compared to the model degrees of freedom. One of the most effective parsimonious\nassumptions is sparsity. As such, learning sparse distributions is a problem of broad interest in\nmachine learning, with many applications [1\u20137].\nThe majority of approaches for sparse probabilistic modeling have focused on the construction of\nappropriate priors based on inputs from domain experts. The technical challenges there involve\nthe challenges of prior design and inference [3, 1, 8], including methods that are additionally\ndesigned to exploit special structures [5, 4, 7] . More recently, there has been an interest in studying\nthese algorithmic approaches from an optimization perspective [9\u201311], with the goal of a deeper\nunderstanding and, in some cases, even suggesting improvements over previous methods [12, 13].In\nthis work, we consider an optimization-based approach to learning sparse discrete distributions.\nDespite wide applicability, when compared to classical constrained optimization, there are limited\nstudies that focus on the understanding, both in theory and in practice, of optimization methods over\nthe space of probability densities, under sparsity constraints.\nOur present work proposes and investigates the use of Iterative Hard Thresholding (IHT [14\u201318]) for\nthe problem of sparse probabilistic estimation. IHT is an iterative algorithm that is well-studied in the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fclassical optimization literature. Further, there are known worst-case convergence guarantees and\nempirical studies [19, 20] that vouch for its performance. Our goal in this work is to investigate the\nconvergence properties of IHT, when applied to probabilistic densities, and to evaluate its ef\ufb01cacy for\nlearning sparse distributions.\nHowever, transferring this algorithm from vector and matrix spaces to the space of measures is\nnot straightforward. While several of the technical pieces \u2013such as the existence of a variational\nderivative and normed structure\u2013 fall into place, the algorithm is an iterative one, that involves solving\na projection subproblem in each iteration. We show that this subproblem is computationally hard in\ngeneral, but provide an approximate procedure that we analyze under certain assumptions.\nOur contributions in this work are algorithmic and theoretical, with proof of concept empirical\nevaluation. We brie\ufb02y summarize our contributions below.\n\u2022 We propose the use of classical IHT for learning sparse distributions, and show that the space of\n\u2022 We study in depth the hardness of the projection subproblem, showing that it is NP-hard, and no\n\u2022 Since the projection problem is solved in every iteration, we propose a simple greedy algorithm\nand provide suf\ufb01cient theoretical conditions, under which the algorithm provably approximates\nthe otherwise hard projection problem.\n\npolynomial-time algorithm exists that can solve it with guarantees.\n\nmeasures meets the structural requirements for IHT.\n\n\u2022 We draw on techniques from classical optimization to provide convergence rates for the overall\nIHT algorithm: i.e., we study after how many iterations will the algorithm guarantee to be within\nsome small \u270f of the true optimum.\n\nIn addition to our conceptual and theoretical results, we present empirical studies that support our\nclaims.\n2 Problem statement\nPreliminaries. We use bold characters to denote vectors. Given a vector v, we use vi to represent its\ni-th entry. We use calligraphic upper case letters to denote sets; e.g., S. With a slight abuse of notation,\nwe will use lower case letters to denote probability distributions e.g., p, q, as well as functions e.g.,\nf. The distinction from scalars will be apparent from the context; we usually append functions with\nparentheses to distinguish from scalars. We use upper case letters to denote functionals i.e., functions\nthat take as an input other functions e.g., F [p(\u00b7)]. We use [n] to denote the set {1, 2, ...n}. Given a\nset of indices S\u21e2 [n], we denote the cardinality of S as |S|. Given a vector x, we denote its support\nset i.e., the set of non-zero entries, as supp(x). We use P{e} to denote the probability of event e.\nLet P denote the set of discrete n-dimensional probability densities on an n-dimensional domain X :\n\nP =(p(\u00b7) : X! R+ | Xx2X\n\np(x) = 1) .\n\nXS =x 2X |\n\nsupp(x) \u2713S .\n\nLet S\u21e2 [n] denote a support set where |S| = k < n. Let XS \u21e2X denote the set of variables with\nsupport S, i.e.,\n\nThe set of domain restricted densities, denoted by PS, is the set of probability density functions\nsupported on XS; i.e.,\n\nPS = {q(\u00b7) 2P |8 x /2X S, q(x) = 0} .\n\nInversely, we denote the support of a domain restricted density q(\u00b7) 2P S as supp(q) = S. Next, we\nde\ufb01ne the notion of sparse distributions.\nDe\ufb01nition 1 (Distribution Sparsity [5]). Let Dk = [|S|\uf8ffkPS \u2713P i.e., the union of all possible\nk-sparse support domain restricted densities. We say that p(\u00b7) is k-sparse if p(\u00b7) 2D k.\nNote that while each component PS is a convex set, the union Dk is not. To see this, consider\nthe convex combination of two k-sparse distributions p1 and p2 with disjoint supports S1 and S2\nrespectively. In general, the convex combination \u21b5p1(\u00b7) + (1 \u21b5)p2(\u00b7); 0 <\u21b5< 1, has larger\nsupport; i.e., |S1 [S 2| > k. As an aside, we note that unlike the vector case, its is straightforward\n\n2\n\n\fX = {x 2 Zn |8 i 2 [n], 0 \uf8ff xi \uf8ff m 1} ,\n\nto construct multiple de\ufb01nitions of distribution sparsity. For instance, another reasonable de\ufb01nition\nis via the set D0k = {p(\u00b7) 2P | p(x) = 0 for all kxk0 > k}; i.e., distributions that assign zero\nprobability mass to non-k-sparse vectors. Interestingly, Dk \u21e2D 0k \u21e2P in general, as any of the\ndistributions in Dk must has a support with size less than k, which is not necessary for distributions\nin D0k. Motivated by prior work [5], we use De\ufb01nition 1 in this work.\nVector sparsity. While the proposed framework is developed for a specialized notion of sparsity\ni.e. along the dimensions of a multivariate discrete distribution, it is also applicable to alternative\nnotions of distribution sparsity. One common setting is sparsity of the distribution itself p(\u00b7) when\nrepresented as a vector e.g. sparsifying the number of valid states of a univariate distribution such as\na histogram. We outline how our framework can be applied to this setting in the Appendix A.\nProblem setting. In this work, we focus on studying sparsity for the case of discrete densities. In\nparticular, X\u21e2 Zn; i.e., x is an integer such that:\nwhere m is an integer. Therefore, x has mn valid positions. In other words, if we denote X as a\nrandom variable from that distribution, then X 2X has mn possible values, and P{X = x} = p(x).\nGiven a cost functional over distributions F [\u00b7] : P! R, we are interested in the following optimiza-\ntion criterion:\n(1)\nwhere Dk = [S:|S|\uf8ffkPS \u2713P is the k-sparsity constraint, as in De\ufb01nition 1. In words, we are\ninterested in \ufb01nding a distribution, denoted as q(\u00b7), that \u201clives\u201d in the k-sparse set of distributions,\nand minimizes the cost functional F [\u00b7]. This is similar to classical sparse optimization problems in\nliterature [21\u201324], but there are fundamental dif\ufb01culties, both in theory and in practice, that require a\ndifferent approach than standard iterative hard thresholding algorithms [14\u201318].\nWe assume that the objective F [\u00b7] is a convex functional over distributions.\nDe\ufb01nition 2 (Convexity of F [\u00b7]). The functional F [\u00b7] : P! R is convex if:\nF [\u2713q(\u00b7) + (1 \u2713)p(\u00b7)] \uf8ff \u2713F [q(\u00b7)] + (1 \u2713)F [p(\u00b7)],\nfor all q(\u00b7), p(\u00b7) 2P and \u2713 2 [0, 1].\nObserve that, while F [\u00b7] is a convex functional, and P and PS are convex sets, Dk is not a convex set.\nHence, the optimization problem (1) is not a convex program.\nFollowing the projected gradient descent approach, we require de\ufb01nitions of the gradient over F [\u00b7],\nas well as de\ufb01nitions of the projection.\nDe\ufb01nition 3 (Variational Derivative [25]). The variational derivative of F [\u00b7] : P! R is a function,\ndenoted as F\n\nsubject to q 2D k,\n\nmin\nq\n\nF [q]\n\nq (\u00b7) : X! R, and satis\ufb01es:\nXX\n\nF\n\nq (x)(x) = @F [q+\u270f]\n\n@\u270f\n\nwhere : X! R is an arbitrary function.\nDe\ufb01nition 4 (First-order Convexity). The functional F [\u00b7] : P! R is convex if:\n\n\u270f=0\np (\u00b7), q(\u00b7) p(\u00b7)E\n\nF [q(\u00b7)] F [p(\u00b7)] +D F\n\npt+1(\u00b7) =\u21e7 Dk\u21e3pt(\u00b7) \u00b5 F\n\npt\n\n(\u00b7)\u2318 ,\n\n3\n\nfor all q(\u00b7), p(\u00b7) 2P .\nHere, we use the standard inner product for two densities: hq(\u00b7), p(\u00b7)i =Rx q(x)p(x), or hq(\u00b7), p(\u00b7)i =\nPx q(x)p(x) in the discrete setting.\n\n3 Algorithms\nRecall that our goal is to solve the optimization problem (1). A natural way to solve it in an iterative\nfashion is using projected gradient descent, where the projection step is over the set of sparse\ndistributions Dk. This analogy makes the connection to iterative hard thresholding (IHT) algorithms,\nwhere the iterative recursion is:\n\nwhere pt(\u00b7) denotes\nthe projection of\n\nthe current\n\nsense,\nthe distribution function to the set of sparse distribution functions.\n\nand \u21e7Dk (\u00b7) denotes,\n\nin an abstract\n\niterate,\n\n\fThe consequent steps are analogous to those of\nregular IHT: given an initialization point, we\niteratively i) compute the gradient, ii) perform\nthe gradient step with step size \u00b5, iii) ensure\nthe computed approximate solution satis\ufb01es our\nconstraint in each iteration by projecting to Dk.\n3.1 Projection onto Dk\nConsider the projection step with respect to the\n`2-norm i.e.\n\n\u21e7Dk (p(\u00b7)) := arg min\n\niters T , p0(\u00b7) 2D k, \u00b5. Output: pT 2D k\n\nAlgorithm 1 Distribution IHT\n1: Input: F [\u00b7] : P! R, k 2 Z+. number of\n2: t 0\n3: while t < T do\n4:\n5:\n6: end while\n7: return pT (\u00b7)\nq(\u00b7)2Dk kq(\u00b7) p(\u00b7)k2\n\nqt+1(\u00b7) = pt(\u00b7) \u00b5 F\npt+1(\u00b7) =\u21e7 Dk (qt+1)\n\n(\u00b7)\n\n2,\n\npt\n\n(2)\n\nwhere the `2-norm is de\ufb01ned by the afore-\nmentioned inner product\nhq(\u00b7), p(\u00b7)i =\nPx q(x)p(x). The set Dk = [|S|\uf8ffkPS is\nk ) = O(nk) sparse sets PS of\na union of (n\ndifferent supports. Thus, if we denote Tproj as\nthe time to compute \u21e7PS (p(\u00b7)), then we need\nO(nk \u00b7 Tproj) time for Dk projection using naive\nenumeration. One may reasonably conjecture\nthe existence of more ef\ufb01cient implementations\nof the exact projection in (2), e.g., in polynomial\ntime. In the following, we show that this is not\nthe case.\n3.2 On the tractability\nof sparse distribution `2-norm projection\nThe projection (2) is iteratively solved in IHT (step 5 in Algorithm 1). Thus, for the algorithm to be\npractical, it is important to study the tractability of the projection step. The combinatorial nature of\nDk hints that this might not be the case.\nTheorem 1. The sparse distribution `2-norm projection problem (2) is NP-hard.\n\nFigure 1: Illustration of projection onto Dk, with\nq =\u21e7 Dk (p).\n\nDk\n\nP\n\np\n\nq\n\nSketch of proof: We show that the subset selection problem [26] can be reduced to the `2-norm\nprojection problem. The complete proof is provided in the supplementary material.\nAs an alternative route, NP-hard problems can be often tackled suf\ufb01ciently, by using approximate\nmethods. However, the following theorem states that the sparsity constrained optimization problem\nin (2) is hard even to approximate, in the sense that no deterministic approximation algorithm exists\nthat solves it in polynomial time.\nTheorem 2. There exists no deterministic algorithm that can provide a constant factor approximation\nfor the sparse distribution `2-norm projection problem in polynomial time. Formally, for given\n\nq : X! R with X2 Rn, let p?(\u00b7) be the optimal `2-norm projection onto Dk, and letbp(\u00b7) be the\n\nsolution found by any algorithm that operates in O(poly(n)) time. Then, we can design problem\ninstances, where the approximation ratio:\n\ncannot be bounded.\n\n' = kq(\u00b7) bp(\u00b7)k2\n\nkq(\u00b7) p?(\u00b7)k2\n\n2\n\n2 1,\n\nThe proof of the theorem is provided in the supplementary material. Through Theorems 1 and 2, we\nhave shown that the distribution sparse `2-norm projection problem is hard, and thus the applicability\nof IHT on the space of densities seems not to be well-established to be practical. This may be\nsurprising, in light of results in a variety of domains where it is known to be effective. For example,\nin case of vectors, a simple O(n) selection algorithm solves the projection problem optimally [27].\nSimilarly, on the space of matrices for low rank IHT, the projection onto the top-k ranks is optimally\nsolved by an SVD [28].\n\n4\n\n\f3.3 A greedy approximation\nIn contrast to the results of Theorems 1\nand 2, we have observed that a simple\ngreedy support selection seems effective\nin practice. Thus, we simply consider re-\nplacing exact projection to Dk by greedy\nselection.\nConsider Algorithm 2 when the input\nis not necessarily a distribution,\ni.e.,\n\nlevel k.\n\nAlgorithm 2 Greedy Sparse Projection (GSProj)\n1: Input: n-dimensional function q : X! R and sparsity\n2: Output: A distribution p(\u00b7) 2D k\n3: S := ;\n4: while |S| < k do\n5:\n6:\n7: end while\n8: return arg minp2PS kp(\u00b7) q(\u00b7)k2\n\nj 2 arg mini2[n]\\Sminp2PS[i kp(\u00b7) q(\u00b7)k2\n2 \n\nS := S[ j\n\n2\n\nPx2X q(x) 6= 1. The key procedure\nof the projection is line 5, where the in-\nner min(\u00b7) is the projection of q(\u00b7) on a\nset of domain restricted densities. Let\n2. Since, by de\ufb01nitionbp(x) = 0\nbp(\u00b7) denote this projection, i.e.,bp(\u00b7) = arg minp(\u00b7)2PS kp(\u00b7) q(\u00b7)k2\nfor any x /2X S, we only need to calculatebp(x) where x 2X S, and this can be reformulated as:\np(x) 0,\nwhich is essentially `2-norm projection onto a simplex {p(x) |Px2XS\np(x) 0}.\np(x) = 1,8x2XS\nThis `2-norm projection onto the simplex can be solved ef\ufb01ciently and easily (See [29]).\nWhen p(\u00b7) is a distribution, we can analytically compute its projection on any support restricted\ndomain. Given support S, the exact projection of a distribution p(\u00b7) onto PS is:\n(3)\n\np(x) = 1 and 8x2XS\n\np(\u00b7) Xx2XS\n\ns.t. Xx2XS\n\n(p(x) q(x))2\n\narg min\n\narg min\n\nq2PS kq(\u00b7) p(\u00b7)k2\n2.\n\nIn our setting, the above problem can be written as\n\narg min\n\nq2PS kq(\u00b7) p(\u00b7)k2\n\nq2PS\n\n2 = arg min\n\nhq(\u00b7) p(\u00b7), q(\u00b7) p(\u00b7)i = arg min\n\n= arg min\n\nq2PS Xx2XS\n\n(q(x) p(x))2 + Xx2X ,x /2XS\n\nq2PS Xx2X\n\np(x)2.\n\n(q(x) p(x))2\n\nThe last equation is due to de\ufb01nition of PS and XS. Since p(\u00b7) is constant, we can eliminate the last\nterm. Further, since q 2P S, we have that q(x) = 0 for every x /2X S. The resulting problem is:\n(4)\n\nq(x) = 1.\n\narg min\n\nq2PS Xx2XS\n\n(q(x) p(x))2\n\ns.t. Xx2XS\n\nDenotePx2XS\n\nequation (4), we have:\n\np(x) = C \uf8ff 1. Applying the Quadratic Mean-Arithmetic Mean inequality to\nXx2XS\n\n(q(x) p(x))2 (1 C)2 /|XS|\n\ns.t. Xx2XS\n\nq(x) = 1\n\nThe equality can be achieved when q(x) p(x) is the same for every x 2X S. Therefore we have\nthe optimal solution to Problem (3):\n\nq?\n\nS(x) =\u21e2 p(x) + 1C\n\n|XS|\n\n0,\n\n,\n\nx 2X S\nx /2X S\n\nComputational complexity. The time we need to solve Problem (3) is O(|XS|), i.e. the time to\ncompute C. However, to compute the norm kq(\u00b7) p(\u00b7)k2\n2 we still need O(|X|) time, as p(x) is\nnot necessarily zero at any x 2X . As a result, we need O(nk(|X| + |XS|)) time to enumerate\nfor an optimal solution of the `2-norm projection. If we consider the integer lattice X , as stated\nin the problem setting, then |X| = mn and |XS| = mk, rendering the time complexity O(nkmn).\nHowever, Algorithm 2 has much lower time complexity. In each iteration, the greedy method selects\nan element to put into S that maximize the gain, which requires k iterations. It need not to consider\nthe exact `2-norm kq(\u00b7) p(\u00b7)k2\n2 in each iteration, only the increment for each e from n options. To\ncompute the increment, no more than |XS| terms are added, which requires compute of O(|XS|) time\ncomplexity. All together, the greedy method requires O(k|XS|) time to operate, or O(nkmk) in our\ninteger lattice setting, which is far less that the enumeration method\u2019s O(nkmn).\n\n5\n\n\f3.4 When Greedy is Good\nWe have shown in the proof of Theorem 2 that there always exist extreme examples that are hard to\nsolve. Thus, in the most general sense, and without further assumptions, one can \ufb01nd pathological\ncases which make the problem hard. However, we \ufb01nd that the greedy approach works well\nempirically. In this section, we consider suf\ufb01cient conditions for tractability of the problem. Our\nconditions boil down to structural assumptions on F [\u00b7] which match standard assumptions in the\nliterature.\nTo build further intuition, consider line 4 in Algorithm 1, where the parameter passed to the greedy\nmethod is q(\u00b7) = p(\u00b7) \u00b5 F\np (\u00b7), and p(\u00b7) is already a k-sparse distribution. Denote the support of\np(\u00b7) as S; we can see that |S| \uf8ff k. Therefore, that q(\u00b7) is close to k-sparse when the step size \u00b5 is\nsmall. Thus, while the general problem (2) may be a lot harder, there is reason to conjecture that\nunder certain conditions, a simple greedy algorithm performs well. Next, we state these assumptions\nformally.\nAssumption 1 (Strong Convexity/Smoothness). The objective F [\u00b7] satis\ufb01es Strong Convex-\nity/Smoothness with respect to \u21b5 and if:\n\n\u21b5\n2 kp1(\u00b7) p2(\u00b7)k2\n\n2 \uf8ff F [p1(\u00b7)] F [p2(\u00b7)] \u2327 F\n\np1\n\n(\u00b7), p2(\u00b7) p1(\u00b7) \uf8ff\n\n\n2kp1(\u00b7) p2(\u00b7)k2\n\n2\n\nFor the sake of simplicity in exposition, we have assumed strong convexity to hold over the entire\ndomain (which can be a restrictive assumption). As will be clear from the proof analysis, this\nassumption can easily be tightened to a restricted strong convexity assumption; see, e.g., [30]. This\ndetail is left for a longer version of this manuscript.\nAssumption 2 (Lipschitz Condition). The functional F : P! R satis\ufb01es the Lipschitz condition\nwith respect to L, in k-sparse domain Dk is\n\nThis assumption implies that\n\n|F [p1(\u00b7)] F [p2(\u00b7)]|\uf8ff Lkp1(\u00b7) p2(\u00b7)k2\n\np(\u00b7) b\u21e7Dk (p(\u00b7))\n\n6\n\nF\np\n\n\n\n(\u00b7)2 \uf8ff L.\n\nUsing the strong convexity, smoothness, and Lipschitz assumptions, we are able to provide analysis\nfor when greedy works well. This is encapsulated in Theorem 3.\nTheorem 3. Given n-dimensional function q(\u00b7) = p(\u00b7) \u00b5 F\np (\u00b7), where p(\u00b7) is an n-dimensional\nk-sparse distribution and supp(p(\u00b7)) = S0, Algorithm 2 \ufb01nds the optimal projection to domain PS0\nif F [\u00b7] satis\ufb01es Assumption 2, \u00b5 is suf\ufb01ciently small and there are enough positions x 2X S0 where\np(x) > 0, i.e., satis\ufb01es inequality (6) and inequality (9).\n\n3.5 Convergence Analysis\nNext, we analyze the convergence of the overall Algorithm 1 with greedy projections. While\nTheorem 3 provides suf\ufb01cient conditions for exact projection using the greedy approach, in practice\ndue to computational precision issues and/or violation of the stated assumptions, the solution may\nnot provide an exact projection. Thus, it is prudent to assume that the inner projection subproblem is\nsolved within some approximation as quanti\ufb01ed in the following.\n\nonto sparsity domain and distribution space, with approximation parameter, , as:\n\nDe\ufb01nition 5. Approximate `2-norm projection. We de\ufb01ne b\u21e7Dk (\u00b7) as the approximate projection\n\n2\n\n2 \uf8ff (1 + )kp(\u00b7) \u21e7Dk (p(\u00b7))k2\n\n2\n\nNext, we present our main convergence theorem.\nTheorem 4. Suppose F satis\ufb01es assumptions 1 and 2. Furthermore, assume that the projection step\nin Algorithm 1 is solved -approximately. Let the step size \u00b5 = 1/, and kp0(\u00b7) p?(\u00b7)k2 \uf8ff L/(2\u21b5).\nThen if \nF [p0(\u00b7)]F [p?(\u00b7)]c iterations achieves\n(/(2)+(1+)(\u21b5)/(2\u21b52))L2\n.\nF [pT (\u00b7)] \uf8ff F [p?(\u00b7)]+c+\u270f , where \u2318 = 1(1+)(2/\u21b5) and c =\n\n1+ , 2), IHT (Algorithm 1) with T log\u2318\n\n\u21b5 2 (2 1\n\n(1+)(2/\u21b5)\n\n\u270f\n\n\f]\n\n[\n\n*\np\nF\n \n-\n \n]\n\n[\n\np\nF\n\n0.015\n\n0.01\n\n0.005\n\n0\n\n0\n\nIHT\nGreedy\nIHT after Greedy\n\n]\n\n[\n\n*\np\nF\n \n-\n \n]\n\n[\n\np\nF\n\n100\n\n200\n\niteration\n\n300\n\n400\n\n3\n2.5\n2\n1.5\n1\n0.5\n0\n\nIHT\nGreedy\nIHT after Greedy\n\n0\n\n500\n\n1000\niteration\n\n1500\n\n2000\n\n(a) Normalized `2-norm Minimization\n\n(b) Normalized KL Minimization\n\nFigure 2: Simulated Experiments\n\nF [p(\u00b7)]\n\nj 2 arg mini2[n]\\S {minp2PS[i F [p(\u00b7)]}\nS := S[ j\n\nAlgorithm 3 Greedy Selection\n1: Input: F [\u00b7] : P! R, k 2 Z+. Output:\npT 2D k\n2: S := ;\n3: while |S| < k do\n4:\n5:\n6: end while\n7: return arg minp2PS\n\n4 Experiments\nWe evaluate our algorithm on different\nconvex objectives, namely, `2-norm dis-\ntance and KL divergence. As mentioned\nbefore, there are no theoretically guar-\nanteed algorithms for `2-norm distance\nminimization under sparsity constraint.\nTo investigate optimality of the algo-\nrithms, we consider simulated experi-\nments of suf\ufb01ciently small size that the\nglobal optimal can be exhaustively enu-\nmerated.\nIHT implementation details. For IHT, the step size is chosen by a simple strategy: given an initial\nstep size, we double the step size when IHT is trapped in local optima, and return to the initial step\nsize after escaping. We return the algorithm along the entire solution path.\nBaseline: Forward Greedy Selection. Unfortunately, we are unaware of optimization algorithms\nfor sparse probability estimation with general losses. As as a simple baseline, we consider greedy\nselection wrt. the objective. This is equivalent to Algorithm 3. For certain special cases e.g. KL\nobjective, Algorithm 3 can be applied ef\ufb01ciently and is effective in practice [5].\n4.1 Simulated Data\nWe set dimension n = 15, number of entries m = 2, sparsity level k = 7. That is, X = {0, 1}15\nis a 15-dimensional binary vector space, with cardinality |X| = 215 = 32768. The distribution\np : X! [0, 1] satis\ufb01esPx2X p(x) = 1. The sparsity constraint is designed to \ufb01x a support\nS : |S| \uf8ff 7, such that for any x : p(x) > 0 has supp(x) = S. Thus, the optimal solution is requires\nenumerating15\nThe `2-norm minimization objective is F [p(\u00b7)] = kp(\u00b7) q(\u00b7)k2\n2 where q(\u00b7) is a distribution generated\nby randomly choosing 50 positions x1 \u00b7\u00b7\u00b7 x50 2X to assign random real numbers c1 \u00b7\u00b7\u00b7 c50 :\nP50\ni=1 ci = 1 and the other positions are assigned to 0, i.e., q(xi) = ci for i 2 [50], and q(x) = 0\notherwise. Initial step size \u00b5 = 0.008. Results are shown in Figure 2 (a). For the KL divergence\nobjective, it is F [p(\u00b7)] = KL(p(\u00b7)||q(\u00b7)) =Px2X p(x) log p(x)\nq(x) , where q(\u00b7) is a random distribution\ngenerated similar to the q(\u00b7) in `2-norm objective. The only difference is that q(x) can not be zero as\nit would render the KL unde\ufb01ned. For simulated experiments, we use the optimum to normalize the\nobjective function as \u02dcF [p] = F [p] F [p?], so that at the optimum \u02dcF [p?] = 0.\nThree algorithms are compared in each experiment, i.e., IHT, Greedy and IHT after Greedy. While\nIHT starts randomly, IHT after Greedy is initialized by the result of Greedy. In each run, the\ndistribution q(\u00b7) and the starting distribution for IHT p0(\u00b7) are randomly generated. Each of the\nexperiments are run 20 times. Results are presented in showing the mean and standard deviation of\nGreedy and IHT after Greedy. The standard deviation of IHT is similar to that of IHT after Greedy.\n\n7 = 6435 possible supports.\n\n7\n\n\fWe use the `2-norm greedy projection in IHT in both experiments. Interestingly, this not only\noutperforms the `2-norm greedy projection itself (Figure 2 (a)), but also outperforms Greedy on the\nKL objective (Figure 2 (b)), where [5] suggests provably good performance. In particularly, while the\nperformance of Greedy can \ufb02uctuate severely, IHT (after Greedy) is stable in obtaining good results.\nNote that low variance is especially desirable when the algorithm is only applied a few times to save\ncomputation, as in large discrete optimization problems.\n4.2 Benchmark Data\nDistribution Compression / Compressed sensing. We apply our IHT to the task of expectation-\npreserving distribution compression, useful for ef\ufb01ciently storing large probability tables. Given a\ndistribution p(\u00b7), our goal is to construct a sparse approximation q(\u00b7), such that q(\u00b7) approximately\npreserves expectations with respect to p(\u00b7). Interestingly, this model compression problem is equiva-\nlent to compressed sensing, but with the distributional constraints. Speci\ufb01cally, our goal is to \ufb01nd q\nwhich minimizes ||Aq Ap||2\n2 subject to a k-sparsity constraint on q. The model is evaluated with\nrespect to moment reconstruction ||Bq Bp||2\n2 for a new \"sensing\" matrix B. Our experiments use\nreal data from the Texas hospital discharge public use dataset. IHT is compared to post-precessed\nLasso and Random. Lasso ignores the simplex constraints during optimization, then projects the\nresults to the simplex, while Random is a na\u00efve baseline of random distributions. Figure 3(a) shows\nthat IHT signi\ufb01cantly outperforms baselines. Additional details are provided in Appendix H due to\nlimited space.\nDataset compression. We study representative prototype selection for the Digits data [31]. Pro-\ntotypes are representative examples chosen from the data in order to achieve dataset compression.\nOur optimization objective is the Maximum Mean Discrepancy (MMD) between the discrete data\ndistribution and the sparse data distribution representing the selected samples. We evaluate per-\nformance using the prototype nearest neighbor classi\ufb01cation error on a test dataset. We compare\ntwo forward selection greedy variants (Local Greedy and Global Greedy) proposed by [32] and the\nmeans algorithm (labeled as PS) proposed by [33], both state of the art. The results are presented in\nFigure 3(b) showing that IHT outperforms all baselines. Additional experimental details are provided\nin Appendix H due to limited space.\n\n15\n\n10\n\n5\n\nr\no\nr\nr\n\nE\n\nIHT\nPS\nGlobal Greedy\nLocal Greedy\n\nIHT\nLasso\nRandom\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\nr\no\nr\nr\n\nE\n\n \nt\ns\ne\nT\n\n0\n100\n\n200\n\n300\n\nk\n\n(a)\n\n400\n\n500\n\n0.04\n\n4500\n\n2500\nNumber of Prototypes Selected\n\n1100\n\n125\n\n500\n\n300\n\n(b)\n\nFigure 3:\nCompression. Test Classi\ufb01cation error of prototype nearest neighbor classi\ufb01er\n\n(a) Compression / Compressed sensing. Test Error at varying sparsity k. (b) Dataset\n\n5 Conclusion and Future Work\nIn this work, we proposed the use of IHT for learning discrete sparse distributions. We study several\ntheoretical properties of the algorithm from an optimization viewpoint, and propose practical solutions\nto solve otherwise hard problems. There are several possible future directions of research. We have\nanalyzed discrete distributions with sparsity constraints. The obvious extensions are to the space of\ncontinuous measures and structured sparsity constraints. Is there a bigger class of constraints for\nwhich the a tractable projection algorithm exists? Can we improve the suf\ufb01cient conditions under\nwhich projections are provably close to the optimum projection? Finally, more in-depth empirical\nstudies compared to other state of the art algorithms should be very interesting and useful to the\ncommunity.\n\n8\n\n\fReferences\n[1] T. J. Mitchell and J. J. Beauchamp. Bayesian variable selection in linear regression. Journal of\n\nthe American Statistical Association, 83(404):1023\u20131032, 1988.\n\n[2] Hemant Ishwaran and J. Sunil Rao. Spike and slab variable selection: Frequentist and bayesian\n\nstrategies. Ann. Statist., 33(2):730\u2013773, 04 2005.\n\n[3] Edward I. George and Robert E. McCulloch. Variable selection via gibbs sampling. Journal of\n\nthe American Statistical Association, 88(423):881\u2013889, 1993.\n\n[4] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical\n\nAssociation, 103(482):681\u2013686, 2008.\n\n[5] Oluwasanmi O Koyejo, Rajiv Khanna, Joydeep Ghosh, and Russell Poldrack. On prior distribu-\ntions and approximate inference for structured variables. In Advances in Neural Information\nProcessing Systems, pages 676\u2013684, 2014.\n\n[6] Rajiv Khanna, Joydeep Ghosh, Russell Poldrack, and Oluwasanmi Koyejo. Sparse submodular\n\nprobabilistic pca. In Arti\ufb01cial Intelligence and Statistics, pages 453\u2013461, 2015.\n\n[7] Rajiv Khanna, Joydeep Ghosh, Rusell Poldrack, and Oluwasanmi Koyejo. Information Pro-\njection and Approximate Inference for Structured Sparse Variables. In Aarti Singh and Jerry\nZhu, editors, Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 54 of Proceedings of Machine Learning Research, pages 1358\u20131366, Fort\nLauderdale, FL, USA, 20\u201322 Apr 2017. PMLR.\n\n[8] CARLOS M. CARVALHO, NICHOLAS G. POLSON, and JAMES G. SCOTT. The horseshoe\n\nestimator for sparse signals. Biometrika, 97(2):465\u2013480, 2010.\n\n[9] Andre Wibisono. Sampling as optimization in the space of measures: The langevin dynamics as\na composite optimization problem. In S\u00e9bastien Bubeck, Vianney Perchet, and Philippe Rigollet,\neditors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of\nMachine Learning Research, pages 2093\u20133027. PMLR, 06\u201309 Jul 2018.\n\n[10] Francesco Locatello, Rajiv Khanna, Joydeep Ghosh, and Gunnar Ratsch. Boosting variational\ninference: an optimization perspective. In Amos Storkey and Fernando Perez-Cruz, editors,\nProceedings of the Twenty-First International Conference on Arti\ufb01cial Intelligence and Statistics,\nvolume 84 of Proceedings of Machine Learning Research, pages 464\u2013472, Playa Blanca,\nLanzarote, Canary Islands, 09\u201311 Apr 2018. PMLR.\n\n[11] Arnak S.Dalalyan and Avetik Karagulyan. User-friendly guarantees for the langevin monte\n\ncarlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.\n\n[12] Ferenc Husz\u00e1r and David Duvenaud. Optimally-weighted herding is bayesian quadrature. In\nProceedings of the Twenty-Eighth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201912,\npages 377\u2013386, 2012.\n\n[13] Francesco Locatello, Gideon Dresdner, Rajiv Khanna, Isabel Valera, and Gunnar Raetsch.\nBoosting black box variational inference. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,\nN. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems\n31, pages 3401\u20133411. Curran Associates, Inc., 2018.\n\n[14] Thomas Blumensath and Mike E. Davies. Iterative hard thresholding for compressed sensing.\n\nCoRR, 2008.\n\n[15] Anastasios Kyrillidis and Volkan Cevher. Recipes on hard thresholding methods. In 2011 4th\nIEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing\n(CAMSAP), pages 353\u2013356. IEEE, 2011.\n\n[16] Deanna Needell and Joel A Tropp. Cosamp: Iterative signal recovery from incomplete and\n\ninaccurate samples. Applied and computational harmonic analysis, 26(3):301\u2013321, 2009.\n\n[17] Wei Dai and Olgica Milenkovic. Subspace pursuit for compressive sensing signal reconstruction.\n\nIEEE transactions on Information Theory, 55(5):2230\u20132249, 2009.\n\n9\n\n\f[18] Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions\n\non Information theory, 50(10):2231\u20132242, 2004.\n\n[19] Prateek Jain, Ambuj Tewari, and Purushottam Kar. On iterative hard thresholding methods for\nhigh-dimensional m-estimation. In Proceedings of the 27th International Conference on Neural\nInformation Processing Systems - Volume 1, NIPS\u201914, pages 685\u2013693, 2014.\n\n[20] Rajiv Khanna and Anastasios Kyrillidis. Iht dies hard: Provable accelerated iterative hard\n\nthresholding. AISTATS, 12 2018.\n\n[21] David L Donoho. Compressed sensing. IEEE Transactions on information theory, 52(4):1289\u2013\n\n1306, 2006.\n\n[22] Robert Tibshirani, Martin Wainwright, and Trevor Hastie. Statistical learning with sparsity: the\n\nlasso and generalizations. Chapman and Hall/CRC, 2015.\n\n[23] Suvrit Sra, Sebastian Nowozin, and Stephen J Wright. Optimization for machine learning. Mit\n\nPress, 2012.\n\n[24] Junzhou Huang, Tong Zhang, and Dimitris Metaxas. Learning with structured sparsity. Journal\n\nof Machine Learning Research, 12(Nov):3371\u20133412, 2011.\n\n[25] Eberhard Engel and Reiner M Dreizler. Density functional theory. Springer.\n[26] Jon Kleinberg and Eva Tardos. Algorithm design. Pearson Education India, 2006.\n[27] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Ef\ufb01cient projections\nonto the `1-ball for learning in high dimensions. In Proceedings of the 25th international\nconference on Machine learning, pages 272\u2013279. ACM, 2008.\n\n[28] Anastasios Kyrillidis and Volkan Cevher. Matrix recipes for hard thresholding methods. Journal\n\nof mathematical imaging and vision, 48(2):235\u2013265, 2014.\n\n[29] Anastasios Kyrillidis, Stephen Becker, Volkan Cevher, and Christoph Koch. Sparse projections\n\nonto the simplex. In International Conference on Machine Learning, pages 235\u2013243, 2013.\n\n[30] Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of\ngradient methods for high-dimensional statistical recovery. In Advances in Neural Information\nProcessing Systems, pages 37\u201345, 2010.\n\n[31] J. J. Hull. A database for handwritten text recognition research. IEEE Trans. Pattern Anal.\n\nMach. Intell., 16(5):550\u2013554, May 1994.\n\n[32] Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. Examples are not enough, learn to\ncriticize! criticism for interpretability. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2280\u20132288.\nCurran Associates, Inc., 2016.\n\n[33] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classi\ufb01cation. Ann.\n\nAppl. Stat., 5(4):2403\u20132424, 12 2011.\n\n10\n\n\f", "award": [], "sourceid": 3656, "authors": [{"given_name": "Jacky", "family_name": "Zhang", "institution": "UIUC"}, {"given_name": "Rajiv", "family_name": "Khanna", "institution": "University of California at Berkeley"}, {"given_name": "Anastasios", "family_name": "Kyrillidis", "institution": "Rice University"}, {"given_name": "Oluwasanmi", "family_name": "Koyejo", "institution": "UIUC"}]}