{"title": "Fast greedy algorithms for dictionary selection with generalized sparsity constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 4744, "page_last": 4753, "abstract": "In dictionary selection, several atoms are selected from finite candidates that successfully approximate given data points in the sparse representation. We propose a novel efficient greedy algorithm for dictionary selection. Not only does our algorithm work much faster than the known methods, but it can also handle more complex sparsity constraints, such as average sparsity. Using numerical experiments, we show that our algorithm outperforms the known methods for dictionary selection, achieving competitive performances with dictionary learning algorithms in a smaller running time.", "full_text": "Fast greedy algorithms for dictionary selection\n\nwith generalized sparsity constraints\n\nKaito Fujii\n\nGraduate School of Information Sciences and Technology\n\nThe University of Tokyo\n\nkaito_fujii@mist.i.u-tokyo.ac.jp\n\nTasuku Soma\n\nGraduate School of Information Sciences and Technology\n\nThe University of Tokyo\n\ntasuku_soma@mist.i.u-tokyo.ac.jp\n\nAbstract\n\nIn dictionary selection, several atoms are selected from \ufb01nite candidates that suc-\ncessfully approximate given data points in the sparse representation. We propose\na novel ef\ufb01cient greedy algorithm for dictionary selection. Not only does our\nalgorithm work much faster than the known methods, but it can also handle more\ncomplex sparsity constraints, such as average sparsity. Using numerical experi-\nments, we show that our algorithm outperforms the known methods for dictionary\nselection, achieving competitive performances with dictionary learning algorithms\nin a smaller running time.\n\n1\n\nIntroduction\n\nLearning sparse representations of data and signals has been extensively studied for the past decades in\nmachine learning and signal processing [16]. In these methods, a speci\ufb01c set of basis signals (atoms),\ncalled a dictionary, is required and used to approximate a given signal in a sparse representation. The\ndesign of a dictionary is highly nontrivial, and many studies have been devoted to the construction of\na good dictionary for each signal domain, such as natural images and sounds. Recently, approaches\nto construct a dictionary from data have shown the state-of-the-art results in various domains. The\nstandard approach is called dictionary learning [3, 32, 1]. Although many studies have been devoted\nto dictionary learning, it is usually dif\ufb01cult to solve, requiring a non-convex optimization problem\nthat often suffers from local minima. Also, standard dictionary learning methods (e.g., MOD [14] or\nk-SVD [2]) require a heavy time complexity.\nKrause and Cevher [22] proposed a combinatorial analogue of dictionary learning, called dictionary\nselection. In dictionary selection, given a \ufb01nite set of candidate atoms, a dictionary is constructed\nby selecting a few atoms from the set. Dictionary selection could be faster than dictionary learning\ndue to its discrete nature. Another advantage of dictionary selection is that the approximation\nguarantees hold even in agnostic settings, i.e., we do not need stochastic generating models of the\ndata. Furthermore, dictionary selection algorithms can be used for media summarization, in which\nthe atoms must be selected from given data points [8, 9].\nThe basic dictionary selection is formalized as follows. Let V be a \ufb01nite set of candidate atoms and\nn = |V |. Throughout the paper, we assume that the atoms are unit vectors in Rd without loss of\ngenerality. We represent the candidate atoms as a matrix A \u2208 Rd\u00d7n whose columns are the atoms in\nV . Let yt \u2208 Rd (t \u2208 [T ]) be data points, where [T ] = {1, . . . , T}, and k and s be positive integers\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fT(cid:88)\n\nt=1\n\nh(X) =\n\nmax\n\nwith k \u2265 s. We assume that a utility function u : Rd\u00d7Rd \u2192 R+ exists, which measures the similarity\nof the input vectors. For example, one can use the (cid:96)2-utility function u(y, x) = (cid:107)y(cid:107)2\n2 as\nin Krause and Cevher [22]. Then, the dictionary selection \ufb01nds a set X \u2286 V of size k that maximizes\n\n2 \u2212 (cid:107)y \u2212 x(cid:107)2\n\nu(yt, AX w),\n\nw\u2208Rk : (cid:107)w(cid:107)0\u2264s\n\nwe can rewrite this as the following two-stage optimization: h(X) =(cid:80)T\n\n(1)\nwhere (cid:107)w(cid:107)0 is the number of nonzero entries in w and AX is the column submatrix of A with\nrespect to X. That is, we approximate a data point yt with a sparse representation in atoms in X,\nwhere the approximation quality is measured by u. Letting ft(Zt) := maxw u(yt, AZtw) (t \u2208 [T ]),\nt=1 maxZt\u2286X : |Zt|\u2264s ft(Zt).\nHere Zt is the set of atoms used in a sparse representation of data point yt. The main challenges in\ndictionary selection are that the evaluation of h is NP-hard in general [25], and the objective function\nh is not submodular [17] and therefore the well-known greedy algorithm [27] cannot be applied. The\nprevious approaches construct a good proxy of dictionary selection that can be easily solved, and\nanalyze the approximation ratio.\n\n1.1 Our contribution\n\nOur main contribution is a novel and ef\ufb01cient algorithm called the replacement orthogonal matching\npursuit (Replacement OMP) for dictionary selection. This algorithm is based on a previous approach\ncalled Replacement Greedy [30] for two-stage submodular maximization, a similar problem to\ndictionary selection. However, the algorithm was not analyzed for dictionary selection. We extend\ntheir approach to dictionary selection in the present work, with an additional improvement that exploits\ntechniques in orthogonal matching pursuit. We compare our method with the previous methods in\nTable 1. Replacement OMP has a smaller running time than SDSOMP [10] and Replacement\nGreedy. The only exception is SDSMA [10], which intuitively ignores any correlation of the atoms.\nIn our experiment, we demonstrate that Replacement OMP outperforms SDSMA in terms of\ntest residual variance. We note that the constant approximation ratios of SDSMA, Replacement\nGreedy, and Replacement OMP are incomparable in general. In addition, we demonstrate that\nReplacement OMP achieves a competitive performance with dictionary learning algorithms in a\nsmaller running time, in numerical experiments.\n\nin which they add a global constraint(cid:80)T\n\nGeneralized sparsity constraint\nIncorporating further prior knowledge on the data domain often\nimproves the quality of dictionaries [28, 29, 11]. A typical example is a combinatorial constraint\nindependently imposed on each support Zt. This can be regarded as a natural extension of the\nstructured sparsity [19] in sparse regression, which requires the support to satisfy some combinatorial\nconstraint, rather than a cardinality constraint. A global structure of supports is also useful prior\ninformation. Cevher and Krause [6] proposed a global sparsity constraint called the average sparsity,\nt=1|Zt| \u2264 s(cid:48). Intuitively, the average sparsity constraint\nrequires that the most data points can be represented by a small number of atoms. If the data points\nare patches of a natural image, most patches are a simple background, and therefore the number\nof the total size of the supports must be small. The average sparsity has been also intensively\nstudied in dictionary learning [11]. To deal with these generalized sparsities in a uni\ufb01ed manner,\nwe propose a novel class of sparsity constraints, namely p-replacement sparsity families. We prove\nthat Replacement OMP can be applied for the generalized sparsity constraint with a slightly worse\napproximation ratio. We emphasize that the OMP approach is essential for ef\ufb01ciency; in contrast,\nReplacement Greedy cannot be extended to the average sparsity setting because it can only handle\nlocal constraints on Zt, and yields an exponential running time.\n\nOnline extension In some practical situations, it is not always feasible to store all data points yt,\nbut these data points arrive in an online fashion. We show that Replacement OMP can be extended\nto the online setting, with a sublinear approximate regret. The details are given in Section 5.\n\n1.2 Related work\n\nKrause and Cevher [22] \ufb01rst introduced dictionary selection as a combinatorial analogue of dictionary\nlearning. They proposed SDSMA and SDSOMP, and analyzed the approximation ratio using the\ncoherence of the matrix A. Das and Kempe [10] introduced the concept of the submodularity ratio\n\n2\n\n\fTable 1: Comparison of known methods with Replacement OMP. The constants ms, Ms, and\nMs,2 are the restricted concavity and smoothness constants of u(yt,\u00b7) (t \u2208 [T ]); see Section 2. The\nrunning time is from the (cid:96)2-utility function u and the individual sparsity constraint.\n\nMethod\n\nSDSMA [22]\nSDSOMP [22]\n\nReplacement Greedy [30]\n\nReplacement OMP\n\nm1ms\nM1Ms\n\nApproximation ratio\n(1 \u2212 1/e) [10]\nO(1/k) [10]\n1 \u2212 exp\n1 \u2212 exp\n\n(cid:16)\u2212 Ms,2\n(cid:16)\u2212 Ms,2\n\n(cid:17)2(cid:16)\n(cid:17)2(cid:16)\n\nm2s\n\nO((k + d)nT )\n\nRunning time\n\nGeneralized\nsparsity\nNo\nO((s + k)sdknT ) No\nNo\n\nO(s2dknT )\n\nO((n + ds)kT )\n\nYes\n\n(cid:17)(cid:17)\n(cid:17)(cid:17)\n\nm2s\n\n(cid:16) m2s\n(cid:16) m2s\n\nMs,2\n\nMs,2\n\nand re\ufb01ned the analysis via the restricted isometry property [5]. A connection to the restricted\nconcavity and submodularity ratio has been investigated by Elenberg et al. [13], Khanna et al. [21]\nfor sparse regression and matrix completion. Balkanski et al. [4] studied two-stage submodular\nmaximization as a submodular proxy of dictionary selection, devising various algorithms. Stan et al.\n[30] proposed Replacement Greedy for two-stage submodular maximization. It is unclear that\nthese methods provide an approximation guarantee for the original dictionary selection.\nTo the best of our knowledge, there is no existing research in the literature that addresses online\ndictionary selection. For a related problem in sparse optimization, namely online linear regression,\nKale et al. [20] proposed an algorithm based on supermodular minimization [23] with a sublinear\napproximate regret guarantee. Elenberg et al. [12] devised a streaming algorithm for weak submodular\nfunction maximization. Chen et al. [7] dealt with online maximization of weakly DR-submodular\nfunctions.\n\nOrganization The rest of this paper is organized as follows. Section 2 provides the basic concepts\nand de\ufb01nitions. Section 3 formally de\ufb01nes dictionary selection with generalized sparsity constraints.\nSection 4 presents our algorithm, Replacement OMP. Section 5 sketches the extension to the online\nsetting. The experimental results are presented in Section 6.\n\n2 Preliminaries\nNotation For a positive integer n, [n] denotes the set {1, 2, . . . , n}. The sets of reals and nonneg-\native reals are denoted by R and R\u22650, respectively. We similarly de\ufb01ne Z and Z\u22650. Vectors and\nmatrices are denoted by lower and upper case letters in boldface, respectively: a, x, y for vectors and\nA, X, Y for matrices. The ith standard unit vector is denoted by ei; that is, ei is the vector such that\nits ith entry is equal to one and all other entries are zero. For a matrix A \u2208 Rd\u00d7n and X \u2286 [n], AX\ndenotes the column submatrix of A with respect to X. The maximum and minimum singular values\nof a matrix A are denoted by \u03c3max(A) and \u03c3min(A), respectively. For a positive integer k, we de\ufb01ne\n\u03c3max(A, k) := maxX\u2286[n] : |X|\u2264k \u03c3max(AX ). We de\ufb01ne \u03c3min(A, k) in a similar way. For t \u2208 [T ],\ndenote the maximizer of ut(w) subject to supp(w) \u2286 Zt.\nlet ut(w) := u(yt, Aw). Let w(Zt)\nThroughout the paper, V denotes the \ufb01xed \ufb01nite ground set. For X \u2286 V and a \u2208 V \\ X, we de\ufb01ne\nX + a := X \u222a {a}. Similarly, for a \u2208 V \\ X and b \u2208 X, we de\ufb01ne X \u2212 b + a := (X \\ {b}) \u222a {a}.\n\nt\n\n2.1 Restricted concavity and smoothness\n\nThe following concept of restricted strong concavity and smoothness is crucial in our analysis.\nDe\ufb01nition 2.1 (Restricted strong concavity and restricted smoothness [26]). Let \u2126 be a subset of\nRd \u00d7 Rd and u : Rd \u2192 R be a continuously differentiable function. We say that u is restricted\nstrongly concave with parameter m\u2126 and restricted smooth with parameter M\u2126 if,\n\n(cid:107)y \u2212 x(cid:107)2\n\n2 \u2265 u(y) \u2212 u(x) \u2212 (cid:104)\u2207u(x), y \u2212 x(cid:105) \u2265 \u2212 M\u2126\n2\n\n\u2212 m\u2126\n2\nfor all (x, y) \u2208 \u2126.\nWe de\ufb01ne \u2126s,p := {(x, y) \u2208 Rd \u00d7 Rd : (cid:107)x(cid:107)0,(cid:107)y(cid:107)0 \u2264 s,(cid:107)x\u2212 y(cid:107)0 \u2264 p} and \u2126s := \u2126s,s for positive\nintegers s and p. We often abbreviate M\u2126s, M\u2126s,p, and m\u2126s as Ms, Ms,p, and ms, respectively.\n\n(cid:107)y \u2212 x(cid:107)2\n\n2\n\n3\n\n\f3 Dictionary selection with generalized sparsity constraints\n\nIn this section, we formalize our problem, dictionary selection with generalized sparsity constraints.\nIn this setting, the supports Zt for each t \u2208 [T ] cannot be independently selected, but we impose\n(cid:81)T\na global constraint on them. We formally write such constraints as a down-closed 1 family I \u2286\nt=1 2V . Therefore, we aim to \ufb01nd X \u2286 V with |X| \u2264 k maximizing\n\nT(cid:88)\n\nt=1\n\nh(X) =\n\nZ1,...,Zt\u2286X : (Z1,...,Zt)\u2208I\n\nmax\n\nft(Zt)\n\n(2)\n\nSince a general down-closed family is too abstract, we focus on the following class. First, we de\ufb01ne\nthe set of feasible replacements for the current support Z1,\u00b7\u00b7\u00b7 , ZT and an atom a as\n\nFa(Z1,\u00b7\u00b7\u00b7 , ZT ) = {(Z(cid:48)\n\n1,\u00b7\u00b7\u00b7 , Z(cid:48)\n\nT ) \u2208 I : Z(cid:48)\n\nt \u2286 Zt + a, |Zt \\ Z(cid:48)\n\n(3)\nThat is, the set of members in I obtained by adding a and removing at most one element from each\na\u2208V Fa(Z1,\u00b7\u00b7\u00b7 , ZT ). If Z1, . . . , ZT are clear from the context, we\nsimply write it as Fa.\n\nZt. Let F(Z1,\u00b7\u00b7\u00b7 , ZT ) =(cid:83)\nDe\ufb01nition 3.1 (p-replacement sparsity). A sparsity constraint I \u2286 (cid:81)T\n\nt| \u2264 1 (\u2200t \u2208 [T ])} .\n\nT ) \u2208 F(Z1, . . . , ZT ) (p(cid:48) \u2208 [p]) such that each element in Z\u2217\n\np(cid:48)=1 and each element in Zt \\ Z\u2217\n\nt=1 2V is p-replacement\nT ) \u2208 I, there is a sequence of p feasible replacements\nt \\ Zt appears at least once\nt appears at most once in the sequence\n\n1 , . . . , Z\u2217\n\nsparse if for any (Z1, . . . , ZT ), (Z\u2217\n(Z p(cid:48)\n1 , . . . , Z p(cid:48)\nin the sequence (Z p(cid:48)\n(Zt \\ Z p(cid:48)\nt )p\n\nt \\ Zt)p\n\np(cid:48)=1.\n\nThe following sparsity constraints are all p-replacement sparsity families. See Appendix B for proof.\nExample 3.2 (individual sparsity). The sparsity constraint for the standard dictionary selection can\nbe written as I = {(Z1,\u00b7\u00b7\u00b7 , ZT ) | |Zt| \u2264 s (\u2200t \u2208 [T ])}. We call it the individual sparsity constraint.\nThis constraint is a special case of an individual matroid constraint, described below.\nExample 3.3 (individual matroids). This was proposed by [30] as a sparsity constraint for two-stage\nsubmodular maximization. An individual matroid constraint can be written as I = {(Z1,\u00b7\u00b7\u00b7 , ZT ) |\nZt \u2208 It (\u2200t \u2208 [T ])} where (V,It) is a matroid2 for each t \u2208 [T ]. An individual sparsity constraint is\na special case of an individual matroid constraint where (V,It) is the uniform matroid for all t.\nExample 3.4 (block sparsity). Block sparsity was proposed by Krause and Cevher [22]. This sparsity\nrequires that the support must be sparse within each prespeci\ufb01ed block. That is, disjoint blocks\nB1,\u00b7\u00b7\u00b7 , Bb \u2286 [T ] of data points are given in advance, and an only small subset of atoms can be used\nt\u2208Bb(cid:48) Zt| \u2264 sb(cid:48) (\u2200b(cid:48) \u2208 [b])} where sb(cid:48) \u2208 Z\u22650 for\neach b(cid:48) \u2208 [b] are sparsity parameters.\nExample 3.5 (average sparsity [6]). This sparsity imposes a constraint on the average number of\nused atoms among all data points. The number of atoms used for each data point is also restricted.\nt=1 |Zt| \u2264 s(cid:48)} where st \u2208 Z\u22650 for each t \u2208 [T ] and\ns(cid:48) \u2208 Z\u22650 are sparsity parameters.\nProposition 3.6. The replacement sparsity parameters of individual matroids, block sparsity, and\naverage sparsity are upper-bounded by k, k, and 3k \u2212 1, respectively.\n\nin each block. Formally, I = {(Z1,\u00b7\u00b7\u00b7 , ZT ) | |(cid:83)\nFormally, I = {(Z1,\u00b7\u00b7\u00b7 , ZT ) | |Zt| \u2264 st,(cid:80)T\n\n4 Algortihms\n\nIn this section, we present Replacement Greedy [30] and Replacement OMP for dictionary\nselection with generalized sparsity constraints.\n\n1A set family I is said to be down-closed if X \u2208 I and Y \u2286 X then Y \u2208 I.\n2A matroid is a pair of a \ufb01nite ground set V and a non-empty down-closed family I \u2286 2V that satisfy that\n\nfor all Z, Z(cid:48) \u2208 I with |Z| < |Z(cid:48)|, there is an element a \u2208 Z(cid:48) \\ Z such that Z \u222a {a} \u2208 I\n\n4\n\n\f4.1 Replacement Greedy\n\nT(cid:88)\n\nReplacement Greedy was \ufb01rst proposed as an algorithm for a different problem, two-stage sub-\nmodular maximization [4]. In two-stage submodular maximization, the goal is to maximize\n\nt=1\n\nmax\n\nft(Zt),\n\nh(X) =\n\nZt\u2286X : Zt\u2208It\n\n(4)\nwhere ft is a nonnegative monotone submodular function (t \u2208 [T ]) and It is a matroid. Despite the\nsimilarity of the formulation, in dictionary selection, the functions ft are not necessarily submodular,\nbut come from the continuous function ut. Furthermore, in two-stage submodular maximization, the\nconstraints on Zt are individual for each t \u2208 [T ], while we pose a global constraint I. In the following,\nwe present an adaptation of Replacement Greedy to dictionary selection with generalized sparsity\nconstraints.\nReplacement Greedy stores the current dictionary X and supports Zt \u2286 X such that\n(Z1, . . . , ZT ) \u2208 I, which are initialized as X = \u2205 and Zt = \u2205 (t \u2208 [T ]). At each step, the\nalgorithm considers the gain of adding an element a \u2208 V to X with respect to each function ft,\nt) \u2212 f (Zt)}. See Algo-\ni.e., the algorithm selects a that maximizes max(Z(cid:48)\nrithm 1 for a pseudocode description. Note that for the individual matroid constraint I, the algorithm\ncoincides with the original Replacement Greedy [30].\n\n(cid:80)T\nt=1{ft(Z(cid:48)\n\nT )\u2208Fa\n\n1,...,Z(cid:48)\n\nAlgorithm 1 Replacement Greedy & Replacement OMP\n1: Initialize X \u2190 \u2205 and Zt \u2190 \u2205 for t = 1, . . . , T .\n2: for i = 1, . . . , k do\n3:\n\nPick a\u2217 \u2208 V that maximizes\n\n(cid:40)max(Z(cid:48)\n\nT )\u2208Fa\u2217(cid:80)T\n(cid:110) 1\n\n1,\u00b7\u00b7\u00b7 ,Z(cid:48)\nT )\u2208Fa\u2217\n1,\u00b7\u00b7\u00b7 ,Z(cid:48)\n1,\u00b7\u00b7\u00b7 , Z(cid:48)\n\nmax(Z(cid:48)\nand let (Z(cid:48)\nSet X \u2190 X + a\u2217 and Zt \u2190 Z(cid:48)\n\nMs,2\n\n(cid:80)T\nt) \u2212 ft(Zt)}\nt=1 {ft(Z(cid:48)\nt=1(cid:107)\u2207ut(w(Zt)\n\nt\n\n(Replacement Greedy)\nt\\Zt(cid:107)2 \u2212 Ms,2\n)Z(cid:48)\n\n(cid:80)T\nt=1(cid:107)(w(Zt)\n(Replacement OMP)\n\n(cid:107)2(cid:111)\n\n)Zt\\Z(cid:48)\n\nt\n\nt\n\nT ) be a replacement achieving a maximum.\n\nt for each t \u2208 [T ].\n\n4:\n5: return X.\nStan et al. [30] showed that Replacement Greedy achieves an ((1 \u2212 1/\ne)/2)-approximation\nwhen ft are monotone submodular. We extend their analysis to our non-submodular setting. The\nproof can be found in Appendix C.\nTheorem 4.1. Assume that ut is m2s-strongly concave on \u21262s and Ms,2-smooth on \u2126s,2 for t \u2208 [T ]\nand that the sparsity constraint I is p-replacement sparse. Let (Z\u2217\nT ) \u2208 I be optimal supports\nof an optimal dictionary X\u2217. Then the solution (Z1,\u00b7\u00b7\u00b7 , ZT ) \u2208 I of Replacement Greedy after\nk(cid:48) steps satis\ufb01es\n\n1 ,\u00b7\u00b7\u00b7 , Z\u2217\n\n\u221a\n\nT(cid:88)\n\nt=1\n\nft(Zt) \u2265 m2\n2s\nM 2\ns,2\n\n1 \u2212 exp\n\n(cid:18)\n\n(cid:18)\n\u2212 k(cid:48)\n\np\n\n(cid:19)(cid:19) T(cid:88)\n\nt=1\n\nMs,2\nm2s\n\nft(Z\u2217\nt )\n\n4.2 Replacement OMP\n\nheavy computation: in each greedy step, we need to evaluate(cid:80)T\n\nNow we propose our algorithm, Replacement OMP. A down-side of Replacement Greedy is its\nt) \u2208\nFa(Z1, . . . , Zt), which amounts to solving linear regression problems snT times if u is the (cid:96)2-utility\nfunction. To avoid heavy computation, we propose a proxy of this quantity by borrowing an idea\nfrom orthogonal matching pursuit. Replacement OMP selects an atom a \u2208 V that maximizes\n\nt) for each (Z(cid:48)\n\n1, . . . , Z(cid:48)\n\n(Z(cid:48)\n\n1,\u00b7\u00b7\u00b7 ,Z(cid:48)\n\nT )\u2208Fa(Z1,\u00b7\u00b7\u00b7 ,ZT )\n\nmax\n\n(cid:107)\u2207ut(w(Zt)\n\nt\n\nt\\Zt(cid:107)2 \u2212 Ms,2\n)Z(cid:48)\n\n(cid:107)(w(Zt)\n\nt\n\n(cid:107)2\n\n)Zt\\Z(cid:48)\n\nt\n\n.\n\n(5)\n\nThis algorithm requires the smoothness parameter Ms,2 before the execution. Computing Ms,2\nis generally dif\ufb01cult, but this parameter for the squared (cid:96)2-utility function can be bounded by\nmax(A, 2). This value can be computed in O(n2d) time.\n\u03c32\n\n5\n\nt=1 ft(Z(cid:48)\nT(cid:88)\n\nt=1\n\n(cid:40)\n\nT(cid:88)\n\n1\n\nMs,2\n\nt=1\n\n(cid:41)\n\n\fTheorem 4.2. Assume that ut is m2s-strongly concave on \u21262s and Ms,2-smooth on \u2126s,2 for t \u2208 [T ]\nand that the sparsity constraint I is p-replacement sparse. Let (Z\u2217\nT ) \u2208 I be optimal\nsupports of an optimal dictionary X\u2217. Then the solution (Z1,\u00b7\u00b7\u00b7 , ZT ) \u2208 I of Replacement OMP\nafter k(cid:48) steps satis\ufb01es\n\n1 ,\u00b7\u00b7\u00b7 , Z\u2217\n\nT(cid:88)\n\n(cid:18)\n\n(cid:18)\n\u2212 k(cid:48)\n\np\n\n(cid:19)(cid:19) T(cid:88)\n\nt=1\n\nMs,2\nm2s\n\nft(Z\u2217\nt ).\n\nft(Zt) \u2265 m2\n2s\nM 2\ns,2\n\n1 \u2212 exp\n\nt=1\n\n4.3 Complexity\n\nt\n\nand \u2207ut(w(Zt)\n\nt\n\nt=1 ft(Z(cid:48)\n\nNow we analyze the time complexity of Replacement Greedy and Replacement OMP. In general,\nFa has O(nT ) members, and therefore it is dif\ufb01cult to compute Fa. Nevertheless, we show that\nReplacement OMP can run much faster for the examples presented in Section 3.\nIn Replacement Greedy, it is dif\ufb01cult to \ufb01nd an atom with the largest gain at each step. This\nt). Conversely, in Replacement\n) for all t \u2208 [T ], the problem of calculating gain of\n\nis because we need to maximize a nonlinear function(cid:80)T\n\nOMP, if we can calculate w(Zt)\neach atom is reduced to maximizing a linear function.\nIn the following, we consider the (cid:96)2-utility function and average sparsity constraint because it is the\nmost complex constraint. A similar result holds for the other examples. In fact, we show that this\ntask reduces to maximum weighted bipartite matching. The Hungarian method returns the maximum\nweight bipartite matching in O(T 3) time. We can further improve the running time to O(T log T )\ntime by utilizing the structure of this problem. Due to the limitation of space, we defer the details to\nAppendix C. In summary, we obtain the following:\nTheorem 4.3. Assume that the assumption of Theorem 4.2 holds. Further assume that u is the\n(cid:96)2-utility function and I is the average sparsity constraint. Then Replacement OMP \ufb01nds the\nsolution (Z1,\u00b7\u00b7\u00b7 , ZT ) \u2208 I\n\n(cid:18) \u03c32\n\n(cid:19)2(cid:18)\n\n(cid:18)\n\nft(Zt) \u2265\n\nmax(A, 2s)\n\u03c32\nmin(A, 2)\n\n1 \u2212 exp\n\n\u2212 1\n3\n\n\u03c32\nmin(A, 2)\n\u03c32\nmax(A, 2s)\n\nft(Z\u2217\nt )\n\n(cid:19)(cid:19) T(cid:88)\n\nt=1\n\nT(cid:88)\n\nt=1\n\nin O(T k(n log T + ds)) time.\nRemark 4.4. If \ufb01nding an atom with the largest gain is computationally intractable, we can add an\natom whose gain is no less than \u03c4 times the largest gain. In this case, we can bound the approximation\nratio with replacing k(cid:48) with \u03c4 k(cid:48) in Theorem 4.1 and 4.2.\n\n5 Extensions to the online setting\n\nOur algorithms can be extended to the following online setting. The problem is formalized as a\ntwo-player game between a player and an adversary. At each round t = 1, . . . , T , the player must\nselect (possibly in a randomized manner) a dictionary Xt \u2286 V with |Xt| \u2264 k. Then, the adversary\nreveals a data point yt \u2208 Rd and the player gains ft(Xt) = maxw\u2208Rk:(cid:107)w(cid:107)0\u2264s u(yt, AX w). The\nperformance measure of a player\u2019s strategy is the expected \u03b1-regret:\n\nT(cid:88)\n\n(cid:34) T(cid:88)\n\n(cid:35)\n\nregret\u03b1(T ) = \u03b1 max\n\nX\u2217:|X\u2217|\u2264k\n\nft(X\u2217) \u2212 E\n\nft(Xt)\n\n,\n\nt=1\n\nt=1\n\nwhere \u03b1 > 0 is a constant independent from T corresponding to the of\ufb02ine approximation ratio, and\nthe expectation is taken over the randomness in the player.\nFor this online setting, we present an extension of Replacement Greedy and Replacement OMP\nwith sublinear \u03b1-regret, where \u03b1 is the corresponding of\ufb02ine approximation ratio. The details are\nprovided in Appendix D.\n\n6 Experiments\n\nIn this section, we empirically evaluate our proposed algorithms on several dictionary selection\nproblems with synthetic and real-world datasets. We use the squared (cid:96)2-utility function for all\n\n6\n\n\f(a) synthetic, T = 100, time\n\n(b) synthetic, T = 100, residual\n\n(c) voc, T = 100, residual\n\n(d) synthetic, T = 1000, time\n\n(e) synthetic, T = 1000, residual\n\n(f) voc, T = 1000, residual\n\n(g) synthetic, T = 1000, time\n\n(h) synthetic, T = 1000, residual\n\n(i) voc, T = 1000, residual\n\nFigure 1: The experimental results for the of\ufb02ine setting. In all \ufb01gures, the horizontal axis indicates\nthe size of the output dictionary. (a), (b), and (c) are the results for T = 100. (d), (e), and (f) are\nthe results for T = 1000. (g), (h), and (i) are the results for T = 1000 with an average sparsity\nconstraint. For each setting, we provide the plot of the running time for the synthetic dataset, test\nresidual variance for the synthetic dataset, and test residual variance for VOC2006 image dataset.\n\nof the experiments. Since evaluating the value of the objective function is NP-hard, we plot the\napproximated residual variance obtained by orthogonal matching pursuit.\n\nGround set We use the ground set consisting of several orthonormal bases that are standard choices\nin signal and image processing, such as 2D discrete cosine transform and several 2D discrete wavelet\ntransforms (Haar, Daubechies 4, and coi\ufb02et). In all of the experiments, the dimension is set to d = 64,\nwhich corresponds to images of size 8 \u00d7 8 pixels. The size of the ground set is n = 256.\n\nMachine All the algorithms are implemented in Python 3.6. We conduct the experiments in a\nmachine with Intel Xeon E3-1225 V2 (3.20 GHz and 4 cores) and 16 GB RAM.\n\nDatasets We conduct experiments on two types of datasets. The \ufb01rst one is a synthetic dataset. In\neach trial, we randomly pick a dictionary with size k out of the ground set, and generate sparse linear\ncombinations of the columns of this dictionary. The weights of the linear combinations are generated\nfrom the standard normal distribution. The second one is a dataset of real-world images extracted\nfrom PASCAL VOC2006 image datasets [15]. In each trial, we randomly select an image out of 2618\nimages and divide it into patches of 8 \u00d7 8 pixels, then select T patches uniformly at random. All the\npatches are normalized to zero mean and unit variance. We make datasets for training and test in the\nsame way, and use the training dataset for obtaining a dictionary and the test dataset for measuring\nthe quality of the output dictionary.\n\n7\n\n2.55.07.510.012.515.017.520.0dictionary size103102101100101102running timeSDS_MASDS_OMPRGRepOMP2.55.07.510.012.515.017.520.0dictionary size0.00.20.40.60.8test residual varianceSDS_MASDS_OMPRGRepOMP2.55.07.510.012.515.017.520.0dictionary size0.40.50.60.70.8test residual varianceSDS_MASDS_OMPRGRepOMP20406080100dictionary size102101100101102running timeSDS_MARepOMPRepOMPdMODK-SVD20406080100dictionary size0.00.10.20.30.40.50.60.70.8test residual varianceSDS_MARepOMPRepOMPdMODK-SVD20406080100dictionary size0.300.350.400.45test residual varianceSDS_MARepOMPRepOMPdMODK-SVD20406080100dictionary size102running timeSDS_MA (average)RepOMP (average)RepOMPd (average)20406080100dictionary size0.00.10.20.30.40.50.60.70.8test residual varianceSDS_MA (average)RepOMP (average)RepOMPd (average)20406080100dictionary size0.2250.2500.2750.3000.3250.3500.3750.400test residual varianceSDS_MA (average)RepOMP (average)RepOMPd (average)\f(a) synthetic\n\n(b) voc\n\nFigure 2: The experimental results for the online setting. In both \ufb01gures, the horizontal axis indicates\nthe number of rounds. (a) is the result with synthetic datasets, and (b) is the result with VOC2006\nimage datasets.\n\n6.1 Experiments on the of\ufb02ine setting\n\nWe implement our proposed methods, Replacement Greedy (RG) and Replacement OMP\n(RepOMP), as well as the existing methods for dictionary selection, SDSMA and SDSOMP. We also\nimplement a heuristically modi\ufb01ed version of RepOMP, which we call RepOMPd. In RepOMPd,\n\u221a\nwe replace Ms,2 with some parameter that decreases as the size of the current dictionary grows,\nwhich prevents the gains of all the atoms from being zero. Here we use Ms,2/\ni as the decreasing\nparameter where i is the number of iterations so far. In addition, we compare these methods with\nstandard methods for dictionary learning, MOD [14] and KSVD [2], which is set to stop when the\nchange of the objective value becomes no more than 10\u22126 or 200 iterations are \ufb01nished. Orthogonal\nmatching pursuit is used as a subroutine in both methods.\nFirst, we compare the methods for dictionary selection with small datasets of T = 100. The parameter\nof sparsity constraints is set to s = 5. The results averaged over 20 trials are shown in Figure 1(a),\n(b), and (c). The plot of the running time for VOC2006 datasets is omitted as it is much similar to that\nfor synthetic datasets. In terms of running time, SDSMA is the fastest, but the quality of the output\ndictionary is unsatisfactory. RepOMP is several magnitudes faster than SDSOMP and RG, but its\nquality is almost the same with SDSOMP and RG. In Figure 1(b), test residual variance of SDSOMP,\nRG, and RepOMP are overlapped, and in Figure 1(c), test residual variance of RepOMP is slightly\nworse than that of SDSOMP and RG. From these results, we can conclude that RepOMP is by far\nthe most practical method for dictionary selection.\nNext we compare the dictionary selection methods with the dictionary learning methods with larger\ndatasets of T = 1000. SDSOMP and RG are omitted because they are too slow to be applied to\ndatasets of this size. The results averaged over 20 trials are shown in Figure 1(d), (e), and (f). In\nterms of running time, RepOMP and RepOMPd are much faster than MOD and KSVD, but their\nperformances are competitive with MOD and KSVD.\nFinally, we conduct experiments with the average sparsity constraints. We compare RepOMP and\nRepOMPd with Algorithm 2 in Appendix C with a variant of SDSMA proposed for average sparsity\nin Cevher and Krause [6]. The parameters of constraints are set to st = 8 for all t \u2208 [T ] and s(cid:48) = 5T .\nThe results averaged over 20 trials are shown in Figure 1(g), (h), and (i). RepOMP and RepOMPd\noutperform SDSMA both in running time and quality of the output.\nIn Appendix E, We provide further experimental results. There we provide examples of image\nrestoration, in which the average sparsity works better than the standard dictionary selection.\n\n6.2 Experiments on the online setting\n\nHere we give the experimental results on the online setting. We implement the online version of\nSDSMA, RG and RepOMP, as well as an online dictionary learning algorithm proposed by Mairal\net al. [24]. For all the online dictionary selection methods, the hedge algorithm is used as the\nsubroutines. The parameters are set to k = 20 and s = 5. The results averaged over 50 trials are\nshown in Figure 2(a), (b). For both datasets, Online RepOMP shows a better performance than\nOnline SDSMA, Online RG, and the online dictionary learning algorithm.\n\n8\n\n0255075100125150175200round0.10.20.30.40.50.60.70.8test residual varianceSDS_MARGRepOMPMairal et al.0255075100125150175200round0.40.50.60.70.8test residual varianceSDS_MARGRepOMPMairal et al.\fAcknowledgement\n\nThe authors would thank Taihei Oki and Nobutaka Shimizu for their stimulating discussions. K.F.\nwas supported by JSPS KAKENHI Grant Number JP 18J12405. T.S. was supported by ACT-I, JST.\nThis work was supported by JST CREST, Grant Number JPMJCR14D2, Japan.\n\nReferences\n[1] A. Agarwal, A. Anandkumar, P. Jain, and P. Netrapalli. Learning sparsely used overcomplete\ndictionaries via alternating minimization. SIAM Journal on Optimization, 26(4):2775\u20132799,\n2016.\n\n[2] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An algorithm for designing overcomplete\ndictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311\u2013\n4322, 2006.\n\n[3] S. Arora, R. Ge, and A. Moitra. New algorithms for learning incoherent and overcomplete\ndictionaries. In Proceedings of the Conference on Learning Theory (COLT), pages 779\u2013806,\n2014.\n\n[4] E. Balkanski, B. Mirzasoleiman, A. Krause, and Y. Singer. Learning sparse combinatorial repre-\nsentations via two-stage submodular maximization. In Proceedings of The 33rd International\nConference on Machine Learning (ICML), pages 2207\u20132216, 2016.\n\n[5] E. J. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information\n\nTheory, 51(12):4203\u20134215, 2005.\n\n[6] V. Cevher and A. Krause. Greedy dictionary selection for sparse representation. IEEE Journal\n\nof Selected Topics in Signal Processing, 5(5):979\u2013988, 2011.\n\n[7] L. Chen, H. Hassani, and A. Karbasi. Online continuous submodular maximization. In Proceed-\nings of the 21st International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS),\nvolume 84, pages 1896\u20131905, 2018.\n\n[8] Y. Cong, J. Yuan, and J. Luo. Towards scalable summarization of consumer videos via sparse\n\ndictionary selection. IEEE Transactions on Multimedia, 14(1):66\u201375, 2012.\n\n[9] Y. Cong, J. Liu, G. Sun, Q. You, Y. Li, and J. Luo. Adaptive greedy dictionary selection for\n\nweb media summarization. IEEE Transactions on Image Processing, 26(1):185\u2013195, 2017.\n\n[10] A. Das and D. Kempe. Submodular meets spectral: Greedy algorithms for subset selection,\nsparse approximation and dictionary selection. In Proceedings of the 28th International Confer-\nence on Machine Learning (ICML), pages 1057\u20131064, 2011.\n\n[11] B. Dumitrescu and P. Irofti. Dictionary Learning Algorithms and Applications. Springer, 2018.\n\n[12] E. Elenberg, A. G. Dimakis, M. Feldman, and A. Karbasi. Streaming weak submodularity:\nInterpreting neural networks on the \ufb02y. In Advances in Neural Information Processing Systems\n(NIPS) 30, pages 4047\u20134057. 2017.\n\n[13] E. R. Elenberg, R. Khanna, and A. G. Dimakis. Restricted strong convexity implies weak\nIn Proceedings of NIPS Workshop on Learning in High Dimensions with\n\nsubmodularity.\nStructure, 2016.\n\n[14] K. Engan, S. O. Aase, and J. Hakon Husoy. Method of optimal directions for frame design.\nIn Proceedings of the IEEE International Conference on the Acoustics, Speech, and Signal\nProcessing, volume 05, pages 2443\u20132446, 1999.\n\n[15] M. Everingham, A. Zisserman, C. K. I. Williams, and L. Van Gool.\n\nCAL Visual Object Classes Challenge 2006 (VOC2006) Results.\nnetwork.org/challenges/VOC/voc2006/results.pdf.\n\nThe PAS-\nhttp://www.pascal-\n\n[16] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Springer,\n\n2013.\n\n9\n\n\f[17] S. Fujishige. Submodular Functions and Optimization. Elsevier, 2nd edition, 2005.\n\n[18] G. H. Golub and C. F. Van Loan. Matrix Computations, volume 3. JHU Press, 2012.\n\n[19] J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. The Journal of Machine\n\nLearning Research, 12:3371\u20133412, 2009.\n\n[20] S. Kale, Z. Karnin, T. Liang, and D. P\u00e1l. Adaptive feature selection: Computationally ef\ufb01cient\nonline sparse linear regression under RIP. In Proceedings of the 34th International Conference\non Machine Learning (ICML), pages 1\u201322, 2017.\n\n[21] R. Khanna, E. Elenberg, A. Dimakis, J. Ghosh, and S. Neghaban. On approximation guarantees\nfor greedy low rank optimization. In Proceedings of the 34th International Conference on\nMachine Learning (ICML), pages 1837\u20131846, 2017.\n\n[22] A. Krause and V. Cevher. Submodular dictionary selection for sparse representation.\n\nIn\nProceedings of the 27th International Conference on Machine Learning (ICML), pages 567\u2013\n574, 2010.\n\n[23] E. Liberty and M. Sviridenko. Greedy minimization of weakly supermodular set functions. In\nApproximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques\n(APPROX/RANDOM 2017), volume 81, pages 19:1\u201319:11, 2017.\n\n[24] J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse\n\ncoding. Journal of Machine Learning Research, 11:19\u201360, 2010.\n\n[25] B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing,\n\n24(2):227\u2013234, 1995.\n\n[26] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A uni\ufb01ed framework for high-\ndimensional analysis of m-estimators with decomposable regularizers. Statistical Science, 27\n(4):538\u2013557, 2012.\n\n[27] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing\n\nsubmodular set functions - I. Mathematical Programming, 14(1):265\u2013294, 1978.\n\n[28] R. Rubinstein, M. Zibulevsky, and M. Elad. Double sparsity: Learning sparse dictionaries for\nsparse signal approximation. IEEE Transactions on Signal Processing, 58(3):1553\u20131564, 2010.\n\n[29] C. Rusu, B. Dumitrescu, and S. A. Tsaftaris. Explicit shift-invariant dictionary learning. IEEE\n\nSignal Processing Letters, 21(1):6\u20139, 2014.\n\n[30] S. Stan, M. Zadimoghaddam, A. Krause, and A. Karbasi. Probabilistic submodular maximization\nin sub-linear time. Proceedings of the 34th International Conference on Machine Learning\n(ICML), pages 3241\u20133250, 2017.\n\n[31] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In\n\nAdvances in Neural Information Processing Systems (NIPS), pages 1577\u20131584, 2009.\n\n[32] M. Zhou, H. Chen, L. Ren, G. Sapiro, L. Carin, and J. W. Paisley. Non-parametric bayesian\nIn Advances in Neural Information\n\ndictionary learning for sparse image representations.\nProcessing Systems (NIPS) 22, pages 2295\u20132303. 2009.\n\n10\n\n\f", "award": [], "sourceid": 2297, "authors": [{"given_name": "Kaito", "family_name": "Fujii", "institution": "university of Tokyo"}, {"given_name": "Tasuku", "family_name": "Soma", "institution": "University of Tokyo"}]}