{"title": "Efficiently Learning Fourier Sparse Set Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 15120, "page_last": 15129, "abstract": "Learning set functions is a key challenge arising in many domains, ranging from sketching graphs to black-box optimization with discrete parameters. In this paper we consider the problem of efficiently learning set functions that are defined over a ground set of size $n$ and that are sparse (say $k$-sparse) in the Fourier domain. This is a wide class, that includes graph and hypergraph cut functions, decision trees and more. Our central contribution is the first algorithm that allows learning functions whose Fourier support only contains low degree (say degree $d=o(n)$) polynomials using $O(k d \\log n)$ sample complexity and runtime $O( kn \\log^2 k \\log n \\log d)$. This implies that sparse graphs with $k$ edges can, for the first time, be learned from $O(k \\log n)$ observations of cut values and in linear time in the number of vertices. \n Our algorithm can also efficiently learn (sums of) decision trees of small depth.\n The algorithm exploits techniques from the sparse Fourier transform literature and is easily implementable. Lastly, we also develop an efficient robust version of our algorithm and prove $\\ell_2/\\ell_2$ approximation guarantees without any statistical assumptions on the noise.", "full_text": "Ef\ufb01ciently Learning Fourier Sparse Set Functions\n\nAndisheh Amrollahi \u2217\n\nETH Zurich\n\nZurich, Switzerland\namrollaa@ethz.ch\n\nAmir Zandieh \u2217\n\nEPFL\n\nLausanne, Switzerland\n\namir.zandieh@epfl.ch\n\nMichael Kapralov\u2020\n\nEPFL\n\nLausanne, Switzerland\n\nmichael.kapralov@epfl.ch\n\nAndreas Krause\n\nETH Zurich\n\nZurich, Switzerland\nkrausea@ethz.ch\n\nAbstract\n\nLearning set functions is a key challenge arising in many domains, ranging from\nsketching graphs to black-box optimization with discrete parameters. In this paper\nwe consider the problem of ef\ufb01ciently learning set functions that are de\ufb01ned over a\nground set of size n and that are sparse (say k-sparse) in the Fourier domain. This\nis a wide class, that includes graph and hypergraph cut functions, decision trees and\nmore. Our central contribution is the \ufb01rst algorithm that allows learning functions\nwhose Fourier support only contains low degree (say degree d = o(n)) polynomials\nusing O(kd log n) sample complexity and runtime O(kn log2 k log n log d). This\nimplies that sparse graphs with k edges can, for the \ufb01rst time, be learned from\nO(k log n) observations of cut values and in linear time in the number of vertices.\nOur algorithm can also ef\ufb01ciently learn (sums of) decision trees of small depth.\nThe algorithm exploits techniques from the sparse Fourier transform literature and\nis easily implementable. Lastly, we also develop an ef\ufb01cient robust version of\nour algorithm and prove (cid:96)2/(cid:96)2 approximation guarantees without any statistical\nassumptions on the noise.\n\n1\n\nIntroduction\n\nHow can we learn the structure of a graph by observing the values of a small number of cuts? Can we\nlearn a decision tree ef\ufb01ciently by observing its evaluation on a few samples? Both of these important\napplications are instances of the more general problem of learning set functions.\nConsider a set function which maps subsets of a ground set V of size n to real numbers, x : 2V \u2192 R.\nSet functions that arise in applications often exhibit structure, which can be effectively captured in\nthe Fourier (also called Walsh-Hadamard) basis. One common studied structure for set functions\nis Fourier sparsity [2]. A k-Fourier-sparse set function contains no more than k nonzero Fourier\ncoef\ufb01cients. A natural example for k-Fourier-sparse set functions are cut functions of graphs with\nk edges or evaluations of a decision tree of depth d [7, 8, 12]. The cut function of a graph only\ncontains polynomials of degree at most two in the Fourier basis and in the general case, the cut\nfunction of a hypergraph of degree d only contains polynomials of degree at most d in the Fourier\nbasis [12]. Intuitively this means that these set functions can be written as sums of terms where each\nterm depends on at most d elements in the ground set. Also a decision tree of depth d only contains\npolynomials of degree at most d in the Fourier basis [7][8]. Learning such functions has recently\n\n\u2217The \ufb01rst two authors contributed equally\n\u2020Supported by ERC Starting Grant SUBLINEAR.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffound applications in neural network hyper-parameter optimization [5]. Therefore, the family of\nFourier sparse set functions whose Fourier support only contains low order terms is a natural and\nimportant class of functions to consider.\n\nRelated work One approach for learning Fourier sparse functions uses Compressive Sensing\n\n(CS) methods [12]. Suppose we know that the Fourier transform of our function(cid:98)x is k-sparse i.e.\n|supp((cid:98)x)| \u2264 k, and supp((cid:98)x) \u2286 P for some known set P of size p. In [12] it is shown that recovery\nof(cid:98)x is possible (with high probability) by observing the value of x on O(k log4 p) subsets chosen\n(cid:1) = O(n2)\na matrix satisfying the RIP which is required for recovery. For the case of graphs p =(cid:0)n\n\nindependently and uniformly at random. They utilize results from [10, 13] which prove that picking\nO(k log4 p) rows of the Walsh-Hadamard matrix independently and uniformly at random results in\n\n2\n\nand one can essentially learn the underlying graph with O(k log4 n) samples. In fact this result can be\nfurther improved, and O(k log2 k log n) samples suf\ufb01ce [4]. Computationally, for the CS approach,\none may use matching pursuit which takes \u2126(kp) time and thus results in runtime of \u2126(knd) for\nk Fourier sparse functions of order d. This equals \u2126(kn2) for graphs, where d = 2. In [12],\nproximal methods are used to optimize the Lagrangian form of the (cid:96)1 norm minimization problem.\nOptimization is performed on p variables which results in \u2126(n2) runtime for graphs and \u2126(nd) time\nfor the general order d sparse recovery case. Hence, these algorithms scale exponentially with d and\nhave at least quadratic dependence on n even in the simple case of learning graph cut functions.\nThere is another line of work on this problem in the sparse Fourier transform literature. [11] provides a\nnon-robust version of the sparse Walsh Hadamard Transform (WHT). This algorithm makes restrictive\nassumptions on the signal, namely that the k non-zero Fourier coef\ufb01cients are chosen uniformly\nat random from the Fourier domain. This is a strong assumption that does not hold for the case\nof cut functions or decision trees. This work is extended in [4] to a robust sparse WHT called\nSPRIGHT. In addition to the the random uniform support assumption, [4] further presumes that the\nFourier coef\ufb01cients are \ufb01nite valued and the noise is Gaussian. Furthermore, all existing sparse WHT\nalgorithms are unable to exploit low-degree Fourier structure.\n\nOur results We build on techniques from the sparse Fourier transform literature [3, 6, 2] and\ndevelop an algorithm to compute the Walsh-Hadamard transform (WHT) of a k-Fourier-sparse signal\nwhose Fourier support is constrained to low degree frequencies (low degree polynomials). For\nrecovering frequencies with low degree, we utilize ideas that are related to compressive sensing over\n\n\ufb01nite \ufb01elds [1]. We show that if the frequencies present in the support of(cid:98)x are of low order then there\n\nexists an algorithm that computes WHT in O(kn log2 k log n log d) time using O(kd log n) samples.\nAs opposed to [11], we avoid distributional assumptions on the support using hashing schemes. Our\napproach is the \ufb01rst one to achieve the sampling complexity of O(kd log n). Moreover its running\ntime scales linearly in n and there is no exponential dependence on d. For the important special case\nof graphs, where d = 2, our sampling complexity is near optimally O(k log n) and our runtime is\nO(kn log2 k log n) which is strictly better than CS methods which take at least quadratic time in n.\nThis allows us to learn sparse graphs which have in the range of 800 vertices in \u2248 2 seconds whereas\nthe previous methods [12] were constrained to the range of 100 for similar runtimes.\n\nFor the case where (cid:98)x is not exactly k-sparse, we provide novel robust algorithms that recover\nO(cid:0)nk log3 k + nk log2 k log n log(d log n) log d(cid:1).\n\nthe k dominant Fourier coef\ufb01cients with provable (cid:96)2/(cid:96)2 approximation guarantees. We provide\na robust algorithm using appropriate hashing schemes and a novel analysis. We further de-\nvelop a robust recovery algorithm that uses O(kd log n log(d log n)) samples and runs in time\n\n2 Problem Statement\n\nHere we de\ufb01ne the problem of learning set functions. Consider a set function which maps subsets of\na ground set V (cid:44) {1, . . . , n} = [n] of size n to real numbers, x : 2V \u2192 R. We assume oracle access\nto this function, that is, we can observe the function value x(A) for any subset A that we desire. The\ngoal is to learn the function, that is to be able to evaluate it for all subsets B \u2286 V . A problem which\nhas received considerable interest is learning cut functions of sparse (in terms of edges) graphs [12].\nGiven a weighted undirected graph G = (V, E, w), the cut function associated to G is de\ufb01ned as\n\ns\u2208A,t\u2208V \\A w(s, t), for every A \u2286 V .\n\nx(A) =(cid:80)\n\n2\n\n\fNote that we can equivalently represent each subset A \u2286 V by a vector t \u2208 Fn\n2 which is the indicator\nof set A. Here F2 = {0, 1} denotes the \ufb01nite \ufb01eld with 2 elements. Hence the set function can be\n2 \u2192 R. It\nviewed as x : Fn\nis de\ufb01ned as:\n\n2 \u2192 R. We denote the Walsh-Hadamard transform of x : Fn\n\n2 \u2192 R by(cid:98)x : Fn\n\n(cid:98)xf =\n\n1\u221a\nN\n\n(cid:88)\n\nt\u2208Fn\n\n2\n\nxt \u00b7 (\u22121)(cid:104)f,t(cid:105)\n\n, f \u2208 Fn\n2 .\n\nThe inner product (cid:104)f, t(cid:105) throughout the paper is performed modulo 2.\n\nThe Fourier transform of the graph cut function(cid:98)x is the following,\n\n(cid:80)\n\n\uf8f1\uf8f2\uf8f3 1\n\n2\n\n\u2212w(s, t)/2\n0\n\n(cid:98)xf =\n\ns,t\u2208V w(s, t)\n\nif f = (0, . . . , 0)\nif fs = ft = 1 and fi = 0 \u2200i (cid:54)= s, t\notherwise\n\n.\n\nIt is clear that the Fourier support of the cut function for graph G contains only |E| + 1 nonzero\nelements (and hence it is sparse). Furthermore, the nonzero Fourier coef\ufb01cients correspond to\nfrequencies with hamming weights at most 2.\nOne of the classes of set functions that we consider is that of exactly low order Fourier sparse\nfunctions. Under this model we address the following problem:\n\nInput: oracle access to x : Fn\n\nsuch that (cid:107)(cid:98)x(cid:107)0 \u2264 k and |f| \u2264 d for all f \u2208 support((cid:98)x)\n\nOutput: nonzero coef\ufb01cients of(cid:98)x and their corresponding frequencies\n\n2 \u2192 R\n\n(1)\n\nwhere |f| denotes the Hamming weight of f.\nWe also consider the robust version of problem (1) where we only have access to noisy measurements\nof the input set function. We make no assumption about the noise, which can be chosen adversarially.\nEquivalently one can think of a general set function whose spectrum is well approximated by a low\nsuch that the frequency has low Hamming weight |f| \u2264 d. We refer to the noise spectrum as tail.\n\norder sparse function which we refer to as head. Head of(cid:98)x is just the top k Fourier coef\ufb01cients(cid:98)xf\nDe\ufb01nition 1 (Head and Tail norm). For all integers n, d, and k we de\ufb01ne the head of(cid:98)x : Fn\n2 \u2192 R as,\n\n(cid:98)xhead := arg\n\n(cid:107)(cid:98)x \u2212 y(cid:107)2.\n\nmin\n2 \u2192R\ny:Fn\n(cid:107)y(cid:107)0\u2264k\n\nThe tail norm of(cid:98)x is de\ufb01ned as, Err((cid:98)x, k, d) := (cid:107)(cid:98)x \u2212(cid:98)xhead(cid:107)2\n\n|j|\u2264d for all j\u2208supp(y)\n\n2.\n\nSince the set function to be learned is only approximately in the low order Fourier sparse model,\nit makes sense to consider the approximate version of problem (1). We use the well known (cid:96)2/(cid:96)2\napproximation to formally de\ufb01ne the robust version of problem (1) as follows,\n\nfunction(cid:98)\u03c7 : Fn\n2 \u2192 R\nInput: oracle access to x : Fn\n2 \u2192 R\nsuch that (cid:107)(cid:98)\u03c7 \u2212(cid:98)x(cid:107)2\n2 \u2264 (1 + \u03b4)Err((cid:98)x, k, d),\nOutput:\n|f| \u2264 d for all f \u2208 support((cid:98)\u03c7)\n\n(2)\n\nNote that no assumptions are made about the function x and it can be any general set function.\n\n3 Algorithm and Analysis\n\nIn this section we present our algorithm and analysis. We use techniques from the sparse FFT\nliterature [3, 6, 2]. Our main technical novelty is a new primitive for estimating a low order frequency,\ni.e., |f| \u2264 d, ef\ufb01ciently using an optimal number of samples O(d log n) given in Section 3.1. This\n\nprimitive relies heavily on the fact that a low order frequency is constrained on a subset of size(cid:0)n\n\n(cid:1) as\n\nopposed to the whole universe of size 2n. We show that problem (1) can be solved quickly and using\na few samples from the function x by proving the following theorem,\n\nd\n\n3\n\n\fprobability 9/10. Moreover the runtime of this algorithm is O(cid:0)kn log2 k log n log d(cid:1) and the sample\n\nTheorem 2. For any integers n, k, and d, the procedure EXACTSHT solves problem (1) with\n\ncomplexity of this procedure is O (kd log n).\n\nWe also show that problem (2) can be solved ef\ufb01ciently by proving the following theorem in the full\nversion of this paper,\nTheorem 3. For any\nintegers n,\nproblem (2) with probability 9/10.\n\nO(cid:0)nk log3 k + nk log2 k log n log(d log n) log d(cid:1) and the sample complexity of the procedure is\n\nthe procedure ROBUSTSHT solves\nthe runtime of\nthis procedure is\n\nand d,\nMoreover\n\nk,\n\nO (kd log n log(d log n)).\n\nRemark: This theorem proves that for any arbitrary input signal, we are able to achieve the (cid:96)2/(cid:96)2\nguarantee using O (kd \u00b7 log n \u00b7 log(d log n)) samples. Using the techniques of [9] one can prove that\nthe sample complexity is optimal up to log(d log n) factor. Note that it is impossible to achieve this\nsample complexity without exploiting the low degree structure of the Fourier support.\n\n3.1 Low order frequency recovery\nIn this section we provide a novel method for recovering a frequency f \u2208 Fn\n2 with bounded Hamming\nweight |f| \u2264 d, from measurements (cid:104)mi, f(cid:105) i \u2208 [s] for some s = O(d log n). The goal of this\nsection is to design a measurement matrix M \u2208 Fs\u00d7n\n2 with\n|f| \u2264 d the following system of constraints, with constant probability, has a unique solution j = f\nand has an ef\ufb01cient solver,\n\n2 with small s, such that for any f \u2208 Fn\n\n(cid:26)M j = M f\n\n|j| \u2264 d\n\n.\n\nj \u2208 Fn\n\n2 such that\n\nl=0\n\n(cid:107)\n\n(cid:106) (j mod 2l)\n\nTo design an ef\ufb01cient solver for the above problem with optimal s, we \ufb01rst need an optimal algorithm\nfor recovering frequencies with weight one |f| \u2264 1. In this case, we can locate the index of the\nnonzero coordinate of f optimally via binary search using O(log n) measurements and runtime.\nDe\ufb01nition 4 (Binary search vectors). For any integer n, the ensemble of vectors {vl}(cid:100)log2 n(cid:101)\n\u2286 Fn\ncorresponding to binary search on n elements is de\ufb01ned as follows. Let v0 = {1}n (the all ones\nvector). For every l \u2208 {1,\u00b7\u00b7\u00b7 ,(cid:100)log2 n(cid:101)} and every j \u2208 [n], vl\nLemma 5. There exists a set of measurements {mi}s\nalgorithm such that for every f \u2208 Fn\n(cid:104)f, mi(cid:105) in time O(log2 n).\nTo recover a frequency f with Hamming weight d, we hash the coordinates of f randomly into O(d)\nbuckets. In expectation, a constant fraction of nonzero elements of f get isolated in buckets, and\nhence the problem reduces to the weight one recovery. We know how to solve this using binary search\nas shown in Lemma 5 in time O(log n) and with sample complexity O(log n). We recover a constant\nfraction of the nonzero indices of f and then we subtract those from f and recurse on the residual.\nThe pseudocode of the recovery procedure is presented in Algorithm 1.\nLemma 6. For any integers n and d , any power of two integer D \u2265 128d, and any frequency\nf \u2208 Fn\n2 with |f| \u2264 d, the procedure RECOVERFREQUENCY given in Algorithm 1 outputs f with\nprobability at least 7/8, if we have access to the following,\n\ni=1 for s = (cid:100)log2 n(cid:101) + 1 together with an\n2 with |f| \u2264 1 the algorithm can recover f from the measurements\n\nj =\n\n2l\u22121\n\n.\n\n2\n\n1. For every r = 0, 1,\u00b7\u00b7\u00b7 , log4 D, a hash function hr : [n] \u2192 [D/2r] which is an instance\n\nfrom a pairwise independent hash family.\n\n2. For every l = 0, 1,\u00b7\u00b7\u00b7 ,(cid:100)log2 n(cid:101) and every r = 0, 1,\u00b7\u00b7\u00b7 , log4 D, the measurements \u03c6l\n\nr(i)\n\nthat are equal to \u03c6l\n\nr (i) fj \u00b7 vl\n\n\u22121\n\nj for every i \u2208 [D/2r].\n\nj\u2208h\n\nr(i) =(cid:80)\n\nMoreover, the runtime of this procedure is O(D log D log n) and the number of measurements is\nO(D log n).\nProof. The proof is by induction on the iteration number r = 0, 1,\u00b7\u00b7\u00b7 , T . We denote by Er the event\n|f \u2212 \u02dcf (r)| \u2264 d\n4r , that is the sparsity goes down by a factor of 4 in every iteration up to rth iteration.\nThe inductive hypothesis is Pr[Er+1|Er] \u2265 1 \u2212 1\n\n16\u00b72r .\n\n4\n\n\f2\n\nAlgorithm 1 RECOVERFREQUENCY\ninput: power of two integer D, hash functions hr : [n] \u2192 [D/2r] for every r \u2208 {0, 1,\u00b7\u00b7\u00b7 , log4 D},\nr \u2208 FD/2r\nfor every l = 0, 1,\u00b7\u00b7\u00b7(cid:100)log2 n(cid:101) and every r = 0, 1,\u00b7\u00b7\u00b7 , log4 D.\nmeasurement vectors \u03c6l\noutput: recovered frequency \u02dcf.\n1: {vl}(cid:100)log2 n(cid:101)\nl=0 \u2190 binary search vectors on n elements (De\ufb01nition 4), T \u2190 log4 D, \u02dcf (0) \u2190 {0}n\n2: for r = 0 to T do\nw \u2190 {0}n.\n3:\nfor i = 1 to D/2r do\n4:\n5:\n\u22121\nj\u2208h\nr (i)\n6:\n7:\n8:\n9:\n10:\n11:\n12: return \u02dcf (T +1).\n\n\u00b7 v0\n\u02dcf (r)\nj\nindex \u2190 {0}(cid:100)log2 n(cid:101), a (cid:100)log2 n(cid:101) bits pointer.\nfor l = 1 to (cid:100)log2 n(cid:101) do\n\nw(index) \u2190 1, set the coordinate of w positioned at index to 1.\n\n[index]l \u2190 1, set lth bit of index to 1.\n\nr(i) \u2212(cid:80)\n\nr(i) \u2212(cid:80)\n\n\u02dcf (r+1) \u2190 \u02dcf (r) + w.\n\nj = 1 then\n\nj = 1 then\n\n\u02dcf (r)\nj\n\nif \u03c60\n\n\u22121\nr (i)\n\nif \u03c6l\n\n\u00b7 vl\n\nj\u2208h\n\n4r . For every i \u2208 [D/2r] and every l \u2208\nConditioning on Er we have that |f \u2212 \u02dcf (r)| \u2264 d\n(cid:88)\n{0, 1,\u00b7\u00b7\u00b7 ,(cid:100)log2 n(cid:101)} it follows from the de\ufb01nition of \u03c6l\nr that,\n\n(cid:16)\n\nr(i) \u2212 (cid:88)\n\n\u03c6l\n\n\u02dcf (r)\nj\n\n\u00b7 vl\n\nj =\n\nfj \u2212 \u02dcf (r)\n\nj\n\nj\u2208h\n\n\u22121\nr (i)\n\nj\u2208h\n\n\u22121\nr (i)\n\nLet us denote by S the support of vector f \u2212 \u02dcf (r), namely let S = supp\nFrom the pairwise independence of the hash function hr the following holds for every a \u2208 S,\n\n.\n\nPr[hr(a) \u2208 hr(S \\ {a})] \u2264 2r \u00b7 |S|\n\n\u2264 2r \u00b7\nThis shows that for every a \u2208 S, with probability 1 \u2212 1\nelement of S. Because the vector f \u2212 \u02dcf (r) restricted to the elements in bucket h\u22121\nHamming weight one, for every a \u2208 S,\n\n128\u00b72r , the bucket hr(a) contains no other\nr (hr(a)) has\n\n128 \u00b7 4r \u2264\n\n128 \u00b7 2r .\n\nD\n\n1\n\n1\n\n(cid:17) \u00b7 vl\n(cid:16)\nf \u2212 \u02dcf (r)(cid:17)\n\nj.\n\n(cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:16)\n\nf \u2212 \u02dcf (r\u22121)(cid:17)\n\nPr\n\n(cid:12)(cid:12)(cid:12)(cid:12) = 1\n(cid:21)\n\n\u22121\nr (hr(a))\n\nh\n\n\u2265 1 \u2212\n\n1\n\n128 \u00b7 2r .\n\nIf the above condition holds, then it is possible to \ufb01nd the index of the nonzero element via binary\nsearch as in Lemma 5. The for loop in line 7 of Algorithm 1 implements this. Therefore with\nprobability 1 \u2212 1\n16\u00b72r by Markov\u2019s inequality a 1 \u2212 1/8 fraction of the support elements, S, gets\nrecovered correctly and at most 1/8 fraction of elements remain unrecovered and possibly result\nin false positive. Since the algorithm recovers at most one element per bucket, the total number\nof falsely recovered indices is no more than the number of non-isolated buckets which is at most\n1/8 \u00b7 |S|. Therefore with probability 1 \u2212 1\n16\u00b72r , the residual at the end of rth iteration has sparsity\n1/8 \u00b7 |S| + 1/8 \u00b7 |S| = 1/4 \u00b7 |S|, i.e.\nIt follows from the event ET for T = log4 D that \u02dcf (T ) = f, where \u02dcf (T ) is the output of Algorithm\n\n1. The inductive hypothesis along with union bound implies that Pr(cid:2) \u00afET\nPr(cid:2) \u00afE0\ntherefore the time to calculate(cid:80)\n\nRuntime:\nthe algorithm has three nested loops and the total number of repetitions of all loops\ntogether is O(D log n). The recovered frequency \u02dcf (r) always has at most O(D) nonzero entries\nj for a \ufb01xed r and a \ufb01xed l and all i \u2208 [D/2r] is\n\n(cid:12)(cid:12)(cid:12)f \u2212 \u02dcf (r+1)(cid:12)(cid:12)(cid:12) \u2264 |S|\n\nr=1 Pr(cid:2) \u00afEr|Er\u22121\n\n4r+1 . This proves the inductive step.\n\n(cid:3) \u2264(cid:80)T\n\n(cid:3) \u2264(cid:80)T\n\n16\u00b72r \u2264 1/8.\n\n4 \u2264 d\n\n\u02dcf (r\u22121)\n\n(cid:3) +\n\n\u00b7 vl\n\nr=0\n\n1\n\nj\n\nj\u2208h\n\n\u22121\nr (i)\n\nO(D). Therefore the total runtime is O(D log D log n).\n\n5\n\n\fNumber of measurements:\nvectors \u03c6l\n\nr which is O(D log n).\n\nthe number of measurements is the total size of the measurement\n\n3.2 Signal reduction\n\n2\n\n\u03c3 : Fb\n\n2 , and every \u03c3 \u2208 Fn\u00d7b\n\u03c3(t) =\n\nWe now develop the main tool for estimating the frequencies of a sparse signal, namely the\nHASH2BINS primitive. If we hash the frequencies of a k-sparse signal into O(k) buckets, we\nexpect most buckets to contain at most one of the elements of the support of our signal. The next\n(cid:113) 2n\nde\ufb01nition shows how we compute the hashing of a signal in the time domain.\nDe\ufb01nition 7. For every n, b \u2208 N, every a \u2208 Fn\n2 \u2192 R, we\nde\ufb01ne the hashing of(cid:98)x as ua\nand every x : Fn\n2 \u2192 R, where ua\n2b \u00b7 x\u03c3t+a, for every t \u2208 Fb\n2.\n\u03c3 corresponds to hashing(cid:98)x into B buckets.\n\u03c3(j) =(cid:80)\n2,(cid:98)ua\n2 :\u03c3(cid:62)f =j(cid:98)xf \u00b7 (\u22121)(cid:104)a,f(cid:105).\n2,(cid:98)ua\n\u03c3 is the sum of(cid:98)xf \u00b7 (\u22121)(cid:104)a,f(cid:105) for all frequencies f \u2208 Fn\n\nWe denote by B (cid:44) 2b the number of buckets of the hash function. In the next claim we show that the\nFourier transform of ua\nClaim 8. For every j \u2208 Fb\nLet h(f ) (cid:44) \u03c3(cid:62)f. For every j \u2208 Fb\n2 such\nthat h(f ) = j, hence h(f ) can be thought of as the bucket that f is hashed into. If the matrix \u03c3 is\nchosen uniformly at random then the hash function h(\u00b7) is pairwise independent.\nClaim 9. For any n, b \u2208 N, if the hash function h : Fn\n2 is de\ufb01ned as h(\u00b7) = \u03c3(cid:62)(\u00b7), where\n\u03c3 \u2208 Fn\u00d7b\nis a random matrix whose entries are distributed independently and uniformly at random\non F2, then for any f (cid:54)= f(cid:48) \u2208 Fn\nB , where the probability is over\npicking n \u00b7 b random bits of \u03c3.\n\n2 it holds that Pr[h(f ) = h(f(cid:48))] = 1\n\n2 \u2192 Fb\n\nf\u2208Fn\n\n2\n\n\u03c3.\n\n2\n\n2.\n\n(cid:17)\n\n.\n\nf\u2208Fn\n\n\u03c3.\n\u03c3 = FHT\n\nAlgorithm 2 HASH2BINS\n\n, shift vector a \u2208 Fn\n2 .\n\n(cid:46) FHT is the fast Hadamard transform algorithm\n\n(cid:16)(cid:113) 2n\n2b \u00b7 x\u03c3(\u00b7)+a\n2 :\u03c3(cid:62)f =j(cid:98)\u03c7f \u00b7 (\u22121)(cid:104)a,f(cid:105) for every j \u2208 Fb\n\ninput: signal x \u2208 R2n, signal(cid:98)\u03c7 \u2208 R2n, integer b, binary matrix \u03c3 \u2208 Fn\u00d7b\noutput: hashed signal(cid:98)ua\n1: Compute(cid:98)ua\n\u03c3(j) \u2212(cid:80)\n\u03c3(j) \u2190(cid:98)ua\n2: (cid:98)ua\n3: return(cid:98)ua\neach of the buckets. We denote by(cid:98)\u03c7 the estimate of(cid:98)x in each iteration. As we will see in Section 3.3,\nthe recovery algorithm is iterative in the sense that we iterate over(cid:98)x \u2212(cid:98)\u03c7 (the residue) whose sparsity\nClaim 10. For any signal x,(cid:98)\u03c7 : Fn\nthe procedure HASH2BINS(x,(cid:98)\u03c7, b, \u03c3, a) given in Algorithm 2 computes the following using O(B)\nsamples from x in time O(Bn log B + (cid:107)(cid:98)\u03c7(cid:107)0 \u00b7 n log B)\n\nThe HASH2BINS primitive computes the Fourier coef\ufb01cients of the residue signal that are hashed to\n\n2 \u2192 R, integer b, matrix \u03c3 \u2208 Fn\u00d7b\n(cid:88)\n\nis guaranteed to decrease by a constant factor in each step.\n\n, and vector a \u2208 Fn\n\n2\n\n2\n\n(cid:98)ua\n\n\u03c3(j) =\n\n((cid:92)x \u2212 \u03c7)f \u00b7 (\u22121)(cid:104)a,f(cid:105).\n\nf\u2208Fn\n\n2 :\u03c3(cid:62)f =j\n\n3.3 Exact Fourier recovery\n\nde\ufb01ned in (1) and prove Theorem 2. Let S (cid:44) supp((cid:98)x). Problem (1) assumes that |S| \u2264 k and also\n\nIn this section, we present our algorithm for solving the exact low order Fourier sparse problem\nfor every f \u2208 S, |f| \u2264 d. The recovery algorithm hashes the frequencies into B = 2b buckets using\nAlgorithm 2. Every frequency in the support f \u2208 S is recoverable, with constant probability, if no\nother frequency from the support collides with it in the hashed signal. The collision event is formally\nde\ufb01ned below,\nDe\ufb01nition 11 (Collision). For any frequency f \u2208 Fn\nh(f ) \u2208 h(S \\ {f}).\n\n2 and every sparse signal (cid:98)x with support\nS = supp((cid:98)x), the collision event Ecoll(f ) corresponding to the hash function h(f ) = \u03c3(cid:62)f holds iff\n\n6\n\n\f2\n\n2 \u2192 Fb\n\n2 , if the hash function h : Fn\n\nClaim 12 (Probability of collision). For every f \u2208 Fn\nas h(\u00b7) = \u03c3(cid:62)(\u00b7), where \u03c3 \u2208 Fn\u00d7b\nand uniformly at random on F2 then Pr[Ecoll(f )] \u2264 k\nthe randomness of matrix \u03c3.\nIf the hash function h(\u00b7) = \u03c3(cid:62)(\u00b7) is such that the collision event Ecoll(f ) does not occur for a\nfrequency f, then it follows from Claim 8 and De\ufb01nition 11 that for every a \u2208 Fn\n2 ,\n\n2 is de\ufb01ned\nis a random matrix whose entries are distributed independently\nB (see De\ufb01nition 11). The probability is over\n\n(cid:98)ua\n\u03c3(h(f )) =(cid:98)xf \u00b7 (\u22121)(cid:104)a,f(cid:105).\n\u03c3(h(f )) =(cid:98)xf . Hence for any m \u2208 Fn\n(cid:98)ua\nIf a = {0}n then,\nsign of(cid:98)um\n\u03c3 (h(f )) = (cid:98)xf \u00b7 (\u22121)(cid:104)m,f(cid:105) and(cid:98)ua\n2 , one can learn the inner product (cid:104)m, f(cid:105) by comparing the\n\u03c3(h(f )). If the signs are the same then (\u22121)(cid:104)m,f(cid:105) = 1\nmeaning that (cid:104)m, f(cid:105) = 0 and if the signs are different then (cid:104)m, f(cid:105) = 1. In Section 3.1 we gave\nan algorithm for learning a low order frequency |f| \u2264 d from measurements of the form (cid:104)m, f(cid:105).\nSo putting these together gives the inner subroutine for our sparse fast Hadamard transform, which\nperforms one round of hashing, presented in Algorithm 3.\n\nTherefore, under this condition, the problem reduces to d-sparse recovery.\n\nl=0\n\nAlgorithm 3 SHTINNER\n\ninput: signal x \u2208 R2n, signal(cid:98)\u03c7 \u2208 R2n, failure probability p, integer b, integer d.\noutput: recovered signal(cid:98)\u03c7(cid:48).\n1: Let {vl}(cid:100)log2 n(cid:101)\nbe binary search vectors on n elements (De\ufb01nition 4).\n2: D \u2190 smallest power of two integer s.t. D \u2265 128d, R \u2190 (cid:100)2 log2(1/p)(cid:101).\nr : [n] \u2192 [D/2r] be an independent\n3: For every r \u2208 {0, 1,\u00b7\u00b7\u00b7 , log4 D} and every s \u2208 [R], let hs\n4: For every r \u2208 {0, 1,\u00b7\u00b7\u00b7 , log4 D}, every s \u2208 [R], and every j \u2208 [D/2r] let wj\n2 be the\nbinary indicator vector of the set hs\n5: For every s \u2208 [R], every r \u2208 {0, 1,\u00b7\u00b7\u00b7 , log4 D} and every l \u2208 {0, 1,\u00b7\u00b7\u00b7 ,(cid:100)log2 n(cid:101)} and every\nr,s \u00b7 vl to set As.\nj \u2208 [D/2r], add wj\n7: For every a \u2208 \u222as\u2208[R]As compute(cid:98)ua\n6: Let \u03c3 \u2208 Fn\u00d7b\n\n\u03c3 = HASH2BINS(x,(cid:98)\u03c7, b, \u03c3, a).\n\nbe a random matrix. Each entry is independent and uniform on F2.\n\ncopy of a pairwise independent hash function.\n\nr,s \u2208 Fn\n\nr(j)\u22121.\n\n2\n\n8: for j = 1 to B do\nLet L be an empty multi-set.\n9:\nfor s \u2208 [R] do\n10:\n11:\n12:\n\nif(cid:98)uc\nfor every r \u2208 {0,\u00b7\u00b7\u00b7 , log4 D}, every i \u2208 [D/2r], and every l \u2208 {0,\u00b7\u00b7\u00b7 ,(cid:100)log2 n(cid:101)} do\n\u03c3(j) (cid:54)= 0, where c = {0}n then\nif(cid:98)uc\n\u03c3(j) and(cid:98)u\n\n(j) have same sign then \u03c6l\n\nr(i) \u2190 1.\n\n(cid:19)\n(cid:111)(cid:100)log2 n(cid:101)\nr(i) \u2190 0. else \u03c6l\n\n(cid:18)\n\nD,{hs\n\nr}log2 D\n\nr=0\n\n,\n\nr}log4 D\n\nr=0\n\n(cid:110){\u03c6l\n\nr,s\u00b7vl\nwi\n\u03c3\n\u02dcf \u2190 RECOVERFREQUENCY\nAppend \u02dcf to multi-set L.\n\n.\n\nl=0\n\n13:\n\n14:\n\n15:\n16:\n17:\n\nf \u2190 majority(L)\n\n(cid:98)\u03c7(cid:48)\nf \u2190(cid:98)uc(j), where c = {0}n.\n18: return(cid:98)\u03c7(cid:48).\nLemma 13. For all integers b and d, every signals x,(cid:98)\u03c7 \u2208 R2n such that |\u03be| \u2264 d for every\n\u03be \u2208 supp((cid:92)x \u2212 \u03c7), and any parameter p > 0, Algorithm 3 outputs a signal (cid:98)\u03c7(cid:48) \u2208 R2n such that\n|supp((cid:98)\u03c7(cid:48))| \u2264 |supp((cid:92)x \u2212 \u03c7)| and also for every frequency f \u2208 supp((cid:92)x \u2212 \u03c7), if the collision event\n\nEcoll(f ) does not happen then,\n\n(cid:16)\n\n(cid:17)\nMoreover the sample complexity of this procedure is O(Bd log n log 1\n.\nis O\nx,(cid:98)\u03c7 \u2208 R2n such that (cid:107)(cid:92)x \u2212 \u03c7(cid:107)0 \u2264 k and |\u03be| \u2264 d for every \u03be \u2208 supp((cid:92)x \u2212 \u03c7), the output of\nLemma 14. For any parameter p > 0, all integers k, d, and b \u2265 log2(k/p), every signal\n\np ) and also its time complexity\n\np ) + nB log n log d log 1\n\nB log B(n + d log n log 1\n\np )\n\n(cid:105) \u2265 1 \u2212 p.\np + (cid:107)(cid:98)\u03c7(cid:107)0 \u00b7 n(log B + log n log d log 1\n\n(cid:104)(cid:98)\u03c7(cid:48)\n\nPr\n\nf = ((cid:92)x \u2212 \u03c7)f\n\n7\n\n\fSHTINNER(x,(cid:98)\u03c7, p, b, d),(cid:98)\u03c7(cid:48) satis\ufb01es the following with probability at least 1 \u2212 32p,\n\n(cid:107)(cid:98)x \u2212(cid:98)\u03c7 \u2212 (cid:98)\u03c7(cid:48)(cid:107)0 \u2264 k/8.\n\nOur sparse Hadamard transform algorithm iteratively calls the primitive SHTINNER to reduce the\nsparsity of the residual signal by a constant factor in every iteration. Hence, it terminates in O(log k)\niterations. See Algorithm 4.\n\nAlgorithm 4 EXACTSHT\noutput: estimate(cid:98)\u03c7 \u2208 R2n.\ninput: signal x \u2208 R2n, failure probability q, sparsity k, integer d.\n1: p(1) \u2190 q/32, b(1) \u2190 (cid:100)log2\nq (cid:101), w(0) \u2190 {0}2n\n(cid:101)\u03c7 \u2190 SHTINNER(x, w(r\u22121), p(r), b(r), d)\n2: for r = 1 to T do\nw(r) \u2190 w(r\u22121) +(cid:101)\u03c7.\n3:\n4:\n5:\n\np(r+1) \u2190 p(r)/2, b(r+1) \u2190 b(r) \u2212 2.\n\n, T \u2190 (cid:100)log8 k(cid:101).\n\n64k\n\n6: (cid:98)\u03c7 \u2190 w(T ).\n7: return(cid:98)\u03c7.\n(cid:107)(cid:98)x \u2212 w(r)(cid:107)0 \u2264 k\nEr\u22121 we have that (cid:107)(cid:98)x \u2212 w(r\u22121)(cid:107)0 \u2264 k\n(cid:107)(cid:98)x \u2212 w(r)(cid:107)0 \u2264 k\nto(cid:98)\u03c7, hence we can assume that (cid:107)(cid:98)\u03c7(cid:107)0 \u2264 128k\nO(cid:0)kn log2 k log n log d(cid:1).\nThe sample complexity of iteration r is O(cid:0) kd\n\nRuntime and Sample complexity:\n2b(r)\n\nq\u00b74r and the error probability p(r) = q\n\n= 64k\n\n8r . This proves the inductive step.\n\nProof of Theorem 2: The proof is by induction. We denote by Er the event corresponding to\n8r . The inductive hypothesis is Pr[Er|Er\u22121] \u2265 1 \u2212 16p(r). Conditioned on\n8r\u22121 . The number of buckets in iteration r of the algorithm\n4r\u22121\u00b7q . Hence, it follows from Lemma 14, that with probability 1 \u2212 32p(r),\n\nis B(r) = 2br \u2265 64k\n\nIn iteration r \u2208 [(cid:100)log8 k(cid:101)], the size of the bucket B(r) =\nr B(r) elements are added\n. From Lemma 13 it follows that the total runtime is\n\n32\u00b72r . Moreover at most(cid:80)\n2r log n log 2r(cid:1) hence the total sample complexity is\n\nq\n\ndominated by the sample complexity of the \ufb01rst iteration which is equal to O (kd log n).\n\n4 Experiments\n\nWe test our EXACTSHT algorithm for graph sketching on a real world data set. We utilize the\nautonomous systems dataset from the SNAP data collection.3 In order to compare our methods with\n[12] we reproduce their experimental setup. The dataset consists of 9 snapshots of an autonomous\nsystem in Oregon on 9 different dates. The goal is detect which edges are added and removed when\ncomparing the system on two different dates. As a pre-processing step, we \ufb01nd the common vertices\nthat exist on all dates and look at the induced subgraphs on these vertices. We take the symmetric\ndifferences (over the edges) of dates 7 and 9. Results for other date combinations can be found in the\nsupplementary material. This results in a sparse graph (sparse in the number of edges). Recall that\nthe running time of our algorithm is O(kn log2 k log n log d) which reduces to O(nk log2 k log n)\nfor the case of cut functions where d = 2.\n\n4.1 Sample and time complexities as number of vertices varies\n\nIn the \ufb01rst experiment depicted in Figure 1b we order the vertices of the graph by their degree and\nlook at the induced subgraph on the n largest vertices in terms of degree where n varies. For each n\nwe pick e = 50 edges uniformly at random. The goal is to learn the underlying graph by observing\nthe values of cuts. We choose parameters of our algorithm such that the probability of success is at\nleast 0.9. The parameters tuned in our algorithm to reach this error probability are the initial number\n\n3snap.stanford.edu/data/\n\n8\n\n\f(a) Avg. time vs. no. edges\n\n(b) Avg. time vs. no. vertices\n\nTable 1: Sampling and computational complexity\nOur method\n\nCS method\n\nNo. of vertices\n\nRuntime Samples Runtime Samples\n\n70\n90\n110\n130\n150\n170\n190\n210\n230\n250\n300\n400\n500\n600\n700\n800\n\n1.14\n1.88\n3.00\n4.31\n5.34\n6.13\n7.36\n8.24\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\n767\n812\n850\n880\n905\n927\n947\n965\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\u2217\n\n0.85\n0.92\n0.82\n1.01\n1.16\n1.22\n1.18\n1.28\n1.38\n1.38\n1.66\n2.06\n2.42\n3.10\n3.35\n3.60\n\n6428\n6490\n6491\n7549\n7942\n7942\n7271\n7271\n7942\n7271\n8051\n8794\n8794\n9646\n9646\n9646\n\nof buckets the frequencies are hashed to and the ratio at which they reduce in each iteration. We plot\nrunning times as n varies. We compare our algorithm with that of [12] which utilizes a CS approach.\nWe \ufb01ne-tune their algorithm by tuning the sampling complexity. Both algorithms are run in a way\nsuch that each sample (each observation of a cut value) takes the same time. As one can see our\nalgorithm scales linearly with n (up to log factors) whereas the CS approach scales quadratically.\nOur algorithm continues to work in a reasonable amount of time for vertex sizes as much as 900 in\nunder 2 seconds. Error bars depict standard deviations.\nIn Table 1 we include both sampling complexities (number of observed cuts) and running times\nas n varies. Our sampling complexity is equal to O(k log n). In practice they perform around a\nconstant factor of 10 worse than the compressive sensing method, which are not provably optimal (see\nSection 1) but perform well in practice. In terms of computational cost, however, the CS approach\nquickly becomes intractable, taking large amounts of time on instance sizes around 200 and larger\n[12]. Asterisks in Table 1 refer to experiments that have taken too long to be feasible to run.\n\n4.2 Time complexities as number of edges varies\n\nHere we \ufb01x the number of vertices to n = 100 and consider the induced subgraph on these vertices.\nWe randomly pick e edges to include in the graph. We plot computational complexities. Our running\ntime provably scales linearly in the number of edges as can be seen in Figure 1a.\n\n9\n\n\fReferences\n[1] Abhik Kumar Das and Sriram Vishwanath. On \ufb01nite alphabet compressive sensing. In 2013\nIEEE International Conference on Acoustics, Speech and Signal Processing, pages 5890\u20135894.\nIEEE, 2013.\n\n[2] Oded Goldreich and Leonid A Levin. A hard-core predicate for all one-way functions. In\nProceedings of the twenty-\ufb01rst annual ACM symposium on Theory of computing, pages 25\u201332.\nACM, 1989.\n\n[3] Haitham Hassanieh, Piotr Indyk, Dina Katabi, and Eric Price. Nearly optimal sparse fourier\ntransform. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing,\npages 563\u2013578. ACM, 2012.\n\n[4] Ishay Haviv and Oded Regev. The restricted isometry property of subsampled fourier matrices.\n\nIn Geometric Aspects of Functional Analysis, pages 163\u2013179. Springer, 2017.\n\n[5] Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: a spectral approach.\n\nIn International Conference on Learning Representations, 2018.\n\n[6] Piotr Indyk, Michael Kapralov, and Eric Price. (nearly) sample-optimal sparse fourier transform.\nIn Proceedings of the twenty-\ufb01fth annual ACM-SIAM symposium on Discrete algorithms, pages\n480\u2013499. Society for Industrial and Applied Mathematics, 2014.\n\n[7] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. SIAM\n\nJournal on Computing, 22(6):1331\u20131348, 1993.\n\n[8] Yishay Mansour. Learning boolean functions via the fourier transform. In Theoretical advances\n\nin neural computation and learning, pages 391\u2013424. Springer, 1994.\n\n[9] Eric Price and David P. Woodruff. (1 + eps)-approximate sparse recovery. In Proceedings of the\n2011 IEEE 52Nd Annual Symposium on Foundations of Computer Science, FOCS \u201911, pages\n295\u2013304, Washington, DC, USA, 2011. IEEE Computer Society.\n\n[10] Mark Rudelson and Roman Vershynin. On sparse reconstruction from fourier and gaussian\nmeasurements. Communications on Pure and Applied Mathematics: A Journal Issued by the\nCourant Institute of Mathematical Sciences, 61(8):1025\u20131045, 2008.\n\n[11] Robin Scheibler, Saeid Haghighatshoar, and Martin Vetterli. A fast hadamard transform for\nsignals with sublinear sparsity in the transform domain. IEEE Transactions on Information\nTheory, 61(4):2115\u20132132, 2015.\n\n[12] Peter Stobbe and Andreas Krause. Learning fourier sparse set functions. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 1125\u20131133, 2012.\n\n[13] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\ne-prints, page arXiv:1011.3027, Nov 2010.\n\n10\n\n\f", "award": [], "sourceid": 8659, "authors": [{"given_name": "Andisheh", "family_name": "Amrollahi", "institution": "ETH Zurich"}, {"given_name": "Amir", "family_name": "Zandieh", "institution": "epfl"}, {"given_name": "Michael", "family_name": "Kapralov", "institution": "EPFL"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}