{"title": "Sparse Polynomial Learning and Graph Sketching", "book": "Advances in Neural Information Processing Systems", "page_first": 3122, "page_last": 3130, "abstract": "Let $f: \\{-1,1\\}^n \\rightarrow \\mathbb{R}$ be a polynomial with at most $s$ non-zero real coefficients. We give an algorithm for exactly reconstructing $f$ given random examples from the uniform distribution on $\\{-1,1\\}^n$ that runs in time polynomial in $n$ and $2^{s}$ and succeeds if the function satisfies the \\textit{unique sign property}: there is one output value which corresponds to a unique set of values of the participating parities. This sufficient condition is satisfied when every coefficient of $f$ is perturbed by a small random noise, or satisfied with high probability when $s$ parity functions are chosen randomly or when all the coefficients are positive. Learning sparse polynomials over the Boolean domain in time polynomial in $n$ and $2^{s}$ is considered notoriously hard in the worst-case. Our result shows that the problem is tractable for almost all sparse polynomials. Then, we show an application of this result to hypergraph sketching which is the problem of learning a sparse (both in the number of hyperedges and the size of the hyperedges) hypergraph from uniformly drawn random cuts. We also provide experimental results on a real world dataset.", "full_text": "Sparse Polynomial Learning and Graph Sketching\n\nMurat Kocaoglu1\u2217, Karthikeyan Shanmugam1\u2020, Alexandros G.Dimakis1\u2021, Adam Klivans2(cid:63)\n\n1Department of Electrical and Computer Engineering, 2Department of Computer Science\n\nThe University of Texas at Austin, USA\n\n\u2217mkocaoglu@utexas.edu, \u2020karthiksh@utexas.edu\n\n\u2021dimakis@austin.utexas.edu, (cid:63)klivans@cs.utexas.edu\n\nAbstract\n\nLet f : {\u22121, 1}n \u2192 R be a polynomial with at most s non-zero real coef\ufb01cients.\nWe give an algorithm for exactly reconstructing f given random examples from\nthe uniform distribution on {\u22121, 1}n that runs in time polynomial in n and 2s\nand succeeds if the function satis\ufb01es the unique sign property: there is one output\nvalue which corresponds to a unique set of values of the participating parities. This\nsuf\ufb01cient condition is satis\ufb01ed when every coef\ufb01cient of f is perturbed by a small\nrandom noise, or satis\ufb01ed with high probability when s parity functions are chosen\nrandomly or when all the coef\ufb01cients are positive. Learning sparse polynomials\nover the Boolean domain in time polynomial in n and 2s is considered notoriously\nhard in the worst-case. Our result shows that the problem is tractable for almost\nall sparse polynomials.\nThen, we show an application of this result to hypergraph sketching which is the\nproblem of learning a sparse (both in the number of hyperedges and the size of\nthe hyperedges) hypergraph from uniformly drawn random cuts. We also provide\nexperimental results on a real world dataset.\n\n1\n\nIntroduction\n\nLearning sparse polynomials over the Boolean domain is one of the fundamental problems from\ncomputational learning theory and has been studied extensively over the last twenty-\ufb01ve years [1\u2013\n6]. In almost all cases, known algorithms for learning or interpolating sparse polynomials require\nquery access to the unknown polynomial. An outstanding open problem is to \ufb01nd an algorithm\nfor learning s-sparse polynomials with respect to the uniform distribution on {\u22121, 1}n that runs in\ntime polynomial in n and g(s) (where g is any \ufb01xed function independent of n) and requires only\nrandomly chosen examples to succeed. In particular, such an algorithm would imply a breakthrough\nresult for the problem of learning k-juntas (functions that depend on only k (cid:28) n input variables; it\nis not known how to learn \u03c9(1)-juntas in polynomial time).\nWe present an algorithm and a set of natural conditions such that any sparse polynomial f satis-\nfying these conditions can be learned from random examples in time polynomial in n and 2s. In\nparticular, any f whose coef\ufb01cients have been subjected to a small perturbation (smoothed analysis\nsetting) satis\ufb01es these conditions (for example, if a Gaussian with arbitrarily small variance is added\nindependently to each coef\ufb01cient, f satis\ufb01es these conditions with probability 1). We state our main\nresult here:\nTheorem 1. Let f be an s-sparse function that satis\ufb01es at least one of the following properties:\na) (smoothed analysis setting)The coef\ufb01cients {ci}s\ni=1 are in general position or all of them are\nperturbed by a small random noise. b) The s parity functions are linearly independent. c) All the\ncoef\ufb01cients are positive. Then we learn f with high probability in time poly(n, 2s).\n\n1\n\n\fWe note that smoothed-analysis, pioneered in [7], has now become a common alternative for prob-\nlems that seem intractable in the worst-case.\nOur algorithm also succeeds in the presence of noise:\nTheorem 2. Let f = f1 + f2 be a polynomial such that f1 and f2 depend on mutually disjoint set\nof parity functions. f1 is s-sparse and the values of f1 are \u2018well separated\u2019. Further, (cid:107)f2(cid:107)1 \u2264 \u03bd,\n(i.e., f is approximately sparse). If observations are corrupted by additive noise bounded by \u0001, then\nthere exists an algorithm which takes \u0001 + \u03bd as an input, that gives g in time polynomial in n and 2s\nsuch that (cid:107)f \u2212 g(cid:107)2 \u2264 O(\u03bd + \u0001) with high probability.\n\nThe treatment of the noisy case, i.e., the formal statement of this theorem, the corresponding al-\ngorithm, and the related proofs are relegated to the supplementary material. All these results are\nbased on what we call as the unique sign property: If there is one value that f takes which uniquely\nspeci\ufb01es the signs of the parity functions involved, then the function is ef\ufb01ciently learnable. Note\nthat our results cannot be used for learning juntas or other Boolean-valued sparse polynomials, since\nthe unique sign property does not hold in these settings.\nWe show that this property holds for the complement of the cut function on a hypergraph (no. of\nhyperedges \u2212 cut value). This fact can be used to learn the cut complement function and eventually\ninfer the structure of a sparse hypergraph from random cuts. Sparsity implies that the number of\nhyperedges and the size of each hyperedge is of constant size. Hypergraphs can be used to represent\nrelations in many real world data sets. For example, one can represent the relation between the books\nand the readers (users) on the Amazon dataset with a hypergraph. Book titles and Amazon users\ncan be mapped to nodes and hyperedges, respectively ([8]). Then a node belongs to a hyperedge, if\nthe corresponding book is read by the user represented by that hyperedge. When such graphs evolve\nover time (and space), the difference graph \ufb01ltered by time and space is often sparse. To locate\nand learn the few hyperedges from random cuts in such difference graphs constitutes hypergraph\nsketching. We test our algorithms on hypergraphs generated from the dataset that contain the time\nstamped record of messages between Yahoo! messenger users marked with the user locations (zip\ncodes).\n\n1.1 Approach and Related Work\n\nThe problem of recovering the sparsest solution of a set of underdetermined linear equations has re-\nceived signi\ufb01cant recent attention in the context of compressed sensing [9\u201311]. In compressed sens-\ning, one tries to recover an unknown sparse vector using few linear observations (measurements),\npossibly in the presence of noise.\nThe recent papers [12,13] are of particular relevance to us since they establish a connection between\nlearning sparse polynomials and compressed sensing. The authors show that the problem of learning\na sparse polynomial is equivalent to recovering the unknown sparse coef\ufb01cient vector using linear\nmeasurements. By applying techniques from compressed sensing theory, namely Restricted Isome-\ntry Property (see [12]) and incoherence (see [13]), the authors independently established results for\nreconstructing sparse polynomials using convex optimization. The results have near-optimal sample\ncomplexity. However, the running time of these algorithms is exponential in the underlying dimen-\nsion, n. This is because the measurement matrix of the equivalent compressed sensing problem\nrequires one column for every possible non-zero monomial.\nIn this paper, we show how to solve this problem in time polynomial in n and 2s under the assump-\ntion of unique sign property on the sparse polynomial. Our key contribution is a novel identi\ufb01cation\nprocedure that can reduce the list of potentially non-zero coef\ufb01cients from the naive bound of 2n to\n2s when the function has this property.\nOn the theoretical side, there has been interesting recent work of [14] that approximately learns\nsparse polynomial functions when the underlying domain is Gaussian. Their results do not seem to\ntranslate to the Boolean domain. We also note the work of [15] that gives an algorithm for learning\nsparse Boolean functions with respect to a randomly chosen product distribution on {\u22121, 1}n. Their\nwork does not apply to the uniform distribution on {\u22121, 1}n.\nOn the practical side, we give an application of the theory to the problem of hypergraph sketching.\nWe generalize a prior work [12] that applied the compressed sensing approach discussed before to\n\n2\n\n\fgraph sketching on evolving social network graphs. In our algorithm, while the sample complexity\nrequirements are higher, the time complexity is greatly reduced in comparison. We test our algo-\nrithms on a real dataset and show that the algorithm is able to scale well on sparse hypergraphs\ncreated out of Yahoo! messenger dataset by \ufb01ltering through time and location stamps.\n\n2 De\ufb01nitions\nConsider a real-valued function over the Boolean hypercube f : {\u22121, 1}n \u2192 R. Given a sequence\nof labeled samples of the form (cid:104)f (x), x(cid:105), where x is sampled from the uniform distribution U over\nthe hypercube {\u22121, 1}n, we are interested in an ef\ufb01cient algorithm that learns the function f with\nhigh probability. Through Fourier expansion, f can be written as a linear combination of monomials:\n\nf (x) =\n\ncS\u03c7S(x), \u2200 x \u2208 {\u22121, 1}n\n\nwhere [n] is the set of integers from 1 to n, \u03c7S(x) = (cid:81)\n\n(1)\nxi and cS \u2208 R. Let c be the vector of\ncoef\ufb01cients cS. A monomial \u03c7S (x) is also called a parity function. More background on Boolean\nfunctions and the Fourier expansion can be found in [16].\nIn this work, we restrict ourselves to sparse polynomials f with sparsity s in the Fourier domain, i.e.,\nf is a linear combination of unknown parity functions \u03c7S1(x), \u03c7S2 (x), . . . \u03c7Ss (x) with s unknown\nreal coef\ufb01cients given by {cSi}s\ni=1 such that cSi (cid:54)= 0, \u22001 \u2264 i \u2264 s; all other coef\ufb01cients are 0. Let\nthe subsets corresponding to the s parity functions form a family of sets I = {Si}s\ni=1. Finding I is\nequivalent to \ufb01nding the s parity functions.\nNote: In certain places, where the context makes it clear, we slightly abuse the notation such that\nthe set Si identifying a speci\ufb01c parity function is replaced by just the index i. The coef\ufb01cients may\nbe denoted simply by ci and the parity functions by \u03c7i (\u00b7).\nLet F2 denote the binary \ufb01eld. Every parity function \u03c7i(\u00b7) can be represented by a vector pi \u2208 Fn\u00d71\nThe j-th entry pi(j) in the vector pi is 1, if j \u2208 Si and is 0 otherwise.\nDe\ufb01nition 1. A set of s parity functions {\u03c7i(\u00b7)}s\nsponding set of vectors {pi}s\nSimilarly, they are said to have rank r if the dimension of the subspace spanned by {pi}s\nDe\ufb01nition 2. The coef\ufb01cients {ci}s\nvalues bi \u2208 {0, 1,\u22121}, \u2200 1 \u2264 i \u2264 s, with at least one nonzero bi,\nDe\ufb01nition 3. The coef\ufb01cients {ci}s\nbi \u2208 {0, 1,\u22121}, \u2200 1 \u2264 i \u2264 s with at least one nonzero bi,\nDe\ufb01nition 4. A sign pattern is a distinct vector of signs a = [\u03c71 (\u00b7) , \u03c72 (\u00b7) , . . . \u03c7s (\u00b7))] \u2208\n{\u22121, 1}1\u00d7s assumed by the set of s parity functions.\nSince this work involves switching representations between the real and the binary \ufb01eld, we de\ufb01ne\na function q that does the switch.\nDe\ufb01nition 5. q : {\u22121, 1}a\u00d7b \u2192 Fa\u00d7b\nover F2 such that Yij = q(Xij) = 1 \u2208 F2,\nClearly, it has an inverse function q\u22121 such that q\u22121(Y) = X.\n\ni=1 are said to be \u00b5-separated if for all possible set of values\n\nis a function that converts a sign matrix X to a matrix Y\nif Xij = 1.\n\nif Xij = \u22121 and Yij = q(Xij) = 0 \u2208 F2,\n\ni=1 are said to be in general position if for all possible set of\n\ni=1 are said to be linearly independent if the corre-\n\ni=1 are linearly independent over F2.\n\ns(cid:80)\n(cid:12)(cid:12)(cid:12)(cid:12) > \u00b5.\n\ni=1\n\n(cid:12)(cid:12)(cid:12)(cid:12) s(cid:80)\n\ni=1\n\ncibi (cid:54)= 0\n\n.\n\n2\n\ni=1 is r.\n\ncibi\n\n(cid:88)\n\nS\u2286[n]\n\ni\u2208S\n\n2\n\nWe also present some de\ufb01nitions to deal with the case when the polynomial f is not exactly s-sparse\nand observations are noisy. Let 2[n] denote the power set of [n].\nDe\ufb01nition 6. A polynomial f : {\u22121, 1}n \u2192 R is called approximately (s, \u03bd)-sparse if there exists\n\nI \u2282 2[n] with |I| = s such that (cid:80)\n\n|cS| < \u03bd, where {cS} are the Fourier coef\ufb01cients as in (1).\n\nS\u2208Ic\n\nIn other words, the sum of the absolute values of all the coef\ufb01cients except the ones corresponding\nto I are rather small.\n\n3\n\n\f3 Problem Setting\nSuppose m labeled samples (cid:104)f (x) , x(cid:105)m\ni=1 are drawn from the uniform distribution U on the Boolean\nhypercube. For any B \u2286 2[n], let cB \u2208 R2n\u00d71 be the vector of real coef\ufb01cients such that cB(S) =\ncS, \u2200S \u2208 B and cB(S) = 0, \u2200S /\u2208 B. Let A \u2208 Rm\u00d72n be such that every row of A corresponds\nto one random input sample x \u223c U. Let x also denote the row index and S \u2286 [n] denote the\ncolumn index of A. A(x, S) = \u03c7S (x). Let AS denote the sub matrix formed by the columns\ncorresponding to the subsets in S. Let I be the set consisting of the s parity functions of interest\nin both the sparse and the approximately sparse cases. A sparse representation of an approximately\n(s, \u03bd)-sparse function f is fI = A(x) cI, where cI is as de\ufb01ned above.\nWe review the compressed sensing framework used in [12] and [13]. Speci\ufb01cally, for the remainder\nof the paper, we rely on [13] as a point of reference. We review their framework and explain how\nwe use it to obtain our results, particularly for the noisy case.\nLet y \u2208 Rm and \u03b2S \u2208 R2n, such that \u03b2S = 0, \u2200S \u2286 S c. Note that, here S is a subset of the power\nset 2[n]. Now, consider the following convex program for noisy compressed sensing in this setting:\n\nmin(cid:107)\u03b2S(cid:107)1 subject to\n\n(cid:107)A\u03b2S \u2212 y(cid:107)2 \u2264 \u0001.\n\n(2)\nbe an optimum for the program (2). Note that only the columns of A in S are used in the\nLet \u03b2optS\nprogram. The convex program runs in time poly (m,|S|). The incoherence property of the matrix\nA in [13] implies the following.\nTheorem 3. ( [13]) For any family of subsets I \u2208 2[n] such that |I| = s, m = 4096ns2 and\nc1 = 4, c2 = 8, for any feasible point \u03b2S of program 2, we have:\n\nm\n\n(cid:114) 1\n\nwith probability at least 1 \u2212 O(cid:0) 1\n\n(cid:1)\n\n4n\n\n(cid:107)\u03b2S \u2212 \u03b2optS (cid:107)2 \u2264 c1\u0001 + c2\n\n(cid:16) n\n\nm\n\n(cid:17)1/4 (cid:107)\u03b2Ic(cid:84) S(cid:107)1\n\n(3)\n\nWhen S is set to the power set 2[n], \u0001 = 0 and y is the vector of observed values for an s-sparse\npolynomial, the s-sparse vector cI is a feasible point to program (2). By Theorem 3, the program\nrecovers the sparse vector cI and hence learns the function. The only caveat is that the complexity\nis exponential in n.\nThe main idea behind our algorithms for noiseless and noisy sparse function learning is to \u2018capture\u2019\nthe actual s-sparse set I of interest in a small set S : |S| = O (2s) of coef\ufb01cients by a separate\nalgorithm that runs in time poly(n, 2s). Using the restricted set of coef\ufb01cients S, we search for the\nsparse solution under the noisy and noiseless cases using program (2).\nLemma 1. Given an algorithm that runs in time poly(n, 2s) and generates a set of parities S such\nthat |S| = O (2s) ,I \u2286 S with |I| = s, program (2) with S and m = 4096ns2 random samples as\n\ninputs runs in time poly(n, 2s) and learns the correct function with probability 1 \u2212 O(cid:0) 1\n\n(cid:1).\n\n4n\n\nUnique Sign Pattern Property: The key property that lets us \ufb01nd a small S ef\ufb01ciently is the\nunique sign pattern property. Observe that an s-sparse function can produce at most 2s different real\nvalues. If the maximum value obtained always corresponds to a unique pattern of signs of parities,\nby looking only at the random samples x corresponding to the subsequent O(n) occurrences of this\nmaximum value, we show that all the parity functions needed to learn f are captured in a small set\nof size 2s+1 (see Lemma 2 and its proof). The unique sign property again plays an important role,\nalong with Theorem 3 with more technicalities added, in the noisy case, which we visit in Section 2\nof the supplementary material.\nIn the next section, we provide an algorithm to generate the bounded set S for the noiseless case for\nan s-sparse function f and provide guarantees for the algorithm formally.\n\n4 Algorithm and Guarantees: Noiseless case\ni=1 each corresponding to the s parity functions \u03c7Si (\u00b7) in an\nLet I be the family of s subsets {Si}s\ns-sparse function f. In this section, we provide an algorithm, named LearnBool, that \ufb01nds a small\n\n4\n\n\fthat\n\nsubset S of the power set 2[n] that contains elements of I \ufb01rst and then uses program (2) with S.\nWe show that the algorithm learns f in time poly (n, 2s) from uniformly randomly drawn labeled\nsamples from the Boolean hypercube with high probability under some natural conditions.\nRecall\nits maximum value only if\n[\u03c71(x), \u03c72 (x) . . . \u03c7s (x)] = amax \u2208 {\u22121, 1}s for some unique sign pattern amax, then the function\nis said to possess the unique sign property. Now we state the main technical lemma for the unique\nsign property.\nLemma 2. If an s-sparse function f has the unique sign property then, in Algorithm 1, S is such\n\nthat I \u2286 S, |S| \u2264 2s+1 with probability 1 \u2212 O(cid:0) 1\n\n(cid:1) and runs in time poly(n, 2s).\n\nsuch that f (x) attains\n\nthe function is\n\nif\n\nn\n\nProof. See the supplementary material.\n\nThe proof of the above lemma involves showing that the random matrix Ymax (see Algorithm 1) has\nrank at least n \u2212 s, leading to at most 2s solutions for each equation in (4). The feasible solutions\ncan be obtained by Gaussian elimination in the binary \ufb01eld.\nTheorem 4. Let f be an s-sparse function that satis\ufb01es at least one of the following properties:\n(a) The coef\ufb01cients {ci}s\ni=1 are in general position.\n(b) The s parity functions are linearly independent.\n(c) All the coef\ufb01cients are positive.\n\nGiven labeled samples, Algorithm 1 learns f exactly (or vopt = c) in time poly (n, 2s) with proba-\n\nbility 1 \u2212 O(cid:0) 1\n\n(cid:1).\n\nn\n\nProof. See the supplementary material.\n\nSmoothed Analysis Setting: Perturbing ci\u2019s with Gaussian random variables of standard deviation\n\u03c3 > 0 or by random variables drawn from any set of reasonable continuous distributions ensures\nthat the perturbed function satis\ufb01es property (a) with probability 1.\nRandom Parity Functions: When ci\u2019s are arbitrary and the set of s parity functions are drawn uni-\nformly randomly from 2[n], then property (b) holds with high probability if s is a constant.\nInput: Sparsity parameter s, m1 = 2n2s random labeled samples {(cid:104)f (xi) , xi(cid:105)}m1\ni=1.\nPick samples {xij}nmax\nStack all xij row wise into a matrix Xmax of dimensions nmax \u00d7 n.\nInitialise S = \u2205. Let Ymax = q (Xmax).\nFind all feasible solutions p \u2208 Fn\u00d71\n\nj=1 corresponding to the maximum value of f observed in all the m samples.\n\nsuch that:\n\n2\n\n1nmax\u00d71 = Ymaxp or 0nmax\u00d71 = Ymaxp\n\nCollect all feasible solutions p to either of the above equations in the set P \u2286 Fn\u00d71\nS = {{j \u2208 [n] : p(j) = 1}|p \u2208 P}.\nUsing m = 4096ns2 more samples (number of rows of A is m corresponding to these new\nsamples), solve:\n\n.\n\n2\n\n\u03b2optS = min(cid:107)\u03b2S(cid:107)1 such that A\u03b2S = y,\n\n(4)\n\n(5)\n\nwhere y is the vector of m observed values.\nSet vopt = \u03b2optS .\nOutput: vopt.\n\nAlgorithm 1: LearnBool\n\n5 A Sparse Polynomial Learning Application: Hypergraph Sketching\n\nHypergraphs can be used to model the relations in real world data sets (e.g., books read by users in\nAmazon). We show that the cut functions on hypergraphs satisfy the unique sign property. Learn-\ning a cut function of a sparse hypergraph from random cuts is a special case of learning a sparse\n\n5\n\n\fpolynomial from samples drawn uniformly from the Boolean hypercube. To track the evolution of\nlarge hypergraphs over a small time interval, it is enough to learn the cut function of the difference\ngraph which is often sparse. This is called the graph sketching problem. Previously, graph sketching\nwas applied to social network evolution [12]. We generalize this to hypergraphs showing that they\nsatisfy the unique sign property, which enable faster algorithms, and provide experimental results\non real data sets.\n\n5.1 Graph Sketching\n\nto be c(S) = |{e \u2208 E : e(cid:84) S (cid:54)= \u2205, e(cid:84) V \u2212 S (cid:54)= \u2205}|. Graph sketching is the problem of identifying\n\nA hypergraph G = (V, E) is a set of vertices V along with a set E of subsets of V called the\nhyperedges. The size of a hyperedge is the number of variables that the hyperedge connects. Let d\nbe the maximum hyperedge size of graph G. Let |V | = n and |E| = s.\nA random cut S \u2286 V is a set of vertices selected uniformly at random. De\ufb01ne the value of the cut S\nthe graph structure from random queries that evaluate the value of a random cut, where s (cid:28) n\n(sparse setting). Hypergraphs naturally specify relations among a set of objects through hyperedges.\nFor example, Amazon users can form the set E and Amazon books can form the set V . Each user\nmay read a subset of books which represents the hyperedge. Learning the hypergraph corresponds\nto identifying the sets of books bought by each user. For more examples of hypergraphs in real data\nsets, we refer the reader to [8]. Such hypergraphs evolve over time. The difference graph between\ntwo consecutive time instants is expected to be sparse (number of edges s and maximum hyperedge\nsize d are small). We are interested in learning such hypergraphs from random cut queries.\nFor simplicity and convenience, we consider the cut complement query, i.e., c\u2212cut, which returns\ns \u2212 c(S). One can easily represent the c\u2212cut query with a sparse polynomial as follows: Let node\ni correspond to variable xi \u2208 {\u22121, +1}. A random cut involves choosing xi uniformly randomly\nfrom {\u22121, +1}. The variables assigned to +1 belong to the random cut S. The value is given by\nthe polynomial\n\n(cid:89)\n\n(1 \u2212 xi)\n\n(cid:33)\n\n(cid:88)\n\nI\u2208E\n\n=\n\n1\n\n2|I|\u22121\n\n\uf8eb\uf8ec\uf8ed (cid:88)\n\nJ \u2286I,\n|J |is even\n\n\uf8f6\uf8f7\uf8f8 .\n\n(cid:89)\n\ni\u2208J\n\n(1 +\n\nxi)\n\n(6)\n\n(cid:32)(cid:89)\n\n(cid:88)\n\nfc\u2212cut(x) =\n\n(1 + xi)\n\n+\n\nI\u2208E\n\ni\u2208I\n\n2\n\ni\u2208I\n\n2\n\nHence, the c\u2212cut function is a sparse polynomial where the sparsity is at most s2d\u22121. The variables\ncorresponding to the nodes that belong to some hyperedge appear in the polynomial. We call these\nthe relevant variables and the number of relevant variables is denoted by k. Note that, in our sparse\nsetting k \u2264 sd. We note that for a hypergraph with no singleton hyperedge, given the c\u2212cut function,\nit is easy to recover the hyper edges from (6). Therefore, we focus on learning the c\u2212cut function to\nsketch the hypergraph.\nWhen G is a graph with edges (of cardinality 2), the compressed sensing approach (using program\n2) using the cut (or c\u2212cut) values as measurements is shown to be very ef\ufb01cient in [12] in terms\nof the sample complexity, i.e., the required number of queries. The run time is ef\ufb01cient because\ntotal number of candidate parities is O(n2). However when we consider hypergraphs, i.e., when\nd is a large constant, the compressed sensing approach cannot scale computationally (poly(nd)\nruntime). Here, based on the theory developed, we give a faster algorithm based on the unique\nsign property with sample complexity m1 = O(2kd log n + 22d+1s2(log n + k)) and run time of\nO(m12k, n2 log n)).\nWe observe that the c\u2212cut polynomial satis\ufb01es the unique sign property. From (6), it is evident\nthat the polynomial has only positive coef\ufb01cients. Therefore, by Theorem 4, algorithm LearnBool\nsucceeds. The maximum value of the c\u2212cut function is the number of edges. Notice that the\nmaximum value is de\ufb01nitely observed in two con\ufb01gurations of the relevant variables: If either all\nrelevant variables are +1 or all are \u22121. Therefore, the maximum value is observed in every 2k\u22121 \u2264\n2sd samples. Thus, a direct application of LearnBool yields poly(n, 2k\u22121) time complexity, which\nimproves the O(nd) bound for small s and d.\nImproving further, we provide a more ef\ufb01cient algorithm tailored for the hypergraph sketching prob-\nlem, which makes use of the unique sign property and some other properties of the cut function.\nAlgorithm LearnGraph (Algorithm 4) is provided in the supplementary material.\n\n6\n\n\f(a) Runtime vs. # of variables, d = 3 and s = 1.\n\n(b) Probability of error vs. \u03b1.\n\nFigure 1: Performance \ufb01gures comparing LearnGraph and Compressed Sensing approach.\n\nTheorem 5. Algorithm 4 exactly learns the c\u2212cut function with probability 1 \u2212 O( 1\nn )with sample\ncomplexity m1 = O(2kd log n + 22d+1s2(log n + k)) and time complexity O(2km1 + n2d log n)) .\n\nProof. See the supplementary material.\n\n5.2 Yahoo! Messenger User Communication Pattern Dataset\n\nWe performed simulations using MATLAB on an Intel(R) Xeon(R) quad-core 3.6 GHz machine\nwith 16 GB RAM and 10M cache. We run our algorithm on the Yahoo! Messenger User Commu-\nnication Pattern Dataset [17]. This dataset contains the timestamped user communication data, i.e.,\ninformation about a large number of messages sent over Yahoo! Messenger, for a duration of 28\ndays.\nDataset: Each row represents a message. The \ufb01rst two columns show the day and time (time\nstamp) of the message respectively. The third and \ufb01fth columns show the ID of the transmitting and\nreceiving users, respectively. The fourth column shows the zipcode (spatial stamp) from which this\nparticular message is transmitted. The sixth column shows if the transmitter was in the contact list\nof the reciver user (y) or not (n). If a transmitter sends the same receiver more than one message\nfrom the same zipcode, only the \ufb01rst message is shown in the dataset. In total, there are 100000\nunique users and 5649 unique zipcodes.\nWe form a hypergraph from the dataset as follows: The transmitting users form the hyperedges and\nthe receiving users form the nodes of the hypergraph. A hyperedge connects a set T of users if\nthere is a transmitting user that sends a message to all the users in T . In any given time interval \u03b4t\n(short time interval) and small set of locations \u03b4x speci\ufb01ed by the number of zip codes, there are\nfew users who transmit (s) and they transmit to very few users (d). The complete set of nodes in the\nhypergraph (n) is taken to be those receiving users who are active during m consecutive intervals\nof length \u03b4t and in a set of \u03b4x zipcodes. This gives rise to a sparse graph. We identify the active\nset of transmitting users (hyperedges) and their corresponding receivers (nodes in these hyperedges)\nduring a short time interval \u03b4t and a randomly selected space interval (\u03b4x, i.e., zip codes) from a\nlarge pool of receivers (nodes) that are observed during m intervals of length \u03b4t. Details of \u03b4t, m\nand \u03b4x chosen for experiments are given in Table 1. We note that n is in the order of 1000 usually.\nRemark: Our task is to learn the c\u2212cut function from the random queries, i.e., random +/-1 as-\nsignment of variables and corresponding c\u2212cut values. The generated sparse graph contains only\nhyperedges that have more than 1 node. Other hyperedges (transmitting users) with just one node in\nthe sparse hypergraph are not taken into account. This is because a singleton hyperedge i is always\ncounted in the c\u2212cut function thereby effectively its presence is masked. First, we identify the rele-\nvant variables that participate in the sparse graph. After identifying this set of candidates, correlating\nthe corresponding candidate parities with the function output yields the Fourier coef\ufb01cient of that\nparity (see Algorithm 4).\n\n7\n\n02004006008001000100101102103104Runtime of LearnGraph vs. standard compressed sensing No. of variables, nRuntime (seconds) LearnGraphComp. Sensing123456789100.10.150.20.25\u03b1 (# of samples/n)Prob. of ErrorError Probability vs. \u03b1 Setting 1Setting 3Setting 2Setting 4\fTable 1: Runtime for different graphs. LG: LearnGraph, CS: Compressed sensing based alg.\n\n(a) Runtime for d = 4 and s = 1 graph.\nHHHAlg.\n\nn\n\n159\n2.13\n\n288\n2.23\n\n556\n2.79\n\n1221\n4.94\n\n88\n1.96\n265.63\n\nLG\nCS\n\n-\n\n-\n\n-\n\n-\n\n(b) Runtime for d = 4 and s = 3 graph.\nHHHAlg.\n\nn\n\n246\n2.08\n\n412\n2.30\n\n1399\n4.98\n\n104\n2.08\n\n> 10823\n\n-\n\n-\n\n-\n\n52\n1.91\n39.89\n\nLG\nCS\n\nSetting No.\nSetting 1\nSetting 2\nSetting 3\nSetting 4\n\nInterval\n5 min.\n20 sec.\n10 min.\n2 min.\n\n(c) Simulation parameters for Fig. 1b\nmax(d) max(s)\n\n# of Int.\n\nn\n\n20\n200\n10\n50\n\n6822\n5730\n6822\n6822\n\n10\n22\n11\n30\n\n19\n4\n13\n21\n\nZip. Set Size\n\n20\n200\n2\n50\n\n5.2.1 Performance Comparison with Compressed Sensing Approach\n\nFirst, we compare the runtime of our implementation LearnGraph with the compressed sensing\nbased algorithm from [12]. Both algorithms correctly identify the relevant variables in all the con-\nsidered range of parameters. The last step of \ufb01nding the corresponding Fourier coef\ufb01cients is omitted\nand can be easily implemented (Algorithm 4) without signi\ufb01cantly affecting the running time. As\ncan be seen in Tables 1a, 1b and Fig. 1a, LearnGraph scales well to graphs on thousands of nodes.\nOn the contrary, the compressed sensing approach must handle a measurement matrix of size O(nd),\nwhich becomes prohibitively large on graphs involving more than a few hundred nodes.\n\n5.2.2 Error Performance of LearnGraph\nError probability (probability that the correct c\u2212cut function is not recovered) versus the number\nof samples used is plotted for four different experimental settings of \u03b4t, \u03b4x and m in Fig. 1b. For\neach time interval, the error probability is calculated by averaging the number of errors among 100\ndifferent trials. For each value of \u03b1 (number of samples), the error probability is averaged over time\nintervals to illustrate the error performance. We only keep the intervals for which the graph \ufb01ltered\nwith the considered zipcodes contains at least one user with more than one neighbor. We \ufb01nd that\nfor the \ufb01rst 3 settings, the error probability decreases with more samples. For the fourth setting, d\nand s are very large and hence a large number of samples are required. For that reason, the error\nprobability does not improve signi\ufb01cantly. The probability of error can be reduced by repeating the\nexperiment multiple times and taking a majority, at the cost of signi\ufb01cantly more samples. Our plot\nshows only the probability of error without such a majority ampli\ufb01cation.\n\n6 Conclusions\n\nWe presented a novel algorithm for learning sparse polynomials by random samples on the Boolean\nhypercube. While the general problem of learning all sparse polynomials is notoriously hard, we\nshow that almost all sparse polynomials can be ef\ufb01ciently learned using our algorithm. This is\nbecause our unique sign property holds for randomly perturbed coef\ufb01cients, in addition to several\nother natural settings. As an application, we show that graph and hypergraph sketching lead to sparse\npolynomial learning problems that always satisfy the unique sign property. This allows us to obtain\nef\ufb01cient reconstruction algorthms that outperform the previous state of the art for these problems.\nAn important open problem is to achieve the sample complexity of [12] while keeping the compu-\ntational complexity polynomial in n.\n\nAcknowledgments\n\nM.K, K.S. and A.D. acknowledge the support of NSF via CCF 1422549, 1344364, 1344179 and\nDARPA STTR and a ARO YIP award.\n\n8\n\n\fReferences\n[1] E. Kushilevitz and Y. Mansour, \u201cLearning decision trees using the Fourier spectrum,\u201d in SIAM\n\nJ. Comput., vol. 22, no. 6, 1993, pp. 1331\u20131348.\n\n[2] Y. Mansour, \u201cRandomized interpolation and approximation of sparse polynomials,\u201d in SIAM\nJ. Comput., vol. 24, no. 2. Philadelphia, PA: Society for Industrial and Applied Mathematics,\n1995, pp. 357\u2013368.\n\n[3] R. Schapire and R. Sellie, \u201cLearning sparse multivariate polynomials over a \ufb01eld with queries\n\nand counterexamples,\u201d in JCSS: Journal of Computer and System Sciences, vol. 52, 1996.\n\n[4] A. C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, \u201cNear-optimal sparse\n\nFourier representations via sampling,\u201d in Proceedings of STOC, 2002, pp. 152\u2013161.\n\n[5] P. Gopalan, A. Kalai, and A. Klivans, \u201cAgnostically learning decision trees,\u201d in Proceedings of\n\nSTOC, 2008, pp. 527\u2013536.\n\n[6] A. Akavia, \u201cDeterministic sparse Fourier approximation via fooling arithmetic progressions,\u201d\n\nin Proceedings of COLT, 2010, pp. 381\u2013393.\n\n[7] D. Spielman and S. Teng, \u201cSmoothed analysis of algorithms: Why the simplex algorithm\n\nusually takes polynomial time,\u201d in JACM: Journal of the ACM, vol. 51, 2004.\n\n[8] P. Li, \u201cRelational learning with hypergraphs,\u201d Ph.D. dissertation, \u00b4Ecole Polytechnique F\u00b4ed\u00b4erale\n\nde Lausanne, 2013.\n\n[9] E. J. Cand`es, J. Romberg, and T. Tao, \u201cRobust uncertainty principles: Exact signal reconstruc-\ntion from highly incomplete frequency information,\u201d Information Theory, IEEE Transactions\non, vol. 52, no. 2, pp. 489\u2013509, 2006.\n\n[10] E. J. Cand`es and T. Tao, \u201cDecoding by linear programming,\u201d Information Theory, IEEE Trans-\n\nactions on, vol. 51, no. 12, pp. 4203\u20134215, 2005.\n\n[11] D. L. Donoho, \u201cCompressed sensing,\u201d Information Theory, IEEE Transactions on, vol. 52,\n\nno. 4, pp. 1289\u20131306, 2006.\n\n[12] P. Stobbe and A. Krause, \u201cLearning Fourier sparse set functions,\u201d in Proceedings of the Inter-\n\nnational Conference on Arti\ufb01cial Intelligence and Statistics, 2012, pp. 1125\u20131133.\n\n[13] S. Negahban and D. Shah, \u201cLearning sparse boolean polynomials,\u201d in Proceedings of the Com-\nmunication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on.\nIEEE, 2012, pp. 2032\u20132036.\n\n[14] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang, \u201cLearning sparse polynomial functions,\u201d\n\nin Proceedings of SODA, 2014.\n\n[15] A. T. Kalai, A. Samorodnitsky, and S.-H. Teng, \u201cLearning and smoothed analysis,\u201d in Proceed-\n\nings of FOCS.\n\nIEEE Computer Society, 2009, pp. 395\u2013404.\n\n[16] R. O\u2019Donnell, Analysis of Boolean Functions. Cambridge University Press, 2014.\n[17] Yahoo, \u201cYahoo! webscope dataset ydata-ymessenger-user-communication-pattern-v1 0,\u201d http:\n\n//research.yahoo.com/Academic Relations.\n\n9\n\n\f", "award": [], "sourceid": 1615, "authors": [{"given_name": "Murat", "family_name": "Kocaoglu", "institution": "UT Austin"}, {"given_name": "Karthikeyan", "family_name": "Shanmugam", "institution": "The University of Texas at Austin"}, {"given_name": "Alexandros", "family_name": "Dimakis", "institution": "The University of Texas at Austin"}, {"given_name": "Adam", "family_name": "Klivans", "institution": "UT Austin"}]}