{"title": "An Active Learning Framework using Sparse-Graph Codes for Sparse Polynomials and Graph Sketching", "book": "Advances in Neural Information Processing Systems", "page_first": 2170, "page_last": 2178, "abstract": "Let $f: \\{-1,1\\}^n \\rightarrow \\mathbb{R}$ be an $n$-variate polynomial consisting of $2^n$ monomials, in which only $s\\ll 2^n$ coefficients are non-zero. The goal is to learn the polynomial by querying the values of $f$. We introduce an active learning framework that is associated with a low query cost and computational runtime. The significant savings are enabled by leveraging sampling strategies based on modern coding theory, specifically, the design and analysis of {\\it sparse-graph codes}, such as Low-Density-Parity-Check (LDPC) codes, which represent the state-of-the-art of modern packet communications. More significantly, we show how this design perspective leads to exciting, and to the best of our knowledge, largely unexplored intellectual connections between learning and coding. The key is to relax the worst-case assumption with an ensemble-average setting, where the polynomial is assumed to be drawn uniformly at random from the ensemble of all polynomials (of a given size $n$ and sparsity $s$). Our framework succeeds with high probability with respect to the polynomial ensemble with sparsity up to $s={O}(2^{\\delta n})$ for any $\\delta\\in(0,1)$, where $f$ is exactly learned using ${O}(ns)$ queries in time ${O}(n s \\log s)$, even if the queries are perturbed by Gaussian noise. We further apply the proposed framework to graph sketching, which is the problem of inferring sparse graphs by querying graph cuts. By writing the cut function as a polynomial and exploiting the graph structure, we propose a sketching algorithm to learn the an arbitrary $n$-node unknown graph using only few cut queries, which scales {\\it almost linearly} in the number of edges and {\\it sub-linearly} in the graph size $n$. Experiments on real datasets show significant reductions in the runtime and query complexity compared with competitive schemes.", "full_text": "An Active Learning Framework using Sparse-Graph\nCodes for Sparse Polynomials and Graph Sketching\n\nXiao Li\n\nUC Berkeley\n\nxiaoli@berkeley.edu\n\nKannan Ramchandran\u2217\n\nUC Berkeley\n\nkannanr@berkeley.edu\n\nAbstract\n\nLet f : {\u22121, 1}n \u2192 R be an n-variate polynomial consisting of 2n monomials,\nin which only s (cid:28) 2n coef\ufb01cients are non-zero. The goal is to learn the poly-\nnomial by querying the values of f. We introduce an active learning framework\nthat is associated with a low query cost and computational runtime. The sig-\nni\ufb01cant savings are enabled by leveraging sampling strategies based on modern\ncoding theory, speci\ufb01cally, the design and analysis of sparse-graph codes, such\nas Low-Density-Parity-Check (LDPC) codes, which represent the state-of-the-art\nof modern packet communications. More signi\ufb01cantly, we show how this design\nperspective leads to exciting, and to the best of our knowledge, largely unexplored\nintellectual connections between learning and coding.\nThe key is to relax the worst-case assumption with an ensemble-average setting,\nwhere the polynomial is assumed to be drawn uniformly at random from the en-\nsemble of all polynomials (of a given size n and sparsity s). Our framework suc-\nceeds with high probability with respect to the polynomial ensemble with sparsity\nup to s = O(2\u03b4n) for any \u03b4 \u2208 (0, 1), where f is exactly learned using O(ns)\nqueries in time O(ns log s), even if the queries are perturbed by Gaussian noise.\nWe further apply the proposed framework to graph sketching, which is the prob-\nlem of inferring sparse graphs by querying graph cuts. By writing the cut function\nas a polynomial and exploiting the graph structure, we propose a sketching algo-\nrithm to learn the an arbitrary n-node unknown graph using only few cut queries,\nwhich scales almost linearly in the number of edges and sub-linearly in the graph\nsize n. Experiments on real datasets show signi\ufb01cant reductions in the runtime\nand query complexity compared with competitive schemes.\n\n1\n\nIntroduction\n\nOne of the central problems in computational learning theory is the ef\ufb01cient learning of polynomials\nf (x) : x \u2208 {\u22121, 1}n \u2192 R. The task of learning an s-sparse polynomial f has been studied\nextensively in the literature, often in the context of Fourier analysis for pseudo-boolean functions (a\nreal-valued function de\ufb01ned on a set of binary variables). Many concept classes, such as \u03c9(1)-juntas,\npolynomial-sized circuits, decision trees and disjunctive normative form (DNF) formulas, have been\nproven very dif\ufb01cult [1] to learn in the worst-case with random examples. Almost all existing\nef\ufb01cient algorithms are based on the membership query model [1, 6\u20138, 10, 11, 17], which provides\narbitrary access to the value of f (x) given any x \u2208 {\u22121, 1}n. This makes a richer set of concept\nclasses learnable in polynomial time poly(s, n). This is a form of what is now popularly referred to\nas active learning, which makes queries using different sampling strategies. For instance, [3,10] use\nregular subsampling and [9, 14, 18] use random sampling based on compressed sensing. However,\nthey remain dif\ufb01cult to scale computationally, especially for large s and n.\n\n\u2217This work was supported by grant NSF CCF EAGER 1439725.\n\n1\n\n\fIn this paper, we are interested in learning polynomials with s = O(2\u03b4n) for some \u03b4 \u2208 (0, 1).\nAlthough this regime is not typically considered in the literature, we show that by relaxing the\n\u201cworst-case\u201d mindset to an ensemble-average setting (explained later), we can handle this more\nchallenging regime and reduce both the number of queries and the runtime complexity, even if the\nqueries are corrupted by Gaussian noise.\nIn the spirit of active learning, we design a sampling\nstrategy that makes queries to f based on modern coding theory and signal processing. The queries\nare formed by \u201cstrategically\u201d subsampling the input to induce aliasing patterns in the dual domain\nbased on sparse-graph codes. Then, our framework exploits the aliasing pattern (code structure) to\nreconstruct f by peeling the sparse coef\ufb01cients with an iterative simple peeling decoder. Through a\ncoding-theoretic lens, our algorithm achieves a low query complexity (capacity-approaching codes)\nand low computational complexity (peeling decoding).\nFurther, we apply our proposed framework to graph sketching, which is the problem of inferring\nhidden sparse graphs with n nodes by actively querying graph cuts (see Fig. 1). Motivated by\nbioinformatics applications [2], learning hidden graphs from additive or cross-additive queries (i.e.\nedge counts within a set or across two sets) has gained considerable interest. This problem closely\npertains to our learning framework because the cut function of any graph can be written as a sparse\npolynomial with respect to the binary variables x \u2208 {\u22121, +1}n indicating a graph partition for the\ncut [18]. Given query access to the cut value for an arbitrary partition of the graph, how many cut\nqueries are needed to infer the hidden graph structure? What is the runtime for such inference?\n\n(a) Unknown Graph\n\n(b) Cut Query\n\n(c) Inferred Graph\n\nFigure 1: Given a set of n nodes, infer the graph structure by querying graph cuts.\n\nMost existing algorithms that achieve the optimal query cost for graph sketching (see [13]) are non-\nconstructive, except for a few algorithms [4, 5, 9, 18] that run in polynomial time in the graph size\nn. Inspired by our active learning framework, we derive a sketching algorithm associated with a\nquery cost and runtime that are both sub-linear in the graph size n and almost-linear in the number\nof edges. To the best of our knowledge, this is the \ufb01rst constructive non-adaptive sketching scheme\nwith sub-linear costs in the graph size n. In the following, we introduce the problem setup, our\nlearning model, and summarize our contributions.\n\nOur goal is to learn the following polynomial in terms of its coef\ufb01cients:\n\n1.1 Problem Setup\n\n(cid:88)\n\nk\u2208Fn\n\n2\n\nf (x) =\n\n\u03b1[k]\u03c7k(x), \u2200 x \u2208 {\u22121, 1}n, F2 := {0, 1},\n\n2 is the index of the monomial1 \u03c7k(x) = (cid:81)\n\nwhere k := [k[1],\u00b7\u00b7\u00b7 , k[n]]T \u2208 Fn\ni\u2208[n] xk[i]\n\u03b1[k] \u2208 R is the coef\ufb01cient. In this work, we consider an ensemble-average setting for learning.\nDe\ufb01nition 1 (Polynomial Ensemble). The polynomial ensemble F(s, n,A) is a collection of poly-\nnomials f : {\u22121, 1}n \u2192 R satisfying the following conditions:\n\n, and\n\ni\n\n(1)\n\n\u2022 the vector \u03b1 := [\u00b7\u00b7\u00b7 , \u03b1[k],\u00b7\u00b7\u00b7 ]T is s-sparse with s = O(2\u03b4n) for some 0 < \u03b4 < 1;\n2} is chosen uniformly at random over Fn\n\u2022 the support supp (\u03b1) := {k : \u03b1[k] (cid:54)= 0, k \u2208 Fn\n2 ;\n\u2022 each non-zero coef\ufb01cient \u03b1[k] takes values from some set A according to \u03b1[k] \u223c PA for\n\nall k \u2208 supp (\u03b1), and PA is some probability distribution over A.\n\n1The notation is de\ufb01ned as [n] := {1,\u00b7\u00b7\u00b7 , n}.\n\n2\n\n\fWe consider active learning under the membership query model. Each query to f at x \u2208 {\u22121, 1}n\nreturns the data-label pair (x, f (x) + \u03b5), where \u03b5 is some additive noise. We propose a query frame-\n\nwork that leads to a fast reconstruction algorithm, which outputs an estimate (cid:98)\u03b1 of the polynomial\nthe exact coef\ufb01cients PF := Pr ((cid:98)\u03b1 (cid:54)= \u03b1) = E(cid:2)1(cid:98)\u03b1(cid:54)=\u03b1\n(cid:3), where 1(\u00b7) is the indicator function and the\n\ncoef\ufb01cients. The performance of our framework is evaluated by the probability of failing to recover\n\nexpectation is taken with respect to the noise \u03b5, the randomized construction of our queries, as well\nas the random polynomial ensemble F(s, n,A).\n1.2 Our Approach and Contributions\n\nParticularly relevant to this work are the algorithms on learning decision trees and boolean functions\nby uncovering the Fourier spectrum of f [3, 5, 10, 12]. Recent papers further show that this problem\ncan be formulated and solved as a compressed sensing problem using random queries [14, 18].\nSpeci\ufb01cally, [14] gives an algorithm using O(s2n) queries based on mutual coherence, whereas the\nRestricted Isometry Property (RIP) is used in [18] to give a query complexity of O(sn4). However,\nthis formulation needs to estimate a length-2n vector and hence the complexity is exponential in n.\nTo alleviate the computational burden, [9] proposes a pre-processing scheme to reduce the number\nof unknowns to 2s, which shortens the runtime to poly(2s, n) using O(n2s) samples. However, this\nmethod only works with very small s due to the exponential scaling. Under the sparsity regime s =\nO(2\u03b4n) for some 0 < \u03b4 < 1, existing algorithms [3, 9, 10, 14, 18], irrespective of using membership\nqueries or random examples, do not immediately apply here because this may require 2n samples\n(and large runtime) due to the obscured polynomial scaling in s.\nIn our framework, we show that f can be learned exactly in time almost-linear in s and strictly-linear\nin n, even when the queries are perturbed by random Gaussian noise.\nTheorem 1 (Noisy Learning). Let f \u2208 F(s, n,A) where A is some arbitrarily large but \ufb01nite set.\nIn the presence of noise \u03b5 \u223c N (0, \u03c32), our algorithm learns f exactly in terms of the coef\ufb01cients\n\n(cid:98)\u03b1 = \u03b1, which runs in time O(ns log s) using O(ns) queries with probability at least 1 \u2212 O(1/s).\n\nThe proposed algorithm and proofs are given in the supplementary material. Further, we apply this\nframework on learning hidden graphs from cut queries. We consider an undirected weighted graph\nG = (V, E, W ) with |E| = r edges and weights W \u2208 Rr, where V = {1,\u00b7\u00b7\u00b7 , n} is given but the\nedge set E \u2286 V \u00d7 V is unknown. This generalizes to hypergraphs, where an edge can connect at\nmost d nodes, called the rank of the graph. For a d-rank hypergraph with r edges, the cut function is a\ns-sparse d-bounded pseudo-boolean function (i.e. each monomial depending on at most d variables)\nwhere the sparsity is bounded by s = O(r2d\u22121) [9].\nOn the graph sketching problem, [18] uses O(sn4) random queries to sketch the sparse tem-\nporal changes of a hypergraph in polynomial time poly(nd). However, [9] shows that it be-\ncomes computationally infeasible for small graphs (e.g. n = 200 nodes, r = 3 edges with\nd = 4), while the LearnGraph algorithm [9] runs in time O(2rdM + n2d log n) using M =\nO(2rdd log n + 22d+1d2(log n + rd)) queries. Although this signi\ufb01cantly reduces the runtime com-\npared to [14, 18], the algorithm only tackles very sparse graphs due to the scaling 2r and n2. This\nimplies that the sketching needs to be done on relatively small graphs (i.e. n = 1000 nodes) over \ufb01ne\nsketching intervals (i.e. minutes) to suppress the sparsity (i.e. r = 10 within the sketching interval).\nIn this work, we adapt and apply our learning framework to derive an ef\ufb01cient sketching algorithm,\nwhose runtime scales as O(ds log s(log n + log s)) by using O(ds(log n + log s)) queries. We use\nour adapted algorithm on real datasets and \ufb01nd that we can handle much coarser sketching intervals\n(e.g. half an hour) and much larger hypergraphs (e.g. n = 105 nodes).\n2 Learning Framework\nOur learning framework consists of a query generator and a reconstruction engine. Given the spar-\nsity s and the number of variables n, the query generator strategically constructs queries (randomly)\nand the reconstruction engine recovers the s-sparse vector \u03b1. For notation convenience, we replace\neach boolean variable xi = (\u22121)m[i] with a binary variable m[i] \u2208 F2 for all i \u2208 [n]. Using the\nnotation m = [m[1],\u00b7\u00b7\u00b7 , m[n]]T in the Fourier expansion (1), we have\n\n(cid:88)\n\nk\u2208Fn\n\n2\n\nu[m] =\n\n\u03b1[k](\u22121)(cid:104)m,k(cid:105) + \u03b5[m],\n\n(2)\n\n3\n\n\fwhere (cid:104)m, k(cid:105) = \u2295i\u2208[n]m[i]k[i] over F2. Now the coef\ufb01cients \u03b1[k] can be interpreted as the Walsh-\nHadamard Transform (WHT) coef\ufb01cients of the polynomial f (x) for x \u2208 {\u22121, 1}n.\n\n2.1 Membership Query: A Coding-Theoretic Design\n\nThe building block of our query generator is the basic query set by subsampling and tiny WHTs:\n\n(cid:96) \u2208 Fb\n\n2, where M \u2208 Fn\u00d7b\n\n\u2022 Subsampling: we choose B = 2b samples u[m] indexed selectively by m = M(cid:96) + d for\n2 is the subsampling offset.\n\u2022 WHT: a very small B-point WHT is performed over the samples u[M(cid:96) + d] for (cid:96) \u2208 Fb\n2,\nwhere each output coef\ufb01cient can be obtained according to the aliasing property of WHT:\n\nis the subsampling matrix and d \u2208 Fn\n\n2\n\n\u03b1[k](\u22121)(cid:104)d,k(cid:105) + W [j],\n\nj \u2208 Fb\n2,\n\n(3)\n\n(cid:88)\n\nk:MT k=j\n\nU [j] =\n\n(cid:80)\n\n2\n\nB\n\n(cid:96)\u2208Fb\n\nwhere W [j] = 1\u221a\n\n\u03b5[M(cid:96) + d](\u22121)(cid:104)(cid:96),j(cid:105) is the observation noise with variance \u03c32.\nThe B-point basic query set (3) implies that each coef\ufb01cient U [j] is the weighted hash output of \u03b1[k]\nunder the hash function MT k = j. From a coding-theoretic perspective, the coef\ufb01cient U [j] for\nconstitutes a parity constraint of the coef\ufb01cients \u03b1[k], where \u03b1[k] enters the j-th parity if MT k = j.\nIf we can induce a set of parity constraints that mimic good error-correcting codes with respect to\nthe unknown coef\ufb01cients \u03b1[k], the coef\ufb01cients can be recovered iteratively in the spirit of peeling\ndecoding, similar to that in LDPC codes. Now it boils down to the following questions:\n\n\u2022 How to choose the subsampling matrix M and how to choose the query set size B?\n\u2022 How to recover the coef\ufb01cients \u03b1[k] from their aliased observations U [j]?\n\nIn the following, we illustrate the principle of our learning framework through a simple example\nwith n = 4 boolean variables and sparsity s = 4.\n\n2.2 Main Idea: A Simple Example\n\n2\u00d72, IT\n\n2\u00d72]T and M2 = [IT\n\nSuppose that the s = 4 non-zero coef\ufb01cients are \u03b1[0100], \u03b1[0110], \u03b1[1010] and \u03b1[1111]. We choose\nB = s = 4 and use two patterns M1 = [0T\n2\u00d72]T for subsampling,\n2\u00d72, 0T\nwhere all queries made using the same pattern Mi are called a query group.\nIn this example, by enforcing a zero subsampling offset d = 0, we generate only one set of queries\n{Uc[j]}j\u2208Fb\nunder each pattern Mc according to (3). For example, under pattern M1, the chosen\nsamples are u[0000], u[0001], u[0010], u[0011]. Then, the observations are obtained by a B-point\nWHT coef\ufb01cients of these chosen samples.\nFor illustration we assume the queries are noiseless:\n\n2\n\nU1[00] = \u03b1[0000] + \u03b1[0100] + \u03b1[1000] + \u03b1[1100],\nU1[01] = \u03b1[0001] + \u03b1[0101] + \u03b1[1001] + \u03b1[1101],\nU1[10] = \u03b1[0010] + \u03b1[0110] + \u03b1[1010] + \u03b1[1110],\nU1[11] = \u03b1[0011] + \u03b1[0111] + \u03b1[1011] + \u03b1[1111].\n\nGenerally speaking, it is impossible to reconstruct the coef-\n\ufb01cients from these queries. However, since the coef\ufb01cients\nare sparse, then the observations are reduced to\n\nU1[00] = \u03b1[0100],\nU1[01] = 0,\nU1[10] = \u03b1[0110] + \u03b1[1010], U2[10] = \u03b1[1010]\nU2[11] = \u03b1[1111].\nU1[11] = \u03b1[1111],\n\nU2[00] = 0\nU2[01] = \u03b1[0100] + \u03b1[0110]\n\nThe observations are captured by a bipartite graph, which\nconsists of s = 4 left nodes and 8 right nodes (see Fig. 2).\n\nFigure 2:\ngraph for the observations.\n\nExample of a bipartite\n\n4\n\n01\t\r \u00a010\t\r \u00a011\t\r \u00a000\t\r \u00a001\t\r \u00a010\t\r \u00a011\t\r \u00a000\t\r \u00a0\u03b1[0100]\t\r \u00a0\u03b1[0110]+\u03b1[1010]\t\r \u00a0\u03b1[1010]\t\r \u00a0\u03b1[1111]\t\r \u00a0\u03b1[0110]\t\r \u00a0\u03b1[1010]\t\r \u00a0\u03b1[1111]\t\r \u00a0\u03b1[0100]\t\r \u00a0\u03b1[0100]+\u03b1[0110]\t\r \u00a0\u03b1[1111]\t\r \u00a0QueryStage1QueryStage2\f2.2.1 Oracle-based Decoding\n\nWe illustrate how to decode the unknown \u03b1[k] from the bipartite graph in Fig. 2 with the help of an\n\u201coracle\u201d, and then introduce how to get rid of this oracle. The right nodes can be categorized as:\n\n\u2022 Zero-ton: a right node is a zero-ton if it is not connected to any left node.\n\u2022 Single-ton: a right node is a single-ton if it is connected to only one left node. We refer to\n\u2022 Multi-ton: a right node is a multi-ton if it is connected to more than one left node.\n\nthe index k and its associated value \u03b1[k] as the index-value pair (k, \u03b1[k]).\n\nThe oracle informs the decoder exactly which right nodes are single-tons as well as the correspond-\ning index-value pair (k, \u03b1[k]). Then, we can learn the coef\ufb01cients iteratively as follows:\nStep (1) select all edges in the bipartite graph with right degree 1 (i.e. detect presence of single-tons\n\nand the index-value pairs informed by the oracle);\n\nStep (2) remove (peel off) these edges and the left and right end nodes of these single-ton edges;\nStep (3) remove (peel off) other edges connected to the left nodes that are removed in Step (2);\nStep (4) remove contributions of the left nodes removed in Step (3) from the remaining right nodes.\n\nFinally, decoding is successful if all edges are removed. Clearly, this simple example is only an illus-\ntration. In general, if there are C query groups associated with the subsampling patterns {Mc}C\nc=1\nand query set size B, we de\ufb01ne the bipartite graph ensemble below and derive the guidelines for\nchoosing them to guarantee successful peeling-based recovery.\nDe\ufb01nition 2 (Sparse Graph Ensemble). The bipartite graph ensemble G(s, \u03b7, C,{Mc}c\u2208[C]) is a\ncollection of C-regular bipartite graphs where\n\n\u2022 there are s left nodes, each associated with a distinct non-zero coef\ufb01cient \u03b1[k];\n\u2022 there are C groups of right nodes and B = 2b = \u03b7s right nodes per group, and each right\n\nnode is characterized by the observation Uc[j] indexed by j \u2208 Fb\n\n2 in each group;\n\u2022 there exists an edge between left node \u03b1[k] and right node Uc[j] in group c if MT\n\nc k = j,\n\nand thus each left node has a regular degree C.\n\nUsing the construction of {Mc}C\nc=1 given in the supplemental material, the decoding is successful\nover the ensemble G(s, \u03b7, C,{Mc}c\u2208[C]) if C and B are chosen appropriately. The key idea is to\navoid excessive aliasing by exploiting a suf\ufb01ciently large but \ufb01nite number of groups C for diversity\nand maintaining the query set size B on par with the sparsity O(s).\nLemma 1. If we construct our query generator using C query groups with B = \u03b7s = 2b for some\nredundancy parameter \u03b7 > 0 satisfying:\n\nC\n\u03b7\n\n2\n\n1.0000\n\n3\n\n0.4073\n\n4\n\n0.3237\n\n5\n\n0.2850\n\n6\n\n0.2616\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n\nthen the oracle-based decoder learns f in O(s) peeling iterations with probability 1 \u2212 O(1/s).\n\nTable 1: Minimum value for \u03b7 given the number of groups C\n\n2.2.2 Getting Rid of the Oracle\n\nNow we explain how to detect single-tons and obtain the index-value pair without an oracle. We ex-\nploit the diversity of subsampling offsets d from (3). Let Dc \u2208 FP\u00d7n\nbe the offset matrix containing\nP subsampling offsets, where each row is a chosen offset. Denote by U c[j] := [\u00b7\u00b7\u00b7 , Uc,p[j],\u00b7\u00b7\u00b7 ]T\nthe vector of observations (called observation bin) associated with the P offsets at the j-th right\nnode, we have the general observation model for each right node in the bipartite graph as follows.\nProposition 1. Given the offset matrix D \u2208 FP\u00d7n\n\n, we have\n\n2\n\n(cid:88)\n\n2\n\nU c[j] =\n\n\u03b1[k](\u22121)Dck + wc[j],\n\n(4)\n\nwhere wc[j] (cid:44) [\u00b7\u00b7\u00b7 , Wc,p[j],\u00b7\u00b7\u00b7 ]T contains noise samples with variance \u03c32, (\u22121)(\u00b7) is an element-\nwise exponentiation operator and (\u22121)Dck is the offset signature associated with \u03b1[k].\n\nk : MT\n\nc k=j\n\n5\n\n\fIn the same simple example, we keep the subsampling matrix M1 and use the set of offsets d0 =\n[0, 0, 0, 0]T , d1 = [1, 0, 0, 0]T , d2 = [0, 1, 0, 0]T , d3 = [0, 0, 1, 0]T and d4 = [0, 0, 0, 1]T such that\nD1 = [01\u00d74; I4]. The observation bin associated with the subsampling pattern M1 is:\n\nFor example, observations U 1[01] and U 1[10] are given as\n\nU 1[j] = [U1,0[j], U1,1[j], U1,2[j], U1,3[j], U1,4[j]]T .\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb + \u03b1[1010] \u00d7\n\n(5)\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb .\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n1\n\n(\u22121)1\n(\u22121)0\n(\u22121)1\n(\u22121)0\n\n1\n\n(\u22121)0\n(\u22121)1\n(\u22121)1\n(\u22121)0\n\nU 1[01] = \u03b1[0100] \u00d7\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\n1\n\n(\u22121)0\n(\u22121)1\n(\u22121)0\n(\u22121)0\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8f0\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fb , U 1[10] = \u03b1[0110] \u00d7\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n=\u21d2\n\nWith these bin observations, one can effectively determine if a check node is a zero-ton, a single-\nton or a multi-ton. For example, a single-ton, say U 1[01], satis\ufb01es |U1,0[01]| = |U1,1[01]| =\n|U1,2[01]| = |U1,3[01]| = |U1,4[01]|. Then, the index k = [k[1], k[2], k[3], k[4]]T and the value\nof a single-ton can be obtained by a simple ratio test\nU1,0[01] = (\u22121)0\nU1,0[01] = (\u22121)1\nU1,0[01] = (\u22121)0\nU1,0[01] = (\u22121)0\n\n(\u22121)(cid:98)k[1] = U1,1[01]\n(\u22121)(cid:98)k[2] = U1,2[01]\n(\u22121)(cid:98)k[3] = U1,3[01]\n(\u22121)(cid:98)k[4] = U1,4[01]\n\n(cid:98)k[1] = 0\n(cid:98)k[2] = 1\n(cid:98)k[3] = 0\n(cid:98)k[4] = 0\n(cid:98)\u03b1[(cid:98)k] = U1,0[01]\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nThe above tests are easy to verify for all observations such that the index-value pair is obtained\nfor peeling. In fact, this detection scheme for obtaining the oracle information is mentioned in the\nnoiseless scenario [16] by using P = n + 1 offsets. However, this procedure fails in the presence of\nnoise. In the following, we propose the general detection scheme for the noisy scenario while using\nP = O(n) offsets.\n\n3 Learning in the Presence of Noise\n\nIn this section, we propose a robust bin detection scheme that identi\ufb01es the type of each observation\nbin and estimate the index-value pair (k, \u03b1[k]) of a single-ton in the presence of noise. For conve-\nnience, we drop the group index c and the node index j without loss of clarity, because the detection\nscheme is identical for all nodes from all groups. The bin detection scheme consists of the single-ton\ndetection scheme and the zero-ton/multi-ton detection scheme, as described next.\n\n3.1 Single-ton Detection\nProposition 2. Given a single-ton with (k, \u03b1[k]) observed in the presence of noise N (0, \u03c32), then\nby collecting the signs of the observations, we have\n\nc = Dk \u2295 sgn [\u03b1[k]] \u2295 z\n\nwhere z contains P independent Bernoulli variables with probability at most Pe = e\u2212\u03b7B\u03b12\nand the sign function is de\ufb01ned as sgn [x] = 1 if x < 0 and sgn [x] = 0 if x > 0.\n\nmin/2\u03c32,\n\nNote that the P -bit vector c is a received codeword of the n-bit message k over a binary symmetric\nchannel (BSC) under an unknown \ufb02ip sgn [\u03b1[k]]. Therefore, we can design the offset matrix D\naccording to linear block codes. The codes should include 1 as a valid codeword such that both Dk\nand Dk \u2295 1 can be decoded correctly and then obtain the correct codeword Dk and hence k.\nDe\ufb01nition 3. Let the offset matrix D \u2208 FP\u00d7n\nconstitute a P \u00d7 n generator matrix of some linear\ncode, which satis\ufb01es a minimum distance \u03b2P with a code rate R(\u03b2) > 0 and \u03b2 > Pe.\nSince there are n information bits in the index k, there exists some linear code (i.e. D) with block\nlength P = n/R(\u03b2) that achieves a minimum distance of \u03b2P , where R(\u03b2) is the rate of the code\n[15]. As long as \u03b2 > Pe, it is obvious that the unknown k can be decoded with exponentially\ndecaying probability of error. Excellent examples include the class of expander codes or LDPC\ncodes, which admits a linear time decoding algorithm. Therefore, the single-ton detection can be\nperformed in time O(n), same as the noiseless case.\n\n2\n\n6\n\n\f3.2 Zero-ton and Multi-ton Detection\n\nindependent identically distributed (i.i.d.) Bernoulli entries with probability 1/2.\n\nThe single-ton detection scheme works when the underlying bin is indeed a single-ton. However,\nit does not work on isolating single-tons from zero-tons and multi-tons. We address this issue by\nfurther introducing P extra random offsets.\n\nDe\ufb01nition 4. Let the offset matrix (cid:101)D \u2208 FP\u00d7n\nDenote by (cid:101)U = [(cid:101)U1,\u00b7\u00b7\u00b7 ,(cid:101)UP ]T the observations associated with (cid:101)D, we perform the following:\n\u2022 zero-ton veri\ufb01cation: the bin is a zero-ton if (cid:107)(cid:101)U(cid:107)2/P \u2264 (1+\u03b3)\u03c32/B for some \u03b3 \u2208 (0, 1).\n\u2022 multi-ton veri\ufb01cation: the bin is a multi-ton if (cid:107)(cid:101)U \u2212(cid:98)\u03b1[(cid:98)k](\u22121)(cid:101)D(cid:98)k(cid:107)2 \u2265 (1 + \u03b3)\u03c32/B,\nwhere ((cid:98)k,(cid:98)\u03b1[(cid:98)k]) are the single-ton detection estimates.\n\nconstitute a P \u00d7 n random matrix consisting of\n\n2\n\nIt is shown in the supplemental material that this bin detection scheme works with probability at\nleast 1\u2212 O(1/s). Together with Lemma 1, the learning framework in the presence of noise succeeds\nwith probability at least 1\u2212 O(1/s). As detailed in the supplemental material, this leads to a overall\nsample complexity of O(sn) and runtime of O(ns log s).\n\n(cid:32)(cid:89)\n\n1 \u2212\n\n(cid:34)\n\n(cid:88)\n\ne\u2208E\n\n(cid:89)\n\n(1 \u2212 xi)\n\n(cid:33)(cid:35)\n\n.\n\n4 Application in Hypergraph Sketching\nConsider a d-rank hypergraph G = (V, E) with |E| = r edges, where V = {1,\u00b7\u00b7\u00b7 , n}. A cut\nS \u2286 V is a set of selected vertices, denoted by the boolean cube x = [x1,\u00b7\u00b7\u00b7 , xn] over {\u00b11}n,\nwhere xi = \u22121 if i \u2208 S and xi = 1 if i /\u2208 S. The value of a speci\ufb01c cut x can be written as\n\n(1 + xi)\n\n+\n\n2\n\n2\n\n2\n\n(6)\n\ni\u2208e\n\nk\u2208Fn\n\nf (x) =\n\nLetting xi = (\u22121)m[i], we have f (x) = u[m] =(cid:80)\n\ni\u2208e\nc[k](\u22121)(cid:104)k,m(cid:105) with xi = (\u22121)m[i] for all\ni \u2208 [n], where the coef\ufb01cient c[k] is a scaled WHT coef\ufb01cient. Clearly, if the number of hyperedges\nis small r (cid:28) 2n and the maximum size of each hyperedge is small d (cid:28) n, the coef\ufb01cients c[k]\u2019s\nare sparse and the sparsity can be well upper bounded by s \u2264 r2d\u22121. Now, we can use our learning\nframework to compute the sparse coef\ufb01cients c[k] from only a few cut queries. Note that in the\ngraph sketching problem, the weight of k is bounded by d due to the special structure of cut function.\nTherefore, in the noiseless setting, we can leverage the sparsity d and use much fewer offsets P (cid:28) n\nin the spirit of compressed sensing. In the supplemental material, we adapt our framework to derive\nthe GraphSketch bin detection scheme with even lower query costs and runtime.\nProposition 3. The GraphSketch bin detection scheme uses P = O(d(log n + log s)) offsets and\nsuccessfully detects single-tons and their index-value pairs with probability at least 1 \u2212 O(1/s).\nNext, we provide numerical experiments of our learning algorithm for sketching large random hyper-\ngraphs as well as actual hypergraphs formed by real datasets2. In Fig. 3, we compare the probability\nof success in sketching hypergraphs with n = 1000 nodes over 100 trials against the LearnGraph\nprocedure3 in [9], by randomly generating r = 1 to 10 hyperedges with rank d = 5. The perfor-\nmance is plotted against the number of edges r and the query complexity of learning. As seen from\nFig. 3, the query complexity of our framework is signi\ufb01cantly lower (\u2264 1%) than that of [9].\n4.1 Sketching the Yahoo! Messenger User Communication Pattern Dataset\nWe sketch the hypergraphs extracted from Yahoo! Messenger User Communication Pattern Dataset\n[19], which records communications for 28 days. The dataset is recorded entry-wise as (day, time,\ntransmitter, origin-zipcode, receiver, \ufb02ag), where day and time represent the time stamp of each\nmessage, the transmitter and receiver represent the IDs of the sender and the recipient, the zipcode is\na spatial stamp of each message, and the \ufb02ag indicates if the recipient is in the contact list. There are\n105 unique users and 5649 unique zipcodes. A hidden hypergraph structure is captured as follows.\n\n2We used MATLAB on a Macbook Pro with an Intel Core i5 processor at 2.4 GHz and 8 GB RAM.\n3We would like to acknowledge and thank the authors [9] for providing their source codes.\n\n7\n\n\f(a) Our Framework\n\n(b) Our Framework\n\n(c) LearnGraph\n\n(d) LearnGraph\n\nFigure 3: Sketching performance of random hypergraphs with n = 1000 nodes.\n\nOver an interval \u03b4t, each sender with a unique zipcode forms a hyperedge, and the recipients are\nthe members of the hyperedge. By considering T consecutive intervals \u03b4t over a set of \u03b4z (cid:28) 5649\nzipcodes, the communication pattern gives rise to a hypergraph with only few hyperedges in each\ninterval and each hyperedge contains only few d nodes. The complete set of nodes in the hypergraph\nn is the set of recipients who are active during the T intervals. In Table 2, we choose the sketching\ninterval \u03b4t = 0.5hr and consider T = 5 intervals. For each interval, we extract the communication\nhypergraph from the dataset by sketching the communications originating from a set of \u03b4z = 20\nzipcodes4 by posing queries constructed at random in our framework. We average our performance\nover 100 trial runs and obtain the success probability.\n\nTemporal Graph\n\nn\n\n# of edges (E)\n\ndegree (d)\n\nRun-time (sec)\n\n(9:00 a.m. \u223c 9:30 a.m.)\n(9:30 a.m. \u223c 10:00 a.m.)\n(10:00 a.m. \u223c 10:30 a.m.)\n(10:30 a.m. \u223c 11:00 a.m.)\n(11:00 a.m. \u223c 11:00 a.m.)\nTable 2: Sketching performance with C = 8 groups and P = 421 query sets of size B = 128.\n\n12648\n12648\n12648\n12648\n12648\n\n422.3\n310.1\n291.4\n571.3\n295.1\n\n87\n102\n109\n84\n89\n\n9\n8\n7\n9\n10\n\n1 \u2212 PF\n0.97\n0.99\n0.99\n0.93\n0.93\n\nWe maintain C = 8 groups of queries with P = 421 query sets of size B = 256 per group\nthroughout all the experiments (i.e., 8.6 \u00d7 105 queries \u2248 60n). It is also seen that we can sketch the\ntemporal communication hypergraphs from the real dataset over much larger intervals (0.5 hr) than\nthat by LearnGraph (around 30 sec to 5 min), also more reliably in terms of success probability.\n5 Conclusions\nIn this paper, we introduce a coding-theoretic active learning framework for sparse polynomials un-\nder a much more challenging sparsity regime. The proposed framework effectively lowers the query\ncomplexity and especially the computational complexity. Our framework is useful in sketching large\nhypergraphs, where the queries are obtained by speci\ufb01c graph cuts. We further show via experiments\nthat our learning algorithm performs very well over real datasets compared with existing approaches.\n4We did now show the performance of LearnGraph because it fails to work on hypergraphs with the number\n\nof hyperedges at this scale with a reasonable number of queries (i.e., \u2264 1000n), as mentioned in [9].\n\n8\n\n# of Queries# of EdgesProb. of Success 11.522.53x 1041234567891000.20.40.60.8111.522.53x 104051000.511.52# of QueriesRun\u2212time# of EdgesRun\u2212time (secs)# of Queries# of EdgesProb. of Success 11.522.53x 1061234567891000.20.40.60.8111.522.53x 1060510010203040# of QueriesRun\u2212time# of EdgesRun\u2212time (secs)\fReferences\n[1] D. Angluin. Computational learning theory: survey and selected bibliography. In Proceedings\nof the twenty-fourth annual ACM symposium on Theory of computing, pages 351\u2013369. ACM,\n1992.\n\n[2] M. Bouvel, V. Grebinski, and G. Kucherov. Combinatorial search on graphs motivated by\nbioinformatics applications: A brief survey. In Graph-Theoretic Concepts in Computer Sci-\nence, pages 16\u201327. Springer, 2005.\n\n[3] N. Bshouty and Y. Mansour. Simple learning algorithms for decision trees and multivariate\npolynomials. In Foundations of Computer Science, 1995. Proceedings., 36th Annual Sympo-\nsium on, pages 304\u2013311, Oct 1995.\n\n[4] N. H. Bshouty and H. Mazzawi. Optimal query complexity for reconstructing hypergraphs.\nIn 27th International Symposium on Theoretical Aspects of Computer Science-STACS 2010,\npages 143\u2013154, 2010.\n\n[5] S.-S. Choi, K. Jung, and J. H. Kim. Almost tight upper bound for \ufb01nding fourier coef\ufb01cients of\nbounded pseudo-boolean functions. Journal of Computer and System Sciences, 77(6):1039\u2013\n1053, 2011.\n\n[6] S. A. Goldman. Computational learning theory.\n\nIn Algorithms and theory of computation\n\nhandbook, pages 26\u201326. Chapman & Hall/CRC, 2010.\n\n[7] J. Jackson. An ef\ufb01cient membership-query algorithm for learning dnf with respect to the\nuniform distribution. In Foundations of Computer Science, 1994 Proceedings., 35th Annual\nSymposium on, pages 42\u201353. IEEE, 1994.\n\n[8] M. J. Kearns. The computational complexity of machine learning. MIT press, 1990.\n[9] M. Kocaoglu, K. Shanmugam, A. G. Dimakis, and A. Klivans. Sparse polynomial learning and\ngraph sketching. In Advances in Neural Information Processing Systems, pages 3122\u20133130,\n2014.\n\n[10] E. Kushilevitz and Y. Mansour. Learning decision trees using the fourier spectrum. SIAM\n\nJournal on Computing, 22(6):1331\u20131348, 1993.\n\n[11] Y. Mansour. Learning boolean functions via the fourier transform. In Theoretical advances in\n\nneural computation and learning, pages 391\u2013424. Springer, 1994.\n\n[12] Y. Mansour. Randomized interpolation and approximation of sparse polynomials. SIAM Jour-\n\nnal on Computing, 24(2):357\u2013368, 1995.\n\n[13] H. Mazzawi. Reconstructing Graphs Using Edge Counting Queries. PhD thesis, Technion-\n\nIsrael Institute of Technology, Faculty of Computer Science, 2011.\n\n[14] S. Negahban and D. Shah. Learning sparse boolean polynomials. In Communication, Control,\nand Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 2032\u20132036. IEEE,\n2012.\n\n[15] T. Richardson and R. Urbanke. Modern coding theory. Cambridge University Press, 2008.\n[16] R. Scheibler, S. Haghighatshoar, and M. Vetterli. A fast hadamard transform for signals with\n\nsub-linear sparsity. arXiv preprint arXiv:1310.1803, 2013.\n\n[17] B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52:55\u201366,\n\n2010.\n\n[18] P. Stobbe and A. Krause. Learning fourier sparse set functions. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 1125\u20131133, 2012.\n\n[19] Yahoo. Yahoo! webscope dataset ydata-ymessenger-user-communication-pattern-v1 0.\n\n9\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Xiao", "family_name": "Li", "institution": "UC Berkeley"}, {"given_name": "Kannan", "family_name": "Ramchandran", "institution": "UC Berkeley"}]}