{"title": "What do row and column marginals reveal about your dataset?", "book": "Advances in Neural Information Processing Systems", "page_first": 2166, "page_last": 2174, "abstract": "Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i,j) of interest. We do this, for all the cells of H simultaneously, without generating realizations but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with running time equal to the time required by standard sampling techniques to generate a single dataset from the same dataspace. Our experimental evaluation demonstrates the efficiency and the efficacy of our framework in multiple settings.", "full_text": "What do row and column marginals reveal about\n\nyour dataset?\n\nBehzad Golshan\nBoston University\n\nbehzad@cs.bu.edu\n\nJohn W. Byers\nBoston University\n\nbyers@cs.bu.edu\n\nEvimaria Terzi\nBoston University\n\nevimaria@cs.bu.edu\n\nAbstract\n\nNumerous datasets ranging from group memberships within social networks to\npurchase histories on e-commerce sites are represented by binary matrices. While\nthis data is often either proprietary or sensitive, aggregated data, notably row and\ncolumn marginals, is often viewed as much less sensitive, and may be furnished\nfor analysis. Here, we investigate how these data can be exploited to make infer-\nences about the underlying matrix H. Instead of assuming a generative model for\nH, we view the input marginals as constraints on the dataspace of possible real-\nizations of H and compute the probability density function of particular entries\nH(i, j) of interest. We do this for all the cells of H simultaneously, without gen-\nerating realizations, but rather via implicitly sampling the datasets that satisfy the\ninput marginals. The end result is an ef\ufb01cient algorithm with asymptotic running\ntime the same as that required by standard sampling techniques to generate a sin-\ngle dataset from the same dataspace. Our experimental evaluation demonstrates\nthe ef\ufb01ciency and the ef\ufb01cacy of our framework in multiple settings.\n\n1\n\nIntroduction\n\nOnline marketplaces such as Walmart, Net\ufb02ix, and Amazon store information about their customers\nand the products they purchase in binary matrices. Likewise, information about the groups that social\nnetwork users participate in, the \u201cLikes\u201d they make, and the other users they \u201cfollow\u201d can also be\nrepresented using large binary matrices. In all these domains, the underlying data (i.e., the binary\nmatrix itself) is often viewed as proprietary or as sensitive information. However, the data owners\nmay view certain aggregates as much less sensitive. Examples include revealing the popularity of a\nset of products by reporting total purchases, revealing the popularity of a group by reporting the size\nof its membership, or specifying the in- and out-degree distributions across all users.\nHere, we tackle the following question: \u201cGiven the row and column marginals of a hidden binary\nmatrix H, what can one infer about H?\u201d.\nOptimization-based methods for addressing this question, e.g., least squares or maximum likelihood,\n\nassume a generative model for the hidden matrix and output an estimate (cid:98)H of H. However, this\nestimate gives little guidance as to the structure of the feasible solution space; for example, (cid:98)H may\n\nbe one of many solutions that achieve the same value of the objective function. Moreover, these\nmethods provide little insight about the estimates of particular entries H(i, j).\nIn this paper, we do not make any assumptions about the generative process of H. Rather, we\napproach the above question by viewing the row and column marginals as constraints that induce\na dataspace X , de\ufb01ned by the set of all matrices satisfying the input constraints. Then, we explore\nthis dataspace not by estimating H at large, but rather by computing the entry-wise PDF P(i, j), for\nevery entry (i, j), where we de\ufb01ne P(i, j) to be the probability that cell (i, j) takes on value 1 in the\ndatasets in X . From the application point of view, the value of P(i, j) can provide the data analyst\n\n1\n\n\fwith valuable insight: for example, values close to 1 (respectively 0) give high con\ufb01dence to the\nanalyst that H(i, j) = 1 (respectively H(i, j) = 0).\nA natural way to compute entry-wise PDFs is by sampling datasets from the induced dataspace\nX . However, this dataspace can be vast, and existing techniques for sampling binary tables with\n\ufb01xed marginals [6, 9] fail to scale. In this paper, we propose a new ef\ufb01cient algorithm for computing\nentry-wise PDFs by implicitly sampling the dataspace X . Our technique can compute the entry-wise\nPDFs of all entries in running time the same as that required for state-of-the art sampling techniques\nto generate just a single sample from X . Our experimental evaluation demonstrates the ef\ufb01ciency\nand the ef\ufb01cacy of our technique for both synthetic and real-world datasets.\nRelated work: To the best of our knowledge, we are the \ufb01rst to introduce the notion of entry-\nwise PDFs for binary matrices and to develop implicit sampling techniques for computing them\nef\ufb01ciently. However, our work is related to the problem of sampling from the space of binary\nmatrices with \ufb01xed marginals, studied extensively in many domains [2, 6, 7, 9, 21], primarily due to\nits applications in statistical signi\ufb01cance testing [14, 17, 20]. Existing sampling techniques all rely\non explicitly sampling the underlying dataspaces (either using MCMC or importance sampling) and\nwhile these methods can be used to compute entry-wise PDFs, they are inef\ufb01cient for large datasets.\nOther related studies focus on identifying interesting patterns in binary data given itemset frequen-\ncies or other statistics [3, 15]. These works either assume a generative model for the data or build\nthe maximum entropy distribution that approximates the observed statistics; whereas our approach\nmakes no such assumptions and focuses only on exact solutions. Finally, considerable work has\nfocused on counting binary matrices with \ufb01xed marginals [1, 8, 10, 23]. One can compute the\nentry-wise PDFs using these results, albeit in exponential time.\n\n2 Dataspace Exploration\nThroughout the rest of the discussion we will assume an n\u00d7 m 0\u20131 matrix H, which is hidden. The\ninput to our problem consists of the dimensionality of H and its row and column marginals provided\nas a pair of vectors (cid:104)r, c(cid:105). That is, r and c are n-dimensional and m-dimensional integer vectors\nrespectively; entry r(i) stores the number of 1s in the ith row of H, and similarly for c(i). In this\npaper we address the following high-level problem:\nProblem 1. Given (cid:104)r, c(cid:105), what can we infer about H? More speci\ufb01cally, can we reason about\nentries H(i, j) without access to H but only its row and column marginals?\n\nClearly, there are many possible ways of formalizing the above problem into a concrete problem\nde\ufb01nition. In Section 2.1 we describe some mainstream formulations and discuss their drawbacks.\nIn Section 2.2 we introduce our dataspace exploration framework that overcomes these drawbacks.\n\n2.1 Optimization\u2013based approaches\n\nStandard optimization-based approaches for Problem 1 usually assume a generative model for H,\n\nand estimate it by computing (cid:98)H, the best estimate of H optimizing a speci\ufb01c objective function\nthat provide an estimate (cid:98)H of H while maximizing the likelihood of the observed row and column\n\n(e.g., likelihood, squared error). Instantiations of these methods for our setting are described next.\nMaximum-likelihood (ML): The ML approach assumes that the hidden matrix H is generated by\na model that only depends on the observed marginals. Then, the goal is to \ufb01nd the model parameters\n\nmarginals. A natural choice of such a model for our setting is the Rasch model [4, 19], where the\nprobability of entry H(i, j) taking on value 1 is given by:\n\nPr [H(i, j) = 1] = e\u03b1i\u2212\u03b2j\n1 + e\u03b1i\u2212\u03b2j\n\n.\n\nThe maximum-likelihood estimates of the (n + m) parameters \u03b1i and \u03b2j of this model can be\ncomputed in polynomial time [4, 19]. For the rest of the discussion, we will use the term Rasch to\nrefer to the experimental method that computes an estimate of H using this ML technique.\nLeast-squares (LS): One can view the task of estimating H(i, j) from the input aggregates as\nsolving a linear system de\ufb01ned by equations: r = H \u00d7 (cid:126)1 and c = H T \u00d7 (cid:126)1 Unfortunately, such\n\n2\n\n\fa system of equations is typically highly under-determined and standard LS methods approach it\n\nare restricted to be in [0, 1], it is not guaranteed that the above regression-based formulation will\noutput a reasonable estimate of H. For example, all tables with row and column marginals r and\nc are 0-error solutions; yet there may be exponentially many such matrices. Alternatively, one can\n\nas a regression problem that asks for an estimate (cid:98)H of H to minimize F ((cid:98)H) = ||((cid:98)H \u00d7 (cid:126)1) \u2212\nr||F + ||((cid:98)H T \u00d7 (cid:126)1) \u2212 c||F , where || ||F is the Frobenius norm [13]. Even when the entries of (cid:98)H\nincorporate a \u201cregularization\u201d factor J() and search for (cid:98)H that minimizes F ((cid:98)H) + J((cid:98)H). For the\nrest of this paper, we consider this latter approach with J((cid:98)H) = ((cid:98)H(i, j) \u2212 h)2, where h is the\naverage value over all entries of (cid:98)H. We refer to this approach as the LS method.\nsolutions (cid:98)H with the same value of the objective function exist or the con\ufb01dence in the value of\n\nAlthough one can solve (any of) the above estimation problems via standard optimization tech-\nniques, the output of such methods is a holistic estimate of H that gives no insight on how many\n\nevery cell. Moreover, these techniques are based on assumptions about the generative model of the\nhidden data. While these assumptions may be plausible, they may not be valid in real data.\n\n2.2 The dataspace exploration framework\n\nTo overcome the drawbacks of the optimization-based methods, we now introduce our dataspace\nexploration framework, which does not make any structural assumptions about H and considers the\nset of all possible datasets that are consistent with input row and column marginals (cid:104)r, c(cid:105). We call\nthe set of such datasets the (cid:104)r, c(cid:105)-dataspace, denoted by X(cid:104)r,c(cid:105), or X for short.\nWe translate the high-level Problem 1 into the following question: Given (cid:104)r, c(cid:105), what is the proba-\nbility that the entry H(i, j) of the hidden dataset takes on value 1? That is, for each entry H(i, j)\nwe are interested in computing the quantity:\n\nPr(H(cid:48))Pr [H(cid:48)(i, j) = 1] .\n\n(1)\n\nP(i, j) = (cid:88)\n\nH(cid:48)\u2208X\n\nHere, Pr(H(cid:48)) encodes the prior probability distribution over all hidden matrices in X . For a uniform\nprior, P(i, j) encodes the fraction of matrices in X that have 1 in position (i, j). Clearly, for binary\nmatrices, P(i, j) determines the PDF of the values that appear in cell (i, j). Thus, we call P(i, j) the\nentry-wise PDF of entry (i, j), and P the PDF matrix. If P(i, j) is very close to 1 (or 0), then over\nall possible instantiations of H, the entry (i, j) is, with high con\ufb01dence, 1 (or 0). On the other hand,\nP(i, j) (cid:39) 0.5 signals that in the absence of additional information, a high-con\ufb01dence prediction of\nentry H(i, j) cannot be made.\nNext, we discuss algorithms for estimating entry-wise PDFs ef\ufb01ciently. Throughout the rest of the\ndiscussion we will adopt Matlab notation for matrices: for any matrix M, we will use M(i, :) to\nrefer to the i-th row, and M(:, j) to refer to the j-th column of M.\n\n3 Basic Techniques\nFirst, we review some basic facts and observations about (cid:104)r, c(cid:105) and the dataspace X(cid:104)r,c(cid:105).\nValidity of marginals: Given (cid:104)r, c(cid:105) we can decide in polynomial time whether |X(cid:104)r,c(cid:105)| > 0 either\nby verifying the Gale-Ryser condition [5] or by constructing a binary matrix with the input row and\ncolumn marginals, as proposed by Erd\u00a8os, Gallai, and Hakimi [18, 11]. The second option has the\ncomparative advantage that if |X(cid:104)r,c(cid:105)| > 0, then it also outputs a binary matrix from X(cid:104)r,c(cid:105).\nNested matrices: Building upon existing results [18, 11, 16, 24], we have the following:\nLemma 1. Given the row and column marginals of a binary matrix as (cid:104)r, c(cid:105) we can decide in\npolynomial time whether |X(cid:104)r,c(cid:105)| = 1 and if so, completely recover the hidden matrix H.\nThe binary matrices that can be uniquely recovered are called nested matrices and have the property\nthat in their representation as bipartite graphs they do not have any switch boxes [16]: a pair of edges\n(u, v) and (u(cid:48), v(cid:48)) for which neither (u, v(cid:48)) nor (u(cid:48), v) exist.\n\n3\n\n\fExplicit sampling: One way of approximating P(i, j) for large dataspaces is to \ufb01rst obtain a uniform\nsample of N binary matrices from X : X1, . . . , XN and for each (i, j), compute P(i, j) as the\nfraction of samples for which X(cid:96)(i, j) = 1.\nWe can obtain random (near-uniform) samples from X using either the Markov chain Monte\nCarlo (MCMC) method proposed by Gionis et al. [9] or the Sequential Importance Sampling\n(Sequential) algorithm proposed by Chen et al. [6]. MCMC guarantees uniformity of the sam-\nples, but it does not converge in polynomial time. Sequential produces near-uniform samples in\npolynomial time, but it requires O(n3m) time per sample and thus using this algorithm to produce\nN samples (N >> n) is beyond practical consideration. To recap, explicit sampling methods are\nimpractical for large datasets; moreover, their accuracy depends on the number of samples N and\nthe size of the dataspace X , which itself is hard to estimate.\n\n4 Computing entry-wise PDFs\n\n4.1 Warmup: The SimpleIS algorithm\n\nalgorithm called SimpleIS,\n\nWith the aim to provide some intuition and insight, we start by presenting a sim-\npli\ufb01ed version of our\nalso shown in Algorithm 1.\nSimpleIS computes the P matrix one column at a\ntime, in arbitrary order. Each such computation con-\nsists of two steps: (a) propose and (b) adjust. The\nPropose step associates with every row i, weight\nw(i) that is proportional to the row marginal of row\ni. A naive way of assigning these weights is by set-\nting w(i) = r(i)\nm . We refer to these weights w as the\nraw probabilities. The Adjust step takes as input\nthe column sum x = c(j) of the jth column and the\nraw probabilities w and adjusts these probabilities\ninto the \ufb01nal probabilities px such that for column j\n\n1: w = Propose(r)\n2: for j = 1 . . . m do\n3:\n4:\n5:\n\nInput: (cid:104)r, c(cid:105)\nOutput: Estimate of the PDF matrix P\n\nx = c(j)\npx = Adjust(w, x)\nP(:, j) = px\n\nAlgorithm 1 The SimpleIS algorithm.\n\ni px(i) = x. This adjustment is not a simple normalization, but it computes the \ufb01nal\nvalues of px(i) by implicitly considering all possible realizations of the jth column with column\nsum x and computing the probability that the ith cell of that column is equal to 1.\nFormally, if we use x to denote the binary vector that represents one realization of the j-th column\nof the hidden matrix, then px(i) is computed as:\n\nwe have that(cid:80)\n\npx(i) := Pr\n\nx(i) = x\n\n.\n\n(2)\n\nIt can be shown [6] that Equation (2) can be evaluated in polynomial time as follows: for any vector\nx, let N = {1, . . . , n}, be the set of all possible positions of 1s within x, and let R(x, N) be the\nprobability that exactly x elements of N are set to 1, i.e.,\n\n(cid:34)\nx(i) = 1 | n(cid:88)\n(cid:34)(cid:88)\n\ni=1\n\n(cid:35)\n\n(cid:35)\n\ni\u2208N\nUsing this de\ufb01nition, px(i) is then derived as follows:\n\nR(x, N) := Pr\n\nx(i) = x\n\n.\n\npx(i) =\n\nw(i)R(x \u2212 1, N \\ {i})\n\nR(x, N)\n\n.\n\n(3)\n\nThe evaluation of all of the necessary terms R( , ) can be accomplished by the following dynamic-\nprogramming recursion: for all a \u2208 {1, . . . x}, and for all B and i such that |B| > a and i \u2208 B \u2286 N:\n\nR(a, B) = (1 \u2212 w(i))R(a, B \\ {i}) + w(i)R(a \u2212 1, B \\ {i}).\n\nRunning time: The Propose step is linear in the number of the rows and the Adjust evaluates\nEquation (3) and thus needs at most O(n2) time. Thus, SimpleIS runs in time O(mn2).\nDiscussion: A natural question to consider is: why could the estimates of P produced by SimpleIS\nbe inaccurate?\n\n4\n\n\f0\n0\n0\n1\n1\n2\n\nTo answer this question, consider a hidden 5 \u00d7 5 binary table\nwith r = (4, 4, 2, 2, 2) and c = (2, 3, 1, 4, 4) and assume that\nSimpleIS starts by computing the entry-wise PDFs of the \ufb01rst\ncolumn. While evaluating Equation (2), SimpleIS generates all\npossible columns of matrices with row marginals r and a column\nwith column sum 2 \u2013 ignoring the values of the rest of the column\nmarginals. Thus, the realization of the \ufb01rst column shown in the matrix on the right is taken into\nconsideration by SimpleIS, despite the fact that it cannot lead to a matrix that respects r and c.\nThis is because four more 1s need to be placed in the empty cells of the \ufb01rst two rows which in turn\nwould lead to a violation of the column marginal of the third column. This situation occurs exactly\nbecause SimpleIS never considers the constraints imposed by the rest of the column marginals\nwhen aggregating the possible realizations of column j.\nUltimately, the SimpleIS algorithm results in estimates px(i) that re\ufb02ect an entry i in a column\nbeing equal to 1 conditioned over all matrices with row marginals r and a single column with column\nsum x [6]. But this dataspace is not our target dataspace.\n\n3\n\n1\n\n4\n\n4\n\n4\n4\n2\n2\n2\n\n4.2 The IS algorithm\n\nIn the IS algorithm, we remedy this weakness of SimpleIS by taking into account the constraints\nimposed by all the column marginals when aggregating the realization of a particular column j.\nReferring again to the previous example, the input vectors r and c impose the following constraints\non any matrix in X(cid:104)r,c(cid:105): column 1 must have least one 1 in the \ufb01rst two rows and (exactly) two 1s in\nthe \ufb01rst \ufb01ve rows. These types of constraints, known as knots, are formally de\ufb01ned as follows.\nDe\ufb01nition 1. A knot is a subvector of a column characterized by three integer values (cid:104)[s, e] | b(cid:105),\nwhere s and e are the starting and ending indices de\ufb01ning the subvector, and b is a lower bound on\nthe number of 1s that must be placed in the \ufb01rst e rows of the column.\nInterestingly, given (cid:104)r, c(cid:105), the knots of any column j of the hidden matrix can be identi\ufb01ed in linear\ntime using an algorithm that recursively applies the Gale-Ryser condition on realizability of bipartite\ngraphs. This method, and the notion of knots, were \ufb01rst introduced by Chen et al. [6].\nAt a high level, IS (Algorithm 2) identi\ufb01es the knots within each column and uses them\nto achieve a better estimation of P. Here,\nthe process of obtaining the \ufb01nal probabilities\nis more complicated since it requires: (a) identifying the knots of every column j (line 3),\n(b) computing the entry-wise PDFs for the en-\ntries in every knot (denoted by qk) (lines 4-7),\nand (c) creating the jth column of P by putting\nthe computed entry-wise PDFs back together\n(line 8). Note that we use wk to refer to the\nvector of raw probabilities associated with cells\nin knot k. Also, vector pk,x is used to store the\nadjusted probabilities of cells in knot k given\nthat x 1s are placed within the knot.\nStep (a) is described by Chen et al. [6], and\nstep (c) is straightforward, so we focus on (b),\nwhich is the main part of IS. This step con-\nsiders all the knots of the jth column sequen-\ntially. Assume that the kth knot of this column\nis given by the tuple (cid:104)[sk, ek] | bk(cid:105). Let x be the number of 1s inside this knot. If we know the value\nof x, then we can simply use the Adjust routine to adjust the raw probabilities wk. But since the\nvalue of x may vary over different realizations of column j, we need to compute the probability\ndistribution of different values of x. For this, we \ufb01rst observe that if we know that y 1s have been\nplaced prior to knot k, then we can compute lower and upper bounds on x as:\n\n1: w = Propose(r)\n2: for j = 1 . . . m do\n3:\n4:\n5:\n6:\n7:\n8:\n\nInput: (cid:104)r, c(cid:105)\nOutput: Estimate of the PDF matrix P\n\nFindKnots(j, r, c)\nfor each knot k \u2208 {1 . . . l} do\n\nfor x: number of 1s in knot k do\n\nAlgorithm 2 The IS algorithm.\n\npk,x = Adjust(wk, x)\n\nqk = Ex[pk,x]\n\nP(:, j) = [q1; . . . ; ql]\n\nLk|y = max{0, bk \u2212 y} , Uk|y = min{ek \u2212 sk + 1, c(j) \u2212 y} .\n\nClearly, the number of 1s in the knot must be an integer value in the interval [Lk|y, Uk|y]. Lacking\nprior knowledge we assume that x takes any value in the interval uniformly at random. Thus, the\n\n5\n\n\f(a) Blocked matrices\n\n(b) Matrices with knots\n\n(c) Relative running times\n\nFigure 1: Panels (a) and (b) depict Error (log scale) for six different algorithms, on two classes of\nmatrices. Panel (c) depicts algorithmic running times.\n\nprobability of x 1s occurring inside the knot, given the value of y (i.e., 1s prior to the knot) is:\n\nBased on this conditional probability we can write the probability of each value of x as\n\nPk(x|y) =\n\n1\n\nUk|y \u2212 Lk|y + 1\n\nPk(x) =\n\nQk(y)Pk(x|y),\n\nc(j)(cid:88)\n\ny=0\n\ny(cid:88)\n\nin which Pk(x|y) is computed by Equation (4) and Qk(y) refers to the probability of having y\n1s prior to knot k. In order to evaluate Equation (5), we need to compute the values of Qk(y).\nWe observe that for every knot k and for every value of y, Qk(y) can be computed by dynamic\nprogramming as:\n\nQk(y) =\n\nQk\u22121(z)Pk\u22121(y \u2212 z|z).\n\n(6)\n\nz=0\n\nRunning time and speedups: If there is a single knot in every column, SimpleIS and IS are\nidentical. For a column j with (cid:96) knots, IS requires O((cid:96)2c(j) + nc(j)2) \u2013 or worst-case O(n3) time.\nThus, sequential implementation of IS has running time O(n3m). This is the same as the time\nrequired by Sequential for generating a single sample from X(cid:104)r,c(cid:105), providing a clear indication\nof the computational speedups afforded by IS over sampling. Moreover, IS treats each column\nindependently, and thus it is parallelizable. Finally, since the entry-wise PDFs for two columns with\nthe same column marginals are identical, our method needs to only compute the PDFs of columns\nwith distinct column marginals. Further speedups can be achieved for large datasets by binning\ncolumns with similar marginals into a bin with a representative column sum. When the columns in\nthe same bin differ by at most t, we call this speedup t-Binning.\nDiscussion: We point out here that IS is highly motivated by Sequential \u2013 the most practi-\ncal algorithm to date for sampling (almost) uniformly matrices from dataspace X(cid:104)r,c(cid:105). Although\nSequential was designed for a different purpose, IS uses some of its building blocks (e.g.,\nknots). However, the connection is high level and there is no clear quanti\ufb01cation of the relationship\nbetween the values of P computed by IS and those produced by repeated sampling from X(cid:104)r,c(cid:105) using\nSequential. While we study this relationship experimentally, we leave the formal investigation\nas an open problem.\n\n5 Experiments\n\nAccuracy evaluation: We measure the accuracy of different methods by comparing the estimates(cid:98)P\n\nthey produce against the known ground-truth P and evaluating the average absolute error as:\n\n(cid:80)\n\ni,j\n\n(cid:12)(cid:12)(cid:12)(cid:98)P(i, j) \u2212 P(i, j)\n(cid:12)(cid:12)(cid:12)\n\nError((cid:98)P, P) =\n\nmn\n\n6\n\n(4)\n\n(5)\n\n(7)\n\n02040608010010\u2212610\u2212410\u22122100Matrix Size (n)Error LSMCMCSequentialRaschSimpleISIS02040608010010\u2212310\u2212210\u22121100101Matrix Size (n)Error LSMCMCSequentialRaschSimpleISIS02040608010010\u22122100102104Matrix Size (n)Running Time (secs) LSMCMCSequentialRaschSimpleISIS\f(a) DBLP distribution of entries\n\n(b) DBLP\n\n(c) NOLA\n\nFigure 2: Panel (a) shows the CDF of estimated entry-wise PDFs by IS for the DBLP dataset.\nPanels (b) and (c) show Error(P, H) as a function of the percentage of \u201crevealed\u201d cells.\n\nWe compare the Error of our methods, i.e., SimpleIS and IS, to the Error of the two opti-\nmization methods: Rasch and LS described in Section 2.1 and the two explicit-sampling methods\nSequential and MCMC described in Section 3. For MCMC, we use double the burn-out period (i.e.,\nfour times the number of ones in the table) suggested by Gionis et al. [9]. For both Sequential\nand MCMC, we vary the sample size and use up to 250 samples; for this number of samples these\nmethods can take up to 2 hours to complete for 100 \u00d7 100 matrices. In fact, our experiments were\nultimately restricted by the inability of other methods to handle larger matrices.\nSince exhaustive enumeration is not an option, it is very hard to obtain ground-truth values of P for\narbitrary matrices, so we focus on two speci\ufb01c types of matrices: blocked matrices and matrices\nwith knots.\nAn n\u00d7 n blocked matrix has marginals r = (1, n\u22121, . . . , n\u22121) and c = (1, n\u22121, . . . , n\u22121). Any\ntable with these marginals either has a value of 1 in entry (1, 1) or it has two distinct entries with\nvalue 1 in the \ufb01rst row and the \ufb01rst column (excluding cell (1, 1)). Also note that given a realization\nof the \ufb01rst row and the column, the rest of the table is fully determined. This implies that there are\nexactly (2n\u2212 1) tables with such marginals and the entry-wise PDFs are: P(1, 1) = 1/(2n\u2212 1); for\ni (cid:54)= 1, P(1, i) = P(i, 1) = (n \u2212 1)/(2n \u2212 1); and for i (cid:54)= 1 and j (cid:54)= 1, P(i, j) = 2n/(2n \u2212 1).\nThe matrices with knots are binary matrices that are generated by diagonally concatenating smaller\nmatrices for which the ground-truth is computed known through exhaustive enumeration. The con-\ncatenation is done in such a way such that no new switch boxes are introduced (as de\ufb01ned in Sec-\ntion 3). While the details of the construction are omitted due to lack of space, the key characteristic\nof these matrices is that they have a large number of knots.\nFigure 1(a) shows the Error (in log scale) of the different methods as a function of the matrix size n;\nSimpleIS and IS perform identically in terms of Error and are much better than other methods.\nMoreover, they become increasingly accurate as the size of the dataset increases, which means that\nour methods remain relatively accurate even for large matrices. In this experiment, Rasch appears\nto be the second-best method. However, as our next experiment indicates, the success of Rasch\nhinges on the fact that the marginals of this experiment do not introduce many knots.\nThe results on matrices with many knots are shown in Figure 1(b). Here, the relative performance of\nthe different algorithms is different: SimpleIS is among the worst-performing algorithms, together\nwith LS and Rasch, with an average Error of 0.1. On the other hand, IS with Sequential and\nMCMC are clearly the best-performing algorithms. This is mainly due to the fact that the matrices\nwe create for this experiment have a lot of knots, and as SimpleIS, LS and Rasch are all knot-\noblivious, they produce estimates with large errors. On the other hand, Sequential, MCMC and\nIS take knots into account, and therefore they perform much better than the rest.\nLooking at the running times (Figure 1(c)), we observe that the running time of our methods is\nclearly better than the running time of all the other algorithms for larger values of n. For example,\nwhile both SimpleIS and IS compute P[] within a second, Rasch requires a couple of seconds\nand other methods need minutes or even up to hours to complete.\n\nUtilizing entry-wise PDFs: Next, we move on to demonstrate the practical utility of entry-wise\nPDFs. For this experiment we use the following real-world datasets as hidden matrices.\n\n7\n\n00.20.40.60.8100.20.40.60.81P(i,j)valuesCDF01020304050607000.020.040.060.080.10.120.140.16Percentage of Cells RevealedError(P,H) InformativeFixRandomFix01020304050607000.0050.010.0150.020.0250.03Percentage of Cells RevealedError(P,H) InformativeFixRandomFix\fDBLP: The rows of this hidden matrix correspond to authors and the columns correspond to confer-\nences in DBLP. Entry (i, j) has value 1 if author i has a publication in conference j. This subset of\nDBLP, obtained by Hyv\u00a8onen et al. [12], has size 17, 702 \u00d7 19 and density 8.3%.\nNOLA: This hidden matrix records the membership of 15, 965 Facebook users from New Orleans\nacross 92 different groups [22]. The density of this 0\u20131 matrix is 1.1%1.\nWe start with an experiment that addresses the following question: \u201cCan entry-wise PDFs help us\nidentify the values of the cells of the hidden matrix?\u201d To quantify this, we \ufb01rst look at the distribution\nof values of entry-wise PDFs per dataset, shown in Figure 2(a) for the DBLP dataset (the distribution\nof entry-wise PDFs is similar for the NOLA dataset). The \ufb01gure demonstrates that the overwhelming\nmajority of the P(i, j) entries are small, smaller than 0.1.\nWe then address the question: \u201cCan entry-wise PDFs guide us towards effectively querying the\nhidden matrix H so that its entries are more accurately identi\ufb01ed?\u201d For this, we iteratively query\nentries of H. At each iteration, we query 10% of unknown cells and we compute the entry-wise\nPDFs P after having these entries \ufb01xed. Figures 2(b) and 2(c) show the Error(P, H) after each itera-\ntion for the DBLP and NOLA datasets; values of Error(P, H) close to 0 imply that our method could\nreconstruct H almost exactly. The two lines in the plots correspond to RANDOMFIX and INFOR-\nMATIVEFIX strategies for selecting the queries at every step. The former picks 10% of unknown\ncells to query uniformly at random at every step. The latter selects 10% of cells with PDF values\nclosest to 0.5 at every step. The results demonstrate that INFORMATIVEFIX is able to reconstruct\nthe table with signi\ufb01cantly fewer queries than RANDOMFIX. Interestingly, using INFORMATIVEFIX\nwe can fully recover the hidden matrix of the NOLA dataset by just querying 30% of entries. Thus,\nthe values of entry-wise PDFs can be used to guide adaptive exploration of the hidden datasets.\n\nScalability:\nIn a \ufb01nal experiment, we explored the accuracy/speedup tradeoff obtained by t-\nBinning. For the the DBLP and NOLA datasets, we observed that by using t = 1, 2, 4 we reduced\nthe number of columns (and thus the running time) by a factor of at least 2, 3 and 4 respectively.\nFor the same datasets, we evaluate the accuracy of the t-Binning results by comparing the values\nof Pt computed for t = {1, 2, 4} with the values of P0 (obtained by IS in the original dataset). In\nall cases, and for all values of t, we observe that the Error(Pt, P0) (de\ufb01ned in Equation (7)) are low\n\u2013 never exceeding 1.5%. Even the maximum entry-wise difference of P0 and Pt are consistently\nabout 0.1 \u2013 note that such high error values only occur in one out of the millions of entries in P.\nFinally, we also experimented with an even larger dataset obtained through the Yahoo! Research\nWebscope program. This is a 140, 000 \u00d7 4252 matrix of users and their participation in groups. For\nthis dataset we observe that for 80% reduction to the number of columns of the dataset introduces\nan average error of size of only 1.7e\u22124 (for t = 4).\n\n6 Conclusions\n\nWe started with a simple question: \u201cGiven the row and column marginals of a hidden binary matrix\nH, what can we infer about the matrix itself?\u201d We demonstrated that existing optimization-based\napproaches for addressing this question fail to provide a detailed intuition about the possible values\nof particular cells of H. Then, we introduced the notion of entry-wise PDFs, which capture the\nprobability that a particular cell of H is equal to 1. From the technical point of view, we developed\nIS, a parallelizable algorithm that ef\ufb01ciently and accurately approximates the values of the entry-\nwise PDFs for all cells simultaneously. The key characteristic of IS is that it computes the entry-\nwise PDFs without generating any of the matrices in the dataspace de\ufb01ned by the input row and\ncolumn marginals, and did so by implicitly sampling from the dataspace. Our experiments with\nsynthetic and real data demonstrated the accuracy of IS on computing entry-wise PDFs as well as\nthe practical utility of these PDFs towards better understanding of the hidden matrix.\n\nAcknowledgements\n\nThis research was partially supported by NSF grants CNS-1017529, III-1218437 and a gift from\nMicrosoft.\n\n1The dataset is available at: http://socialnetworks.mpi-sws.org/data-wosn2009.html\n\n8\n\n\fReferences\n[1] A. Bekessy, P. Bekessy, and J. Komplos. Asymptotic enumeration of regular matrices. Studia Scientiarum\n\nMathematicarum Hungarica, pages 343\u2013353, 1972.\n\n[2] I. Bez\u00b4akov\u00b4a, N. Bhatnagar, and E. Vigoda. Sampling binary contingency tables with a greedy start.\n\nRandom Struct. Algorithms, 30(1-2):168\u2013205, 2007.\n\n[3] T. D. Bie, K.-N. Kontonasios, and E. Spyropoulou. A framework for mining interesting pattern sets.\n\nSIGKDD Explorations, 12(2):92\u2013100, 2010.\n\n[4] T. Bond and C. Fox. Applying the Rasch Model: Fundamental Measurement in the Human Sciences.\n\nLawrence Erlbaum, 2007.\n\n[5] R. Brualdi and H. J. Ryser. Combinatorial Matrix Theory. Cambridge University Press, 1991.\n[6] Y. Chen, P. Diaconis, S. Holmes, and J. Liu. Sequential monte carlo methods for statistical analysis of\n\ntables. Journal of American Statistical Association, JASA, 100:109\u2013120, 2005.\n\n[7] G. W. Cobb and Y.-P. Chen. An application of Markov chain Monte Carlo to community ecology. Amer.\n\nMath. Month., 110(4):264\u2013288, 2003.\n\n[8] M. Gail and N. Mantel. Counting the number of r \u00d7 c contigency tables with \ufb01xed margins. Journal of\n\nthe American Statistical Association, 72(360):859\u2013862, 1977.\n\n[9] A. Gionis, H. Mannila, T. Mielik\u00a8ainen, and P. Tsaparas. Assessing data mining results via swap random-\n\nization. TKDD, 1(3), 2007.\n\n[10] I. J. Good and J. Crook. The enumeration of arrays and a generalization related to contigency tables.\n\nDiscrete Mathematics, 19(1):23 \u2013 45, 1977.\n\n[11] S. Hakimi. On realizability of a set of integers as degrees of the vertices of a linear graph. Journal of the\n\nSociety for Industrial and Applied Mathematics, 10(3):496\u2013506, 1962.\n\n[12] S. Hyv\u00a8onen, P. Miettinen, and E. Terzi. Interpretable nonnegative matrix decompositions. In KDD, pages\n\n345\u2013353, 2008.\n\n[13] T. Kariya and H. Kurata. Generalized Least Squares. Wiley, 2004.\n[14] N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon. Ef\ufb01cient sampling algorithm for estimating subgraph\n\nconcentrations and detecting network motifs. Bioinformatics, 20(11):1746\u20131758, 2004.\n\n[15] M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets.\n\nTKDD, 6(4), 2012.\n\n[16] H. Mannila and E. Terzi. Nestedness and segmented nestedness. In KDD, pages 480\u2013489, 2007.\n[17] R. Milo, S. Shen-Orr, S. Itzkovirz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple\n\nbuilding blocks of complex networks. Science, 298(5594):824\u2013827, 2002.\n\n[18] P. Erd\u00a8os and T. Gallai. Graphs with prescribed degrees of vertices. Mat. Lapok., 1960.\n[19] G. Rasch. Probabilistic models for some intelligence and attainment tests. 1960. Technical Report,\n\nDanish Institute for Educational Research, Copenhagen.\n\n[20] J. Sanderson. Testing ecological patterns. Amer. Sci., 88(4):332\u2013339, 2000.\n[21] T. Snijders. Enumeration and simulation methods for 0\u20131 matrices with given marginals. Psychometrika,\n\n56(3):397\u2013417, 1991.\n\n[22] B. Viswanath, A. Mislove, M. Cha, and K. P. Gummadi. On the evolution of user interaction in Facebook.\n\nIn WOSN, pages 37\u201342, 2009.\n\n[23] B. Wang and F. Zhang. On the precise number of (0,1)-matrices in u(r,s). Discrete Mathematics,\n\n187(13):211\u2013220, 1998.\n\n[24] M. Yannakakis. Computing the minimum \ufb01ll-in is NP-Complete. SIAM Journal on Algebraic and Dis-\n\ncrete Methods, 2(1):77\u201379, 1981.\n\n9\n\n\f", "award": [], "sourceid": 1058, "authors": [{"given_name": "Behzad", "family_name": "Golshan", "institution": "Boston University"}, {"given_name": "John", "family_name": "Byers", "institution": "Boston University"}, {"given_name": "Evimaria", "family_name": "Terzi", "institution": "Boston University"}]}