{"title": "Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons", "book": "Advances in Neural Information Processing Systems", "page_first": 10857, "page_last": 10867, "abstract": "We analyze linear independence of rank one tensors produced by tensor powers of randomly perturbed vectors. This enables efficient decomposition of sums of high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider range of perturbation models, including discrete ones. We give an application to recovering assemblies of neurons.\n\t\t\nAssemblies are large sets of neurons representing specific memories or concepts. The size of the intersection of two assemblies has been shown in experiments to represent the extent to which these memories co-occur or these concepts are related; the phenomenon is called association of assemblies. This suggests that an animal's memory is a complex web of associations, and poses the problem of recovering this representation from cognitive data. Motivated by this problem, we study the following more general question: Can we reconstruct the Venn diagram of a family of sets, given the sizes of their l-wise intersections? We show that as long as the family of sets is randomly perturbed, it is enough for the number of measurements to be polynomially larger than the number of nonempty regions of the Venn diagram to fully reconstruct the diagram.", "full_text": "Smoothed Analysis of Discrete Tensor Decomposition\n\nand Assemblies of Neurons\n\nNima Anari\n\nComputer Science\nStanford University\n\nanari@cs.stanford.edu\n\nConstantinos Daskalakis\n\nEECS\nMIT\n\ncostis@csail.mit.edu\n\nWolfgang Maass\n\nTheoretical Computer Science\nGraz University of Technology\n\nmaass@igi.tugraz.at\n\nChristos H. Papadimitriou\n\nComputer Science\nColumbia University\n\nchristos@cs.columbia.edu\n\nAmin Saberi\n\nMS&E\n\nStanford University\n\nsaberi@stanford.edu\n\nSantosh Vempala\nComputer Science\n\nGeorgia Tech\n\nvempala@gatech.edu\n\nAbstract\n\nWe analyze linear independence of rank one tensors produced by tensor powers\nof randomly perturbed vectors. This enables ef\ufb01cient decomposition of sums of\nhigh-order tensors. Our analysis builds upon Bhaskara et al. [3] but allows for a\nwider range of perturbation models, including discrete ones. We give an application\nto recovering assemblies of neurons.\nAssemblies are large sets of neurons representing speci\ufb01c memories or concepts.\nThe size of the intersection of two assemblies has been shown in experiments\nto represent the extent to which these memories co-occur or these concepts are\nrelated; the phenomenon is called association of assemblies. This suggests that\nan animal\u2019s memory is a complex web of associations, and poses the problem of\nrecovering this representation from cognitive data. Motivated by this problem, we\nstudy the following more general question: Can we reconstruct the Venn diagram\nof a family of sets, given the sizes of their `-wise intersections? We show that as\nlong as the family of sets is randomly perturbed, it is enough for the number of\nmeasurements to be polynomially larger than the number of nonempty regions of\nthe Venn diagram to fully reconstruct the diagram.\n\n1\n\nIntroduction\n\nTensor decomposition is one of the key algorithmic tools for learning many latent variable models\n[1, 5, 14, 19]. In practice, tensor decomposition methods based on gradient descent and power method\nhave been observed to work well [9, 16]. Theoretically, determining the minimum number of rank one\ncomponents in the tensor decomposition is known to be NP-hard in the worst case [11, 12], so usually\ntensor decomposition is analyzed in the average case. Several algorithms have been analyzed in the\naverage case, where the input tensor is produced according to some probabilistic model, for example\nsee Bhaskara et al. [3], De Lathauwer et al. [7], Goyal et al. [10] as well as sum-of-squares-based\nalgorithms like Barak et al. [2], Ge and Ma [8], Hopkins et al. [13], Ma et al. [18].\nThe average case models studied in the literature generally fall into two categories. They either\nassume components of the tensor are fully random, i.e., generated from a known distribution (e.g.,\nGaussian), or they follow a smoothed analysis setting where some adversarially chosen instance is\nperturbed by random noise, see for example Bhaskara et al. [3], Goyal et al. [10], Ma et al. [18]. Our\nwork falls into the second category.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe build upon the framework used in Bhaskara et al. [3] which reduces decomposing sums of rank\none tensors to showing robust linear independence of related rank one tensors, by using Jennrich\u2019s\nalgorithm, also known as Chang\u2019s lemma [5, 17]. The main departing point of our work is our\nsmoothed analysis of linear independence, which we base on a new notion we call echelon trees, a\ngeneralization of Gaussian elimination and echelon form to high-order tensors, which might be of\nindependent interest. We also get improved guarantees compared to Bhaskara et al. [3] when the\ntensors are of high enough order.\nThe main feature of our analysis is that it can handle discrete perturbations. To illustrate, suppose that\nvectors X1, . . . , Xm 2 Rn are drawn from some unknown distribution and our goal is to recover them\nby (noisily) observingPi X\u2326`\nfor small values of `. Bhaskara et al. [3] showed that up to constant\nfactor blow-ups in ` an ef\ufb01cient algorithm can do this as long as X\u2326`\nare linearly independent in a\nrobust sense. Note that the set of vector tuples (X1, . . . , Xm) for which X\u2326`\nm are linearly\ndependent can be de\ufb01ned by polynomial equations, using determinants, and is therefore an algebraic\nvariety. As long as m \u2327 n`, this variety will have dimension smaller than the whole space, so we\nexpect most vector tuples to fall outside. Bhaskara et al. [3] showed that starting from an arbitrary set\nof vectors X1, . . . , Xm, by adding Gaussian noise, the new tuple will lie far away from this variety.\nOur analysis on the other hand, handles a much wider class of perturbations. For example, if each\nXi is independently chosen at random from a \u201clarge enough\u201d discrete set such as the vertices of\nan arbitrary hypercube, we show that with very high probability the resulting tensors are linearly\nindependent, again in a robust sense.\nFor our main application, described in the next section, it is important to assume components of the\ntensor come from a discrete set.\n\n1 , . . . , X\u2326`\n\ni\n\ni\n\n1.1 Assemblies of neurons and recovering sparse Venn diagrams\n\nExperiments by neuroscientists over the past three decades [21] have identi\ufb01ed neurons which are\nselectively activated when a real-world object1 is seen (or more generally sensed). It is now widely\naccepted [4] that these neurons are part of large cell assemblies, stable sets of highly interconnected\nneurons whose \ufb01ring (more or less simultaneous and in unison) is tantamount to a cognitive event\nsuch as the sensing or imagining of a person, or of a word or concept (hence the other common name\n\u201cconcept cells\u201d).\nIn a recent experiment [15], a neuron \ufb01ring when one real-world entity is seen (say, the Eiffel tower)\nbut not another (e.g., Barak Obama) may start \ufb01ring on presentation of an image of Obama after a\nvisual experience associating the two \u2014 for example, a picture of Obama in front of the Eiffel tower.\nThis experiment has taught us that assemblies seem to be \u201cmobile\u201d and able to intersect in complex\nways re\ufb02ecting perceived varying degrees of associations between the corresponding entities. The\nstronger the association between the entities, the larger the intersection will be of the corresponding\nassemblies. During one\u2019s life, presumably a complex mesh of entities and associations will be created,\nof some degree of permanence, re\ufb02ecting the sum total of one\u2019s cognitive experiences.\nAll said, this complex mesh of memories in somebody\u2019s brain can be modeled as a Venn diagram\nwhere each set or assembly consists of neurons \ufb01ring for a particular concept, and each region of\nthe Venn diagram, a minimal set obtained from an intersection of assemblies and their complements,\nrepresents a class of neurons behaving the same way towards all concepts.\nAlternatively to the Venn diagram, one may record associations between assemblies in a hypergraph.\nThe entities are the sets or nodes, and the edges re\ufb02ect associations between the nodes. Furthermore,\nthe hypergraph representing a person\u2019s state of knowledge can be adorned with edge weights re\ufb02ecting\nthe degree of af\ufb01nity between a set of nodes (or equivalently, the size of the intersection of their\ncorresponding sets).\nThis gives rise to several natural questions. The \ufb01rst question concerns reconstruction. How many\nexperiments or observations are needed to identify the structure of cell assembly intersections, or in\nother words the Venn diagram? Here, we make two crucial assumptions. First, we assume that we\ncan only measure the degree of association between a small number of entities or concepts. Second,\nthe total number of classes of neurons (which behave similarly in response to stimuli) is bounded.\nIn the language of sets, we assume the number of non-empty regions of the Venn diagram is upper\n\n1Or person, these are commonly known as \u201cJennifer Aniston neurons\u201d.\n\n2\n\n\f1 +n\n\n2 + \u00b7\u00b7\u00b7 +n\n\nbounded by some number m and we can measure the sizes of k-wise intersections of any k of our n\nsets for 1 \uf8ff k \uf8ff ` for some small `. We also allow for measurement errors.\nOur main result here is as follows: As long as the cell assemblies are slightly randomly perturbed,\nand as long as the number of measurements,n\n\nnumber of nonempty regions of the Venn diagram, m, we can fully reconstruct the Venn diagram.\nThe perturbation of cell assemblies, a process which likely occurs naturally in the brain, is a\nmild assumption that we need in order to escape idiosyncratic cases. We solve the problem of\nreconstructing the Venn diagram by casting it as a tensor decomposition problem where the elements\nof the decomposition come from high order tensors of the vertices of the hypercube.\nWe also explore a simpler graph-theoretic model of assembly association, motivated by more recent\nexperimental \ufb01ndings [6, 15]: Assume that all assemblies have the same size K, and that two\nassemblies are associated if their intersection is of size at least b, and are not associated if the\nintersection is less than another threshold a < b; the results of De Falco et al. [6], Ison et al. [15]\nsuggest that a is 4% of K, while b is 8% of K. We show that an unreasonably rich and complex\nfamily of graphs can be realized by associations (roughly, any graph of degree O(K/a)).\n\n`, is polynomially larger than the\n\n1.2 Problem formulation\nSuppose that we have a Venn diagram formed by some n sets S1, . . . ,Sn. We will assume that this\nVenn diagram has at most m nonempty regions. For our main application, each set Si corresponds to\nneurons that respond to a particular stimuli, so we are assuming that there are at most m classes of\nneurons. We let U denote the set of neuron classes. We also have a weight function w : U! R0\nrepresenting the sizes of various classes. Each set Si \u2713U is an assembly and w(Si) =Pu2Si\nw(u)\nis its weight. Our main question is the following:\nQuestion 1. Given the sizes of `-wise intersections of S1, . . . ,Sn for some constant `, i.e., w(Si1 \\\n\u00b7\u00b7\u00b7\\S i`) for all i1, . . . , i` 2 [n], can we recover the full Venn diagram of S1, . . . ,Sn, i.e., the weight\nof all intersections formed by these sets and their complements?\n\nOur main result is that as long as the set memberships of elements are slightly perturbed to avoid\nworst case scenarios, and as long as n` is polynomially larger than m = |U|, the answer is yes and\nmoreover there is an ef\ufb01cient algorithm that performs recovery. Our algorithm is also robust to\ninverse polynomial noise in the input.\nWe pose the question as a tensor decomposition problem in the following way: To each element\nu 2U assign a vector (u) 2{ 0, 1}n, where (u)i indicates whether u 2S i. Then the entries of\nthe following tensor capture all `-wise intersections:\n\nw(u) (u) \u2326\u00b7\u00b7\u00b7\u2326 (u)\n\n.\n\nT =Xu2U\n\n|\n\n` times\n\n{z\n\n}\n\nFor simplicity of exposition, we assume weights are all equal to 1, but our results easily generalize,\nsince each weight w(u) can be absorbed into (u)\u2326`.\n\n2 Notations and preliminaries\nWe denote the set {1, . . . , n} by [n]. For a matrix A, we denote the minimum and maximum singular\nvalues of A by min(A) and max(A). We use h\u00b7,\u00b7i to denote the standard inner product.\nWe denote the tensor product of two vectors 2 Rn and 0 2 Rm by \u2326 0 which belongs to\nRn \u2326 Rm ' Rn\u21e5m. We use the notation \u2326` to denote\n \u2326\u00b7\u00b7\u00b7\u2326 \n}\n\nBy abuse of notation we identify tensors T 2 Rn1 \u2326\u00b7\u00b7\u00b7\u2326 Rn` with multilinear maps from Rn1 \u21e5\n\u00b7\u00b7\u00b7\u21e5 Rn` to R. In other words we let T (v1, . . . , v`) denote hT, v1 \u2326\u00b7\u00b7\u00b7\u2326 v`i. We also use the\nnotation T (\u00b7, v2, . . . , v`) to denote the multilinear map from Rn1 to R given by:\n\n{z\n\n` times\n\n|\n\n.\n\nT (\u00b7, v2, . . . , v`)(v1) = T (v1, . . . , v`).\n\n3\n\n\fIn general we can use \u00b7 in place of any of the arguments of T . So for example T (\u00b7,\u00b7, v3, . . . , v`) is\ninterpreted as living in Rn1 \u2326 Rn2. With a slight abuse of notation we let some of the inputs of T be\nmerged together by tensor operations. In other words we let T (v1 \u2326 v2, v3, . . . , v`) be the same as\nT (v1, . . . , v`).\nWe use e1, . . . , en to denote the standard basis of Rn. For a tuple of coordinates I = (i1, . . . , i`) we\nlet eI denote ei1 \u2326\u00b7\u00b7\u00b7\u2326 ei`. With this notation, the entry corresponding to coordinate (i1, . . . , i`) of\na tensor T can be written as T (eI) = T (ei1, . . . , ei`).\n\n3 Tensor decomposition\nSuppose that we have a \ufb01nite universe U of elements with a vector (u) 2 Rn assigned to each u 2U .\nOur goal is to recover (u)\u2019s by observingPu (u)\u2326`. A necessary condition is for (u)\u2326`\u2019s to be\nlinearly independent, otherwise it is an easy exercise to show that there is another decomposition\nPu(cu(u))\u2326` for some positive weights {cu}u2U not all equal to 1. The framework introduced by\n\nBhaskara et al. [3] shows that linear independence is not just necessary, but up to a constant factor\nblow-up in `, it is suf\ufb01cient. A more detailed account is given in supplementary materials.\nWe also use another trick from this framework which allows us to replace symmetric tensors (u)\u2326`\nwith asymmetric ones. If we divide the coordinates [n] into ` roughly-equal sized parts I1, . . . , I`\nand de\ufb01ne (u)(i) to be the projection of (u) onto the i-th part, then (u)(1) \u2326\u00b7\u00b7\u00b7\u2326 (u)` is a\nsubtensor of (u)\u2326`. So linear independence of these tensors proves linear independence of (u)\u2326`\u2019s.\nThe advantage of this trick is that when we introduce perturbations to (u)(1), . . . , (u)(`), we do\nnot have to worry about consistently perturbing the same coordinates and we can potentially use\nindependent randomness. For simplicity of notation, from here on, we use n (as opposed to n/`) to\ndenote the dimension of each (u)(i). So now we can work with the following tensor:\n\nT =Xu2U\n\n(u)(1) \u2326\u00b7\u00b7\u00b7\u2326 (u)(`).\n\nOur main result is that the components of this sum are robustly linearly independent, assuming the\ncomponents (u)(i) are randomly perturbed. We remark that this implies robust linear independence\n\nP[Xi 2 (t , t + ) | Xi] \uf8ff p,\n\nof {(u)\u2326`}u2U as well, so we can recover them from the sumPu2U (u)\u2326`.\nWe \ufb01rst de\ufb01ne our model of perturbations:\nDe\ufb01nition 2. Assume that a vector X 2 Rd is drawn according to some distribution D. We call\nD a (, p)-nondeterministic distribution if for every coordinate i 2 [d] and any interval of the form\n(t , t + ) we have\nwhere Xi represents the projection of X onto the coordinates [d] { i}.\nFor a set of random vectors {Xi}, we call their joint distribution (, p)-nondeterministic iff their\nconcatenation is (, p)-nondeterministic. In our setting, we will assume that for each u 2U , the\nvectors (u)(1), . . . , (u)(`) are chosen from a (, p)-nondeterministic distribution.\nTwo examples of (, p)-nondeterministic perturbations can be obtained as follows:\nExample 3. Suppose that each (u)(i) is chosen adversarially from {0, 1}n, but then each bit is inde-\npendently \ufb02ipped with some probability q. This distribution is ( 1\n2 , max(q, 1 q))-nondeterministic.\nExample 4. Suppose that each (u)(i) is chosen adversarially from Rn, but a standard Gaussian noise\nof total variance \u21e22 is added to each one. Then for any > 0, this distribution is (, erf(pn/\u21e2))-\nnondeterministic.\nGaussian perturbations are the model used in Bhaskara et al. [3]. Our main result is the following:\nTheorem 5. Assume that for each u 2U , the concatenation of the n-dimensional vectors\n{(u)(i)}i2[`] is drawn from a distribution D that is (, p)-nondeterministic. Let A be the ma-\ntrix whose columns are given by \ufb02attened a(u) = (u)(1) \u2326\u00b7\u00b7\u00b7\u2326 (u)` for various u. Then,\nassuming |U| \uf8ff (cn)`, we have\n\nP[min(A) < (/n)`] \uf8ff n2`p(1c)n.\n\n4\n\n\fThis theorem shows how the (, p)-nondeterministic property ensures robust linear independence. To\nprove it, we use a strategy similar to Bhaskara et al. [3], by proving a bound on the leave-one-out\ndistance. The leave-one-out distance is closely related to min(A), and only differs from it by a factor\n\nof at mostp|U| \uf8ff n`/2 [3]. It is enough to prove that for any \ufb01xed u\ndista(u), span{a(u0)}u02U{u} (/pn)`\nwith probability at least 1n`p(1c)n. Here dist measures the distance of a vector to the closet point in\na linear subspace. A union bound implies the leave-one-out distance for all u is large. As in Bhaskara\net al. [3], we simplify the analysis by treating span{a(u0)}u02U{u} as a generic linear subspace\nV \u2713 (Rn)\u2326`, and only using the fact that dim(V ) < (cn)`. Noting that n` 1+n+n2 +\u00b7\u00b7\u00b7+n`1,\nit is enough to prove the following\nLemma 6. Assume that vectors (1), . . . , (`) are drawn according to a (, p)-nondeterministic\ndistribution. Further assume that V \u2713 (Rn)\u2326` is a subspace of dimension at most (cn)`. Then\nPhdist\u21e3(1) \u2326\u00b7\u00b7\u00b7\u2326 (`), V\u2318 < (/pn)`i \uf8ff (1 + n + n2 + \u00b7\u00b7\u00b7 + n`1)p(1c)n.\n\nIn the rest of this section we prove lemma 6.\nLet W = V ? \u2713 (Rn)\u2326` be the linear subspace of all tensors that vanish on V , or in other words\nhave zero dot product with every member of V . Then dim(W ) (1 c`)n`. We will show that with\nhigh probability there is an element T 2 W such that kTk \uf8ff n`/2 and\n\nhT, (1) \u2326\u00b7\u00b7\u00b7\u2326 (`)i = T ((1), . . . , (`)) `.\n\nThis implies that\n\ndist\u21e3(1) \u2326\u00b7\u00b7\u00b7\u2326 (`), V\u2318 \n\n`\nkTk\n\n= n`/2` = (/pn)`,\n\nand the proof would be complete.\nWe \ufb01nd it instructive to \ufb01rst prove this fact for ` = 1 and then for general `.\n\n3.1 The case ` = 1\nProof of lemma 6 for ` = 1. We will generate a sequence T1, . . . , Tdim(W ) 2 W , such that kTik1 \uf8ff\n1 for all i. This ensures that kTik \uf8ff pn. We will then show that\nP[9i : |Ti()| ] 1 p(1c)n.\n\n(1)\nWe will \ufb01rst pick T1 to be any nonzero element of W . By rescaling, we can assume that kT1k1 = 1\nand that T1(ej) = 1 for some j. Let us call j the pivot point of T1. By rearranging the coordinates\nwe can assume without loss of generality that j = 1. In other words T1(e1) = 1 and kT1k1 = 1.\nIn order to pick T2, consider the subspace {T 2 W | T (e1) = 0}. This subspace has dimension at\nleast dim(W ) 1, and we can pick T2 to be any nonzero element of it. As before, we can without\nloss of generality and by scaling assume that T2(e2) = 1 and kT2k1 = 1.\nWhen picking Ti, we pick any nonzero element of {T 2 W | T (ej) = 0 8j < i} and by rescaling\nand rearranging the coordinates assume that Ti(ei) = 1 and kTik1 = 1. Thus we make sure that the\npivot point of Ti is i. A keen observer would notice that T1, . . . , Tdim(W ) can also be obtained by a\nmodi\ufb01ed Gaussian elimination procedure run on some basis of the space W .\nNow that we have \ufb01xed T1, . . . , Tdim(W ) it remains to prove eq. (1).\nTo do this, let us \ufb01x the coordinates of the random vector = (1) one-by-one, starting from n and\ngoing backwards to 1. Once we have \ufb01xed dim(W )+1, . . . , n we can argue about the probability\nof the event |Tdim(W )()| < . Since Tdim(W )(ei) = 0 for i < dim(W ), we have\n\nTdim(W )() = dim(W ) + Tdim(W )(edim(W )+1)dim(W )+1 + \u00b7\u00b7\u00b7 + Tdim(W )(en)n.\n\nBut t := Tdim(W )(edim(W )+1)dim(W )+1 + \u00b7\u00b7\u00b7 + Tdim(W )(en)n is a constant once we have \ufb01xed\ndim(W )+1, . . . , n. So |Tdim(W )()| < if and only if dim(W ) 2 (t ,t + ). Because is\n\n5\n\n\fdistributed according to a (, p)-nondeterministic distribution, this event happens with probability at\nmost p. In other words\n\nIf this event does not occur, we are already done. Otherwise we can condition on dim(W ), . . . , n,\nand look at the event |Tdim(W )1()| < . Once we condition on dim(W ), this event becomes\nindependent of the previous event and we can again upperbound its probability by p. So we have\n\nP[|Tdim(W )()| < ] \uf8ff p.\n\nwhich implies\n\nBy continuing this, in the end we get\nP[^dim(W )\nwhich is the complement of eq. (1).\n\ni=1\n\nP[|Tdim(W )1()| < | |Tdim(W )()| < ] \uf8ff p\n\nP[|Tdim(W )1()| < ^| Tdim(W )()| < ] \uf8ff p2.\n\n|Ti()| < ] \uf8ff pdim(W ) \uf8ff p(1c)n,\n\n3.2 The general case\n\nHere we describe a structure that we name echelon tree. This de\ufb01nition is motivated by the Gaussian\nelimination procedure for matrices that produces an echelon form. Our de\ufb01nition can be seen as a\ngeneralization of this form for tensor spaces.\nWe \ufb01rst describe an index tree for Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n`: Consider an abstract rooted tree T of height ` where\nthe nodes at level k are labeled by different partial indices from [n1] \u21e5 [n2] \u21e5\u00b7\u00b7\u00b7\u21e5 [nk]; the root has\nthe empty label and resides at level 0, and all leaves reside at level `. We require the indices to be\nconsistent with the tree structure, i.e., all children (and by extension descendants) of a node labeled\nI = (i1, . . . , ik) must contain I as the pre\ufb01x of their label. We further assume that T is ordered, i.e.,\neach node of T has an ordering over its children. This enables us to talk about post-order traversal of\nthe tree, a linear ordering of the nodes of the tree, which we denote by the binary relation . For two\nnodes labeled I and J, we let I J exactly when (i) I is a descendant of J or (ii) there are ancestors\nI0, J0 of I, J with a common parent who places I0 before J0 (according to the ordering induced by\nthe parent on its children).\nDe\ufb01nition 7. An index tree for Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` is a height ` tree T of partial indices together with a\npost-traversal ordering on its nodes as described above.\nWe emphasize that nodes of an index tree have different labels, so we consider the partial indices the\nsame as the nodes. For example, an index tree of height 1 is identical to an ordered list i(1), . . . , i(s)\nof elements in [n1], with no repetitions allowed. Next we de\ufb01ne an echelon tree.\nDe\ufb01nition 8. An echelon tree is an index tree where each leaf I is additionally labeled by an element\nTI 2 Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n`. We require that TI(eI) 6= 0 and that for every node J that appears before I in the\npost-order traversal, i.e., J I, the following identity to hold:\n\nTI(eJ ,\u00b7, . . . ,\u00b7) = 0.\n\nNote that the identity in the above de\ufb01nition is requiring an entire sub-array of TI to be zero. For\nexample a height 1 echelon tree is a list of unique indices i(1), . . . , i(s) of [n1] together with vectors\nT (1), . . . , T (s) 2 Rn1 such that T (j) has zeros in the i(1), . . . , i(j1) entries and has a nonzero i(j)-th\nentry. Notice the similarity to the echelon form obtained by Gaussian elimination in a matrix. In\nparticular, for a height 1 echelon tree, the vectors T (1), . . . , T (s) must be linearly independent.\nWe say that T is an echelon tree for the linear subspace W \u2713 Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` if for all leaves I, we\nhave TI 2 W . Notice that we can collapse or \ufb02atten consecutive levels of an echelon tree, and the\nresult would remain an echelon tree. In this operation, nodes of a particular level i are removed,\nand each orphaned node of level i + 1 is assigned to its grandparent (of level i 1). We then treat\nthe indices as coming from [n1] \u21e5\u00b7\u00b7\u00b7\u21e5 [nini+1] \u21e5\u00b7\u00b7\u00b7\u21e5 [n`], i.e., we merge the i, i + 1-st level\nindices. This also corresponds to partially \ufb02attening tensors TI and considering them as elements of\nRn1\u21e5\u00b7\u00b7\u00b7\u21e5nini+1\u21e5...n`. It is easy to check that these operations preserve the properties in de\ufb01nition 8:\nFact 9. Collapsing an echelon tree at level i produces an echelon tree.\n\n6\n\n\fThe main question we would like to address here is how large of an echelon tree can be constructed\nfor a subspace W . For example, for Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` one can get a full tree, where nodes at level i 1 have\nbranching factor ni, by simply placing the standard basis for Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` at the leaves. We measure\nthe size of a tree by its fractional branching factor.\nDe\ufb01nition 10. An echelon tree T has fractional branching (\u21b51, . . . ,\u21b5 `) 2 [0, 1]` if each node I\nat level i 1 has at least \u21b5ini children. For a single number \u21b5 2 [0, 1], we say T has fractional\nbranching \u21b5 when it has fractional branching (\u21b5, \u21b5, . . . , \u21b5).\n\nNote that fractional branching \u21b5 implies that the tree has at least \u21b5`n1 . . . n` leaves. On the other\nhand, repeated applications of fact 9 on the echelon tree would produce a height 1 echelon tree,\nand we have already observed that the vectors assigned to the leaves in such a tree must be linearly\nindependent. So this implies that dim(W ) \u21b5`n1 . . . n`. There is a partial inverse to this statement:\nIf W \u2713 Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` has dimension (1 c`) \u00b7 n1 . . . n`, then there is an echelon tree with fractional\nbranching 1 c for W . However, this fact is not \u201crobust\u201d, since the elements of W assigned to the\nleaves can have arbitrarily small or large entries. Instead we prove the following:\nTheorem 11. If W \u2713 Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` has dimension (1c`)\u00b7n1 . . . n`, then there is an echelon tree with\nfractional branching 1 c for W such that for every leaf I we have kTIk1 = 1 and |TI(eI)| = 1.\nLet us see \ufb01rst see why theorem 11 is enough to prove lemma 6.\n\nProof of lemma 6 for general `. Note that kTIk1 = 1 implies that kTIk \uf8ff n`/2. So it suf\ufb01ces to\nshow that TI((1), . . . , (`)) ` for some I with high probability.\nLet us say that an echelon tree is x-large when |TI(eI)| x for all leaves I. Theorem 11 guarantees\nthat the echelon tree produced by it is 1-large.\nOur strategy is to \ufb01x (`), (`1), . . . , (1) in that order, and simultaneously reduce the height of our\nechelon tree by 1 each time. When we \ufb01x (`), we can get a smaller echelon tree in the following way:\nFor each leaf I in the echelon tree, consider the reduced tensor TI(\u00b7, . . . , (`)) 2 Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n`1 as a\ncandidate tensor for the parent of I. Now let J be a node of level ` 1. Its children have produced\ncandidate tensors for J. Pick the candidate T with the highest |T (eJ )| to be TJ. In this way we have\nremoved the lowest level of the tree and have assigned appropriate tensors to the new leaves.\nOur goal is to prove that if we start with an x-large echelon tree, then with high probability the next\nechelon tree is x-large. Inductively this would prove that with high probability over the choice\nof (1), . . . , (`), we have TI((1), . . . , (`)) ` for some leaf I of the original echelon tree,\ncompleting the proof.\nFor a \ufb01xed node J of level ` 1, we want to show that the quantity TI(eJ , (`)) is at least x in\nmagnitude for some child I of J. But this is very similar to the ` = 1 case of lemma 6, which we\nhave already proved. The difference is that the pivots are not necessarily equal to 1, but are at least x\nin magnitude. This implies that\n\nP[8I child of J : |TI(eJ , (`))|\uf8ff x] \uf8ff p(1c)n.\n\nThe number of nodes at level ` 1 is at most n`1, so by a union bound, we get that with probability\nat least 1 n`1p(1c)n, the tree produced at the next level is x-large (the union bound is over\nfewer than n`1 events, each corresponding to one J). Induction completes the proof.\n\nNow we give a proof of theorem 11. We use induction to prove a stronger version. Theorem 11 will\nbe a corollary of the following by setting \u21b51 = \u00b7\u00b7\u00b7 = \u21b5` = 1 c.\nTheorem 12. If W \u2713 Rn1\u21e5\u00b7\u00b7\u00b7\u21e5n` is a subspace, and \u21b51, . . . ,\u21b5 ` 2 [0, 1] are such that\n\n(1 \u21b51)(1 \u21b52)\u00b7\u00b7\u00b7 (1 \u21b5`) 1 dim(W )/(n1 \u00b7\u00b7\u00b7 n`),\n\nthen there is an echelon tree for W with fractional branching (\u21b51, . . . ,\u21b5 n) such that for each leaf I\nwe have kTIk1 = 1 and |TI(eI)| = 1.\nProof. We use induction on `. For the base case of ` = 1, we have \u21b51 dim(W )/n1 and we want\nan echelon tree with branching factor \u21b51n1 \uf8ff dim(W ). We have already proved this case.\n\n7\n\n\fNow assume we have proved the statement for ` 1 and want to prove it for `. Consider partially\n\ufb02attening the tensor space by merging the \ufb01rst two dimensions, i.e., considering W as a subspace\nof Rn1n2\u21e5n3\u21e5\u00b7\u00b7\u00b7\u21e5n`. Let us \ufb01x 2 [0, 1] such that the premise of the induction hypothesis holds\nand we can get an echelon tree of height ` 1 with fractional branching (, \u21b5 3, . . . ,\u21b5 `). Nodes at\nlevel 1 of this tree have indices in [n1n2], and there are n1n2 of them. Considering these indices\nas living in [n1] \u21e5 [n2], by the pigeonhole principle at least n1n2/n1 = n2 of them will have the\nsame \ufb01rst component; let\u2019s call this component i1 2 [n1]. We can now extract the subtrees of these\nn2 elements and join them into an echelon tree of height `. The common parent of these nodes\nwill have index i1. So far we have constructed an echelon tree of height ` with fractional branching\n(1/n1,,\u21b5 3, . . . ,\u21b5 `).\nNow consider the subspace {T 2 W | T (ei1,\u00b7, . . . ,\u00b7) = 0}. We think of W as living in\nR(n11)n2\u21e5n3\u21e5\u00b7\u00b7\u00b7\u21e5n`, since index i1 has been eliminated from the \ufb01rst dimension. We can again\napply the induction hypothesis to this space and as long as the premise holds obtain an echelon tree\nof height ` 1 with fractional branching (, \u21b5 3, . . . ,\u21b5 `). We can apply the pigeonhole principle\nagain to \ufb01nd (n1 1)n2/(n1 1) = n2 level-1 nodes having the same \ufb01rst index i2. We extract\na height ` echelon tree from them and join this with the height ` echelon tree we already have. At the\nend we will have an echelon tree with fractional branching (2/n1,,\u21b5 3, . . . ,\u21b5 `).\nSuppose we have repeated this procedure n1 1 many times and currently have a height ` echelon\ntree with fractional branching (, , \u21b5 3, . . . ,\u21b5 `). As long as the premise of the induction hypothesis\nholds we can grow this echelon tree. The current subspace is {T 2 W | W (eij ,\u00b7, . . . ,\u00b7) = 0 for j 2\n[n1]} which lives in R(1)n1n2\u21e5n3\u21e5\u00b7\u00b7\u00b7\u21e5n`. The dimension of this subspace is at least dim(W ) \nn1n2 \u00b7\u00b7\u00b7 n`. So the premise of the induction hypothesis holds as long as\n\n(1 )(1 \u21b53)\u00b7\u00b7\u00b7 (1 \u21b5`) 1 \n\ndim(W ) n1 \u00b7\u00b7\u00b7 n`\n(1 )n1n2 \u00b7\u00b7\u00b7 n`\n\n=\n\n1 dim(W )/n1 \u00b7\u00b7\u00b7 n`\n\n.\n\n1 \n\nThis means that as long as (1 )(1 )(1 \u21b53)\u00b7\u00b7\u00b7 (1 \u21b5`) 1 dim(W )/n1 \u00b7\u00b7\u00b7 n`, we can\ngrow the echelon tree.\nTo \ufb01nish the proof, we set = \u21b52, which means that while <\u21b5 1, we can grow the echelon tree.\nSo when this procedure stops we have an echelon tree with fractional branching (\u21b51, . . . ,\u21b5 `).\n\nImplications for the main question\n\n3.3\nOur result, theorem 5, together with results from [3] (see the supplementary material), imply that\nunder very mild assumptions we can recover S1, . . . ,Sn from their `-wise intersections as long as\n|U| \uf8ff n\u21e5(`). These mild assumptions are necessary to prevent adversarially constructed examples\nthat have no hope of unique recovery.\nTo get a sense of the mild assumptions that we need, let us discuss the parameters that appear in\ntheorem 5. We assume that ` is a constant that does not grow with n. We can take c to be some \ufb01xed\nconstant as well. For example 1/2, or even 1/ lp2. If we perturb our cell assemblies according to\nexample 3, i.e., \ufb02ip assembly memberships for each neuron class and assembly pair with probability\nq, how large of a q do we need for the conditions of theorem 5 and [3] to be satis\ufb01ed? The distribution\nwe get for (u)s is going to be (1/2, 1 q)-nondeterministic as long as q \uf8ff 1/2. So = 1/2 is a\nconstant. The only condition we need is now for the failure probability to be small. This roughly\ntranslates to\n\nnO(`)(1 q)(1c)n \u2327 1,\n\nwhich will be satis\ufb01ed for q = \u2326(log n/n). In other words, we only have to \ufb02ip each coordinate of\n(u) with probability O(log n/n). On average, each neuron\u2019s membership will be changed in about\nO(log(n)) of the assemblies, which is a very small fraction of the assemblies. For slightly larger\nvalues of q, e.g., q = n\u270f1, the probability of failure becomes exponentially small similar to [3].\nWe also assumed that w(u) = 1 for all u 2U . In general this is not needed. As long as the weights\nw(u) are in a range whose upper bound is at most a polynomially bounded factor larger than the\nlower bound, we can absorb the weights into the vector (u) and the running time and accuracy will\nonly suffer by a polynomially bounded factor.\nWe also remark that recovering a {0, 1}n vector within an additive error of 1/n is the same as exact\nrecovery (by rounding the coordinates). So by setting the recovery error (see supplementary material)\nto 1/n we get exact recovery.\n\n8\n\n\fFinally, we remark that even though we are mostly interested in the case where ` = O(1), our\ndependencies on ` seem to be better than the results of [3] even in the setting of Gaussian perturbations.\nIn particular, our running time (as well as our tolerance for error) grows polynomially with n`, whereas\nthe running time of [3] grows with n3`. When adding Gaussian noise of total variance \u21e22 as in\nexample 4, we can treat our vectors as coming from a (O(\u21e2/pn), 1/2)-nondeterministic distribution.\nThis means our probability of failure will be at most n2`/2(1c)n. To have a fair comparison, we\nneed to allow for the number of components to be roughly half the total dimension, so we need to let\n\nc = 1/ `p2 ' 1 \u21e5(1/`). So the probability of failure will be roughly exp(O(` log n) \u2326(n/`)).\nFor large enough values of ` this is much better than the guarantee of exp(\u21e5(n1/3`)) of [3].\n4 Association graphs and the soft model\n\nWhen the number of observations is smaller than what is needed for reconstruction, we can still ask\nwhether there exists some Venn diagram that is consistent with the observations. Which classes of\nweighted graphs (or hypergraphs) can be represented by Venn diagrams?\nInterestingly, a similar model was formulated almost three decades ago, motivated by quantum\nmechanics and spin glass systems, and a mathematical object called correlation polytope was de\ufb01ned\nto frame that investigation [20]. It is not hard to show that membership in the polytope is an NP-hard\nproblem and natural optimization variants of it are hard to approximate.\nIn this section we formulate a promise version of the problem where either the intersection is above a\ncertain threshold (corresponding to association) or below another (corresponding to non-association)\nwhich seems to be more tractable.\nMore precisely, we are given a graph that is unweighted. The nodes still stand for assemblies of\nneurons, all of the same size K, out of a universe of N neurons, and the edges signify association;\nthe difference is that, in this model, if two assemblies are associated then they have an intersection of\nsize at least a; whereas if they are not, then their intersection is at most b. The intended relationship\nbetween these numbers is that N is much larger than K (we take it to be a power of K), and K\nis in turn much larger than a, while a is quite a bit larger than b. To \ufb01x ideas, in the sequel we\ntake N = K2 and b < a small constant fractions of K; in the experiment in [6, 15] a and b are\nfound to be about 8% and 4% of K, respectively. We call a graph G = (V, E) representable with\nparameters (N, K, a, b) if every node of G can be associated with a set of K neurons such that\nfor any two adjacent nodes the corresponding sets have intersection at least a, while for any two\nnon-adjacent nodes the corresponding sets have intersection at most b. The question is, which graphs\nare representable?\nTheorem 13. Any graph of maximum degree at most 2K/a is representable, and so is any tree of\nmaximum degree 2K2/a2.\n\nThe 2K/a bound follows from the fact that the edges of a regular Eulerian graph can be decomposed\ninto cycles, while the 2K2/a2 follows from the theory of block designs. Recalling that a is a small\nfraction of K, we conclude that rather rich and complex \u201cassociation graphs\u201d can be represented\nin principle. But can these sophisticated combinatorial constructions be carried out with surgical\nprecision in the wet chaos of the brain?\nHere is a more realistic framework which we call\nthe soft model: Suppose that we are given an\nassociation graph G = (V, E). We wish to determine whether a model of G exists, i.e., |V | sets\ncorresponding to nodes of G whose pairwise intersections realize G according to the rules above\ninvolving a and b. We wish to create sets of expected size K representing the nodes, starting from the\nuniverse of neurons [N ] and executing instructions of the following form (in the following C, C1, C2\nare previously constructed sets, and A is the set being constructed):\nA C1 C2,\n\nA C1 [ C2,\n\nA C1 \\ C2,\n\nwhere by S(C, p) we denote the result of sampling each node in set C with probability p \u2014 a simple\nand realistic enough primitive. The question is, which graphs can be realized in such a way that the\nintended relations between the nodes and their intersections are not corrupted, with high enough\nprobability, by the randomness of the process? We can show the following:\nTheorem 14. Any graph with maximum degree 1\nprobability.\n\na can be realized in the soft model with high\n\ne \u00b7 K\n\nA S(C, p),\n\n9\n\n\fReferences\n[1] Animashree Anandkumar, Daniel Hsu, and Sham M Kakade. A method of moments for mixture\n\nmodels and hidden markov models. In Conference on Learning Theory, pages 33\u20131, 2012.\n\n[2] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor decom-\nposition via the sum-of-squares method. In Proceedings of the forty-seventh annual ACM\nsymposium on Theory of computing, pages 143\u2013151. ACM, 2015.\n\n[3] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan. Smoothed\nanalysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on\nTheory of computing, pages 594\u2013603. ACM, 2014.\n\n[4] Gy\u00f6rgy Buzs\u00e1ki. Neural syntax: cell assemblies, synapsembles, and readers. Neuron, 68(3):\n\n362\u2013385, 2010.\n\n[5] Joseph T Chang. Full reconstruction of markov models on evolutionary trees: identi\ufb01ability\n\nand consistency. Mathematical biosciences, 137(1):51\u201373, 1996.\n\n[6] Emanuela De Falco, Matias J Ison, Itzhak Fried, and Rodrigo Quian Quiroga. Long-term coding\nof personal and universal associations underlying the memory web in the human brain. Nature\ncommunications, 7:13408, 2016.\n\n[7] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-order cumulant-\nbased blind identi\ufb01cation of underdetermined mixtures. IEEE Transactions on Signal Processing,\n55(6):2965\u20132973, 2007.\n\n[8] Rong Ge and Tengyu Ma. Decomposing overcomplete 3rd order tensors using sum-of-squares\n\nalgorithms. arXiv preprint arXiv:1504.05287, 2015.\n\n[9] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. In Advances\n\nin Neural Information Processing Systems, pages 3656\u20133666, 2017.\n\n[10] Navin Goyal, Santosh Vempala, and Ying Xiao. Fourier pca and robust tensor decomposition. In\nProceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 584\u2013593.\nACM, 2014.\n\n[11] Johan H\u00e5stad. Tensor rank is np-complete. Journal of Algorithms, 11(4):644\u2013654, 1990.\n[12] Christopher J Hillar and Lek-Heng Lim. Most tensor problems are np-hard. Journal of the ACM\n\n(JACM), 60(6):45, 2013.\n\n[13] Samuel B Hopkins, Tselil Schramm, Jonathan Shi, and David Steurer. Fast spectral algorithms\nfrom sum-of-squares proofs: tensor decomposition and planted sparse vectors. In Proceedings\nof the forty-eighth annual ACM symposium on Theory of Computing, pages 178\u2013191. ACM,\n2016.\n\n[14] Daniel Hsu, Sham M Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov\n\nmodels. Journal of Computer and System Sciences, 78(5):1460\u20131480, 2012.\n\n[15] Matias J Ison, Rodrigo Quian Quiroga, and Itzhak Fried. Rapid encoding of new memories by\n\nindividual neurons in the human brain. Neuron, 87(1):220\u2013230, 2015.\n\n[16] Tamara G Kolda and Jackson R Mayo. Shifted power method for computing tensor eigenpairs.\n\nSIAM Journal on Matrix Analysis and Applications, 32(4):1095\u20131124, 2011.\n\n[17] SE Leurgans, RT Ross, and RB Abel. A decomposition for three-way arrays. SIAM Journal on\n\nMatrix Analysis and Applications, 14(4):1064\u20131083, 1993.\n\n[18] Tengyu Ma, Jonathan Shi, and David Steurer. Polynomial-time tensor decompositions with sum-\nof-squares. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium\non, pages 438\u2013446. IEEE, 2016.\n\n[19] Elchanan Mossel and S\u00e9bastien Roch. Learning nonsingular phylogenies and hidden markov\nmodels. In Proceedings of the thirty-seventh annual ACM symposium on theory of computing,\npages 366\u2013375. ACM, 2005.\n\n10\n\n\f[20] Itamar Pitowsky. Correlation polytopes: their geometry and complexity. Mathematical Pro-\n\ngramming, 50(1):395\u2013414, 1991.\n\n[21] Rodrigo Quian Quiroga. Concept cells: the building blocks of declarative memory functions.\n\nNature reviews. Neuroscience, 13(8):587, 2012.\n\n11\n\n\f", "award": [], "sourceid": 6923, "authors": [{"given_name": "Nima", "family_name": "Anari", "institution": "Stanford University"}, {"given_name": "Constantinos", "family_name": "Daskalakis", "institution": "MIT"}, {"given_name": "Wolfgang", "family_name": "Maass", "institution": "Graz University of Technology"}, {"given_name": "Christos", "family_name": "Papadimitriou", "institution": "Columbia University"}, {"given_name": "Amin", "family_name": "Saberi", "institution": "Stanford University"}, {"given_name": "Santosh", "family_name": "Vempala", "institution": "Georgia Tech"}]}