{"title": "Shallow vs. Deep Sum-Product Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 666, "page_last": 674, "abstract": "We investigate the representational power of sum-product networks (computation networks analogous to neural networks, but whose individual units compute either products or weighted sums), through a theoretical analysis that compares deep (multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove there exist families of functions that can be represented much more efficiently with a deep network than with a shallow one, i.e. with substantially fewer hidden units. Such results were not available until now, and contribute to motivate recent research involving learning of deep sum-product networks, and more generally motivate research in Deep Learning.", "full_text": "Shallow vs. Deep Sum-Product Networks\n\nOlivier Delalleau\n\nDepartment of Computer Science and Operation Research\n\nUniversit\u00b4e de Montr\u00b4eal\n\ndelallea@iro.umontreal.ca\n\nDepartment of Computer Science and Operation Research\n\nYoshua Bengio\n\nUniversit\u00b4e de Montr\u00b4eal\n\nyoshua.bengio@umontreal.ca\n\nAbstract\n\nWe investigate the representational power of sum-product networks (computation\nnetworks analogous to neural networks, but whose individual units compute either\nproducts or weighted sums), through a theoretical analysis that compares deep\n(multiple hidden layers) vs. shallow (one hidden layer) architectures. We prove\nthere exist families of functions that can be represented much more ef\ufb01ciently\nwith a deep network than with a shallow one, i.e. with substantially fewer hidden\nunits. Such results were not available until now, and contribute to motivate recent\nresearch involving learning of deep sum-product networks, and more generally\nmotivate research in Deep Learning.\n\nIntroduction and prior work\n\n1\nMany learning algorithms are based on searching a family of functions so as to identify one member\nof said family which minimizes a training criterion. The choice of this family of functions and how\nmembers of that family are parameterized can be a crucial one. Although there is no universally\noptimal choice of parameterization or family of functions (or \u201carchitecture\u201d), as demonstrated by\nthe no-free-lunch results [37], it may be the case that some architectures are appropriate (or inap-\npropriate) for a large class of learning tasks and data distributions, such as those related to Arti\ufb01cial\nIntelligence (AI) tasks [4]. Different families of functions have different characteristics that can be\nappropriate or not depending on the learning task of interest. One of the characteristics that has\nspurred much interest and research in recent years is depth of the architecture. In the case of a\nmulti-layer neural network, depth corresponds to the number of (hidden and output) layers. A \ufb01xed-\nkernel Support Vector Machine is considered to have depth 2 [4] and boosted decision trees to have\ndepth 3 [7]. Here we use the word circuit or network to talk about a directed acyclic graph, where\neach node is associated with some output value which can be computed based on the values associ-\nated with its predecessor nodes. The arguments of the learned function are set at the input nodes of\nthe circuit (which have no predecessor) and the outputs of the function are read off the output nodes\nof the circuit. Different families of functions correspond to different circuits and allowed choices\nof computations in each node. Learning can be performed by changing the computation associated\nwith a node, or rewiring the circuit (possibly changing the number of nodes). The depth of the circuit\nis the length of the longest path in the graph from an input node to an output node.\nDeep Learning algorithms [3] are tailored to learning circuits with variable depth, typically greater\nthan depth 2. They are based on the idea of multiple levels of representation, with the intuition that\nthe raw input can be represented at different levels of abstraction, with more abstract features of\nthe input or more abstract explanatory factors represented by deeper circuits. These algorithms are\noften based on unsupervised learning, opening the door to semi-supervised learning and ef\ufb01cient\n\n1\n\n\fuse of large quantities of unlabeled data [3]. Analogies with the structure of the cerebral cortex (in\nparticular the visual cortex) [31] and similarities between features learned with some Deep Learning\nalgorithms and those hypothesized in the visual cortex [17] further motivate investigations into deep\narchitectures. It has been suggested that deep architectures are more powerful in the sense of being\nable to more ef\ufb01ciently represent highly-varying functions [4, 3]. In this paper, we measure \u201cef\ufb01-\nciency\u201d in terms of the number of computational units in the network. An ef\ufb01cient representation\nis important mainly because: (i) it uses less memory and is faster to compute, and (ii) given a \ufb01xed\namount of training samples and computational power, better generalization is expected.\nThe \ufb01rst successful algorithms for training deep architectures appeared in 2006, with ef\ufb01cient train-\ning procedures for Deep Belief Networks [14] and deep auto-encoders [13, 27, 6], both exploiting\nthe general idea of greedy layer-wise pre-training [6]. Since then, these ideas have been inves-\ntigated further and applied in many settings, demonstrating state-of-the-art learning performance\nin object recognition [16, 28, 18, 15] and segmentation [20], audio classi\ufb01cation [19, 10], natural\nlanguage processing [9, 36, 21, 32], collaborative \ufb01ltering [30], modeling textures [24], modeling\nmotion [34, 33], information retrieval [29, 26], and semi-supervised learning [36, 22].\nPoon and Domingos [25] introduced deep sum-product networks as a method to compute partition\nfunctions of tractable graphical models. These networks are analogous to traditional arti\ufb01cial neural\nnetworks but with nodes that compute either products or weighted sums of their inputs. Analo-\ngously to neural networks, we de\ufb01ne \u201chidden\u201d nodes as those nodes that are neither input nodes nor\noutput nodes. If the nodes are organized in layers, we de\ufb01ne the \u201chidden\u201d layers to be those that\nare neither the input layer nor the output layer. Poon and Domingos [25] report experiments with\nnetworks much deeper (30+ hidden layers) than those typically used until now, e.g. in Deep Belief\nNetworks [14, 3], where the number of hidden layers is usually on the order of three to \ufb01ve.\nWhether such deep architectures have theoretical advantages compared to so-called \u201cshallow\u201d archi-\ntectures (i.e. those with a single hidden layer) remains an open question. After all, in the case of a\nsum-product network, the output value can always be written as a sum of products of input variables\n(possibly raised to some power by allowing multiple connections from the same input), and conse-\nquently it is easily rewritten as a shallow network with a sum output unit and product hidden units.\nThe argument supported by our theoretical analysis is that a deep architecture is able to compute\nsome functions much more ef\ufb01ciently than a shallow one.\nUntil recently, very few theoretical results supported the idea that deep architectures could present\nan advantage in terms of representing some functions more ef\ufb01ciently. Most related results originate\nfrom the analysis of boolean circuits (see e.g. [2] for a review). Well-known results include the\nproof that solving the n-bit parity task with a depth-2 circuit requires an exponential number of\ngates [1, 38], and more generally that there exist functions computable with a polynomial-size depth-\nk circuit that would require exponential size when restricted to depth k \u2212 1 [11]. Another recent\nresult on boolean circuits by Braverman [8] offers proof of a longstanding conjecture, showing that\nbounded-depth boolean circuits are unable to distinguish some (non-uniform) input distributions\nfrom the uniform distribution (i.e.\nIn particular,\nBraverman\u2019s result suggests that shallow circuits can in general be fooled more easily than deep\nones, i.e., that they would have more dif\ufb01culty ef\ufb01ciently representing high-order dependencies\n(those involving many input variables).\nIt is not obvious that circuit complexity results (that typically consider only boolean or at least dis-\ncrete nodes) are directly applicable in the context of typical machine learning algorithms such as\nneural networks (that compute continuous representations of their input). Orponen [23] surveys the-\noretical results in computational complexity that are relevant to learning algorithms. For instance,\nH\u02daastad and Goldmann [12] extended some results to the case of networks of linear threshold units\nwith positivity constraints on the weights. Bengio et al. [5, 7] investigate, respectively, complexity\nissues in networks of Gaussian radial basis functions and decision trees, showing intrinsic limitations\nof these architectures e.g. on tasks similar to the parity problem. Utgoff and Stracuzzi [35] infor-\nmally discuss the advantages of depth in boolean circuit in the context of learning architectures.\nBengio [3] suggests that some polynomials could be represented more ef\ufb01ciently by deep sum-\nproduct networks, but without providing any formal statement or proofs. This work partly addresses\nthis void by demonstrating families of circuits for which a deep architecture can be exponentially\nmore ef\ufb01cient than a shallow one in the context of real-valued polynomials.\nNote that we do not address in this paper the problem of learning these parameters: even if an\nef\ufb01cient deep representation exists for the function we seek to approximate, in general there is no\n\nthey are \u201cfooled\u201d by such input distributions).\n\n2\n\n\fguarantee for standard optimization algorithms to easily converge to this representation. This paper\nfocuses on the representational power of deep sum-product circuits compared to shallow ones, and\nstudies it by considering particular families of target functions (to be represented by the learner).\nWe \ufb01rst formally de\ufb01ne sum-product networks. We consider two families of functions represented\nby deep sum-product networks (families F and G). For each family, we establish a lower bound on\nthe minimal number of hidden units a depth-2 sum-product network would require to represent a\nfunction of this family, showing it is much less ef\ufb01cient than the deep representation.\n\n2 Sum-product networks\nDe\ufb01nition 1. A sum-product network is a network composed of units that either compute the product\nof their inputs or a weighted sum of their inputs (where weights are strictly positive).\n\nHere, we restrict our de\ufb01nition of the generic term \u201csum-product network\u201d to networks whose sum-\nmation units have positive incoming weights1, while others are called \u201cnegative-weight\u201d networks.\nDe\ufb01nition 2. A \u201cnegative-weight\u201c sum-product network may contain summation units whose\nweights are non-positive (i.e. less than or equal to zero).\n\nFinally, we formally de\ufb01ne what we mean by deep vs. shallow networks in the rest of the paper.\nDe\ufb01nition 3. A \u201cshallow\u201c sum-product network contains a single hidden layer (i.e. a total of three\nlayers when counting the input and output layers, and a depth equal to two).\nDe\ufb01nition 4. A \u201cdeep\u201c sum-product network contains more than one hidden layer (i.e. a total of at\nleast four layers, and a depth at least three).\n\n3 The family F\n3.1 De\ufb01nition\nThe \ufb01rst family of functions we study, denoted by F, is made of functions built from deep sum-\nproduct networks that alternate layers of product and sum units with two inputs each (details are\nprovided below). The basic idea we use here is that composing layers (i.e. using a deep architec-\nture) is equivalent to using a factorized representation of the polynomial function computed by the\nnetwork. Such a factorized representation can be exponentially more compact than its expansion as\na sum of products (which can be associated to a shallow network with product units in its hidden\nlayer and a sum unit as output). This is what we formally show in what follows.\n\n+\n\n1 = \u03bb11\u21131\n\u21132\n\n1 + \u00b511\u21131\n\n2 = x1x2 + x3x4 = f (x1, x2, x3, x4)\n\n\u03bb11 = 1\n\n\u00b511 = 1\n\n\u21131\n1 = x1x2\n\n\u00d7\u00d7\n\n\u21131\n2 = x3x4\n\n\u00d7\n\nx1\n\nx2\n\nx3\n\nx4\n\nFigure 1: Sum-product network computing the function f \u2208 F such that i = \u03bb11 = \u00b511 = 1.\n\nLet n = 4i, with i a positive integer value. Denote by \u21130 the input layer containing scalar variables\n{x1, . . . , xn}, such that \u21130\nj = xj for 1 \u2264 j \u2264 n. Now de\ufb01ne f \u2208 F as any function computed by a\nsum-product network (deep for i \u2265 2) composed of alternating product and sum layers:\n\n= \u21132k\n\nj\n\n2j\n\nj = \u03bbjk\u21132k\u22121\n\n\u2022 \u21132k+1\n\u2022 \u21132k\n\n2j\u22121 \u00b7 \u21132k\n2j\u22121 + \u00b5jk\u21132k\u22121\n\n2j for 0 \u2264 k \u2264 i \u2212 1 and 1 \u2264 j \u2264 22(i\u2212k)\u22121\nfor 1 \u2264 k \u2264 i and 1 \u2264 j \u2264 22(i\u2212k)\nwhere the weights \u03bbjk and \u00b5jk of the summation units are strictly positive.\n1 \u2208 R, the unique unit in the last layer.\nThe output of the network is given by f (x1, . . . , xn) = \u21132i\nThe corresponding (shallow) network for i = 1 and additive weights set to one is shown in Figure 1\n\n1This condition is required by some of the proofs presented here.\n\n3\n\n\f(this architecture is also the basic building block of bigger networks for i > 1). Note that both the\ninput size n = 4i and the network\u2019s depth 2i increase with parameter i.\n\n3.2 Theoretical results\nThe main result of this section is presented below in Corollary 1, providing a lower bound on the\nminimum number of hidden units required by a shallow sum-product network to represent a function\nf \u2208 F. The high-level proof sketch consists in the following steps:\n(1) Count the number of unique products found in the polynomial representation of f (Lemma 1 and\nProposition 1).\n(2) Show that the only possible architecture for a shallow sum-product network to compute f is to\nhave a hidden layer made of product units, with a sum unit as output (Lemmas 2 to 5).\n(3) Conclude that the number of hidden units must be at least the number of unique products com-\nputed in step 3.2 (Lemma 6 and Corollary 1).\nLemma 1. Any element \u2113k\nj can be written as a (positively) weighted sum of products of input vari-\nables, such that each input variable xt is used in exactly one unit of \u2113k. Moreover, the number mk of\nproducts found in the sum computed by \u2113k\nj does not depend on j and obeys the following recurrence\nrule for k \u2265 0: if k + 1 is odd, then mk+1 = m2\nProof. We prove the lemma by induction on k.\nAssuming this is true for some k \u2265 0, we consider two cases:\n\nIt is obviously true for k = 0 since \u21130\n\nk, otherwise mk+1 = 2mk.\n\nj = xj.\n\nj = \u2113k\n\n2j\u22121 \u00b7 \u2113k\n\n\u2022 If k + 1 is odd, then \u2113k+1\n2j\u22121 and \u2113k\n\n2j. By the inductive hypothesis, it is the product of\ntwo (positively) weighted sums of products of input variables, and no input variable can\nappear in both \u2113k\n2j, so the result is also a (positively) weighted sum of products\n2j is mk, then\nof input variables. Additionally, if the number of products in \u2113k\nk, since all products involved in the multiplication of the two units are different\nmk+1 = m2\n(since they use disjoint subsets of input variables), and the sums have positive weights.\nFinally, by the induction assumption, an input variable appears in exactly one unit of \u2113k.\nThis unit is an input to a single unit of \u2113k+1, that will thus be the only unit of \u2113k+1 where\nthis input variable appears.\n\u2022 If k + 1 is even, then \u2113k+1\n\n2j. Again, from the induction assumption, it\nmust be a (positively) weighted sum of products of input variables, but with mk+1 = 2mk\nsuch products. As in the previous case, an input variable will appear in the single unit of\n\u2113k+1 that has as input the single unit of \u2113k in which this variable must appear.\n\n2j\u22121 + \u00b5jk\u2113k\n\n2j\u22121 and \u2113k\n\nj = \u03bbjk\u2113k\n\nProposition 1. The number of products in the sum computed in the output unit l2i\n\n1 of a network\n\ncomputing a function in F is m2i = 2\u221an\u22121.\nProof. We \ufb01rst prove by induction on k \u2265 1 that for odd k, mk = 22\nmk = 22\nsingle products of the form xrxs. Assuming this is true for some k \u2265 1, then:\n\n2 \u22121. This is obviously true for k = 1 since 22\n\n1+1\n\nk\n\nk+1\n\n2 \u22122, and for even k,\n2 \u22122 = 20 = 1, and all units in \u21131 are\n\n\u2022 if k + 1 is odd, then from Lemma 1 and the induction assumption, we have:\n\nk = (cid:18)22\n\u2022 if k + 1 is even, then instead we have:\n\nmk+1 = m2\n\nk\n\n2 \u22121(cid:19)2\n\nk\n2\n\n= 22\n\n+1\u22122 = 22\n\n(k+1)+1\n\n2\n\n\u22122\n\nmk+1 = 2mk = 2 \u00b7 22\n\nk+1\n\n2 \u22122 = 22\n\n(k+1)\n\n2 \u22121\n\nwhich shows the desired result for k + 1, and thus concludes the induction proof. Applying this\nresult with k = 2i (which is even) yields\n\n2i\n\nm2i = 22\n\n2 \u22121 = 2\n\n\u221a22i\u22121 = 2\n\n\u221an\u22121.\n\n4\n\n\fLemma 2. The products computed in the output unit l2i\ncontaining only variables x1, . . . , x n\n\nand one containing only variables x n\n\n1 can be split in two groups, one with products\n\n2\n\n2 +1, . . . , xn.\n\nProof. This is obvious since the last unit is a \u201csum\u201c unit that adds two terms whose inputs are these\ntwo groups of variables (see e.g. Fig. 1).\n\nLemma 3. The products computed in the output unit l2i\n\n1 involve more than one input variable.\n\nProof. It is straightforward to show by induction on k \u2265 1 that the products computed by lk\ninvolve more than one input variable, thus it is true in particular for the output layer (k = 2i).\nLemma 4. Any shallow sum-product network computing f \u2208 F must have a \u201csum\u201d unit as output.\n\nj all\n\nProof. By contradiction, suppose the output unit of such a shallow sum-product network is multi-\nplicative. This unit must have more than one input, because in the case that it has only one input,\nthe output would be either a (weighted) sum of input variables (which would violate Lemma 3), or\na single product of input variables (which would violate Proposition 1), depending on the type (sum\nor product) of the single input hidden unit. Thus the last unit must compute a product of two or\nmore hidden units. It can be re-written as a product of two factors, where each factor corresponds to\neither one hidden unit, or a product of multiple hidden units (it does not matter here which speci\ufb01c\nfactorization is chosen among all possible ones). Regardless of the type (sum or product) of the\nhidden units involved, those two factors can thus be written as weighted sums of products of vari-\nables xt (with positive weights, and input variables potentially raised to powers above one). From\nLemma 1, both x1 and xn must be present in the \ufb01nal output, and thus they must appear in at least\none of these two factors. Without loss of generality, assume x1 appears in the \ufb01rst factor. Variables\n2 +1, . . . , xn then cannot be present in the second factor, since otherwise one product in the output\nx n\nwould contain both x1 and one of these variables (this product cannot cancel out since weights must\nbe positive), violating Lemma 2. But with a similar reasoning, since as a result xn must appear in\nthe \ufb01rst factor, variables x1, . . . , x n\ncannot be present in the second factor either. Consequently, no\ninput variable can be present in the second factor, leading to the desired contradiction.\n\n2\n\nLemma 5. Any shallow sum-product network computing f \u2208 F must have only multiplicative units\nin its hidden layer.\n\nProof. By contradiction, suppose there exists a \u201csum\u201c unit in the hidden layer, written s =\n\nPt\u2208S \u03b1txt with S the set of input indices appearing in this sum, and \u03b1t > 0 for all t \u2208 S. Since\naccording to Lemma 4 the output unit must also be a sum (and have positive weights according to\nDe\ufb01nition 1), then the \ufb01nal output will also contain terms of the form \u03b2txt for t \u2208 S, with \u03b2t > 0.\nThis violates Lemma 3, establishing the contradiction.\n\nLemma 6. Any shallow negative-weight sum-product network (see De\ufb01nition 2) computing f \u2208 F\nmust have at least 2\u221an\u22121 hidden units, if its output unit is a sum and its hidden units are products.\n\nits output can be written as \u03a3jwj\u03a0tx\n\nProof. Such a network computes a weighted sum of its hidden units, where each hidden unit is a\nt with wj \u2208 R and \u03b3jt \u2208\nproduct of input variables, i.e.\n{0, 1}. In order to compute a function in F, this shallow network thus needs a number of hidden\nunits at least equal to the number of unique products in that function. From Proposition 1, this\nnumber is equal to 2\u221an\u22121.\nCorollary 1. Any shallow sum-product network computing f \u2208 F must have at least 2\u221an\u22121 hidden\n\n\u03b3jt\n\nunits.\n\nProof. This is a direct corollary of Lemmas 4 (showing the output unit is a sum), 5 (showing that\nhidden units are products), and 6 (showing the desired result for any shallow network with this\nspeci\ufb01c structure \u2013 regardless of the sign of weights).\n\n5\n\n\f3.3 Discussion\nCorollary 1 above shows that in order to compute some function in F with n inputs, the number of\nunits in a shallow network has to be at least 2\u221an\u22121, (i.e. grows exponentially in \u221an). On another\nhand, the total number of units in the deep (for i > 1) network computing the same function, as\ndescribed in Section 3.1, is equal to 1 + 2 + 4 + 8 + . . . + 22i\u22121 (since all units are binary), which is\n\nalso equal to 22i \u2212 1 = n \u2212 1 (i.e. grows only quadratically in \u221an). It shows that some deep sum-\nproduct network with n inputs and depth O(log n) can represent with O(n) units what would\nrequire O(2\u221an) units for a depth-2 network. Lemma 6 also shows a similar result regardless\nof the sign of the weights in the summation units of the depth-2 network, but assumes a speci\ufb01c\narchitecture for this network (products in the hidden layer with a sum as output).\n\n4 The family G\nIn this section we present similar results with a different family of functions, denoted by G. Com-\npared to F, one important difference of deep sum-product networks built to de\ufb01ne functions in G\nis that they can vary their input size independently of their depth. Their analysis thus provides ad-\nditional insight when comparing the representational ef\ufb01ciency of deep vs. shallow sum-product\nnetworks in the case of a \ufb01xed dataset.\n\n4.1 De\ufb01nition\nNetworks in family G also alternate sum and product layers, but their units have as inputs all units\nfrom the previous layer except one. More formally, de\ufb01ne the family G = \u222an\u22652,i\u22650Gin of func-\ntions represented by sum-product networks, where the sub-family Gin is made of all sum-product\nnetworks with n input variables and 2i + 2 layers (including the input layer \u21130), such that:\n\n1. \u21131 contains summation units; further layers alternate multiplicative and summation units.\n2. Summation units have positive weights.\n3. All layers are of size n, except the last layer \u21132i+1 that contains a single sum unit that sums\n\nall units in the previous layer \u21132i.\n\n4. In each layer \u2113k for 1 \u2264 k \u2264 2i, each unit \u2113k\n\nj takes as inputs {\u2113k\u22121\n\nm |m 6= j}.\n\nAn example of a network belonging to G1,3 (i.e. with three layers and three input variables) is shown\nin Figure 2.\n\n1 = x2\n\u21133\n\n1 + x2\n\n2 + x2\n\n3 + 3(x1x2 + x1x3 + x2x3) = g(x1, x2, x3)\n\n1 = x2\n\u21132\n\n1 + x1x2\n\n+x1x3 + x2x3\n\n\u21131\n1 = x2 + x3\n\n\u00d7\n\n+\n\n+\n\n\u00d7\n\n\u21132\n2 = . . .\n\n\u00d7\n\n3 = x2\n\u21132\n\n3 + x1x2\n\n+x1x3 + x2x3\n\n+\n\n\u21131\n2 = x1 + x3\n\n+\n\n\u21131\n3 = x1 + x2\n\nx1\n\nx2\n\nx3\n\nFigure 2: Sum-product network computing a function of G1,3 (summation units\u2019 weights are all 1\u2019s).\n4.2 Theoretical results\nThe main result is stated in Proposition 3 below, establishing a lower bound on the number of hidden\nunits of a shallow sum-product network computing g \u2208 G. The proof sketch is as follows:\n\n1. We show that the polynomial expansion of g must contain a large set of products (Proposi-\n\ntion 2 and Corollary 2).\n\n2. We use both the number of products in that set as well as their degree to establish the\n\ndesired lower bound (Proposition 3).\n\n6\n\n\fWe will also need the following lemma, which states that when n \u2212 1 items each belong to n \u2212 1\nsets among a total of n sets, then we can associate to each item one of the sets it belongs to without\nusing the same set for different items.\nLemma 7. Let S1, . . . , Sn be n sets (n \u2265 2) containing elements of {P1, . . . , Pn\u22121}, such that for\nany q, r, |{r|Pq \u2208 Sr}| \u2265 n \u2212 1 (i.e. each element Pq belongs to at least n \u2212 1 sets). Then there\nexist r1, . . . , rn\u22121 different indices such that Pq \u2208 Srq for 1 \u2264 q \u2264 n \u2212 1.\n\nProof. Omitted due to lack of space (very easy to prove by construction).\n\nProposition 2. For any 0 \u2264 j \u2264 i, and any product of variables P = \u03a0n\nsuch that \u03b1t \u2208 N and\nPt \u03b1t = (n \u2212 1)j, there exists a unit in \u21132j whose computed value, when expanded as a weighted\nsum of products, contains P among these products.\n\nt=1x\u03b1t\n\nt\n\nt=1x\u03b1t\n\nt\n\nProof. We prove this proposition by induction on j.\nFirst, for j = 0, this is obvious since any P of this form must be made of a single input variable xt,\nthat appears in \u21130\nSuppose now the proposition is true for some j < i. Consider a product P = \u03a0n\n\nsuch that\n\nt = xt.\n\n\u03b2qt\nt=1x\nt\n\n\u03b1t \u2208 N and Pt \u03b1t = (n \u2212 1)j+1. P can be factored in n \u2212 1 sub-products of degree (n \u2212 1)j,\n, \u03b2qt \u2208 N and Pt \u03b2qt = (n \u2212 1)j for all q. By\ni.e. written P = P1 . . . Pn\u22121 with Pq = \u03a0n\nthe induction hypothesis, each Pq can be found in at least one unit \u21132j\n. As a result, by property 4\nkq\n(in the de\ufb01nition of family G), each Pq will also appear in the additive layer \u21132j+1, in at least n \u2212 1\ndifferent units (the only sum unit that may not contain Pq is the one that does not have \u21132j\nas input).\nkq\nBy Lemma 7, we can thus \ufb01nd a set of units \u21132j+1\nsuch that for any 1 \u2264 q \u2264 n \u2212 1, the product\nPq appears in \u21132j+1\n, with indices rq being different from each other. Let 1 \u2264 s \u2264 n be such that\ns 6= rq for all q. Then, from property 4 of family G, the multiplicative unit \u21132(j+1)\ncomputes the\nproduct \u03a0n\u22121\n, and as a result, when expanded as a sum of products, it contains in particular\nP1 . . . Pn\u22121 = P . The proposition is thus true for j + 1, and by induction, is true for all j \u2264 i.\nCorollary 2. The output gin of a sum-product network in Gin, when expanded as a sum of products,\ncontains all products of variables of the form \u03a0n\nsuch that \u03b1t \u2208 N and Pt \u03b1t = (n \u2212 1)i.\n\nq=1 \u21132j+1\n\nt=1x\u03b1t\n\nrq\n\nrq\n\nt\n\nrq\n\ns\n\nProof. Applying Proposition 2 with j = i, we obtain that all products of this form can be found in\nthe multiplicative units of \u21132i. Since the output unit \u21132i+1\ncomputes a sum of these multiplicative\nunits (weighted with positive weights), those products are also present in the output.\n\n1\n\nProposition 3. A shallow negative-weight sum-product network computing gin \u2208 Gin must have at\nleast (n \u2212 1)i hidden units.\nProof. First suppose the output unit of the shallow network is a sum. Then it may be able to compute\ngin, assuming we allow multiplicative units in the hidden layer in the hidden layer to use powers\nof their inputs in the product they compute (which we allow here for the proof to be more generic).\nHowever, it will require at least as many of these units as the number of unique products that can\nbe found in the expansion of gin. In particular, from Corollary 2, it will require at least the number\nt=1 \u03b1t = (n \u2212 1)i. Denoting\n(cid:1), and it is easy to verify it is higher\n\nof unique tuples of the form (\u03b11, . . . , \u03b1n) such that \u03b1t \u2208 N and Pn\ndni = (n \u2212 1)i, this number is known to be equal to (cid:0)n+dni\u22121\nthan (or equal to) dni for any n \u2265 2 and i \u2265 0.\nNow suppose the output unit is multiplicative. Then there can be no multiplicative hidden unit,\notherwise it would mean one could factor some input variable xt in the computed function output:\nthis is not possible since by Corollary 2, for any variable xt there exist products in the output function\nthat do not involve xt. So all hidden units must be additive, and since the computed function contains\nproducts of degree dni, there must be at least dni such hidden units.\n\ndni\n\n7\n\n\f4.3 Discussion\nProposition 3 shows that in order to compute the same function as gin \u2208 Gin, the number of units\nin the shallow network has to grow exponentially in i, i.e. in the network\u2019s depth (while the deep\nnetwork\u2019s size grows linearly in i). The shallow network also needs to grow polynomially in the\nnumber of input variables n (with a degree equal to i), while the deep network grows only linearly in\nn. It means that some deep sum-product network with n inputs and depth O(i) can represent\nwith O(ni) units what would require O((n \u2212 1)i) units for a depth-2 network.\nNote that in the similar results found for family F, the depth-2 network computing the same function\nas a function in F had to be constrained to either have a speci\ufb01c combination of sum and hidden\nunits (in Lemma 6) or to have non-negative weights (in Corollary 1). On the contrary, the result\npresented here for family G holds without requiring any of these assumptions.\n5 Conclusion\n\nWe compared a deep sum-product network and a shallow sum-product network representing the\nsame function, taken from two families of functions F and G. For both families, we have shown that\nthe number of units in the shallow network has to grow exponentially, compared to a linear growth\nin the deep network, so as to represent the same functions. The deep version thus offers a much\nmore compact representation of the same functions.\nThis work focuses on two speci\ufb01c families of functions: \ufb01nding more general parameterization of\nfunctions leading to similar results would be an interesting topic for future research. Another open\nquestion is whether it is possible to represent such functions only approximately (e.g. up to an\nerror bound \u01eb) with a much smaller shallow network. Results by Braverman [8] on boolean circuits\nsuggest that similar results as those presented in this paper may still hold, but this topic has yet to be\nformally investigated in the context of sum-product networks. A related problem is also to look into\nfunctions de\ufb01ned only on discrete input variables: our proofs do not trivially extend to this situation\nbecause we cannot assume anymore that two polynomials yielding the same output values must have\nthe same expansion coef\ufb01cients (since the number of input combinations becomes \ufb01nite).\n\nAcknowledgments\n\nThe authors would like to thank Razvan Pascanu and David Warde-Farley for their help in improv-\ning this manuscript, as well as the anonymous reviewers for their careful reviews. This work was\npartially funded by NSERC, CIFAR, and the Canada Research Chairs.\n\nReferences\n[1] Ajtai, M. (1983). P1\n[2] Allender, E. (1996). Circuit complexity before the dawn of the new millennium. In 16th Annual Conference\non Foundations of Software Technology and Theoretical Computer Science, pages 1\u201318. Lecture Notes in\nComputer Science 1180, Springer Verlag.\n\n1-formulae on \ufb01nite structures. Annals of Pure and Applied Logic, 24(1), 1\u201348.\n\n[3] Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1),\n\n1\u2013127. Also published as a book. Now Publishers, 2009.\n\n[4] Bengio, Y. and LeCun, Y. (2007). Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle,\n\nD. DeCoste, and J. Weston, editors, Large Scale Kernel Machines. MIT Press.\n\n[5] Bengio, Y., Delalleau, O., and Le Roux, N. (2006). The curse of highly variable functions for local kernel\n\nmachines. In NIPS\u201905, pages 107\u2013114. MIT Press, Cambridge, MA.\n\n[6] Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training of deep\n\nnetworks. In NIPS 19, pages 153\u2013160. MIT Press.\n\n[7] Bengio, Y., Delalleau, O., and Simard, C. (2010). Decision trees do not generalize to new variations.\n\nComputational Intelligence, 26(4), 449\u2013467.\n\n[8] Braverman, M. (2011). Poly-logarithmic independence fools bounded-depth boolean circuits. Communi-\n\ncations of the ACM, 54(4), 108\u2013115.\n\n[9] Collobert, R. and Weston, J. (2008). A uni\ufb01ed architecture for natural language processing: Deep neural\n\nnetworks with multitask learning. In ICML 2008, pages 160\u2013167.\n\n[10] Dahl, G. E., Ranzato, M., Mohamed, A., and Hinton, G. E. (2010). Phone recognition with the mean-\n\ncovariance restricted boltzmann machine. In Advances in Neural Information Processing Systems (NIPS).\n\n8\n\n\f[11] H\u02daastad, J. (1986). Almost optimal lower bounds for small depth circuits.\n\nIn Proceedings of the 18th\n\nannual ACM Symposium on Theory of Computing, pages 6\u201320, Berkeley, California. ACM Press.\n\n[12] H\u02daastad, J. and Goldmann, M. (1991). On the power of small-depth threshold circuits. Computational\n\nComplexity, 1, 113\u2013129.\n\n[13] Hinton, G. E. and Salakhutdinov, R. (2006). Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786), 504\u2013507.\n\n[14] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18, 1527\u20131554.\n\n[15] Kavukcuoglu, K., Sermanet, P., Boureau, Y.-L., Gregor, K., Mathieu, M., and LeCun, Y. (2010). Learning\n\nconvolutional feature hierarchies for visual recognition. In NIPS\u201910.\n\n[16] Larochelle, H., Erhan, D., Courville, A., Bergstra, J., and Bengio, Y. (2007). An empirical evaluation of\n\ndeep architectures on problems with many factors of variation. In ICML\u201907, pages 473\u2013480. ACM.\n\n[17] Lee, H., Ekanadham, C., and Ng, A. (2008). Sparse deep belief net model for visual area V2. In NIPS\u201907,\n\npages 873\u2013880. MIT Press, Cambridge, MA.\n\n[18] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009a). Convolutional deep belief networks for\n\nscalable unsupervised learning of hierarchical representations. In ICML 2009. Montreal (Qc), Canada.\n\n[19] Lee, H., Pham, P., Largman, Y., and Ng, A. (2009b). Unsupervised feature learning for audio classi\ufb01cation\n\nusing convolutional deep belief networks. In NIPS\u201909, pages 1096\u20131104.\n\n[20] Levner, I. (2008). Data Driven Object Segmentation. Ph.D. thesis, Department of Computer Science,\n\nUniversity of Alberta.\n\n[21] Mnih, A. and Hinton, G. E. (2009). A scalable hierarchical distributed language model.\n\npages 1081\u20131088.\n\nIn NIPS\u201908,\n\n[22] Mobahi, H., Collobert, R., and Weston, J. (2009). Deep learning from temporal coherence in video. In\n\nICML\u20192009, pages 737\u2013744.\n\n[23] Orponen, P. (1994). Computational complexity of neural networks: a survey. Nordic Journal of Comput-\n\ning, 1(1), 94\u2013110.\n\n[24] Osindero, S. and Hinton, G. E. (2008). Modeling image patches with a directed hierarchy of markov\n\nrandom \ufb01eld. In NIPS\u201907, pages 1121\u20131128, Cambridge, MA. MIT Press.\n\n[25] Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture.\n\nBarcelona, Spain.\n\nIn UAI\u20192011,\n\n[26] Ranzato, M. and Szummer, M. (2008). Semi-supervised learning of compact document representations\n\nwith deep networks. In ICML.\n\n[27] Ranzato, M., Poultney, C., Chopra, S., and LeCun, Y. (2007). Ef\ufb01cient learning of sparse representations\n\nwith an energy-based model. In NIPS\u201906, pages 1137\u20131144. MIT Press.\n\n[28] Ranzato, M., Boureau, Y.-L., and LeCun, Y. (2008). Sparse feature learning for deep belief networks. In\n\nNIPS\u201907, pages 1185\u20131192, Cambridge, MA. MIT Press.\n\n[29] Salakhutdinov, R. and Hinton, G. E. (2007). Semantic hashing. In Proceedings of the 2007 Workshop on\n\nInformation Retrieval and applications of Graphical Models (SIGIR 2007), Amsterdam. Elsevier.\n\n[30] Salakhutdinov, R., Mnih, A., and Hinton, G. E. (2007). Restricted Boltzmann machines for collaborative\n\n\ufb01ltering. In ICML 2007, pages 791\u2013798, New York, NY, USA.\n\n[31] Serre, T., Kreiman, G., Kouh, M., Cadieu, C., Knoblich, U., and Poggio, T. (2007). A quantitative theory\nof immediate visual recognition. Progress in Brain Research, Computational Neuroscience: Theoretical\nInsights into Brain Function, 165, 33\u201356.\n\n[32] Socher, R., Lin, C., Ng, A. Y., and Manning, C. (2011). Learning continuous phrase representations and\n\nsyntactic parsing with recursive neural networks. In ICML\u20192011.\n\n[33] Taylor, G. and Hinton, G. (2009). Factored conditional restricted Boltzmann machines for modeling\n\nmotion style. In ICML 2009, pages 1025\u20131032.\n\n[34] Taylor, G., Hinton, G. E., and Roweis, S. (2007). Modeling human motion using binary latent variables.\n\nIn NIPS\u201906, pages 1345\u20131352. MIT Press, Cambridge, MA.\n\n[35] Utgoff, P. E. and Stracuzzi, D. J. (2002). Many-layered learning. Neural Computation, 14, 2497\u20132539.\n[36] Weston, J., Ratle, F., and Collobert, R. (2008). Deep learning via semi-supervised embedding. In ICML\n\n2008, pages 1168\u20131175, New York, NY, USA.\n\n[37] Wolpert, D. H. (1996). The lack of a priori distinction between learning algorithms. Neural Computation,\n\n8(7), 1341\u20131390.\n\n[38] Yao, A. (1985). Separating the polynomial-time hierarchy by oracles. In Proceedings of the 26th Annual\n\nIEEE Symposium on Foundations of Computer Science, pages 1\u201310.\n\n9\n\n\f", "award": [], "sourceid": 471, "authors": [{"given_name": "Olivier", "family_name": "Delalleau", "institution": null}, {"given_name": "Yoshua", "family_name": "Bengio", "institution": null}]}