{"title": "On the Representational Efficiency of Restricted Boltzmann Machines", "book": "Advances in Neural Information Processing Systems", "page_first": 2877, "page_last": 2885, "abstract": "This paper examines the question: What kinds of distributions can be efficiently represented by Restricted Boltzmann Machines (RBMs)? We characterize the RBM's unnormalized log-likelihood function as a type of neural network (called an RBM network), and through a series of simulation results relate these networks to types that are better understood. We show the surprising result that RBM networks can efficiently compute any function that depends on the number of 1's in the input, such as parity. We also provide the first known example of a particular type of distribution which provably cannot be efficiently represented by an RBM (or equivalently, cannot be efficiently computed by an RBM network), assuming a realistic exponential upper bound on the size of the weights. By formally demonstrating that a relatively simple distribution cannot be represented efficiently by an RBM our results provide a new rigorous justification for the use of potentially more expressive generative models, such as deeper ones.", "full_text": "On the Representational Ef\ufb01ciency of Restricted\n\nBoltzmann Machines\n\nJames Martens\u2217\n\nArkadev Chattopadhyay+\n\nToniann Pitassi\u2217\n\nRichard Zemel\u2217\n\n\u2217Department of Computer Science\n\nUniversity of Toronto\n\n{jmartens,toni,zemel}@cs.toronto.edu\n\n+School of Technology & Computer Science\nTata Institute of Fundamental Research\n\narkadev.c@tifr.res.in\n\nAbstract\n\nThis paper examines the question: What kinds of distributions can be ef\ufb01ciently\nrepresented by Restricted Boltzmann Machines (RBMs)? We characterize the\nRBM\u2019s unnormalized log-likelihood function as a type of neural network, and\nthrough a series of simulation results relate these networks to ones whose repre-\nsentational properties are better understood. We show the surprising result that\nRBMs can ef\ufb01ciently capture any distribution whose density depends on the num-\nber of 1\u2019s in their input. We also provide the \ufb01rst known example of a particular\ntype of distribution that provably cannot be ef\ufb01ciently represented by an RBM, as-\nsuming a realistic exponential upper bound on the weights. By formally demon-\nstrating that a relatively simple distribution cannot be represented ef\ufb01ciently by\nan RBM our results provide a new rigorous justi\ufb01cation for the use of potentially\nmore expressive generative models, such as deeper ones.\n\n1\n\nIntroduction\n\nStandard Restricted Boltzmann Machines (RBMs) are a type of Markov Random Field (MRF) char-\nacterized by a bipartite dependency structure between a group of binary visible units x \u2208 {0, 1}n\nand binary hidden units h \u2208 {0, 1}m. Their energy function is given by:\n\nE\u03b8(x, h) = \u2212x(cid:62)W h \u2212 c(cid:62)x \u2212 b\n(cid:62)\n\nh\n\nby(cid:80)\n\nwhere W \u2208 Rn\u00d7m is the matrix of weights, c \u2208 Rn and b \u2208 Rm are vectors that store the\ninput and hidden biases (respectively) and together these are referred to as the RBM\u2019s parameters\n\u03b8 = {W, c, b}. The energy function speci\ufb01es the probability distribution over the joint space (x, h)\nexp(\u2212E\u03b8(x, h)) with the partition function Z\u03b8 given\nvia the Boltzmann distribution p(x, h) = 1\nx,h exp(\u2212E\u03b8(x, h)). Based on this de\ufb01nition, the probability for any subset of variables can\nZ\u03b8\nbe obtained by conditioning and marginalization, although this can only be done ef\ufb01ciently up to a\nmultiplicative constant due to the intractability of the RBM\u2019s partition function (Long and Servedio,\n2010).\nRBMs have been widely applied to various modeling tasks, both as generative models (e.g. Salakhut-\ndinov and Murray, 2008; Hinton, 2000; Courville et al., 2011; Marlin et al., 2010; Tang and\nSutskever, 2011), and for pre-training feed-forward neural nets in a layer-wise fashion (Hinton and\nSalakhutdinov, 2006). This method has led to many new applications in general machine learning\nproblems including object recognition and dimensionality reduction. While promising for practical\napplications, the scope and basic properties of these statistical models have only begun to be studied.\nAs with any statistical model, it is important to understand the expressive power of RBMs, both\nto gain insight into the range of problems where they can be successfully applied, and to provide\njusti\ufb01cation for the use of potentially more expressive generative models.\nIn particular, we are\ninterested in the question of how large the number of hidden units m must be in order to capture a\nparticular distribution to arbitrarily high accuracy. The question of size is of practical interest, since\nvery large models will be computationally more demanding (or totally impractical), and will tend to\nover\ufb01t a lot more during training.\n\n1\n\n\fconstruct simple distributions where the mass is a function of(cid:80)\n\nIt was shown by Freund and Haussler (1994), and later by Le Roux and Bengio (2008) that for\nbinary-valued x, any distribution over x can be realized (up to an approximation error which van-\nishes exponentially quickly in the magnitude of the parameters) by an RBM, as long as m is allowed\nto grow exponentially fast in input dimension (n). Intuitively, this construction works by instantiat-\ning, for each of the up to 2n possible values of x that have support, a single hidden unit which turns\non only for that particular value of x (with overwhelming probability), so that the corresponding\nprobability mass can be individually set by manipulating that unit\u2019s bias parameter. An improve-\nment to this result was obtained by Montufar and Ay (2011); however this construction still requires\nthat m grow exponentially fast in n.\nRecently, Montufar et al. (2011) generalized the construction used by Le Roux and Bengio (2008)\nso that each hidden unit turns on for, and assigns probability mass to, not just a single x, but a\n\u201ccubical set\u201d of possible x\u2019s, which is de\ufb01ned as a subset of {0, 1}n where some entries of x are\n\ufb01xed/determined, and the rest are free. By combining such hidden units that are each specialized to\na particular cubic set, they showed that any k-component mixture of product distributions over the\nfree variables of mutually disjoint cubic sets can be approximated arbitrarily well by an RBM with\nm = k hidden units.\nUnfortunately, families of distributions that are of this specialized form (for some m = k bounded by\na polynomial function of n) constitute only a very limited subset of all distributions that have some\nkind of meaningful/interesting structure. For example, this result would not allow us to ef\ufb01ciently\ni xi (e.g., for p(x) \u221d PARITY(x)).\nIn terms of what kinds of distributions provably cannot be ef\ufb01ciently represented by RBMs, even\nless is known. Cueto et al. (2009) characterized the distributions that can be realized by a RBM with\nk parameters as residing within a manifold inside the entire space of distributions on {0, 1}n whose\ndimension depends on k. For sub-exponential k this implies the existence of distributions which\ncannot be represented. However, this kind of result gives us no indication of what these hard-to-\nrepresent distributions might look like, leaving the possibility that they might all be structureless or\notherwise uninteresting.\nIn this paper we \ufb01rst develop some tools and simulation results which relate RBMs to certain easier-\nto-analyze approximations, and to neural networks with 1 hidden layer of threshold units, for which\nmany results about representational ef\ufb01ciency are already known (Maass, 1992; Maass et al., 1994;\nHajnal et al., 1993). This opens the door to a range of potentially relevant complexity results, some\nof which we apply in this paper.\nNext, we present a construction that shows how RBMs with m = n2 + 1 can produce arbitrarily\ngood approximations to any distribution where the mass is a symmetric function of the inputs (that\ni xi). One example of such a function is the (in)famous PARITY function, which\nwas shown to be hard to compute in the perceptron model by the classic Minsky and Papert book\nfrom 1968. This distribution is highly non-smooth and has exponentially many modes.\nHaving ruled out distributions with symmetric mass functions as candidates for ones that are hard for\nRBMs to represent, we provide a concrete example of one whose mass computation involves only\none additional operation vs computing PARITY, and yet whose reprentation by an RBM provably\nrequires m to grow exponentially with n (assuming an exponental upper bound on the size of the\nRBM\u2019s weights). Because this distribution is particularly simple, it can be viewed as a special case\nof many other more complex types of distributions, and thus our results speak to the hardness of\nrepresenting those distributions with RBMs as well.\nOur results provide a \ufb01ne delineation between what is \u201ceasy\u201d for RBMs to represent, and what is\n\u201chard\u201d. Perhaps more importantly, they demonstrate that the distributions that cannot be ef\ufb01ciently\nrepresented by RBMs can have a relatively basic structure, and are not simply random in appearance\nas one might hope given the previous results. This provides perhaps the \ufb01rst completely rigorous\njusti\ufb01cation for the use of deeper generative models such as Deep Boltzmann Machines (Salakhutdi-\nnov and Hinton, 2009), and contrastive backpropagation networks (Hinton et al., 2006) over standard\nRBMs.\nThe rest of the paper is organized as follows. Section 2 characterizes the unnormalized log-\nlikelihood as a type of neural network (called an \u201cRBM network\u201d) and shows how this type is related\nto single hidden layer neural networks of threshold neurons, and to an easier-to-analyze approxima-\ntion (which we call a \u201chardplus RBM network\u201d). Section 3 describes a m = n2 + 1 construction for\n\ndistributions whose mass is a function of(cid:80) xi, and in Section 4 we present an exponential lower\n\nis, it depends on(cid:80)\n\nbound on m for a slightly more complicated class of explicit distributions. Note that all proofs can\nbe found in the Appendix.\n\n2\n\n\fFigure 1: Left: An illustration of a basic RBM network with n = 3 and m = 5. The hidden biases are omitted\nto avoid clutter. Right: A plot comparing the soft and hard activation functions.\n\n2 RBM networks\n\n2.1 Free energy function\n\nIn an RBM, the (negative) unnormalized log probability of x, after h has been marginalized out,\nis known as the free energy. Denoted by F\u03b8(x), the free energy satis\ufb01es the property that p(x) =\nexp(\u2212F\u03b8(x))/Z\u03b8 where Z\u03b8 is the usual partition function.\nIt is well known (see Appendix A.1 for a derivation) that, due to the bipartite structure of RBMs,\ncomputing F is tractable and has a particularly nice form:\n\nF\u03b8(x) = \u2212c(cid:62)x \u2212(cid:88)\n\nlog(1 + exp(x(cid:62)[W ]j + bj))\n\n(1)\n\nj\n\nwhere [W ]j is the j-th column of W .\nBecause the free energy completely determines the log probability of x, it fully characterizes an\nRBM\u2019s distribution. So studying what kinds of distributions an RBM can represent amounts to\nstudying the kinds of functions that can be realized by the free energy function for some setting of\n\u03b8.\n\n2.2 RBM networks\n\nThe form of an RBM\u2019s free energy function can be expressed as a standard feed-forward neural\nnetwork, or equivalently, a real-valued circuit, where instead of using hidden units with the usual\nsigmoidal activation functions, we have m \u201cneurons\u201d (a term we will use to avoid confusion with\nthe original meaning of a \u201cunit\u201d in the context of RBMs) that use the softplus activation function:\n\nsoft(y) = log(1 + exp(y))\n\nNote that at the cost of increasing m by one (which does not matter asymptotically) and introducing\nan arbitrarily small approximation error, we can assume that the visible biases (c) of an RBM are\nall zero. To see this, note that up to an additive constant, we can very closely approximate c(cid:62)x by\nsoft(K + c(cid:62)x) \u2248 K + c(cid:62)x for a suitably large value of K (i.e., K (cid:29) (cid:107)c(cid:107)1 \u2265 maxx(c(cid:62)x)).\nProposition 11 in the Appendix quanti\ufb01es the very rapid convergence of this approximation as K\nincreases.\nThese observations motivate the following de\ufb01nition of an RBM network, which computes functions\nwith the same form as the negative free energy function of an RBM (assumed to have c = 0), or\nequivalently the log probability (negative energy) function of an RBM. RBM networks are illustrated\nin Figure 1.\n\nDe\ufb01nition 1 A RBM network with parameters W, b is de\ufb01ned as a neural network with one hidden\nlayer containing m softplus neurons and weights and biases given by W and b, so that each neuron\nj\u2019s output is soft([W ]j + bj). The output layer contains one neuron whose weights and bias are\ngiven by 1 \u2261 [11...1](cid:62) and the scalar B, respectively.\nFor convenience, we include the bias constant B so that RBM networks shift their output by an\nadditive constant (which does not affect the probability distribution implied by the RBM network\nsince any additive constant is canceled out by log Z in the full log probability).\n\n3\n\n\u22124\u2212202460123456Functionvaluesysoftplushardplus\f2.3 Hardplus RBM networks\n\nA function which is somewhat easier to analyze than the softplus function is the so-called hardplus\nfunction (aka \u2018plus\u2019 or \u2018recti\ufb01cation\u2019), de\ufb01ned by:\n\nhard(y) = max(0, y)\n\nAs their names suggest, the softplus function can be viewed as a smooth approximation of the\nhardplus, as illustrated in Figure 1. We de\ufb01ne a hardplus RBM network in the obvious way: as an\nRBM network with the softplus activation functions of the hidden neurons replaced with hardplus\nfunctions.\nThe strategy we use to prove many of the results in this paper is to \ufb01rst establish them for hardplus\nRBM networks, and then show how they can be adapted to the standard softplus case via simulation\nresults given in the following section.\n\n2.4 Hardplus RBM networks versus (Softplus) RBM networks\n\nIn this section we present some approximate simulation results which relate hardplus and standard\n(softplus) RBM networks.\nThe \ufb01rst result formalizes the simple observation that for large input magnitudes, the softplus and\nhardplus functions behave very similarly (see Figure 1, and Proposition 11 in the Appendix).\nLemma 2. Suppose we have a softplus and hardplus RBM networks with identical sizes and pa-\nrameters. If, for each possible input x \u2208 {0, 1}n, the magnitude of the input to each neuron is\nbounded from below by C, then the two networks compute the same real-valued function, up to an\nerror (measured by | \u00b7 |) which is bounded by m exp(\u2212C).\nThe next result demonstrates how to approximately simulate a RBM network with a hardplus RBM\nnetwork while incurring an approximation error which shrinks as the number of neurons increases.\nThe basic idea is to simulate individual softplus neurons with groups of hardplus neurons that com-\npute what amounts to a piece-wise linear approximation of the smooth region of a softplus function.\nTheorem 3. Suppose we have a (softplus) RBM network with m hidden neurons with parame-\nters bounded in magnitude by C. Let p > 0. Then there exists a hardplus RBM network with\n\u2264 2m2p log(mp) + m hidden neurons and with parameters bounded in magnitude by C which\ncomputes the same function, up to an approximation error of 1/p.\n\nNote that if p and m are polynomial functions of n, then the simulation produces hardplus RBM\nnetworks whose size is also polynomial in n.\n\n2.5 Thresholded Networks and Boolean Functions\n\nMany relevant results and proof techniques concerning the properties of neural networks focus on\nthe case where the output is thresholded to compute a Boolean function (i.e. a binary classi\ufb01cation).\nIn this section we de\ufb01ne some key concepts regarding output thresholding, and present some basic\npropositions that demonstrate how hardness results for computing Boolean functions via threshold-\ning yield analogous hardness results for computing certain real-valued functions.\nWe say that a real-valued function g represents a Boolean function f with margin \u03b4 if for all x g\nsatis\ufb01es |g(x)| \u2265 \u03b4 and thresh(g(x)) = f (x), where thresh is the 0/1 valued threshold function\nde\ufb01ned by:\n\n(cid:26) 1\n\n0\n\nthresh(a) =\n\n: a \u2265 0\n: a < 0\n\nWe de\ufb01ne a thresholded neural network (a distinct concept from a \u201cthreshold network\u201d, which is\na neural network with hidden neurons whose activation function is thresh) to be a neural network\nwhose output is a single real value, which is followed by an application of the threshold function.\nSuch a network will be said to compute a given Boolean function f with margin \u03b4 (similar to the\nconcept of \u201cseparation\u201d from Maass et al. (1994)) if the real valued input g to the \ufb01nal threshold\nrepresents f according to the above de\ufb01nition.\nWhile the output of a thresholded RBM network does not correspond to the log probability of an\nRBM, the following observation spells out how we can use thresholded RBM networks to establish\nlower bounds on the size of an RBM network required to compute certain simple functions (i.e.,\nreal-valued functions that represent certain Boolean functions):\n\n4\n\n\fProposition 4. If an RBN network of size m can compute a real-valued function g which represents\nf with margin \u03b4, then there exists a thresholded RBM network that computes f with margin \u03b4.\n\nThis statement clearly holds if we replace each instance of \u201cRBM network\u201d with \u201chardplus RBM\nnetwork\u201d above.\nUsing Theorem 3 we can prove a more interesting result which states that any lower bound result\nfor thresholded hardplus RBMs implies a somewhat weaker lower bound result for standard RBM\nnetworks:\nProposition 5. If an RBM network of size \u2264 m with parameters bounded in magnitude by C com-\nputes a function which represents a Boolean function f with margin \u03b4, then there exists a thresholded\nhardplus RBM network of size \u2264 4m2 log(2m/\u03b4)/\u03b4 + m with parameters bounded in magnitude by\nC (C can be \u221e) that computes f (x) with margin \u03b4/2\n\nThis proposition implies that any exponential lower bound on the size of a thresholded hardplus\nRBM network will yield an exponential lower bound for (softplus) RBM networks that compute\nfunctions of the given form, provided that the margin \u03b4 is bounded from below by some function of\nthe form 1/poly(n).\nIntuitively, if f is a Boolean function and no RBM network of size m can compute a real-valued\nfunction that represents f (with a margin \u03b4), this means that no RBM of size m can represent any\ndistribution where the log probability of each member of {x|f (x) = 1} is at least 2\u03b4 higher than\neach member of {x|f (x) = 0}. In other words, RBMs of this size cannot generate any distribution\nwhere the two \u201cclasses\u201d implied by f are separated in log probability by more than 2\u03b4.\n\n2.6 RBM networks versus standard neural networks\n\nViewing the RBM log probability function through the formalism of neural networks (or real-valued\ncircuits) allows us to make use of known results for general neural networks, and helps highlight\nimportant differences between what an RBM can effectively \u201ccompute\u201d (via its log probability) and\nwhat a standard neural network can compute.\nThere is a rich literature studying the complexity of various forms of neural networks, with diverse\nclasses of activation functions, e.g., Maass (1992); Maass et al. (1994); Hajnal et al. (1993). RBM\nnetworks are distinguished from these, primarily because they have a single hidden layer and because\nthe upper level weights are constrained to be 1.\nFor some activation functions this restriction may not be signi\ufb01cant, but for soft/hard-plus neurons,\nwhose output is always positive, it makes particular computations much more awkward (or perhaps\nimpossible) to express ef\ufb01ciently. Intuitively, the jth softplus neuron acts as a \u201cfeature detector\u201d,\nwhich when \u201cactivated\u201d by an input s.t. x(cid:62)wj + bj (cid:29) 0, can only contribute positively to the log\nprobability of x, according to an (asymptotically) af\ufb01ne function of x given by that neuron\u2019s input.\nFor example, it is easy to design an RBM network that can (approximately) output 1 for input x = (cid:126)0\nand 0 otherwise (i.e., have a single hidden neuron with weights \u2212M 1 for a large M and bias b such\nthat soft(b) = 1), but it is not immediately obvious how an RBM network could ef\ufb01ciently compute\n(or approximate) the function which is 1 on all inputs except x = (cid:126)0, and 0 otherwise (it turns out\nthat a non-obvious construction exists for m = n). By comparison, standard threshold networks\nonly requires 1 hidden neuron to compute such a function.\nIn fact, it is easy to show1 that without the constraint on upper level weights, an RBM network\nwould be, up to a linear factor, at least as ef\ufb01cient at representing real-valued functions as a neural\nnetwork with 1 hidden layer of threshold neurons. From this, and from Theorem 4.1 of Maass et al.\n(1994), it follows that a thresholded RBM network is, up to a polynomial increase in size, at least as\nef\ufb01cient at computing Boolean functions as 1-hidden layer neural networks with any \u201csigmoid-like\u201d\nactivation function2, and polynomially bounded weights.\n\n1To see this, note that we could use 2 softplus neurons to simulate a single neuron with a \u201csigmoid-like\u201d\nactivation function (i.e., by setting the weights that connect them to the output neuron to have opposite signs).\nThen, by increasing the size of the weights so the sigmoid saturates in both directions for all inputs, we could\nsimulate a threshold function arbitrarily well, thus allowing the network to compute any function computable\nby a one hidden layer threshold network while only using only twice as many neurons.\n\n2This is a broad class and includes the standard logistic sigmoid. See Maass et al. (1994) for a precise\n\ntechnical de\ufb01nition\n\n5\n\n\fFigure 2: Left: The functions computed by the 5 building-blocks as constructed by Theorem 7 when applied to\nthe PARITY function for n = 5. Right: The output total of the hardplus RBM network constructed in Theorem\n7. The dotted lines indicate the target 0 and 1 values. Note: For purposes of illustration we have extended the\nfunction outputs over all real-values of X in the obvious way.\n\n2.7 Simulating hardplus RBM networks by a one-hidden-layer threshold network\n\nHere we provide a natural simulation of hardplus RBM networks by threshold networks with one\nhidden layer. Because this is an ef\ufb01cient (polynomial) and exact simulation, it implies that a hardplus\nRBM network can be no more powerful than a threshold network with one hidden layer, for which\nseveral lower bound results are already known.\nTheorem 6. Let f be a real-valued function computed by a hardplus RBM network of size m.\nThen f can be computed by a single hidden layer threshold network, of size mn. Furthermore, if\nthe weights of the RBM network have magnitude at most C, then the weights of the corresponding\nthreshold network have magnitude at most (n + 1)C.\n\n3 n2 + 1-sized RBM networks can compute any symmetric function\n\nIn this section we present perhaps the most surprising results of this paper: a construction of an\nn2-sized RBM network (or hardplus RBM network) for computing any given symmetric function of\nx. Here, a symmetric function is de\ufb01ned as any real-valued function whose output depends only on\ni xi. A well-known example of\n\nthe number of 1-bits in the input x. This quantity is denoted X \u2261(cid:80)\n\na symmetric function is PARITY.\nSymmetric functions are already known3 to be computable by single hidden layer threshold networks\n(Hajnal et al., 1993) with m = n. Meanwhile (quali\ufb01ed) exponential lower bounds on m exist for\nfunctions which are only slightly more complicated (Hajnal et al., 1993; Forster, 2002).\nGiven that hardplus RBM networks appear to be strictly less expressive than such threshold networks\n(as discussed in Section 2.6), it is surprising that they can nonetheless ef\ufb01ciently compute functions\nthat test the limits of what those networks can compute ef\ufb01ciently.\n\nTheorem 7. Let f : {0, 1}n \u2192 R be a symmetric function de\ufb01ned by f (x) = tk for(cid:80)\n\ni xi = k.\nThen (i) there exists a hardplus RBM network, of size n2 + 1, and with weights polynomial in n and\nt1, . . . , tk that computes f exactly, and (ii) for every \u0001 there is a softplus RBM network of size n2 +1,\nand with weights polynomial in n, t0, . . . , tn and log(1/\u0001) that computes f within an additive error\n\u0001.\n\nThe high level idea of this construction is as follows. Our hardplus RBM network consists of n\n\u201cbuilding blocks\u201d, each composed of n hardplus neurons, plus one additional hardplus neuron, for\na total size of m = n2 + 1. Each of these building blocks is designed to compute a function of the\nform:\n\nmax(0, \u03b3X(e \u2212 X))\n\nfor parameters \u03b3 > 0 and e > 0. This function, examples of which are illustrated in Figure 2, is\nquadratic from X = 0 to X = e and is 0 otherwise.\nThe main technical challenge is then to choose the parameters of these building blocks so that the\nsum of n of these \u201crecti\ufb01ed quadratics\u201d, plus the output of the extra hardplus neuron (which handles\n\n3The construction in Hajnal et al. (1993) is only given for Boolean-valued symmetric functions but can be\n\ngeneralized easily.\n\n6\n\n012345050100150OutputfromBuildingBlocksX012345\u221215\u221210\u221250NetworkOutputX\fthe X = 0 case), yields a function that matches f, up to a additive constant (which we then \ufb01x by\nsetting the bias B of the output neuron). This would be easy if we could compute more general\nrecti\ufb01ed quadratics of the form max(0, \u03b3(X \u2212 g)(e \u2212 X)), since we could just take g = k \u2212 1/2\nand e = k + 1/2 for each possible value k of X. But the requirement that g = 0 makes this more\ndif\ufb01cult since signi\ufb01cant overlap between non-zero regions of these functions will be unavoidable.\nFurther complicating the situation is the fact that we cannot exploit linear cancelations due to the\nrestriction on the RBM network\u2019s second layer weights. Figure 2 depicts an example of the solution\nto this problem as given in our proof of Theorem 7.\nNote that this construction is considerably more complex than the well-known construction used for\ncomputing symmetric functions with 1 hidden layer threshold networks Hajnal et al. (1993). While\nwe cannot prove that ours is the most ef\ufb01cient possible construction RBM networks, we can prove\nthat a construction directly analogous to the one used for 1 hidden layer threshold networks\u2014where\neach individual neuron computes a symmetric function\u2014cannot possibly work for RBM networks.\nTo see this, \ufb01rst observe that any neuron that computes a symmetric function must compute a func-\ntion of the form g(\u03b2X + b), where g is the activation function and \u03b2 is some scalar. Then noting that\nboth soft(y) and hard(y) are convex functions of y, and that the composition of an af\ufb01ne function\nand a convex function is convex, we have that each neuron computes a convex function of X. Then\nbecause the positive sum of convex functions is convex, the output of the RBM network (which is\nthe unweighted sum of the output of its neurons, plus a constant) is itself convex in X. Thus the\nsymmetric functions computable by such RBM networks must be convex in X, a severe restriction\nwhich rules out most examples.\n4 Lower bounds on the size of RBM networks for certain functions\n4.1 Existential results\n\nIn this section we prove a result which establishes the existence of functions which cannot be com-\nputed by RBM networks that are not exponentially large.\nInstead of identifying non-representable distributions as lying in the complement of some low-\ndimensional manifold (as was done previously), we will establish the existence of Boolean functions\nwhich cannot be represented with a suf\ufb01ciently large margin by the output of any sub-exponentially\nlarge RBM network. However, this result, like previous such existential results, will say nothing\nabout what these Boolean functions actually look like.\nTo prove this result, we will make use of Proposition 5 and a classical result of Muroga (1971) which\nallows us to discretize the incoming weights of a threshold neuron (without changing the function\nit computes), thus allowing us to bound the number of possible Boolean functions computable by\n1-layer threshold networks of size m.\nTheorem 8. Let Fm,\u03b4,n represent the set of those Boolean functions on {0, 1}n that can be computed\nby a thresholded RBM network of size m with margin \u03b4. Then, there exists a \ufb01xed number K such\nthat,\n\n(cid:12)(cid:12)Fm,\u03b4,n\n\n(cid:12)(cid:12) \u2264 2poly(s,m,n,\u03b4), where\n\nIn particular, when m2 \u2264 \u03b42\u03b1n, for any constant \u03b1 < 1/2, the ratio of the size of the set Fm,\u03b4,n to\nthe total number of Boolean functions on {0, 1}n (which is 22n), rapidly converges to zero with n.\n\n(cid:18) 2m\n\n(cid:19)\n\n\u03b4\n\n4m2n\n\n\u03b4\n\ns(m, \u03b4, n) =\n\nlog\n\n+ m.\n\n4.2 Quali\ufb01ed lower bound results for the IP function\n\nWhile interesting, existential results such as the one above does not give us a clear picture of what\na particular hard-to-compute function for RBM networks might look like. Perhaps these functions\nwill resemble purely random maps without any interesting structure. Perhaps they will consist only\nof functions that require exponential time to compute on a Turing machine, or even worse, ones that\nare non-computable. In such cases, not being able to compute such functions would not constitute a\nmeaningful limitation on the expressive ef\ufb01ciency of RBM networks.\nIn this sub-section we present strong evidence that this is not the case by exhibiting a simple Boolean\nfunction that provably requires exponentially many neurons to be computed by a thresholded RBM\nnetwork, provided that the margin is not allowed to be exponentially smaller than the weights. Prior\nto these results, there was no formal separation between the kinds of unnormalized log-likelihoods\nrealizable by polynomially sized RBMs, and the class of functions computable ef\ufb01ciently by almost\nany reasonable model of computation, such as arbitrarily deep Boolean circuits.\n\n7\n\n\fThe Boolean function we will consider is the well-known \u201cinner product mod 2\u201d function, denoted\nIP (x), which is de\ufb01ned as the parity of the the inner product of the \ufb01rst half of x with the second\nhalf (we assume for convenience that n is even). This function can be thought of as a strictly harder\nto compute version of PARITY (since PARITY is trivially reducible to it), which as we saw in\nSection 7, can be ef\ufb01ciently computed by thresholded RBM network (indeed, an RBM network can\nef\ufb01ciently compute any possible real-valued representation of PARITY). Intuively, IP (x) should be\nharder than PARITY, since it involves an extra \u201cstage\u201d or \u201clayer\u201d of sequential computation, and our\nformal results with RBMs agree with this intuition.\nThere are many computational problems that IP can be reduced to, so showing that RBM networks\ncannot compute IP thus proves that RBMs cannot ef\ufb01ciently model a wide range of distributions\nwhose unnormalized log-likelihoods are suf\ufb01ciently complex in a computational sense. Examples\nof such log-likelihoods include ones given by the multiplication of binary-represented integers, or\nthe evaluation of the connectivity of an encoded graph. For other examples, see see Corollary 3.5 of\nHajnal et al. (1993).\nUsing the simulation of hardplus RBM networks by 1 hidden layer threshold networks (Theorem\n6), and Proposition 5, and an existing result about the hardness of computing IP by 1 hidden layer\nthresholded networks of bounded weights due to Hajnal et al. (1993), we can prove the following\nbasic result:\n\nTheorem 9. If m < min\nthen no RBM network of size\nm, whose weights are bounded in magnitude by C, can compute a function which represents n-\ndimensional IP with margin \u03b4. In particular, for C and 1/\u03b4 bounded by polynomials in n, for n\nsuf\ufb01ciently large, this condition is satis\ufb01ed whenever m < 2(1/9\u2212\u0001)n for some \u0001 > 0.\n\n4C log(2/\u03b4) , 2n/9 3\n\n4C\n\n(cid:26)\n\nC , 2n/6(cid:113) \u03b4\n\n2n/3\n\n(cid:27)\n\n(cid:113) \u03b4\n\nTranslating the de\ufb01nitions, this results says the following about the limitations of ef\ufb01cient repre-\nsentation by RBMs: Unless either the weights, or the number units of an RBM are exponentially\nlarge in n, an RBM cannot capture any distribution that has the property that x\u2019s s.t. IP(x) = 1 are\nsigni\ufb01cantly more probable than the remaining x\u2019s.\nWhile the above theorem is easy to prove from known results and the simulation/hardness results\ngiven in previous sections, by generalizing the techniques used in Hajnal et al. (1993), we can (with\nmuch more effort) derive a stronger result. This gives an improved bound on m and lets us partially\nrelax the magnitude bound on parameters so that they can be arbitrarily negative:\n2\u00b7max{log 2,nC+log 2} \u00b7 2n/4, then no RBM network of size m, whose weights\nTheorem 10. If m <\nare upper bounded in value by C, can compute a function which represents n-dimensional IP with\nmargin \u03b4. In particular, for C and 1/\u03b4 bounded by polynomials in n, for n suf\ufb01ciently large, this\ncondition is satis\ufb01ed whenever m < 2(1/4\u2212\u0001)n for some \u0001 > 0.\nThe general theorem we use to prove this second result (Theorem 17 in the Appendix) requires only\nthat the neural network have 1 hidden layer of neurons with activation functions that are monotonic\nand contribute to the top neuron (after multiplication by the outgoing weight) a quantity which can\nbe bounded by a certain exponentially growing function of n (that also depends on \u03b4). Thus this\ntechnique can be applied to produce lower bounds for much more general types of neural networks,\nand thus may be independently interesting.\n\n\u03b4\n\n5 Conclusions and Future Work\nIn this paper we signi\ufb01cantly advanced the theoretical understanding of the representational ef\ufb01-\nciency of RBMs. We treated the RBM\u2019s unnormalized log likelihood as a neural network which\nallowed us to relate an RBM\u2019s representational ef\ufb01ciency to that of threshold networks, which are\nmuch better understood. We showed that, quite suprisingly, RBMs can ef\ufb01ciently represent distribu-\ntions that are given by symmetric functions such as PARITY, but cannot ef\ufb01ciently represent distri-\nbutions which are slightly more complicated, assuming an exponential bound on the weights. This\nprovides rigorous justi\ufb01cation for the use of potentially more expressive/deeper generative models.\nGoing forward, some promising research directions and open problems include characterizing the\nexpressive power of Deep Boltzmann Machines and more general Boltzmann machines, and proving\nan exponential lower bound for some speci\ufb01c distribution without any quali\ufb01cations on the weights.\n\nAcknowledgments\nThis research was supported by NSERC. JM is supported by a Google Fellowship; AC by a Ra-\nmanujan Fellowship of the DST, India.\n\n8\n\n\fReferences\nAaron Courville, James Bergstra, and Yoshua Bengio. Unsupervised models of images by spike-\nand-slab RBMs. In Proceedings of the 28th International Conference on Machine Learning, pages\n952\u2013960, 2011.\n\nMaria Anglica Cueto, Jason Morton, and Bernd Sturmfels. Geometry of the Restricted Boltzmann\n\nMachine. arxiv:0908.4425v1, 2009.\n\nJ. Forster. A linear lower bound on the unbounded error probabilistic communication complexity. J.\n\nComput. Syst. Sci., 65(4):612\u2013625, 2002.\n\nYoav Freund and David Haussler. Unsupervised learning of distributions on binary vectors using\n\ntwo layer networks, 1994.\n\nA. Hajnal, W. Maass, P. Pudl\u00b4ak, M. Szegedy, and G. Tur\u00b4an. Threshold circuits of bounded depth. J.\n\nComput. System. Sci., 46:129\u2013154, 1993.\n\nG. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006. ISSN 1095-9203.\n\nGeoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Com-\n\nputation, 14:2002, 2000.\n\nGeoffrey E. Hinton, Simon Osindero, Max Welling, and Yee Whye Teh. Unsupervised discovery of\nnonlinear structure using contrastive backpropagation. Cognitive Science, 30(4):725\u2013731, 2006.\nNicolas Le Roux and Yoshua Bengio. Representational power of Restricted Boltzmann Machines\n\nand deep belief networks. Neural Computation, 20(6):1631\u20131649, 2008.\n\nPhilip Long and Rocco Servedio. Restricted Boltzmann Machines are hard to approximately evalu-\nate or simulate. In Proceedings of the 27th International Conference on Machine Learning, pages\n952\u2013960, 2010.\n\nWolfgang Maass. Bounds for the computational power and learning complexity of analog neural\nnets (extended abstract). In Proc. of the 25th ACM Symp. Theory of Computing, pages 335\u2013344,\n1992.\n\nWolfgang Maass, Georg Schnitger, and Eduardo D. Sontag. A comparison of the computational\npower of sigmoid and boolean threshold circuits. In Theoretical Advances in Neural Computation\nand Learning, pages 127\u2013151. Kluwer, 1994.\n\nBenjamin M. Marlin, Kevin Swersky, Bo Chen, and Nando de Freitas.\n\nInductive principles for\nRestricted Boltzmann Machine learning. Journal of Machine Learning Research - Proceedings\nTrack, 9:509\u2013516, 2010.\n\nG. Montufar, J. Rauh, and N. Ay. Expressive power and approximation errors of Restricted Boltz-\n\nmann Machines. In Advances in Neural Information Processing Systems, 2011.\n\nGuido Montufar and Nihat Ay. Re\ufb01nements of universal approximation results for deep belief net-\n\nworks and Restricted Boltzmann Machines. Neural Comput., 23(5):1306\u20131319, May 2011.\n\nSaburo Muroga. Threshold logic and its applications. Wiley, 1971.\nRuslan Salakhutdinov and Geoffrey E. Hinton. Deep boltzmann machines. Journal of Machine\n\nLearning Research - Proceedings Track, 5:448\u2013455, 2009.\n\nRuslan Salakhutdinov and Iain Murray. On the quantitative analysis of Deep Belief Networks.\nIn Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th Annual International\nConference on Machine Learning (ICML 2008), pages 872\u2013879. Omnipress, 2008.\n\nYichuan Tang and Ilya Sutskever. Data normalization in the learning of Restricted Boltzmann\nMachines. Technical Report UTML-TR-11-2, Department of Computer Science, University of\nToronto, 2011.\n\n9\n\n\fA Appendix\n\nA.1 Free-energy derivation\n\nThe following is a derivation of the well-known formula for the free-energy of an RBM. This\ntractable form is made possible by the bipartite interaction structure of the RBM\u2019s units:\n\n(cid:88)\n\nh\n1\nZ\u03b8\n\n1\nZ\u03b8\n\n1\nZ\u03b8\n\n1\nZ\u03b8\n\n=\n\n=\n\n=\n\n=\n\np(x) =\n\n(cid:62)\nexp(x(cid:62)W h + c(cid:62)x + b\n\nh)\n\n1\nZ\u03b8\nexp(c(cid:62)x)\n\n(cid:89)\n\nj\n\nhj\u2208{0,1}\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nj\n\n(cid:88)\n\nhj\u2208{0,1}\n\nexp(c(cid:62)x +\n\nlog[1 + exp(x(cid:62)[W ]j + bj)])\n\nexp(x(cid:62)[W ]jhj + bjhj)\n\nexp(c(cid:62)x) exp(\n\n(log\n\nexp(x(cid:62)[W ]jhj + bjhj)))\n\nj\n\nexp(\u2212F\u03b8(x))\n\nA.2 Proofs for Section 2.4\n\nWe begin with a useful technical result:\nProposition 11. For arbitrary y \u2208 R the following basic facts for the softplus function hold:\n\ny \u2212 soft(y) = \u2212 soft(\u2212y)\n\nsoft(y) \u2264 exp(y)\n\nProof. The \ufb01rst fact follows from:\n\ny\u2212 soft(y) = log(exp(y)) \u2212 log(1 + exp(y)) = log\n\n(cid:18)\n\n(cid:19)\n\n= log\n\n1\n\nexp(\u2212y) + 1\n\n1 + exp(y)\n= \u2212 log(1 + exp(y)) = \u2212 soft(\u2212y)\n\n(cid:18) exp(y)\n\n(cid:19)\n\nTo prove the second fact, we will show that the function f (y) = exp(y) \u2212 soft(y) is positive. Note\nthat f tends to 0 as y goes to \u2212\u221e since both exp(y) and soft(y) do. It remains to show that f is\nmonotonically increasing, which we establish by showing that its derivative is positive:\n\nf(cid:48)(y) = exp(y) \u2212\n\n1\n\n1 + exp(\u2212y)\n\n> 0\n\n\u21d4 exp(y)(1 + exp(\u2212y)) \u2212 1 + exp(\u2212y)\n1 + exp(\u2212y)\n\u21d4 exp(y) + 1 \u2212 1 > 0 \u21d4 exp(y) > 0\n\n> 0\n\nProof of Lemma 2. Consider a single neuron in the RBM network and the corresponding neuron\nin the hardplus RBM network, whose net-input are given by y = w(cid:62)x + b.\nFor each x, there are two cases for y. If y \u2265 0, we have by hypothesis that y \u2265 C, and so:\n\n| hard(y) \u2212 soft(y)| = |y \u2212 soft(y)| = | \u2212 soft(\u2212y)| = soft(\u2212y)\n\nAnd if y < 0, we have by hypothesis that y \u2264 \u2212C and so:\n\n| hard(y) \u2212 soft(y)| = |0 \u2212 soft(y)| = soft(y)\n\n\u2264 exp(y) \u2264 exp(\u2212C)\n\n\u2264 exp(\u2212y) \u2264 exp(\u2212C)\n\n10\n\n\fThus, each corresponding pair of neurons computes the same function up to an error bounded by\nexp(\u2212C). From this it is easy to show that the entire circuits compute the same function, up to an\nerror bounded by m exp(\u2212C), as required.\n\nProof of Theorem 3. Suppose we have a softplus RBM network with a number of hidden neurons\ngiven by m. To simulate this with a hardplus RBM network we will replace each neuron with a group\nof hardplus neurons with weights and biases chosen so that the sum of their outputs approximates the\noutput of the original softplus neuron, to within a maximum error of 1/p where p is some constant\n> 0.\nFirst we describe the construction for the simulation of a single softplus neurons by a group of\nhardplus neurons.\nLet g be a positive integer and a > 0. We will de\ufb01ne these more precisely later, but for what follows\ntheir precise value is not important.\nAt a high level, this construction works by approximating soft(y), where y is the input to the neuron,\nby a piece-wise linear function expressed as the sum of a number of hardplus functions, whose\n\u201ccorners\u201d all lie inside [\u2212a, a]. Outside this range of values, we use the fact that soft(y) converges\nexponentially fast (in a) to 0 on the left, and y on the right (which can both be trivially computed by\nhardplus functions).\nFormally, for i = 1, 2, ..., g, g + 1, let:\n\nFor i = 1, 2, ..., g, let:\n\nqi = (i \u2212 1)\n\n\u2212 a\n\n2a\ng\n\n\u03bdi =\n\nsoft(qi+1) \u2212 soft(qi)\n\nqi+1 \u2212 qi\n\nand also let \u03bd0 = 0 and \u03bdg+1 = 1. Finally, for i = 1, 2, ..., g, g + 1, let:\n\n\u03b7i = \u03bdi \u2212 \u03bdi\u22121\n\nWith these de\ufb01nitions it is straightforward to show that 1 \u2265 \u03bdi > 0, \u03bdi > \u03bdi\u22121 and consequently\n0 < \u03b7i < 1 for each i. It is also easy to show that qi > qi\u22121, q0 = \u2212a and qg+1 = a.\nFor i = 1, 2, ..., g, g + 1, we will set the weight vector wi and bias bi of the i-th hardplus neuron in\nour group so that the neuron outputs hard(\u03b7i(y \u2212 qi)). This is accomplished by taking wi = \u03b7iw\nand bi = \u03b7i(b \u2212 qi), where w and b (without the subscripts), are the weight vector and bias of the\noriginal softplus neuron.\nNote that since |\u03b7i| \u2264 1 we have that the weights of these hard neurons are smaller in magnitude\nthan the weights of the original soft neuron and thus bounded by C as required.\nThe total output (sum) for this group is:\n\ng+1(cid:88)\n\nT (y) =\n\nhard(\u03b7i(y \u2212 qi))\n\ni=1\n\nWe will now bound the approximation error |T (y) \u2212 soft(y)| of our single neuron simulation.\nNote that for a given y we have that the i-th hardplus neuron in the group has a non-negative input\niff y \u2265 qi. Thus for y < \u2212a all of the neurons have a negative input. And for y \u2265 \u2212a , if we take\nj to be the largest index i s.t. qi \u2264 y, then each neuron from i = 1 to i = j will have positive input\nand each neuron from i = j + 1 to i = g + 1 will have negative input.\nConsider the case that y < \u2212a. Since the input to each neuron is negative, they each output 0 and\nthus T (y) = 0. This results in an approximation error \u2264 exp(\u2212a):\n\n|T (y) \u2212 soft(y)| = |0 \u2212 soft(y)| = soft(y) < soft(\u2212a) \u2264 exp(\u2212a)\n\n11\n\n\fNext, consider the case that y \u2265 \u2212a, and let j be as given above. In such a case we have:\n\nT (y) =\n\n=\n\ni=1\n\ng+1(cid:88)\nj(cid:88)\nj(cid:88)\n\ni=1\n\ni=1\n\n= y\n\nj(cid:88)\n\ni=1\n\n(\u03bdi \u2212 \u03bdi\u22121)qi\n\nhard(\u03b7i(y \u2212 qi)) =\n\n\u03b7i(y \u2212 qi) + 0\n\n(\u03bdi \u2212 \u03bdi\u22121)(y \u2212 qi)\n\n(\u03bdi \u2212 \u03bdi\u22121) \u2212 j(cid:88)\nj\u22121(cid:88)\n\ni=1\n\nj\u22121(cid:88)\n\ni=1\n\n= y\u03bdj \u2212 y\u03bd0 \u2212 \u03bdjqj +\n\n\u03bdi(qi+1 \u2212 qi) + \u03bd0q1\n\n= \u03bdj(y \u2212 qj) +\n= \u03bdj(y \u2212 qj) + soft(qj) \u2212 soft(q1)\n\n(soft(qi+1) \u2212 soft(qi))\n\ni=1\n\nFor y \u2264 a we note that \u03bdj(y \u2212 qj) + soft(qj) is a secant approximation to soft(y) generated by the\nsecant from qj to qj+1 and upperbounds soft(y) for y \u2208 [qj, qj+1]. Thus a crude bound on the error\nis soft(qj+1) \u2212 soft(qj), which only makes use of the fact that soft(y) is monotonic. Then because\nthe slope (derivative) of soft(y) is \u03c3(y) = 1/(1 + exp(\u2212y)) < 1, we can further (crudely) bound\nthis by qj+1 \u2212 qj. Thus the approximation error at such y\u2019s may be bounded as:\n\n|T (y)\u2212 soft(y)| = |(\u03bdj(y \u2212 qj) + soft(qj) \u2212 soft(q1)) \u2212 soft(y)|\n\n\u2264 max{|\u03bdj(y \u2212 qj) + soft(qj) \u2212 soft(y)|, soft(q1)}\n, exp(\u2212a)\n\u2264 max{qj+1 \u2212 qj, exp(\u2212a)} = max\n\n(cid:27)\n\n(cid:26) 2a\n\ng\n\nwhere we have also used soft(q1) = soft(\u2212a) \u2264 exp(\u2212a).\nFor the case y > a, all qi > y and the largest index j such that qj \u2264 y is j = g + 1. So\n\u03bdj(y \u2212 qj) + soft(qj) \u2212 soft(q1) = y \u2212 a + soft(a) \u2212 soft(\u2212a) = y. Thus the approximation error\nat such y\u2019s is:\n\n|y \u2212 soft(y)| = | \u2212 soft(\u2212y)| = soft(\u2212y) \u2264 soft(\u2212a) \u2264 exp(\u2212a)\n\nHaving covered all cases for y we conclude that the general approximation error for a single softplus\nneuron satis\ufb01es the following bound:\n\n|y \u2212 soft(y)| \u2264 max\n\n, exp(\u2212a)\n\n(cid:26) 2a\n\ng\n\n(cid:27)\n\nFor a softplus RBM network with m neurons, our hardplus RBM neurons constructed by replacing\neach neuron with a group of hardplus neurons as described above will require a total of m(g + 1)\nneurons, and have an approximation error bounded by the sum of the individual approximation\nerrors, which is itself bounded by:\n\nTaking a = log(mp), g = (cid:100)2mpa(cid:101). This gives:\n1\nmp\n\n(cid:100)2mpa(cid:101) ,\n\nm max\n\n(cid:26) 2a\n\nm max\n\n, exp(\u2212a)\n\n(cid:26) 2a\n(cid:27)\n\ng\n\n(cid:27)\n(cid:26) 2a\n(cid:26) 1\n(cid:27)\n\n2mpa\n1\np\n\n=\n\n,\n\np\n\n,\n\n(cid:27)\n\n1\nmp\n1\np\n\n\u2264 m max\n\n= max\n\nThus we see that with m(g + 1) = m((cid:100)2mp log(mp)(cid:101) + 1) \u2264 2m2p log(mp) + m neurons we\ncan produce a hardplus RBM network which approximates the output of our softplus RBM network\nwith error bounded by 1/p.\n\n12\n\n\fRemark 12. Note that the construction used in the above lemma is likely far from optimal, as the\nplacement of the qi\u2019s could be done more carefully. Also, the error bound we proved is crude and\ndoes not make strong use of the properties of the softplus function. Nonetheless, it seems good\nenough for our purposes.\n\nA.3 Proofs for Section 2.5\n\nProof of Proposition 5. Suppose that there is an RBM network of size m with weights bounded in\nmagnitude by C computes a function g which represent f with margin \u03b4.\nThen taking p = 2/\u03b4 and applying Theorem 3 we have that there exists an hardplus RBM network\nof size 4m2 log(2m/\u03b4)/\u03b4 + m which computes a function g(cid:48) s.t. |g(x) \u2212 g(cid:48)(x)| \u2264 1/p = \u03b4/2 for\nall x.\nNote that f (x) = 1 \u21d2 thresh(g(x)) = 1 \u21d2 g(x) \u2265 \u03b4 \u21d2 g(cid:48)(x) \u2265 \u03b4 \u2212 \u03b4/2 = \u03b4/2 and similarly,\nf (x) = 0 \u21d2 thresh(g(x)) = 0 \u21d2 g(x) \u2264 \u2212\u03b4 \u21d2 g(cid:48)(x) \u2264 \u2212\u03b4 + \u03b4/2 = \u2212\u03b4/2. Thus we conclude\nthat g(cid:48) represents f with margin \u03b4/2.\n\nA.4 Proofs for Section 2.7\n\nProof of Theorem 6. Let f be a Boolean function on n variables computed by a size s hardplus RBM\nnetwork, with parameters (W, b, d) . We will \ufb01rst construct a three layer hybrid Boolean/threshold\ncircuit/network where the output gate is a simple weighted sum, the middle layer consists of AND\ngates, and the bottom hidden layer consists of threshold neurons. There will be n\u00b7m AND gates, one\nfor every i \u2208 [n] and j \u2208 [m]. The (i, j)th AND gate will have inputs: (1) xi and (2) (x(cid:62)[W ]j \u2265 bj).\nThe weights going from the (i, j)th AND gate to the output will be given by [W ]i,j. It is not hard to\nsee that our three layer netork computes the same Boolean function as the original hardplus RBM\nnetwork.\nIn order to obtain a single hidden layer threshold network, we replace each sub-network rooted at\nan AND gate of the middle layer by a single threshold neuron. Consider a general sub-network\ni=1 aixi \u2265 b).\nLet Q be some number greater than the sum of all the ai\u2019s. We replace this sub-network by a single\ni=1 aixi + Qxj \u2265 b + Q). Note that if the input x is such that\ni aixi + Q\u03b1j will be at least b + Q, so the threshold gate will\ni aixi < b, then even if xj = 1,\ni aixi is never greater than\n\nconsisting of an AND of: (1) a variable xj and (2) a threshold neuron computing ((cid:80)n\nthreshold gate that computes ((cid:80)n\n(cid:80)\ni aixi \u2265 b and xj = 1, then(cid:80)\noutput 1. In all other cases, the threshold will output zero. (If(cid:80)\nthe sum will still be less than Q + b. Similarly, if xj = 0, then since(cid:80)\n(cid:80)\ni ai, the total sum will be less than Q \u2264 (n + 1)C.)\n\nA.5 Proof of Theorem 7\n\nProof. We will \ufb01rst describe how to construct a hardplus RBM network which satis\ufb01es the properties\nrequired for part (i). It will be composed of n special groups of hardplus neurons (which are de\ufb01ned\nand discussed below), and one additional one we call the \u201czero-neuron\u201d, which will be de\ufb01ned later.\n\nDe\ufb01nition 13 A \u201cbuilding block\u201d is a group of n hardplus neurons, parameterized by the scalars \u03b3\nand e, where the weight vector w \u2208 Rn between the i-th neuron in the group and the input layer is\ngiven by wi = M \u2212 \u03b3 and wj = \u2212\u03b3 for j (cid:54)= i and the bias will be given by b = \u03b3e \u2212 M, where M\nis a constant chosen so that M > \u03b3e.\nFor a given x, the input to the i-th neuron of a particular building block is given by:\n\n(cid:88)\n\nj(cid:54)=i\n\n13\n\nn(cid:88)\n\nj=1\n\nwjxj + b = wixi +\n\nwjxj + b\n\n= (M \u2212 \u03b3)xi \u2212 \u03b3(X \u2212 xi) + \u03b3e \u2212 M\n= \u03b3(e \u2212 X) \u2212 M (1 \u2212 xi)\n\n\fWhen xi = 0, this is \u03b3(e \u2212 X) \u2212 M < 0, and so the neuron will output 0 (by de\ufb01nition of the\nhardplus function). On the other hand, when xi = 1, the input to the neuron will be \u03b3(e \u2212 X) and\nthus the output will be max(0, \u03b3(e \u2212 X)).\nIn general, we have that the output will be given by:\n\nxi max(0, \u03b3(e \u2212 X))\n\nFrom this it follows that the combined output from the neurons in the building block is:\n\nn(cid:88)\n\n(xi max(0, \u03b3(e \u2212 X))) = max(0, \u03b3(e \u2212 X))\n\nxi\n\nn(cid:88)\n\ni=1\n\n= max(0, \u03b3(e \u2212 X))X = max(0, \u03b3X(e \u2212 X))\n\ni=1\n\nNote that whenever X is positive, the output is a concave quadratic function in X, with zeros at\nX = 0 and X = e, and maximized at X = e/2, with value \u03b3e2/4.\nNext we show how the parameters of the n building blocks used in our construction can be set to\nproduce a hardplus RBM network with the desired output.\n\nFirst, de\ufb01ne d to be any number greater than or equal to 2n2(cid:80)\n\nIndexing the building blocks by j for 1 \u2264 j \u2264 n we de\ufb01ne their respective parameters \u03b3j, ej as\nfollows:\n\nj |tj|.\n\n\u03b3n =\n\ntn + d\n\nn2\n\n,\n\n\u03b3j =\n\nen = 2n,\n\nej =\n\n2\n\u03b3j\n\ntj + d\n\n(cid:18) tj + d\nj2 \u2212 tj+1 + d\n\n(j + 1)2\n\u2212 tj+1 + d\nj + 1\n\nj\n\n(cid:19)\n\nwhere we have assumed that \u03b3j (cid:54)= 0 (which will be established, along with some other properties of\nthese de\ufb01nitions, in the next claim).\n\nClaim 1. For all j, 1 \u2264 j \u2264 n, (i) \u03b3j > 0 and (ii) for all j, 1 \u2264 j \u2264 n \u2212 1, j \u2264 ej \u2264 j + 1.\n\n2n2(cid:80)\n\nProof of Claim 1. Part (i): For j = n, by de\ufb01nition we know that \u03b3n = tn+d\n\nj |tj| > |tn|, the numerator will be positive and therefore \u03b3n will be positive.\n\nn2 . For d \u2265\n\nFor j < n, we have:\n\nj2 >\n\ntj+1 + d\n(j + 1)2\n\n\u03b3j > 0\n\u21d4 tj + d\n\u21d4 (j + 1)2(tj + d) > j2(tj+1 + d)\n\u21d4 d((j + 1)2 \u2212 j2) > j2tj+1 \u2212 (j + 1)2tj\n\u21d4 d >\n\nj2tj+1 \u2212 (j + 1)2tj\n\n2j + 1\n\nThe right side of the above inequality is less than or equal to (j+1)2(|tj+1|+|tj|)\n\nwhich is strictly upper bounded by 2n2(cid:80)\n\n\u2264 (j+1)(|tj+1|+|tj|)\nj |tj|, and thus by d. So it follows that \u03b3j > 0 as needed.\n\n2j+1\n\nPart (ii):\n\n14\n\n\fj \u2264 ej =\n\n\u21d4 j\u03b3j \u2264 2\n\n\u21d4 tj + d\n\nj\n\n\u2212 tj+1 + d\nj + 1\n\n(cid:19)\n(cid:19)\n(cid:18) tj + d\n\n2\n\u03b3j\n\n(cid:18) tj + d\n(cid:18) tj + d\nj\n\u2212 tj+1 + d\nj + 1\n(j + 1)2 \u2264 2\n\u2212 j(tj+1 + d)\n(j + 1)2 \u2264 tj + d\n\nj\n\nj\n\n(cid:19)\n\n\u2212 tj+1 + d\nj + 1\n\n\u2212 2\n\ntj+1 + d\n\n\u21d4 \u2212 j(tj+1 + d)\n\u21d4 \u2212 (tj+1 + d)j2 \u2264 (tj + d)(j + 1)2 \u2212 2(tj+1 + d)j(j + 1)\n\u21d4 d(j2 \u2212 2j(j + 1) + (j + 1)2) \u2265 \u2212j2tj+1 + 2j(j + 1)tj+1 \u2212 (j + 1)2tj\n\u21d4 d \u2265 \u2212j2tj+1 + 2j(j + 1)tj+1 \u2212 (j + 1)2tj\n\nj + 1\n\nj\n\nwhere we have used j2 \u2212 2j(j + 1) + (j + 1)2 = (j \u2212 (j + 1))2 = 12 = 1 at the last line. Thus it\nsuf\ufb01ces to make d large enough to ensure that j \u2264 ej. For our choice of d, this will be true.\nFor the upper bound we have:\n\n= ej \u2264 j + 1\n\n\u2264 (j + 1)\u03b3j =\n\n(j + 1)(tj + d)\n\nj2\n\n\u2212 tj+1 + d\nj + 1\n\n2\n\u03b3j\n\u21d4 2\n\n(cid:19)\n(cid:19)\n\n(cid:18) tj + d\n(cid:18) tj + d\n\nj\n\nj\ntj + d\n\nj\n\n\u2212 tj+1 + d\nj + 1\n\u2212 tj+1 + d\nj + 1\n\u2212 tj+1 + d\nj + 1\n\nj2\n\n\u2264 (j + 1)(tj + d)\n\n\u21d4 2\n\u21d4 2(tj + d)j(j + 1) \u2212 (tj+1 + d)j2 \u2264 (j + 1)2(tj + d)\n\u21d4 \u2212(d \u2212 tj+1)\n\u21d4 \u2212 j2(d + tj+1) + 2j(j + 1)(d + tj) \u2264 (j + 1)2(d + tj)\n\u21d4 d(j2 \u2212 2j(j + 1) + (j + 1)2)\n\n\u2264 (j + 1)\n\n(d + tj)\n\n(d + tj)\n\nj + 1\n\n+ 2\n\nj2\n\nj\n\n\u2265 \u2212j2tj+1 + 2j(j + 1)tj \u2212 (j + 1)2tj\n\n\u21d4 d \u2265 \u2212j2tj+1 + 2j(j + 1)tj \u2212 (j + 1)2tj\n\nwhere we have used j2 \u2212 2j(j + 1) + (j + 1)2 = 1 at the last line. Again, for our choice of d the\nabove inequality is satis\ufb01ed.\n\nFinally, de\ufb01ne M to be any number greater than max(t0 + d, maxi{\u03b3iei}).\nIn addition to the n building blocks, our hardplus RBM will include an addition unit that we will call\nthe zero-neuron, which handles x = 0. The zero-neuron will have weights w de\ufb01ned by wi = \u2212M\nfor each i, and b = t0 + d.\nFinally, the output bias B of our hardplus RBM network will be set to \u2212d.\nThe total output of the network is simply the sum of the outputs of the n different building blocks,\nthe zero neuron, and constant bias \u2212d.\nTo show part (i) of the theorem we want to prove that for all k, whenever X = k, our circuit outputs\nthe value tk.\nWe make the following de\ufb01nitions:\n\nak \u2261 \u2212 n(cid:88)\n\n\u03b3j\n\nbk \u2261 n(cid:88)\n\n\u03b3jej\n\nj=k\n\nj=k\n\n15\n\n\fClaim 2.\n\nak =\n\n\u2212(tk + d)\n\nk2\n\nbk =\n\n2(tk + d)\n\nk\n\nbk = \u22122kak\n\nThis claim is self-evidently true by examining basic de\ufb01nitions of \u03b3j and ej and realizing that ak\nand bk are telescoping sums.\nGiven these facts, we can prove the following:\n\nClaim 3. For all k, 1 \u2264 k \u2264 n, when X = k the sum of the outputs of all the n building blocks is\ngiven by tk + d.\n\nthe (\u03b3n, en)-block computes max(0, \u03b3nX(en \u2212 X)) =\nProof of Claim 3. For X = n,\nmax(0,\u2212\u03b3nX 2 +\u03b3nenX). By the de\ufb01nition of en, n \u2264 en, and thus when X \u2264 n, \u03b3nX(en\u2212X) \u2265\n0. For all other building blocks (\u03b3j, ej), j < n, since ej \u2264 j + 1, this block outputs zero since\n\u03b3jX(ej \u2212 X) is less than or equal to zero. Thus the sum of all of the building blocks when X = n\nis just the output of the (\u03b3n, en)-block which is\n\n\u03b3n \u00b7 n(en \u2212 n) = \u2212\u03b3n \u00b7 n2 + \u03b3nen \u00b7 n = \u2212(tn + d) + 2(tn + d) = tn + d\n\nas desired.\nFor X = k, 1 \u2264 k < n the argument is similar. For all building blocks j \u2265 k, by Claim 1 we know\nthat ej \u2265 j and therefore this block on X = k is nonnegative and therefore contributes to the sum.\nOn the other hand, for all building blocks j < k, by Claim 1 we know that ej \u2264 j + 1 and therefore\nthis outputs 0 and so does not contribute to the sum.\nThus the sum of all of the building blocks is equal to the sum of the non-zero regions of the building\nblocks j for j \u2265 k. Since each of this is a quadratic function of X, it can written as a single quadratic\npolynomial of the form akX 2 + bkX where ak and bk are de\ufb01ned as before.\nPlugging in the above expressions for ak and bk from Claim 2, we see that the value of this polyno-\nmial at X = k is:\n\nakk2 + bkk =\n\n\u2212(tk + d)\n\nk2\n\nk2 +\n\n2(tk + d)\n\nk\n\nk = \u2212(tk + d) + 2(tk + d) = tk + d\n\nFinally, it remains to ensure that our hardplus RBM network outputs t0 for X = 0. Note that the\nsum of the outputs of all n building blocks and the output bias is \u2212d at X = 0. To correct this, we\nset the incoming weights and the bias of the zero-neuron according to wi = \u2212M for each i, and\nb = t0 + d. When X = 0, this neuron will output t0 + d, making the total output of the network\n\u2212d + t0 + d = t0 as needed. Furthermore, note that the addition of the zero-neuron does not affect\nthe output of the network when X = k > 0 because the zero-neuron outputs 0 on all of these inputs\nas long as M \u2265 t0 + d.\nThis completes the proof of part (i) of the theorem and it remains to prove part (ii).\nObserve that the size of the weights grows linearly in M and d, which follows directly from their def-\ninitions. And note that the magnitude of the input to each neuron is lower bounded by a positive lin-\near function of M and d (a non-trivial fact which we will prove below). From these two observations\nit follows that to achieve the condition that the magnitude of the input to each neuron is greater than\nC(n) for some function C of n, the weights need to grow linearly with C. Noting that error bound\ncondition \u0001 \u2264 (n2 + 1) exp(\u2212C) in Lemma 2 can be rewritten as C \u2264 log((n2 + 1)) + log(1/\u0001),\nfrom which part (ii) of the theorem then follows.\nThere are two cases where a hardplus neuron in building block j has a negative input. Either the\ninput is \u03b3j(ej \u2212 X) \u2212 M, or it is \u03b3j(ej \u2212 X) for X \u2265 j + 1. In the \ufb01rst case it is clear that as M\ngrows the net input becomes more negative since ej doesn\u2019t depend on M at all.\n\n16\n\n\fThe second case requires more work. First note that from its de\ufb01ntion, ej can be rewritten as\n2 (j+1)aj+1\u2212jaj\n\n. Then for any X \u2265 j + 1 and j \u2264 n \u2212 1 we have:\n\n\u03b3j\n\n(cid:18)\n\n(cid:19)\n\n\u03b3j(ej \u2212 X) \u2264 \u03b3j(ej \u2212 (j + 1))\n\n2\n\n(j + 1)aj+1 \u2212 jaj\n\n\u2212 (j + 1)\n= \u03b3j\n= 2(j + 1)aj+1 \u2212 2jaj \u2212 (j + 1)\u03b3j\n= 2(j + 1)aj+1 \u2212 2jaj \u2212 (j + 1)(aj+1 \u2212 aj)\n= (j + 1)aj+1 \u2212 2jaj + (j + 1)aj\n\n\u03b3j\n\n\u2212(d \u2212 tj+1)\n\nj + 1\n\n(d + tj+1)\n\n\u2212 (j + 1)\n\nd + tj+1\n\n+ 2\n\nj\n\nj2\n\n\u2212j2(d + tj+1) + 2j(j + 1)(d + tj) \u2212 (j + 1)2(d + tj)\n\n\u2212(j2 \u2212 2j(j + 1) + (j + 1)2)d \u2212 j2tj + 2j(j + 1)tj\n\nj2(j + 1)\n\n\u2212(j \u2212 (j + 1))2d \u2212 j2tj + 2j(j + 1)tj\n\nj2(j + 1)\n\nj2(j + 1)\n\u2212d \u2212 j2tj + 2j(j + 1)tj\n\nj2(j + 1)\n\n\u2212d\n\nj2(j + 1)\n\n+\n\n\u2212j2tj + 2j(j + 1)tj\n\nj2(j + 1)\n\n=\n\n=\n\n=\n\n=\n\n=\n\n=\n\nSo we see that as d increases, this bound guarantees that \u03b3j(ej \u2212 X) becomes more negative for\neach X \u2265 j + 1. Also note that for the special zero-neuron, for X \u2265 1 the net input will be\n\u2212M X + t0 + d \u2264 \u2212M + t0 + d, which will shrink as M grows.\nFor neurons belonging to building block j which have a positive valued input, we have that X < ej.\nNote that for any X \u2264 j and j < n we have:\n\u03b3j(ej \u2212 X) \u2265 \u03b3j(ej \u2212 j) = \u03b3j\n= 2(j + 1)aj+1 \u2212 2jaj \u2212 j\u03b3j\n= 2(j + 1)aj+1 \u2212 2jaj \u2212 j(aj+1 \u2212 aj)\n= 2(j + 1)aj+1 \u2212 jaj \u2212 jaj+1\n\n(j + 1)aj+1 \u2212 jaj\n\n(cid:18)\n\n(cid:19)\n\n\u2212 j\n\n\u03b3j\n\n2\n\n\u2212(d + tj+1)\n\nj + 1\n\n= 2\n\n(d + tj)\n\n+\n\nj\n\n+ j\n\n(d + tj+1)\n(j + 1)2\n\n\u22122j(j + 1)(d + tj+1) + (j + 1)2(d + tj) + j2(d + tj+1)\n\n((j + 1)2 \u2212 2j(j + 1) + j2)d + (j + 1)2tj \u2212 2j(j + 1)tj+1 + j2tj+1\n\nj(j + 1)2\n\n(j + 1 \u2212 j)2d + (j + 1)2tj \u2212 2j(j + 1)tj+1 + j2tj+1\n\nj(j + 1)2\n\nd + (j + 1)2tj \u2212 2j(j + 1)tj+1 + j2tj+1\n\nj(j + 1)2\n\nd\n\nj(j + 1)2 +\n\nj(j + 1)2\n\n(j + 1)2tj \u2212 2j(j + 1)tj+1 + j2tj+1\n\nj(j + 1)2\n\n=\n\n=\n\n=\n\n=\n\n=\n\nAnd for the case j = n, we have for X \u2264 j that:\n\u03b3j(ej \u2212 X) \u2265 \u03b3j(ej \u2212 j) =\n\nd + tn\n\nn2\n\n(2n \u2212 n) =\n\nd\nn\n\n+\n\ntn\nn\n\n17\n\n\fSo in all cases we see that as d increases, this bound guarantees that \u03b3j(ej \u2212 X) grows linearly.\nAlso note that for the special zero-neuron, the net input will be t0 + d for X = 0, which will grow\nlinearly as d increases.\n\nA.6 Proofs for Section 4\n\nA.6.1 Proof of Theorem 8\n\nWe \ufb01rst state some basic facts which we need.\nFact 14 (Muroga (1971)). Let f : {0, 1}n \u2192 {0, 1} be a Boolean function computed by a threshold\nneuron with arbitrary real incoming weights and bias. There exists a constant K and another\nthreshold neuron computing f, all of whose incoming weights and bias are integers with magnitude\nat most 2Kn log n.\n\nA direct consequence of the above fact is the following fact, by now folklore, whose simple proof\nwe present for the sake of completeness.\nFact 15. Let fn be the set of all Boolean functions on {0, 1}n. For each 0 < \u03b1 < 1, let f\u03b1,n be the\nsubset of such Boolean functions that are computable by threshold networks with one hidden layer\nwith at most s neurons. Then, there exits a constant K such that,\n\n(cid:12)(cid:12)f\u03b1,n\n\n(cid:12)(cid:12) \u2264 2K(n2s log n+s2 log s).\n\nProof. Let s be the number of hidden neurons in our threshold network. By using Fact 14 repeatedly\nfor each of the hidden neurons, we obtain another threshold network having still s hidden units com-\nputing the same Boolean function such that the incoming weights and biases of all hidden neurons\nis bounded by 2Kn log n. Finally applying Fact 14 to the output neuron, we convert it to a threshold\ngate with parameters bounded by 2Ks log s. Henceforth, we count only the total number of Boolean\nfunctions that can be computed by such threshold networks with integer weights. We do this by\nestablishing a simple upper bound on the total number of distinct such networks. Clearly, there are\nat most 2Kn2 log n ways to choose the incoming weights of a given neuron in the hidden layer. There\nare s incoming weights to choose for the output threshold, each of which is an integer of magnitude\nat most 2Ks log s. Combining these observations, there are at most 2Ks\u00b7n2 log n \u00d7 2Ks2 log s distinct\nnetworks. Hence, the total number of distinct Boolean functions that can be computed is at most\n2K(n2s log n+s2 log s).\n\nWith these basic facts in hand, we prove below Theorem 8 using Proposition 5 and Theorem 6.\n\nProof of Theorem 8. Consider any thresholded RBM network with m hidden units that is computing\na n-dimensional Boolean function with margin \u03b4. Using Proposition 5, we can obtain a thresholded\nhardplus RBM network of size 4m2/\u03b4 \u00b7 log(2m/\u03b4) + m that computes the same Boolean function\nas the thresholded original RBM network. Applying Theorem 6 and thresholding the output, we\nobtain a thresholded network with 1 hidden layer of thresholds which is the same size and computes\nthe same Boolean function. This argument shows that the set of Boolean functions computed by\nthresholded RBM networks of m hidden units and margin \u03b4 is a subset of the Boolean functions\ncomputed by 1-hidden-layer threshold networks of size 4m2n/\u03b4\u00b7log(2m/\u03b4)+mn. Hence, invoking\nFact 15 establishes our theorem.\n\nA.6.2 Proof of Theorem 9\n\nNote that the theorems from Hajnal et al. (1993) assume integer weights, but this hypthosis can\nbe easily removed from their Theorem 3.6. In particular, Theorem 3.6 assumes nothing about the\nlower weights, and as we will see, the integrality assumption on the top level weights can be easily\nreplaced with a margin condition.\nFirst note that their Lemma 3.3 only uses the integrality of the upper weights to establish that the\nmargin must be \u2265 1. Otherwise it is easy to see that with a margin \u03b4, Lemma 3.3 implies that\n\u03b1 -discriminator (\u03b1 is the sum of the\na threshold neuron in a thresholded network of size m is a 2\u03b4\n\n18\n\n\fabsolute values of the 2nd-level weights in their notation). Then Theorem 3.6\u2019s proof gives m \u2265\n\u03b42(1/3\u2212\u0001)n for suf\ufb01ciently large n (instead of just m \u2265 2(1/3\u2212\u0001)n). A more precise bound that they\nimplictly prove in Theorem 3.6 is m \u2265 6\u03b42n/3\nC .\nThus we have the following fact adapted from Hajnal et al. (1993):\nFact 16. For a neural network of size m with a single hidden layer of threshold neurons and weights\nbounded by C that computes a function that represents IP with margin \u03b4, we have m \u2265 6\u03b42n/3\nC .\n\nProof of Theorem 9. By Proposition 5 it suf\ufb01ces to show that no thresholded hardplus RBM network\nof size \u2264 4m2 log(2m/\u03b4)/\u03b4 + m with parameters bounded by C can compute IP with margin \u03b4/2.\nWell, suppose by contradiction that such a thresholded RBM network exists. Then by Theorem 6\nthere exists a single hidden layer threshold network of size \u2264 4m2n log(2m/\u03b4)/\u03b4+mn with weights\nbounded in magnitude by (n + 1)C that computes the same function, i.e. one which represents IP\nwith margin \u03b4/2.\nApplying the above Fact we have 4m2n log(2m/\u03b4)/\u03b4 + mn \u2265 3\u03b42n/3\n(n+1)C .\nIt is simple to check that this bound is violated if m is bounded as in the statement of this theorem.\n\nA.6.3 Proof of Theorem 10\n\nWe prove a more general result here from which we easily derive Theorem 10 as a special case.\nTo state this general result, we introduce some simple notions. Let h : R \u2192 R be an activation\nfunction. We say h is monotone if it satis\ufb01es the following: Either h(x) \u2264 h(y) for all x < y OR\nh(x) \u2265 h(y) for all x < y. Let (cid:96) : {0, 1}n \u2192 R be an inner function. An (h, (cid:96)) gate/neuron Gh,(cid:96)\nis just one that is obtained by composing h and (cid:96) in the natural way, i.e. Gh,(cid:96)\n\n(cid:0)x(cid:1) = h(cid:0)(cid:96)(x)(cid:1). We\n\nnotate(cid:12)(cid:12)(cid:12)(cid:12)(h, (cid:96))(cid:12)(cid:12)(cid:12)(cid:12)\u221e = maxx\u2208{0,1}n\n\n(cid:12)(cid:12)Gh,(cid:96)(x)(cid:12)(cid:12).\n\nWe assume for the discussion here that the number of input variables (or observables) is even and is\ndivided into two halves, called x and y, each being a Boolean string of n bits. In this language, the in-\nner production Boolean function, denoted by IP (x, y), is just de\ufb01ned as x1y1 +\u00b7\u00b7\u00b7+xnyn (mod 2).\nWe call an inner function of a neuron/gate to be (x, y)-separable if it can be expressed as g(x)+f (y).\nFor instance, all af\ufb01ne inner functions are (x, y)-separable. Finally, given a set of activation func-\ntions H and a set of inner functions I, an (H, I)- network is one each of whose hidden unit is a\n\nneuron of the form Gh,(cid:96) for some h \u2208 H and (cid:96) \u2208 I. Let(cid:12)(cid:12)(cid:12)(cid:12)(H, I)(cid:12)(cid:12)(cid:12)(cid:12)\u221e = sup(cid:8)(cid:12)(cid:12)(cid:12)(cid:12)(h, (cid:96))(cid:12)(cid:12)(cid:12)(cid:12)\u221e : h \u2208\nH, (cid:96) \u2208 I(cid:9).\n\nTheorem 17. Let H be any set of monotone activation functions and I be a set of (x, y) separable\ninner functions. Then, every (H, I) network with one layer of m hidden units computing IP with a\nmargin of \u03b4 must satisfy the following:\n\nm \u2265\n\n2(cid:12)(cid:12)(cid:12)(cid:12)(H, I)(cid:12)(cid:12)(cid:12)(cid:12)\u221e\n\n\u03b4\n\n2n/4.\n\nIn order to prove Theorem 17, it would be convenient to consider the following 1/-1 valued function:\n(\u22121)IP(x,y) = (\u22121)x1y1+\u00b7\u00b7\u00b7+xnyn. Please note that when IP evaluates to 0, (\u22121)IP evaluates to 1 and\nwhen IP evaluates to 1, (\u22121)IP evaluates to -1.\nWe also consider a matrix Mn with entries in {1,\u22121} which has 2n rows and 2n columns. Each\nrow of Mn is indexed by a unique Boolean string in {0, 1}n. The columns of the matrix are also\nindexed similarly. The entry Mn[x, y] is just the 1/-1 value of (\u22121)IP(x,y). We need the following\nfact that is a special case of the classical result of Lindsey.\nLemma 18 (Chor and Goldreich,1988). The magnitude of the sum of elements in every r \u00d7 s sub-\nmatrix of Mn is at most\n\nrs2n.\n\n\u221a\n\nWe use Lemma 18 to prove the following key fact about monotone activation functions:\nLemma 19. Let Gh,(cid:96) be any neuron with a monotone activation function h and inner function (cid:96) that\nis (x, y)-separable. Then,\n\n19\n\n\f(cid:12)(cid:12)(cid:12)(cid:12) Ex,y\n\n(cid:20)\n\nGh,(cid:96)\n\n(cid:0)x, y(cid:1)(\u22121)IP(cid:0)x,y(cid:1)(cid:21)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 ||(h, (cid:96))||\u221e \u00b7 2\u2212\u2126(n).\n\n(2)\n\nProof. Let (cid:96)(x, y) = g(x) + f (y) and let 0 < \u03b1 < 1 be some constant speci\ufb01ed later. De\ufb01ne a\ntotal order \u227ag on {0, 1}n by setting x \u227ag x(cid:48) whenever g(x) \u2264 g(x(cid:48)) and x occurs before x(cid:48) in the\nlexicographic ordering. We divide {0, 1}n into t = 2(1\u2212\u03b1)n groups of equal size as follows: the \ufb01rst\ngroup contains the \ufb01rst 2\u03b1n elements in the order speci\ufb01ed by \u227ag, the second group has the next\n2\u03b1n elements and so on. The ith such group is denoted by Xi for i \u2264 2(1\u2212\u03b1)n. Likewise, we de\ufb01ne\nthe total order \u227af and use it to de\ufb01ne equal sized blocks Y1, . . . , Y2(1\u2212\u03b1)n.\nThe way we estimate the LHS of (2) is to pair points in the block (Xi, Yj) with (Xi+1, Yj+1)\nin the following manner: wlog assume that the activation function h in non-decreasing. Then,\nGh,(cid:96)(x, y) \u2264 Gh,(cid:96)(x(cid:48), y(cid:48)) for each (x, y) \u2208 (Xi, Yj) and (x(cid:48), y(cid:48)) \u2208 (Xi+1, Yj+1). Further, applying\nLemma 18, we will argue that the total number of points in (Xi, Yj) at which the product in the\nLHS evaluates negative (positive) is very close to the number of points in (Xi+1, Yj+1) at which\nthe product evaluates to positive (negative). Moreover, by assumption, the composed function (h, (cid:96))\ndoes not take very large values in our domain by assumption. These observations will be used to\nshow that the points in blocks that are diagonally across each other will almost cancel each other\u2019s\ncontribution to the LHS. There are too few uncancelled blocks and hence the sum in the LHS will\nbe small. Forthwith the details.\ni,j = {(x, y) \u2208 (Xi, Yi)| IP(x, y) = \u22121}.\nLet P +\nLet t = 2(1\u2212\u03b1)n. Let hi,j be the max value that the gate takes on points in (Xi, Yj). Note that the\nnon-decreasing assumption on h implies that hi,j \u2264 hi+1,j+1. Using this observation, we get the\nfollowing:\n\ni,j = {(x, y) \u2208 (Xi, Yj)| IP(x, y) = 1} and P \u2212\n\n(cid:12)(cid:12)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) +\n(cid:18)(cid:12)(cid:12)P +\n(cid:12)(cid:12) \u2212(cid:12)(cid:12)P \u2212\n(cid:12)(cid:12) \u2212(cid:12)(cid:12)P \u2212\n(cid:12)(cid:12) is at most 2 \u00b7 2(\u03b1+1/2)n. Thus, we get\n2 )n + 4 \u00b7 2\u2212(1\u2212\u03b1)n(cid:1).\nRHS of (3) \u2264 ||(h, (cid:96))||\u221e \u00b7(cid:0)2 \u00b7 2\u2212(\u03b1\u2212 1\n(cid:12)(cid:12)(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) \u2212 1\n(cid:18)(cid:12)(cid:12)P +\n\n(cid:12)(cid:12) \u2212(cid:12)(cid:12)P \u2212\n\n(4)\nThus, setting \u03b1 = 3/4 gives us the bound that the RHS above is arbitrarily close to ||(h, (cid:96))||\u221e\u00b72\u2212n/4.\nSimilarly, pairing things slightly differently, we get\n\nEx,y\n\n(cid:0)x, y(cid:1)(\u22121)IP(cid:0)x,y(cid:1)(cid:21)\n\n(cid:12)(cid:12)(cid:12)(cid:12) (cid:88)\nWe apply Lemma 18 to conclude that(cid:12)(cid:12)P +\n\n\u2264 1\n4n\n\n(cid:0)x, y(cid:1)(\u22121)IP(cid:0)x,y(cid:1)(cid:21)\n\nhi,j|Pi,j|\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n\u2265 1\n4n\n\ni,j\n\ni+1,j+1\n\nEx,y\n\nGh,(cid:96)\n\ni+1,j+1\n\ni,j\n\ni+1,j+1\n\ni,j\n\nhi+1,j+1\n\ni=tORj=t\n\nhi,j\n\n(i,j)