{"title": "Multilayer Neural Networks: One or Two Hidden Layers?", "book": "Advances in Neural Information Processing Systems", "page_first": 148, "page_last": 154, "abstract": null, "full_text": "Multilayer neural networks: \none or two hidden layers? \n\nG. Brightwell \n\nDept of Mathematics \nLSE, Houghton Street \n\nLondon WC2A 2AE, U.K. \n\nc. Kenyon, H. Paugam-Moisy \n\nLIP, URA 1398 CNRS \n\nENS Lyon, 46 alIee d'Italie \n\nF69364 Lyon cedex, FRANCE \n\nAbstract \n\nWe study the number of hidden layers required by a multilayer neu(cid:173)\nral network with threshold units to compute a function f from n d \nto {O, I}. In dimension d = 2, Gibson characterized the functions \ncomputable with just one hidden layer, under the assumption that \nthere is no \"multiple intersection point\" and that f is only defined \non a compact set. We consider the restriction of f to the neighbor(cid:173)\nhood of a multiple intersection point or of infinity, and give neces(cid:173)\nsary and sufficient conditions for it to be locally computable with \none hidden layer. We show that adding these conditions to Gib(cid:173)\nson's assumptions is not sufficient to ensure global computability \nwith one hidden layer, by exhibiting a new non-local configuration, \nthe \"critical cycle\", which implies that f is not computable with \none hidden layer. \n\n1 \n\nINTRODUCTION \n\nThe number of hidden layers is a crucial parameter for the architecture of multilayer \nneural networks. Early research, in the 60's, addressed the problem of exactly real(cid:173)\nizing Boolean functions with binary networks or binary multilayer networks. On the \none hand, more recent work focused on approximately realizing real functions with \nmultilayer neural networks with one hidden layer [6, 7, 11] or with two hidden units \n[2]. On the other hand , some authors [1, 12] were interested in finding bounds on \nthe architecture of multilayer networks for exact realization of a finite set of points. \nAnother approach is to search the minimal architecture of multilayer networks for \nexactly realizing real functions, from nd to {O, I}. Our work , of the latter kind, is a \ncontinuation of the effort of [4, 5, 8, 9] towards characterizing the real dichotomies \nwhich can be exactly realized with a single hidden layer neural network composed \nof threshold units. \n\n\fMultilayer Neural Networks: One or Two Hidden Layers? \n\n149 \n\n1.1 NOTATIONS AND BACKGROUND \n\nA finite set of hyperplanes {Hd1* \n() \n[ I:r=H+1 Wm < \n2:m=i-k+l Wm > \n\n() \n\n() \n\nif class 0 \nif class 1 \n\nif class 0 \nif class 1 \n\nThe system (S) can be rewritten in the matrix form Ax ~ b, where \n\nx T = [Wl,W2, ... ,Wk,()] and bT = [b1,b2, ... ,bk,bk+1, ... ,b2k] \n\nwhere bi = -f, for all i, and f is an arbitrary small positive number. Matrix A can \nbe seen in figure 2, where fj = +1 or -1 depending on whether region j is in class 0 \nor 1. The next step is to apply Farkas lemma, or an equivalent version [10], which \ngives a necessary and sufficient condition for finding a solution of Ax ~ b. \nLemma 1 (Farkas lemma) There exists a vector x E nn such that Ax ~ b iff \nthere does not exist a vector Y E nm such that y T A = 0, y 2': 0 and y T b < O. \nAssume that Ax ~ b is not solvable. Then, by Lemma 1 for n = k + 1 and m = 2k, \na vector y can be found such that y 2': O. Since in addition y T b = - f 2:~~1 Yj, the \ncondition y T b < 0 implies (3jt) Y31 > O. But y T A = 0 is equivalent to the system \n\n\f152 \n\n(t:) of k + 1 equations \n\nG. Brightwell. C. Kenyon and H. Paugam-Moisy \n\n(t:) { ~::; i ::; k \nz=k+1 \n\n\"i+k-l \ni..Jm=i Ym/clau 0 \n,,2k \ni..Jm=l Ym/clau 0 \n\n\"i+k-l \ni..Jm=i Ym/clau 1 \n,,2k \nL..m=l Ym/clau 1 \n\nSince (3jt) Yil > 0, \nthe last equation (Ek+l) of system (t:) implies that \n(3h / class (region jt) ::/= class(region h)) Yh > O. Without loss of generality, \nassume that it and h are less than k and that region it is in class 0 and region h \nis in class 1. Comparing two successive equations of (t:), for i < k, we can write \n(VA E {O, 1}) L(E.+d Ym/clau >. = L(E.) Ym/clau >. - Yi/clau >. + Yi+k/clau >. \n\nSince Yit > 0 and region it is in class 0, the transition from Ejl to EiI+1 implies \nthat Yil +k = Yit > 0 and region it + k, which is opposite to region it, is also \nin class O. Similarly, the transition from Eh to Eh +1 implies that both opposite \nregions h + k and h are in class 1. These conditions are necessary for the system \n(t:) to have a non-negative solution and they correspond exactly to the definition \nof an XOR-bow-tie at point P. The converse comes from theorem 1. \n\u2022 \n\n2.2 UNBOUNDED REGIONS \n\nIf no two essential hyperplanes are parallel, the case of unbounded regions is exactly \nthe same as a multiple intersection. All the unbounded regions can be labelled as \non figure 2. The same argument holds for proving that, if the local system (S) \nAx ::; b is not solvable, then there exists an XOR-at-infinity. The case of parallel \nhyperplanes is more intricate because matrix A is more complex. The proof requires \na heavy case-by-case analysis and cannot be given in full in this paper (see [3]) . \n\nTheorem 4 Let f be a polyhedral dichotomy on 'R,2 . Let Coo be the complementary \nregion of the convex hull of the essential points of f\u00b7 The restriction of f to Coo is \nrealizable by a one-hidden-layer network iff f is not in an XOR-at-in/inity. \n\nFrom theorems 3 and 4 we can deduce that a polyhedral dichotomy is locally real(cid:173)\nizable in 'R,2 by a one-hidden-Iayer network iff f has no XOR-bow-tie and no XOR(cid:173)\nat-infinity. Unfortunately this result cannot be extended to the global realization \nof f in 'R, 2 because more intricate distant configurations can involve contradictions \nin the complete system of inequalities. The object of the next section is to point \nout such a situation by producing a new geometric configuration, called a critical \ncycle, which implies that f cannot be realized with one hidden layer. \n\n3 CRITICAL CYCLES \n\nIn contrast to section 2, the results of this section hold for any dimension d 2:: 2. \nWe first need some definitions. Consider a pair of regions {T, T'} in the same class \nand which both contain an essential point P in their closure. This pair is called \ncritical with respect to P and H if there is an essential hyperplane H going through \nP such that T' is adjacent along H to the region opposite to T . Note that T and \nT' are both on the same side of H. \nWe define a graph G whose nodes correspond to the critical pairs of regions of f. \nThere is a red edge between {T, T'} and {U, U'} if the pairs, in different classes, \nare both critical with respect to the same point (e.g., {Bp, Bp} and {Wp, Wi>} in \nfigure 3) . There is a green edge between {T, T'} and {U, U'} if the pairs are both \ncritical with respect to the same hyperplane H, and either the two pairs are on the \n\n\fMultilayer Neural Networks: One or Two Hidden Layers? \n\n153 \n\nsame side of H, but in different classes (e.g., {W p, Wp} and {BQ, BQ})' or they \nare on different sides of H, but in the same class (e.g., {Bp,Bp} and {BR, Bk})\u00b7 \n\nDefinition 1 A critical cycle is a cycle in graph G, with alternating colors. \n\n.... ; \n\nP \n\nP. \n\nP \n\nP, \n\n-i B B' }--.. -.------- {Y Y'} \n\nf \nI \nI \nI {B Q, B'Q~- \u00b7-\u00b7-{Y Q ,Y'Q}', \nI \nI \nI \nI \n\" { B R, B'R}l---{Y R ,Y'R}\" \n\n~ \n\n; \n\nred edge \ngreen edge \n\nFigure 3: Geometrical configuration and graph of a critical cycle, in the plane. Note \nthat one can augment the figure in such a way that there is no XOR-situation, no \nXOR-bow-tie, and no XOR-at-infinity. \n\nTheorem 5 If a polyhedral dichotomy I, from n-d to {O, I}, can be realized by a \none-hidden-layer network, then it cannot have a critical cycle. \n\nLet H be the hyperplane associated to one of the hidden units. For T a region, let \n\nProof: For the sake of simplicity, we will restrict ourselves to doing the proof for a \ncase similar to the example figure 3, with notation as given in that figure, but with(cid:173)\nout any restriction on the dimension d of I. Assume, for a contradiction, that I has \na critical cycle and can be realized by a one-hidden-Iayer network. Consider the sets \nof regions {Bp, Bp , BQ , BQ, BR, Bk} and {Wp, Wp, WQ , WQ, WR , WR}. Consider \nthe regions defined by all the hyperplanes associated to the hidden layer units (in \ngeneral, these hyperplanes are a large superset of the essential hyperplanes). There \nis a region b p ~ B p, whose border contains P and a (d - 1 )-dimensional subset \nof H 1. Similarly we can define bp, .. . , bR, Wp , . . . ,wR' Let B be the set of such \nregions which are in class 1 and W be the set of such regions in class 0. \nH(T) be the digit label of T w.r.t. H, i.e. H(T) = 1 or \u00b0 according to whether T \nis above or below H (cf. section 1.1) . We do a case-by-case analysis. \nIf H does not go through P, then H(bp) = H(b'p) = H(wp) = H(wp); similar \nequalities hold for lines not going through Q or R. If H goes through P but is \nnot equal to H1 or to H2 , then, from the viewpoint of H, things are as if b'p was \nopposite to bp, and w'p was opposite to Wp, so the two regions of each pair are on \ndifferent sides of H, and so H (bp )+H(b'p) = H( wp )+H(w'p) = 1; similar equalities \nhold for hyperplanes going through Q or R. If H = H 1, then we use the fact that \nthere is a green edge between {W p, Wp} and {BQ, BQ}, meaning in the case of \nthe figure that all four regions are on the same side of H 1 but in different classes. \nThen H(bp) +H(b'p) +H(bQ) +H(b'q) = H(wp) +H(wp)+ H(wQ) +H(w'q). In \nfact, this equality would also hold in the other case, as can easily be checked. Thus \nfor all H, we have L,bEB H(b) = L,wEW H(w). But such an equality is impossible: \nsince each b is in class 1 and each w is in class 0, this implies a contradiction in the \nsystem of inequalities and I cannot be realized by a one-hidden-Iayer network. \nObviously there can exist cycles of length longer than 3, but the extension of the \nproof is straightforward. \n_ \n\n\f154 \n\nG. Brightwell, C. Kenyon and H. Paugam-Moisy \n\n4 CONCLUSION AND PERSPECTIVES \nThis paper makes partial progress towards characterizing functions which can be re(cid:173)\nalized by a one-hidden-Iayer network, with a particular focus on dimension 2. Higher \ndimensions are more challenging, and it is difficult to even propose a conjecture: \nnew cases of inconsistency emerge in subspaces of intermediate dimension. Gibson \ngives an example of an inconsistent line (dimension 1) resulting of its intersection \nwith two hyperplanes (dimension 2) which are not inconsistent in n3. \nThe principle of using Farkas lemma for proving local realizability still holds but \nthe matrix A becomes more and more complex. In n d , even for d = 3, the labelling \nof the regions, for instance around a point P of multiple intersection, can become \nvery complex. \n\nIn conclusion, it seems that neither the topological method of Gibson, nor our \nalgebraic point of view, can easily be extended to higher dimensions. Nevertheless, \nwe conjecture that in dimension 2, a function can be realized by a one-hidden(cid:173)\nlayer network iff it does not have any of the four forbidden types of configurations: \nXOR-situation, XOR-bow-tie, XOR-at-infinity, and critical cycle. \n\nAcknowledgements \nThis work was supported by European Esprit III Project nO 8556, NeuroCOLT. \n\nReferences \n\n[1] E. B. Baum. On the capabilities of multilayer perceptrons. Journal of Complexity, \n\n4:193-215, 1988. \n\n[2] E. K. Blum and L. K. Li. Approximation theory and feedforward networks. Neural \n\nNetworks, 4(4):511-516, 1991. \n\n[3] G. Brightwell, C. Kenyon, and H. Paugam-Moisy. Multilayer neural networks: one or \n\ntwo hidden layers? Research Report 96-37, LIP, ENS Lyon, 1996. \n\n[4] M. Cosnard, P . Koiran, and H. Paugam-Moisy. Complexity issues in neural network \ncomputations. In I. Simon, editor, Proc. of LATIN'92, volume 583 of LNCS, pages \n530-544. Springer Verlag, 1992. \n\n[5] M. Cosnard, P. Koiran, and H. Paugam-Moisy. A step towards the frontier between \none-hidden-Iayer and two-hidden layer neural networks. In I. Simon, editor, Proc. of \nIJCNN'99-Nagoya, volume 3, pages 2292-2295. Springer Verlag, 1993. \n\n[6] G. Cybenko. Approximation by superpositions of a sigmoidal function. Math . Control, \n\nSignal Systems, 2:303-314, October 1988. \n\n[7] K. F\\mahashi. On the approximate realization of continuous mappings by neural \n\nnetworks. Neural Networks, 2(3):183-192, 1989. \n\n[8] G . J. Gibson. A combinatorial approach to understanding perceptron decision regions. \n\nIEEE Trans. Neural Networks, 4:989--992, 1993. \n\n[9] G. J . Gibson. Exact classification with two-layer neural nets. Journal of Computer \n\nand System Science, 52(2):349-356, 1996. \n\n[10] M. Grotschel, 1. Lovsz, and A. Schrijver. Geometric Algorithms and Combinatorial \n\nOptimization. Springer-Verlag, Berlin, Heidelberg, 1988. \n\n[11] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are \n\nuniversal approximators. Neural Networks, 2(5):359-366, 1989. \n\n[12] S.-C. Huang and Y.-F. Huang. Bounds on the number of hidden neurones in multilayer \n\nperceptrons. IEEE Trans. Neural Networks, 2:47-55, 1991. \n\n[13] P. J. Zweitering. The complexity of multi-layered perceptrons. PhD thesis, Technische \n\nUniversiteit Eindhoven, 1994. \n\n\f", "award": [], "sourceid": 1239, "authors": [{"given_name": "Graham", "family_name": "Brightwell", "institution": null}, {"given_name": "Claire", "family_name": "Kenyon", "institution": null}, {"given_name": "H\u00e9l\u00e8ne", "family_name": "Paugam-Moisy", "institution": null}]}*