{"title": "Optimal Architectures in a Solvable Model of Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4781, "page_last": 4789, "abstract": "Deep neural networks have received a considerable attention due to the success of their training for real world machine learning applications. They are also of great interest to the understanding of sensory processing in cortical sensory hierarchies. The purpose of this work is to advance our theoretical understanding of the computational benefits of these architectures. Using a simple model of clustered noisy inputs and a simple learning rule, we provide analytically derived recursion relations describing the propagation of the signals along the deep network. By analysis of these equations, and defining performance measures, we show that these model networks have optimal depths. We further explore the dependence of the optimal architecture on the system parameters.", "full_text": "Optimal Architectures in a Solvable Model of Deep\n\nNetworks\n\nJonathan Kadmon\n\nThe Racah Institute of Physics and ELSC\n\nThe Hebrew University, Israel\n\njonathan.kadmon@mail.huji.ac.il\n\nHaim Sompolinsky\n\nThe Racah Institute of Physics and ELSC\n\nThe Hebrew University, Israel\n\nand\n\nCenter for Brain Science\n\nHarvard University\n\nAbstract\n\nDeep neural networks have received a considerable attention due to the success\nof their training for real world machine learning applications. They are also\nof great interest to the understanding of sensory processing in cortical sensory\nhierarchies. The purpose of this work is to advance our theoretical understanding of\nthe computational bene\ufb01ts of these architectures. Using a simple model of clustered\nnoisy inputs and a simple learning rule, we provide analytically derived recursion\nrelations describing the propagation of the signals along the deep network. By\nanalysis of these equations, and de\ufb01ning performance measures, we show that\nthese model networks have optimal depths. We further explore the dependence of\nthe optimal architecture on the system parameters.\n\n1\n\nIntroduction\n\nThe use of deep feedforward neural networks in machine learning applications has become widespread\nand has drawn considerable research attention in the past few years. Novel approaches for training\nthese structures to perform various computation are in constant development. However, there is still a\ngap between our ability to produce and train deep structures to complete a task and our understanding\nof the underlying computations. One interesting class of previously proposed models uses a series of\nsequential of de-noising autoencoders (dA) to construct a deep architectures [5, 14]. At it base, the\ndA receives a noisy version of a pre-learned pattern and retrieves the noiseless representation. Other\nmethods of constructing deep networks by unsupervised methods have been proposed including\nthe use of Restricted Boltzmann Machines (RBMs) [3, 12, 7]. Deep architectures have been of\ninterest also to neuroscience as many biological sensory systems (e.g., vision, audition, olfaction and\nsomatosensation, see e.g. [9, 13]) are organized in hierarchies of multiple processing stages. Despite\nthe impressive recent success in training deep networks, fundamental understanding of the merits and\nlimitations of signal processing in such architectures is still lacking.\nA theory of deep network entails two dynamical processes. One is the dynamics of weight matrices\nduring learning. This problem is challenging even for linear architectures and progress has been\nmade recently on this front (see e.g. [11]). The other dynamical process is the propagation of the\nsignal and the information it carries through the nonlinear feedforward stages. In this work we\nfocus on the second challenge, by analyzing the \u2019signal and noise\u2019 neural dynamics in a solvable\nmodel of deep networks. We assume a simple clustered structure of inputs where inputs take the\nform of corrupted versions of a discrete set of cluster centers or \u2019patterns\u2019. The goal of the multiple\nprocessing layer is to reformat the inputs such that the noise is suppressed allowing for a linear\nreadout to perform classi\ufb01cation tasks based on the top representations. We assume a simple learning\nrule for the synaptic matrices, the well known Pseudo-Inverse rule [10]. The advantage of this choice,\nbeside its mathematics tractability, is the capacity for storing patterns. In particular, when the input\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fis noiseless, the propagating signals retain their desired representations with no distortion up to a\nreasonable capacity limit. In addition, previous studies of this rule showed that these systems have a\nconsiderable basins of attractions for pattern completion in a recurrent setting [8]. Here we study this\nsystem in a deep feedforward architecture. Using mean \ufb01eld theory we derive recursion relations for\nthe propagation of signal and noise across the network layers, which are exact in the limit of large\nnetwork sizes. Analyzing this recursion dynamics, we show that for \ufb01xed overall number of neurons,\nthere is an optimal depth that minimizes the readout average classi\ufb01cation error. We analyze the\noptimal depth as a function of the system parameters such as load, sparsity, and the overall system\nsize.\n\n2 Model of Feedforward Processing of Clustered Inputs\n\nWe consider a network model of sensory processing composed of three or more layers of neurons\narranged in a feedforward architecture (\ufb01gure 1). The \ufb01rst layer, composed of N0 neuron is the\ninput or stimulus layer. The input layer projects into a sequence of one or more intermediate layers,\nwhich we also refer to as processing layers. These layers can represent neurons in sensory cortices or\ncortical-like structures. The simplest case is a single processing layer (\ufb01gure 1.A). More generally, we\nconsider L processing layers with possibly different widths (\ufb01gure 1.B). The last layer in the model is\nthe readout layer, which represents a downstream neural population that receives input from the top\nprocessing layer and performs a speci\ufb01c computation, such as recognition of a speci\ufb01c stimulus or\nclassi\ufb01cation of stimuli. For concreteness, we will use a layer of one or more readout binary neurons\nthat perform binary classi\ufb01cations on the inputs. For simplicity, all neurons in the network are binary\nunits, i.e., the activity level of each neuron is either 0 (silent) or 1 (\ufb01ring). We denote Si\nl 2{ 0, 1}, the\nactivity of the i 2{ 1, . . . , Nl} neuron in the l = {1, . . . , L} layer; Nl denotes the size of the layer.\nThe level of sparsity of the neural code, i.e. the fraction f of active neurons for each stimulus, is set\nby tuning the threshold Tl of the neurons in each layer (see below). For simplicity we will assume all\nneurons (except for the readout) have the same sparsity,f .\n\nFigure 1: Schematics of the network. The network receives input from N0 neurons and then projects\nthem onto an intermediate layer composed of Nt processing neurons. The neurons can be arranged in\na single (A) or multiple (B) layers. The readout layer receives input from the last processing layer.\n\nInput The input to the network is organized as clusters around P activity patterns. At it center, each\ncluster has a prototypical representation of an underlying speci\ufb01c stimulus, denoted as \u00afSi\n0,\u00b5, where\ni = 1, ..., N0 , denotes the index of the neuron in the input layer l = 0, and the index \u00b5 = 1, ..., P ,\ndenotes the pattern number. The probability of an input neuron to be \ufb01ring is denoted by f0. Other\nmembers of the clusters are noisy versions of the central pattern, representing natural variations in the\nstimulus representation due to changes in physical features in the world, input noise, or neural noise.\nWe model the noise as iid Bernoulli distribution. Each noisy input Si\n0,\u232b from the \u232bth cluster, equals\n0,\u232b ( \u00afSi\n\u00afSi\n0,\u232b) with probability (1 + m0)/2, ((1 m0)/2) respectively. Thus, the average overlap of\nthe noisy inputs with the central pattern, say \u00b5 = 1 is\n\n(1)\n\nm0 =\n\n1\n\nN0f (1 f )* N0Xi=1Si\n\n0 f \u00afSi\n\n0,1 f+ ,\n\n2\n\n\franging from m0 = 1 denoting the noiseless limit, to m0 = 0 where the inputs are uncorrelated with\nthe centers. Topologically, the inputs are organized into clusters with radius 1 m0.\nUpdate rule The state Si\nweighted sum of the activities in the antecedent layer:\n\nl of the i-th neuron in the l > 0 layer is determined by thresholding the\n\nHere \u21e5 is the step function and the \ufb01eld hi\n\nl represent the synaptic input to the neuron\n\nSi\n\nl Tl .\nl =\u21e5 hi\nl1 f\u2318 .\nl,l1\u21e3Sj\nNl1Xj=1\n\nW ij\n\nhi\nl =\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\n(7)\n\nwhere the sparsity f is the mean activity level of the preceding layer (set by thresholding, Eq. (2)).\n\nSynaptic matrix A key question is how the connectivity matrix W ij\nl,l1 is chosen. Here we construct\nthe weight matrix by \ufb01rst allocating for each layer l , a set of P random templates \u21e0l,\u00b5 2{ 0, 1}N\n(with mean activity f), which are to serve as the representations of the P stimulus clusters in the layer.\nNext, W has to be trained to ensure that the response, \u00afSl,\u00b5, of the layer l to a noiseless inputs, \u00afS0,\u00b5,\nequals \u21e0l,\u00b5 . Here we use an explicit recipe to enforce these relations, namely the pseudo-inverse (PI)\nmodel [10, 8, 6], given by\n\nW ij\n\nl,l1 =\n\n1\n\nNl1f (1 f )\n\nwhere\n\nl1,\u00b5 f\u2318 ,\n\n\u00b5\u232b\u21e3\u21e0j\nPX\u00b5,\u232b=1\u21e0i\nl,\u232b f\u21e5Cl1\u21e41\nNlXi=1\u21e0i\nl,\u232b f\nl,\u00b5 f\u21e0i\n\n1\n\nCl\n\n\u00b5\u232b =\n\nNlf (1 f )\n\nis the correlation matrix of the random templates in the lth layer. For completeness we also denote\n\u21e00,\u00b5 = \u00afS0,\u00b5. This learning rule guarantees that for noiseless inputs, i.e., S0 = \u21e00,\u00b5, the states of all\nthe layers are Sl,\u00b5 = \u21e0l,\u00b5. This will in turn allow for a perfect readout performance if noise is zero.\nThe capacity of this system is limited by the rank of Cl so we require P < Nl [8].\nA similar model of clustered inputs fed into a single processing layer has been studied in [1] using a\nsimpler, Hebbian projection weights.\n\n3 Mean Field Equations for the Signal Propagation\n\nTo study the dynamics of the signal along the network layers, we assume that the input to the network\nis a noisy version of one of the clusters, say, cluster \u00b5 = 1. In the notation above, the input is a state\n0} with an overlap m0 with the pattern \u21e00,1. Information about the cluster identity of the input is\n{Si\nrepresented in subsequent layers through the overlap of the propagated state with the representation\nof the same cluster in each layer; in our case, the overlap between the response of the layer l, Sl, and\n\u21e0l,1 , de\ufb01ned similarly to Eq. (1), as:\n\nml =\n\n1\n\nNlf (1 f )* NlXi=1Si\n\nl f\u21e0i\n\nl,1 f+ .\n\nIn each layer the load is de\ufb01ned as\n\n\u21b5l =\n\nP\nNl\n\n.\n\nUsing analytical mean \ufb01eld techniques (detailed in the supplementary material), exact in the limit of\nlarge N, we \ufb01nd a recursive equation for the overlaps of different layers. In this limit the \ufb01elds and\nthe \ufb02uctuations of the \ufb01elds hi\nl, assume Gaussian statistics as the realizations of the noisy input vary.\nThe overlaps are evaluated by thresholding these variables, given by\n\n3\n\n\f(l 2)\n\nwhere H(x) = (2\u21e1)1/2\u00b4 1\n\nx dx exp(x2/2). The threshold Tl is set for each layer by solving\n\nml+1 = H\" Tl+1 (1 f )ml\nf = f H\" Tl+1 (1 f )ml\n\npl+1 + Ql+1 # H\" Tl+1 + f ml\npl+1 + Ql+1# ,\npl+1 + Ql+1 # + (1 f )H\" Tl+1 + f ml\npl+1 + Ql+1# .\nl+12E which has two contributions. The\n1 \u21b5l1 m2\nl .\n\nl+1 = f (1 f )\n\nThe factor l+1 + Ql+1 is the variance of the \ufb01eldsDhi\n\n\ufb01rst is due to the variance in the noisy responses of the previous layers, yielding\n\nThe second contribution comes from the spatial correlations between noisy responses of the previous\nlayers, yielding\n\n(10)\n\n\u21b5l\n\n(8)\n\n(9)\n\nQl+1 =\n\n1 2\u21b5l\n\n2\u21e1(1 \u21b5l) f exp\"\n\n(Tl (1 f )ml1)2\n\n2(l + Ql)\n\n# + (1 f ) exp\"\n\n(Tl + f ml1)2\n\n2(l + Ql) #!2\n\n.\n\n(11)\nNote that despite the fact that the noise in the different nodes of the input layer is uncorrelated, as the\nsignals propagate through the network, correlations between the noisy responses of different neurons\nin the same layer emerge. These correlations depend on the particular realization of the random\ntemplates, and will average to zero upon averaging over the templates. Nevertheless, they contribute\na non-random contribution to the total variance of the \ufb01elds at each layer. Interestingly, for \u21b5l > 1/2\nthis term becomes negative, and reduces the overall variance of the \ufb01elds.\nThe above recursion equations hold for l 2. The initial conditions for this layer is Q1 = 0 and m1,\n1given by:\n(Layer 1)\n\nm1 = H\uf8ff T1 (1 f )m0\np1\nf = f H\uf8ff T1 (1 f )m0\n\np1\n\n1 = f (1 f )\n\n H\uf8ff T1 + f m0\n ,\np1\n + (1 f )H\uf8ff T1 + f m1\np1\n1 \u21b501 m2\n0 .\n\n\u21b50\n\n ,\n\n(12)\n\n(13)\n\n(14)\n\nand\n\nwhere \u21b50 = P/N0.\nFinally, we note that a previous analysis of the feedforward PI model (in the dense case, f = 0.5)\nreported results [6] neglected the contribution Ql of the induced correlations to the \ufb01eld variance.\nIndeed, their approximate equations fail to correctly describe the behavior of the system. As we will\nshow, our recursion relations fully accounts for the behavior of the network in the limit of large N .\n\nIn\ufb01nitely deep homogeneous network The above equations, eq (8)-(11) describe the dynamics\nof the average overlap of the network states and the variance in the inputs to the neurons in each\nlayer. This dynamics depends on the sizes (and sparsity) of the different processing layers. Although\nthe above equations are general, from now on, we will assume homogeneous architecture in which\nNl = N = Nt/L (all with the same sparsity). To \ufb01nd the behavior of the signals as they propagate\nalong this in\ufb01nitely deep homogenous network (l ! 1) we look for the \ufb01xed points of the recursion\nequation.\nSolution of the equations reveals three \ufb01xed points of the trajectories. Two of them are stable \ufb01xed\npoints, one at m = 0 and the other at m = 1. The third is an unstable \ufb01xed point at some intermediate\n\n4\n\n\fFigure 2: Overlap dynamics. (A) Trajectory of overlaps across layers from eq (8)-(11) (solid lines)\nand simulations (circles). Dashed red line show the predicted separatrix m\u2020. The deviation from the\ntheoretical prediction near the separatrix are due to \ufb01nal size effects of the simulations (\u21b5 = 0.4,\nf = 0.1). (B) Basin of attraction for two values of f as a function of \u21b5. Line show theoretical\nprediction and shaded area simulations. (C) Convergence time (number of layers) of the m = 1\nattractor. Near the unstable \ufb01xed point (dashed vertical lines) convergence time diverges and rapidly\ndecreases for larger initial conditions, m0 > m\u2020.\n\nvalue m\u2020. Initial conditions with overlaps obeying m0 > m\u2020 converge to 1, implying complete\nsuppression of the input noise, while those with m0 < m\u2020 lose all overlap with the central pattern\n[\ufb01gure 2.A], which depicts the values of the overlaps for different initial conditions. As expected, the\ncurves (analytical results derived by numerically iterating the above mean \ufb01eld equations) terminate\neither at ml = 1 or ml = 0 for large l . The same holds for the numerical simulations (dots) except\nfor a few intermediate values of initial conditions that converge to an intermediate asymptotic values\nof overlaps. These intermediate \ufb01xed points are \u2019\ufb01nite size effects\u2019. As the system size (Nt and\ncorrespondingly N) increases, the range of initial conditions that converge to intermediate \ufb01xed\npoints shrinks to zero. In general increasing the sparsity of the representations (i.e., reducing f\n) improves the performance of the network. As seen in [\ufb01gure 2.B] the basin of attraction of the\nnoiseless \ufb01xed point increases as f decreases.\n\nConvergence time\nIn general, the overlaps approach the noiseless state relatively fast, i.e., within\n5 10 layers. This holds for initial conditions well within the basin of attraction of this \ufb01xed point.\nIf the initial condition is close to the boundary of the basin, i.e., m0 \u21e1 m\u2020, convergence is slow. In\nthis case, the convergence time diverges as m0 ! m\u2020 from above [\ufb01gure 2.C].\n4 Optimal Architecture\n\nWe evaluate the performances of the network by the ability of readout neurons to correctly perform\nrandomly chosen binary linear classi\ufb01cations of the clusters. For concreteness we consider the\nperformance of a single readout neuron to perform a binary classi\ufb01cation where for each central\npattern, the desired label is \u21e0ro,\u00b5 = 0, 1. The readout weights, projecting from the last processing\nlayer into the readout [\ufb01gure 1] are assumed to be learned to perform the correct classi\ufb01cation by\na pseudo-inverse rule, similar to the design of the processing weight matrices. The readout weight\nmatrix is given by\n\nW j\n\nro =\n\n1\n\nN fro(1 fro)\n\nPX\u00b5,\u232b=1\n\n\u00b5\u232b\u21e3\u21e0j\n(\u21e0ro,\u00b5 fro)\u21e5CL\u21e41\n\nL,\u00b5 f\u2318 .\n\n(15)\n\nWe assume the readout labels are iid Bernoulli variables with zero bias (fro = 0.5), though a bias can\nbe easily incorporated. The error of the readout is the probability of the neuron being in the opposite\nstate than the labels.\n\n\u270f =\n\n1 mro\n\n2\n\n,\n\n(16)\n\nwhere mro is the average overlap of the readout layer, and can be calculated using the recursion\nequations (8)-(11). However, Since generally f 6= fro, the activity factor need to be replaced in the\n\n5\n\n\fproper positions in the equations. For correctness, we bring the exact form of the readout equation in\nthe supplementary material.\n\n4.1 Single in\ufb01nite layer\nIn the following we explore the utility of deep architectures in performing the above tasks. Before\nassessing quantitatively different architectures, we present a simple comparison between a single\nin\ufb01nitely wide layer and a deep network with a small number of \ufb01nite-width layers.\nAn important result of our theory is that for a model with a single processing layer with \ufb01nite f, the\noverlap m1 and hence the classi\ufb01cation error do not vanish even for a layer with in\ufb01nite number of\nneurons. This holds for all levels of input noise, i.e., as long as m0 < 1. This can be seen by setting\n\u21b5 = 0 in equations (8)-(11) for L = 2 . Note that although the variance contribution to the noise in\nthe \ufb01eld, ro vanishes, the contribution from the correlations, Q1, remains \ufb01nite and is responsible\nfor the fact that mro < 1 and \u270f> 0 [1]. In contrast, in a deep network, if the initial overlap is within\nthe basin of attraction the m = 1 solution, the overlap quickly approach m = 1 [\ufb01gure (2).C]. This\nsuggests that a deep architecture will generally perform better than a single layer, as can be seen in\nthe example in \ufb01gure 3.A.\n\nMean error The readout error depends on the level of the initial noise (i.e., the value of m0). Here\nwe introduce a global measure of performance, E , de\ufb01ned as the readout error averaged over the\ninitial overlaps,\n\nE =\n\n1\n\n\u02c6\n\n0\n\ndm0\u21e2 (m0) \u270f (m0) ,\n\n(17)\n\nwhere the \u21e2(m0) is the distribution of cluster sizes. For simplicity we use here a uniform distribution\n\u21e2 = 1. The mean error is a function of the parameters of the network, namely the sparsity f , the input\nand total loads \u21b50 = P/N0, \u21b5t = P/Nt respectively, and the number of layers L, which describes\nthe layout of the network. We are now ready to compare the performance of different architectures.\n\n4.2 Limited resources\nIn any real setting, the resources of the network are limited. This may be due to \ufb01nite number of\navailable neurons or a limit on the computational power. To evaluate the optimal architecture under\nconstraints of a \ufb01xed total number of neurons, we assume that the total number of neurons is \ufb01xed\nto Nt = \uf8ffN0, where N0 is the size of the input layer. As in the analysis above, we consider for\nsimplicity alternative uniform architectures in which all processing layers are of equal size N = Nt/L\n. The performance as a function of the number of layers is shown in \ufb01gure 3.B which depicts the\nmean error against the number of processing layers L for several values of the expansion factor\uf8ff.\nThese curves show that the error has a minimum at a \ufb01nite depth\n\n(18)\nThe reason for this is that for shallower networks, the overlaps have not been iterated suf\ufb01cient\nnumber of times and hence remain further from the noiseless \ufb01xed point. On the other hand, deeper\nnetworks will have an increased load at each layer, since\n\nLopt = arg min\nL\n\nE(L).\n\n\u21b5 =\n\nP\n\uf8ffN0\n\nL,\n\n(19)\n\nthereby reducing the noise suppression of each layer. As seen in the \ufb01gure, increasing the total\nnumber of neurons, yields a lower mean error Eopt, and increases the the optimal depth on the\nnetwork. Note however, that for large \uf8ff , the mean error rises slowly for L larger than its optimal\nvalue; this is is because the error changes very slowly with \u21b5 for small \u21b5. and remains close to its\n\u21b5 = 0 value. Thus, increasing the depth moderately above Lopt may not harm signi\ufb01cantly the\nperformance. Ultimately, if L increases to the order of \uf8ffN/P , the load in each processing layer\n\u21b5 approaches 1, and the performance deteriorates drastically. Other considerations, such as time\nrequired for computation may favor shallower architectures, and in practice will limit the utility of\narchitectures deeper than Lopt.\n\n6\n\n\fFigure 3: Optimal layout. (A) Comparing readout error produced by the same initial condition\n(m0 = 0.6) of a single, in\ufb01nitely-wide processing layer to that of a deep architecture with \u21b5 = 0.2.\nFor both networks \u21b50 = 0.7, f = 0.15 and m0 = 0.6. (B) Mean error as a function of the number\nof the processing layers for three values of expansion factor \uf8ff = Nt/N0. Dashed line shows the\nerror of a single in\ufb01nite layer. (C) Optimal number of layers as a function of the inverse of the input\nload (\u21b50 / P ), for different values of sparsity. Lines show linear regression on the data points. (D)\nminimal error as a function of the input load (number of stored templates). Same color code as (C).\n\nThe effect of load on the optimal architecture\nIf the overall number of neurons in the network is\n\ufb01xed, then the optimal layout Lopt is a function of the size of the dataset, i.e, P . For large P , the\noptimal network becomes shallow. This is because that when the load is high, resources are better\nallocated to constrain \u21b5 as much as possible, due to the high readout error when \u21b5 is close to 1,\n\ufb01gures C and D . As shown in [\ufb01gure 3.D], Loptincreases with decreasing the load, scaling as\n\nThis implies that the width Nopt scales as\n\nLopt / P 1/2.\n\nNopt / P 1/2.\n\n(20)\n\n(21)\n\n4.3 Autoencoder example\n\nThe model above assumes inputs in the form of random patterns (\u21e00,\u00b5) corrupted by noise. Here\nwe illustrate that the qualitative behavior of the network for inputs generated by handwritten digits\n(MNIST dataset) with random corruptions. To visualize the suppression of noise by the deep pseudo-\ninverse network, we train the network with autoencoder readout layer, namely use a readout layer of\nsize N0 and readout labels equal the original noiseless images, \u21e0ro,\u00b5 = \u21e00,\u00b5. The readout weights\nare Pseudo-inverse weights with output labels identical to the input patterns, and following eq. (15).\n[? 2]. A perfect overlap at the readout layer implies perfect reconstruction of the original noiseless\npattern.\nIn \ufb01gure 4, two networks were trained as autoencoders on a set of templates composed of 3-digit\nnumbers (See experimental procedures in the supplementary material). Both networks have the same\nnumber of neurons. In the \ufb01rst, all processing neurons are placed in a single wide layer, while in the\nother neurons were divided into 10 equally-sized layers. As the theory predicts, the deep structure\nis able to reproduce the original templates for a wide range of initial noise, while the single layer\ntypically reduces the noise but fails to reproduce the original image.\n\n7\n\n\fFigure 4: Visual example of the difference between a single processing layer and a deep struc-\nture. Input data was prepared using the MNIST handwritten digit database. Example of the templates\nare shown on the top row. Two different networks were trained to autoencode the inputs, one with\nall the processing neurons in a single layer (\ufb01gure 1.A) and one in which the neurons were divided\nequally between 10 layers (\ufb01gure 1.B) (See experimental procedures in the supplementary material\nfor details). A noisy version of the templates were introduced to the two networks and the outputs are\npresented on the third and fourth rows, for different level of initial noise (columns).\n\n5 Summary and Final Remarks\n\nOur paper aims at gaining a better understanding of the functionality of deep networks. Whereas the\noperation of the bottom (low level processing of the signals) and the top (fully supervised) stages are\nwell understood, an understanding of the rationale of multiple intermediate stages and the tradeoffs\nbetween competing architectures is lacking. The model we study is simpli\ufb01ed both in the task,\nsuppressing noise, and its learning rule (pseudo-inverse). With respect to the \ufb01rst, we believe that\nchanging the noise model to the more realistic variability inherent in objects will exhibit the same\nqualitative behaviors. With respect to the learning rule, the pseudo-inverse is close to SVM rule in the\nregime we work, so we believe that is a good tradeoff between realism and tractability. Thus, although\nthe unavoidable simplicity of our model, we believe its analysis yields important insights which will\nlikely carry over to the more realistic domains of deep networks studied in ML and neuroscience.\n\nEffects of sparseness Our results show that the performance of the network is improved as the\nsparsity of the representation increases. In the extreme case of f ! 0, perfect suppression of noise\noccurs already after a single processing layer. Cortical sensory representations exhibit only moderate\nsparsity levels, f \u21e1 0.1. Computational considerations of robustness to \u2019representational noise\u2019\nat each layer will also limit the value of f. Thus, deep architectures may be necessary for good\nperformance at realistic moderate levels of sparsity (or for dense representations).\n\nIn\ufb01nitely wide shallow architectures: A central result of our model is that a \ufb01nite deep network\nmay perform better than a network with a single processing layer of in\ufb01nite width. An in\ufb01nitely wide\nshallow network has been studied in the past (e.g., [4]). In principle, an in\ufb01nitely wide network, even\nwith random projection weights, may serve as a universal approximate, allowing for yielding readout\nperformance as good as or superior to any \ufb01nite deep network. This however requires a complex\ntraining of the readout weights. Our relatively simple readout weights are incapable of extracting this\ninformation from the in\ufb01nite, shallow architecture. Similar behavior is seen with simpler readout\nweights, the Hebbian weights as well as with more complex readout generated by training the readout\nweights using SVMs with noiseless patterns or noisy inputs [1]. Thus, our results hold qualitatively\nfor a broad range of plausible readout learning algorithms (such as Hebb, PI, SVM) but not for\narbitrarily complex search that \ufb01nds the optimal readout weights.\n\n8\n\n\fAcknowledgements\n\nThis work was partially supported by IARPA (contract #D16PC00002), Gatsby Charitable Foundation,\nand Simons Foundation SCGB grant.\n\nReferences\n[1] Baktash Babadi and Haim Sompolinsky. Sparseness and Expansion in Sensory Representations.\n\nNeuron, 83(5):1213\u20131226, September 2014.\n\n[2] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis: Learning\n\nfrom examples without local minima. 2(1):53\u201358, 1989.\n\n[3] Maneesh Bhand, Ritvik Mudur, Bipin Suresh, Andrew Saxe, and Andrew Y Ng. Unsupervised\nlearning models of primary cortical receptive \ufb01elds and receptive \ufb01eld plasticity. ADVANCES\nIN NEURAL . . . , pages 1971\u20131979, 2011.\n\n[4] Y Cho and L K Saul. Large-margin classi\ufb01cation in in\ufb01nite neural networks. Neural Computa-\n\ntion, 22(10):2678\u20132697, 2010.\n\n[5] William W Cohen, Andrew McCallum, and Sam T Roweis, editors. Extracting and Composing\n\nRobust Features with Denoising Autoencoders. ACM, 2008.\n\n[6] E Domany, W Kinzel, and R Meir. Layered neural networks. Journal of Physics A: Mathematical\n\nand General, 22(12):2081\u20132102, June 1989.\n\n[7] G E Hinton and R R Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks.\n\nscience, 313(5786):504\u2013507, July 2006.\n\n[8] I Kanter and Haim Sompolinsky. Associative recall of memory without errors. Physical Review\n\nA, 35(1):380\u2013392, 1987.\n\n[9] Honglak Lee, Chaitanya Ekanadham, and Andrew Y Ng. Sparse deep belief net model for\n\nvisual area V2. Advances in neural information . . . , pages 873\u2013880, 2008.\n\n[10] L Personnaz, I Guyon, and G Dreyfus. Information storage and retrieval in spin-glass like neural\n\nnetworks. Journal de Physique Lettres, 46(8):359\u2013365, April 1985.\n\n[11] Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear\n\ndynamics of learning in deep linear neural networks. arXiv.org, December 2013.\n\n[12] Paul Smolensky. Information Processing in Dynamical Systems: Foundations of Harmony\n\nTheory. February 1986.\n\n[13] Glenn C Turner, Maxim Bazhenov, and Gilles Laurent. Olfactory Representations by Drosophila\n\nMushroom Body Neurons. Journal of Neurophysiology, 99(2):734\u2013746, February 2008.\n\n[14] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.\nStacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a\nLocal Denoising Criterion. The Journal of Machine Learning Research, 11:3371\u20133408, March\n2010.\n\n9\n\n\f", "award": [], "sourceid": 2433, "authors": [{"given_name": "Jonathan", "family_name": "Kadmon", "institution": "Hebrew University"}, {"given_name": "Haim", "family_name": "Sompolinsky", "institution": "Hebrew University and Harvard University"}]}