{"title": "An Information Maximization Approach to Overcomplete and Recurrent Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 612, "page_last": 618, "abstract": null, "full_text": "An Information Maximization Approach to \nOvercomplete and Recurrent Representations \n\nOren Shriki and Haim Sompolinsky \n\nRacah Institute of Physics and \nCenter for Neural Computation \n\nHebrew University \n\nDaniel D. Lee \n\nBell Laboratories \n\nLucent Technologies \nMurray Hill, NJ 07974 \n\nJerusalem, 91904, Israel \n\nAbstract \n\nThe principle of maximizing mutual information is applied to learning \novercomplete and recurrent representations. The underlying model con(cid:173)\nsists of a network of input units driving a larger number of output units \nwith recurrent interactions. In the limit of zero noise, the network is de(cid:173)\nterministic and the mutual information can be related to the entropy of \nthe output units. Maximizing this entropy with respect to both the feed(cid:173)\nforward connections as well as the recurrent interactions results in simple \nlearning rules for both sets of parameters. The conventional independent \ncomponents (ICA) learning algorithm can be recovered as a special case \nwhere there is an equal number of output units and no recurrent con(cid:173)\nnections. The application of these new learning rules is illustrated on a \nsimple two-dimensional input example. \n\n1 Introduction \n\nMany unsupervised learning algorithms such as principal component analysis, vector quan(cid:173)\ntization, self-organizing feature maps, and others use the principle of minimizing recon(cid:173)\nstruction error to learn appropriate features from multivariate data [1, 2]. Independent \ncomponents analysis (ICA) can similarly be understood as maximizing the likelihood of \nthe data under a non-Gaussian generative model, and thus is related to minimizing a re(cid:173)\nconstruction cost [3, 4, 5]. On the other hand, the same ICA algorithm can also be derived \nwithout regard to a particular generative model by maximizing the mutual information be(cid:173)\ntween the data and a nonlinearly transformed version of the data [6]. This principle of \ninformation maximization has also been previously applied to explain optimal properties \nfor single units, linear networks, and symplectic transformations [7, 8, 9]. \n\nIn these proceedings, we show how the principle of maximizing mutual information can \nbe generalized to overcomplete as well as recurrent representations. In the limit of zero \nnoise, we derive gradient descent learning rules for both the feedforward and recurrent \nweights. Finally, we show the application of these learning rules to some simple illustrative \nexamples. \n\n\fM output variables \n\nN input variables \n\nFigure 1: Network diagram of an overcomplete, recurrent representation. x are input data \nwhich influence the output signals s through feedforward connections W. The signals s \nalso interact with each other through the recurrent interactions K. \n\n2 Information Maximization \n\nThe \"Infomax\" formulation of leA considers the problem of maximizing the mutual in(cid:173)\nformation between N-dimensional data observations {x} which are input to a network \nresulting in N-dimensional output signals {s} [6]. Here, we consider the general problem \nwhere the signals s are M -dimensional with M ~ N. Thus, the representation is overcom(cid:173)\nplete because there are more signal components than data components. We also consider \nthe situation where a signal component Si can influence another component Sj through a \nrecurrent interaction Kji. As a network, this is diagrammed in Fig. 1 with the feedfor(cid:173)\nward connections described by the M x N matrix Wand the recurrent connections by the \nM x M matrix K. The network response s is a deterministic function of the input x: \n\n(1) \n\nwhere 9 is some nonlinear squashing function. In this case, the mutual information between \nthe inputs x and outputs s is functionally only dependent on the entropy of the outputs: \n\nJ(s, x) = H(s) - H(slx) '\" H(s). \n\n(2) \n\nThe distribution of s is aN-dimensional manifold embedded in aM-dimensional vector \nspace and nominally has a negatively divergent entropy. However, as shown in Appendix \n1, the probability density of s can be related to the input distribution via the relation: \n\nP(s) ex: \n\nP(x) \n\ny!det(xTx) \n\nwhere the susceptibility (or Jacobian) matrix X is defined as: \n\nXij =~. \n\nOSi \n\nuXj \n\n(3) \n\n(4) \n\nThis result can be understood in terms of the singular value decomposition (SVD) of the \nmatrix x. The transformation performed by X can be decomposed into a series of three \ntransformations: an orthogonal transformation that rotates the axes, a diagonal transfor(cid:173)\nmation that scales each axis, followed by another orthogonal transformation. A volume \nelement in the input space is mapped onto a volume element in the output space, and its \nvolume change is described by the diagonal scaling operation. This scale change is given \n\n\fby the product of the square roots of the eigenvalues of XT X. Thus, the relationship be(cid:173)\ntween the probability distribution in the input and output spaces includes the proportionality \nfactor, y'det(xTx), as formally derived in Appendix 1. \n\nWe now get the following expression for the entropy of the outputs: \n\nH(s) '\" -I dxP(x) log ( \n\nP(x) \n\ny'det(xTx) \n\n) = -21 (logdet(xT X)) + H(x), \n\n(5) \n\nwhere the brackets indicate averaging over the input distribution. \n\n3 Learning rules \n\nFrom Eq. (5), we see that minimizing the following cost function: \n\n(6) \nis equivalent to maximizing the mutual information. We first note that the susceptibility X \nsatisfies the following recursion relation: \n\nE = -\"2Tr(log(XTX)), \n\n1 \n\nXij = g~ . (Wij + ~ KikXkj ) = (GW + GKX)ij, \n\n(7) \n\nwhere G ij = 8ijg~ and g~ == g' (Lj WijXj + Lk KikS k) . \nSolving for X in Eq. (7) yields the result: \n\n(8) \nwhere <]>-1 == G-1 - K. <]>ij can be interpreted as the sensitivity in the recurrent network \nof the ith unit's output to changes in the total input of the jth unit. \n\nX = (G- 1 - K)-1W = <]>W, \n\nWe next derive the learning rules for the network parameters using gradient descent, as \nshown in detail in Appendix 2. The resulting expression for the learning rule for the feed(cid:173)\nforward weights is: \n\n~W = -'f/- = 'f/ (rT + <]>T 'YxT) \n\n8E \n8W \n\nwhere'f/ is the learning rate, the matrix r is defined as \nr = (XT X)-1 XT <]> \n\nand the vector 'Y is given by \n\nl' \n'Yi = (Xr)ii (g~t)3 . \n\n(9) \n\n(0) \n\n(11) \n\n(2) \n\nMultiplying the gradient in Eq. (9) by the matrix (WWT) yields an expression analogous \nto the \"natural\" gradient learning rule [10]: \n\n~W = 'f/W (I + (XT 'YxT)) . \n\nSimilarly, the learning rule for the recurrent interactions is \n\n8E \n\n~K = -'f/ 8K = 'f/ ((xrf + <]>T 'YsT) . \n\n(13) \nIn the case when there are equal numbers of input and output units, M = N, and there \nare no recurrent interactions, K = 0, most of the previous expressions simplify. The \nsusceptibility matrix X is diagonal, <]> = G, and r = W- 1 . Substituting back into Eq. (9) \nfor the learning rule for W results in the update rule: \n\n(14) \nwhere Z i = gr / g~. Thus, the well-known Infomax leA learning rule is recovered as a \nspecial case ofEq. (9) [6] . \n\n~W = 'f/ [(WT )-1 + (zxT)] , \n\n\f(a) \n\n(b) \n\n(c) \n\nFigure 2: Results of fitting 3 filters to a 2-dimensional hexagon distribution with 10000 \nsample points. \n\n4 Examples \n\nWe now apply the preceding learning algorithms to a simple two-dimensional (N = 2) \ninput example. Each input point is generated by a linear combination of three (two(cid:173)\ndimensional) unit vectors with angles of 00 , 1200 and 240 0 \u2022 The coefficients are taken \nfrom a uniform distribution on the unit interval. The resulting distribution has the shape \nof a unit hexagon, which is slightly more dense close to the origin than at the boundaries. \nSamples of the input distribution are shown in Fig. 2. The second order cross correlations \nvanish, so that all the structure in the data is described only by higher order correlations. \nWe fix the sigmoidal nonlinearity to be g(x} = tanh(x}. \n\n4.1 Feedforward weights \nA set of M = 3 overcomplete filters for W are learned by applying the update rule in \nEq. (9) to random normalized initial conditions while keeping the recurrent interactions \nfixed at K = O. The length of the rows of W were constrained to be identical so that the \nfilters are projections along certain directions in the two-dimensional space. The algorithm \nconverged after about 20 iterations. Examples of the resulting learned filters are shown \nby plotting the rows of W as vectors in Fig. 2. As shown in the figure, there are several \ndifferent local minimum solutions. If the lengths of the rows of Ware left unconstrained, \nslight deviations from these solutions occur, but relative orientation differences of 600 or \n1200 between the various filters are preserved. \n\n4.2 Recurrent interactions \n\nTo investigate the effect of recurrent interactions on the representation, we fixed the feed(cid:173)\nforward weights in W to point in the directions shown in Fig. 2(a), and learned the optimal \nrecurrent interactions K using Eq. (13). Depending upon the length of the rows of W \nwhich scaled the input patterns, different optimal values are seen for the recurrent connec(cid:173)\ntions. This is shown in Fig. 3 by plotting the value of the cost function against the strength \nof the uniform recurrent interaction. For small scaled inputs, the optimal recurrent strength \nis negative which effectively amplifies the output signals since the 3 signals are negatively \ncorrelated. With large scaled inputs, the optimal recurrent strength is positive which tend to \ndecrease the outputs. Thus, in this example, optimizing the recurrent connections performs \ngain control on the inputs. \n\n\f3 \n2.5 ' \n2 \u00b7 \n\n.... 1.5 \nC/) o 1 \nU \n0.5 \nO\u00b7 \n-0.5 . \n\n-1 \n\n\u00b7 \u00b7 \u00b7 \u00b7 \n\nIWI=1 \n\nIWI=5 \n\n-1.5 \n\n-1 \n\n-0.5 \n\n0.5 \n\n1.5 \n\n0 \nk \n\nFigure 3: Effect of adding recurrent interactions to the representation. The cost function \nis plotted as a function of the recurrent interaction strength, for two different input scaling \nparameters. \n\n5 Discussion \n\nThe learned feedforward weights are similar to the results of another ICA model that can \nlearn overcomplete representations [11]. Our algorithm, however, does not need to perform \napproximate inference on a generative model. Instead, it directly maximizes the mutual in(cid:173)\nformation between the outputs and inputs of a nonlinear network. Our method also has \nthe advantage of being able to learn recurrent connections that can enhance the representa(cid:173)\ntional power of the network. We also note that this approach can be easily generalized to \nundercomplete representations by simply changing the order of the matrix product in the \ncost function. However, more work still needs to be done in order to understand technical \nissues regarding speed of convergence and local minima in larger applications. Possible \nextensions of this work would be to optimize the nonlinearity that is used, or to adaptively \nchange the number of output units to best match the input distribution. \n\nWe acknowledge the financial support of Bell Laboratories, Lucent Technologies, and the \nUS-Israel Binational Science Foundation. \n\n6 Appendix 1: Relationship between input and output distributions \n\nIn general, the relation between the input and output distributions is given by \n\nP(s) = ! dxP(x)P(slx). \n\n(15) \n\nSince we use a deterministic mapping, the conditional distribution of the response given \nthe input is given by P(slx) = 8(s - g(Wx + Ks)). By adding independent Gaussian \nnoise to the responses of the output units and considering the limit where the variance of \nthe noise goes to zero, we can write this term as \n\nP(slx) = lim \n\n1 \n\n6.-+0 (2?r~2)N/2 \n\ne-~lls-g(Wx+Ks)112 \n\n(16) \n\nThe output space can be partitioned into those points which belong to the image of the \ninput space, and those which are not. For points outside the image of the input space, \nP(s) = O. Consider a point s inside the image. This means that there exists Xo such that \ns = g(Wxo + Ks). For small~, we can expand g(Wx + Ks) - s ::::: X8x, where X is \n\n\fP(slx) \n\n(17) \n\nThe expression in the square brackets is a delta function in x around Xo. Using Eq. (15) we \nfinally get \n\nP(s) = \n\nP(x) O(s) \n\nJdet(xTx) \n\n(18) \n\nwhere the characteristic function O(s) is 1 if s belongs to the image of the input space \nand is zero otherwise. Note that for the case when X is a square matrix (M = N), this \nexpression reduces to the relation P(s) = P(x) II det(x)l. \n\n7 Appendix 2: Derivation of the learning rules \n\nTo derive the appropriate learning rules, we need to calculate the derivatives of E with \nrespect to some set of parameters A. In general, these derivatives are obtained from the \nexpression: \n\n7.1 Feedforward weights \n\nIn order to derive the learning rule for the weights W, we first calculate \n\nOXab \n\nOWlm = \"S: ~ae OWlm + OWlm Web = ~al6bm + \"S: OWlm Web\u00b7 \n\no~ a e ) \n\no~ ae \n\nOWeb \n\n\" \n\n( \n\n\" \n\nFrom the definition of ~, we see that: \n\nO~ae __ ,,~ . oGi:/~. \nL.J at OWlm Je \nOWlm -\n\ntJ \n\nand \n\noGi/ _ \nOWlm -\n\n6ij \n\nog~ _ 6 \n\ng~' OSi \n\n- (gD 2 OWlm -\n\n-\n\nij (gD3 OWlm ' \n\nwhere g~' == g\" (Lj WijXj + Lk KikSk). \nThe derivatives of s also satisfy a recursion relation similar to Eq. (7): \n\nOSi \nOWlm = gi' \n\nI \n\n( \n\n6U x m + 7 Kij OWlm \n\n\"OS j ) \n\n' \n\nwhich has the solution: \n\n(20) \n\n(21) \n\n(22) \n\n(23) \n\n(24) \n\nPutting all these results together in Eq. (19) and taking the trace, we get the gradient descent \nrule in Eq. (9). \n\n\f7.2 Recurrent interactions \n\nTo derive the learning rules for the recurrent weights K, we first calculate the derivatives \nof Xab with respect to Kim: \n\nOXab \no<1>ijl \noK = ~ oK Web = - ~ <1>ai OK \n1m \n1m \n\n'\"\" o<1>ae \n1m \ne \n\n'\"\" \ne,i,j \n\n<1>jeW eb. \n\n(25) \n\nFrom the definition of <1>, we obtain: \n\n0 g~ \n0<1> ij 1 \n\u00a3lK = - -( ')2 \u00a3lK \nu \nu \n1m \n1m \n\n6ij \ngi \n\n- 6il6jm. \n\nThe derivatives of g' are obtained from the following relations: \n\nand \n\n(26) \n\n(27) \n\n(28) \n\nwhich results from a recursion relation similar to Eq. (23). Finally, after combining these \nresults and calculating the trace, we get the gradient descent learning rule in Eq. (13). \n\nReferences \n\n[1] Jolliffe, IT (1986). Principal Component Analysis. New York: Springer-Verlag. \n[2] Hayldn, S (1999). Neural networks: a comprehensive foundation. 2nd ed., Prentice-Hall, Upper \n\nSaddle River, NJ. \n\n[3] Jutten, C & Herault, J (1991). Blind separation of sources, part I: An adaptive algorithm based \n\non neuromimetic architecture. Signal Processing 24,1-10. \n\n[4] Hinton, G & Ghahramani, Z (1997). Generative models for discovering sparse distributed rep(cid:173)\n\nresentations. Philosophical Transactions Royal Society B 352, 1177-1190. \n\n[5] Pearlmutter, B & Parra, L (1996). A context-sensitive generalization of ICA. In ICONIP'96, \n\n151-157. \n\n[6] Bell, AJ & Sejnowsld, TJ (1995). An information maximization approach to blind separation \n\nand blind deconvolution. Neural Comput. 7, 1129- 1159. \n\n[7] Barlow, HB (1989). Unsupervised learning. Neural Comput. 1,295-311. \n[8] Linsker, R (1992). Local synaptic learning rules suffice to maximize mutual information in a \n\nlinear network. Neural Comput. 4,691-702. \n\n[9] Parra, L, Deco, G, & Miesbach, S (1996). Statistical independence and novelty detection with \n\ninformation preserving nonlinear maps. Neural Comput. 8,260-269. \n\n[10] Amari, S, Cichocld, A & Yang, H (1996). A new learning algorithm for blind signal separation. \n\nAdvances in Neural Information Processing Systems 8, 757-763. \n\n[11] Lewicki, MS & Sejnowsld, TJ (2000). Learning overcomplete representations. Neural Compu(cid:173)\n\ntation 12 337- 365. \n\n\f", "award": [], "sourceid": 1863, "authors": [{"given_name": "Oren", "family_name": "Shriki", "institution": null}, {"given_name": "Haim", "family_name": "Sompolinsky", "institution": null}, {"given_name": "Daniel", "family_name": "Lee", "institution": null}]}