{"title": "Towards an Organizing Principle for a Layered Perceptual Network", "book": "Neural Information Processing Systems", "page_first": 485, "page_last": 494, "abstract": null, "full_text": "485 \n\nTOWARDS AN ORGANIZING PRINCIPLE FOR \n\nA LAYERED PERCEPTUAL NETWORK \n\nIBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598 \n\nRalph Linsker \n\nAbstract \n\nAn information-theoretic optimization principle is proposed for the development \nof each processing stage of a multilayered perceptual network. This principle of \n\"maximum information preservation\" states that the signal transformation that is to be \nrealized at each stage is one that maximizes the information that the output signal values \n(from that stage) convey about the input signals values (to that stage), subject to certain \nconstraints and in the presence of processing noise. The quantity being maximized is a \nShannon information rate. I provide motivation for this principle and -- for some simple \nmodel cases -- derive some of its consequences, discuss an algorithmic implementation, \nand show how the principle may lead to biologically relevant neural architectural \nfeatures such as topographic maps, map distortions, orientation selectivity, and \nextraction of spatial and temporal signal correlations. A possible connection between \nthis information-theoretic principle and a principle of minimum entropy production in \nnonequilibrium thermodynamics is suggested. \n\nIntroduction \n\nThis paper describes some properties of a proposed information-theoretic \norganizing principle for the development of a layered perceptual network. The purpose \nof this paper is to provide an intuitive and qualitative understanding of how the principle \nleads to specific feature-analyzing properties and signal transformations in some simple \nmodel cases. More detailed analysis is required in order to apply the principle to cases \ninvolving more realistic patterns of signaling activity as well as specific constraints on \nnetwork connectivity. \n\nThis section gives a brief summary of the results that motivated the formulation \nof the organizing principle, which I call the principle of \"maximum information \npreservation.\" In later sections the principle is stated and its consequences studied. \n\nIn previous work l I analyzed the development of a layered network of model cells \nwith feedforward connections whose strengths change in accordance with a Hebb-type \nsynaptic modification rule. I found that this development process can produce cells that \nare selectively responsive to certain input features, and that these feature-analyzing \nproperties become progressively more sophisticated as one proceeds to deeper cell \nlayers. These properties include the analysis of contrast and of edge orientation, and \nare qualitatively similar to properties observed in the first several layers of the \nmammalian visual pathway.2 \n\nWhy does this happen? Does a Hebb-type algorithm (which adjusts synaptic \nstrengths depending upon correlations among signaling activities3) cause a developing \nperceptual network to optimize some property that is deeply connected with the mature \nnetwork's functioning as an information processing system? \n\n\u00a9 American Institute ofPhvsics 1988 \n\n\f486 \n\nFurther analysis4.s has shown \n\nthat a suitable Hebb-type rule causes a \nlinear-response cell in a layered feedforward network (without lateral connections) to \ndevelop so that the statistical variance of its output activity (in response to an ensemble \nof inputs from the previous layer) is maximized, subject to certain constraints. The \nmature cell thus performs an operation similar to principal component analysis (PCA), \nan approach used in statistics to expose regularities (e.g., clustering) present in \nhigh-dimensional input data. \n(Oja6 had earlier demonstrated a particular form of \nHebb-type rule that produces a model cell that implements PCA exactly.) \n\nFurthermore, given a linear device that transforms inputs into an output, and given \nany particular output value, one can use optimal estimation theory to make a \"best \nestimate\" of the input values that gave rise to that output. Of all such devices, I have \nfound that an appropriate Hebb-type rule generates that device for which this \"best \nestimate\" comes closest to matching the input values. 4\u2022s Under certain conditions, such \na cell has the property that its output preserves the maximum amount of information \nabout its input values.s \n\nMaximum Information Preservation \n\nThe above results have suggested a possible organizing principle for the \ndevelopment of each layer of a multilayered perceptual network.s The principle can be \napplied even if the cells of the network respond to their inputs in a nonlinear fashion, \nand even if lateral as well as feedforward connections are present. (Feedback from later \nto earlier layers, however, is absent from this formulation.) This principle of \"maximum \ninformation preservation\" states that for a layer of cells L that is connected to and \nprovides input to another layer M, the connections should develop so that the \ntransformation of signals from L to M (in the presence of processing noise) has the \nproperty that the set of output values M conveys the maximum amount of information \nabout the input values L, subject to various constraints on, e.g., the range of lateral \nconnections and the processing power of each cell. The statistical properties of the \nensemble of inputs L are assumed stationary, and the particular L-to-M transformation \nthat achieves this maximization depends on those statistical properties. The quantity \nbeing maximized is a Shannon information rate. 7 \n\nAn equivalent statement of this principle is: The L-to-M transformation is chosen \nso as to minimize the amount of information that would be conveyed by the input values \nL to someone who already knows the output values M. \n\nWe shall regard the set of input signal values L (at a given time) as an input \n\"message\"; the message is processed to give an output message M. Each message is in \ngeneral a set of real-valued signal activities. Because noise is introduced during the \nprocessing, a given input message may generate any of a range of different output \nmessages when processed by the same set of connections. \n\nThe Shannon information rate (i.e., the average information transmitted from L \n\nto M per message) is7 \n\nR = LL LMP(L,M) log [P(L,M)/P(L)P(M)]. \n\n(1) \n\nFor a discrete message space, peL) [resp. P(M)] is the probability of the input (resp. \noutput) message being L (resp. M), and P(L,M) is the joint probability of the input \nbeing L and the output being M. [For a continuous message space, probabilities are \n\n\freplaced by probability densities, and sums (over states) by integrals.] This rate can be \nwritten as \n\n487 \n\nwhere \nh == - LL P(L) log P(L) \n\nis the average information conveyed by message Land \n\n(2) \n\n(3) \n\n(4) \n\nis the average information conveyed by message L to someone who already knows M. \nSince II. is fixed by the properties of the input ensemble, maximizing R means \nminimizing I LIM, as stated above. \n\nThe information rate R can also be written as \n\n(5) \n\nwhere 1M and IMI L are defined by interchanging Land M in Eqns. 3 and 4. This form is \nheuristically useful, since it suggests that one can attempt to make R large by (if \npossible) simultaneously making 1M large and IMI L small. The term 1M is largest when \neach message M occurs with equal probability. The term 1\"'1/. is smallest when each L \nis transformed into a unique M, and more generally is made small by \"sharpening\" the \nP(M I L) distribution, so that for each L, P(M I L) is near zero except for a small set of \nmessages M. \n\nHow can one gain insight into biologically relevant properties of the L - M \ntransformation that may follow from the principle of maximum information preservation \n(which we also call the \"infomax\" principle)? In a network, this L - M transformation \nmay be a function of the values of one or a few variables (such as a connection strength) \nfor each of the allowed connections between and within layers, and for each cell. The \nsearch space is quite large, particularly from the standpoint of gaining an intuitive or \nqualitative understanding of network behavior. We shall therefore consider a simple \nmodel in which the dimensionalities of the Land M signal spaces are greatly reduced, \nyet one for which the infomax analysis exhibits features that may also be important \nunder more general conditions relevant \nto biological and synthetic network \ndevelopment. \n\nThe next four sections are organized as follows. \n\n(i) A model is introduced in \nwhich the Land M messages, and the L-to-M transformation, have simple forms. The \ninfomax principle is found to be satisfied when some simple geometric conditions (on \nthe transformation) are met. (ii) I relate this model to the analysis of signal processing \nand noise in an interconnection network. The formation of topographic maps is \ndiscussed. \n(iii) The model is applied to simplified versions of biologically relevant \nproblems, such as the emergence of orientation selectivity. (iv) I show that the main \nproperties of the infomax principle for this model can be realized by certain local \nalgorithms that have been proposed to generate topographic maps using lateral \ninteractions. \n\n\f488 \n\nA Simple Geometric Model \n\nIn this model, each input message L is described by a point in a low-dimensional \nvector space, and the output message M is one of a number of discrete states. For \ndefiniteness, we will take the L space to be two-dimensional (the extension to higher \ndimensionality is straightforward). The L - M transformation consists of two steps. \n(i) A noise process alters L to a message L' lying within a neighborhood of radius v \ncentered on L. (ii) The altered message L' is mapped deterministically onto one of the \noutput messages M. \n\nA given L' - M mapping corresponds to a partitioning of the L space into regions \nlabeled by the output states M. (We do not exclude a priori the possibility that multiple \ndisjoint regions may be labeled by the same M.) Let A denote the total area of the L \nstate space. For each M, let A (M) denote the area of L space that is labeled by M. Let \nsCM) denote the total border length that the region(s) labeled M share with regions of \nunlike M -label. A point L lying within distance v of a border can be mapped onto either \nM-value (because of the noise process L - L'). Call this a \"borderline\" L. A point L \nthat is more than a distance v from every border can only be mapped onto the M-value \nof the region containing it. \n\nSuppose v is sufficiently small that (for the partitionings of interest) the area \noccupied by borderline L states is small compared to the total area of the L space. \nConsider first the case in which peL) is uniform over L. Then the information rate R \n(using Eqn. 5) is given approximately (through terms of order v) by \n\n(yv/A) ~Ms(M). \n\nR = - ~M[A(M)/A] 10g[A(M)/A] -\n(6) \nTo see this, note that P(M) = A(M)/ A and that P(M I L) log P(M I L) is zero except for \nborderline L (since 0 log 0 = 1 log 1 = 0). Here y is a positive number whose value \ndepends upon the details of the noise process, which determines P(M I L) for borderline \nL as a function of distance from the border. \n\nFor small v (low noise) the first term (1M) on the RHS of Eqn. 6 dominates. It is \nmaximized when the A(M) [and hence the P(M)] values are equal for all M. The second \nterm (with its minus sign), which equals ( -~'4IL)' is maximized when the sum of the \nborder lengths of all M regions is minimized. This corresponds to \"sharpening\" the \nP(M I L) distribution in our earlier, more general, discussion. This suggests that the \ninfomax solution is obtained by partitioning the L space into M-regions (one for each \nM value) that are of substantially equal area, with each M-region tending to have \nnear-minimum border length. \n\nAlthough this simple analysis applies to the low-noise case, it is plausible that even \nwhen v is comparable to the spatial scale of the M regions, infomax will favor making \nthe M regions have approximately the same extent in all directions (rather than be \nelongated), in order to \"sharpen\" p(MI L) and reduce the probability of the noise \nprocess mapping L onto many different M states. \n\nWhat if peL) is nonuniform? Then the same result (equal areas, minimum border) \nis obtained except that both the area and border-length elements must now be weighted \nby the local value of peL). Therefore the infomax principle tends to produce maps in \nwhich greater representation in the output space is given to regions of the input signal \nspace that are activated more frequently. \n\nTo see how lateral interactions within the M layer can affect these results, let us \nsuppose that the L - M mapping has three, not two, process steps: L - L' \n\n\f489 \n\n- M - M, where the first two steps are as above, and the third step changes the output \nM into any of a number of states M \nthe \n\"M-neighborhood\" of M). We consider the case in which this M-neighborhood relation \nis symmetric. \n\n(which by definition comprise \n\nThis type of \"lateral interaction\" between M states causes the infomax principle \nto favor solutions for which M regions sharing a border in L space are M-neighbors in \nthe sense defined. For a simple example in which each state M has n M-neighbors \n(including itself), and each M-neighbor has an equal chance of being the final state \n(given M), infomax tends to favor each M-neighborhood having similar extent in all \ndirections (in L space). \n\nRelation Between the Geometric Model and Network Properties \n\nThe previous section dealt with certain classes of transformations from one \nmessage space to another, and made no specific reference to the implementation of these \ntransformations by an interconnected network of processor cells. Here we show how \nsome of the features discussed in the previous section are related to network properties. \nFor simplicity suppose that we have a two-dimensional layer of uniformly \ndistributed cells, and that the signal activity of each cell at any given time is either 1 \n(active) or 0 (quiet). We need to specify the ensemble of input patterns. Let us first \nconsider a simple case in which each pattern consists of a disk of activity of fixed radius, \nbut arbitrary center position, against a quiet background. In this case the pattern is fully \ndefined by specifying the coordinates of the disk center. In a two-dimensional L state \nspace (previous section), each pattern would be represented by a point having those \ncoordinates. \n\nNow suppose that each input pattern consists not of a sharply defined disk of \nactivity, but of a \"fuzzy\" disk whose boundary (and center position) are not sharply \ndefined. [Such a pattern could be generated by choosing (from a specified distribution) \na position Xc as the nominal disk center, then setting the activity of the cell at position \nX to 1 with a probability that decreases with distance I x - Xc I . ] Any such pattern can \nbe described by giving the coordinates of the \"center of activity\" along with many other \nvalues describing (for example) various moments of the activity pattern relative to the \ncenter. \n\nFor the noise process L - L' we suppose that the activity of an L cell can be \n\"misread\" (by the cells of the M layer) with some probability. This set of distorted \nactivity values is the \"message\" L'. We then suppose that the set of output activities M \nis a deterministic function of L'. \n\nWe have constructed a situation in which (for an appropriate choice of noise level) \ntwo of the dimensions of the L state space -- namely, those defined by the disk center \ncoordinates -- have large variance compared to the variance induced by the noise \nprocess, while the other dimensions have variance comparable to that induced by noise. \nIn other words, the center position of a pattern is changed only a small amount by the \nnoise process (compared to the typical difference between the center positions of two \npatterns), whereas the values of the other attributes of an input pattern differ as much \nfrom their noise-altered values as two typical input patterns differ from each other. \n(Those attributes are \"lost in the noise. \") \n\nSince the distance between L states in our geometric model (previous section) \ncorresponds to the likelihood of one L state being changed into the other by the noise \n\n\f490 \n\nprocess, we can heuristically regard the L state space (for the present example) as a \n\"slab\" that is elongated in two dimensions and very thin in all other dimensions. (In \ngeneral this space could have a much more complicated topology, and the noise process \nwhich we here treat as defining a simple metric structure on the L state space need not \ndo so. These complications are beyond the scope of the present discussion.) \n\nThis example, while simple. illustrates a feature that is key to understanding the \noperation of the infomax principle: The character of the ensemble statistics and of the \nnoise process jointly determine which attributes of the input pattern are statistically \nmost significant; that is, have largest variance relative to the variance induced by noise. \nWe shall see that the infomax principle selects a number of these most significant \nattributes to be encoded by the L - M transformation. \n\nWe turn now to a description of the output state space M. We shall assume that \nthis space is also of low dimensionality. For example, each M pattern may also be a disk \nof activity having a center defined within some tolerance. A discrete set of discriminable \ncenter-coordinate values can then be used as the M-region \"labels\" in our geometric \nmodel. \n\nRestricting the form of the output activity in this particular way restricts us to \nconsidering positional encodings L - M, rather than encodings that make use of the \nshape of the output pattern, its detailed activity values, etc. However, this restriction \non the form of the output does not determine which features of the input patterns are \nto be encoded, nor whether or not a topographic (neighbor-preserving) mapping is to \nbe used. These properties will be seen to emerge from the operation of the infomax \nprinciple. \n\nIn the previous section we saw that the infomax principle will tend to lead to a \npartitioning of the L space into M regions having equal areas [if peL) is uniform in the \ncoordinates of the L disk center] and minimum border length. For the present case this \nmeans that the M regions will tend to \"tile\" the two long dimensions of the L state space \n\"slab,\" and that a single M value will represent all points ill L space that differ only in \ntheir low-variance coordinates. If peL) is nonuniform, then the area of the M region \nat L will tend to be inversely proportional to peL). Furthermore, if there are local lateral \nconnections between M cells, then (depending upon the particular form of such \ninteraction) M states corresponding to nearby localized regions of layer-M activity can \nbe M-neighbors in the sense of the previous section. In this case the mapping from the \ntwo high-variance coordinates of L space to M space will tend to be topographic. \n\nExamples: Orientation Selectivity and Temporal Feature Maps \n\nThe simple example in the previous section illustrates how infomax can lead to \ntopographic maps, and \n[which provide greater M-space \nrepresentation for regions of L having large peL)]. Let us now consider a case in which \ninformation about input features is positionally encoded in the output layer as a result \nof the infomax principle. \n\nto map distortions \n\nConsider a model case in which an ensemble of patterns is presented to the input \nlayer L. Each pattern consists of a rectangular bar of activity (of fixed length and width) \nagainst a quiet background. The bar's center position and orientation are chosen for \neach pattern from uniform distributions over some spatial interval for the position, and \nover all orientation angles (i.e., from 0\u00b0 to 180\u00b0). The bar need not be sharply defined, \nbut can be \"fuzzy\" in the sense described above. We assume, however, that all \n\n\f491 \n\nproperties that distinguish different patterns of the ensemble -- except for center \nposition and orientation -- are \"lost in the noise\" in the sense we discussed. \n\nTo simplify the representation of the solution, we further assume that only one \ncoordinate is needed to describe the center position of the bar for the given ensemble. \nFor example, the ensemble could consist of bar patterns all of which have the same y \ncoordinate of center position, but differ in their x coordinate and in orientation 0. \n\nWe can then represent each input state by a point in a rectangle (the L state space \ndefined in a previous section) whose abscissa is the center-position coordinate x and \nwhose ordinate is the angle 0. The horizontal sides of this rectangle are identified with \neach other, since orientations of 0 0 and 180 0 are identical. \n(The interior of the \nrectangle can thus be thought of as the surface of a horizontal cylinder.) \n\nThe number Nx of different x positions that are discriminable is given by the range \nof x values in the input ensemble divided by the tolerance with which x can be measured \n(given the noise process L - L'); similarly for No. The relative lengths Llx and MJ of the \nsides of the L state space rectangle are given by Llx/ MJ = Nj No. We discuss below the \ncase in which Nx > > No; if No were> > Nx the roles of x and 0 in the resulting mappings \nwould be reversed. \n\nThere is one complicating feature that should be noted, although in the interest \nof clarity we will not include it in the present analysis. Two horizontal bar patterns that \nare displaced by a horizontal distance that is small compared with the bar length, are \nmore likely to be rendered indiscriminable by the noise process than are two vertical bar \npatterns that are displaced by the same horizontal distance (which may be large \ncompared with the bar's width). The Hamming distance, or number of binary activity \nvalues that need to be altered to change one such pattern into the other, is greater in the \nlatter case than in the former. Therefore, the distance in L state space between the two \n\nUNORIENTED RECEPTIVE FIELDS \n\nFigure 1. Orientation Selectivity in a Simple Model: As the input domain size \n(see text) is reduced [from (a) upper left, to (b) upper right, to (c) \nlower \nthe emergence of an \norientation-selective L - M mapping. (d) Lower right figure shows \na solution obtained by applying Kohonen's relaxation algorithm with \n50 M-points (shown as dots) to this mapping problem. \n\ninfomax \n\nfavors \n\nleft \n\nfigure], \n\n\f492 \n\nstates should be greater in the latter case. This leads to a \"warped\" rather than simple \nrectangular state space. We ignore this effect here, but it must be taken into account in \na fuller treatment of the emergence of orientation selectivity. \n\nConsider now an L - M transformation that consists of the three-step process \n(discussed above) (i) noise-induced L - L' ; (ii) deterministic L' - M'; \n(iii) \nlateral-interaction-induced M' - M. Step (ii) maps the two-dimensional L state space \nof points (x, 0) onto a one-dimensional M state space. For the present discussion, we \n.consider L' - M' maps satisfying the following Ansatz: Points corresponding to the \nM states are spaced uniformly, and in topographic order, along a helical line in L state \nspace (which we recall is represented by the surface of a horizontal cylinder). The pitch \nof the helix (or the slope dO/dx) remains to be determined by the infomax principle. \nEach M-neighborhood of M states (previous section) then corresponds to an interval \non such a helix. A state L' is mapped onto a state in a particular M-neighborhood if L' \nis closer (in L space) to the corresponding interval of the helix than to any other portion \nof the helix. We call this set of L states (for an M-neighborhood centered on M ) the \n\"input domain\" of M. It has rectangular shape and lies on the cylindrical surface of the \nL space. \n\nWe have seen (previous sections) that infomax tends to produce maps having (i) \nequal M-region areas, (ii) topographic organization, and (iii) an input domain (for each \nM-neighborhood) that has similar extent in all directions (in L space). Our choice of \nAnsatz enforces (i) and (ii) explicitly. Criterion (iii) is satisfied by choosing dO / dx such \nthat the input domain is square (for a given M-neighborhood size). \n\nFigure 1a (having dO/dx = 0) shows a map in which the output M encodes only \ninformation about bar center position x, and is independent of bar orientation o. The \nsize of the M -neighborhood is relatively large in this case. The input domain of the state \nM denoted by the 'x' is shown enclosed by dotted lines. (The particular 0 value at which \nwe chose to draw the M line in Fig. 1a is irrelevant.) For this M-neighborhood size, the \nlength of the border of the input domain is as small as it can be. \n\nAs the M -neighborhood size is reduced, the dotted lines move closer together. A \nvertically oblong input domain (which would result if we kept dO/dx = 0 ) would not \nsatisfy the infomax criterion. The helix for which the input domain is square (for this \nsmaller choice of M-neighborhood size) is shown in Fig. lb. The M states for this \nsolution encode information about bar orientation as well as center position. If each M \nstate corresponds to a localized output activity pattern centered at some position in a \none-dimensional array of M cells, then this solution corresponds to orientation-selective \nthis \ncells organized in \"orientation columns\" (really \"orientation intervals\" \none-dimensional model). A \"labeling\" of the linear array of cells according to whether \ntheir orientation preferences lie between 0 and 60, 60 and 120, or 120 and 180 degrees \nis indicated by the bold, light, and dotted line segments beneath the rectangle in Fig. 1 b \n(and 1c). \n\nin \n\nAs the M-neighborhood size is decreased still further, the mapping shown in Fig. \nIe becomes favored over that of either Fig. 1a or lb. The \"orientation columns\" shown \nin the lower portion of Fig. 1 c are narrower than in Fig. 1 b. \n\nA more detailed analysis of the information rate function for various mappings \n\nconfirms the main features we have here obtained by a simple geometric argument. \n\nThe same type of analysis can be applied to different types of input pattern \nensembles. To give just one other example, consider a network that receives an \nensemble of simple patterns of acoustic input. Each such pattern consists of a tone of \n\n\f493 \n\nsome frequency that is sensed by two \"ears\" with some interaural time delay. Suppose \nthat the initial network layers organize the information from each ear (separately) into \ntonotopic maps, and that (by means of connections having a range of different time \ndelays) the signals received by both ears over some time interval appear as patterns of \ncell activity at some intermediate layer L. We can then apply the infomax principle to \nthe signal transformation from layer L to the next layer M. The L state space can (as \nbefore) be represented as a rectangle, whose axes are now frequency and interaural \ndelay (rather than spatial position and bar orientation). Apart from certain differences \n(the density of L states may be nonuniform, and states at the top and bottom of the \nrectangle are no longer identical), the infomax analysis can be carried out as it was for \nthe simplified case of orientation selectivity. \n\nLocal Algorithms \n\nThe information rate (Eqn. I), which the infomax principle states is to be \nmaximized subject to constraints (and possibly as part of an optimization function \ncontaining other cost terms not discussed here), has a very complicated mathematical \nform. How might this optimization process, or an approximation to it, be implemented \nby a network of cells and connections each of which has limited computational power? \nThe geometric form in which we have cast the infomax principle for some very simple \nmodel cases, suggests how this might be accomplished. \n\nAn algorithm due to Kohonen 8 demonstrates how topographic maps can emerge \nas a result of lateral interactions within the output layer. I applied this algorithm to a \none-dimensional M layer and a two-dimensional L layer, using a Euclidean metric and \nimposing periodic boundary conditions on the short dimension of the L layer. A \nresulting map is shown in Fig. Id. This map is very similar to those of Figs. 1 band Ic, \nexcept for one reversal of direction. The reversal is not surprising, since the algorithm \ninvolves only local moves (of the M-points) while the infomax principle calls for a \nglobally optimal solution. \n\nMore generally, Kohonen's algorithm tends empirically8 to produce maps having \nthe property that if one constructs the Voronoi diagram corresponding to the positions \nof the M-points (that is, assigns each point L to an M region based on which M-point \nL is closest to), one obtains a set of M regions that tend to have areas inversely \nproportional to P(L) , and neighborhoods (corresponding to our input domains) that \ntend to have similar extent in all directions rather than being elongated. \n\nThe Kohonen algorithm makes no reference to noise, to information content, or \neven to an optimization principle. Nevertheless, it appears to implement, at least in a \nqualitative way, the geometric conditions that infomax imposes in some simple cases. \nThis suggests that local algorithms along similar lines may be capable of implementing \nthe infomax principle in more general situations. \n\nOur geometric formulation of the infomax principle also suggests a connection \nwith an algorithm proposed by von der Malsburg and Willshaw9 to generate topographic \nmaps. In their \"tea trade\" model, neighborhood relationships are postulated within the \nsource and the target spaces, and the algorithm's operation leads to the establishment \nof a neighborhood-preserving mapping from source to target space. Such neighborhood \nrelationships arise naturally in our analysis when the infomax principle is applied to our \nthree-step L - L' - M' - M \ninduces a \n\nThe noise process \n\ntransformation. \n\n\f494 \n\nneighborhood relation on the L space, and lateral connections in the M cell layer can \ninduce a neighborhood relation on the M space. \n\nMore recently, Durbin and Willshaw lO have devised an approach to solving certain \ngeometric optimization problems (such as the traveling salesman problem) by a gradient \ndescent method bearing some similarity to Kohonen's algorithm. \n\nThere is a complementary relationship between the infomax principle and a local \nalgorithm that may be found to implement it. On the one hand, the principle may \nexplain what the algorithm is \"for\" -- that is, how the algorithm may contribute to the \ngeneration of a useful perceptual system. This in turn can shed light on the system-level \nrole of lateral connections and synaptic modification mechanisms in biological networks. \nOn the other hand, the existence of such a local algorithm is important for demonstrating \nthat a network of relatively simple processors -- biological or synthetic -- can in fact find \nglobal near-maxima of the Shannon information rate. \n\nA Possible Connection Between Infomax and a Thermodynamic Principle \n\nThe principle of \"maximum preservation of information\" can be viewed \nequivalently as a principle of \"minimum dissipation of information.\" When the principle \nis satisfied, the loss of information from layer to layer is minimized, and the flow of \ninformation is in this sense as \"nearly reversible\" as the constraints allow. There is a \nresemblance between this principle and the principle of \"minimum entropy production\" \nII in nonequilibrium thermodynamics. It has been suggested by Prigogine and others \nthat the latter principle is important for understanding self-organization in complex \nsystems. There is also a resemblance, at the algorithmic level, between a Hebb-type \nmodification rule and the autocatalytic processes l2 considered in certain models of \nevolution and natural selection. This raises the possibility that the connection I have \ndrawn between synaptic modification rules and an information-theoretic optimization \nprinciple may be an example of a more general relationship that is important for the \nemergence of complex and apparently \"goal-oriented If structures and behaviors from \nrelatively simple local interactions, in both neural and non-neural systems. \n\nReferences \n\n[1] \n[2] \n[3] \n[4] \n\nR. Linsker, Proc. Natl. Acad. Sci. USA 83,7508,8390,8779 (1986). \nD. H . Hubel and T. N. Wiesel, Proc. Roy. Soc. London 8198,1 (1977). \nD. O. Hebb, The Organization of Behavior (Wiley, N. Y., 1949). \nR. Linsker, in: R. Cotterill (ed.), Computer Simulation in Brain Science (Copenhagen. \n20-22 August 1986; Cambridge Univ. Press, in press), p. 416. \nR. Linsker, Computer (March 1988, in press). \nE. Oja,J. Math. Bioi. 15 , 267 (1982). \nC. E. Shannon, Bell Syst. Tech. J. 27 . 623 (1948). \nT. Kohonen, Self-Organization and Associative Memory (Springer-Verlag, N. Y .. 19S4). \nC. von der Malsburg and D. J. Willshaw, Proc. Nat I. A cad. Sci. USA 74 , 5176 (1977). \n\n[5] \n[6] \n[7] \n[8] \n[9] \n[10] R. Durbin and D. J. Willshaw, Nature 326,689 (1987). \n[11] \n\nP. Glansdorff and I. Prigogine, Thermodynamic Theory of Structure, Stabili(v. and \nFluctuations (Wiley-Interscience, N. Y., 1971). \n\n[12] M. Eigen and P. Schuster, Die Naturwissenschaften 64 , 541 (1977). \n\n\f", "award": [], "sourceid": 5, "authors": [{"given_name": "Ralph", "family_name": "Linsker", "institution": null}]}