{"title": "Scaling Laws and Local Minima in Hebbian ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 495, "page_last": 501, "abstract": null, "full_text": "Scaling laws and local minima in Hebbian ICA\n\nMagnus Rattray and Gleb Basalyga\n\nDepartment of Computer Science, University of Manchester,\n\nManchester M13 9PL, UK.\n\nmagnus@cs.man.ac.uk, basalygg@cs.man.ac.uk\n\nAbstract\n\nWe study the dynamics of a Hebbian ICA algorithm extracting a sin-\ngle non-Gaussian component from a high-dimensional Gaussian back-\nground. For both on-line and batch learning we \ufb01nd that a surprisingly\nlarge number of examples are required to avoid trapping in a sub-optimal\nstate close to the initial conditions. To extract a skewed signal at least\n\u0002\u0001\u0004\u0003\u0006\u0005\b\u0007 examples are required for \u0003\n-dimensional data and \u0002\u0001\t\u0003\u000b\n\f\u0007 exam-\nples are required to extract a symmetrical signal with non-zero kurtosis.\n\n1 Introduction\n\nIndependent component analysis (ICA) is a statistical modelling technique which has at-\ntracted a signi\ufb01cant amount of research interest in recent years (for a review, see Hyv\u00a8arinen,\n1999). The goal of ICA is to \ufb01nd a representation of data in terms of a combination of sta-\ntistically independent variables. A number of neural learning algorithms have been applied\nto this problem, as detailed in the aforementioned review.\n\nTheoretical studies of ICA algorithms have mainly focussed on asymptotic stability and\nef\ufb01ciency, using the established results of stochastic approximation theory. However, in\npractice the transient stages of learning will often be more signi\ufb01cant in determining the\nsuccess of an algorithm. In this paper a Hebbian ICA algorithm is analysed in both on-line\nand batch mode, highlighting the critical importance of the transient dynamics. We \ufb01nd that\na surprisingly large number of training examples are required in order to avoid trapping in\na sub-optimal state close to the initial conditions. To detect a skewed signal at least \u0002\u0001\u0004\u0003\u000b\u0005\f\u0007\nexamples are required for \u0003\n-dimensional data, while \u0002\u0001\u0004\u0003\r\n\f\u0007 examples are required for a\nsymmetric signal with non-zero kurtosis. In addition, for on-line learning we show that\nthe maximal initial learning rate which allows successful learning is unusually low, being\n\u0002\u0001\t\u0003\u000f\u000e\u0011\u0010\nIn order to obtain a tractable model, we consider the limit of high-dimensional data and\nstudy an idealised data set in which a single non-Gaussian source is mixed into a large\nnumber of Gaussian sources. Recently, one of us considered a more general model in\nwhich an arbitrary, but relatively small, number of non-Gaussian sources were mixed into\na high-dimensional Gaussian background (Rattray, 2002). In that work a solution to the\ndynamics of the on-line algorithm was obtained in closed form for \u0002\u0001\u0004\u0003\u0016\u0007\nlearning iterations\nand a simple solution to the asymptotic dynamics under the optimal learning rate decay was\nobtained. However, it was noted there that modelling the dynamics on an \u0002\u0001\t\u0003\u0016\u0007\ntimescale\nis not always appropriate, because the algorithm typically requires much longer in order to\n\n\u0012\u0013\u0007 for a skewed signal and \u0002\u0001\t\u0003\u0014\u000e\u0015\u0005\f\u0007 for a symmetric signal.\n\n\fescape from a class of metastable states close to the initial conditions. In order to elucidate\nthis effect in greater detail we focus here on the simplest case of a single non-Gaussian\nsource and we will limit our analysis to the dynamics close to the initial conditions.\n\nIn recent years a number of on-line learning algorithms, including back-propagation and\nSanger\u2019s PCA algorithm, have been studied using techniques from statistical mechanics\n(see, for example, Biehl (1994); Biehl and Schwarze (1995); Saad and Solla (1995) and\ncontributions in Saad (1998)). These analyses exploited the \u201cself-averaging\u201d property of\ncertain macroscopic variables in order to obtain ordinary differential equations describing\nthe deterministic evolution of these quantities over time in the large \u0003\nlimit. In the present\ncase the appropriate macroscopic quantity does not self-average and \ufb02uctuations have to\nbe considered even in the limit. In this case it is more natural to model the on-line learning\ndynamics as a diffusion process (see, for example Gardiner, 1985).\n\n2 Data Model\n\nIn order to apply the Hebbian ICA algorithm we must \ufb01rst sphere the data, ie.\nlinearly\ntransform the data so that it has zero mean and an identity covariance matrix. This can be\nachieved by standard transformations in a batch setting or for on-line learning an adaptive\nsphering algorithm, such as the one introduced by Cardoso and Laheld (1996), could be\nused. To simplify the analysis it is assumed here that the data has already been sphered.\nWithout loss of generality it can also be assumed that the sources each have unit variance.\n\ncomponents respectively,\n\n\u001b\u001e\b\u0014\n\nis presented to the\n\n(1)\n\n(2)\n\n(3)\n\n\u0006\u0003 \n\n\u001e!\u0015\n\n\u0001#\"$\u0014%\u001c\n\nis generated from a noiseless linear mixture of sources which are decom-\n\nof examples are available to the algorithm. To conform with the model assumptions the\n\nEach data point\nposed into a single non-Gaussian source\u0001 and \u0003\u0003\u0002\u0005\u0004 uncorrelated Gaussian components,\n\u0001\u000b\n\r\f\u000f\u000e\u0011\u0010\n\u0006\b\u0007\u0005\t\n\u0007 . We will also decompose the mixing matrix\u0014\ninto a column vector\u0015\u0017\u0016\n\u000e\u0013\u0012\n\u0001\u0004\u0003\u001a\u0002\u001b\u0004\b\u0007 rectangular matrix\u0014\u001d\u001c associated with the non-Gaussian and Gaussian\nand a \u0003\u0019\u0018\n\u0006'&\nWe will consider both the on-line case, in which a new IID example)(\nalgorithm at each time* and then discarded, and also the batch case, in which a \ufb01nite set\nmixing matrix\u0014 must be unitary, which leads to the following constraints,\n\n\u001510\n\u0015,\u00162\u0015\n\u0014%0\n\u001510\n\u0014\u001d\u001c\n\u0014%0\nThe goal of ICA is to \ufb01nd a vector8\nsuch that the projection9;:<8=013>@?A\u0001 . De\ufb01ning\nthe overlapB<:'8%0C\u0015C\u0016 we obtain,\n\u0001#\"D\u0014\u001d\u001c\n8LJMJ\nJKJ\nBE\u0001#\"GFIH\nand which is true for on-line learning but is only strictly true for the \ufb01rst iteration\ntween8\n\u0004 . In this case we see that the goal is to \ufb01nd8\ntion constraint on8\nsuch thatJMJ\n8LJKJQ\u001e\n\u0004 .\nthatBR>S?\n\nof batch learning (see section 4). In the algorithm described below we impose a normalisa-\nsuch\n\n(4)\nwhere we have made use of the constraint in eqn. (2). This assumes zero correlation be-\n\nwhere\n\n\"3\u0014\n\u001510\n\u0015,\u00166\u0014%0\n\n\u001c/.\n\n\u0015,\u0016-\u0014\n\u0015\r0\n\u001450\n\n\u0007\u001b\t\n\n\u0001\u000bNO\fP\u0004\f\u0007\u0017\f\n\n3 On-line learning\n\n\u000e4\f\n\n\u0014\u001d\u001c\n\n\u001f%7\n\n\u001f\n\u0001\n\u0016\n+\n\u001f\n\u0016\n\u001c\n \n\u001e\n0\n\u0016\n\u001c\n\u0014\n0\n\u001c\n\u001e\n\u001f\n\u0016\n\u001c\n \n+\n\u0015\n\u0016\n.\n\u001e\n\u001f\n\u0016\n\u0015\n\u0016\n\u0016\n\u001c\n\u001c\n\u0014\n\u001c\n \n\u001e\n\n\u000e\n \n&\n9\n\u001e\n8\n0\n\u0001\n\u0015\n\u0016\n\u0006\n\u0007\n\u001e\n\u0005\n\u0002\nB\n\u0005\nF\n\fA simple Hebbian (or anti-Hebbian) learning rule was studied by Hyv\u00a8arinen and Oja\n(1998), who showed it to have a remarkably simple stability condition. We will consider\nthe de\ufb02ationary form in which a single source is learned at one time. The algorithm is\nclosely related to Projection Pursuit algorithms, which seek interesting projections in high-\ndimensional data. A typical criteria for an interesting projection is to \ufb01nd one which is\nmaximally non-Gaussian in some sense. Maximising some such measure (simple exam-\nples would be skewness or kurtosis) leads to the following simple algorithm (see Hyv\u00a8arinen\n\nand Oja, 1998, for details). The change in8\n\nat time*\n\nis given by,\n\nfollowed by normalisation such that\n\n(5)\n\n\u001e\u0002\u0001\u0004\u0003\u0006\u0005\n\n(\b\u0007\n\u001e\n\t\f\u000b\u000e\r\u0010\u000f\n\u0003\u0012\u0011\u0012\u0013\n\n or\n\nis the learning rate and\n\nis some non-linear function which we will take to be\nHere,\n\u0005 , is appropriate\nat least three times differentiable. An even non-linearity, eg.\nfor detecting asymmetric signals while a more common choice is an odd function, eg.\n\u0007 , which can be used to detect symmetric non-Gaussian\n\nsignals. In the latter case\ncorrect solution, as described by Hyv\u00a8arinen and Oja (1998), either adaptively or using \u00b4a\npriori knowledge. We set\nin the case of an even non-linearity. Remarkably, the same\nnon-linearity can be used to separate both sub and super-Gaussian signals, in contrast to\nmaximum likelihood methods for which this is typically not the case.\n\n\u0002E\u0004\u0011\fP\u0004\u0015\u0014 has to be chosen in order to ensure stability of the\n\nJKJ\n\n8LJKJ\u0011\u001e\n\nWe can write the above algorithm as,\n\nand\n\n\u0001\u001b\u001d\n(\u0017\u0016\n\n(6)\n\n(7)\n\n,\n\n(8)\n\n,(\n\n\u0005$\u0001\n\nJMJ\n\nJMJ\n\n\"\u0018\u0001\u0019\u0003\u0006\u0005\n\nand variance of %\n\n9/(\n\"\u001c\u0001\n(#\"\n\n\u0007 (two different scalings will be considered below) we can\n\nFor large \u0003\nexpand out to get a weight decay normalisation,\n\n8%(\n(\u0017\u0016\n\"\u001b\u001a\u000e\u0001\u0004\u0003\u0006\u0005\n\u0002\u0001\u0004\u0003\u000f\u000e\u0013\u0012\n\u0012\u001f\u001e\n\u0007! \n\"\u0018\u0001\u0004\u0003\u0006\u0005\n\u0016 gives the following update increment for the overlapB\nTaking the dot-product with\u0015\nB<\u001e&\u0001\u0019\u0003\u0006\u0005\nwhere we used the constraint in eqn. (3) to set\u0015,0\n\u0001 . Below we calculate the mean\ndistribution for9 given\u0001 only depends onB\n(settingJKJ\n8LJKJ\u0011\u001e\nwill depend only onB\nIf the entries in\u0015,\u0016 and8\nare initially of similar order then one would expectB<\u001e\nThis is the typical case if we consider a random and uncorrelated choice for\u0014\nentries in8\nB*)\n\nof the mixing matrix which we will not assume. We will set\ndiscussion, where\nrestricted to describing the dynamics close to the initial conditions. For an account of the\ntransient dynamics far from the initial conditions and the asymptotic dynamics close to an\noptimal solution, see Rattray (2002).\n\n\u000e\u001f'\n\u0004\b\u0007 quantity. The discussion below is therefore\n\n\u0007 .\nand the initial\ncould only be obtained with some prior knowledge\nin the following\n\nfor two different scalings of the learning rate. Because the conditional\nin eqn. 4) these expressions\n\n. Larger initial values ofB\n\nand statistics of the non-Gaussian source distribution.\n\n3.1 Dynamics close to the initial conditions\n\nis assumed to be an \u0002\u0001\n\n(;:\n\n\u0002\u0001\t\u0003\n\n3.1.1\n\n\u0007 even,\n\n-,\n\n\u001e/.10\n\nIf the signal is asymmetrical then an even non-linearity can be used, for example\nis a common choice. In this case the appropriate (ie. maximal) scaling for the learning rate\nis \u0002\u0001\u0004\u0003\n\n\u0007 and we set\n\n\u0012 where\n\nis an \u0002\u0001\n\n\u0004\f\u0007 scaled learning rate parameter. In\n\n\n8\n\u0001\n9\n(\n\u0007\n\n\u0004\n&\n\u0001\n\u0005\n\u0001\n9\n\u0007\n\u0005\n\u0001\n9\n\u0007\n\u001e\n9\n\u0005\n\u0001\n9\n\u0007\n\u001e\n9\n\u0005\n\u0001\n9\n\u0007\n\u0001\n9\n\u0003\n\u001e\n\u0004\n8\n\u0012\n\u001e\n\u0001\n\u0007\nH\n\u0004\n\u0001\n9\n(\n\u0007\n9\n(\n\u0005\n\u0005\n\u0005\n\u0001\n9\n(\n\u0007\n\n(\n\u0005\n&\n8\n8\n(\n\u0001\n9\n(\n\n(\n\u0002\n9\n(\n8\n\u0002\n\u0012\n\u0005\n\u0003\n\u0005\n\u0005\n\u0001\n9\n(\n\u0007\n8\n(\n&\n%\n\u0001\n9\n\u0007\n \n\u0001\n(\n\u0002\nB\n(\n9\n(\n\"\n\u0002\n\u0012\n\u0005\n\u0001\n\u0005\n\u0003\n\u0005\n\u0005\n\u0001\n9\n(\n\u0007\nB\n(\n\f\n\u0016\n\n\u001e\nB\n\u0004\n\u0012\n\u0003\n(\n\u0005\n\u0001\n9\n+\n\u001e\nN\n\u0005\n\u0001\n9\n\u0007\n\u001e\n9\n\u0005\n\u000e\n\u0010\n\u0012\n\u0001\n\u0003\n\u0010\n.\n\f\u0001\u0011\u0010\u0012\u0005\n\n\u0002\u0001\u0004\u0003\u0006\u0005 even,\n\n\f\u000e\r\n\u0007\t\b\u000b\n\n%\u0015\u0014\n\n\u0002\u0001\u0004\u0003\u0006\u0005 odd,\n\n\f\u000e\r\n\u0007\u0017\u0016\u0018\n\n%\u0019\u0014\n\n\u0001\u0011\u0010\u0012\u0005\n\nFigure 1: Close to the initial conditions (where\ndynamics is equivalent to diffusion in a polynomial potential. For asymmetrical source\ndistributions we can use an even non-linearity in which case the potential is cubic, as shown\non the left. For symmetrical source distributions with non-zero kurtosis we should use\nan odd non-linearity in which case the potential is quartic, as shown on the right. The\ndynamics is initially con\ufb01ned in a metastable state near\n.\n\n(\u0005:\n\n\u0002\u0001\n\n\u0004\b\u0007 ) the learning\nN with a potential barrier %\u0015\u0014\n\u0007\u0011\u001b\n\nat each iteration are given\n\n(9)\n(10)\n\n.\u0004(\n\n8=01\u0015\n(E\u001e\n\u0005\u0012\u001d\n\nthis case we \ufb01nd that the mean and variance of the change in\nby (to leading order in \u0003\n\n\u000e\u0013\u0012 ),\n\n\u0007\u001c\u001b\n\n\u000e\u0015\u0005\n\n\u0005\u0012\u001a\n\n\u0007\u001c\u001b\n\nE+\nVar+\n\nwhere\n\nis the third cumulant of the source distribution (third central moment), which\n\nmeasures skewness, and brackets denote averages overF\nE+\n\nfor integer\nFokker-Planck equation for large \u0003\ntimescale of \u0002\u0001\t\u0003\npotential,\n\n. In this case the system can be described by a\n(see, for example, Gardiner, 1985) with a characteristic\n\u0007 . The system is locally equivalent to a diffusion in the following cubic\n\n\u0004\b\u0007 . We also \ufb01nd that\n\n\u001f! \n\nNO\f\n\n\u0001\t\u0003\u000f\u000e\n\n\u0005\b\u0007\n\n\u001e\u000e\u001e\n\nwith a diffusion coef\ufb01cient\n\n\u0005\u0013\u0001\n\n\u0005 which is independent of\n\npotential is shown on the left of \ufb01g. 1. A potential barrier of %\u0015\u0014 must be overcome to\n\nescape a metastable state close to the initial conditions.\n\n\u0007\u001c\u001b\n\n.\u0004(\n\n(11)\n. The shape of this\n\n\u0007\u001c\u001b\n\u0007\u001c\u001b\n\n\u001e%\u001a\n\n3.1.2\n\n\u0007 odd,\n\nIf the signal is symmetrical, or only weakly asymmetrical, it will be necessary to use an\n\u0007 are popular choices. In\nodd non-linearity, for example\nthis case a lower learning rate is required in order to achieve successful separation. The\nappropriate scaling for the learning rate is \u0002\u0001\t\u0003\nis\nan \u0002\u0001\nthe change in\n\n\u0004\b\u0007 scaled learning rate parameter. In this case we \ufb01nd that the mean and variance of\n\nat each iteration are given by,\n\n\u0003\u0006\u0005 where again\n\n\u000e\u0015\u0005\f\u0007 and we set\n\n or\n\n\u000b\u000e\n\nwhere\n\nE+\nVar+\ndenote averages overF\n\n\u0007\u001b\t\n\n\u000e\u0015\n\n\u0007\u001c\u001b\n\n\u0007\u001c\u001b\n\n\u0007\u001c\u001b\n\n \u000f\u0002\n\u0005\u0012\u001a\n\u0001\u000bNO\fP\u0004\f\u0007 . Again the system can be described by a Fokker-Planck\n\n(12)\n(13)\n\nis the fourth cumulant of the source distribution (measuring kurtosis) and brackets\n\n\u00015\u001e\n\u0003\u0006.\u0019(\n\n\u000f\n\u0010\n\u0013\n\u000f\n\u0010\n\u0013\n\u0016\n)\n\u0003\n\u001e\n(\n%\n(\n.\n\u001e\n \n\u0002\n\u0012\n\u0005\n\u0005\n\u0001\nF\n.\n\u0005\n(\n\"\n\u0012\n\u0005\n+\n\n\u001a\n\u001d\n\u0001\nF\n\u0005\n\"\n\u0003\n\u000e\n\u0005\n\f\n%\n(\n.\n\u001e\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n\u0005\n\u0003\n\f\n+\n\n\u0007\n\t\n\u0001\n\u0001\n%\n(\n\u0007\n\u001c\n.\n\u001a\n\u0005\n\u0014\n\u0001\n(\n\u0007\n\u001e\n\u0012\n\"\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n\u0005\n(\n\u0005\n\u0002\n\u0012\n#\n+\n\n\u001a\n\u0005\n\u001d\n\u001d\n\u0001\nF\n\n\f\n$\n\u0005\nF\n.\n(\n\u0005\n\u0001\n9\n+\n\"\n,\n\u001e\nN\n\u0005\n\u0001\n9\n\u0007\n\u001e\n9\n\u0005\n\u0001\n9\n\u0007\n\u001e\n\t\n\u000f\n\u0001\n9\n.\n0\n.\n(\n%\n(\n.\n\u001e\n\u0012\n\u0005\n\u0005\n\u0001\nF\n.\n\u0005\n(\n\"\n\u0012\n#\n+\n\"\n\u001a\n\u0005\n\u001d\n\u001d\n\u001d\n\u0001\nF\n\n\"\n\u0003\n\u000e\n\n\f\n%\n(\n.\n\u001e\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n\u0005\n\u0003\n\f\n+\n\"\n\fequation for large \u0003\nslower than in the asymmetrical case. The system is locally equivalent to diffusion in the\nfollowing quartic potential,\n\nbut in this case the timescale for learning is \u0002\u0001\u0004\u0003\n\n\u0007 , an order of \u0003\n\n\u0007\u0011\u001b\n\n(14)\n\u0007 which is\nwith a diffusion coef\ufb01cient\na necessary condition for successful learning. In the case of a cubic non-linearity this is\nalso the condition for stability of the optimal \ufb01xed point, although in general these two\nconditions may not be equivalent (Rattray, 2002). The shape of this potential is shown\n\n.\u0004(\n\u0005 . We have assumed\n\non the right of \ufb01g. 1 and again a potential barrier of %\u0015\u0014 must be overcome to escape a\n\nmetastable state close to the initial conditions.\n\n\u0003<\u001e\n\nSign\u0001\n\n\u0005\u0012\u001d\n\n\u0007\u0011\u001b\n\n\u0007\u001c\u001b\n\n). As\n\n(\u00150\n\n\u0007\u0011\u001b\n\n\u0007\u0011\u001b\n\nthe dynamics of\n\nwhere %\u0015\u0014\n\nis the potential barrier,\n\nfor example, Gardiner, 1985),\n\n(15)\n\nis a unit of time in\n\nthe diffusion process. For the two cases considered above we obtain,\n\nFor large\nsian stationary distribution of \ufb01xed unit variance. Thus, if one chooses too large\n\ncorresponds to an Ornstein-Uhlenbeck process with a Gaus-\ninitially\nis reduced\nthe potential barrier con\ufb01ning the dynamics is reduced. The timescale for escape for large\n(mean \ufb01rst passage time) is mainly determined by the effective size of the barrier (see,\n\n3.1.3 Escape times from a metastable state atB<\u001e\nthe dynamics will become localised close toB\n\u000e\u0013\u0012\u0006\u0005\b\u0007\n\t\f\u000b\n\u0001\u0004\u0003\nis the diffusion coef\ufb01cient and\u0003\n\u0005\u0013\u0012\n\n(recall,B\n$\u000e\n\nThe constants of proportionality depend on the shape of the potential and not on \u0003\n. As the\nlearning rate parameter is reduced so the timescale for escape is also reduced. However, the\nchoice of optimal learning rate is non-trivial and cannot be determined by considering only\nwill result in a quicker\n\n\u0005\u000f\u0007\n\t\u0011\u0010\n\u001f\u0015\u0014\n\u0005\u000f\u0007\n\t\n. (16)\nthe leading order terms inB\nN , this in turn will lead to a very slow\nescape from the unstable \ufb01xed point near B\n\" are large, suggesting that de\ufb02ationary ICA algorithms will tend to \ufb01nd these signals\n\u000e\u0013\u0012\n\nFrom the above discussion one can draw two important conclusions. Firstly, the initial\nlearning rate must be less than \u0002\u0001\u0004\u0003\ninitially in order to avoid trapping close to the\ninitial conditions. Secondly, the number of iterations required to escape the initial transient\nwill be greater than \u0002\u0001\u0004\u0003\u0016\u0007 , resulting in an extremely slow initial stage of learning for large\n. The most extreme case is for symmetric source distributions with non-zero kurtosis, in\n\nlearning transient after escape. Notice that escape time is shortest when the cumulants\nor\n\ufb01rst.\n\nas above, because although small\n\n.5:\n.%:&\u0001\n\nfor even\n\nfor odd\n\n,\n\n.\n\n\u0007\u0011\u001b\n\n\u0007\u0011\u001b\n\n\u0007 ,\n\n\u0007 ,\n\neven\nescape\n\nodd\nescape\n\nlearning iterations are required.\n\nwhich case \u0002\u0001\u0004\u0003\u0016\n\b\u0007\nIn \ufb01g. 2 we show results of learning with an asymmetric source (top) and uniform source\n(bottom) for different scaled learning rates. As the learning rate is increased (left to right)\nwe observe that the dynamics becomes increasingly stochastic, with the potential barrier\nbecoming increasingly signi\ufb01cant (potential maxima are shown as dashed lines). For the\nlargest value of learning rate (\n) the algorithm becomes trapped close to the initial\nconditions for the whole simulation time. From the time axis we observe that the learning\ntimescale is \u0002\u0001\u0004\u0003\u0006\u0005\f\u0007\nfor the symmetric signal, as\npredicted by our theory.\n\nfor the asymmetrical signal and \u0002\u0001\u0004\u0003\u0016\n\f\u0007\n\nescape\n\n%\u0015\u0014\n\n.$\u001e\u0018\u0017\n\n\u0014\n\u0001\n(\n\u0007\n\u001e\n\u0012\n\"\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n\u0005\n(\n\u0005\n\u0002\n\u0012\n\u0005\n\"\nJ\n+\n\"\n\u001a\n\u001d\n\u001d\n\u0001\nF\nJ\n\"\n\f\n$\n\u001e\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n+\n\"\nN\n.\n(\n.\n\u001e\nN\n\u001e\n)\n\u0003\n.\n.\n\u0001\n\u0002\n*\n\u0007\n\f\n$\n*\n\u0001\n\u0002\n\u0003\n\u0005\n\u0004\n\u0004\n\u001a\n\u000b\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n+\n\n\u001a\n\u0005\n\u001d\n\u001d\n\u0001\nF\n\n\u0005\n\u0001\n9\n+\n\n,\n\u001e\nN\n+\n\u0001\n\u0003\n\u0010\n\u0012\n.\n\u0001\n\u0002\n\u0003\n\n\u001a\n\u0005\n\u0005\n\u0001\nF\n.\n\u0016\nJ\n+\n\"\n\u001a\n\u0005\n\u001d\n\u001d\n\u001d\n\u0001\nF\nJ\n \n\u0005\n\u0001\n9\n+\n\"\n,\n\u001e\nN\n+\n\u0003\n\u0005\n.\n\u001e\n+\n\n+\n\u0007\n\u0003\n\fn=0.1\n\nf (x)=x2, k\n\n\u201e 0\n3\n\n1\nR\n0.5\n\n0\n\n1\nR\n0.5\n\n0\n\nn=1\n\nn=5\n\n1\nR\nRRRRRRRRRRRRRRRR\n0.5\n\n0\n\n0\n\n5\n\n10\n\nt/N2\n\nf (x)=x3, k\n\n\u201e 0\n4\n\n10\n\nt/N2\n\n15\n\n15\n\n0\n\n5\n\n1\n\nR\n0.5\n\n0\n\n\u22120.5\n\n10\n\nt/N2\n\n15\n\n0\n\n5\n\n1\n\nR\n0.5\n\n0\n\n\u22120.5\n\n5\nt/N3\n\n\u22121\n0\n\n10\n\n\u22121\n0\n\n10\n\n5\nt/N3\n\n5\nt/N3\n\n10\n\n1\n\nR\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n0\n\nFigure 2: 100-dimensional data (\u0003\nnon-Gaussian source. In the top row we show results for a binary, asymmetrical source with\n\u0005 . In the bottom row we show results for a uniformly\nskewness\n\n . Each row shows learning with the same initial conditions\ndistributed source and\nand data but with different scaled learning rates (left to right\n) where\n(bottom). Dashed lines are maxima of the potentials in \ufb01g. 1.\n\nNQN ) is produced from a mixture containing a single\n\nand\n\u0001\u0001\n\n(top) or\n\n\u0001\u0001\n\n\u0003\u0006\u0005\n\n\u0004Q\f\n\n\u0004 and\n\n.R\u001e\n\n4 Batch learning\n\n:\u0002\u0001\n\nThe batch version of eqn. (5) for suf\ufb01ciently small learning rates can be written,\n\n\u0007\b \n\n(17)\n\n\u0001\u0019\u0003\n\n\u0001\u0019\u0003\n\n(\u0001\u0004\n\n(\u0001\u0004\n\nwhere\nis the number of training examples. Here we argue that such an update requires\nat least the same order of examples as in the on-line case, in order to be successful. Less\ndata will result in a low signal-to-noise ratio initially and the possibility of trapping in a\nsub-optimal \ufb01xed point close to the initial conditions.\n\nAs in the on-line case we can write the update in terms ofB\n\n,\n\n(18)\n\nlinearity we require\n\nWe make an assumption that successful learning is unlikely unless the initial increment in\nis in the desired direction. For example, with an asymmetric signal and quadratic non-\ninitially, while for a symmetric signal and odd non-linearity\n\nthat a relatively low percentage of runs in which the intial increment was incorrect result\nin successful learning compared to typical performance. As in the on-line case we observe\n\u0007 .\n\nN . We have carried out simulations of batch learning which con\ufb01rm\n\nremaining \u0002\u0001\u0004\u0003\n\ninitially and we can therefore expand the right-hand side of\nat the \ufb01rst iteration) is a sum over ran-\n\n\u0004 , or fail badly withB\n\ninit (%\n\nwe requireB\nB% \nthat runs either succeed, in which caseB\b>\nAs before,B\neqn. (18) in orders ofB\n\n\u0012\u0013\u0007\nfor large \u0003\n\n\u0002\u0001\u0004\u0003\u000f\u000e\n\n. %\n\n\u001e\n\u0004\n+\n\n\u001e\n\u0004\n&\n\u0017\n\u0005\n\u0007\n\u001e\n\n\u0005\n\u0007\n\u001e\n\nN\n&\n\u0017\n.\n:\n\u0001\n\u0003\n\u0010\n\u0012\n.\n%\n8\n\u001e\n\u0002\n\u0003\n\u0012\n\u0005\n\u0001\n9\n(\n\n(\n\u0002\n9\n(\n8\n(\n\"\n\u0005\n%\nB\n\u001e\n\u0002\n\u0003\n\u0012\n\u0005\n\u0001\n9\n(\n\u0007\n \n\u0001\n(\n\u0002\n9\n(\nB\n(\n\"\n&\nB\n+\n\n%\nB\n \nN\n%\n?\n\u000e\n'\n\u0012\n\u001e\n'\nB\nB\n\fdomly sampled terms, and the central limit theorem states that for large\n\nthe distribution\ninit is sampled will be Gaussian, with mean and variance given by (to lead-\n\nE\nVar\n\ninit\n\ninit\n\n\u0001\u0019\u0003\n\n\u0007\u001c\u001b\n\n\u0005\u0012\u001d\n\n\u0007\u0011\u001b\n\n(19)\n(20)\n\n\u0005\u0002\n\u0005\u0015\u001a\n\n\u0007\u001c\u001b\n\n),\n\nfrom which %\n\ning order inB\n\n\u0002\u0001\u0004\u0003\u000f\u000e\n\nthe standard deviation of %\nB\b\u001e\n\nNotice that the\nterm disappears in the case of an asymmetrical non-linearity, which\nis why we have left both terms in eqn. (19). The algorithm will be likely to fail when\ninit is of the same order (or greater) than the mean. Since\nin the case of an even non-\nlinearity and asymmetric signal, or for\nin the case of an odd non-linearity and\na signal with non-zero kurtosis. We expect these results to be necessary but not necessarily\nsuf\ufb01cient for successful learning, since we have only shown that this order of examples is\nthe minimum required to avoid a low signal-to-noise ratio in the \ufb01rst learning iteration. A\ncomplete treatment of the batch learning problem would require much more sophisticated\nformulations such as the mean-\ufb01eld theory of Wong et al. (2000).\n\ninitially, we see that this is true for\n\u0002\u0001\t\u0003\r\n\f\u0007\n\n\u0005R\u001e\n\n\u0005R\u001e\n\n\u0002\u0001\t\u0003\r\u0005\f\u0007\n\n5 Conclusions and future work\n\nIn both the batch and on-line Hebbian ICA algorithm we \ufb01nd that a surprisingly large num-\nber of examples are required to avoid a sub-optimal \ufb01xed point close to the initial condi-\ntions. We expect simialr scaling laws to apply in the case of small numbers of non-Gaussian\nsources. Analysis of the square demixing problem appears to be much more challenging\nas in this case there may be no simple macroscopic description of the system for large \u0003\n.\nIt is therefore unclear at present whether ICA algorithms based on Maximum-likelihood\nand Information-theoretic principles (see, for example, Bell and Sejnowski, 1995; Amari\net al., 1996; Cardoso and Laheld, 1996), which estimate a square demixing matrix, exhibit\nsimilar classes of \ufb01xed point to those studied here.\n\nAcknowledgements: This work was supported by an EPSRC award (ref. GR/M48123). We would\nlike to thank Jon Shapiro for useful comments on a preliminary version of this paper.\n\nReferences\nS-I Amari, A Cichocki, and H H Yang. In D S Touretzky, M C Mozer, and M E Has-\nselmo, editors, Neural Information Processing Systems 8, pages 757\u2013763. MIT Press,\nCambridge MA, 1996.\n\nA J Bell and T J Sejnowski. Neural Computation, 7:1129\u20131159, 1995.\nM Biehl. Europhys. Lett., 25:391\u2013396, 1994.\nM Biehl and H Schwarze. J. Phys. A, 28:643\u2013656, 1995.\nJ-F Cardoso and B. Laheld. IEEE Trans. on Signal Processing, 44:3017\u20133030, 1996.\nC. W. Gardiner. Handbook of Stochastic Methods. Springer-Verlag, New York, 1985.\nA Hyv\u00a8arinen. Neural Computing Surveys, 2:94\u2013128, 1999.\nA Hyv\u00a8arinen and E Oja. Signal Processing, 64:301\u2013313, 1998.\nM Rattray. Neural Computation, 14, 2002 (in press).\nD Saad, editor. On-line Learning in Neural Networks. Cambridge University Press, 1998.\nD Saad and S A Solla. Phys. Rev. Lett., 74:4337\u20134340, 1995.\nK Y M Wong, S Li, and P Luo. In S A Solla, T K Leen, and K-R M\u00a8uller, editors, Neural\n\nInformation Processing Systems 12. MIT Press, Cambridge MA, 2000.\n\n\u0005\nB\n\n%\nB\n\u0001\n\u001e\n\u0012\n\u0005\n+\n\n\u001a\n\u001d\n\u0001\nF\nB\n\u0005\n\"\n\u0012\n#\n+\n\"\n\u001a\n\u0005\n\u001d\n\u001d\n\u001d\n\u0001\nF\nB\n\n\u0001\n\f\n\n%\nB\n\u0001\n\u001e\n\u0001\n\u0005\n\u0005\n\u0005\n\u0001\nF\n&\n+\n\nB\n'\n\u0012\n\u0007\n\f", "award": [], "sourceid": 2004, "authors": [{"given_name": "Magnus", "family_name": "Rattray", "institution": null}, {"given_name": "Gleb", "family_name": "Basalyga", "institution": null}]}