{"title": "Online Independent Component Analysis with Local Learning Rate Adaptation", "book": "Advances in Neural Information Processing Systems", "page_first": 789, "page_last": 795, "abstract": null, "full_text": "Online Independent Component Analysis \n\nWith Local Learning Rate Adaptation \n\nNicol N. Schraudolph \n\nXavier Giannakopoulos \n\nnic 0) oPt!OPt-i = 0; this signifies a certain dependence on \nan appropriate choice of meta-learning rate p.. Note that there is an efficient O(n) \nalgorithm to calculate HtVt without ever having to compute or store the matrix \nH t itself [20]; we shall elaborate on this technique for the case of independent \ncomponent analysis below. \n\nMeta-level conditioning. The gradient descent in P at the meta-level (2) may \nof course suffer from ill-conditioning just like the descent in W at the main level \n(1); the meta-descent in fact squares the condition number when v is defined as the \nprevious gradient, or an exponential average of past gradients. Special measures to \nimprove conditioning are thus required to make meta-descent work in non-trivial \nsystems. \nMany researchers [11, 12, 13, 14] use the sign function to radically normalize the \np-update. Unfortunately such a nonlinearity does not preserve the zero-mean prop(cid:173)\nerty that characterizes stochastic gradients in equilibrium -- in particular, it will \ntranslate any skew in the equilibrium distribution into a non-zero mean change in \np. This causes convergence to non-optimal step sizes, and renders such methods un(cid:173)\nsuitable for online learning. Notably, Almeida et al. [15] avoid this pitfall by using \na running estimate of the gradient's stochastic variance as their meta-normalizer. \nIn addition to modeling the long-term effect of a change in local learning rate, our \niterative gradient trace serves as a highly effective conditioner for the meta-descent: \nthe fixpoint of (6) is given by \n\nVt = \n\n[AHt + (I-A) diag(I/Pi)]-llt \n\n(7) \n\na modified Newton step, which for typical values of A (i. e., close to 1) scales with \n-\nthe inverse of the gradient. Consequently, we can expect the product It . Vt in (2) \nto be a very well-conditioned quantity. Experiments with feedforward multi-layer \nperceptrons [3, 4] have confirmed that SMD does not require explicit meta-level \nnormalization, and converges faster than alternative methods. \n\n3 Application to leA \n\nWe now apply the SMD technique to independent component analysis, using the \nBell-Sejnowski algorithm [5] as our base method. The goal is to find an unmixing \n\n\f792 \n\nN. N. Schraudolph and X Giannakopoulos \n\nup to scaling and permutation -\n\nmatrix Wt which -\nprovides a good linear esti(cid:173)\nmate Vt == WtXt of the independent sources St present in a given mixture signal Xt\u00b7 \nThe mixture is generated linearly according to Xt = Atst , where At is an unknown \n(and unobservable) full rank matrix. \n\nWe include the well-known natural gradient [6] and kurtosis estimation [7] modifi(cid:173)\ncations to the basic algorithm, as well as a matrix Pt of local learning rates. The \nresulting online update for the weight matrix Wt is \n\nwhere the gradient D t is given by \n\nDt \n\n== \n\n8f;;~~t) = ([Vt \u00b1 tanh(Vt)] vt - 1) Wt , \n\n(8) \n\n(9) \n\nwith the sign for each component of the tanh(Vt) term depending on its current \nkurtosis estimate. \nFollowing Pearlmutter [20], we now define the differentiation operator \n\nRVt (g(Wt\u00bb \n\n== \n\n8g(W~r+ rVt) Ir=o \n\nu \n\n(10) \n\nwhich describes the effect on 9 of a perturbation of the weights in the direction of \nVt. We can use RVt to efficiently calculate the Hessian-vector product \n\n(11) \n\nwhere \"vee\" is the operator that concatenates all columns of a matrix into a single \ncolumn vector. Since Rv, is a linear operator, we have \n\nRv,(Wt ) \nRVt (Vt) \nRVt (tanh(Vd) \n\n-\n\nVt, \nRv, (WtXt) = VtXt, \ndiag( tanh' (Vt\u00bb) VtXt , \n\n(12) \n(13) \n(14) \n\nand so forth (cf. [20]). Starting from (9), we apply the RVt operator to obtain \n\nHt*Vt \n\n-\n\nRv,[([Vt \u00b1 tanh(Vt)] ytT - 1) Wt] \n([Vt \u00b1 tanh(Vt)] vt - 1) Vt + RVt([ Yt \u00b1 tanh(Vt)] vt - 1) Wt \n([ Vt \u00b1 tanh(Vt)] vt - 1) Vt + \n[(1 \u00b1 diag[tanh'(Vt)]) VtXt vt + [Vt \u00b1 tanh(Vd](Vtxt)T] Wt \n\nIn conjunction with the matrix versions of our learning rate update (3) \n\nand gradient trace (6) \n\nthis constitutes our SMD-ICA algorithm. \n\n(15) \n\n(16) \n\n(17) \n\n\fOnline leA with Local Rate Adaptation \n\n793 \n\n4 Experiment \n\nThe algorithm was tested on an artificial problem where 10 sources follow elliptic \ntrajectories according to \n\nXt = (Abase + Al sin(wt) + A2 cos(wt)) St \n\n(18) \nwhere Abase is a normally distributed mixing matrix, as well as Al and A2, whose \ncolumns represent the axes of the ellipses on which the sources travel. The velocities \nware normally distributed around a mean of one revolution for every 6000 data \nsamples. All sources are supergaussian. \nThe ICA-SMD algorithm was implemented with only online access to the data, \nincluding on-line whitening [21]. Whenever the condition number of the estimated \nwhitening matrix exceeded a large threshold (set to 350 here), updates (16) and (17) \nwere disabled to prevent the algorithm from diverging. Other parameters settings \nwere It = 0.1, >. = 0.999, and p = 0.2. \nResults that were not separating the 10 sources without ambiguity were discarded. \nFigure 1 shows the performance index from [6] (the lower the better, zero being \nthe ideal case) along with the condition number of the mixing matrix, showing that \nthe algorithm is robust to a temporary confusion in the separation. The ordinate \nrepresents 3000 data samples, divided into mini-batches of 10 each for efficiency. \nFigure 2 shows the match between an actual mixing column and its estimate, in \nthe subspace spanned by the elliptic trajectory. The singularity occurring halfway \nthrough is not damaging performance. Globally the algorithm remains stable as \nlong as degenerate inputs are handled correctly. \n\n5 Conclusions \n\nOnce SMD-ICA has found a separating solution, we find it possible to simultane(cid:173)\nously track ten sources that move independently at very different, a priori unknown \n\nOOr-------~------T_------~------,_------~------~ \n\nError index -\ncond(A)120 ---+---\n\n50 \n\n40 \n\n30 \n\nFigure 1: Global view of the quality of separation \n\n\f794 \n\nN. N. Schraudolph and X Giannakopoulos \n\nOr---.---.,~--,----.----~---.----.----.----r---. \n\nEstimation error -\n\n-, \n\n-2 \n\n-3 \n\n-4 \n\n-5 \n\n-6~--~--~~--~--~----~--~----~---L----~--~ \n2.5 \n\n-2.5 \n\n-0.5 \n\n0.5 \n\n'.5 \n\n2 \n\n-2 \n\n-'.5 \n\n-, \n\no \n\nFigure 2: Projection of a column from the mixing matrix. Arrows link the exact \npoint with its estimate; the trajectory proceeds from lower right to upper left. \n\nspeeds. To continue tracking over extended periods it is necessary to handle mo(cid:173)\nmentary singularities, through online estimation of the number of sources or some \nother heuristic solution. SMD's adaptation of local learning rates can then facilitate \ncontinuous, online use of ICA in rapidly changing environments. \n\nAcknowledgments \n\nThis work was supported by the Swiss National Science Foundation under grants \nnumber 2000-052678.97/1 and 2100-054093.98. \n\nReferences \n\n[1] J. Karhunen and P. Pajunen, \"Blind source separation and tracking using \nnonlinear PCA criterion: A least-squares approach\", in Proc. IEEE Int. Conf. \non Neural Networks, Houston, Texas, 1997, pp. 2147- 2152. \n\n[2] N. Murata, K.-R. Milller, A. Ziehe, and S.-i. Amari, \"Adaptive on-line learning \nin changing environments\", \nin Advances in Neural Information Processing \nSystems, M. C. Mozer, M. I. Jordan, and T . Petsche, Eds. 1997, vol. 9, pp. \n599- 605, The MIT Press, Cambridge, MA. \n\n[3] N. N. Schraudolph, \"Local gain adaptation in stochastic gradient descent\", in \nProceedings of the 9th International Conference on Artificial Neural Networks, \nEdinburgh, Scotland, 1999, pp. 569-574, lEE, London, ftp://ftp.idsia.ch/ \npub/nic/smd.ps.gz. \n\n[4] N. N. Schraudolph, \"Online learning with adaptive local step sizes\", in Neural \nNets - WIRN Vietri-99; Proceedings of the 11th Italian Workshop on Neural \nNets, M. Marinaro and R. Tagliaferri, Eds., Vietri suI Mare, Salerno, Italy, \n1999, Perspectives in Neural Computing, pp. 151-156, Springer Verlag, Berlin. \n\n\fOnline leA with Local Rate Adaptation \n\n795 \n\n[5] A. J. Bell and T. J. Sejnowski, \"An information-maximization approach to \nblind separation and blind deconvolution\", Neural Computation, 7(6):1129-\n1159,1995. \n\n[6] S.-i. Amari, A. Cichocki, and H. H. Yang, \"A new learning algorithm for blind \nsignal separation\", \nin Advances in Neural Information Processing Systems, \nD. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. 1996, vol. 8, pp. \n757-763, The MIT Press, Cambridge, MA. \n\n[7] M. Girolami and C. Fyfe, \n\n\"Generalised independent component analysis \nthrough unsupervised learning with emergent bussgang properties\", in Proc. \nIEEE Int. Conf. on Neural Networks, Houston, Texas, 1997, pp. 1788-179l. \n\n[8] J. Kivinen and M. K. Warmuth, \"Exponentiated gradient verSus gradient \ndescent for linear predictors\", Tech. Rep. UCSC-CRL-94-16, University of \nCalifornia, Santa Cruz, June 1994. \n[9] J. Kivinen and M. K. Warmuth, \n\n\"Additive versus exponentiated gradient \nupdates for linear prediction\", \nin Proc. 27th Annual ACM Symposium on \nTheory of Computing, New York, NY, May 1995, pp. 209-218, The Association \nfor Computing Machinery. \n\n[10] N. N. Schraudolph, \"A fast, compact approximation of the exponential func(cid:173)\n\ntion\", Neural Computation, 11(4):853-862, 1999. \n\n[11] R. Jacobs, \"Increased rates of convergence through learning rate adaptation\", \n\nNeural Networks, 1:295- 307, 1988. \n\n[12] T. Tollenaere, \"SuperSAB: fast adaptive back propagation with good scaling \n\nproperties\", Neural Networks, 3:561-573, 1990. \n\n[13] F. M. Silva and L. B. Almeida, \"Speeding up back-propagation\", in Advanced \nNeural Computers, R. Eckmiller, Ed., Amsterdam, 1990, pp. 151-158, Elsevier. \n\n[14] M. Riedmiller and H. Braun, \"A direct adaptive method for faster backprop(cid:173)\n\nagation learning: The RPROP algorithm\", in Proc. International Conference \non Neural Networks, San Francisco, CA, 1993, pp. 586-591, IEEE, New York. \n[15] L. B. Almeida, T. Langlois, J. D. Amaral, and A. Plakhov, \"Parameter adap(cid:173)\ntation in stochastic optimization\", in On-Line Learning in Neural Networks, \nD. Saad, Ed., Publications of the Newton Institute, chapter 6. Cambridge Uni(cid:173)\nversity Press, 1999, ftp://146.193. 2 . 131/pub/lba/papers/adsteps . ps .gz. \n\n[16] M. E. Harmon and L. C. Baird III, \"Multi-player residual advantage learning \n\nwith general function approximation\" , Tech. Rep. WL-TR-1065, Wright Labo(cid:173)\nratory, WL/ AACF, 2241 Avionics Circle, Wright-Patterson Air Force Base, OH \n45433-7308, 1996, http://vvv.leemon.com/papers/sim_tech/sim_tech.ps.gz. \n\n[17] R. S. Sutton, \"Adapting bias by gradient descent: an incremental version of \ndelta-bar-delta\", in Proc. 10th National Conference on Artificial Intelligence. \n1992, pp. 171-176, The MIT Press, Cambridge, MA, ftp://ftp.cs.umass.edu/ \npub/anv/pub/sutton/sutton-92a.ps.gz. \n\n[18] R. S. Sutton, \"Gain adaptation beats least squares?\", in Proc. 7th Yale Work(cid:173)\nshop on Adaptive and Learning Systems, 1992, pp. 161-166, ftp://ftp.cs. \numass.edu/pub/anv/pub/sutton/sutton-92b.ps.gz. \n\n[19] R. Williams and D. Zipser, \"A learning algorithm for continually running fully \n\nrecurrent neural networks\", Neural Computation, 1:270-280, 1989. \n\n[20] B. A. Pearlmutter, \"Fast exact multiplication by the Hessian\", Neural Com(cid:173)\n\nputation, 6(1):147-160,1994. \n\n[21] J. Karhunen, E. Oja, L. Wang, R. Vigario, and J. Joutsensalo, \"A class of \nneural networks for independent component analysis\", IEEE Trans. on Neural \nNetworks, 8(3):486-504, 1997. \n\n\f", "award": [], "sourceid": 1648, "authors": [{"given_name": "Nicol", "family_name": "Schraudolph", "institution": null}, {"given_name": "Xavier", "family_name": "Giannakopoulos", "institution": null}]}