Part of Advances in Neural Information Processing Systems 8 (NIPS 1995)

*David Saad, Sara Solla*

We consider the problem of on-line gradient descent learning for general two-layer neural networks. An analytic solution is pre(cid:173) sented and used to investigate the role of the learning rate in con(cid:173) trolling the evolution and convergence of the learning process.

Learning in layered neural networks refers to the modification of internal parameters {J} which specify the strength of the interneuron couplings, so as to bring the map fJ implemented by the network as close as possible to a desired map 1. The degree of success is monitored through the generalization error, a measure of the dissimilarity between fJ and 1. Consider maps from an N-dimensional input space e onto a scalar (, as arise in the formulation of classification and regression tasks. Two-layer networks with an arbitrary number of hidden units have been shown to be universal approximators [1] for such N-to-one dimensional maps. Information about the desired map i is provided through independent examples (e, (1'), with (I' = i(e) for all p . The examples are used to train a student network with N input units, K hidden units, and a single linear output unit; the target map i is defined through a teacher network of similar architecture except for the number M of hidden units. We investigate the emergence of generalization ability in an on-line learning scenario [2], in which the couplings are modified after the presentation of each example so as to minimize the corresponding error. The resulting changes in {J} are described as a dynamical evolution; the number of examples plays the role of time. In this paper we limit our discussion to the case of the soft-committee machine [2], in which all the hidden units are connected to the output unit with positive couplings of unit strength, and only the input-to-hidden couplings are adaptive.

*D.Saad@aston.ac.uk tOn leave from AT&T Bell Laboratories, Holmdel, NJ 07733, USA

Dynamics of On-line Gradient Descent Learning for Multilayer Neural Networks

303

Consider the student network: hidden unit i receives information from input unit r through the weight hr, and its activation under presentation of an input pattern ~ = (6,· .. ,~N) is Xi = J i .~, with J i = (hl, ... ,JiN) defined as the vector of incoming weights onto the i-th hidden unit. The output of the student network is a(J,~) = L:~l 9 (Ji . ~), where 9 is the activation function of the hidden units, taken here to be the error function g(x) == erf(x/V2), and J == {Jdl*
*

Do not remove: This comment is monitored to verify that the site is working properly