{"title": "Neural Network Exploration Using Optimal Experiment Design", "book": "Advances in Neural Information Processing Systems", "page_first": 679, "page_last": 686, "abstract": null, "full_text": "Neural Network Exploration Using \n\nOptimal Experiment Design \n\nDavid A. Cohn \n\nDept. of Brain and Cognitive Sciences \n\nMassachusetts Inst. of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nConsider the problem of learning input/output mappings through \nexploration, e.g. learning the kinematics or dynamics of a robotic \nmanipulator. If actions are expensive and computation is cheap, \nthen we should explore by selecting a trajectory through the in(cid:173)\nput space which gives us the most amount of information in the \nfewest number of steps. I discuss how results from the field of opti(cid:173)\nmal experiment design may be used to guide such exploration, and \ndemonstrate its use on a simple kinematics problem. \n\n1 \n\nIntroduction \n\nMost machine learning research treats the learner as a passive receptacle for data \nto be processed. This approach ignores the fact that, in many situations, a learner \nis able, and sometimes required, to act on its environment to gather data. \n\nLearning control inherently involves being active; the controller must act in order \nto learn the result of its action. When training a neural network to control a \nrobotic arm, one may explore by allowing the controller to \"flail\" for a length of \ntime, moving the arm at random through coordinate space while it builds up data \nfrom which to build a model [Kuperstein, 1988]. This is not feasible, however, if \nactions are expensive and must be conserved. In these situations, we should choose \na training trajectory that will get the most information out of a limited number of \nsteps. Manually designing such trajectories is a slow process, and intuitively \"good\" \ntrajectories often fail to sufficiently explore the state space [Armstrong, 1989]. In \n\n679 \n\n\f680 \n\nCohn \n\nthis paper I discuss another alternative for exploration: automatic, incremental \ngeneration of training trajectories using results from \"optimal experiment design.\" \n\nThe study of optimal experiment design (OED) [Fedorov, 1972] is concerned with \nthe design of experiments that are expected to minimize variances of a parameter(cid:173)\nized model. Viewing actions as experiments that move us through the state space, \nwe can use the techniques of OED to design training trajectories. \n\nThe intent of optimal experiment design is usually to maximize confidence in a \ngiven model, minimize parameter variances for system identification, or minimize \nthe model's output variance. Armstrong [1989] used a form of OED to identify link \nmasses and inertial moments of a robot arm, and found that automatically gener(cid:173)\nated training trajectories provided a significant improvement over human-designed \ntrajectories. Automatic exploration strategies have been tried for neural networks \n(e.g. [Thrun and Moller, 1992]' [Moore, 1994]), but use of OED in the neural net(cid:173)\nwork community has been limited. Plutowski and White [1993] successfully used it \nto filter a data set for maximally informative points, but its application to selecting \nnew data has only been proposed [MacKay, 1992], not demonstrated. \n\nThe following section gives a brief description of the relevant results from optimal \nexperiment design. Section 3 describes how these results may be adapted to guide \nneural network exploration and Section 4 presents experimental results of imple(cid:173)\nmenting this adaptation. Finally, Section 5 discusses implications of the results, \nand logical extensions of the current experiments. \n\n2 Optimal experiment design \n\nOptimal experiment design draws heavily on the technique of Maximum Likelihood \nEstimation (MLE) [Thisted, 1988]. Given a set of assumptions about the learner's \narchitecture and sources of noise in the output, MLE provides a statistical basis for \nlearning. Although the specific MLE techniques we use hold exactly only for linear \nmodels, making certain computational approximations allows them to be used with \nnonlinear systems such as neural networks. \nWe begin with a training set of input-output pairs (Xi, Yi)i=l and a learner fw O\u00b7 \nWe define fw(x) to be the learner's output given input X and weight vector w. \nUnder an assumption of additive Gaussian noise, the maximum likelihood estimate \nfor the weight vector, W, is that which minimizes the sum squared error Esse = \n2:7=1(JW(Xi) - Yi)2. The estimate W gives us an estimate of the output at a novel \ninput: if = fw(x) (see e.g. Figure 1a). \nMLE allows us to compute the variances of our weight and output estimates. Writ(cid:173)\ning the output sensitivity asgw(x) = 8fw(x)/8w, the covariances of ware \n\nwhere the last approximation assumes local linearity of gw(x). (For brevity, the \noutput sensitivity will be abbreviated to g( x) in the rest of the paper.) \n\n\fNeural Network Exploration Using Optimal Experiment Design \n\n681 \n\ny \n\nFigure 1: a) A set of training examples for a classification problem, and the net(cid:173)\nwork's best fit to the data. b) Maximum likelihood estimate of the network's output \nvariance for the same problem. \n\nFor a given reference input X r , the estimated output variance is \n\nvar(xr ) = g(Xr? A- 1g(xr ). \n\n(1) \n\nOutput variance corresponds to the model's estimate of the expected squared dis(cid:173)\ntance between its output fw(x) and the unknown \"true\" output y. Output variance \nthen, corresponds to the model's estimate of its mean squared error (MSE) (see Fig(cid:173)\nure 1 b). If the estimates are accurate then minimizing the output variance would \ncorrespond to minimizing the network's MSE. \n\nIn optimal experiment design, we estimate how adding a new training example is \nexpected to change the computed variances. Given a novel X n +1, we can use OED \nto predict the effect of adding Xn+1 and its as-yet-unknown Yn+1 to the training \nset. We make the assumption that \n\n-1 \n\nAn+1 ~ An + g(xn+dg(xn+d \n\n( \n\nT)-1 \n\n, \n\nwhich corresponds to assuming that our current model is already fairly good. Based \non this assumption, the new parameter variances will be \n\nA~~1 = A~l - A~1g(xn+d(1 + g(Xn+1? A~1g(xn+d)g(xn+t)T A~1. \n\nCombined with Equation 1, this predicts that if we take a new example at X n +1, \nthe change in output variance at reference input Xr will be \n\n~var(Xr ) \n\n(g(xrf A~lg(xn+l\u00bb2(1 + g(Xn+1)T A;;lg(xn+d) \ncov(xr, Xn+l)2(1 + var(xn+d) \n\n(2) \n\nTo minimize the expected value of var(xr ), we should select Xn+l so as to maximize \nthe right side of Equation 2. For other interesting OED measures, see MacKay \n[1992] . \n\n\f682 \n\nCohn \n\n3 Adapting OED to Exploration \n\nWhen building a world model, the learner is trying to build a mapping, e.g. from \njoint angles to cartesian coordinates (or from state-action pairs to next states). If \nit is allowed to select arbitrary joint angles (inputs) in successive time steps, then \nthe problem is one of selecting the next \"query\" to make ([Cohn, 1990], [Baum and \nLang, 1991]). In exploration, however, one's choices for a next input are constrained \nby the current input. We cannot instantaneously \"teleport\" to remote parts of the \nstate space, but must choose among inputs that are available in the next time step. \n\nOne approach to selecting a next input is to use selective sampling: evaluate a num(cid:173)\nber of possible random inputs, choose the one with the highest expected gain. In a \nhigh-dimensional action space, this is inefficient. The approach followed here is that \nof gradient search, differentiating Equation 2 and hillclimbing on 8jj,var( x r )/ 8Xn +l. \n\nNote that Equation 2 gives the expected change in variance only at a single point \nX r , while we wish to minimize the average variance over the entire domain. Ex(cid:173)\nplicitly integrating over the domain is intractable, so we must make do with an \napproximation. MacKay [1992] proposed using a fixed set of reference points and \nmeasuring the expected change in variance over them. This produces spurious lo(cid:173)\ncal maxima at the reference points, and has the undesirable effect of arbitrarily \nquantizing the input space. Our approach is to iteratively draw reference points at \nrandom (either uniformly or according to a distribution of interest), and compute \na stochastic approximation of jj, var. \n\nBy climbing the stochastically approximated gradient, either to convergence or to \nthe horizon of available next inputs, we will settle on an input/action with a (locally) \noptimal decrease in expected variance. \n\n4 Experimental Results \n\nIn this section, I describe two sets of experiments. The first attempts to confirm \nthat the gains predicted by optimal experiment design may actually be realized in \npractice, and the second studies the application of OED to a simple learning task. \n\n4.1 Expected versus actual gain \n\nIt must be emphasized that the gains predicted by OED are expected gains. These \nexpectations are based on the relatively strong assumptions of MLE, which may \nnot strictly hold. In order for the expected gains to materialize, two \"bridges\" must \nbe crossed. First, the expected decrease in model variance must be realized as an \nactual decrease in variance. Second, the actual decrease in model variance must \ntranslate into an actual decrease in model MSE. \n\n4.1.1 Expected decreases in variance --+ actual decreases in variance \n\nThe translation from expected to actual changes in variance requires coordination \nbetween the exploration strategy and the learning algorithm: to predict how the \nvariance of a weight will change with a new piece of data, the predictor must know \nhow the weight itself (and its neighboring weights) will change. Using a black \n\n\fNeural Network Exploration Using Optimal Experiment Design \n\n683 \n\n0 . 012 l I \n\n0.0 1 \n\n~ \n\n~ > \n~ ... \n~ \n'tl \n\n0 . 008 \n\n0 .00 6 \n\n0 . 0 0 4 \n\n0 . 002 \n\n, , \n\nx \nx \n\nx \n\nx \n\nx \n\nx \n\nXX \n\n-\n\n- -\n\n-\n\n-\n\na ctual =e xpec ted \n\n\"\"; , \n;:: \n\n~ \n\n[oJ \nVl \n:E \n\n\" ... \n~ \n'tl \n\n2 .8 \n\n2 . 4 \n\n1. 6 \n\n1. 2 \n\n0.8 \n\n0 . 4 \n\n- 0. 4 \n\nx x \n\nx \n\n;If \n)( x \n\nx \n\nx \nX \n\nx \n\nx \n\nx \ni\u00ab \n\nx \n\nx \n\nx x \n\n0 . 0 0 2 \n\n0. 00 4 \n\n0 . 0 0 6 \n\n0 . 0 08 \n\n0 . 0 1 \n\n0 . 01 2 \n\n0 .002 \n\n0.004 \n\n0.00 6 \n\n0 .00 8 \n\n0 . 01 \n\n0 . 012 \n\nexp ec t e d d e lta var \n\nactua l d e lta va r \n\nFigure 2: a) Correlations between expected change in output variance and actual \nchange output variance b) Correlations between actual change in output variance \nand change in mean squared error. Correlations are plotted for a network trained \non 50 examples from the arm kinematics task. \n\nbox routine like backpropagation to update the weights virtually guarantees that \nthere will be some mismatch between expected and actual decreases in variance. \nExperiments indicate that, in spite of this, the correlation between predicted and \nactual changes in variance are relatively good (Figure 2a) . \n\n4.1.2 Decreases in variance -- decreases in MSE \n\nA more troubling translation is the one from model variance to model correctness. \nGiven the highly nonlinear nature of a neural network, local minima may leave us \nin situations where the model is very confident but entirely wrong. Due to high \nconfidence, the learner may reject actions that would reduce its mean squared error \nand explore areas where the model is correct, but has low confidence. Evidence \nof this behavior is seen in the lower right corner of Figure 2b, where some actions \nwhich produce a large decrease in variance have little effect on the network's MSE. \nWhile this decreases the utility of OED, it is not crippling. We discuss one possible \nsolution to this problem at the end of this paper. \n\n4.2 Learning kinematics \n\nWe have used the the stochastic approximation of ~var to guide exploration on \nseveral simple tasks involving classification and regression. Below , I detail the \nexperiments involving exploration of the kinematics of a simple two-dimensional, \ntwo-joint arm . The task was to learn a forward model 8 1 x 8 2 -- X X Y through \nexploration, which could then be used to build a controller following Jordan [1992]. \n\n\f684 \n\nCohn \n\nThe model was to be learned by a feedforward network with a sigmoid transfer \nfunction using a single hidden layer of 8 or 20 hidden units. \n\nFigure 3: Learning 2D arm kinematics with 8 hidden units. a) Geometry of the 2D, \ntwo-joint arm. b) Sample trajectory using OED-based greedy exploration. \n\nOn each time step, the learner was allowed to select inputs 8 1 and 8 2 and was then \ngiven tip position x and y to incorporate into its training set. It then hillclimbed \nto find the next 8 1 and 8 2 within its limits of movement that would maximize the \nstochastic approximation of ~var . On each time step 8 1 and 8 2 were limited to \nchange by no more than \u00b136\u00b0 and \u00b118\u00b0 respectively. Simulations were performed on \nthe Xerion simulator (made available by the University of Toronto), approximating \nthe variance gradient on each step with 100 randomly drawn points. A sample tip \ntrajectory is illustrated in Figure 3b. \n\nWe compared the performance of this one-step optimal (greedy) learner, in terms \nof mean squared error, with that of an identical learner which explored randomly \nby \"flailing.\" Not surprisingly, the improvement of greedy exploration over random \nexploration is significant (Figure 4b). The asymptotic performance of the greedy \nlearner was better than that of the random learner, and it reached its asymptote in \nmuch few steps. \n\n5 Discussion \n\nThe experiments described in this paper indicate that optimal experiment design \nis a promising tool for guiding neural network exploration. It requires no arbi(cid:173)\ntrary discretization of state or action spaces, and is amenable to gradient search \ntechniques. It does, however, have high computational costs and, as discussed in \nSection 4.1.2, may be led astray if the model settles in a local minimum. \n\n5.1 Alternatives to greedy OED \n\nThe greedy approach is prone to \"boxing itself into a corner\" while leaving important \nparts of the domain unexplored. One heuristic for avoiding local minima is to \n\n\fNeural Network Exploration Using Optimal Experiment Design \n\n685 \n\n::::~ \n0.21~ \\\\ \nO. 18~ \\ \n~ 0 . 15~ \\ \n\n. \\ \n\nI\n\n. ::: \\( \n\n0 . 061 \n\n80 \n\n100 \n\n120 \n\nO. 0 3~ \no . 001-----,-, \n\no \n\nI ~~-=::::-./ \n\n, \n\n-,---,-, \n\nI \n\n, \n\n, \n\n20 40 60 80 100120140160180200 \n\nNumber of steps \n\n0.24 -\n\nO. 28 ~ \nI \n\nI ~\\ \n0 . 16V \n\nI \nI \nw o. 201 ~ \nUl \n:0: \n\nI \n+ \n\n0.12-1 \nI \nI \n0 . 08i \n\n0 .0 4J \n\n0.00 , .. \n\n20 \n\n\\ \n~ \n\nn e \n\nI \n\n60 \n\n_ --T\n40 \n\nNumber of steps \n\nFigure 4: Learning 2D arm kinematics. a) MSE for a single exploration trajectory \n(20 hidden units). b) Plot of MSE for random and greedy exploration vs. number \nof training examples, averaged over 12 runs (8 hidden units). \n\noccasionally check the expected gain in other parts of the input space and move \ntowards them if they promise much greater gain than a greedy step. \n\nThe theoretically correct but computationally expensive approach is to optimize \nover an entire trajectory. Trajectory optimization entails starting with an initial \ntrajectory, computing the expected gain over it, and iteratively perturbing points on \nthe trajectory towards towards optimal expected gain (subject to other points along \nthe trajectory being explored). Experiments are currently underway to determine \nhow much of an improvement may be realized with trajectory optimization; it is \nunclear whether the improvement over the greedy approach will be worth the added \ncomputational cost. \n\n5.2 Computational Costs \n\nThe computational costs of even greedy OED are great . Selecting a next action \nrequires computation and inversion of the hessian {)2 Eue/ ow 2 . Each time an action \nis selected and taken, the new data must be incorporated into the training set, \nand the learner retrained . In comparison, when using a flailing strategy or a fixed \ntrajectory, the data may be gathered with little computation, and the learner trained \nonly once on the batch. In this light, the cost of data must be much greater than \nthe cost of computation for optimal experiment design to be a preferable strategy. \n\nThere are many approximations one can make which significantly bring down the \ncost of OED. By only considering covariances of weights leading to the same neuron, \nthe hessian may be reduced to a block diagonal form, with each neuron computing \nits own (simpler) covariances in parallel. As an extreme, one can do away with \ncovariances entirely and rely only on individual weight variances, whose computa(cid:173)\ntion is simple. By the same token, one can incorporate the new examples in small \nbatches, only retraining every 5 or so steps. While suboptimal from a data gather(cid:173)\ning perspective, they appear to still outperform random exploration, and are much \ncheaper than \"full-blown\" optimization. \n\n\f686 \n\nCohn \n\n5.3 Alternative architectures \n\nWe may be able to bring down computational costs and improve performance by \nusing a different architecture for the learner. With a standard feedforward neural \nnetwork, not only is the repeated compution of variances expensive, it sometimes \nfails to yield estimates suitable for use as confidence intervals (as we saw in Sec(cid:173)\ntion 4.1.2). A solution to both of these problems may lie in selection of a more \namenable architecture and learning algorithm . One such architecture, in which \noutput variances have a direct role in estimation, is a mixture of Gaussians, which \nmay be efficiently trained using an EM algorithm [Ghahramani and Jordan, 1994]. \nWe expect that it is along these lines that our future research will be most fruitful. \n\nAcknowledgements \n\nI am indebted to Michael I. Jordan and David J .C. MacKay for their help in making \nthis research possible. This work was funded by ATR Human Information Process(cid:173)\ning Laboratories, Siemens Corporate Research and NSF grant CDA-9309300. \n\nBibliography \n\nB. Armstrong. (1989) On finding exciting trajectories for identification experiments. \nInt. J. of Robotics Research, 8(6):28-48. \n\nE. Baum and K. Lang. \n(1991) Constructing hidden units using examples and \nqueries. In R . Lippmann et al., eds., Advances in Neural Information Processing \nSystems 3, Morgan Kaufmann, San Francisco, CA. \nD. Cohn, L. Atlas and R. Ladner. (1990) Training connectionist networks with \nqueries and selective sampling. In D. Touretzky, editor, Advances in Neural Infor(cid:173)\nmation Processing Systems 2, Morgan Kaufmann, San Francisco. \n\nV. Fedorov. (1972) Theory of Optimal Experiments. Academic Press, New York. \n\nZ. Ghahramani and M. Jordan. (1994) Supervised learning from incomplete data \nvia an EM approach. In this volume. \n\nM. Jordan and D. Rumelhart. (1992) Forward models: Supervised learning with a \ndistal teacher. Cognitive Science, 16(3):307-354. \n\nD. MacKay. (1992) Information-based objective functions for active data selection, \nNeural Computation 4(4): 590-604. \n\nA. Moore. (1994) The parti-game algorithm for variable resolution reinforcement \nlearning in multidimensional state-spaces. In this volume. \n\nM. Plutowski and H. White. (1993) Selecting concise training sets from clean data. \nIEEE Trans. on Neural Networks, 4(2):305-318. \n\nR. Thisted. (1988) Elements of Statistical Computing. Chapman and Hall, NY. \n\nS. Thrun and K. Moller. (1992) Active Exploration in Dynamic Environments. In \nJ. Moody et aI., editors, Advances in Neural Information Processing Systems 4. \nMorgan Kaufmann, San Francisco, CA. \n\n\f", "award": [], "sourceid": 765, "authors": [{"given_name": "David", "family_name": "Cohn", "institution": null}]}