{"title": "Repeat Until Bored: A Pattern Selection Strategy", "book": "Advances in Neural Information Processing Systems", "page_first": 1001, "page_last": 1008, "abstract": null, "full_text": "Repeat Until Bored: A Pattern Selection Strategy \n\nPaul W. Munro \n\nDepamnent of Information Science \n\nUniversity of Pittsburgh \nPittsburgh, PA 15260 \n\nABSTRACT \n\nAn alternative to the typical technique of selecting training examples \nindependently from a fixed distribution is fonnulated and analyzed, in \nwhich the current example is presented repeatedly until the error for that \nitem is reduced to some criterion value, ~; then, another item is ran(cid:173)\ndomly selected. The convergence time can be dramatically increased or \ndecreased by this heuristic, depending on the task, and is very sensitive \nto the value of ~. \n\n1 INTRODUCTION \n\nIn order to implement the back propagation learning procedure (Werbos, 1974; Parker, \n1985; Rumelhart, Hinton and Williams, 1986), several issues must be addressed. In addi(cid:173)\ntion to designing an appropriate network architecture and detennining appropriate values \nfor the learning parameters, the batch size and a scheme for selecting training examples \nmust be chosen. The batch size is the number of patterns presented for which the corre(cid:173)\nsponding weight changes are computed before they are actually implemented; immediate \nupdate is equivalent to a batch size of one. The principal pattern selection schemes are \nindependent selections from a stationary distribution (independent identically distributed, \nor i.i.d.) and epochal, in which the training set is presented cyclically (here, each cycle \nthrough the training set is called an epoch). Under Li.d. pattern selection, the learning \nperfonnance is sensitive to the sequence of training examples. This observation suggests \nthat there may exist selection strategies that facilitate learning. Several studies have \nshown the benefit of strategic pattern selection (e.g., Mozer and Bachrach, 1990; Atlas, \nCohn, and Ladner, 1990; Baum and Lang, 1991). \n\n1001 \n\n\f1002 \n\nMunro \n\nlYPically, online learning is implemented by independent identically distributed pattern se(cid:173)\nlection, which cannot (by definition) take advantage of useful sequencing strategy. \nIt \nseems likely, or certainly plausible, that the success of learning depends to some extent \non the order in which stimuli are presented. An extreme, though negative, example \nwould be to restrict learning to a portion of the available training set; i.e. to reduce the ef(cid:173)\nfective training set. Let sampling functions that depend on the state of the learner in a \nconstructive way be termed pedagogical. \n\nDetermination of a particular input may require information exogenous to the learner; that \nis, just as training algorithms have been classified as supervised and unsupervised, so can \npedagogical pattern selection techniques. For example, selection may depend on the net(cid:173)\nworlc's performance relative to a desired schedule. The intent of this study is to explore \nan unsupervised selection procedure (even though a supervised learning rule, backpropaga(cid:173)\ntion, is used). The initial selection heuristic investigated was to evaluate the errors across \nthe entire pattern set for each iteration and to present the pattern with the highest error, of \ncourse, this technique has a large computational overhead, but the question was whether it \nwould reduce the number of learning trials. The results were quite to the contrary; pre(cid:173)\nliminary trials on small tasks (two and three bit parity), show that this scheme performs \nvery poorly with all patterns maintaining high error. \n\nA new unsupervised selection technique is introduced here. The \"Repeat-Until-Bored\" \nheuristic is easily implemented and simply stated: if the current training example gener(cid:173)\nates a high error (Le. greater than a fixed criterion value), it is repeated; otherwise, another \none is randomly selected. This approach was motivated by casual observations of behav(cid:173)\nior in small children; they seem to repeat seemingly arbitrary tasks several times, and \nthen abruptly stop and move to some seemingly arbitrary alternative (Piaget, 1952). For \nthe following discussion, lID and RUB will denote the two selection procedures to be \ncompared. \n\n2 METHODOLOGY \n\nRUB can be implemented by adding a condition to the lID statement; in C, this is simply \n\nold (lID) : patno = random () % numpatsi \nnew(RUB): \n\nif (paterror.001 (Le., there is no \ndead zone), the effect of ~ is reflected in another perfonnance measure: average number of \niterations to convergence (Figure 4). However, experiments with the 5-2-5 encoder task \nshow an effect. While backprop converges for all values of ~ (except very small values), \nthe perfonnance, as measured by number of pattern presentations, does show a pro(cid:173)\nnounced decrement. The 8-3-8 encoder shows a significant, but less dramatic, effect. \n\n6000 \n\n1 st data value: \n8691.0 \n\n\u2022 \n\n6 \n6 \n\n5-2-5 \n4-2-4 \n8-3-8 \n\n~ 4000 \n0 u \n\n2000 \n\nCD u \n~ \nCD \n\n~ .. CD > \n0 -en \n0 ::= as .. CD \n== CD \n~ as .. CD \n\n~ \n\n~ \n\n0 \n.001 \n\n.01 \n\n1 \n\n10 \n\n.1 \n13 \n\nFigure 4. Encoder performance profiles. See text. \n\n3.3 THE MESH \n\nThe mesh (Figure 5, left) is a 2-D classification task that can be solved by a strictly lay(cid:173)\nered net with five hidden units. Like the encoder and unlike parity, lID is found to con(cid:173)\nverge on 100% of trials; however, there is a critical value of ~ and a well-defined dead \nzone (Figure 5, right). Note that the curve depicting average number of iterations to con(cid:173)\nvergence decreases monotonically, interrupted at the dead zone but continuing its apparent \ntrend for higher values of ~. \n\n\f1006 \n\nMunro \n\n0 \n\n0 \n\n0 \n\n\u2022 \n0 \u2022 \u2022 \u2022 \n0 \u2022 \n\u2022 \n\n0 \n\n\u2022 \u2022 \n\n\u2022 \n\n20 ~~----~~\"~----T10+ \n\n'b \n4b \n~ \n~ \nc: 06\" \nUC'i 10 \n-\na~ '-\nJ! \n:::J .S \n\n(I) .... \n\nCI) \n\n4 \n\n0 \n.0001 .001 \n\n.01 \n\n.1 \n\n1 \n\n10 \n\n~ \n\nFigure 5. The mesh task. Left: the task. Right: Performance profile. Number of \nsimulations that converge is plotted along the bold line (left vertical) axis. Average \nnumber of iterations are plotted as squares (right vertical axis). \n\n3A NONCONVERGENCE \n\nNonconvergence was examined in detail for three values of ~, corresponding to high per(cid:173)\nfonnance, poor perfonnance (the dead zone), and lID, for the three bit parity task. The \nerror for each of the eight patterns is plotted over time. For trials that do not converge \n(Figure 6), the patterns interact differently, depending on the value of~. At (3=0.05 (a \n\"good\" value of ~ for this task), the error traces for the four odd-parity patterns are strong(cid:173)\nly correlated in an irregular oscillatory mode, as are the four even-parity traces, but the \ntwo groups are strongly anticorrelated. In the odd parity group, the error remains low for \nthree of the patterns (001, 010, and 100), but ranges from less than 0.1 to values greater \nthan 0.95 for the fourth (111). Traces for the even parity patterns correspond almost iden(cid:173)\ntically; i.e. not only are they correlated, but all four maintain virtually the same value. \n\nAt this point, the dead zone phenomenon has only been observed in tasks with a single \noutput unit. This property hints at the following explanation. Note first that each \ninput/output pair in the training set divides the weight space into two halves, character(cid:173)\nized by the sign of the linear activation into the output unit; that is, whether the output \nis above or below 0.5, and hence whether the magnitude of the difference between the ac(cid:173)\ntual and desired responses is above or below 0.5. Since ~ is the value of the squared \nerror, learning is repeated for (3=0.25 only for examples for which the state is on the \nwrong half of weight space. Just when it is about to cross the category boundary, which \nwould bring the absolute value of the error below .5, RUB switches to another example, \nand the state is not pushed to the other side of the boundary. This conjecture suggests \nthat for tasks with multiple output units, this effect might be reduced or eliminated, as \nhas been demonstrated in the encoder examples. \n\n\fRepeat Until Bored: A Pattern Selection Strategy \n\n1007 \n\n~ = 0.05 \n\n1.0't:::-----\"\"\"':;;:::~-~::;:'1 \n\n~ = 0.3 \n\n1.0~----------------------~ \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n0.0 ~-=::=...-...... ~:=._IIIIIIIC:~ \n\n25500 \n\n25600 \n\n1.0 -r---------------------~ \n\n~ = 2.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\no.o ...... -----------~ \n5000 \n4900 \n\nFigure 6. Error traces for \nindividual patterns. For each \nof three values of the error \ncriterion, the variation of the \nerror for each pattern is \nplotted for 100 iterations of \nthe three-bit parity task that \ndid not converge. Note the \nlarge amplitude swings for low \nvalues (upper left), and the \nsmall amplitude oscillations in \nthe \"dead zone\" (upper right). \n\n0.0~---------------4 \n29000 \n\n29100 \n\n4 DISCUSSION \n\nActive learning and boredom. The sequence of training examples has an undeniable effect \non learning, both in the real world and in simulated learning systems. While the RUB \nprocedure influences this sequence such that the learning perfonnance is either positively \nor negatively affected, it is just a minimal instance of active learning; more elaborate \nlearning systems have explored similar notions of \"boredom\" (eg., Scott and Markovitch, \n1989). \n\nNonconvergence. From Figure 6 it can be seen, for both RUB and lID, that nonconver(cid:173)\ngence does not correspond to a local minimum in weight space. In situations where the \noverall error is \"stuck\" at a non-zero value, the error on the individual patterns continues \nto change. The weight trajectory is thus \"trapped\" in a nonoptimal orbit, rather than a \nnonoptimal equilibrium point. \n\n\f1008 \n\nMunro \n\nAcknowledgements \n\nThis research was supported in part by NSF grant 00-8910368 and by Siemens Corporate \nResearch, which kindly provided the author with financial support and a stimulating re(cid:173)\nsearch environment during the summer of 1990. David Cohn and Rile Belew were helpful \nin bringing relevant work to my attention. \n\nReferences \n\nBaum, E. and Lang, K. (1991) Constructing multi-layer neural networks by searching \ninput space rather than weight space. In: Advances in Neural Information Processing \nSystems 3. D. S. Touretsky, ed. Morgan Kaufmann. \n\nCohn, D., Atlas, L., and Ladner, R. (1990) Training connectionist networks with queries \nand selective sampling. In: Advances in Neural Information Processing Systems 2. D. \nS. Touretsky, ed. Morgan Kaufmann. \n\nMozer, M. and Bachrach, J. (1990) Discovering the structure of a reactive environment by \nexploration. In: Advances in Neural Information Processing Systems 2. D. S. \nTouretsky, ed. Morgan Kaufmann. \n\nParker, D. (1985) Learning logic. TR-47. MIT Center for Computational Economics \nand Statistics. Cambridge MA. \n\nPiaget, J. (1952) The Origins of Intelligence in Children. Norton. \n\nRumelhart D., Hinton G., and Williams R. (1986) Learning representations by back(cid:173)\npropagating errors. Nature 323:533-536. \n\nScott, P. D. and Markovitch, S. (1989) Uncertainty based selection of learning experi(cid:173)\nences. Sixth International Workshop on Machine Learning. pp.358-361 \n\nWerbos, P. (1974) Beyond regression: new tools for prediction and analysis in the behav(cid:173)\nioral sciences. Unpublished doctoral dissertation, Harvard University. \n\n\f", "award": [], "sourceid": 470, "authors": [{"given_name": "Paul", "family_name": "Munro", "institution": null}]}