{"title": "Is Learning The n-th Thing Any Easier Than Learning The First?", "book": "Advances in Neural Information Processing Systems", "page_first": 640, "page_last": 646, "abstract": null, "full_text": "Is Learning The n-th Thing Any Easier Than \n\nLearning The First? \n\nSebastian Thrun I \n\nComputer Science Department \n\nCarnegie Mellon University \nPittsburgh, PA 15213-3891 \n\nWorld Wide Web: http://www.cs.cmu.edul'''thrun \n\nAbstract \n\nThis paper investigates learning in a lifelong context. Lifelong learning \naddresses situations in which a learner faces a whole stream of learn(cid:173)\ning tasks. Such scenarios provide the opportunity to transfer knowledge \nacross multiple learning tasks, in order to generalize more accurately from \nless training data. In this paper, several different approaches to lifelong \nlearning are described, and applied in an object recognition domain. It \nis shown that across the board, lifelong learning approaches generalize \nconsistently more accurately from less training data, by their ability to \ntransfer knowledge across learning tasks. \n\n1 Introduction \n\nSupervised learning is concerned with approximating an unknown function based on exam(cid:173)\nples. Virtually all current approaches to supervised learning assume that one is given a set \nof input-output examples, denoted by X, which characterize an unknown function, denoted \nby f. The target function f is drawn from a class of functions, F, and the learner is given a \nspace of hypotheses, denoted by H, and an order (preference/prior) with which it considers \nthem during learning. For example, H might be the space of functions represented by an \nartificial neural network with different weight vectors. \nWhile this formulation establishes a rigid framework for research in machine learning, it \ndismisses important aspects that are essential for human learning. Psychological studies \nhave shown that humans often employ more than just the training data for generalization. \nThey are often able to generalize correctly even from a single training example [2, 10]. One \nof the key aspects of the learning problem faced by humans, which differs from the vast \nmajority of problems studied in the field of neural network learning, is the fact that humans \nencounter a whole stream of learning problems over their entire lifetime. When faced with \na new thing to learn, humans can usually exploit an enormous amount of training data and \n\nI also affiliated with: Institut fur Informatik III, Universitat Bonn, Romerstr. 164, Germany \n\n\fIs Learning the n-th Thing Any Easier Than Learning the First? \n\n641 \n\nexperiences that stem from other, related learning tasks. For example, when learning to drive \na car, years of learning experience with basic motor skills, typical traffic patterns, logical \nreasoning, language and much more precede and influence this learning task. The transfer of \nknowledge across learning tasks seems to play an essential role for generalizing accurately, \nparticularly when training data is scarce. \nA framework for the study of the transfer of knowledge is the lifelong learning framework. \nIn this framework, it is assumed that a learner faces a whole collection of learning problems \nover its entire lifetime. Such a scenario opens the opportunity for synergy. When facing its \nn-th learning task, a learner can re-use knowledge gathered in its previous n - 1 learning \ntasks to boost the generalization accuracy. \nIn this paper we will be interested in the most simple version of the lifelong learning problem, \nin which the learner faces a family of concept learning tasks. More specifically, the functions \nto be learned over the lifetime of the learner, denoted by 11 , 12, 13, .. . E F , are all of the type \nI : I --+ {O, I} and sampled from F. Each function I E {II , h ,13, ... } is an indicator \nfunction that defines a particular concept: a pattern x E I is member of this concept if \nand only if I(x) = 1. When learning the n-th indicator function, In , the training set X \ncontains examples of the type (x , In(x)) (which may be distorted by noise). In addition to \nthe training set, the learner is also given n - 1 sets of examples of other concept functions, \ndenoted by Xk (k = 1, .. . , n -\nI). Each Xk contains training examples that characterize \nIk. Since this additional data is desired to support learning In, Xk is called a support set \nfor the training set X . \nAn example of the above is the recognition of faces [5, 7]. When learning to recognize the \nn-th person, say IBob, the learner is given a set of positive and negative example of face \nimages of this person. In lifelong learning, it may also exploit training information stemming \nfrom other persons, such as I E {/Rieh, IMike , IDave , ... }. The support sets usually cannot be \nused directly as training patterns when learning a new concept, since they describe different \nconcepts (hence have different class labels). However, certain features (like the shape of the \neyes) are more important than others (like the facial expression, or the location of the face \nwithin the image). Once the invariances of the domain are learned, they can be transferred \nto new learning tasks (new people) and hence improve generalization. \nTo illustrate the potential importance of related learning tasks in lifelong learning, this \npaper does not present just one particular approach to the transfer of knowledge. Instead, \nit describes several, all of which extend conventional memory-based or neural network \nalgorithms. These approaches are compared with more traditional learning algorithms, i.e., \nthose that do not transfer knowledge. The goal of this research is to demonstrate that, \nindependent of a particular learning approach, more complex functions can be learned from \nless training data iflearning is embedded into a lifelong context. \n\n2 Memory-Based Learning Approaches \n\nMemory-based algorithms memorize all training examples explicitly and interpolate them \nat query-time. We will first sketch two simple, well-known approaches to memory-based \nlearning, then propose extensions that take the support sets into account. \n\n2.1 Nearest Neighbor and Shepard's Method \n\nProbably the most widely used memory-based learning algorithm is J{ -nearest neighbor \n(KNN) [15]. Suppose x is a query pattern, for which we would like to know the output y. \nKNN searches the set of training examples X for those J{ examples (Xi, Yi ) E X whose \ninput patterns Xi are nearest to X (according to some distance metric, e.g., the Euclidian \ndistance). It then returns the mean output value k 2:= Yi of these nearest neighbors. \nAnother commonly used method, which is due to Shepard [13], averages the output values \n\n\f642 \n\ns. THRUN \n\nof all training examples but weights each example according to the inverse distance to the \nquery :~~~t x. \n\n( \n\n) \n\n( \n\nI) -I \n\nL \n\n(x\"y.)EX \n\nIlx - ~: II + E \u00b7 L \n\n(x. ,y.)EX \n\nIlx - X i II + E \n\n(1) \n\nHere E > 0 is a small constant that prevents division by zero. Plain memory-based learning \nuses exclusively the training set X for learning. There is no obvious way to incorporate the \nsupport sets, since they carry the wrong class labels. \n\n2.2 Learning A New Representation \n\nThe first modification of memory-based learning proposed in this paper employs the support \nsets to learn a new representation of the data. More specifically, the support sets are employed \nto learn a function, denoted by 9 : I --+ I', which maps input patterns in I to a new space, \nI' . This new space I' forms the input space for a memory-based algorithm. \nObviously, the key property of a good data representations is that multiple examples of a \nsingle concept should have a similar representation, whereas the representation of an example \nand a counterexample of a concept should be more different. This property can directly be \ntransformed into an energy function for g: \n\n( \n\nn-I \n\nE:= ~ (X,y~EXk (X\"y~EXk Ilg(x)-g(x')11 \nAdjusting 9 to minimize E forces the distance between pairs of examples of the same \nconcept to be small, and the distance between an example and a counterexample of a concept \nto be large. In our implementation, 9 is realized by a neural network and trained using the \nBack-Propagation algorithm [12]. \n\n(X\"y~EXk Ilg( x )-g(x')11 \n\n(2) \n\n) \n\nNotice that the new representation, g, is obtained through the support sets. Assuming that \nthe learned representation is appropriate for new learning tasks, standard memory-based \nlearning can be applied using this new representation when learning the n-th concept. \n\n2.3 Learning A Distance Function \n\nAn alternative way for exploiting support sets to improve memory-based learning is to learn \na distance function [3, 9]. This approach learns a function d : I x I --+ [0, I] which accepts \ntwo input patterns, say x and x' , and outputs whether x and x' are members of the same \nconcept, regardless what the concept is. Training examples for d are \n\n(( x , x'),I) \n((x, x'), 0) \n\nify=y'=l \nif(y=IAy'=O)or(y=OAy'=I). \n\nThey are derived from pairs of examples (x , y) , (x', y') E Xk taken from a single support \nset X k (k = 1, . .. , n -\nI). In our implementation, d is an artificial neural network trained \nwith Back-Propagation. Notice that the training examples for d lack information concerning \nthe concept for which they were originally derived. Hence, all support sets can be used to \ntrain d. After training, d can be interpreted as the probability that two patterns x, x' E I are \nexamples of the same concept. \nOnce trained, d can be used as a generalized distance function for a memory-based approach. \nSuppose one is given a training set X and a query point x E I. Then, for each positive \nexample (x' , y' = I) EX , d( x , x') can be interpreted as the probability that x is a member \nof the target concept. Votes from multiple positive examples (XI, I) , (X2' I), ... E X are \ncombined using Bayes' rule, yielding \n\n.- 1- (I + \n\nII \n\n(x' ,y'=I)EXk \n\nI:(~(::~,))-I \n\nProb(fn(x)=I) \n\n(3) \n\n\fIs Learning the n-th Thing Any Easier Than Learning the First? \n\n643 \n\nNotice that d is not a distance metric. It generalizes the notion of a distance metric, because \nthe triangle inequality needs not hold, and because an example of the target concept x' can \nprovide evidence that x is not a member of that concept (if d(x, x') < 0.5). \n\n3 Neural Network Approaches \n\nTo make our comparison more complete, we will now briefly describe approaches that rely \nexclusively on artificial neural networks for learning In. \n\n3.1 Back-Propagation \nStandard Back-Propagation can be used to learn the indicator function In, using X as training \nset. This approach does not employ the support sets, hence is unable to transfer knowledge \nacross learning tasks. \n\n3.2 Learning With Hints \n\nLearning with hints [1, 4, 6, 16] constructs a neural network with n output units, one for \neach function Ik (k = 1,2, .. . , n). This network is then trained to simultaneously minimize \nthe error on both the support sets {Xk} and the training set X. By doing so, the internal \nrepresentation of this network is not only determined by X but also shaped through the \nsupport sets {X k }. If similar internal representations are required for al1 functions Ik \n(k = 1,2, .. . , n), the support sets provide additional training examples for the internal \nrepresentation. \n\n3.3 Explanation-Based Neural Network Learning \n\nThe last method described here uses the explanation-based neural network learning al(cid:173)\ngorithm (EBNN), which was original1y proposed in the context of reinforcement learning \n[8, 17]. EBNN trains an artificial neural network, denoted by h : I ----+ [0, 1], just like \nBack-Propagation. However, in addition to the target values given by the training set X, \nEBNN estimates the slopes (tangents) of the target function In for each example in X. More \nspecifically, training examples in EBNN are of the sort (x, In (x), \\7 xl n (x)), which are fit \nusing the Tangent-Prop algorithm [14]. The input x and target value In(x) are taken from \nthe trai ning set X. The third term, the slope \\7 xl n ( X ), is estimated using the learned distance \nfunction d described above. Suppose (x', y' = 1) E X is a (positive) training example. \nThen, the function dx ' : I ----+ [0, 1] with dx ' (z) := d(z , x') maps a single input pattern to \n[0, 1], and is an approximation to In. Since d( z, x') is represented by a neural network and \nneural networks are differentiable, the gradient 8dx ' (z) /8z is an estimate of the slope of In \nat z. Setting z := x yields the desired estimate of \\7 xln (x) . As stated above, both the target \nvalue In (x) and the slope vector \\7 x In (x) are fit using the Tangent-Prop algorithm for each \ntraining example x EX . \nThe slope \\7 xln provides additional information about the target function In. Since d is \nlearned using the support sets, EBNN approach transfers knowledge from the support sets \nto the new learning task. EBNN relies on the assumption that d is accurate enough to yield \nhelpful sensitivity information. However, since EBNN fits both training patterns (values) \nand slopes, misleading slopes can be overridden by training examples. See [17] for a more \ndetailed description of EBNN and further references. \n\n4 Experimental Results \n\nAll approaches were tested using a database of color camera images of different objects \n(see Fig. 3.3). Each of the object in the database has a distinct color or size. The n-th \n\n\f644 \n\nl1 \n\n.... \n\n'I I't' 'I \n\n\u2022 \n\n, \n\n. \n\n< \n\n> \n\n-\n-.... \n\n'\" \n\n~ \n\n1:1 ,I \n\n:. ~~~ \n\n-~\"\"\":::.~ ~ \n\n-,:~~,} I \n\n, \n\n\u00a3 \n\n...... \n\n~.. \n\n' \n\n. ~ \n\n~ <.-\n\n~~- -_,1-\n\n:t \n\n\" \n\n~ \n\n~_ \n\n, \n\n~,-l/> ;' ;'j III \n\n'1 ~' \n.d!t~)ltI!{iH-\"\" \n\n''',ll t! ~[~ -\n\n... \n\n, \n\n, \n\nc- _ . \n\nML~._ ... , I \n\n'''!!!i!~, \n\n=' \n;~~~ , \n\nS. THRUN \n\nFigure 1: The sup(cid:173)\nport sets were com(cid:173)\npiled out of a hundred \nimages of a bottle, a \nhat, a hammer, a coke \ncan, and a book. The \nn-th learning tasks \ninvolves distinguish(cid:173)\ning the shoe from the \nImages \nsunglasses. \nwere subsampled to \na 100x 100 pixel ma(cid:173)\ntrix (each pixel has a \ncolor, saturation, and \na brightness value), \nshown on the right \nside. \n\n\u00bb.~ <.~ \n\n,,-\n\n~ \n\n-... \n~ ~_l_~ __ E~ \n'~~ \nII _e\u00b7m;, ;1 ~ t \n\n~,~,AA( \n\n, \n\" \n\n~ \n\n:R;1-; \n, \n\"\"111':'i, It \nf4~ r \n\nlearning task was the recognition of one of these objects, namely the shoe. The previous \nn - 1 learning tasks correspond to the recognition of five other objects, namely the bottle, \nthe hat, the hammer, the coke can, and the book. To ensure that the latter images could \nnot be used simply as additional training data for In, the only counterexamples of the shoe \nwas the seventh object, the sunglasses. Hence, the training set for In contained images of \nthe shoe and the sunglasses, and the support sets contained images of the other five objects. \nThe object recognition domain is a good testbed for the transfer of knowledge in lifelong \nlearning. This is because finding a good approximation to In involves recognizing the target \nobject invariant of rotation, translation, scaling in size, change of lighting and so on. Since \nthese invariances are common to all object recognition tasks, images showing other objects \ncan provide additional information and boost the generalization accuracy. \nTransfer of knowledge is most important when training data is scarce. Hence, in an initial \nexperiment we tested all methods using a single image of the shoe and the sunglasses only. \nThose methods that are able to transfer knowledge were also provided 100 images of each \nof the other five objects. The results are intriguing. The generalization accuracies \n\ndistanced \n\nBack-Prop \n\nrepro g+Shep. \n\n59.7% \n\u00b19.0% \n\n74.4% \n\u00b118.5% \n\n75.2% \n\u00b118.9% \n\nKNN \n60.4% \n\u00b18.3% \n\nShepard \n60.4% \n\u00b18.3% \n\nEBNN \nhints \n74.8% \n62.1% \n\u00b110.2% \u00b111.1% \nillustrate that all approaches that transfer knowledge (printed in bold font) generalize sig(cid:173)\nnificantly better than those that do not. With the exception of the hint learning technique, \nthe approaches can be grouped into two categories: Those which generalize approximately \n60% of the testing set correctly, and those which achieve approximately 75% generaliza(cid:173)\ntion accuracy. The former group contains the standard supervised learning algorithms, and \nthe latter contains the \"new\" algorithms proposed here, which are capable of transferring \nknOWledge. The differences within each group are statistically not significant, while the \ndifferences between them are (at the 95% level). Notice that random guessing classifies 50% \nof the testing examples correctly. \nThese results suggest that the generalization accuracy merely depends on the particular \nchoice of the learning algorithm (memory-based vs. neural networks). Instead, the main \nfactor determining the generalization accuracy is the fact whether or not knowledge is \ntransferred from past learning tasks. \n\n\fIs Learning the n-th Thing Any Easier Than Learning the First? \n\n645 \n\n95% \n\n85% \n\n~ 80% \n~ 15% \n70% \n-\n\n65% \n\n60% \n\n55% \n\ndistance function d \n\nhepard 's method with representation g \n\nShepard's method \n\n95% \n\n70% \n65% \n\n60% \n\n;;./ \n\n55% \n\n, ,,'~ . \n\n.