{"title": "Human Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 241, "page_last": 248, "abstract": "We investigate a topic at the interface of machine learning and cognitive science. Human active learning, where learners can actively query the world for information, is contrasted with passive learning from random examples. Furthermore, we compare human active learning performance with predictions from statistical learning theory. We conduct a series of human category learning experiments inspired by a machine learning task for which active and passive learning error bounds are well understood, and dramatically distinct. Our results indicate that humans are capable of actively selecting informative queries, and in doing so learn better and faster than if they are given random training data, as predicted by learning theory. However, the improvement over passive learning is not as dramatic as that achieved by machine active learning algorithms. To the best of our knowledge, this is the first quantitative study comparing human category learning in active versus passive settings.", "full_text": "Human Active Learning\n\nRui Castro1, Charles Kalish2, Robert Nowak3, Ruichen Qian4, Timothy Rogers2, Xiaojin Zhu4\u2217\n\n1Department of Electrical Engineering\n\nDepartment of {2Psychology, 3Electrical and Computer Engineering, 4Computer Sciences}\n\nColumbia University. New York, NY 10027\n\nUniversity of Wisconsin-Madison. Madison, WI 53706\n\nAbstract\n\nWe investigate a topic at the interface of machine learning and cognitive science.\nHuman active learning, where learners can actively query the world for informa-\ntion, is contrasted with passive learning from random examples. Furthermore,\nwe compare human active learning performance with predictions from statistical\nlearning theory. We conduct a series of human category learning experiments\ninspired by a machine learning task for which active and passive learning error\nbounds are well understood, and dramatically distinct. Our results indicate that\nhumans are capable of actively selecting informative queries, and in doing so\nlearn better and faster than if they are given random training data, as predicted\nby learning theory. However, the improvement over passive learning is not as dra-\nmatic as that achieved by machine active learning algorithms. To the best of our\nknowledge, this is the \ufb01rst quantitative study comparing human category learning\nin active versus passive settings.\n\n1 Introduction\n\nActive learning is a paradigm in which the learner has the ability to sequentially select examples\nfor labeling. The selection process can take advantage of information gained from previously ob-\nserved labeled examples in order to accelerate the learning process. In contrast, passive learning is\na paradigm in which the learner has no control over the labeled examples it is given. In machine\nlearning, active learning has been a topic of intense interest. In certain machine learning problems\nit has been shown that active learning algorithms perform much better than passive learning, with\nsuperior convergence bounds (see [1, 4] and references therein) and/or superior empirical perfor-\nmance [5, 19]. In this paper we focus on the application of active learning to classi\ufb01cation, in both\nmachines and humans.\n\nTo our knowledge, no previous work has attempted to quantify human active learning performance\nin probabilistic category learning (i.e., classi\ufb01cation), contrast human active and passive learning,\nand compare against theoretically optimal theory bounds. Theories of human category learning\noften cast the learner as a passive learner, who observes some object (typically represented as a\nfeature vector), is presented with the object\u2019s category label, and does some statistical processing to\ndetermine how the label should generalize. Anyone who has ever interacted with a three-year-old\nwill recognize that this scenario is exceedingly unrealistic in at least one respect. Certainly toddlers\nobserve their environment, and certainly they pay attention when adults label objects for them \u2013 but\nthey also ask a lot of questions. Active querying provides children with information that they would\notherwise be less likely to encounter through passive observation; and so, presumably, such active\nquerying has important implications for category learning.\n\nEarly research in human concept attainment suggested that learners do bene\ufb01t from the opportunity\nto actively select examples during learning [11]. However, it proved very dif\ufb01cult to establish cri-\n\n\u2217Correspondence concerning this article should be send to jerryzhu@cs.wisc.edu.\n\n1\n\n\fFigure 1: The two-category learning\ntask with boundary \u03b8 and noise level \u0001.\n\nFigure 2: Probabilistic bisection strategy. Shaded areas\nhave 1/2 probability mass.\n\nteria for assessing the magnitude of the active learning bene\ufb01t (e.g., compared to theoretical ideals,\nor to passive learning). Partly as a result, nearly all contemporary research in classi\ufb01cation and\ncategorization has ignored active learning. Furthermore, a rich literature on decision-making and\nscienti\ufb01c inference has produced con\ufb02icting claims regarding people\u2019s capacities to select optimal\nlearning examples [7, 10, 12, 13, 14, 15, 16, 17, 20]. Most famously, people make inappropriate\nqueries to assess simple logical hypotheses such as \u201cif p then q\u201d (frequently examining q instances\nto see if they are p, and failing to explore not-q instances [20]). Several authors have argued that\npessimistic views of the human ability to choose relevant queries are based on faulty task analyses;\nand that, when the learning task is properly construed, humans do an excellent, even optimal job of\nselection [7, 14]. As much of the debate in the psychological literature turns on task analysis and the\nproper metric for assessing performance, there is signi\ufb01cant opportunity to bene\ufb01t from the formal\ndescriptions characteristic of machine learning research. The current study exploits one such analy-\nsis of a relatively simple binary classi\ufb01cation task with \ufb01xed error rate in feedback. Speci\ufb01cation of\nthe theoretical bene\ufb01ts of active learning in this context allows us to address the following questions\nregarding human performance:\n[Q1] Do humans perform better when they can select their own examples for labeling, compared to\npassive observation of labeled examples?\n[Q2] If so, do they achieve the full bene\ufb01t of active learning suggested by statistical learning theory?\n[Q3] If they do not, can machine learning be used to enhance human performance?\n[Q4] Do the answers to these questions vary depending upon the dif\ufb01culty of the learning problem?\n\nThe goal of this paper is to answer these questions in a quantitative way by studying human and\nmachine performance in one well-understood classi\ufb01cation task. Answers to these questions have\nimportant theoretical and practical implications for our understanding of human learning and cog-\nnition. As previously noted, most theories of human category learning assume passive sampling\nof the environment. Some researchers have argued that the environment provides little information\nregarding the category structure of the world, and so conclude that human category learning must\nbe subject to strong initial constraints [6, 3, 9]. If, however, human learning bene\ufb01ts from active\nquerying of the environment, it is not clear that such conclusions are justi\ufb01ed. From an applied\nperspective, if machines can be shown to aid human learning in certain predictable circumstances,\nthis has clear implications for the design of intelligent tutoring systems and other machine-human\nhybrid applications.\n\n2 A Two-Category Learning Task\n\nFor the study in this paper we consider learning in a relatively simple setting, where there is a good\ntheoretical understanding of both active and passive machine learning, offering an ideal test-bed for\nassessing active learning in humans. The task is essentially a two-category learning problem (binary\nclassi\ufb01cation) in the interval [0, 1]. Let \u03b8 \u2208 [0, 1] be the unknown but \ufb01xed decision boundary. To\nthe left of \u03b8 the category is \u201czero\u201d and to the right of \u03b8 the category is \u201cone.\u201d The goal of the learning\ntask is to infer \u03b8 as accurately as possible from a set of examples. The training data (set of examples)\nconsists of n sample and label pairs; {(Xi, Yi)}n\ni=1, where Xi \u2208 [0, 1] and Yi \u2208 {0, 1}. The label\nYi is related to the sample Xi in the following noisy way: Yi is equal to the category of Xi with\nprobability 1 \u2212 \u0001 and equal to the other category with probability \u0001, where 0 \u2264 \u0001 < 1/2. In other\nwords, each label more probably is correct than incorrect, and \u0001 is the probability of an incorrect\n\n2\n\n\flabel1. Note that the label Yi is simply a noisy answer to the question \u201cis Xi larger than \u03b8?\u201d Figure 1\nillustrates this model. Furthermore assume that, given Xi, Yi is statistically independent of {Yj}j6=i.\nAt this point we have not speci\ufb01ed how the sample locations Xi are generated, and in this lies the\nmajor difference between passive and active learning. In the passive learning setting the sample\nlocations are randomly distributed, independent of the labels. On the other hand, in the active\nlearning setting the learner can choose the sample locations in a sequential way depending on the\npast, that is Xi = h(X1, . . . , Xi\u22121, Y1, . . . , Yi\u22121) , where h is a (possibly random) function that\ntakes into account past experiences and proposes a new query Xi.\nIf \u0001 = 0, that is when there is no label noise, the optimal methodologies for passive and active\nlearning are quite obvious.\nIn passive learning, the optimal inference is that \u03b8 lies somewhere\nbetween the rightmost location where a label of zero was observed and the leftmost location where a\nlabel of one was observed. If the n sample locations are (approximately) evenly distributed between\n0 and 1, then the error of the inference is on the order of 1/n. On the other hand, in active learning\nthe optimal strategy is a deterministic binary bisection: begin by taking X1 = 1/2. If Y1 = 0, then\n\u03b8 > 1/2, otherwise \u03b8 \u2264 1/2. Suppose Y1 = 1, then the next sample point is X2 = 1/4 and if\nY2 = 1, then \u03b8 < 1/4 otherwise \u03b8 \u2265 1/4. Proceeding in this fashion we see that the length of the\ninterval of possible values of \u03b8 is halved at every observation. Therefore after n samples the error\nof the active learning inference is at most 2\u2212(n+1). Clearly active learning, where the error decays\nexponentially with the number of samples, is much better than passive learning, where the error can\ndecay only polynomially.\nIf \u0001 > 0 there is uncertainty in our label observation process and estimating \u03b8 becomes more del-\nicate. Under passive learning, the maximum likelihood estimator yields the optimal rate of error\nconvergence. Furthermore it is possible to show a performance lower bound that clari\ufb01es what is\nthe best possible performance of any passive learning algorithm. In particular we have the following\nresult.\n\ninf\n\u02c6\u03b8n\n\nsup\n\u03b8\u2208[0,1]\n\nE[|\u02c6\u03b8n \u2212 \u03b8|] \u2265 1\n4\n\n1\n\nn + 1 ,\n\n(1)\n\n(cid:19)2\u0001\n\n(cid:18)1 + 2\u0001\n\n1 \u2212 2\u0001\n\nwhere \u02c6\u03b8n is the estimate of \u03b8 obtained after n observations, and the in\ufb01mum is taken over all possible\npassive learning procedures. This is a so-called minimax lower bound, and gives an indication of the\nbest achievable performance of any passive learning algorithm. That is, no passive algorithm can\nlearn more rapidly. This bound can be easily shown using Theorem 2.2 of [18], and the performance\nof the maximum likelihood estimator is within a constant factor of (1).\n\nFor active learning, deterministic bisection cannot be used due to the label noise. Nevertheless\nactive learning is still extremely bene\ufb01cial in this setting. Horstein [8] proposed a method that is\nsuitable for our purposes. The key idea stems from Bayesian estimation. Suppose that we have a\nprior probability density function p0(\u00b7) on the unknown parameter \u03b8, namely that \u03b8 is uniformly\ndistributed over the interval [0, 1]. To make the exposition clear let us assume \u03b8 = 1/4. Like\nbefore, we start by making a query at X1 = 1/2. With probability 1 \u2212 \u0001 we observe the correct\nlabel Y1 = 1, and with probability \u0001 we observe the incorrect label Y1 = 0. Suppose Y1 = 1 was\nobserved. Given these facts we can update the posterior density by applying Bayes rule. In this case\nwe obtain p1(t|X1, Y1) = 2(1 \u2212 \u0001) if t \u2264 1/2, or 2\u0001 if t > 1/2. The next step is to choose the\nsample location X2. We choose X2 so that it bisects the posterior probability mass, that is, we take\nX2 such that Prt\u223cp1(\u00b7)(t > X2|X1, Y1) = Prt\u223cp1(\u00b7)(t < X2|X1, Y1). In other words X2 is just the\nmedian of the posterior distribution. We continue iterating this procedure until we have collected n\nsamples. The estimate \u02c6\u03b8n is then de\ufb01ned as the median of the \ufb01nal posterior distribution. Figure 2\nillustrates the procedure. Note that if \u0001 = 0 then this probabilistic bisection is simply the binary\nbisection described above.\n\nThe above algorithm works extremely well in practice, but it is hard to analyze. In [2] a slightly\nmodi\ufb01ed method was introduced, which is more amenable to analysis; the major difference involves\n\n1We use a constant noise level \u0001 because the theoretical distinction between active and passive learning is\ndramatic in this case. Other (perhaps more natural) noise models are possible, for example \u0001 can decrease away\nfrom the true class boundary. Noise models like this are well understood theoretically [4]; we will investigate\nthem in future work.\n\n3\n\n\f0\n\n0.125\n\n0.25\n\n0.375\n\n0.5\n\n0.625\n\n0.75\n\n0.875\n\n1\n\nFigure 3: A few 3D visual stimuli and their X values used in our experiment.\n\na discretization of the possible query locations. For this method it can be shown [2] that\n\n r1\n\n2\n\n!n\n+p\u0001(1 \u2212 \u0001)\n\nE[|\u02c6\u03b8n \u2212 \u03b8|] \u2264 2\n\nsup\n\u03b8\u2208[0,1]\n\n.\n\n(2)\n\nNote that the expected estimation error decays exponentially with the number of observations, as\nopposed to the polynomial decay achievable using passive learning (1). This shows that the accuracy\nof active learning is signi\ufb01cantly better than passive learning, even under the presence of uncertainty.\nFurthermore no active (or passive) learning algorithm can have their expected error decaying faster\nthan exponentially with the number of samples, as in (2).\n\n3 Human Passive and Active Learning Experiments\n\nEquipped with the theoretical performance of passive learning (1) and active learning (2), we now\ndescribe a behavioral study designed to answer Q1-Q4 posed earlier. The experiment is essentially\na human analog of the abstract learning problem described in the previous section in which the\nlearner tries to \ufb01nd the boundary between two classes de\ufb01ned along a single dimension, a setting\nused to demonstrate semi-supervised learning behavior in humans in our previous work [21]. We\nare particularly interested in comparing three distinct conditions:\nCondition \u201cRandom\u201d. This is the passive learning condition where the human subject cannot\nselect the queries, and is instead presented sequentially with examples {Xi}n\ni=1 sampled uniformly\nat random from [0, 1], and their noisy labels {Yi}n\ni=1. The subject is regularly asked to guess the\nboundary from these observations (without feedback). As in (1), the expected estimation error\n|\u02c6\u03b8n \u2212 \u03b8| of an optimal machine learning algorithm decreases at the rate 1/n. If humans are capable\nof learning from passive observation of random samples, their boundary estimates should approach\nthe true boundary with this polynomial rate too.\nCondition \u201cHuman-Active\u201d. This is the active learning condition where the human subject, at\niteration i, selects a query Xi based on her previous queries and their noisy labels {(Xj, Yj)}i\u22121\nj=1.\nShe then receives a subsequent noisy label Yi. If humans are making good use of previously collected\nexamples by selecting informative queries then the rate of error decrease should be exponential,\nfollowing (2).\nCondition \u201cMachine-Yoked\u201d. This is a hybrid human-machine-learning condition in which the\nhuman passively observes samples selected by the active learning algorithm in [2], observes the\nnoisy label generated in response to each query, and is regularly asked to guess, without feedback,\nwhere the boundary is \u2013 as though the machine is teaching the human. It is motivated by question\nQ3: Can machine learning assist human category learning?\nMaterials. Each sample X is a novel arti\ufb01cial 3D shape displayed to the subject on a computer\nscreen. The shapes change with X smoothly in several aspects simultaneously. Figure 3 shows a few\nshapes and their X values. A difference of 0.06 in X value corresponds roughly to the psychological\n\u201cJust Noticeable Difference\u201d determined by a pilot study. For implementation reasons our shapes\nare discretized to a resolution of about 0.003 in X values, beyond which the visual difference is too\nsmall to be of interest.\nParticipants. Participants were 33 university students, participating voluntarily or for partial course\ncredit. They were told that the 3D shapes are alien eggs. Spiky eggs (X close to 0) most likely hatch\nalien snakes (category zero), and smooth eggs (X close to 1) most likely hatch alien birds (category\none), but there could be exceptions (label noise). Their task was to identify as precisely as possible\nthe egg shape (decision boundary) at which it switches from most likely snakes to most likely birds.\n\n4\n\n\fProcedure. Each participant was assigned one of the three conditions: Random (13 subjects),\nHuman-Active (14 subjects), Machine-Yoked (6 subjects). Machine-Yoked receives approximately\nhalf the number of other groups, as pilot studies indicated that performance was much less variable in\nthis condition. In all conditions, subjects were explicitly informed of the one dimensional nature of\nthe task. The participant \ufb01rst completed a short practice session to familiarize her with the computer\ninterface and basic task, followed by 5 longer sessions of 45 iterations each. The noise level \u0001, which\ndetermines the dif\ufb01culty of the learning task, varied across sessions, taking the values 0, 0.05, 0.1,\n0.2, 0.4 with order determined randomly for each participant. For each session and participant the\ntrue decision boundary \u03b8 was randomly set in [1/16, 15/16] to avoid dependencies on the location\nof the true boundary. The experiment thus involved one between-subject factor (learning condition)\nand one within-subjects factor (noise level \u0001).\nAt iteration i of the learning task, a single shape at Xi was displayed on a CRT monitor at a normal\nviewing distance.\nIn the Human-Active condition, the participant then used a computer mouse\nwheel to scroll through the range of shapes. Once the participant found the shape she wished to\nquery (Xi+1), she clicked a \u201chatch\u201d button and observed the outcome (bird or snake, corresponding\nto the noisy label), followed by a \u201cContinue\u201d button to move on to the next query. In the Random\nand Machine-Yoked conditions, each sample Xi+1 was generated by the computer with no user\nintervention, and a short animation was displayed showing shapes smoothly transitioning from Xi\nto Xi+1 in order to match the visual experience in the Human-Active condition. Once the transition\nwas completed, the outcome (label) for Xi+1 was observed, and participants clicked a \u201cContinue\u201d\nbutton to observe the next sample and outcome. In all conditions, the computer generated the noisy\nlabel Yi+1 according to the true boundary \u03b8 and noise level \u0001, and displayed it to the participant with\neither a snake picture (Yi+1 = 0) or a bird picture (Yi+1 = 1). The display was reset to the initial\nshape after ever 3 queries to ensure that participants paid attention to the precise shape corresponding\nto their estimate of the boundary location rather than simply searching locally around the current\nshape (total 15 re-starts over 45 queries; 45 re-starts would be too tedious for the subjects).\nThe participant was asked to guess the decision boundary (\u02c6\u03b8) after every three iterations. In these\n\u201cboundary queries,\u201d the computer began by displaying the shape at X = 1/2, and the participant\nused the mouse wheel to change the shape until it matched her current best guess about the boundary\nshape. Once satis\ufb01ed, she clicked a \u201csubmit boundary\u201d button. We thus collect \u02c6\u03b83, \u02c6\u03b86, \u02c6\u03b89, . . . , \u02c6\u03b845\nfor each session. These boundary estimates allowed us to compute mean (across subjects) human\nestimation errors |\u02c6\u03b8n \u2212 \u03b8| for different n, under different conditions and different noise levels. We\ncompare these means (i) across the different experimental conditions and (ii) to the theoretical pre-\ndictions in (1)(2).\n\n4 Experimental Results\n\nFigure 4 shows, for each condition and noise level, how every participant\u2019s boundary guesses ap-\nproach the true boundary \u03b8. Qualitatively, human active learning (Human-Active) appears better\nthan passive learning (Random) because the curves are more concentrated around zero. Machine-\nassisted human learning (Machine-Yoked) seems even better. As the task becomes harder (larger\nnoise \u0001), performance suffers in all conditions, though less so for the Machine-Yoked learners. These\nconclusions are further supported by our quantitative analysis below.\n\nIt is worth noting that the behavior of a few participants stand out in Figure 4. For example, one\nsubject\u2019s boundary guesses shift considerably within a session, resulting in a rather zigzagged curve\nin (Human-Active, \u0001 = 0.1). All participants, however, perform relatively well in at least some\nnoise settings, suggesting that they took the experiment seriously. Any strange-looking behavior\nlikely re\ufb02ect genuine dif\ufb01culties in the task, and for this reason we have not removed any apparent\noutliers in the following analyses. We now answer questions Q1\u2013Q4 raised in Section 1.\n[Q1] Do humans perform better when they can actively select samples for labeling compared\nto passive observation of randomly-selected samples?\n[A1] Yes \u2013 at least for low noise levels. For higher noise the two are similar.\nTo support our answer, we show that the human estimation error |\u02c6\u03b8n \u2212 \u03b8| is smaller in the Human-\nActive condition than Random condition. This is plotted in Figure 5, with \u00b11 standard error bars.\nWhen noise is low, the Human-Active curve is well below the Random curve throughout the session.\n\n5\n\n\fnoise \u0001 = 0\n\nnoise \u0001 = 0.05\n\nnoise \u0001 = 0.1\n\nnoise \u0001 = 0.2\n\nnoise \u0001 = 0.4\n\nRandom\n\nHuman\nActive\n\nMachine\nYoked\n\nFigure 4: Overview of experiment results. The x-axis is iteration n, y-axis is the (signed) difference\nbetween human boundary guess and true boundary \u02c6\u03b8n \u2212 \u03b8. Each curve shows performance from one\nhuman subject (though they overlap, it is suf\ufb01cient to note the trends). Overall, human active learn-\ning (Human-Active) is better than passive learning (Random), and machine-assisted human learning\n(Machine-Yoked) is even better. As the task becomes harder (larger noise \u0001), all performances suffer.\n\nFigure 5: Human estimate error |\u02c6\u03b8n \u2212 \u03b8| under different conditions and noise levels. The x-axis is\niteration n. The error bars are \u00b11 standard error. Human-Active is better than Random when noise\nis low; Machine-Yoked is better than Human-Active when noise is high.\n\nThat is, with active learning the subjects quickly come up with better guesses and maintain this ad-\nvantage till the end. Human-Active performance deteriorates with higher noise levels, however, and\nat the highest noise levels is appears indistinguishable from performance in the Random condition.\n[Q2] Can humans achieve the full bene\ufb01t of active learning suggested by learning theory?\n[A2] Human active learning does have exponential convergence, but with slower decay con-\nstants than the upper bound in (2). Human passive learning, on the other hand, sometimes\ndoes not even achieve polynomial convergence as predicted in (1), and in no condition does the\nrate approach optimal performance.\nTo support these conclusions, consider that, for active learning, the theoretical estimation er-\nror bound in (2) has the form 2e\u2212\u03bbn and decays exponentially with n. The decay constant\n\u03bb = \u22121/2 log\nis determined by the noise level \u0001. The larger the decay con-\nstant, the faster the error approaches zero. If one plots log of the bound vs. n, it would be a line with\nslope \u2212\u03bb. To determine whether human error decays exponentially as predicted, and with a compa-\nrable slope, one can similarly plot the logarithm of human active learning estimation error vs. n. If\nhuman active learning decreases error exponentially (which is desirable), this relationship is linear,\nas Figure 6 (Upper) shows it to be. This exponential decay of error offers further evidence that hu-\nman active learning exceeds passive learning performance, where error can only decay polynomially\n(Figure 6, Lower). The speed (decay constant) of the exponential decay in human active learning is,\nhowever, slower than the theoretical upper bound (2). To see this, we \ufb01t one line per noise level in\n\n(cid:17)\n1/2 +p\u0001(1 \u2212 \u0001)\n\n(cid:16)\n\n6\n\n10203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.5110203040\u22121\u22120.500.511020304000.10.20.3noise !=0.10estimation error  Human ActiveRandomMachine Yoked1020304000.10.20.3noise !=0.201020304000.10.20.3noise !=0.40\fFigure 6: (Upper) Human active learning decreases error exponentially, as indicated by the linear\ndistribution of log(|\u02c6\u03b8n \u2212 \u03b8|) (the y-axis) versus n (the x-axis). (Lower) Human passive learning in\nthe Random condition is slower than O(1/n), since the slopes are shallower than -1 on log(|\u02c6\u03b8n\u2212 \u03b8|)\n(the y-axis) versus log(n) (the x-axis).\n\nHuman-Active\n\nbound (2)\n\n\u0001 = 0\n0.031\n0.347\n\n0.05\n0.042\n0.166\n\n0.1\n0.037\n0.112\n\n0.2\n0.030\n0.053\n\n0.4\n0.005\n0.005\n\nTable 1: The exponential decay constants of human active learning is slower than predicted by\nstatistical learning theory for lower noise levels.\n\nFigure 6 and use the negative slope of the \ufb01tted lines as the estimate of the decay constant in human\nactive learning. For comparison, we computed the decay constant in the theoretical bound. Table 1\ncompares these decay constants under different noise levels. It is clear that human active learning\u2019s\nerror decays at a slower rate, especially when the noise is low.\nFor passive learning, the minimax lower bound (1) has a polynomial decay of O(1/n), which is a\nline with slope -1 on a plot of log(|\u02c6\u03b8n\u2212 \u03b8|) vs. log(n). As shown in Figure 6 (Lower), the analogous\nlog-log plot from human passive learning in the Random condition does seem to \ufb01t a line, but the\nslope is much shallower than -1. Indeed, for 2 of the 5 noise levels (0.1 and 0.2), the estimated slope\nis not signi\ufb01cantly different from zero! These results suggest that humans either fail to learn or learn\nat a much lower rate than formal analysis suggests is possible.\n[Q3] Can machine learning be used to enhance human learning?\n[A3] Apparently in high noise levels \u2013 But what really happened?\nAs shown in Figure 5, the Machine-Yoked curve is no different than Human-Active in low noise\nlevels, but substantially better in high noise levels. It is important to remember that Machine-Yoked\nis human performance, not that of the machine learning algorithm. The results seem to indicate that\nhumans can utilize the training data chosen by a machine active learning algorithm to enhance their\nperformance in settings where humans are not generally performing well. Upon closer inspection,\nhowever, we noticed that almost all subjects in the Machine-Yoked condition used the following\nstrategy. They quickly learned that the computer was generating training examples that soon con-\nverge to the true boundary. They then simply placed their boundary guess at (or near) the latest\ntraining example generated by the machine. This \u201cmemorizing\u201d strategy worked very well in our\nsetting, but it is dif\ufb01cult to believe that the subjects were really \u201clearning\u201d the decision boundary.\nInstead, they likely learned to trust and depend upon the computer. In view of this, we consider\nQ3 inconclusive, but hope these observations provoke thoughts on how to actually improve human\nlearning.\n[Q4] Do answers to the above questions depend upon the dif\ufb01culty of the learning task?\n[A4] One form of dif\ufb01culty, the label noise level \u0001, has profound effects on human learning.\nSpeci\ufb01cally, the advantage of active learning diminishes with noise; and at high noise levels active\nlearning arguably has no advantage over passive learning for humans in this setting. Formal analysis\n\n7\n\n10203040\u22125\u22124\u22123\u22122\u22121noise !=0.0010203040\u22125\u22124\u22123\u22122\u22121noise !=0.0510203040\u22125\u22124\u22123\u22122\u22121noise !=0.1010203040\u22125\u22124\u22123\u22122\u22121noise !=0.2010203040\u22125\u22124\u22123\u22122\u22121noise !=0.40024\u22125\u22124\u22123\u22122\u22121noise !=0.00024\u22125\u22124\u22123\u22122\u22121noise !=0.05024\u22125\u22124\u22123\u22122\u22121noise !=0.10024\u22125\u22124\u22123\u22122\u22121noise !=0.20024\u22125\u22124\u22123\u22122\u22121noise !=0.40\fsuggests that the advantage of active over passive sampling should diminish with increasing noise;\nbut it also suggests that some bene\ufb01t to active sampling should always be obtained. An important\ngoal for future research, then, is to understand why human performance is so adversely affected by\nnoise.\n\n5 Conclusions and Future Work\n\nWe have conducted behavioral experiments to compare active versus passive learning by humans in a\nsimple classi\ufb01cation task, and compared human performance to that predicted by statistical learning\ntheory. In short, humans are able to actively select queries and use them to achieve faster category\nlearning; but the advantages of active-learning diminish under higher noise conditions and do not\napproach theoretical bounds. One important conclusion from this work is that passive learning may\nnot be a very good model for how human beings learn to categorize. Our research also raises several\ninteresting further questions, including how the current conclusions extend to more realistic learning\nscenarios. The bene\ufb01t of the current work is that it capitalizes on a simple learning task for which\npassive and active performance has been formally characterized. The drawback is that the task is\nnot especially natural. In future work we plan to extend the current approach to learning situations\nmore similar to those faced by people in their day-to-day lives.\nAcknowledgments: This work is supported in part by the Wisconsin Alumni Research Foundation,\nand NSF Grant 0745423 from Developmental Learning Sciences.\n\nReferences\n[1] N. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active learning.\n\nto appear in\n\nCOLT 2008, Helsinki, Finland, 2008.\n\n[2] M. V. Burnashev and K. Sh. Zigangirov. An interval estimation problem for controlled observations.\n\nProblems in Information Transmission, 10:223\u2013231, 1974.\n\n[3] S. Carey. Conceptual change in childhood. MIT Press, 1985.\n[4] R. Castro and R. Nowak. Minimax bounds for active learning. IEEE Transactions on Information Theory,\n\n[5] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n54(5):2339\u20132353, 2008.\n\n15(2):201\u2013221, 1994.\n\n[6] R. Gelman and E. M. Williams. Handbook of child psychology, chapter Enabling constraints for cognitive\n\ndevelopment and learning: A domain-speci\ufb01c epigenetic theory. John Wiley and Sons, 1998.\n\n[7] G. Gigerenzer and R. Selten. Bounded rationality: The adaptive toolbox. The MIT Press, 2001.\n[8] M. Horstein. Sequential decoding using noiseless feedback. IEEE Trans. Info. Theory, 9(3):136\u2013143,\n\n1963.\n\n[9] F. Keil. Concepts, kinds, and cognitive development. MIT Press, 1989.\n[10] J. K. Kruschke. Bayesian approaches to associative learning: From passive to active learning. Learning\n\n& Behavior, 36(3):210\u2013226, 2008.\n\n[11] P. A. Laughlin. Focusing strategy in concept attainment as a function of instructions and task complexity.\n\nJournal of Experimental Psychology, 98(2):320\u2013327, May 1973.\n\n[12] C. R. Mynatt, M. E. Doherty, and R. D. Tweney. Con\ufb01rmation bias in a simulated research environ-\nment: An experimental study of scienti\ufb01c inference. The Quarterly Journal of Experimental Psychology,\n29(1):85\u201395, Feb 1977.\n\n[13] J. Nelson. Finding useful questions: On Bayesian diagnosticity, probability, impact, and information gain.\n\nPsychological Review, 112(4):979\u2013999, 2005.\n\n[14] M. Oaksford and N. Chater. Bayesian rationality the probabilistic approach to human reasoning. Oxford\n\nUniversity Press, 2007.\n\n[15] L. E. Schulz, T. Kushnir, and A. Gopnik. Causal Learning; Psychology, Philosophy and Computation,\n\nchapter Learning from doing: Interventions and causal inference. Oxford University Press, 2007.\n\n[16] D. Sobel and T. Kushnir. Interventions do not solely bene\ufb01t causal learning: Being told what to do results\nin worse learning than doing it yourself. In Proceedings of the 25th Annual Meeting of the Cognitive\nScience Society, 2003.\n\n[17] M. Steyvers, J. Tenenbaumb, E. Wagenmakers, and B. Blum. Inferring causal networks from observations\n\nand interventions. Cognitive Science, 27:453\u2013489, 2003.\n\n[18] Alexandre B. Tsybakov. Introduction `a l\u2019estimation non-param\u00b4etrique. Math\u00b4ematiques et Applications,\n\n[19] G. Tur, D. Hakkani-T\u00a8ur, and R. E. Schapire. Combining active and semi-supervised learning for spoken\n\nlanguage understanding. Speech Communication, 45:171\u2013186, 2005.\n\n[20] P. C. Wason and P. N. Johnson-Laird. Psychology of reasoning: Structure and content. Harvard U. Press,\n\n41. Springer, 2004.\n\n1972.\n\n[21] X. Zhu, T. Rogers, R. Qian, and C. Kalish. Humans perform semi-supervised classi\ufb01cation too.\n\nIn\n\nTwenty-Second AAAI Conference on Arti\ufb01cial Intelligence, 2007.\n\n8\n\n\f", "award": [], "sourceid": 860, "authors": [{"given_name": "Rui", "family_name": "Castro", "institution": null}, {"given_name": "Charles", "family_name": "Kalish", "institution": null}, {"given_name": "Robert", "family_name": "Nowak", "institution": null}, {"given_name": "Ruichen", "family_name": "Qian", "institution": null}, {"given_name": "Tim", "family_name": "Rogers", "institution": null}, {"given_name": "Jerry", "family_name": "Zhu", "institution": null}]}