{"title": "Optimal Teaching for Limited-Capacity Human Learners", "book": "Advances in Neural Information Processing Systems", "page_first": 2465, "page_last": 2473, "abstract": "Basic decisions, such as judging a person as a friend or foe, involve categorizing novel stimuli. Recent work finds that people\u2019s category judgments are guided by a small set of examples that are retrieved from memory at decision time. This limited and stochastic retrieval places limits on human performance for probabilistic classification decisions. In light of this capacity limitation, recent work finds that idealizing training items, such that the saliency of ambiguous cases is reduced, improves human performance on novel test items. One shortcoming of previous work in idealization is that category distributions were idealized in an ad hoc or heuristic fashion. In this contribution, we take a first principles approach to constructing idealized training sets. We apply a machine teaching procedure to a cognitive model that is either limited capacity (as humans are) or unlimited capacity (as most machine learning systems are). As predicted, we find that the machine teacher recommends idealized training sets. We also find that human learners perform best when training recommendations from the machine teacher are based on a limited-capacity model. As predicted, to the extent that the learning model used by the machine teacher conforms to the true nature of human learners, the recommendations of the machine teacher prove effective. Our results provide a normative basis (given capacity constraints) for idealization procedures and offer a novel selection procedure for models of human learning.", "full_text": "Optimal Teaching for\n\nLimited-Capacity Human Learners\n\nKaustubh Raosaheb Patil\nAffective Brain Lab, UCL\n\n& MIT Sloan Neuroeconomics Lab\nkaustubh.patil@gmail.com\n\n\u0141ukasz Kope\u00b4c\n\nExperimental Psychology\nUniversity College London\nl.kopec.12@ucl.ac.uk\n\nXiaojin Zhu\n\nDepartment of Computer Sciences\nUniversity of Wisconsin-Madison\n\njerryzhu@cs.wisc.edu\n\nBradley C. Love\n\nExperimental Psychology\nUniversity College London\n\nb.love@ucl.ac.uk\n\nAbstract\n\nBasic decisions, such as judging a person as a friend or foe, involve categorizing\nnovel stimuli. Recent work \ufb01nds that people\u2019s category judgments are guided by\na small set of examples that are retrieved from memory at decision time. This\nlimited and stochastic retrieval places limits on human performance for proba-\nbilistic classi\ufb01cation decisions. In light of this capacity limitation, recent work\n\ufb01nds that idealizing training items, such that the saliency of ambiguous cases is\nreduced, improves human performance on novel test items. One shortcoming of\nprevious work in idealization is that category distributions were idealized in an ad\nhoc or heuristic fashion. In this contribution, we take a \ufb01rst principles approach\nto constructing idealized training sets. We apply a machine teaching procedure\nto a cognitive model that is either limited capacity (as humans are) or unlimited\ncapacity (as most machine learning systems are). As predicted, we \ufb01nd that the\nmachine teacher recommends idealized training sets. We also \ufb01nd that human\nlearners perform best when training recommendations from the machine teacher\nare based on a limited-capacity model. As predicted, to the extent that the learning\nmodel used by the machine teacher conforms to the true nature of human learners,\nthe recommendations of the machine teacher prove effective. Our results provide a\nnormative basis (given capacity constraints) for idealization procedures and offer\na novel selection procedure for models of human learning.\n\nIntroduction\n\n1\nJudging a person as a friend or foe, a mushroom as edible or poisonous, or a sound as an \\l\\ or\n\\r\\ are examples of categorization tasks. Category knowledge is often acquired based on exam-\nples that are either provided by a teacher or past experience. One important research challenge\nis determining the best set of examples to provide a human learner to facilitate learning and use\nof knowledge when making decisions, such as classifying novel stimuli. Such a teacher would be\nhelpful in a pedagogical setting for curriculum design [1, 2].\nRecent work suggests that people\u2019s categorization decisions are guided by a small set of examples\nretrieved at the time of decision [3]. This limited and stochastic retrieval places limits on human per-\nformance for probabilistic classi\ufb01cation decisions, such as predicting the winner of a sports contest\nor classifying a mammogram as normal or tumorous [4]. In light of these capacity limits, Gigu`ere\nand Love [3] determined and empirically veri\ufb01ed that humans perform better at test after being\n\n1\n\n\ftrained on idealized category distributions that minimize the saliency of ambiguous cases during\ntraining. Unlike machine learning systems that can have unlimited retrieval capacity, people per-\nformed better when trained on non-representative samples of category members, which is contrary\nto common machine learning practices where the aim is to match training and test distributions [5].\nOne shortcoming of previous work in idealization is that category distributions were idealized in\nan ad hoc or heuristic fashion, guided only by the intuitions of the experimenters in contrast to a\nrigorous systematic approach. In this contribution, we take a \ufb01rst principles approach to constructing\nidealized training sets. We apply a machine teaching procedure [6] to a cognitive model that is\neither limited capacity (as humans are) or unlimited capacity (as most machine learning systems\nare). One general prediction is that the machine teacher will idealize training sets. Such a result\nwould establish a conceptual link between idealization manipulations from psychology and optimal\nteaching procedures from machine learning [7, 6, 8, 2, 9, 10, 11]. A second prediction is that\nhuman learners will perform best with training sets recommended by a machine teacher that adopts\na limited capacity model of the learner. To the extent that the learning model used by the machine\nteacher conforms to the true nature of human learners, the recommendations of the machine teacher\nshould prove more effective. This latter prediction advances a novel method to evaluate theories of\nhuman learning. Overall, our work aims to provide a normative basis (given capacity constraints)\nfor idealization procedures.\n\n2 Limited- and In\ufb01nite-Capacity Models\n\nAlthough there are many candidate models of human learning (see [12] for a review), to cement the\nconnection with prior work [3] and to facilitate evaluation of model variants differing in capacity\nlimits, we focus on exemplar models of human learning. Exemplar models have proven successful\nin accounting for human learning performance [13, 14], are consistent with neural representations of\nacquired categories [15], and share strong theoretical connections with machine learning approaches\n[16, 17]. Exemplar models represent categories as a collection of experienced training examples. At\nthe time of decision, category examples (i.e., exemplars) are activated (i.e., retrieved) in proportion\nto their similarity to the stimulus. The category with the greatest total similarity across members\ntends to be chosen as the category response. Formally, the categorization problem is to estimate the\nlabel \u02c6y of a test item x from its similarity with the training exemplars {(x1, y1), . . . , (xn, yn)}.\nExemplar models are consistent with the notion that people stochastically and selectively sample\nfrom memory at the time of decision. For example, in the Exemplar-Based Random Walk (EBRW)\nmodel [18], exemplars are retrieved sequentially and stochastically as a function of their similarity\nto the stimulus. Retrieved exemplars provide evidence for category responses. When accumulated\nevidence (i.e., retrieved exemplars) for a response exceeds a threshold, the corresponding response\nis made. The number of steps in the diffusion process is the predicted response time.\nOne basic feature of EBRW is that not all exemplars in memory need feed into the decision process.\nAs discussed by Gigu`ere and Love [3], \ufb01nite decision thresholds in EBRW can be interpreted as\na capacity limit in memory retrieval. When decision thresholds are \ufb01nite, a limited number of\nexemplars are retrieved from memory. When capacity is limited in this fashion, models perform\nbetter when training sets are idealized.\nIdealization reduces the noise injected into the decision\nprocess by limited and stochastic sampling of information in memory.\nWe aim to show that a machine teacher, particularly one using a limited-capacity model of the\nlearner, will idealize training sets. Such a result would provide a normative basis (given capacity\nconstraints) for idealization procedures. To evaluate our predictions, we formally specify a limited-\nand unlimited-capacity exemplar model. Rather than work with EBRW, we instead choose a simpler\nmathematical model, the Generalized Context Model (GCM, [14]), which offers numerous advan-\ntages for our purposes. As discussed below, a parameter in GCM can be interpreted as specifying\ncapacity and can be related to decision threshold placement in EBRW\u2019s drift-diffusion process.\nGiven a \ufb01nite training set (or a teaching set, we will use the two terms interchangeably) D =\n{(x1, y1), . . . , (xn, yn)} and a test item (i.e., stimulus) x, GCM estimates the label probability as:\n\n(cid:16)\n\nb +(cid:80)\n\ni\u2208D:yi=1 e\u2212c d(x,xi)(cid:17)\u03b3\n(cid:17)\u03b3\n\nb +(cid:80)\n\n(cid:16)\n\n+\n\ni\u2208D:yi=1 e\u2212c d(x,xi)\n\ni\u2208D:yi=\u22121 e\u2212c d(x,xi)\n\n(cid:17)\u03b3\n\n(1)\n\n\u02c6p(y = 1 | x, D) =\n\n(cid:16)\n\nb +(cid:80)\n\n2\n\n\fwhere d is the distance function that speci\ufb01es the distance (e.g., the difference in length between\ntwo line stimuli) between the stimulus x and exemplar xi, c is a scaling parameter that speci\ufb01es\nthe rate at which similarity decreases with distance (i.e.\nthe bandwidth parameter for a kernel),\nand the parameter b is background similarity, which is related to irrelevant information activated\nin memory. Critically, the response scaling parameter, \u03b3, has been shown to bear a relationship\nto decision threshold placement in EBRW [18]. In particular, Equation 1 is equivalent to EBRW\u2019s\nmean response (averaged over many trials) with decision threshold bounds placed \u03b3 units away for\nthe starting point for evidence accumulation. Thus, GCM with a low value of \u03b3 can be viewed as\na limited capacity model, whereas GCM with a high value for \u03b3 converges to the predictions of\nan in\ufb01nite capacity model. These two model variations (low and high \u03b3 as surrogates for low- and\nhigh-capacity) will \ufb01gure prominently in our study and analyses.\nthe learner samples a label according to the probability \u02c6y \u223c\nTo select a binary response,\nBernoulli(\u02c6p(y = 1 | x, D)). Therefore, the learner makes stochastic predictions. When mea-\nsuring the classi\ufb01cation error of the learner, we will take expectation over this randomness. Let the\ndistance function be d(xi, xj) = |xi \u2212 xj|. Thus a GCM learner can be represented using three\nparameters {b, c, \u03b3}.\n\n3 Machine Teaching for the GCM Learners\n\nMachine teaching is an inverse problem of machine learning. Given a learner and a test distribution,\nmachine teaching designs a small (typically non-iid) teaching set D such that the learner trained on\nD has the smallest test error [6]. The machine teaching framework poses an optimization problem:\n(2)\n\nloss(D) + effort(D).\n\nmin\nD\u2208D\n\nThe optimization is over D, the teaching set that we present to the learner. For our task, D =\n(x1, y1), . . . , (xn, yn) where xi \u2208 [0, 1] represents the 1D feature of the ith stimulus, and yi \u2208\n{\u22121, 1} represents the ith label. The search space D = {(X \u00d7 Y)n : n \u2208 N} is the (in\ufb01nite)\nset of \ufb01nite teaching sets. Importantly, D is not required to consist of iid items drawn from the\ntest distribution p(x, y). Rather, D will usually contain specially arranged items. This is a major\ndifference to standard machine learning.\nSince we want to minimize classi\ufb01cation error on future test items, we de\ufb01ne the teaching loss\nfunction to be the generalization error:\n\nloss(D) = E(x,y)\u223cp(x,y)E\u02c6y\u223c \u02c6p(y|x,D)1y(cid:54)=\u02c6y.\n\n(3)\n\nThe \ufb01rst expectation is with respect to the test distribution p(x, y). That is, we still assume that\ntest items are drawn iid from the test distribution. The second expectation is w.r.t. the stochastic\npredictions that the GCM learner makes. Note that the teaching set D enters the loss() function\nthrough the GCM model \u02c6p(y | x, D) in (1). We observe that:\n\nloss(D) = Ex\u223cp(x) p(y = 1 | x)\u02c6p(y = \u22121 | x, D) + p(y = \u22121 | x)\u02c6p(y = 1 | x, D)\n\n\uf8f6\uf8f7\uf8f7\uf8f8 p(x)dx.\n\n(4)\n\n(cid:90) \uf8eb\uf8ec\uf8ec\uf8ed\n\n=\n\n(cid:18) b+(cid:80)\n1 \u2212 2p(y = 1 | x)\nb+(cid:80)\n\ni\u2208D:yi=\u22121 e\u2212c d(x,xi)\ni\u2208D:yi=1 e\u2212c d(x,xi)\n\n1 +\n\n(cid:19)\u03b3 + p(y = 1 | x)\n\nThe teaching effort function effort(D) is a powerful way to specify certain preferences on the teach-\ning set space D. For example, if we use effort(D) = |D| the size of D then the machine teaching\nproblem (2) will prefer smaller teaching sets. In this paper, we use a simple de\ufb01nition of effort():\neffort(D) = 0 if |D| = n, and \u221e otherwise. This in\ufb01nity indicator function simply acts as a hard\nconstraint so that D must have exactly n items. Equivalently, we may drop this effort() term from (2)\naltogether while requiring the search space D to consist of teaching sets of size exactly n.\nIn this paper, we consider test distributions p(x, y) whose marginal on x has a special form. Specif-\nically, we assume that p(x) is a uniform distribution over m distinct test stimuli z1, . . . , zm \u2208 [0, 1].\nIn other words, there are only m distinct test stimuli. The test label y for stimuli zj in any given\ntest set is randomly sampled from p(y | zj). Besides matching the actual behavioral experiments,\n\n3\n\n\fthis discrete marginal test distribution affords a further simpli\ufb01cation to our teaching problem: the\nintegral in (4) is replaced with summation:\n\n\uf8eb\uf8ec\uf8ec\uf8ed\n\nm(cid:88)\n\nj=1\n\n(cid:18) b+(cid:80)\n1 \u2212 2p(y = 1 | zj)\nb+(cid:80)\n\ni:yi=\u22121 e\ni:yi=1 e\n\n\u2212c d(zj ,xi)\n\u2212c d(zj ,xi)\n\n1 +\n\n\uf8f6\uf8f7\uf8f7\uf8f8 .\n(cid:19)\u03b3 + p(y = 1 | zj)\n\n(5)\n\nx1...xn\u2208[0,1];y1...yn\u2208{\u22121,1}\n\nmin\n\n1\nm\n\nIt is useful to keep in mind that y1 . . . yn are the training item labels that we can design, while y is a\ndummy variable for the stochastic test label.\nIn fact, equation (5) is a mixed integer program because we design both the continuous training\nstimuli x1 . . . xn and the discrete training labels y1 . . . yn. It is computationally challenging. We\nwill relax this problem to arrive at our \ufb01nal optimization problem. We consider a smaller search\nspace D where each training item label yi is uniquely determined by the position of xi w.r.t. the\ntrue decision boundary \u03b8\u2217 = 0.5. That is, yi = 1 if xi \u2265 \u03b8\u2217 and yi = \u22121 if xi < \u03b8\u2217. We\ndo not have evidence that this reduced freedom in training labels adversely affect the power of the\nteaching set solution. We now removed the dif\ufb01cult discrete optimization aspect, and arrive at the\nfollowing continuous optimization problem to \ufb01nd an optimal teaching set (note the changes to\nselector variables i):\n\n\uf8eb\uf8ec\uf8ec\uf8ed\n\nm(cid:88)\n\nj=1\n\n(cid:18) b+(cid:80)\n1 \u2212 2p(y = 1 | zj)\nb+(cid:80)\n\ni:xi<0.5 e\ni:xi\u22650.5 e\n\n\u2212c d(zj ,xi)\n\u2212c d(zj ,xi)\n\n1 +\n\n\uf8f6\uf8f7\uf8f7\uf8f8 .\n(cid:19)\u03b3 + p(y = 1 | zj)\n\n(6)\n\nmin\n\nx1...xn\u2208[0,1]\n\n1\nm\n\n4 Experiments\n\nUsing the machine teacher, we derive a variety of optimal training sets for low- and high-capacity\nGCM learners. We then evaluate how humans perform when trained on these recommended items\n(i.e.\ntraining sets). The main predictions are that the machine teacher will idealize training sets\nand that humans will perform better on optimal training sets calculated using the low-capacity GCM\nvariant. In what follows, we \ufb01rst specify parameter values for the GCM variants, present the optimal\nteaching sets we calculate, and then discuss human experiments.\n\n4.1 Specifying GCM parameters\n\nThe machine teacher requires a full speci\ufb01cation of the learner, including its parameters. Parameters\nwere set for the low-capacity GCM model by \ufb01tting the behavioral data from Experiment 2 of\nGigu`ere and Love [3]. GCM was \ufb01t to the aggregated data representing an average human learner\nby solving the following optimization problem:\n\n{\u02c6b, \u02c6c, \u02c6\u03b3} = arg min\n\u02c6b,\u02c6c,\u02c6\u03b3\n\n(cid:88)\n\n(cid:16)\n\ni\u2208X (1)\n\n(cid:17)2\n\n(cid:88)\n\n(cid:16)\n\nj\u2208X (2)\n\ng(1)(xi) \u2212 f (1)(xi)\n\n+\n\ng(2)(xj) \u2212 f (2)(xj)\n\n(7)\n\n(cid:17)2\n\nwhere X (1) and X (2) are sets of unique test stimuli for the two training conditions (actual and ideal-\nized) in Experiment 2. We de\ufb01ne two functions to describe the estimated and empirical probabilities,\nrespectively: g(cond)(xi) = p(yi = 1 | xi, D(cond)), f (cond)(xi) =\n. The\nfunction g above is de\ufb01ned using GCM in Equation 1. We solved Equation 7 to obtain the low-\ncapacity GCM parameters that best capture human performance {\u02c6b, \u02c6c, \u02c6\u03b3} = {5.066, 2.964, 4.798}.\nWe de\ufb01ne a high-capacity GCM by only changing the \u02c6\u03b3 parameter, which is set an order of magni-\ntude higher at \u02c6\u03b3 = 47.98.\n\n(cid:80)\n(cid:80)\n(cond):yj =1 1(xj =xi)\nj(cid:48)\u2208D(cond) 1(x(cid:48)\n\nj =xi)\n\nj\u2208D\n\n4.2 Optimal Teaching Sets\n\nThe machine teacher was used to generate a variety of training sets that we evaluated on human\nlearners. All training sets had size n = 20, which was chosen to maximize expected differences\nin human test performance across training sets. All conditions involved the same test conditional\n\n4\n\n\f)\nz\n\n|\n\n1\n=\ny\n(\np\n\n1.00\n0.75\n0.50\n0.25\n0.00\n\n0.00\n\n0.25\n\n0.50\n\nz\n\n0.75\n\n1.00\n\nFigure 1: The test conditional distribution. Each point shows a test item zi and its conditional\nprobability to be in the category y = 1. The vertical dashed line shows the location of the true\ndecision boundary \u03b8\u2217 = 0.5.\n\ndistribution p(y | x) (see Figure 1). The test set consisted of m = 60 representative items evenly\nspaced over the stimulus domain [0, 1] with a probabilistic category structure. The conditional dis-\ntribution p(y = 1 | x = zj) for j = 1 . . . 60 was adapted from a related study [3]. We then solved\nthe machine teaching problem (6) to obtain the optimal teaching sets for low- and high-capacity\nlearners.\nThe optimal training set for the low-capacity GCM places items for each category in a clump far\nfrom the boundary (see Figure 2 for the optimal training sets). We refer to this optimal training set as\nClump-Far. The placement of these items far from the boundary re\ufb02ects the low-capacity (i.e., low\n\u03b3 value) of the GCM. By separating the items from the two categories, the machine teacher makes\nit less likely that low-capacity GCM will erroneously retrieve items from the opposing category at\nthe time of test. As predicted, the machine teacher idealized the Clump-Far training set.\nA mathematical property of the high-capacity GCM suggests that it is sensitive only to the placement\nof training items adjacent to the decision boundary \u03b8\u2217 (all other training items have exponentially\nsmall in\ufb02uence). Therefore, for the high-capacity model up to computer precision, there is no unique\noptimal teaching set but rather a family of optimal sets (i.e., multiple teaching sets with the same\nloss or expected test error). We generated two training sets that are both optimal for the high-\ncapacity model. The Clump-Near training set has one clump of similar items for each category close\nto the boundary. In contrast, the Spread training set uniformly spaces items outward, mimicking\nthe idealization procedure in Gigu`ere and Love [3]. We also generated Random teaching sets by\nsampling from the joint distribution U (x)p(y | x), where U (x) is uniform in [0, 1] and p(y | x) is\nthe test conditional distribution. Note Random is the traditional iid training set in machine learning.\nThe test error of the low- and high-capacity GCM under Random teaching sets was estimated by\ngenerating 10,000 random teaching sets.\nTable 1 shows that Clump-Far outperforms other training sets for the low-capacity GCM. In con-\ntrast, Clump-Far, Clump-Near, and Spread are all optimal for high-capacity GCM, re\ufb02ecting the\nfact that for high-capacity GCM the symmetry of the inner-most training item pair about the true\ndecision boundary \u03b8\u2217 determines the learned model. Not surprisingly, Random teaching sets lead to\nsuboptimal test errors on both low- and high-capacity GCM.\n\nTable 1: Loss (i.e. test error) for different teaching sets on low- and high-capacity GCM. Note the\nsmallest loss 0.216 matches the optimal Bayes error rate.\n\nGCM Model Clump-Far\n0.245\nLow-capacity\nHigh-capacity\n0.216\n\nSpread Clump-Near\n0.261\n0.216\n\nRandom\n0.397 M=0.332, SD=0.040\n0.216 M=0.262, SD=0.066\n\nIn summary, we produced four kinds of teaching sets: (1) Clump-Far which is the optimal teaching\nset for the low-capacity GCM, (2) Spread, (3) Clump-Near, the three are all optimal teaching sets\nfor the high-capacity GCM, and (4) Random. The next section discusses how human participants\nfair with each of these four training sets. Consistent with our predictions, the machine teacher\u2019s\nchoices idealized the training sets with parallels to the idealization procedures used in Gigu`ere and\nLove [3]. They found that human learners bene\ufb01ted when within category variance was reduced\n(akin to clumping in Clump-Far and Clump-Near), training items were shifted away from the cat-\negory boundary (akin to Clump-Far), and feedback was idealized (as in all the machine teaching\nsets considered). Their actual condition in which training sets were not idealized resembles the\nRandom condition here. As hoped, low-capacity and high-capacity GCM make radically different\n\n5\n\n\f1.0\n\ny\n\n\u22121.0\n\n)\n\nD\n\n,\nz\n\n|\n\n1\n=\ny\n(\n\u02c6p\n\n1.00\n0.75\n0.50\n0.25\n0.00\n\n0.00\n\n0.25\n\n0.50\nx\n\n0.75\n\n1.00\n\n0.00\n\n0.25\n\n0.50\nz\n\n0.75\n\n1.00\n\nClump-Far\nSpread\nClump-Near\nRandom\n\nClump-Far\nSpread\nClump-Near\nRandom\n\nFigure 2:\n(A) The teaching sets. The points show the machine teaching sets. Overlapping training\npoints are shown as clumps along with the number of items. A particular Random teaching set is\nshown. All training labels y were in {1,\u22121}, but dithered vertically for viewing clarity. (B) The\npredictive distribution \u02c6p(y = 1 | z, D) produced by the low-capacity GCM given a teaching set D.\nThe vertical dashed lines show the position of the true decision boundary \u03b8\u2217. The curves for the\nhigh-capacity GCM were omitted for space.\n\npredictions. Whereas high-capacity GCM is insensitive to variations across the machine teaching\nsets, low-capacity GCM should perform better under Clump-Far and Spread. The Clump-Near set\nleads to more errors in low-capacity GCM because items are confusable in memory and therefore\nlimited samples from memory can lead to suboptimal classi\ufb01cation decisions. In the next section,\nwe evaluate how humans perform with these four training sets, and compare human performance to\nthat of low- and high-capacity GCM.\n\n4.3 Human Study\n\nHuman participants were trained on one of the four training sets: Clump-Far, Spread, Clump-Near,\nand Random. Participants in all four conditions were tested (no corrective feedback provided) on\nthe m = 60 grid test items z1 . . . zm in [0, 1].\nParticipants. US-based participants (N = 600) were recruited via Amazon Mechanical Turk, a\npaid online crowd-sourcing platform, which is an effective method for recruiting demographically\ndiverse samples [19] and has been shown to yield results consistent with decision making studies in\nthe laboratory [20]. In our sample, 297 of the 600 participants were female and the average age was\n34.86. Participants were paid $1.00 for completing the study with the highest performing participant\nreceiving a $20 bonus.\nDesign. Participants were randomly assigned to one of the four teaching conditions (see Figure 2).\nNotice that feedback was deterministic in all the teaching sets provided by the machine teacher, but\nwas probabilistic as a function of stimulus for the Random condition. For the Random condition,\neach participant received a different sample of training items. The test set always consisted of 60\nstimuli (see Figure 1).\nIn both training and test trials, stimuli were presented sequentially in a\nrandom order (without replacement) determined for each participant.\nMaterials and Procedure. The stimuli were horizontal lines of various lengths. Participants learned\nto categorize these stimuli. The teaching sets values xi \u2208 [0, 1] were converted into pixels by\nmultiplying it by 400 and adding an offset. The offset for each participant was a uniformly selected\nrandom number from 30 to 100. As the study was performed online (see below), screen size varied\nacross participants (height \u00afx=879.16, s=143.34 and width \u00afx=1479.6, s=271.04).\nDuring the training phase, on every trial, participants were instructed to \ufb01xate on a small cross\nappearing in a random position on the screen. After 1000 ms, a line stimulus replaced the cross at\nthe same position. Participants were then to indicate their category decision by pressing a key (\u201cF\u201d or\n\u201cJ\u201d) as quickly as possible without sacri\ufb01cing accuracy. Once the participant responded, the stimulus\n\n6\n\n\fe\nc\nn\na\nm\nr\no\nf\nr\ne\np\n\ng\nn\ni\nn\ni\na\nr\nT\n\n0.75\n0.50\n0.25\n0.00\n\ne\nc\nn\na\nm\nr\no\nf\nr\ne\np\n\nt\ns\ne\nT\n\n0.75\n0.50\n0.25\n0.00\n\ny\nc\nn\ne\nt\ns\ni\ns\nn\no\nc\nn\ni\n\nt\ns\ne\nT\n\n9\n6\n3\n0\n\nClump-Far\nSpread\nClump-Near\nRandom\n\nFigure 3: Human experiment results. Each bar corresponds to one of the training conditions. (A)\nThe proportion of agreement between the individual training responses with the Bayes classi\ufb01er.\n(B) The proportion of agreement between the individual test responses with the Bayes classi\ufb01er. (C)\nInconsistency in individual test responses. The error bars are 95% con\ufb01dence intervals.\n\nwas immediately replaced by a feedback message (\u201cCorrect\u201d or \u201cWrong\u201d), which was displayed for\n2000 ms. The screen coordinates (horizontal/vertical) de\ufb01ning the stimulus (i.e., \ufb01xation cross and\nline) position were randomized on each trial to prevent participants from using marks or smudges\non the screen as an aid. Participants completed 20 training trials.\nThe procedure was identical for test trials, except corrective feedback was not provided. Instead,\n\u201cThank You!\u201d was displayed following a response. The test phase consisted of 60 trials. At the\nend of the test phase each subject was asked to discriminate between the short and long lines from\nthe Clump-Near training set (i.e. x = 0.435 and x = 0.565, closest stimuli in the deterministically\nlabeled training sets). Both lines were presented side-by-side, with their order counterbalanced\nbetween participants. Each participant was asked to indicate which one of those is longer.\nResults. It is important that people could perceptually discriminate the categories for the exemplars\nclose to the boundary, especially for the Clump-Near condition in which all the exemplars are close\nto the boundary. At the end of the main study, this was measured by asking each participant to\nindicate the longer line between the two. Overall 97% participants correctly indicated the longer\nline. This did not differ across conditions, F (3, 596) < 0.84, p \u2248 0.47.\nThe optimal (i.e. Bayes) classi\ufb01er deterministically assigns correct class label \u02c6y = sign(x \u2212 \u03b8\u2217) to\nan item x. The agreement between training responses and the optimal classi\ufb01er were signi\ufb01cantly\ndifferent across the four teaching conditions, F (3, 596) = 66.97, p < 0.05. As expected, the random\nsets resulted in the lowest accuracy (M=65.2%) and the Clump-Far condition resulted in the highest\naccuracy (M=89.9%) (Figure 3A).\nFigure 3B shows how well the test responses agree with the Bayes classi\ufb01er. The proportional\nagreement was signi\ufb01cantly different across conditions, F (3, 596) = 9.16, p < 0.05. The\nClump-Far and Spread conditions were signi\ufb01cantly different from the Clump-Near condition,\nt(228.05) = 3.22, p < 0.05 and t(243.84) = 4.21, p < 0.05, respectively and the Random con-\ndition, t(290.84) = 2.39, p < 0.05 and t(297.37) = 3.71, p < 0.05, respectively. The Clump-Far\nand the Spread conditions did not differ, t(294.32) = 1.55, p \u2248 0.12. This result shows that the\nsubjects in the Clump-Far and Spread conditions performed more similar to the Bayes classi\ufb01er than\nthe subjects in the other two conditions.\nIndividual test response inconsistency can be calculated using number of neighboring stimuli that\nare categorized in opposite categories [3]. This measure of inconsistency attempts to quantify the\nstochastic memory retrieval and higher inconsistency re\ufb02ects more noisy memory sampling. The in-\nconsistency signi\ufb01cantly differed between the conditions, F (3, 596) = 7.73, p < 0.05 (Figure 3C).\nBoth Clump-Far and Spread teaching sets showed lower inconsistency, suggesting that those teach-\ning sets lead to less noisy memory sampling. The inconsistencies for these two conditions did\nnot differ signi\ufb01cantly, two-sample t test, t(290.42) = 1.54, p \u2248 0.12. Inconsistencies in condi-\ntions Clump-Far and Spread signi\ufb01cantly differed from Clump-Near, t(281.7) = \u22122.53, p < 0.05\nand t(291.04) = \u22122.58, p < 0.05, respectively and Random, t(259.18) = \u22123.98, p < 0.05 and\nt(272.12) = \u22124.14, p < 0.05, respectively.\n\nWe then calculated test loss for each subject as(cid:80)m\n\ni=1 (1 \u2212 p(hi | zi)) where hi is the response for\nthe stimulus zi. Figure 4 compares the observed and estimated test performance (i.e. 1 \u2212 loss()) in\nfour conditions. Overall, human performance is more closely followed by the low-capacity GCM.\nThe human performance across four conditions was signi\ufb01cantly different, F (3, 596) = 11.15, p <\n0.05. The conditions Clump-Far and Spread did not signi\ufb01cantly differ, t(295.96) = \u22120.8, p \u2248\n\n7\n\n\fe\nc\nn\na\nm\nr\no\nf\nr\ne\nP\n\nt\ns\ne\nT\n\n0.75\n\n0.50\n\nClump-Far\n\nSpread\n\nClump-Near\n\nRandom\n\nHuman\nLow-capacity\nHigh-capacity\n\nFigure 4: Empirical test performance of human learners for low- and high-capacity GCM on four\nteaching conditions. Test performance is measured as 1 \u2212 loss() (see (3)). Humans follow the\nlow-capacity GCM more closely. The error bars are 95% con\ufb01dence intervals.\n\n0.42. Test performance in conditions Clump-Far and Spread signi\ufb01cantly differed from Clump-\nNear condition, t(226.9) = 4.12, p < 0.05 and t(287.97) = 2.19, p < 0.05, respectively and\nRandom condition, t(238.41) = 4.59, p < 0.05 and t(294.72) = 2.85, p < 0.05, respectively.\nHumans performed signi\ufb01cantly worse in the Clump-Near condition than in the Random condition,\nt(253.94) = \u22122.394, p < 0.05. A similar pattern was observed for the low-capacity GCM while the\nopposite for the high-capacity GCM. Inconsistency, as de\ufb01ned above, signi\ufb01cantly correlated with\nthe test loss, Pearson\u2019s r = 0.56, t(148) = 8.34, p < 0.05. Taken together, these results provide\nsupport for the low-capacity account of human decision making [3].\nIn order to check whether the variability within the training set is predictive of test performance we\ncorrelated the observed test loss with the estimated loss for the subjects in the Random condition.\nWe observed a signi\ufb01cant correlation between the test loss and the estimated loss for both low- and\nhigh-capacity models, Pearson\u2019s r = 0.273, t(148) = 3.45, p < 0.05 and r = 0.203, t(148) =\n2.52, p < 0.05, respectively. This result points out that due to their limited capacity human learners\nbene\ufb01t from lower variability in the training sets, i.e. idealization.\nThe individual median reaction time in the training phase signi\ufb01cantly differed across teaching con-\nditions, F (3, 596) = 10.66, p < 0.05. The training median reaction time for the Clump-Far con-\ndition was the shortest (M=761 ms, SD=223) and differed signi\ufb01cantly from all other conditions,\ntwo-sample t tests, all p < 0.05. Other conditions did not differ signi\ufb01cantly from each other.\nThe individual median reaction times in the test phase (M=767 ms, SD=187) did not differ across\nteaching conditions, F (3, 596) = 0.95, p \u2248 0.42.\nTaken together, our results suggest that the recommendations of the machine teacher for the low-\ncapacity GCM are indeed effective for human learners. Furthermore, the observed lower incon-\nsistency in this condition suggests that machine teacher is performing idealization which aids by\nreducing noise in the stochastic memory sampling process.\n\n5 Discussion\n\nA major aim of cognitive science is to understand human learning and to improve learning per-\nformance. We devised an optimal teacher for human category learning, a fundamental problem in\ncognitive science. Based on recent research we focused on GCM which models limited human ca-\npacity of exemplar retrieval during decision making. We developed the optimal teaching sets for\nthe low- and high-capacity variants of the GCM learner. By using a 1D category learning task, we\nhave shown that the optimal teaching set for the low-capacity GCM is clumped, symmetrical and\nlocated far from the decision boundary, which is intuitively easy to learn. This provides a normative\nbasis (given capacity limits) for the idealization procedures that reduce saliency of ambiguous cases\n[2, 3]. The optimal teaching set indeed proved effective for human learning.\nFuture work will pursue several extensions. One interesting topic not considered here is how the\norder of training examples affects learning. One possibility is that the optimal teacher will recom-\nmend easy examples earlier in training and then gradually progress to harder cases [2, 21]. Another\nimportant extension is use of multi-dimensional stimuli.\nAcknowledgments\nThe authors are thankful to the anonymous reviewers for their comments. This work is partly sup-\nported by the Leverhulme Trust grant RPG-2014-075 to BCL, National Science Foundation grant\nIIS-0953219 to XZ and WT-MIT fellowship 103811AIA to KRP.\n\n8\n\n\fReferences\n[1] P Shafto and N Goodman. A Bayesian Model of Pedagogical Reasoning.\n\nNaturally-Inspired Arti\ufb01cial Intelligence\u201908, pages 101\u2013102, 2008.\n\nIn AAAI Fall Symposium:\n\n[2] Y Bengio, J Louradour, R Collobert, and J Weston. Curriculum learning. In Proceedings of the 26th\nAnnual International Conference on Machine Learning - ICML \u201909, pages 1\u20138, New York, USA, June\n2009. ACM Press.\n\n[3] G Gigu`ere and B C Love. Limits in decision making arise from limits in memory retrieval. Proceedings\n\nof the National Academy of Sciences of the United States of America, 110(19):7613\u20138, May 2013.\n\n[4] A N Hornsby and B C Love.\n\nImproved classi\ufb01cation of mammograms following idealized training.\n\nJournal of Applied Research in Memory and Cognition, 3:72\u201376, 2014.\n\n[5] J Q Candela, M Sugiyama, A Schwaighofer, and N D Lawrence, editors. Dataset Shift in Machine\n\nLearning. MIT Press, \ufb01rst edit edition, 2009.\n\n[6] X Zhu. Machine Teaching for Bayesian Learners in the Exponential Family.\n\nInformation Processing Systems, pages 1905\u20131913, 2013.\n\nIn Advances in Neural\n\n[7] S A Goldman and M J Kearns. On the Complexity of Teaching. Journal of Computer and System Sciences,\n\n50(1):20\u201331, 1995.\n\n[8] F Khan, X Zhu, and B Mutlu. How Do Humans Teach: On Curriculum Learning and Teaching Dimension.\n\nIn Advances in Neural Information Processing Systems, pages 1449\u20131457, 2011.\n[9] F J Balbach and T Zeugmann. Recent Developments in Algorithmic Teaching.\n\nIn A H Dediu, A M\nIonescu, and C Mart\u00b4\u0131n-Vide, editors, Language and Automata Theory and Applications, volume 5457 of\nLecture Notes in Computer Science, pages 1\u201318. Springer, Berlin-Heidelberg, March 2009.\n\n[10] M Cakmak and M Lopes. Algorithmic and Human Teaching of Sequential Decision Tasks.\n\nConference on Arti\ufb01cial Intelligence (AAAI-12), July 2012.\n\nIn AAAI\n\n[11] R Lindsey, M Mozer, W J Huggins, and H Pashler. Optimizing Instructional Policies. In Advances in\n\nNeural Information Processing Systems, pages 2778\u20132786, 2013.\n\n[12] B C Love. Categorization. In K N Ochsner and S M Kosslyn, editors, Oxford Handbook of Cognitive\n\nNeuroscience, pages 342\u2013358. Oxford University Press, 2013.\n\n[13] D L Medin and M M Schaffer. Context theory of classi\ufb01cation learning. Psychological Review,\n\n85(3):207\u2013238, 1978.\n\n[14] R M Nosofsky. Attention, similarity, and the identi\ufb01cation-categorization relationship. Journal of exper-\n\nimental psychology. General, 115(1):39\u201361, March 1986.\n\n[15] M L Mack, A R Preston, and B C Love. Decoding the brain\u2019s algorithm for categorization from its neural\n\nimplementation. Current Biology, 23:2023\u20132027, 2013.\n\n[16] Y Chen, E K Garcia, M R Gupta, A Rahimi, and L Cazzanti. Similarity-based Classi\ufb01cation: Concepts\n\nand Algorithms. The Journal of Machine Learning Research, 10:747\u2013776, December 2009.\n\n[17] F Jakel, B Scholkopf, and F A Wichmann. Does cognitive science need kernels? Trends in Cognitive\n\nScience, 13(9):381\u2013388, 2009.\n\n[18] R M Nosofsky and T J Palmeri. An exemplar-based random walk model of speeded classi\ufb01cation. Psy-\n\nchological review, 104(2):266\u2013300, April 1997.\n\n[19] M Buhrmester, T Kwang, and S D Gosling. Amazon\u2019s Mechanical Turk: A New Source of Inexpensive,\n\nYet High-Quality, Data? Perspectives on Psychological Science, 6(1):3\u20135, February 2011.\n\n[20] M J C Crump, J V McDonnell, and T M Gureckis. Evaluating Amazon\u2019s Mechanical Turk as a tool for\n\nexperimental behavioral research. PloS one, 8(3):e57410, January 2013.\n\n[21] H Pashler and M C Mozer. When does fading enhance perceptual category learning? Journal of experi-\n\nmental psychology. Learning, memory, and cognition, 39(4):1162\u201373, July 2013.\n\n9\n\n\f", "award": [], "sourceid": 1294, "authors": [{"given_name": "Kaustubh", "family_name": "Patil", "institution": "Affective Brain Lab"}, {"given_name": "Jerry", "family_name": "Zhu", "institution": "UW-Madison"}, {"given_name": "\u0141ukasz", "family_name": "Kope\u0107", "institution": "University College London"}, {"given_name": "Bradley", "family_name": "Love", "institution": "University College London"}]}