{"title": "Machine Teaching for Bayesian Learners in the Exponential Family", "book": "Advances in Neural Information Processing Systems", "page_first": 1905, "page_last": 1913, "abstract": "What if there is a teacher who knows the learning goal and wants to design good training data for a machine learner? We propose an optimal teaching framework aimed at learners who employ Bayesian models. Our framework is expressed as an optimization problem over teaching examples that balance the future loss of the learner and the effort of the teacher. This optimization problem is in general hard. In the case where the learner employs conjugate exponential family models, we present an approximate algorithm for finding the optimal teaching set. Our algorithm optimizes the aggregate sufficient statistics, then unpacks them into actual teaching examples. We give several examples to illustrate our framework.", "full_text": "Machine Teaching for Bayesian Learners\n\nin the Exponential Family\n\nDepartment of Computer Sciences, University of Wisconsin-Madison\n\nXiaojin Zhu\n\nMadison, WI, USA 53706\n\njerryzhu@cs.wisc.edu\n\nAbstract\n\nWhat if there is a teacher who knows the learning goal and wants to design good\ntraining data for a machine learner? We propose an optimal teaching framework\naimed at learners who employ Bayesian models. Our framework is expressed as\nan optimization problem over teaching examples that balance the future loss of the\nlearner and the effort of the teacher. This optimization problem is in general hard.\nIn the case where the learner employs conjugate exponential family models, we\npresent an approximate algorithm for \ufb01nding the optimal teaching set. Our algo-\nrithm optimizes the aggregate suf\ufb01cient statistics, then unpacks them into actual\nteaching examples. We give several examples to illustrate our framework.\n\n1\n\nIntroduction\n\nConsider the simple task of learning a threshold classi\ufb01er in 1D (Figure 1). There is an unknown\nthreshold \u03b8 \u2208 [0, 1]. For any item x \u2208 [0, 1], its label y is white if x < \u03b8 and black otherwise.\nAfter seeing n training examples the learner\u2019s estimate is \u02c6\u03b8. What is the error |\u02c6\u03b8 \u2212 \u03b8|? The answer\ndepends on the learning paradigm. If the learner receives iid noiseless training examples where\nx \u223c uniform[0, 1], then with large probability |\u02c6\u03b8 \u2212 \u03b8| = O( 1\nn ). This is because the inner-most\nwhite and black items are 1/(n + 1) apart on average. If the learner performs active learning and\nan oracle provides noiseless labels, then the error reduces faster |\u02c6\u03b8 \u2212 \u03b8| = O( 1\n2n ) since the optimal\nstrategy is binary search. However, a helpful teacher can simply teach with n = 2 items (\u03b8 \u2212\n\u0001/2, white), (\u03b8 + \u0001/2, black) to achieve an arbitrarily small error \u0001. The key difference is that an\nactive learner still needs to explore the boundary, while a teacher can guide.\n\nFigure 1: Teaching can require far fewer examples than passive or active learning\n\nWe impose the restriction that teaching be conducted only via teaching examples (rather than some-\nhow directly giving the parameter \u03b8 to the learner). What, then, are the best teaching examples?\nUnderstanding the optimal teaching strategies is important for both machine learning and education:\n(i) When the learner is a human student (as modeled in cognitive psychology), optimal teaching\ntheory can design the best lessons for education. (ii) In cyber-security the teacher may be an adver-\nsary attempting to mislead a machine learning system via \u201cpoisonous training examples.\u201d Optimal\nteaching quanti\ufb01es the power and limits of such adversaries. (iii) Optimal teaching informs robots\nas to the best ways to utilize human teaching, and vice versa.\n\n1\n\n\u03b8\u03b8O(1/2 )n\u03b8{{O(1/n)passive learning \"waits\"active learning \"explores\"teaching \"guides\"\fOur work builds upon three threads of research. The \ufb01rst thread is the teaching dimension theory by\nGoldman and Kearns [10] and its extensions in computer science(e.g., [1, 2, 11, 12, 14, 25]). Our\nframework allows for probabilistic, noisy learners with in\ufb01nite hypothesis space, arbitrary loss func-\ntions, and the notion of teaching effort. Furthermore, in Section 3.2 we will show that the original\nteaching dimension is a special case of our framework. The second thread is the research on rep-\nresentativeness and pedagogy in cognitive science. Tenenbaum and Grif\ufb01ths is the \ufb01rst to note that\nrepresentative data is one that maximizes the posterior probability of the target model [22]. Their\nwork on Gaussian distributions, and later work by Rafferty and Grif\ufb01ths on multinomial distribu-\ntions [19], \ufb01nd representative data by matching suf\ufb01cient statistics. Our framework can be viewed\nas a generalization. Speci\ufb01cally, their work corresponds to the speci\ufb01c choice (to be de\ufb01ned in Sec-\ntion 2) of loss() = KL divergence and e\ufb00ort() being either zero or an indicator function to \ufb01x the\ndata set size at n. We made it explicit that these functions can have other designs. Importantly, we\nalso show that there are non-trivial interactions between loss() and e\ufb00ort(), such as not-teaching-\nat-all in Example 4, or non-brute-force-teaching in Example 5. An interesting variant studied in\ncognitive science is when the learner expects to be taught [20, 8]. We defer the discussion on this\nvariant, known as \u201ccollusion\u201d in computational teaching theory, and its connection to information\ntheory to section 5. In addition, our optimal teaching framework may shed light on the optimality\nof different method of teaching humans [9, 13, 17, 18]. The third thread is the research on better\nways to training machine learners such as curriculum learning or easy-to-hard ordering of train-\ning items [3, 15, 16], and optimal reward design in reinforcement learning [21]. Interactive systems\nhave been built which employ or study teaching heuristics [4, 6]. Our framework provides a unifying\noptimization view that balances the future loss of the learner and the effort of the teacher.\n\n2 Optimal Teaching for General Learners\n\nWe start with a general framework for teaching and gradually specialize the framework in later\nsections. Our framework consists of three entities: the world, the learner, and the teacher. (i) The\nworld is de\ufb01ned by a target model \u03b8\u2217. Future test items for the learner will be drawn iid from this\nmodel. This is the same as in standard machine learning. (ii) The learner has to learn \u03b8\u2217 from\ntraining data. Without loss of generality let \u03b8\u2217 \u2208 \u0398, the hypothesis space of the learner (if not, we\ncan always admit approximation error and de\ufb01ne \u03b8\u2217 to be the distribution in \u0398 closest to the world\ndistribution). The learner is the same as in standard machine learning (learners who anticipate to\nbe taught are discussed in section 5). The training data, however, is provided by a teacher. (iii)\nThe teacher is the new entity in our framework. It is almost omnipotent: it knows the world \u03b8\u2217,\nthe learner\u2019s hypothesis space \u0398, and importantly how the learner learns given any training data.1\nHowever, it can only teach the learner by providing teaching (or, from the learner\u2019s perspective,\ntraining) examples. The teacher\u2019s goal is to design a teaching set D so that the learner learns \u03b8\u2217 as\naccurately and effortlessly as possible. In this paper, we consider batch teaching where the teacher\npresents D to the learner all at once, and the teacher can use any item in the example domain.\nBeing completely general, we leave many details unspeci\ufb01ed. For instance, the world\u2019s model can\nbe supervised p(x, y; \u03b8\u2217) or unsupervised p(x; \u03b8\u2217); the learner may or may not be probabilistic; and\nwhen it is, \u0398 can be parametric or nonparametric. Nonetheless, we can already propose a generic\noptimization problem for optimal teaching:\n\nloss((cid:99)fD, \u03b8\u2217) + e\ufb00ort(D).\n\nminD\n\nThe function loss() measures the learner\u2019s deviation from the desired \u03b8\u2217. The quantity(cid:99)fD represents\n\nthe state of the learner after seeing the teaching set D. The function e\ufb00ort() measures the dif\ufb01culty\nthe teacher experiences when teaching with D. Despite its appearance, the optimal teaching prob-\nlem (1) is completely different from regularized parameter estimation in machine learning. The\ndesired parameter \u03b8\u2217 is known to the teacher. The optimization is instead over the teaching set D.\nThis can be a dif\ufb01cult combinatorial problem \u2013 for instance we need to optimize over the cardinality\nof D. Neither is the effort function a regularizer. The optimal teaching problem (1) so far is rather\nabstract. For the sake of concreteness we next focus on a rich family of learners, namely Bayesian\nmodels. However, we note that our framework can be adapted to other types of learners, as long as\nwe know how they react to the teaching set D.\n\n(1)\n\n1This is a strong assumption. It can be relaxed in future work, where the teacher has to estimate the state of\n\nthe learner by \u201cprobing\u201d it with tests.\n\n2\n\n\f3 Optimal Teaching for Bayesian Learners\n\nWe focus on Bayesian learners because they are widely used in both machine learning and cognitive\nscience [7, 23, 24] and because of their predictability: they react to any teaching examples in D by\nperforming Bayesian updates.2 Before teaching, a Bayesian learner\u2019s state is captured by its prior\ndistribution p0(\u03b8). Given D, the learner\u2019s likelihood function is p(D | \u03b8). Both the prior and the\nlikelihood are assumed to be known to the teacher. The learner\u2019s state after seeing D is the posterior\n\ndistribution(cid:99)fD \u2261 p(\u03b8 | D) =(cid:0)(cid:82)\n\n\u0398 p0(\u03c0)p(D | \u03c0)d\u03c0(cid:1)\u22121\n\np0(\u03b8)p(D | \u03b8).\n\n3.1 The KL Loss and Various Effort Functions, with Examples\n\nwe will use the Kullback-Leibler divergence so that loss((cid:99)fD, \u03b8\u2217) = KL (\u03b4\u03b8\u2217(cid:107)p(\u03b8 | D)), where \u03b4\u03b8\u2217\n\nThe choice of loss() and e\ufb00ort() is problem-speci\ufb01c and depends on the teaching goal. In this paper,\nis a point mass distribution at \u03b8\u2217.3 This loss encourages the learner\u2019s posterior to concentrate around\nthe world model \u03b8\u2217. With the KL loss, it is easy to verify that the optimal teaching problem (1) can\nbe equivalently written as\n\n(2)\n\nminD \u2212 log p(\u03b8\u2217 | D) + e\ufb00ort(D).\nWe remind the reader that this is not a MAP estimate problem. Instead, the intuition is to \ufb01nd a good\nteaching set D to make \u03b8\u2217 \u201cstand out\u201d in the posterior distribution.\nThe e\ufb00ort() function re\ufb02ects resource constraints on the teacher and the learner: how hard is it to\ncreate the teaching examples, to deliver them to the learner, and to have the learner absorb them? For\nmost of the paper we use the cardinality of the teaching set e\ufb00ort(D) = c|D| where c is a positive\nper-item cost. This assumes that the teaching effort is proportional to the number of teaching items,\nwhich is reasonable in many problems. We will demonstrate a few other effort functions in the\nexamples below.\nHow good is any teaching set D? We hope D guides the learner\u2019s posterior toward the world\u2019s \u03b8\u2217,\nbut we also hope D takes little effort to teach. The proper quality measure is the objective value (2)\nwhich balances the loss() and e\ufb00ort() terms.\nDe\ufb01nition 1 (Teaching Impedance). The Teaching Impedance (TI) of a teaching set D is the objec-\ntive value \u2212 log p(\u03b8\u2217 | D) + e\ufb00ort(D). The lower the TI, the better.\n\nWe now give examples to illustrate our optimal teaching framework for Bayesian learners.\nExample 1 (Teaching a 1D threshold classi\ufb01er). The classi\ufb01cation task is the same as in Figure 1,\nwith x \u2208 [0, 1] and y \u2208 {\u22121, 1}. The parameter space is \u0398 = [0, 1]. The world has a threshold\n\u03b8\u2217 \u2208 \u0398. Let the learner\u2019s prior be uniform p0(\u03b8) = 1. The learner\u2019s likelihood function is p(y =\n1 | x, \u03b8) = 1 if x \u2265 \u03b8 and 0 otherwise.\nThe teacher wants the learner to arrive at a posterior p(\u03b8 | D) peaked at \u03b8\u2217 by designing a small\nD. As discussed above, this can be formulated as (2) with the KL loss() and the cardinality e\ufb00ort()\nfunctions: minD \u2212 log p(\u03b8\u2217 | D) + c|D|. For any teaching set D = {(x1, y1), . . . , (xn, yn)},\nthe learner\u2019s posterior is simply p(\u03b8 | D) = uniform [maxi:yi=\u22121(xi), mini:yi=1(xi)], namely\nuniform over the version space consistent with D.\nThe optimal teaching problem becomes\nminn,x1,y1,...,xn,yn \u2212 log\n+ cn. One solution is the limiting case\nmini:yi=1(xi)\u2212maxi:yi=\u22121(xi)\nwith a teaching set of size two D = {(\u03b8\u2217 \u2212 \u0001/2,\u22121), (\u03b8\u2217 + \u0001/2, 1)} as \u0001 \u2192 0, since the Teaching\nImpedance T I = log(\u0001) + 2c approaches \u2212\u221e. In other words, the teacher teaches by two examples\narbitrarily close to, but on the opposite sides of, the decision boundary as in Figure 1(right).\nExample 2 (Learner cannot tell small differences apart). Same as Example 1, but the learner has\npoor perception (e.g., children or robots) and cannot distinguish similar items very well. We may\n\n(cid:16)\n\n(cid:17)\n\n1\n\n2Bayesian learners typically assume that the training data is iid; optimal teaching intentionally violates this\nassumption because the designed teaching examples in D will typically be non-iid. However, the learners are\noblivious to this fact and will perform Bayesian update as usual.\n3If we allow the teacher to be uncertain about the world \u03b8\u2217, we may encode the teacher\u2019s own belief as a\ndistribution p\u2217(\u03b8) and replace \u03b4\u03b8\u2217 with p\u2217(\u03b8).\n\n3\n\n\fc\n\nencode this in e\ufb00ort() as, for example, e\ufb00ort(D) =\nminxi,xj\u2208D |xi\u2212xj| . That is, the teaching ex-\namples require more effort to learn if any two items are too close. With two teaching examples\nas in Example 1, T I = log(\u0001) + c/\u0001. It attains minimum at \u0001 = c. The optimal teaching set is\nD = {(\u03b8\u2217 \u2212 c/2,\u22121), (\u03b8\u2217 + c/2, 1)}.\nExample 3 (Teaching to pick one model out of two). There are two Gaussian distributions \u03b8A =\nN (\u2212 1\n2 ). The learner has \u0398 = {\u03b8A, \u03b8B}, and we want to teach it the fact\nthat the world is using \u03b8\u2217 = \u03b8A. Let the learner have equal prior p0(\u03b8A) = p0(\u03b8B) = 1\n2 . The\nlearner observes examples x \u2208 R, and its likelihood function is p(x | \u03b8) = N (x | \u03b8). Let D =\n{x1, . . . , xn}. With these speci\ufb01c parameters, the KL loss can be shown to be \u2212 log p(\u03b8\u2217 | D) =\n\n2 ), \u03b8B = N ( 1\n\n4 , 1\n\n4 , 1\n\nlog (1 +(cid:81)n\ncn +(cid:80)n\n\ni=1 exp(xi)).\n\ni=1\n\nFor this example, let us suppose that teaching with extreme item values is undesirable (note\nxi \u2192 \u2212\u221e minimizes the KL loss). We combine cardinality and range preferences in e\ufb00ort(D) =\nI(|xi| \u2264 d), where the indicator function I(z) = 0 if z is true, and +\u221e otherwise.\nIn other words, the teaching items must be in some interval [\u2212d, d]. This leads to the optimal\nI(|xi| \u2264 d). This is a\nteaching problem minn,x1,...,xn\nmixed integer program (even harder\u2013the number of variables has to be optimized as well). We\n\ufb01rst relax n to real values. By inspection, the solution is to let all xi = \u2212d and let n minimize\nT I = log (1 + exp(\u2212dn)) + cn. The minimum is achieved at n = 1\non the left, and showing the learner n copies of \u2212d lends the most support to that model. Note, how-\never, that n = 0 for certain combinations of c, d (e.g., when c \u2265 d): the effort of teaching outweighs\nthe bene\ufb01t. The teacher may choose to not teach at all and maintain the status quo (prior p0) of the\nlearner!\n\ni=1 exp(xi)) + cn +(cid:80)n\nc \u2212 1(cid:1). We then round n\nd log(cid:0) d\nc \u2212 1(cid:1)(cid:3)(cid:1). This D is sensible: \u03b8\u2217 = \u03b8A is the model\n\nand force nonnegativity: n = max(cid:0)0,(cid:2) 1\n\nlog (1 +(cid:81)n\nd log(cid:0) d\n\ni=1\n\n3.2 Teaching Dimension is a Special Case\n\n(cid:26) 1,\n\nIn this section we provide a comparison to one of the most in\ufb02uential teaching models, namely the\noriginal teaching dimension theory [10]. It may seem that our optimal teaching setting (2) is more\nrestrictive than theirs, since we make strong assumptions about the learner (that it is Bayesian, and\nthe form of the prior and likelihood). Their query learning setting in fact makes equally strong\nassumptions, in that the learner updates its version space to be consistent with all teaching items.\nIndeed, we can cast their setting as a Bayesian learning problem, showing that their problem is a\nspecial case of (2). Corresponding to the concept class C = {c} in [10], we de\ufb01ne the conditional\nif c(x) = +\nprobability P (y = 1 | x, \u03b8c) =\nif c(x) = \u2212 and the joint distribution P (x, y | \u03b8c) =\nP (x)P (y | x, \u03b8c) where P (x) is uniform over the domain X . The world has \u03b8\u2217 = \u03b8c\u2217 corresponding\nto the target concept c\u2217 \u2208 C. The learner has \u0398 = {\u03b8c | c \u2208 C}. The learner\u2019s prior is p0(\u03b8) =\nuniform(\u0398) = 1|C|, and its likelihood function is P (x, y | \u03b8c). The learner\u2019s posterior after teaching\nwith D is\nP (\u03b8c | D) =\n(3)\nTeaching dimension T D(c\u2217) is the minimum cardinality of D that uniquely identi\ufb01es the target\nconcept. We can formulate this using our optimal teaching framework\n\n(cid:26) 1/(number of concepts in C consistent with D),\n\nif c is consistent with D\notherwise\n\n0,\n\n0,\n\nminD \u2212 log P (\u03b8c\u2217 | D) + \u03b3|D|,\n\n(4)\n\nwhere we used the cardinality e\ufb00ort() function (and renamed the cost \u03b3 for clarity). We can make\nsure that the loss term is minimized to 0, corresponding to successfully identifying the target concept,\nT D(c\u2217). But since T D(c\u2217) is unknown beforehand, we can set \u03b3 \u2264 1|C| since |C| \u2265 T D(c\u2217)\nif \u03b3 < 1\n(one can at least eliminate one concept from the version space with each well-designed teaching\nitem). The solution D to (4) is then a minimum teaching set for the target concept c\u2217, and |D| =\nT D(c\u2217).\n\n4 Optimal Teaching for Bayesian Learners in the Exponential Family\n\nWhile we have proposed an optimization-based framework for teaching any Bayesian learner and\nprovided three examples, it is not clear if there is a uni\ufb01ed approach to solve the optimization\n\n4\n\n\fs \u2261 n(cid:88)\n\ni=1\n\nproblem (2). In this section, we further restrict ourselves to a subset of Bayesian learners whose\nprior and likelihood are in the exponential family and are conjugate. For this subset of Bayesian\nlearners, \ufb01nding the optimal teaching set D naturally decomposes into two steps: In the \ufb01rst step\none solves a convex optimization problem to \ufb01nd the optimal aggregate suf\ufb01cient statistics for D. In\nthe second step one \u201cunpacks\u201d the aggregate suf\ufb01cient statistics into actual teaching examples. We\npresent an approximate algorithm for doing so.\nWe recall that an exponential family distribution (see e.g. [5]) takes the form p(x | \u03b8) =\n\u03b8 \u2208 RD is the natural parameter, A(\u03b8) is the log partition function, and h(x) modi\ufb01es the base\nmeasure. For a set D = {x1, . . . , xn}, the likelihood function under the exponential family takes a\n\nh(x) exp(cid:0)\u03b8(cid:62)T (x) \u2212 A(\u03b8)(cid:1) where T (x) \u2208 RD is the D-dimensional suf\ufb01cient statistics of x,\nsimilar form p(D | \u03b8) = ((cid:81)n\n\ni=1 h(xi)) exp(cid:0)\u03b8(cid:62)s \u2212 nA(\u03b8)(cid:1), where we de\ufb01ne\n\nmin\nn,s\n\n\u2212\u03b8\u2217(cid:62)\n\nT (xi)\n\n(5)\nto be the aggregate suf\ufb01cient statistics over D. The corresponding conjugate prior is the ex-\nponential family distribution with natural parameters (\u03bb1, \u03bb2) \u2208 RD \u00d7 R: p(\u03b8 | \u03bb1, \u03bb2) =\n\nh0(\u03b8) exp(cid:0)\u03bb(cid:62)\n1 \u03b8 \u2212 \u03bb2A(\u03b8) \u2212 A0(\u03bb1, \u03bb2)(cid:1). The posterior distribution is p(\u03b8 | D, \u03bb1, \u03bb2) =\nh0(\u03b8) exp(cid:0)(\u03bb1 + s)(cid:62)\u03b8 \u2212 (\u03bb2 + n)A(\u03b8) \u2212 A0(\u03bb1 + s, \u03bb2 + n)(cid:1). The posterior has the same form\n\nas the prior but with natural parameters (\u03bb1 + s, \u03bb2 + n). Note that the data D enters the posterior\nonly via the aggregate suf\ufb01cient statistics s and cardinality n. If we further assume that e\ufb00ort(D)\ncan be expressed in n and s, then we can write our optimal teaching problem (2) as\n\n(\u03bb1 + s) + A(\u03b8\u2217)(\u03bb2 + n) + A0(\u03bb1 + s, \u03bb2 + n) + e\ufb00ort(n, s),\n\nwhere n \u2208 Z\u22650 and s \u2208 {t \u2208 RD | \u2203{xi}i\u2208I such that t =(cid:80)\n\n(6)\ni\u2208I T (xi)}. We relax the problem\nto n \u2208 R and s \u2208 RD, resulting in a lower bound of the original objective.4 Since the log partition\nfunction A0() is convex in its parameters, we have a convex optimization problem (6) at hand if we\ndesign e\ufb00ort(n, s) to be convex, too. Therefore, the main advantage of using the exponential family\ndistribution and conjugacy is this convex formulation, which we use to ef\ufb01ciently optimize over n\nand s. This forms the \ufb01rst step in \ufb01nding D.\nHowever, we cannot directly teach with the aggregate suf\ufb01cient statistics. We \ufb01rst turn n back into\nan integer by max(0, [n]) where [] denotes rounding.5 We then need to \ufb01nd n teaching examples\nwhose aggregate suf\ufb01cient statistics is s. The dif\ufb01culty of this second \u201cunpacking\u201d step depends\non the form of the suf\ufb01cient statistics T (x). For some exponential family distributions unpacking\nis trivial. For example, the exponential distribution has T (x) = x. Given n and s we can easily\nunpack the teaching set D = {x1, . . . , xn} by x1 = . . . = xn = s/n. The Poisson distribution\nhas T (x) = x as well, but the items x need to be integers. This is still relatively easy to achieve\nby rounding x1, . . . , xn and making adjustments to make sure they still sum to s. The univariate\nGaussian distribution has T (x) = (x, x2) and unpacking is harder: given n = 3, s = (3, 5) it\nmay not be immediately obvious that we can unpack into {x1 = 0, x2 = 1, x3 = 2} or even\n{x1 = 1\nIn this paper, we use an approximate unpacking algorithm. We initialize the n teaching examples\niid\u223c p(x | \u03b8\u2217), i = 1 . . . n. 6 We then improve the examples by solving an unconstrained\nby xi\noptimization problem to match the examples\u2019 aggregate suf\ufb01cient statistics to the given s:\n\n}. Clearly, unpacking is not unique.\n\n, x3 = 5\u2212\u221a\n\n2 , x2 = 5+\n\n13\n\n4\n\n\u221a\n\n4\n\n13\n\n(cid:107)s \u2212 n(cid:88)\n\ni=1\n\nmin\n\nx1,...,xn\n\nT (xi)(cid:107)2.\n\n(7)\n\n4For higher solution quality we may impose certain convex constraints on s based on the structure of T (x).\nFor example, univariate Gaussian has T (x) = (x, x2). Let s = (s1, s2). It is easy to show that s must satisfy\nthe constraint s2 \u2265 s2\n\n5Better results can be obtained by comparing the objective of (6) under several integers around n and picking\n\n1/n.\n\nthe smallest one.\n\n6As we will see later, such iid samples from the target distribution are not great teaching examples for two\nmain reasons: (i) We really should compensate for the learner\u2019s prior by aiming not at the target distribution\nbut overshooting a bit in the opposite direction of the prior. (ii) Randomness in the samples also prevents them\nfrom achieving the aggregate suf\ufb01cient statistics.\n\n5\n\n\f= \u22122 (s \u2212(cid:80)\n\nThis problem is non-convex in general but can be solved up to a local minimum. The gradient is\nT (cid:48)(xj). Additional post-processing such as enforcing x to be integers\n\u2202\n\u2202xj\nis then carried out if necessary. The complete algorithm is summarized in Algorithm 1.\n\ni T (xi))\n\n(cid:62)\n\nAlgorithm 1 Approximately optimal teaching for Bayesian learners in the exponential family\ninput target \u03b8\u2217; learner information T (), A(), A0(), \u03bb1, \u03bb2; e\ufb00ort()\n\nStep 1: Solve for aggregate suf\ufb01cient statistics n, s by convex optimization (6)\nStep 2: Unpacking: n \u2190 max(0, [n]); \ufb01nd x1, . . . , xn by (7)\n\noutput D = {x1, . . . , xn}\n\nWe illustrate Algorithm 1 with several examples.\nExample 4 (Teaching the mean of a univariate Gaussian). The world consists of a Gaussian\nN (x; \u00b5\u2217, \u03c32) where \u03c32 is \ufb01xed and known to the learner while \u00b5\u2217 is to be taught.\nIn expo-\nnential family form p(x | \u03b8) = h(x) exp (\u03b8T (x) \u2212 A(\u03b8)) with T (x) = x alone (since \u03c32\nis \ufb01xed), \u03b8 = \u00b5\nIts con-\njugate prior (which is the learner\u2019s initial state) is Gaussian with the form p(\u03b8 | \u03bb1, \u03bb2) =\nh0(\u03b8) exp\nTo \ufb01nd a good teaching set D, in step 1 we \ufb01rst \ufb01nd its optimal cardinality n and aggregate suf\ufb01cient\n\n2 , and h(x) = (cid:0)\u221a\n\n(cid:17)\n\u03c32 , A(\u03b8) = \u00b52\n2 \u2212 A0(\u03bb)\n\nwhere A0(\u03bb1, \u03bb2) = \u03bb2\n2\u03c32\u03bb2\n\n2\u03c0\u03c3(cid:1)\u22121\n\n(cid:16)\u2212 x2\n\n2\u03c32 = \u03b82\u03c32\n\n2 log(\u03c32\u03bb2).\n\n\u03bb1\u03b8 \u2212 \u03bb2\n\n(cid:16)\nstatistics s =(cid:80)\n\n\u2212 1\n\n(cid:17)\n\nexp\n\n\u03b82\u03c32\n\n2\u03c32\n\n.\n\n1\n\ni\u2208D xi using (6). The optimization problem becomes\n\u2212\u03b8\u2217s +\n\n(\u03bb1 + s)2\n\n\u03c32\u03b8\u22172\n\nn +\n\n2\n\n2\u03c32(\u03bb2 + n)\n\n\u2212 1\n2\n\nmin\nn,s\n\nlog(\u03c32(\u03bb2 + n)) + e\ufb00ort(n, s)\n\n(8)\n\nwhere \u03b8\u2217 = \u00b5\u2217/\u03c32. The result is more intuitive if we rewrite the conjugate prior in its standard form\n\u00b5 \u223c N (\u00b5 | \u00b50, \u03c32\n. With this notation, the optimal aggregate\nsuf\ufb01cient statistics is\n\n0) with the relation \u03bb1 = \u00b50\u03c32\n\u03c32\n0\n\n, \u03bb2 = \u03c32\n\u03c32\n0\n\n(\u00b5\u2217 \u2212 \u00b50) + \u00b5\u2217n.\n\n\u03c32\n\u03c32\n0\n\n1\n\n\u03c32\n0\n\n\u03c32\n0\n\ns =\n\n+ n\n\n(cid:17)\n\n2 e\ufb00ort(cid:48)(n) + \u03c32\n\n2 log \u03c32(cid:16) \u03c32\n\n= 0. For example, with the cardinality e\ufb00ort(n) = cn we have n = 1\n\n(9)\nn is not the target \u00b5\u2217, but should\nNote an interesting fact here: the average of teaching examples s\ncompensate for the learner\u2019s initial belief \u00b50. This is the \u201covershoot\u201d discussed earlier. Putting (9)\nback in (8) the optimization over n is minn \u2212 1\n+ e\ufb00ort(n). Consider any differ-\n(cid:48)\nentiable effort function (w.r.t. the relaxed n) with derivative e\ufb00ort\n(n), the optimal n is the solution\nto n\u2212\n2c \u2212 \u03c32\n.\nIn step 2 we unpack n and s into D. We discretize n by max(0, [n]). Another interesting fact is that\nthe optimal teaching strategy may be to not teach at all (n = 0). This is the case when the learner\n0 is the learner\u2019s prior variance on the\nhas literally a narrow mind to start with: \u03c32\nmean). Intuitively, the learner is too stubborn to change its prior belief by much, and such minuscule\nchange does not justify the teaching effort.\nHaving picked n, unpacking s is trivial since T (x) = x. For example, we can let D be x1 = . . . =\nxn = s/n as discussed earlier, without employing optimization (7). Yet another interesting fact is\nthat such an alarming teaching set (with n identical examples) is likely to contradict the world\u2019s\nmodel variance \u03c32, but the discrepancy does not affect teaching because the learner \ufb01xes \u03c32.\nExample 5 (Teaching a multinomial distribution). The world is a multinomial distribution \u03c0\u2217 =\n(cid:81)K\n\u0393((cid:80) \u03b2k)\nK) of dimension K. The learner starts with a conjugate Dirichlet prior p(\u03c0 | \u03b2) =\n1, . . . , \u03c0\u2217\n(\u03c0\u2217\n(cid:81) \u0393(\u03b2k)\nnumber of teaching items n and the split s = (s1, . . . , sK) where n =(cid:80)K\nk=1 \u03c0\u03b2k\u22121\n. Each teaching item is x \u2208 {1, . . . , K}. The teacher needs to decide the total\n\n0 < 2c\u03c32 (recall \u03c32\n\n\u03c32\n0\n\nk\n\nk=1 sk.\n\nIn step 1, the suf\ufb01cient statistics is s1, . . . , sK\u22121 but for clarity we write (6) using s and standard\nparameters:\n\n\u2212 log \u0393\n\nmin\n\ns\n\n(\u03b2k + sk)\n\n+\n\n(\u03b2k + sk \u2212 1) log \u03c0\u2217\n\nk + e\ufb00ort(s). (10)\n\n(cid:32) K(cid:88)\n\n(cid:33)\n\nK(cid:88)\n\nlog \u0393(\u03b2k + sk)\u2212 K(cid:88)\n\nk=1\n\nk=1\n\nk=1\n\n6\n\n\f\u2202sk\n\n(cid:17)\n\n= \u2212\u03c8\n\nk=1(\u03b2k + sk)\n\n(cid:16)(cid:80)K\n\n+ \u03c8(\u03b2k + sk)\u2212 log \u03c0\u2217\n\nThis is an integer program; we relax s \u2208 RK\u22650, making it a continuous optimization problem with\nnonnegativity constraints. Assuming a differentiable e\ufb00ort(), the optimal aggregate suf\ufb01cient statis-\ntics can be readily solved with the gradient \u2202\nk +\n\u2202sk\n\u2202e\ufb00ort(s)\n, where \u03c8() is the digamma function. In step 2, unpacking is again trivial: we simply let\n\nsk \u2190 [sk] for k = 1 . . . K.\nLet us look at a concrete problem. Let the teaching target be \u03c0\u2217 = ( 1\n10 ). Let the\n10 , 3\nIf we say that teach-\nlearner\u2019s prior Dirichlet parameters be quite different: \u03b2 = (6, 3, 1).\ning requires no effort by setting e\ufb00ort(s) = 0, then the optimal teaching set D found by Algo-\nrithm 1 is s = (317, 965, 1933) as implemented with Matlab fmincon. The MLE from D is\n(0.099, 0.300, 0.601) and is very close to \u03c0\u2217. In fact, in our experiments, fmincon stopped be-\ncause it exceeded the default function evaluation limit. Otherwise, the counts would grow even\nhigher with MLE\u2192 \u03c0\u2217. This is \u201cbrute-force teaching\u201d: using unlimited data to overwhelm the\nprior in the learner.\n\nBut if we say teaching is costly by setting e\ufb00ort(s) = 0.3(cid:80)K\n\nk=1 sk, the optimal D found by Al-\ngorithm 1 is instead s = (0, 2, 8) with merely ten items. Note that it did not pick (1, 3, 6) which\nalso has ten items and whose MLE is \u03c0\u2217: this is again to compensate for the biased prior Dir(\u03b2)\nin the learner. Our optimal teaching set (0, 2, 8) has Teaching Impedance T I = 2.65. In contrast,\nthe set (1, 3, 6) has T I = 4.51 and the previous set (317, 965, 1933) has T I = 956.25 due to its\nsize. We can also attempt to sample teaching sets of size ten from multinomial(10, \u03c0\u2217). In 100,000\nsimulations with such random teaching sets the average T I = 4.97 \u00b1 1.88 (standard deviation),\nminimum T I = 2.65, and maximum T I = 18.7. In summary, our optimal teaching set (0, 2, 8) is\nvery good.\n\n10 , 6\n\n(cid:17)\n\n(cid:16) mk1\n\nWe remark that one can teach complex models using simple ones as building blocks. For instance,\nwith the machinery in Example 5 one can teach the learner a full generative model for a Na\u00a8\u0131ve Bayes\nclassi\ufb01er. Let the target Na\u00a8\u0131ve Bayes classi\ufb01er have K classes with class probability p(y = k) = \u03c0\u2217\nk.\nLet v be the vocabulary size. Let the target class conditional probability be p(x = i | y = k) =\n\u03b8\u2217\nki for word type i = 1 . . . v and label k = 1 . . . K. Then the aggregate suf\ufb01cient statistics are\nn1 . . . nK, m11 . . . m1v, . . . , mK1 . . . mKv where nk is the number of documents with label k, and\nmki is the number of times word i appear in all documents with label k. The optimal choice of\nthese n\u2019s and m\u2019s for teaching can be solved separately as in Example 5 as long as e\ufb00ort() can be\nseparated. The unpacking step is easy: we know we need nk teaching documents with label k. These\nnk documents together need mki counts of word type i. They can evenly split those counts. In the\nend, each teaching document with label k will have the bag-of-words\n, subject to\nrounding.\nExample 6 (Teaching a multivariate Gaussian). Now we consider the general case of\nteaching both the mean and the covariance of a multivariate Gaussian.\nThe world\nhas the target \u00b5\u2217 \u2208 RD and \u03a3\u2217 \u2208 RD\u00d7D.\n| \u00b5, \u03a3).\n(cid:18)\n(cid:16)(cid:81)D\ni=1 \u0393(cid:0) \u03bd0+1\u2212i\n|\nThe\nstarts with a Normal-Inverse-Wishart\nlearner\n2 (\u00b5 \u2212 \u00b50)(cid:62)\u03a3\u22121(\u00b5 \u2212 \u00b50)(cid:1).\nexp(cid:0)\u2212 1\n|\u03a3|\u2212 \u03bd0+D+2\n\u00b50, \u03ba0, \u03bd0, \u039b\u22121\n0 )\ni=1 xi, S = (cid:80)n\naggregate suf\ufb01cient statistics are s = (cid:80)n\n2 tr(\u03a3\u22121\u039b0) \u2212 \u03ba0\nthe\ni=1 xix(cid:62)\nThe posterior is NIW\ni .\np(\u00b5, \u03a3 | \u00b5n, \u03ban, \u03bdn, \u039b\u22121\nn ) with parameters \u00b5n = \u03ba0\n\u03ba0+n s, \u03ban = \u03ba0 + n, \u03bdn = \u03bd0 + n,\n\u03ba0+n \u00b50 + 1\n0 \u2212 2\u03ba0\n\u03ba0+n \u00b50s(cid:62) \u2212 1\n\u03ba0+n \u00b50\u00b5(cid:62)\n\u03ba0+n ss(cid:62). We formulate the optimal aggregate\n\u039bn = \u039b0 + S + \u03ba0n\nsuf\ufb01cient statistics problem by putting the posterior into (6). Note S by de\ufb01nition needs to be\npositive semi-de\ufb01nite. In addition, with Cauchy-Schwarz inequality one can show that Sii \u2265 s2\n(cid:19)\n(cid:18) \u03bdn + 1 \u2212 i\ni /2\nD(cid:88)\nfor i = 1 . . . n. Step 1 is thus the following SDP:\n\nGiven data x1, . . . , xn \u2208 RD,\n\nThe likelihood is N (x\n(NIW)\n\n(cid:1)(cid:17)|\u039b0|\u2212 \u03bd0\n\n2(cid:19)\u22121\n(cid:17) D\n\nconjugate prior p(\u00b5, \u03a3\n\n(cid:16) 2\u03c0\n\n, . . . , mkv\nnk\n\nD log 2\n\n\u03bd0D\n2 \u03c0\n\nD(D\u22121)\n\n=\n\nnk\n\n2\n\n\u03ba0\n\n2\n\n2\n\n4\n\n2\n\n\u2212 \u03bdn\n2\n\nlog |\u039bn| \u2212 D\n2\n\nlog \u03ban +\n\nlog |\u03a3\u2217|\n\n\u03bdn\n2\n\n2\n\n(\u00b5\u2217 \u2212 \u00b5n)(cid:62)\u03a3\u2217\u22121(\u00b5\u2217 \u2212 \u00b5n) + e\ufb00ort(n, s, S)\n\nmin\nn,s,S\n\ns.t.\n\n\u03bdn +\n\nlog \u0393\n\ni=1\n\n2\ntr(\u03a3\u2217\u22121\u039bn) +\nSii \u2265 s2\n\n+\nS (cid:23) 0;\n\n1\n2\n\n\u03ban\n2\n\ni /2, \u2200i.\n\n7\n\n(11)\n\n(12)\n\n\fiid\u223c N (\u00b5\u2217, \u03a3\u2217). Again, such iid samples are\nIn step 2, we unpack s, S by initializing x1, . . . , xn\ntypically not good teaching examples. We improve them with the optimization (7) where T (x) is the\n(D + D2)-dim vector formed by the elements of x and xx(cid:62), and similarly the aggregate suf\ufb01cient\nstatistics vector s is formed by the elements of s and S.\nWe illustrate the results on a concrete problem in D = 3. The target Gaussian is \u00b5\u2217 = (0, 0, 0) and\n\u03a3\u2217 = I. The target mean is visualized in each plot of Figure 2 as a black dot. The learner\u2019s initial\nstate is captured by the NIW with parameters \u00b50 = (1, 1, 1), \u03ba0 = 1, \u03bd0 = 2 + 10\u22125, \u039b0 = 10\u22125I.\nNote the learner\u2019s prior mean \u00b50 is different than \u00b5\u2217, and is shown by the red dot in Figure 2. The\nred dot has a stem extending to the z-axis=0 plane for better visualization. We used an \u201cexpensive\u201d\neffort function e\ufb00ort(n, s, S) = n. Algorithm 1 decides to use n = 4 teaching examples with s =\n. These unpack into D = {x1 . . . x4}, visualized by the\n(\u22121,\u22121,\u22121) and S =\nfour empty blue circles. The three panels of Figure 2 show unpacking results starting from different\ninitial seeds sampled from N (\u00b5\u2217, \u03a3\u2217). These teaching examples form a tetrahedron (edges added for\nclarity). This is sensible: in fact, one can show that the minimum teaching set for a D-dimensional\nGaussian is the D + 1 points at the vertices of a D-dimensional tetrahedron. Importantly the mean\nof D, (\u22121/4,\u22121/4,\u22121/4) shown as the solid blue dot with a stem, is offset from the target \u00b5\u2217 and\nto the opposite side of the learner\u2019s prior \u00b50. This again shows that D compensates for the learner\u2019s\nprior. Our optimal teaching set D has T I = 1.69. In contrast, teaching sets with four iid random\nsamples from the target N (\u00b5\u2217, \u03a3\u2217) have worse TI. In 100,000 simulations such random teaching\nsets have average T I = 9.06 \u00b1 3.34, minimum T I = 1.99, and maximum T I = 35.51.\n\n(cid:32)4.63 \u22121 \u22121\n\n(cid:33)\n\n\u22121\n\u22121 \u22121\n\n4.63 \u22121\n4.63\n\nFigure 2: Teaching a multivariate Gaussian\n\n5 Discussions and Conclusion\n\nWhat if the learner anticipates teaching? Then the teaching set may be further reduced. For ex-\nample, the task in Figure 1 may only require a single teaching example D = {x1 = \u03b8\u2217}, and the\nlearner can \ufb01gure out that this x1 encodes the decision boundary. Smart learning behaviors simi-\nlar to this have been observed in humans by Shafto and Goodman [20]. In fact, this is known as\n\u201ccollusion\u201d in computational teaching theory (see e.g. [10]), and has strong connections to compres-\nsion in information theory. In one extreme of collusion, the teacher and the learner agree upon an\ninformation-theoretical coding scheme beforehand. Then, the teaching set D is not used in a tradi-\ntional machine learning training set sense, but rather as source coding. For example, x1 itself would\nbe a \ufb02oating-point encoding of \u03b8\u2217 up to machine precision. In contrast, the present paper assumes\nthat the learner does not collude.\nWe introduced an optimal teaching framework that balances teaching loss and effort. we hope this\npaper provides a \u201cstepping stone\u201d for follow-up work, such as 0-1 loss() for classi\ufb01cation, non-\nBayesian learners, uncertainty in learner\u2019s state, and teaching materials beyond training items.\n\nAcknowledgments\n\nWe thank Bryan Gibson, Robert Nowak, Stephen Wright, Li Zhang, and the anonymous reviewers\nfor suggestions that improved this paper. This research is supported in part by National Science\nFoundation grants IIS-0953219 and IIS-0916038.\n\n8\n\n\u2212101\u2212202\u22121\u22120.500.51\u2212101\u2212202\u2212101\u2212101\u2212202\u22122\u22121.5\u22121\u22120.500.51\fReferences\n[1] D. Angluin. Queries revisited. Theor. Comput. Sci., 313(2):175\u2013194, 2004.\n[2] F. J. Balbach and T. Zeugmann. Teaching randomized learners.\n\nIn COLT, pages 229\u2013243.\n\nSpringer, 2006.\n\n[3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.\n[4] B. Biggio, B. Nelson, and P. Laskov. Poisoning attacks against support vector machines. In\n\nICML, 2012.\n\n[5] L. D. Brown. Fundamentals of statistical exponential families: with applications in statistical\n\ndecision theory. Institute of Mathematical Statistics, Hayworth, CA, USA, 1986.\n\n[6] M. Cakmak and M. Lopes. Algorithmic and human teaching of sequential decision tasks. In\n\nAAAI Conference on Arti\ufb01cial Intelligence, 2012.\n\n[7] N. Chater and M. Oaksford. The probabilistic mind: prospects for Bayesian cognitive science.\n\nOXFORD University Press, 2008.\n\n[8] M. C. Frank and N. D. Goodman. Predicting Pragmatic Reasoning in Language Games. Sci-\n\nence, 336(6084):998, May 2012.\n\n[9] G. Gigu`ere and B. C. Love. Limits in decision making arise from limits in memory retrieval.\n\nProceedings of the National Academy of Sciences, Apr. 2013.\n\n[10] S. Goldman and M. Kearns. On the complexity of teaching. Journal of Computer and Systems\n\nSciences, 50(1):20\u201331, 1995.\n\n[11] S. Hanneke. Teaching dimension and the complexity of active learning. In COLT, page 6681,\n\n2007.\n\n[12] T. Heged\u00a8us. Generalized teaching dimensions and the query complexity of learning. In COLT,\n\npages 108\u2013117, 1995.\n\n[13] F. Khan, X. Zhu, and B. Mutlu. How do humans teach: On curriculum learning and teaching\n\ndimension. In Advances in Neural Information Processing Systems (NIPS) 25. 2011.\n\n[14] H. Kobayashi and A. Shinohara. Complexity of teaching by a restricted number of examples.\n\nIn COLT, pages 293\u2013302, 2009.\n\n[15] M. P. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS,\n\n2010.\n\n[16] Y. J. Lee and K. Grauman. Learning the easy things \ufb01rst: Self-paced visual category discovery.\n\nIn CVPR, 2011.\n\n[17] B. D. McCandliss, J. A. Fiez, A. Protopapas, M. Conway, and J. L. McClelland. Success\nand failure in teaching the [r]-[l] contrast to Japanese adults: Tests of a Hebbian model of\nplasticity and stabilization in spoken language perception. Cognitive, Affective, & Behavioral\nNeuroscience, 2(2):89\u2013108, 2002.\n\n[18] H. Pashler and M. C. Mozer. When does fading enhance perceptual category learning? Journal\n\nof Experimental Psychology: Learning, Memory, and Cognition, 2013. In press.\n\n[19] A. N. Rafferty and T. L. Grif\ufb01ths. Optimal language learning: The importance of starting\n\nrepresentative. 32nd Annual Conference of the Cognitive Science Society, 2010.\n\n[20] P. Shafto and N. Goodman. Teaching Games: Statistical Sampling Assumptions for Learning\n\nin Pedagogical Situations. In CogSci, pages 1632\u20131637, 2008.\n\n[21] S. Singh, R. L. Lewis, A. G. Barto, and J. Sorg. Intrinsically motivated reinforcement learning:\n\nAn evolutionary perspective. IEEE Trans. on Auton. Ment. Dev., 2(2):70\u201382, June 2010.\n\n[22] J. B. Tenenbaum and T. L. Grif\ufb01ths. The rational basis of representativeness. 23rd Annual\n\nConference of the Cognitive Science Society, 2001.\n\n[23] J. B. Tenenbaum, T. L. Grif\ufb01ths, and C. Kemp. Theory-based Bayesian models of inductive\n\nlearning and reasoning. Trends in Cognitive Sciences, 10(7):309\u2013318, 2006.\n\n[24] F. Xu and J. B. Tenenbaum. Word learning as Bayesian inference. Psychological review,\n\n114(2), 2007.\n\n[25] S. Zilles, S. Lange, R. Holte, and M. Zinkevich. Models of cooperative teaching and learning.\n\nJournal of Machine Learning Research, 12:349\u2013384, 2011.\n\n9\n\n\f", "award": [], "sourceid": 976, "authors": [{"given_name": "Jerry", "family_name": "Zhu", "institution": "UW-Madison"}]}