{"title": "Convergence Rates of Active Learning for Maximum Likelihood Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1090, "page_last": 1098, "abstract": "An active learner is given a class of models, a large set of unlabeled examples, and the ability to interactively query labels of a subset of these examples; the goal of the learner is to learn a model in the class that fits the data well. Previous theoretical work has rigorously characterized label complexity of active learning, but most of this work has focused on the PAC or the agnostic PAC model. In this paper, we shift our attention to a more general setting -- maximum likelihood estimation. Provided certain conditions hold on the model class, we provide a two-stage active learning algorithm for this problem. The conditions we require are fairly general, and cover the widely popular class of Generalized Linear Models, which in turn, include models for binary and multi-class classification, regression, and conditional random fields. We provide an upper bound on the label requirement of our algorithm, and a lower bound that matches it up to lower order terms. Our analysis shows that unlike binary classification in the realizable case, just a single extraround of interaction is sufficient to achieve near-optimal performance in maximum likelihood estimation. On the empirical side, the recent work in (Gu et al. 2012) and (Gu et al. 2014) (on active linear and logistic regression) shows the promise of this approach.", "full_text": "Convergence Rates of Active Learning\nfor Maximum Likelihood Estimation\n\nKamalika Chaudhuri \u21e4\n\nSham M. Kakade \u2020\n\nPraneeth Netrapalli \u2021\n\nSujay Sanghavi \u00a7\n\nAbstract\n\nAn active learner is given a class of models, a large set of unlabeled examples, and\nthe ability to interactively query labels of a subset of these examples; the goal of\nthe learner is to learn a model in the class that \ufb01ts the data well.\nPrevious theoretical work has rigorously characterized label complexity of active\nlearning, but most of this work has focused on the PAC or the agnostic PAC model.\nIn this paper, we shift our attention to a more general setting \u2013 maximum likeli-\nhood estimation. Provided certain conditions hold on the model class, we provide\na two-stage active learning algorithm for this problem. The conditions we re-\nquire are fairly general, and cover the widely popular class of Generalized Linear\nModels, which in turn, include models for binary and multi-class classi\ufb01cation,\nregression, and conditional random \ufb01elds.\nWe provide an upper bound on the label requirement of our algorithm, and a lower\nbound that matches it up to lower order terms. Our analysis shows that unlike\nbinary classi\ufb01cation in the realizable case, just a single extra round of interaction is\nsuf\ufb01cient to achieve near-optimal performance in maximum likelihood estimation.\nOn the empirical side, the recent work in [12] and [13] (on active linear and\nlogistic regression) shows the promise of this approach.\n\nIntroduction\n\n1\nIn active learning, we are given a sample space X , a label space Y, a class of models that map X to\nY, and a large set U of unlabelled samples. The goal of the learner is to learn a model in the class\nwith small target error while interactively querying the labels of as few of the unlabelled samples as\npossible.\nMost theoretical work on active learning has focussed on the PAC or the agnostic PAC model, where\nthe goal is to learn binary classi\ufb01ers that belong to a particular hypothesis class [2, 14, 10, 7, 3, 4, 22],\nand there has been only a handful of exceptions [19, 9, 20]. In this paper, we shift our attention to\na more general setting \u2013 maximum likelihood estimation (MLE), where Pr(Y |X) is described by a\nmodel \u2713 belonging to a model class \u21e5. We show that when data is generated by a model in this class,\nwe can do active learning provided the model class \u21e5 has the following simple property: the Fisher\ninformation matrix for any model \u2713 2 \u21e5 at any (x, y) depends only on x and \u2713. This condition is\nsatis\ufb01ed in a number of widely applicable model classes, such as Linear Regression and Generalized\nLinear Models (GLMs), which in turn includes models for Multiclass Classi\ufb01cation and Conditional\nRandom Fields. Consequently, we can provide active learning algorithms for maximum likelihood\nestimation in all these model classes.\nThe standard solution to active MLE estimation in the statistics literature is to select samples for\nlabel query by optimizing a class of summary statistics of the asymptotic covariance matrix of the\n\n\u21e4Dept. of CS, University of California at San Diego. Email: kamalika@cs.ucsd.edu\n\u2020Dept. of CS and of Statistics, University of Washington. Email: sham@cs.washington.edu\n\u2021Microsoft Research New England. Email:praneeth@microsoft.com\n\u00a7Dept. of ECE, The University of Texas at Austin. Email:sanghavi@mail.utexas.edu\n\n1\n\n\festimator [6]. The literature, however, does not provide any guidance towards which summary statis-\ntic should be used, or any analysis of the solution quality when a \ufb01nite number of labels or samples\nare available. There has also been some recent work in the machine learning community [12, 13, 19]\non this problem; but these works focus on simple special cases (such as linear regression [19, 12] or\nlogistic regression [13]), and only [19] involves a consistency and \ufb01nite sample analysis.\nIn this work, we consider the problem in its full generality, with the goal of minimizing the expected\nlog-likelihood error over the unlabelled data. We provide a two-stage active learning algorithm for\nthis problem. In the \ufb01rst stage, our algorithm queries the labels of a small number of random samples\nfrom the data distribution in order to construct a crude estimate \u27131 of the optimal parameter \u2713\u21e4. In\nthe second stage, we select a set of samples for label query by optimizing a summary statistic of the\ncovariance matrix of the estimator at \u27131; however, unlike the experimental design work, our choice\nof statistic is directly motivated by our goal of minimizing the expected log-likelihood error, which\nguides us towards the right objective.\nWe provide a \ufb01nite sample analysis of our algorithm when some regularity conditions hold and\nwhen the negative log likelihood function is convex. Our analysis is still fairly general, and applies\nto Generalized Linear Models, for example. We match our upper bound with a corresponding lower\nbound, which shows that the convergence rate of our algorithm is optimal (except for lower order\nterms); the \ufb01nite sample convergence rate of any algorithm that uses (perhaps multiple rounds of)\nsample selection and maximum likelihood estimation is either the same or higher than that of our\nalgorithm. This implies that unlike what is observed in learning binary classi\ufb01ers, a single round of\ninteraction is suf\ufb01cient to achieve near-optimal log likelihood error for ML estimation.\n\n1.1 Related Work\n\nPrevious theoretical work on active learning has focussed on learning a classi\ufb01er belonging to a\nhypothesis class H in the PAC model. Both the realizable and non-realizable cases have been con-\nsidered. In the realizable case, a line of work [7, 18] has looked at a generalization of binary search;\nwhile their algorithms enjoy low label complexity, this style of algorithms is inconsistent in the\npresence of noise. The two main styles of algorithms for the non-realizable case are disagreement-\nbased active learning [2, 10, 4], and margin or con\ufb01dence-based active learning [3, 22]. While active\nlearning in the realizable case has been shown to achieve an exponential improvement in label com-\nplexity over passive learning [2, 7, 14], in the agnostic case, the gains are more modest (sometimes\na constant factor) [14, 10, 8]. Moreover, lower bounds [15] show that the label requirement of any\nagnostic active learning algorithm is always at least \u2326(\u232b2/\u270f2), where \u232b is the error of the best hy-\npothesis in the class, and \u270f is the target error. In contrast, our setting is much more general than\nbinary classi\ufb01cation, and includes regression, multi-class classi\ufb01cation and certain kinds of condi-\ntional random \ufb01elds that are not covered by previous work.\n[19] provides an active learning algorithm for linear regression problem under model mismatch.\nTheir algorithm attempts to learn the location of the mismatch by \ufb01tting increasingly re\ufb01ned par-\ntitions of the domain, and then uses this information to reweight the examples. If the partition is\nhighly re\ufb01ned, then the computational complexity of the resulting algorithm may be exponential\nin the dimension of the data domain. In contrast, our algorithm applies to a more general setting,\nand while we do not address model mismatch, our algorithm has polynomial time complexity. [1]\nprovides an active learning algorithm for Generalized Linear Models in an online selective sampling\nsetting; however, unlike ours, their input is a stream of unlabelled examples, and at each step, they\nneed to decide whether the label of the current example should be queried.\nOur work is also related to the classical statistical work on optimal experiment design, which mostly\nconsiders maximum likelihood estimation [6]. For uni-variate estimation, they suggest selecting\nsamples to maximize the Fisher information which corresponds to minimizing the variance of the\nregression coef\ufb01cient. When \u2713 is multi-variate, the Fisher information is a matrix; in this case, there\nare multiple notions of optimal design which correspond to maximizing different parameters of the\nFisher information matrix. For example, D-optimality maximizes the determinant, and A-optimality\nmaximizes the trace of the Fisher information. In contrast with this work, we directly optimize\nthe expected log-likelihood over the unlabelled data which guides us to the appropriate objective\nfunction; moreover, we provide consistency and \ufb01nite sample guarantees.\n\n2\n\n\fFinally, on the empirical side, [13] and [12] derive algorithms similar to ours for logistic and linear\nregression based on projected gradient descent. Notably, these works provide promising empirical\nevidence for this approach to active learning; however, no consistency guarantees or convergence\nrates are provided (the rates presented in these works are not stated in terms of the sample size). In\ncontrast, our algorithm applies more generally, and we provide consistency guarantees and conver-\ngence rates. Moreover, unlike [13], our logistic regression algorithm uses a single extra round of\ninteraction, and our results illustrate that a single round is suf\ufb01cient to achieve a convergence rate\nthat is optimal except for lower order terms.\n\n2 The Model\n\nWe begin with some notation. We are given a pool U = {x1, . . . , xn} of n unlabelled examples\ndrawn from some instance space X , and the ability to interactively query labels belonging to a label\nspace Y of m of these examples. In addition, we are given a family of models M = {p(y|x, \u2713),\u2713 2\n\u21e5} parameterized by \u2713 2 \u21e5 \u2713 Rd. We assume that there exists an unknown parameter \u2713\u21e4 2 \u21e5 such\nthat querying the label of an xi 2 U generates a yi drawn from the distribution p(y|xi,\u2713 \u21e4). We also\nabuse notation and use U to denote the uniform distribution over the examples in U.\nWe consider the \ufb01xed-design (or transductive) setting, where our goal is to minimize the error on\nthe \ufb01xed set of points U. For any x 2X , y 2Y and \u2713 2 \u21e5, we de\ufb01ne the negative log-likelihood\nfunction L(y|x, \u2713) as:\nOur goal is to \ufb01nd a \u02c6\u2713 to minimize LU (\u02c6\u2713), where\n\nL(y|x, \u2713) = log p(y|x, \u2713)\n\nLU (\u2713) = EX\u21e0U,Y \u21e0p(Y |X,\u2713\u21e4)[L(Y |X, \u2713)]\n\nby interactively querying labels for a subset of U of size m, where we allow label queries with\nreplacement i.e., the label of an example may be queried multiple times.\nAn additional quantity of interest to us is the Fisher information matrix, or the Hessian of the nega-\ntive log-likelihood L(y|x, \u2713) function, which determines the convergence rate. For our active learn-\ning procedure to work correctly, we require the following condition.\nCondition 1. For any x 2X , y 2Y , \u2713 2 \u21e5, the Fisher information @2L(y|x,\u2713)\nx and \u2713 (and does not depend on y.)\n\nis a function of only\n\n@\u27132\n\nCondition 1 is satis\ufb01ed by a number of models of practical interest; examples include linear re-\ngression and generalized linear models. Section 5.1 provides a brief derivation of Condition 1 for\ngeneralized linear models.\nFor any x, y and \u2713, we use I(x, \u2713) to denote the Hessian @2L(y|x,\u2713)\n; observe that by Assumption 1,\nthis is just a function of x and \u2713. Let be any distribution over the unlabelled samples in U; for any\n\u2713 2 \u21e5, we use:\n\n@\u27132\n\nI(\u2713) = EX\u21e0[I(X, \u2713)]\n\n3 Algorithm\n\nThe main idea behind our algorithm is to sample xi from a well-designed distribution over U,\nquery the labels of these samples and perform ML estimation over them. To ensure good perfor-\nmance, should be chosen carefully, and our choice of is motivated by Lemma 1. Suppose\nthe labels yi are generated according to: yi \u21e0 p(y|xi,\u2713 \u21e4). Lemma 1 states that the expected log-\nlikelihood error of the ML estimate with respect to m samples from in this case is essentially\nTrI(\u2713\u21e4)1IU (\u2713\u21e4) /m.\nThis suggests selecting as the distribution \u21e4 that minimizes TrI\u21e4(\u2713\u21e4)1IU (\u2713\u21e4). Unfortu-\n1-2). In the second stage, we calculate a distribution 1 which minimizes TrI1(\u27131)1IU (\u27131) and\n\nnately, we cannot do this as \u2713\u21e4 is unknown. We resolve this problem through a two stage algorithm;\nin the \ufb01rst stage, we use a small number m1 of samples to construct a coarse estimate \u27131 of \u2713\u21e4 (Steps\n\ndraw samples from (a slight modi\ufb01cation of) this distribution for a \ufb01ner estimation of \u2713\u21e4 (Steps 3-5).\n\n3\n\n\fAlgorithm 1 ActiveSetSelect\nInput: Samples xi, for i = 1,\u00b7\u00b7\u00b7 , n\n1: Draw m1 samples u.a.r from U, and query their labels to get S1.\n2: Use S1 to solve the MLE problem:\n\n\u27131 = argmin\u27132\u21e5 X(xi,yi)2S1\n\nL(yi|xi,\u2713 )\n\n3: Solve the following SDP (refer Lemma 3):\n\na\u21e4 = argminaTrS1IU (\u27131)\n\ns.t. ( S =Pi aiI(xi,\u2713 1)\nPi ai = m2\n\n0 \uf8ff ai \uf8ff 1\n\n4: Draw m2 examples using probability = \u21b51 + (1 \u21b5)U where the distribution 1 = a\u21e4i\n5: Use S2 to solve the MLE problem:\n\n. Query their labels to get S2.\n\n\u21b5 = 1 m1/6\n\nm2\n\n2\n\nand\n\n\u27132 = argmin\u27132\u21e5 X(xi,yi)2S2\n\nL(yi|xi,\u2713 )\n\nOutput: \u27132\n\nThe distribution 1 is modi\ufb01ed slightly to \u00af (in Step 4) to ensure that I\u00af(\u2713\u21e4) is well conditioned\nwith respect to IU (\u2713\u21e4).\nThe algorithm is formally presented in Algorithm 1.\nFinally, note that Steps 1-2 are necessary because IU and I are functions of \u2713. In certain special\ncases such as linear regression, IU and I are independent of \u2713.\nIn those cases, Steps 1-2 are\nunnecessary, and we may skip directly to Step 3.\n\n4 Performance Guarantees\n\nThe following regularity conditions are essentially a quanti\ufb01ed version of the standard Local Asymp-\ntotic Normality (LAN) conditions for studying maximum likelihood estimation (see [5, 21]).\nAssumption 1. (Regularity conditions for LAN)\n\n1. Smoothness: The \ufb01rst three derivatives of L(y|x, \u2713) exist in all interior points of \u21e5 \u2713 Rd.\n2. Compactness: \u21e5 is compact and \u2713\u21e4 is an interior point of \u21e5.\n3. Strong Convexity: IU (\u2713\u21e4) = 1\n\ni=1 I (xi,\u2713 \u21e4) is positive de\ufb01nite with smallest singular\n\nvalue min > 0.\n\nnPn\n\n4. Lipschitz continuity: There exists a neighborhood B of \u2713\u21e4 and a constant L3 such that for\n\nall x 2 U, I(x, \u2713) is L3-Lipschitz in this neighborhood.\n\nfor every \u2713, \u27130 2 B.\n\n5. Concentration at \u2713\u21e4: For any x 2 U and y, we have (with probability one),\n\nIU (\u2713\u21e4)1/2 (I (x, \u2713) I (x, \u27130)) IU (\u2713\u21e4)1/22 \uf8ff L3 k\u2713 \u27130kIU (\u2713\u21e4) ,\nkrL(y|x, \u2713\u21e4)kIU (\u2713\u21e4)1 \uf8ff L1, and IU (\u2713\u21e4)1/2I (x, \u2713\u21e4) IU (\u2713\u21e4)1/22 \uf8ff L2.\n\n6. Boundedness: max(x,y) sup\u27132\u21e5 |L(x, y|\u2713)|\uf8ff R.\n\nIn addition to the above, we need one extra condition which is essentially a pointwise self concor-\ndance. This condition is satis\ufb01ed by a vast class of models, including the generalized linear models.\n\n4\n\n\fAssumption 2. Point-wise self concordance:\n\nL4 k\u2713 \u2713\u21e4k2 I (x, \u2713\u21e4) I (x, \u2713) I (x, \u2713\u21e4) L4 k\u2713 \u2713\u21e4k2 I (x, \u2713\u21e4) .\n\nDe\ufb01nition 1. [Optimal Sampling Distribution \u21e4] We de\ufb01ne the optimal sampling distribution \u21e4\n\nover the points in U as the distribution \u21e4 = (\u21e41 , . . . , \u21e4n) for which \u21e4i 0,Pi \u21e4i = 1, and\nTrI\u21e4(\u2713\u21e4)1IU (\u2713\u21e4) is as small as possible.\n\nDe\ufb01nition 1 is motivated by Lemma 1, which indicates that under some mild regularity conditions, a\nML estimate calculated on samples drawn from \u21e4 will provide the best convergence rates (including\nthe right constant factor) for the expected log-likelihood error.\nWe now present the main result of our paper. The proof of the following theorem and all the sup-\nporting lemmas will be presented in Appendix A.\nTheorem 1.\nLet \n\n\nhold.\n>\nO\u2713max\u2713L2 log2 d, L2\n1\u21e3L2\nThen with\nprobability 1 , the expected log likelihood error of the estimate \u27132 of Algorithm 1 is bounded\nas:\n\nconditions\nof\ndiameter(\u21e5)\nTr(IU (\u2713\u21e4)1)\n\nin Assumptions\nused\nstep\n\nregularity\nnumber\n\n2\nand\nbe m1\n\nthe\n3 + 1\n\nSuppose\n10,\n\nsamples\n\n, 2L2\n\n1\n(1)\n\nand\n\nthe\n\nin\n\nmin\u2318 log2 d,\n 1\u25c64\n\n2\n\n4\n\n Tr\u21e3IU (\u2713\u21e4)1\u2318\u25c6\u25c6.\n(1 +e\u270fm2)Tr\u21e3I\u21e4(\u2713\u21e4)1IU (\u2713\u21e4)\u2318 1\n\nin De\ufb01nition\n\n1\n\nm2\n\n+\n\nR\nm2\n2\n\n,\n\n(1)\n\nE [LU (\u27132)] LU (\u2713\u21e4) \uf8ff\u27131 +\n\n2\n\nis\n\nthe\n\noptimal\n\nsampling\n\nwhere \u21e4\n\nO\u2713L1L3 + pL2 plog dm2\n\n=\nfor any sampling distribution satisfying\nI(\u2713\u21e4) \u232b cIU (\u2713\u21e4) and label constraint of m2, we have the following lower bound on the\nexpected log likelihood error for ML estimate:\n\nand e\u270fm2\n\ndistribution\n\nMoreover,\n\n\u25c6.\n\nm1/6\n\nEhLU (b\u2713)i LU (\u2713\u21e4) (1 \u270fm2) Tr\u21e3I(\u2713\u21e4)1IU (\u2713\u21e4)\u2318 1\n= e\u270fm2\n\nwhere \u270fm2\nRemark 1. (Restricting to Maximum Likelihood Estimation) Our restriction to maximum likelihood\nestimators is minor, as this is close to minimax optimal (see [16]). Minor improvements with certain\nkinds of estimators, such as the James-Stein estimator, are possible.\n\nL2\n1\ncm2\n2\n\n,\n\n(2)\n\nm2 \n\nc2m1/3\n\ndef\n\n.\n\n2\n\n4.1 Discussions\n\nSeveral remarks about Theorem 1 are in order.\nThe high probability bound in Theorem 1 is with respect to the samples drawn in S1; provided these\nsamples are representative (which happens with probability 1 ), the output \u27132 of Algorithm 1\nwill satisfy (1). Additionally, Theorem 1 assumes that the labels are sampled with replacement; in\nother words, we can query the label of a point xi multiple times. Removing this assumption is an\navenue for future work.\n\nSecond, the highest order term in both (1) and (2) is Tr\u21e3I\u21e4(\u2713\u21e4)1IU (\u2713\u21e4)\u2318 /m. The terms involving\n\u270fm2 ande\u270fm2 are lower order as both \u270fm2 ande\u270fm2 are o(1). Moreover, if = !(1), then the term\n\ninvolving in (1) is of a lower order as well. Observe that also measures the tradeoff between m1\nand m2, and as long as = o(pm2), m1 is also of a lower order than m2. Thus, provided is !(1)\nand o(pm2), the convergence rate of our algorithm is optimal except for lower order terms.\nFinally, the lower bound (2) applies to distributions for which I(\u2713\u21e4) cIU (\u2713\u21e4), where c occurs\nin the lower order terms of the bound. This constraint is not very restrictive, and does not affect\nthe asymptotic rate. Observe that IU (\u2713\u21e4) is full rank. If I(\u2713\u21e4) is not full rank, then the expected\nlog likelihood error of the ML estimate with respect to will not be consistent, and thus such a \nwill never achieve the optimal rate. If I(\u2713\u21e4) is full rank, then there always exists a c for which\nI(\u2713\u21e4) cIU (\u2713\u21e4). Thus (2) essentially states that for distributions where I(\u2713\u21e4) is close to\nbeing rank-de\ufb01cient, the asymptotic convergence rate of O(TrI(\u2713\u21e4)1IU (\u2713\u21e4) /m2) is achieved\nat larger values of m2.\n\n5\n\n\f4.2 Proof Outline\n\nOur main result relies on the following three steps.\n\n4.2.1 Bounding the Log-likelihood Error\nFirst, we characterize the log likelihood error (wrt U) of the empirical risk minimizer (ERM) esti-\n\n1\nm2\n\nthe ERM estimate using the distribution :\n\nb\u2713 = argmin\u27132\u21e5\n\nmate obtained using a sampling distribution . Concretely, let be a distribution on U. Letb\u2713 be\n\nwhere Xi \u21e0 and Yi \u21e0 p(y|Xi,\u2713 \u21e4). The core of our analysis is Lemma 1, which shows a precise\n\nm2Xi=1\nestimate of the log likelihood error EhLU\u21e3b\u2713\u2318 LU (\u2713\u21e4)i.\ntribution on U andb\u2713 be the ERM estimate (3) using m2 labeled examples. Suppose further that\nI(\u2713\u21e4) \u232b cIU (\u2713\u21e4) for some constant c < 1. Then, for any p 2 and m2 large enough such that\n\u2318 < 1, we have:\n= O\u21e3 1\n\u270fm2\n\nLemma 1. Suppose L satis\ufb01es the regularity conditions in Assumptions 1 and 2. Let be a dis-\n\nL(Yi|Xi,\u2713 ),\n\n(3)\n\ndef\n\n\uf8ff EhLU\u21e3b\u2713\u2318 LU (\u2713\u21e4)i \uf8ff (1 + \u270fm2)\n\n\u2327 2\nm2\n\n+\n\nR\nmp\n2\n\n,\n\nm2\n\nc2L1L3 + pL2q p log dm2\n= Tr\u21e3I(\u2713\u21e4)1IU (\u2713\u21e4)\u2318.\n\n(1 \u270fm2)\n\n\u2327 2\nm2 \n\ncmp/2\n\nL2\n1\n\n2\n\nwhere \u2327 2 def\n\n4.2.2 Approximating \u2713\u21e4\nLemma 1 motivates sampling from the optimal sampling distribution \u21e4 that minimizes\n\nTr\u21e3I\u21e4(\u2713\u21e4)1IU (\u2713\u21e4)\u2318. However, this quantity depends on \u2713\u21e4, which we do not know. To re-\n\nsolve this issue, our algorithm \ufb01rst queries the labels of a small fraction of points (m1) and solves a\nML estimation problem to obtain a coarse estimate \u27131 of \u2713\u21e4.\nHow close should \u27131 be to \u2713\u21e4? Our analysis indicates that it is suf\ufb01cient for \u27131 to be close enough that\nfor any x, I(x, \u27131) is a constant factor spectral approximation to I(x, \u2713\u21e4); the number of samples\nneeded to achieve this is analyzed in Lemma 2.\nLemma 2. Suppose L satis\ufb01es the regularity conditions in Assumptions 1 and 2. If the number of\nsamples used in the \ufb01rst step\n\nm1 > O0@max0@L2 log2 d, L2\n1\u2713L2\n\nthen, we have:\n\n3 +\n\n1\n\nmin\u25c6 log2 d,\n\ndiameter(\u21e5)\n\nTr\u21e3IU (\u2713\u21e4)1\u2318 ,\n\n2L2\n4\n\n\n\nTr\u21e3IU (\u2713\u21e4)1\u23181A1A ,\n\n1\n\n\n\n\nI (x, \u2713\u21e4) I (x, \u27131) I (x, \u2713\u21e4) \n\n1\n\n\nI (x, \u2713\u21e4) 8 x 2 X\n\nwith probability greater than 1 .\n4.2.3 Computing 1\nThird, we are left with the task of obtaining a distribution 1 that minimizes the log likelihood error.\nWe now pose this optimization problem as an SDP.\nFrom Lemmas 1 and 2, it is clear that we should aim to obtain a sampling distribution = ( ai\nm2\n\n[n]) minimizing Tr\u21e3I(\u27131)1IU (\u27131)\u2318. Let IU (\u27131) =Pj jvjvj> be the singular value decompo-\nsition (svd) of IU (\u27131). Since Tr\u21e3I(\u27131)1IU (\u27131)\u2318 = Pd\n\nj=1 jvj>I(\u27131)1vj, this is equivalent\n\n: i 2\n\n6\n\n\fto solving:\n\njcj\n\nmin\na,c\n\ns.t.8><>:\ndXj=1\nAmong the above constraints, the constraint vj>S1vj \uf8ff cj seems problematic. However, Schur\nS \u232b 0 , S \u232b 0 and vj>S1vj \uf8ff cj. In our case,\ncomplement formula tells us that:\uf8ff cj\nwe know that S \u232b 0, since it is a sum of positive semi de\ufb01nite matrices. The above argument proves\nthe following lemma.\nLemma 3. The following two optimization programs are equivalent:\n\nS =Pi aiI(xi,\u2713 1)\nvj>S1vj \uf8ff cj\nPi ai = m2.\n\nai 2 [0, 1]\n\nvj>\n\n(4)\n\nvj\n\nmina\ns.t.\n\nTrS1IU (\u27131)\nS =Pi aiI(xi,\u2713 1)\nPi ai = m2.\n\nai 2 [0, 1]\n\nmina,c\n\ns.t.\n\n\u2318\n\nwhere IU (\u27131) =Pj jvjvj> denotes the svd of IU (\u27131).\n\n5\n\nIllustrative Examples\n\nvj>\n\nj=1 jcj\n\nPd\nS =Pi aiI(xi,\u2713 1)\n\uf8ff cj\nS \u232b 0\nPi ai = m2,\n\nvj\nai 2 [0, 1]\n\nWe next present some examples that illustrate Theorem 1. We begin by showing that Condition 1 is\nsatis\ufb01ed by the popular class of Generalized Linear Models.\n5.1 Derivations for Generalized Linear Models\n\nA generalized linear model is speci\ufb01ed by three parameters \u2013 a linear model, a suf\ufb01cient statis-\ntic, and a member of the exponential family. Let \u2318 be a linear model: \u2318 = \u2713>X. Then, in a\nGeneralized Linear Model (GLM), Y is drawn from an exponential family distribution with pa-\nrameter \u2318. Speci\ufb01cally, p(Y = y|\u2318) = e\u2318>t(y)A(\u2318), where t(\u00b7) is the suf\ufb01cient statistic and\nA(\u00b7) is the log-partition function. From properties of the exponential family, the log-likelihood\nIf we take \u2318 = \u2713>x, and take the derivative with\nis written as log p(y|\u2318) = \u2318>t(y) A(\u2318).\nrespect to \u2713, we have: @ log p(y|\u2713,x)\n= xt(y) xA0(\u2713>x). Taking derivatives again gives us\n@2 log p(y|\u2713,x)\n\n= xx>A00(\u2713>x), which is independent of y.\n\n@\u27132\n\n@\u2713\n\n5.2 Speci\ufb01c Examples\n\nWe next present three illustrative examples of problems that our algorithm may be applied to.\nLinear Regression. Our \ufb01rst example is linear regression. In this case, x 2 Rd and Y 2 R are\ngenerated according to the distribution: Y = \u2713>\u21e4 X + \u2318, where \u2318 is a noise variable drawn from\nIn this case, the negative loglikelihood function is: L(y|x, \u2713) = (y \u2713>x)2, and the\nN (0, 1).\ncorresponding Fisher information matrix I(x, \u2713) is given as: I(x, \u2713) = xx>. Observe that in this\n(very special) case, the Fisher information matrix does not depend on \u2713; as a result we can eliminate\nnPi xixi> is the\nthe \ufb01rst two steps of the algorithm, and proceed directly to step 3.\ncovariance matrix of U, then Theorem 1 tells us that we need to query labels from a distribution \u21e4\n\nIf \u2303= 1\n\nWe illustrate the advantages of active learning through a simple example. Suppose U is the unla-\nbelled distribution:\n\nwith covariance matrix \u21e4 such that Tr\u21e41\u2303 is minimized.\nw.p. 1 d1\nd2 ,\nd2 for j 2{ 2,\u00b7\u00b7\u00b7 , d} ,\n\nxi =\u21e2 e1\n\nej w.p. 1\n\nwhere ej is the standard unit vector in the jth direction. The covariance matrix \u2303 of U is a diagonal\nmatrix with \u230311 = 1 d1\nd2 for j 2. For passive learning over U, we query labels\n\nd2 and \u2303jj = 1\n\n7\n\n\fof examples drawn from U which gives us a convergence rate of Tr(\u23031\u2303)\n= d\nactive learning chooses to sample examples from the distribution \u21e4 such that\n\nm\n\nm. On the other hand,\n\n1\n\n2d ,\n\nej w.p. \u21e0 1\n\n2d for j 2{ 2,\u00b7\u00b7\u00b7 , d} ,\n\nd2 + (d 1) \u00b7 2d \u00b7 1\n\nd2. This has a diagonal covariance matrix\n2d for j 2, and convergence rate of Tr(\u21e41\u2303)\nm, which does not grow with d!\n\nxi =\u21e2 e1 w.p. \u21e0 1 d1\nwhere \u21e0 indicates that the probabilities hold upto O 1\n2d and \u21e4jj \u21e0 1\n\u21e4 such that \u21e411 \u21e0 1 d1\nd2\u2318 \uf8ff 4\nm\u21e3 2d\nd+1 \u00b71 d1\nLogistic Regression. Our second example is logistic regression for binary classi\ufb01cation. In this\ncase, x 2 Rd, Y 2 {1, 1} and the negative log-likelihood function is: L(y|x, \u2713) = log(1 +\ney\u2713>x), and the corresponding Fisher information I(x, \u2713) is given as: I(x, \u2713) = e\u2713>x\n(1+e\u2713>x)2 \u00b7 xx>.\nFor illustration, suppose k\u2713\u21e4k2 and kxk2 are bounded by a constant and the covariance matrix \u2303 is\nsandwiched between two multiples of identity in the PSD ordering i.e., c\nd I for some\nconstants c and C. Then the regularity assumptions 1 and 2 are satis\ufb01ed for constant values of\nL1, L2, L3 and L4. In this case, Theorem 1 states that choosing m1 to be !\u21e3Tr\u21e3IU (\u2713\u21e4)1\u2318\u2318 =\n\nd I \u2303 C\n\n! (d) gives us the optimal convergence rate of (1 + o(1))\n\nTr(I\u21e4 (\u2713\u21e4)1IU (\u2713\u21e4))\n\n\u21e0\n\nm\n\nm2\n\n.\n\nFii =\n\nMultinomial Logistic Regression. Our third example is multinomial logistic regression for multi-\nclass classi\ufb01cation. In this case, Y 2 1, . . . , K, x 2 Rd, and the parameter matrix \u2713 2 R(K1)\u21e5d.\nThe negative log-likelihood function is written as: L(y|x, \u2713) = \u2713>y x + log(1 +PK1\nk=1 e\u2713>k x),\nif y 6= K, and L(y = k|x, \u2713) = log(1 +PK1\nk=1 e\u2713>k x) otherwise. The corresponding Fisher\ninformation matrix is a (K 1)d \u21e5 (K 1)d matrix, which is obtained as follows. Let F be the\n(K 1) \u21e5 (K 1) matrix with:\ne\u2713>i x(1 +Pk6=i e\u2713>k x)\n(1 +Pk e\u2713>k x)2\n\n(1 +Pk e\u2713>k x)2\n\na constant and the covariance matrix \u2303 satis\ufb01es c\n\nSimilar to the example in the logistic regression case, suppose\u2713\u21e4y2 and kxk2 are bounded by\nSince F \u21e4 = diag (p\u21e4i ) p\u21e4p\u21e4>, where p\u21e4i = P (y = i|x, \u2713\u21e4), the boundedness of\u2713\u21e4y2 and kxk2\nimplies thatecI F \u21e4 eCI for some constantsec and eC (depending on K). This means that\nd I I(x, \u2713\u21e4) CeC\ncec\n\nd I and so the regularity assumptions 1 and 2 are satis\ufb01ed with L1, L2, L3 and\nL4 being constants. Theorem 1 again tells us that using !(d) samples in the \ufb01rst step gives us the\noptimal convergence rate of maximum likelihood error.\n\nd I for some constants c and C.\n\nThen, I(x, \u2713) = F \u2326 xx>.\n\nd I \u2303 C\n\n, Fij = \n\ne\u2713>i x+\u2713>j x\n\n6 Conclusion\n\nIn this paper, we provide an active learning algorithm for maximum likelihood estimation which\nprovably achieves the optimal convergence rate (upto lower order terms) and uses only two rounds\nof interaction. Our algorithm applies in a very general setting, which includes Generalized Linear\nModels.\nThere are several avenues of future work. Our algorithm involves solving an SDP which is computa-\ntionally expensive; an open question is whether there is a more ef\ufb01cient, perhaps greedy, algorithm\nthat achieves the same rate. A second open question is whether it is possible to remove the with\nreplacement sampling assumption. A \ufb01nal question is what happens if IU (\u2713\u21e4) has a high condition\nnumber. In this case, our algorithm will require a large number of samples in the \ufb01rst stage; an open\nquestion is whether we can use a more sophisticated procedure in the \ufb01rst stage to reduce the label\nrequirement.\nAcknowledgements. KC thanks NSF under IIS 1162581 for research support.\n\n8\n\n\fReferences\n[1] A. Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. In Pro-\nceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta,\nGA, USA, 16-21 June 2013, pages 1220\u20131228, 2013.\n\n[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst.\n\nSci., 75(1):78\u201389, 2009.\n\n[3] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-\n\nconcave distributions. In COLT, 2013.\n\n[4] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-\n\nstraints. In NIPS, 2010.\n\n[5] L. Cam and G. Yang. Asymptotics in Statistics: Some Basic Concepts. Springer Series in\n\nStatistics. Springer New York, 2000.\n\n[6] J. Cornell. Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data\n\n(third ed.). Wiley, 2002.\n\n[7] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.\n[8] S. Dasgupta. Two faces of active learning. Theor. Comput. Sci., 412(19), 2011.\n[9] S. Dasgupta and D. Hsu. Hierarchical sampling for active learning. In ICML, 2008.\n[10] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In\n\nNIPS, 2007.\n\n[11] R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Competing with the empirical risk minimizer\n\nin a single pass. arXiv preprint arXiv:1412.6606, 2014.\n\n[12] Q. Gu, T. Zhang, C. Ding, and J. Han. Selective labeling via error bound minimization. In In\nProc. of Advances in Neural Information Processing Systems (NIPS) 25, Lake Tahoe, Nevada,\nUnited States, 2012.\n\n[13] Q. Gu, T. Zhang, and J. Han. Batch-mode active learning via error bound minimization. In\n\n30th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2014.\n\n[14] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n[15] M. K\u00a8a\u00a8ari\u00a8ainen. Active learning in the non-realizable case. In ALT, 2006.\n[16] L. Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer, 1986.\n[17] E. L. Lehmann and G. Casella. Theory of point estimation, volume 31. Springer Science &\n\nBusiness Media, 1998.\n\n[18] R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information\n\nTheory, 57(12):7893\u20137906, 2011.\n\n[19] S. Sabato and R. Munos. Active regression through strati\ufb01cation. In NIPS, 2014.\n[20] R. Urner, S. Wulff, and S. Ben-David. Plal: Cluster-based active learning. In COLT, 2013.\n[21] A. W. van der Vaart. Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic\n\nMathematics. Cambridge University Press, 2000.\n\n[22] C. Zhang and K. Chaudhuri. Beyond disagreement-based agnostic active learning. In Proc. of\n\nNeural Information Processing Systems, 2014.\n\n9\n\n\f", "award": [], "sourceid": 679, "authors": [{"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "UCSD"}, {"given_name": "Sham", "family_name": "Kakade", "institution": "University of Washington"}, {"given_name": "Praneeth", "family_name": "Netrapalli", "institution": "Microsoft Research"}, {"given_name": "Sujay", "family_name": "Sanghavi", "institution": "UTexas-Austin"}]}