{"title": "Active Learning for Non-Parametric Regression Using Purely Random Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 2537, "page_last": 2546, "abstract": "Active learning is the task of using labelled data to select additional points to label, with the goal of fitting the most accurate model with a fixed budget of labelled points. In binary classification active learning is known to produce faster rates than passive learning for a broad range of settings. However in regression restrictive structure and tailored methods were previously needed to obtain theoretically superior performance. In this paper we propose an intuitive tree based active learning algorithm for non-parametric regression with provable improvement over random sampling. When implemented with Mondrian Trees our algorithm is tuning parameter free, consistent and minimax optimal for Lipschitz functions.", "full_text": "Active Learning for Non-Parametric Regression\n\nUsing Purely Random Trees\n\nJack Goetz\n\nAmbuj Tewari\n\nUniversity of Michigan\nAnn Arbor, MI 48109\n\n{jrgoetz, tewaria, paulzim}@umich.edu\n\nPaul Zimmerman\n\nAbstract\n\nActive learning is the task of using labelled data to select additional points to\nlabel, with the goal of \ufb01tting the most accurate model with a \ufb01xed budget of la-\nbelled points. In binary classi\ufb01cation active learning is known to produce faster\nrates than passive learning for a broad range of settings. However in regression\nrestrictive structure and tailored methods were previously needed to obtain theo-\nretically superior performance. In this paper we propose an intuitive tree based ac-\ntive learning algorithm for non-parametric regression with provable improvement\nover random sampling. When implemented with Mondrian Trees our algorithm is\ntuning parameter free, consistent and minimax optimal for Lipschitz functions.\n\n1\n\nIntroduction\n\nIn this paper we study active learning for regression in the pool setting. In our setup we are given\na pool of unlabelled data points and want to build the best model with a \ufb01xed number of samples,\nallowing selection of new points to use labels already obtained. Active learning is motivated by\nscenarios where the experimenter has control over the data labelling process and where unlabelled\npoints are cheap but labels are expensive.\nOur primary motivation comes from computational chemistry, where chemical properties of inter-\nest can be computed by solving approximations to the Schr\u00f6dinger equation. One key property to\nchemists, the rate of chemical reaction, can be quanti\ufb01ed via the activation energy, which controls\nthe rate of reaction as a function of temperature [9]. While calculating the activation energy is\nexpensive, there are a small number of readily available features of the reaction that in\ufb02uence the\nactivation energy. This incentivizes building a metamodel for the activation energy to avoid exces-\nsive analysis of undesirable (high activation energy) reactions. Since we are restricted in the number\nof simulations used to build our metamodel, we want to use the most informative data points. Be-\ncause chemical reactions are discrete entities, we are restricted to a \ufb01nite (but often large) pool of\nreactions, thus requiring pool setting active learning even though we are selecting simulations.\nActive learning methods are usually built on top of existing prediction algorithms. Decision trees\nand forests are a popular class of such predictors due to their simplicity, expressiveness, state-of-\nthe-art performance and tuning parameter free nature. In this paper we focus our attention on purely\nrandom trees [4], decision trees built independently of any data, due to their amenability to theo-\nretical analysis. We use a recently proposed version called Mondrian Trees [17], which have been\nshown to produce trees with many attractive properties such as consistency and minimax optimal\nrate of convergence for Lipschitz functions [19].\nAs in some previous work [7], our active learning algorithm will be developed in two stages. First\nwe introduce a simple and intuitive oracle querying algorithm for purely random trees which is\noptimal among a natural class of sampling schemes which includes random sampling (Theorem\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f4.4). This algorithm is not active but uses statistics of the true joint distribution which are generally\nunknown. Second we propose an active learning scheme where we \ufb01rst sample passively to estimate\nthe required statistics, and then use those estimates to approximate the oracle algorithm. We show\nthis algorithm is consistent for the oracle algorithm (Theorem 5.1) and behaves well when our labels\nare normally distributed (Theorem 5.4). Finally we examine the empirical performance of our active\nlearning algorithm to show that bene\ufb01ts, though sometimes modest, can be signi\ufb01cant.\n\n2 Setting and background\n\n(cid:104)\n\n( \u02c6f (X)\u2212 f (X))2(cid:105)\n\nWe begin by describing the pool based active learning setting, as well as introducing purely\nrandom and Mondrian trees. We have a pool of m data points {Xi}m\n1 , with Xi \u2208\n[0, 1]d (rescaling our X as needed) and Xi \u223c pX, which are always available to the algorithm. For\neach Xi we have a corresponding label Yi \u2208 R with the relationship Yi = f (Xi) + \u03c3(Xi)\u0001i with\n\u0001i \u223c p\u0001 iid, \u0001i \u22a5 Xj \u2200 j, E(\u0001i) = 0, Var(\u0001i) = 1, \u03c3(Xi) : [0, 1]d \u2192 R+, meaning our noise is the\nproduct of a function of X with an independent random variable. We assume the (Xi, Yi) = Di\nhave been drawn iid from a joint distribution pX,Y . We will assume that f (x) and \u03c3(x) are bounded.\nInitially none of these Yi are known to the algorithm. Instead we have the ability to gain access to\nany of the Yi, and the task is to select n (cid:28) m labels with the goal of building a model with the\nlowest quadratic risk E\n, where the expectation is taken over our test point X, the\nrandom process which builds our tree and the labelled data we select. Throughout we will assume\nthat our pool is arbitrarily large; in particular we will assume that the marginal density pX is known,\nand that there is enough unlabelled data to implement any sampling scheme for selecting n points.\nWe use active sampling (or learning) to describe any sampling scheme which samples in multiple\nbatches and uses both X(cid:48)\ni s from previous batches when picking points for\nthe next batch. We use passive sampling to denote any sampling scheme which only uses the Xi to\npick points, and we use random sampling to denote picking the points uniformly at random from\nour pool (which is the same as sampling from pX,Y ).\nOur active learning method is for purely random trees [4], which are decision trees (or partitions\nof the space) built using a random process that is independent of the data. We will interchangeably\ndiscuss the partition of the space generated by the tree and the leaves of the tree. Let Ik \u2208 I\nenumerate the leaves of a tree (partitions of the space), where k \u2208 {1...K}. We will abuse notation\nslightly and use the set of partitions I to denote our tree. These partitions can be used to build\nregressograms, which make predictions using the average of labelled points within the partition of\nthe test point. With the partitions \ufb01xed, the best (in L2) approximation to f which is piece-wise\nconstant on each partition predicts the conditional mean on that partition [14]. We will denote true\nvalues and estimates of this approximation using \"tilde\" and \"hat\" notation as shown below.\n\nis as well as known Y (cid:48)\n\nTrue best approximation\n1(x \u2208 Ik) \u02dc\u03b2k\n\n\u02dcfI(x) =\n\nK(cid:88)\n\nk=1\n\n\u02dc\u03b2k = EpX,Y [Y |X \u2208 Ik]\n\n\u02c6\u03b2k =\n\nEstimate of best approximation\n1(x \u2208 Ik) \u02c6\u03b2k\n\n\u02c6fI(x) =\n\nK(cid:88)\n1(cid:80) 1(Xi \u2208 Ik)\n\nk=1\n\n(cid:88)\n\nYi\n\nXi\u2208Ik\n\nOur experiments and some results will use particular purely random trees built using the Mondrian\nProcess [17]. The Mondrian Process is a stochastic process for partitioning a hypercube in Rd, a\nsingle realization of this process gives a Mondrian Tree. The Mondrian Process iteratively splits\nexisting partitions, and the number of partitions is controlled by a parameter \u03bb which, since the\nMondrian Process is a generalization of a Poisson Process, is referred to as the lifetime parameter.\nAs this parameter increases the number of partitions increases, and the rate at which the number\nof partitions increase depends on the dimension and size of the hypercube. We will use Mondrian\nTrees on a \ufb01xed domain [0, 1]d with varying lifetime as in [19], which describes how these random\npartitions are built.\n\n2\n\n\f3 Related work on Active Learning\n\nThe majority of theoretical work in active learning has taken place in binary classi\ufb01cation, and\nthere are many approaches which have been studied (see, e.g. [13], [10], [22], [16], [3], [2]). These\nalgorithms are studied under fairly nonrestrictive assumptions (except occasionally requiring a linear\nclassi\ufb01cation boundary). It has been shown that for a variety of realistic noise conditions active\nlearning provides a better minimax learning rate than passive learning ([15]).\nIn contrast the theory for active learning in regression is less well developed. A negative result [23]\nshowed that for a Lipschitz regression function and constant noise variance, the minimax learning\nrate for active learning was the same as that for passive (up to a constant). Additional assumptions\nare required to obtain better rates. Such structure includes assumptions of piece-wise constantness of\nregression function [23], approximation of a non-linear model by a linear one [21], locally varying\nsmoothness [6], well-speci\ufb01ed parametric model [8] or heteroskedasticity [11], [7].\nWhile many of these regression methods are able to provide provably better learning rates in terms of\nn, d, they are often tailored for their speci\ufb01c assumptions and may perform poorly if the assumptions\ndo not hold. As a recent summary [18] of numerous \ufb02exible but guarantee free methods shows, there\nis great demand for active learning methods without such stringent conditions. Our active learning\nalgorithm will make very mild assumptions, but the improvement will not be in rates in n, d (since\nit is known this is not always possible). Rather we will adopt the approach [13] of comparing the\nsampling generated by our algorithm to an optimal sampling scheme, as well as to random sampling.\n\n4 Oracle label querying algorithm\nWe \ufb01rst describe a simple family of querying algorithms for a \ufb01xed purely random tree I which are\nnot active. In the \ufb01rst two subsections below, we will be implicitly conditioning on the tree I, but\nwill suppress this in the notation.\n\n4.1 Generic algorithm\n\nIn our generic algorithm family, the tree is built without using any data. So we build the tree \ufb01rst\nand query based on the tree\u2019s structure. We call it an \"oracle\" algorithm since it requires pX,Y .\n\nAlgorithm 1: Generic \"oracle\" querying algorithm\nInput: Leaves of our tree I, pool of data points {Xi}m\nOutput: The set of points to label\nforeach Ik \u2208 I do\n\ni=1, label budget n and joint distribution pX,Y\n\nCalculate qk the proportion of points to select from leaf Ik, using I,{Xi}m\nSelect nk = n \u00b7 qk points uniformly at random from the pool of unlabelled points in that leaf. ;\n\ni=1, n, pX,Y . ;\n\nend\n\nThe algorithm is described as picking nk deterministically for simpli\ufb01cation of notation in proofs.\nHowever it is clear that if the nk are random then it is easy (in principle) to discuss the probabilistic\nproperties of the algorithm, and the details of the risk under random versions of Algorithm 1 are\ndiscussed in the proof for Corollary 4.6. The pool marginal distribution pX and the proportion in\neach leaf qk from the querying algorithm above induce a marginal distribution p(cid:48)\nX, as well as a joint\ndistribution p(cid:48)\nX. The scheme is very general, and it is worth noting that random sam-\npling is a (randomized) version of Algorithm 1. But this is enough structure to produce a somewhat\nobvious but very important property of our sampling distribution restricted to each leaf.\nProposition 4.1. Fix a tree structure I, pool marginal density pX and version of Algorithm 1, giving\nX (X|X \u2208 Ik) denote the induced marginal\nus an induced marginal density p(cid:48)\ndensity conditioned on X \u2208 Ik. Then as long as qk (cid:54)= 0, p(cid:48)\nX (X|Ik) = pX (X|Ik) for any version of\nAlgorithm 1.\n\nX (X|Ik) = p(cid:48)\n\nX,Y = pY |X p(cid:48)\n\nX. Let p(cid:48)\n\n[ \u02c6\u03b2k] = \u02dc\u03b2k (as long as Ik has at least 1 labelled\nOne important property this gives us is that Ep(cid:48)\npoint to estimate \u02c6\u03b2k), meaning our sampling scheme produces unbiased estimates of the optimal\nregressogram for this tree. It also allows for a bias-variance decomposition of the risk of the tree.\n\nX,Y\n\n3\n\n\fThis decomposition was already known [12] under the assumption of independence between tree\nstructure and the data. We relax this assumption slightly as the distribution of the data depends on\nthe structure of the tree, but still permits this decomposition.\nCorollary 4.2. For a \ufb01xed tree structure I, under any sampling distribution generated by Algorithm\n1 we have the following bias-variance decomposition of our risk:\n\n( \u02dcfI(X) \u2212 f (X))2(cid:105)\n(cid:104)\n\n(cid:104)\n\n( \u02c6fI(X) \u2212 \u02dcfI(X))2(cid:105)\n\n.\n\n+ E\n\n(cid:104)\n\n( \u02c6fI(X) \u2212 f (X))2(cid:105)\n\nE\n\n= E\n\nWe will refer to these as the risk bias term and risk variance term. The risk bias term depends only\non the structure of the tree, which does not depend our sampling scheme. We thus focus on the risk\nvariance term. Again using Proposition 4.1 we show this term for a single leaf takes a simple form.\nLemma 4.3. For a \ufb01xed tree structure I, under any sampling distribution generated by Algorithm\n1 we have that the variance error term on the leaf Ik is:\n\n( \u02c6fI(X) \u2212 \u02dcfI(X))2|X \u2208 Ik\n\nE\n\n(cid:104)\n\n1\nnk\n(f (X) \u2212 \u02dc\u03b2k)2|X \u2208 Ik\n\n=\n\nbias2\n\nk := EpX,Y\n\nk + \u03c32\n\u0001,k\n\n\u03c32\n\u0001,k := EpX,Y\n\n1\nnk\n\n(cid:104)\n\nVar(Y |X \u2208 Ik),\n(\u03c3(X)\u0001)2|X \u2208 Ik\n\n(cid:105)\n\n.\n\n(cid:105)\n\n(cid:0)bias2\n(cid:105)\n\n,\n\n(cid:1) =\n\n(cid:104)\n\n4.2 Optimal algorithm\n\n\u0001,k have expectations taken with\nIn the above lemma we have emphasized that the terms bias2\nrespect to the data generating distribution pX,Y and do not depend on the induced distribution p(cid:48)\nX,Y .\nThus the only way our sampling distribution affects the variance term is through nk. Averaging out\nover the contribution of each leaf we get that our overall variance error term is.\n\nk and \u03c32\n\n(cid:104)\n( \u02c6fI(X) \u2212 \u02dcfI(X))2(cid:105)\n\nE\n\n=\n\n(cid:88)\n\nk\n\n(cid:0)bias2\n\nk + \u03c32\n\u0001,k\n\n(cid:1).\n\nP (X \u2208 Ik)\n\n1\nnk\n\n(1)\n\nLet pk = P (X \u2208 Ik) under the pool marginal distribution and \u03c32\n\u0001,k. Now we are\ngiven a budget of n data points, and we want to minimize our variance error term subject to this\nbudget. This gives us the following optimization problem which can be easily solved:\n\nY,k = bias2\n\nk + \u03c32\n\n(cid:80)\nsubject to (cid:80)\n\nminimize\n\nnk\n\nk\n\nk\n\n1\nnk\n\npk\u03c32\n\nY,k\n\nnk = n\n\n\u2192\n\nn\u2217\nk = n\n\n(cid:113)\n(cid:113)\n(cid:80)\n\nk(cid:48)\n\npk\u03c32\n\nY,k\n\npk(cid:48)\u03c32\n\nY,k(cid:48)\n\nThe proportions are very intuitive; cells with high bias and/or noise, or high (test) marginal density\nwill get more samples. These results are summarized in the following theorem:\nTheorem 4.4. Let Yi = f (Xi) + \u03c3(Xi)\u0001i and \ufb01x the partitions I of our tree. The risk minimizing\noracle querying algorithm out of the family of algorithms described by Algorithm 1 is the one with\nthe following nk and error\n\n(cid:113)\n(cid:113)\n(cid:80)\n\nk(cid:48)\n\nn\u2217\nk = n\n\npk\u03c32\n\nY,k\n\npk(cid:48)\u03c32\n\nY,k(cid:48)\n\n,\n\nE\n\n(cid:104)\n\n( \u02c6fI(X) \u2212 \u02dcfI(X))2(cid:105)\n\n(cid:88)\n\n(cid:113)\n\n=\n\n1\nn\n\n(\n\nk\n\npk\u03c32\n\nY,k)2.\n\nDe\ufb01nition 4.5. The distribution induced by the sampling in Theorem 4.4 will be referred to as p\u2217\nX.\nRemark. This has a similar \ufb02avour to uncertainty sampling methods from classi\ufb01cation in that\nregions with greater variation will get more samples. However whereas in classi\ufb01cation sampling\ncan focus locally near the decision boundary, in regression sampling must remain global.\n\nRandom sampling is a randomized version of Algorithm 1, so the risk under random sampling is\nthe bias term plus a weighted average of the variance terms for different (n1, ..., nK). The sampling\n\n4\n\n\fscheme from Theorem 4.4 has the same bias term, but minimizes the variance term meaning our\noptimal sampling scheme is better than any randomized version of Algorithm 1 (as long as m > n),\nincluding random sampling.\nCorollary 4.6. For a \ufb01xed tree structure I, the risk from any randomized version of Algorithm 1\nis greater than the risk from sampling according to p\u2217\nK) = 1. In particular\nsampling according to p\u2217\n\nX is strictly better than random sampling.\n\nX unless P (n\u2217\n\n1, ..., n\u2217\n\nWe can also calculate the excess error if we use the incorrect values of \u03c32\nY,k, so\nak is a multiplicative error (we will see that our errors will be multiplicative). Given \ufb01xed leaf errors\na1, ..., aK we can calculate the additional risk generated by using \u02dc\u03c32\nY,k in our optimal algorithm\ninstead of the true \u03c32\nLemma 4.7. For a \ufb01xed tree structure I, if nk = n\nk(cid:48)\neach leaf is as in Lemma 4.3, then our risk variance is:\n\nand the variance error term for\n\nY,k = ak\u03c32\n\nY,k. Let \u02dc\u03c32\n\nY,k\n\n\u221a\n(cid:113)\n(cid:80)\n(cid:88)\n\nY,k\n\npk \u02dc\u03c32\npk(cid:48) \u02dc\u03c32\nY,k(cid:48)\n\u221a\nak\u221a\nal\n\n(\n\n1\nn\n\nk 0\nthere exists c, C s.t. c \u2264 pX \u2264 C, then our estimates \u02c6nk \u2192 n\u2217\nEven with consistency our \ufb01nite sample estimates will give us some error in \u02c6nk. The variance of our\nY,k \u2212 (\u03c32\nY,k)2, so our errors\nsample variance is Var(\u02c6\u03c32\nY,k) = 1\n(\u03c34\nnk\nwill scale multiplicatively with \u03c32\nY,k when our kurtosis \u03baY,k are bounded. This allows us to use\nLemma 4.7 to bound our excess error given bounds on the (multiplicative) error ak = \u02c6\u03c32\n\nY,k)2) +O( 1\n\n(\u03baY,k \u2212 1)(\u03c32\n\nk in probability.\n\nk in probability.\n\n) \u2248 1\n\nn(1)\nKn\n\nY,k/\u03c32\n\nY,k.\n\nn2\nk\n\nnk\n\n6\n\n\f5.2 Reusing data\n\nSince we are using the data in Stage 1 both to estimate \u02c6nk as well as in our estimator \u02c6\u03b2k, we have\nintroduced dependence between the estimated optimal leaf sample size \u02c6nk and leaf mean estimate\ncontribution from Stage 1. To understand the effects of this dependence we will break up our es-\ntimates of the leaf mean as \u02c6\u03b2k = n(1),k\n, where n(i),k, \u02c6\u03b2(i),k are the number and\nmean estimate during sampling round i \u2208 {1, 2}. By writing our \ufb01nal mean estimate in terms of our\nstage-wise mean estimates we can \ufb01nd an expression for this dependence.\nLemma 5.3. For a \ufb01xed tree structure I, under Algorithm 2 the risk variance term becomes:\n\n\u02c6\u03b2(1),k+n(2),k\nn(1),k+n(2),k\n\n\u02c6\u03b2(2),k\n\nE[( \u02c6\u03b2k \u2212 \u02dc\u03b2k)2] = En(2),k\n\n(cid:104)\n\nn2\n\n(1),k\n\n(n(1),k + n(2),k)2 ED1:n(1)\n\n(cid:2)( \u02c6\u03b2(1),k \u2212 \u02dc\u03b2k)2|n(2),k\n\n(cid:2)( \u02c6\u03b2(1),k \u2212 \u02dc\u03b2k)2|n(2),k\n\n(cid:105)\n(cid:3) quanti\ufb01es the dependency introduced by reusing the\n\n(n(1),k + n(2),k)2\n\n(cid:3) +\n\nn(2),k\u03c32\n\nY,k\n\n.\n\nThe term ED1:n(1)\nsamples from n(1). The dependency is between the variance of part of our mean estimators\nY,K). When \u02c6\u03b2(1),k \u22a5 n(2),k we get back\n( \u02c6\u03b2(1),1, ..., \u02c6\u03b2(1),k) and (n(2),1, ..., n(2),K) = g(\u02c6\u03c32\nour risk variance term from Lemma 4.3. However when there is dependence we no longer have that\nthe n\u2217\nk from Theorem 4.4 are optimal over algorithms with an active stage as in Algorithm 2, since\nthe optimal nk will depend on the sampling during Stage 1. This dependency can be complex and\nis generally unknown, though as long as the effect is not too large the n\u2217\nk will still provide a very\ngood solution, and the n\u2217\nk are still better than random sampling. It is worth noting that our active\nalgorithm can take advantage of this dependency in some cases to outperform Algorithm 1, and we\ninformally discuss this in the appendix.\n\nY,1, ..., \u02c6\u03c32\n\n5.3 The Normal case\n\nThe complications above depend on the distribution of ak =\nand the function g, which in\ngeneral are extremely complicated and hard to analyze for arbitrary f, p\u0001, pX. However in the case\nwhere Y are normally distributed these become tractable.\nTheorem 5.4. Let Y \u223c N (\u00b5(X), \u03c32(X)) and X queried according to Algorithm 2 for a \ufb01xed tree\nI. Then the risk variance term for a leaf is as in Lemma 4.3 and we have that with probability at\n\nY,k\n\nY,k\n\n\u02c6\u03c32\n\u03c32\n\nleast 1 \u2212 K(cid:80)\n\ne\u2212 (n(1),k\u22121)\u03b12\n\n8\n\nk=1\n\nEXCESS \u2264 1\nn\n\nthe excess error is bounded by:\n\n(cid:104)(cid:0) 1 + \u03b1\n\n1 \u2212 \u03b1\n\n(cid:1)1/4 \u2212(cid:0) 1 \u2212 \u03b1\n\n1 + \u03b1\n\n(cid:88)\n\nk \u02c6nk. These issues will be less signi\ufb01cant as n \u2192 \u221e and\nwe discuss how each is dealt with in the appendix.\n\nk will be fractional, a leaf may not have n\u2217\n\n7\n\n\f6 Simulations and experiments\n\n2\n\nWe now examine the bene\ufb01ts of active learning on both simulated and real world data. We sim-\nulate 2 data sets, one with differing noise variance (our \u03c32\n\u0001,k term), the other with differing func-\ntion complexity (our bias2\nk term), in different regions of [0, 1]d. We also examine performance\non the Wine quality data set from UCI and a data set of activation energies of Claisen rear-\nrangement reactions (Cl). We compare the performance of selecting points to label using ran-\ndom sampling, our active algorithm, and a naive uncertainty sampling version of our active al-\ngorithm, where each leaf nk is proportional its variance. In all experiments n(1) = n\n2 and Mon-\n2+d \u2212 1, which is theoretically motivated, but corrected\ndrian Trees are grown using \u03bbn = n\nso when n = 1, \u03bbn = 0. We use both Mondrian and Breiman Trees [5] as our \ufb01nal regres-\nsor. Details of the data sets are in the appendix, which also contains forest versions of these ex-\nperiments. Additionally all code and experiments (as well as other experiments) are available at\nhttps://github.com/jackrgoetz/Mondrian_Tree_AL.\nWhen using Mondrian Trees as the \ufb01nal regressor, the active learning method always provides some\nimprovement, and in the simulations this improvement persists when using Breiman Trees. Addi-\ntionally the uncertainty sampling method sometimes produces worse sampling than random sam-\npling, which is common for direct translations of classi\ufb01cation active learning methods.\nIn the\nreal data our bene\ufb01ts are less pronounced, with active learning even being slightly harmful when\nused with Breiman Trees (although with forests the active learning is bene\ufb01cial). We believe this\nperformance drop may be due to the inability of the Mondrian Tree to adapt to differing variable\nimportance. It is also possible that our assumptions that Y has changing variance does not hold, and\neven here the active algorithm is not harmful, where as the naive uncertainty sampling algorithm can\nbe detrimental.\n\nFigure 1: Active learning experiments\n\n8\n\n9.810.010.2Heteroskedastic sim. m = 40000, d = 10MT - ActiveMT - RandomMT - Uncertainty1000200030004000500018.519.019.5BT - ActiveBT - RandomBT - UncertaintyMSEFinal number of labelled points8283848586Varying complexity sim. m = 40000, d = 1010002000300040005000148150152154MSEFinal number of labelled points0.720.740.76Wine experiment. m = 4898, d = 11250500750100012501500175020000.60.81.0MSEFinal number of labelled points1415161718Cl experiment. m = 1508, d = 12100200300400500600700246MSEFinal number of labelled points\f7 Conclusion and further directions\n\nIn this paper we provide a theoretically justi\ufb01ed active learning method for non-parametric regres-\nsion which can take advantage of bene\ufb01cial structure when present without being detrimental when\nsuch structure is absent. When used with Mondrian Trees the method requires no tuning param-\neters (which are dif\ufb01cult to tune while actively sampling [1]), is asymptotically minimax optimal\nfor Lipschitz regression functions, and is consistent. Although the improvement for active learning\nin regression is often restricted to constant factor improvements, these constant improvements are\nimportant in real world applications.\nDespite technical theoretical arguments needed for the theory, the method itself is simple, leading\nto many interesting avenues for further exploration. One direction would be extending theory to\nensembles of trees, or developing tools to deal with high dimensions. Another possibility is to\nexploit the online nature of Mondrian Trees to develop a parallel theory for streaming based active\nlearning. Finally it may be possible to extend the ideas here to non tree based active learning for\nregression.\n\nAcknowledgements\n\nJG acknowledges the support of NSF via grant DMS-1646108. AT acknowledges the support of a\nSloan Research Fellowship.\n\nReferences\n[1] Attenberg, J. and Provost, F. (2011). Inactive learning?: dif\ufb01culties employing active learning\n\nin practice. ACM SIGKDD Explorations Newsletter, 12(2):36\u201341.\n\n[2] Awasthi, P., Balcan, M. F., and Long, P. M. (2014). The power of localization for ef\ufb01ciently\nlearning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium\non Theory of computing, pages 449\u2013458. ACM.\n\n[3] Balcan, M.-F., Beygelzimer, A., and Langford, J. (2009). Agnostic active learning. Journal of\n\nComputer and System Sciences, 75(1):78\u201389.\n\n[4] Breiman, L. (2000). Some in\ufb01nity theory for predictor ensembles. Technical report, Technical\n\nReport 579, Statistics Dept. UCB.\n\n[5] Breiman, L. (2017). Classi\ufb01cation and regression trees. Routledge.\n\n[6] Bull, A. D. (2013). Spatially-adaptive sensing in nonparametric regression. The Annals of\n\nStatistics, 41(1):41\u201362.\n\n[7] Chaudhuri, K., Jain, P., and Natarajan, N. (2017). Active heteroscedastic regression. In Inter-\n\nnational Conference on Machine Learning, pages 694\u2013702.\n\n[8] Chaudhuri, K., Kakade, S. M., Netrapalli, P., and Sanghavi, S. (2015). Convergence rates of ac-\ntive learning for maximum likelihood estimation. In Advances in Neural Information Processing\nSystems, pages 1090\u20131098.\n\n[9] Cramer, C. J. (2013). Essentials of computational chemistry: theories and models. John Wiley\n\n& Sons.\n\n[10] Dasgupta, S., Hsu, D. J., and Monteleoni, C. (2008). A general agnostic active learning algo-\n\nrithm. In Advances in neural information processing systems, pages 353\u2013360.\n\n[11] Efromovich, S. (2008). Optimal sequential design in a controlled non-parametric regression.\n\nScandinavian Journal of Statistics, 35(2):266\u2013285.\n\n[12] Genuer, R. (2012). Variance reduction in purely random forests. Journal of Nonparametric\n\nStatistics, 24(3):543\u2013562.\n\n9\n\n\f[13] Golovin, D. and Krause, A. (2011). Adaptive submodularity: Theory and applications in active\n\nlearning and stochastic optimization. Journal of Arti\ufb01cial Intelligence Research, 42:427\u2013486.\n\n[14] Gy\u00f6r\ufb01, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A distribution-free theory of non-\n\nparametric regression. Springer Science & Business Media.\n\n[15] Hanneke, S. and Yang, L. (2015). Minimax analysis of active learning. Journal of Machine\n\nLearning Research, 16(12):3487\u20133602.\n\n[16] Hoang, T. N., Low, B. K. H., Jaillet, P., and Kankanhalli, M. (2014). Nonmyopic \u0001-bayes-\noptimal active learning of gaussian processes. In Proceedings of the 31st International Confer-\nence on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages\n739\u2013747.\n\n[17] Lakshminarayanan, B., Roy, D. M., and Teh, Y. W. (2014). Mondrian forests: Ef\ufb01cient online\n\nrandom forests. In Advances in neural information processing systems, pages 3140\u20133148.\n\n[18] Liu, H., Ong, Y.-S., and Cai, J. (2017). A survey of adaptive sampling for global metamodeling\nin support of simulation-based complex engineering design. Structural and Multidisciplinary\nOptimization, pages 1\u201324.\n\n[19] Mourtada, J., Ga\u00efffas, S., and Scornet, E. (2017). Universal consistency and minimax rates for\nonline mondrian forests. In Advances in Neural Information Processing Systems, pages 3761\u2013\n3770.\n\n[20] Mourtada, J., Ga\u00efffas, S., and Scornet, E. (2018). Minimax optimal rates for mondrian trees\n\nand forests. arXiv preprint arXiv:1803.05784.\n\n[21] Sabato, S. and Munos, R. (2014). Active regression by strati\ufb01cation. In Advances in Neural\n\nInformation Processing Systems, pages 469\u2013477.\n\n[22] Sourati, J., Akcakaya, M., Leen, T. K., Erdogmus, D., and Dy, J. G. (2017). Asymptotic\nanalysis of objectives based on \ufb01sher information in active learning. Journal of Machine Learning\nResearch, 18(34):1\u201341.\n\n[23] Willett, R., Nowak, R., and Castro, R. M. (2006). Faster rates in regression via active learning.\n\nIn Advances in Neural Information Processing Systems, pages 179\u2013186.\n\n10\n\n\f", "award": [], "sourceid": 1262, "authors": [{"given_name": "Jack", "family_name": "Goetz", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}, {"given_name": "Paul", "family_name": "Zimmerman", "institution": "University of Michigan"}]}