{"title": "Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees", "book": "Advances in Neural Information Processing Systems", "page_first": 14581, "page_last": 14590, "abstract": "We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the n-th sample point is O(log n) in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.", "full_text": "Low-Complexity Nonparametric Bayesian\nOnline Prediction with Universal Guarantees\n\nAlix Lh\u00e9ritier\nAmadeus SAS\n\nF-06902 Sophia-Antipolis, France\nalix.lheritier@amadeus.com\n\nFr\u00e9d\u00e9ric Cazals\n\nUniversit\u00e9 C\u00f4te d\u2019Azur\n\nInria\n\nF-06902 Sophia-Antipolis, France\nfrederic.cazals@inria.fr\n\nAbstract\n\nWe propose a novel nonparametric online predictor for discrete labels conditioned\non multivariate continuous features. The predictor is based on a feature space\ndiscretization induced by a full-\ufb02edged k-d tree with randomly picked directions\nand a recursive Bayesian distribution, which allows to automatically learn the\nmost relevant feature scales characterizing the conditional distribution. We prove\nits pointwise universality, i.e., it achieves a normalized log loss performance\nasymptotically as good as the true conditional entropy of the labels given the\nfeatures. The time complexity to process the n-th sample point is O (log n) in\nprobability with respect to the distribution generating the data points, whereas other\nexact nonparametric methods require to process all past observations. Experiments\non challenging datasets show the computational and statistical ef\ufb01ciency of our\nalgorithm in comparison to standard and state-of-the-art methods.\n\nIntroduction\n\n1\nUniversal online predictors. An online (or sequential) probability predictor processes sequentially\ninput symbols l1, l2, . . . belonging to some alphabet L. Before observing the next symbol in the\nsequence, it predicts it by estimating the probability of observing each symbol of the alphabet. Then,\nit observes the symbol and some loss is incurred depending on the estimated probability of the\ncurrent symbol. Subsequently, it adapts its model in order to better predict future symbols. The\ngoal of universal prediction is to achieve an asymptotically optimal performance independently of\nthe generating mechanism (see, e.g., the survey of Merhav and Feder [22]). When performance\nis measured in terms of the logarithmic loss, prediction is intimately related to data compression,\ngambling and investing (see, e.g., [7, 6]).\nBarron\u2019s theorem [3] (see also [10, Ch.15]) establishes a fundamental link between prediction under\nlogarithmic loss and learning: the better we can sequentially predict data from a probabilistic source,\nthe faster we can identify a good approximation of it. This is of paramount importance when applied\nto nonparametric models of in\ufb01nite dimensionality, where over\ufb01tting is a serious concern. This is our\ncase, since the predictor observes some associated side information (i.e. features) zi \u2208 Rd before\npredicting li \u2208 L, where L = {\u03bb1, . . . , \u03bb|L|}. We consider the probabilistic setup where the pairs\nof observations (zi, li) are i.i.d. realizations of some random variables (Z, L) with joint probability\nmeasure P. Therefore, we aim at estimating a nonparametric model of the conditional measure PL|Z.\nNonparametric distributions can be approximated by universal distributions over countable unions\nof parametric models (see e.g., [10, Ch. 13]). This approach requires de\ufb01ning parametric models\nthat can arbitrarily approximate the nonparametric distribution as the number of parameters tend to\nin\ufb01nity. For example, models based on histograms with arbitrarily many bins have been proposed to\napproximate univariate nonparametric densities (e.g., [13, 26, 36]).\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBayesian mixtures allow to obtain universal distributions for countable unions of parametric models\n(e.g., [35, 34]). Nevertheless, standard Bayesian mixtures suffer from the catch-up phenomenon, i.e.,\ntheir convergence rate is not optimal. In [31], it has been shown that a better convergence rate can be\nachieved by allowing models to change over time, by considering, instead of a set of distributions M,\na (larger) set constituted by sequences of distributions of M. The resulting switch distribution has\nstill a Bayesian form but the mixture is done over sequences of models.\nPrevious works on prediction with side information either are non-sequential (e.g. PAC learning [30]),\nor use other losses (e.g. [11, 12] ) or consider side information in more restrictive spaces (e.g. [1, 5]).\nOur work bears similarities to [16, 29, 32] but the objectives are different and so are the guarantees.\nRecently, [19] proposed a universal online predictor for side information in Rd based on a mixture of\nnearest-neighbors regressors with different k(n) functions specifying the number of neighbors at time\nn. Practically, the performances depend on the particular set of functions\u2014a design choice\u2014and\nits time complexity is linear in n due to the exact nearest neighbor search. Gaussian Processes (see,\ne.g., [25]) are nonparametric Bayesian methods which can be used for online prediction with side\ninformation. It is conjectured that exact Gaussian processes with the radial basis function (RBF)\nkernel are universal under some conditions on the marginal measure PZ [10, Sec. 13.5.3]. In practice,\napproximations are required to compute the predictive posterior for discrete labels (e.g. Laplace)\nand the kernel width strongly affects the results. In addition, their time complexity to predict each\n\nobservation is O(cid:0)n3(cid:1), making them practical for small samples sizes only.\n\nWe propose a novel nonparametric online predictor with universal guarantees for continuous side\ninformation exhibiting two distinctive features. First, it relies on a hierarchical feature space dis-\ncretization and a recursive Bayesian distribution, automatically learning the relevant feature scales\nand making it scale-hyperparameter free. Second, in contrast to other nonparametric approaches, its\ntime complexity to process the n-th sample point is O (log n) in probability. Due to space constraints,\nproofs are presented in the supplementary material.\n\n2 Basic de\ufb01nitions and notations\nIn order to represent sequences, we use the notation xn \u2261 x1, . . . , xn. The functions |\u00b7| and |\u00b7|\u03bb, give,\nrespectively, the length of a sequence and the number of occurrences of a symbol \u03bb in it. Let P be the\njoint probability measure of L, Z. Let PL, PZ be their respective marginal measures and PZ|L the\nprobability measure of Z conditioned on L. The entropy of random variables is denoted H (\u00b7), while\nthe entropy of L conditioned on Z is denoted H (L|Z). The mutual information between L and Z is\ndenoted I (L; Z). Logarithms are taken in base 2.\nA \ufb01nite-measurable partition A = (\u03b31, . . . , \u03b3n) of some set \u2126 is a subdivision of \u2126 into a \ufb01nite\nnumber of disjoint measurable sets or cells \u03b3i whose union is \u2126. An n-sample partition rule \u03c0n(\u00b7) is\na mapping from \u2126n to the space of \ufb01nite-measurable partitions for \u2126, denoted A(\u2126). A partitioning\nscheme for \u2126 is a countable collection of k-sample partition rules \u03a0 \u2261 {\u03c0k}k\u2208N+. The partitioning\nscheme at time n de\ufb01nes the set of partition rules \u03a0n \u2261 {\u03c0k}k=1..n. For a given n-sample partition\nrule \u03c0n(\u00b7) and a sequence zn \u2208 \u2126n, \u03c0n(z|zn) denotes the unique cell in \u03c0n(zn) containing a given\npoint z \u2208 \u2126. For a given partition A, let A(z) denote the unique cell of A containing z. Let \u03b3 (\u00b7)\ndenote the operator that extracts the subsequences whose symbols have corresponding zi \u2208 \u03b3.\n\n3 The kd-switch distribution\n\nWe de\ufb01ne the kd-switch distribution Pkds using a k-d tree based hierarchical partitioning and a\nswitch distribution de\ufb01ned over the union of multinomial distributions implied by the partitioning.\nFull-\ufb02edged k-d tree based spatial partitioning. We obtain a hierarchical partitioning of \u2126 = Rd\nusing a full-\ufb02edged k-d tree [8, Sec. 20.4] that is naturally amenable to an online construction\nsince pivot points are chosen in the same order as sample points are observed. Instead of rotating\ndeterministically the axis of the projections, we sample the axis uniformly at each node of the tree.\nFormally, let \u03a0kd \u2261 {\u03c0k}k\u2208N+ be the nested partitioning scheme such that \u03c0n(zn) is the spatial\npartition generated by a full-\ufb02edged k-d tree after observing zn. In order to de\ufb01ne it recursively,\nlet the base case be \u03c00(z0) \u2261 Rd, where z0 is the empty string. Then, \u03c0n+1(zn+1) is obtained by\nuniformly drawing a direction J in 1..d and by replacing the cell \u03b3 \u2208 \u03c0n(zn) such that zn+1 \u2208 \u03b3 by\n\n2\n\n\fthe following cells\n\n(cid:26)\u03b31 \u2261 {z \u2208 \u03b3 : z[J] \u2264 zn+1[J]}\n\n\u03b32 \u2261 {z \u2208 \u03b3 : z[J] > zn+1[J]}\n\n(cid:8)\u03b31, \u03b32, . . . , \u03b3|A|(cid:9) of Rd de\ufb01nes a class of piecewise multinomial distributions characterized by\n\n(1)\nwhere \u00b7[J] extracts the J-th coordinate of the given vector. A spatial partition A =\n\u03b8A \u2261 [\u03b81, . . . , \u03b8|A|], \u03b8i \u2208 \u2206|L|, where \u2206|L| is the standard |L|-simplex. More precisely, P\u03b8A (\u00b7|z) is\nmultinomial with parameter \u03b8i if z \u2208 \u03b3i.\n\nContext Tree Switching. We adapt the Con-\ntext Tree Switching (CTS) distribution [33] to\nuse spatial cells as contexts. Since these contexts\nare created as sample points zi are observed, the\nchronology of their creation has to be taken into\naccount. Given a nested partitioning scheme \u03a0\nwhose instantiation with zn creates a cell \u03b3 and\nsplits it into \u03b31 and \u03b32, we de\ufb01ne the cell split-\nting index \u03c4n(\u03b3) as the index in the subsequence\n\u03b3 (zn) when \u03b31 and \u03b32 are created (see Fig. 1).\nIf \u03b3 is not split by \u03a0 instantiated with zn, then\nwe de\ufb01ne \u03c4n(\u03b3) \u2261 \u221e.\nAt each cell \u03b3, two models, de\ufb01ned later and\ndenoted a and b, are considered. Let w\u03b3(\u00b7)\nbe a prior over model index sequences im \u2261\ni1, . . . , im \u2208 {a, b}m at cell \u03b3, recursively de-\n\ufb01ned by\n\n\uf8f1\uf8f2\uf8f31 if m = 0\n\n1\n\n2 if m = 1\nw\u03b3(im\u22121) ((1 \u2212 \u03b1\u03b3\n\nw\u03b3(im) \u2261\n\n(a) \u03c01(z1): \u03b31 and \u03b32\nare created, z1 \u2208 \u03b31.\n\n(b) \u03c02(z2): \u03b32,1 and\n\u03b32,2 are created. z2 \u2208\n\u03b32,2.\n\nFigure 1: Cell creation process and cell split-\nting index. The cell splitting index is de\ufb01ned\nw.r.t. its subsequence: \u03c42(\u2126) = 1, \u03c42(\u03b32) = 1\nsince \u03b32\n\u03c42(\u03b32,2) = \u221e.\n\n(cid:0)z2(cid:1) = z2, and \u03c42(\u03b31) = \u03c42(\u03b32,1) =\n\n, E \u2261 {im = im\u22121}, \u03b1\u03b3\n\nm = m\u22121.\n\n1\u00acE) if m > 1\n\nm)1E + \u03b1\u03b3\nm\n\nIn order to de\ufb01ne the CTS distribution, we need the Jeffreys\u2019 mixture over multinomial distributions\nalso known as the Krichevsky-Tro\ufb01mov estimator [17]\n\n|ln|\u03bbj w(\u03b8)d\u03b8\n\n\u03b8[j]\n\n(2)\n\n(cid:90)\n\nPkt(ln) \u2261\n\n(cid:89)\n\n\u03b8\u2208\u2206|L|\n\nj\u22081...|L|\n\nwith \u03b8[j] being the j-th component of the vector \u03b8, |ln|\u03bbj\nthe number of occurrences of \u03bbj in ln\nand w(\u00b7) the Jeffreys\u2019 prior for the multinomial distribution [14] i.e. a Dirichlet distribution with\nparameters (1/2, . . . , 1/2).\nConsider any cell \u03b3 created by the partitioning scheme \u03a0 instantiated with zn. \u03b3 can either be re\ufb01ned\ninto two child cells \u03b31 and \u03b32 or have \u03c4n(\u03b3) = \u221e. Given a sequence of labels ln such that all the\ncorresponding positions zi \u2208 \u03b3, the modi\ufb01ed CTS distribution is given by\n\ncts (ln|zn) \u2261 (cid:88)\n\nP \u03a0,\u03b3\n\nw\u03b3(in)\n\nin\u2208{a,b}n\n\nwhere the predictive distributions of models a and b are given by\n\nk=1\n\nn(cid:89)\n\n(cid:2)1{ik=a}\u03c6a(lk|lk\u22121) +1{ik=b}\u03c6\u03b3\n(cid:0)lk|lk\u22121(cid:1) \u2261 Pkt\n(cid:0)lk|lk\u22121(cid:1) if k < \u03c4k(\u03b3)\n\nPkt(lk\u22121)\n\n(cid:0)lk(cid:1)\n\n\u03c6a(lk|lk\u22121) \u2261 Pkt\n\n\uf8f1\uf8f2\uf8f3Pkt\n\nP\n\nb (lk|lk\u22121, zk)(cid:3)\n\n(3)\n\n(4)\n\nb (lk|lk\u22121, zk) \u2261\n\u03c6\u03b3\n\n(cid:0)l0(cid:1) \u2261 1.\n\nwhere \u00b7 -1 removes the last symbol of a sequence and, for the empty sequences l0, z0, P \u03a0,\u03b3\nand Pkt\n\ncts\n\n\u03a0,\u03b3j\n\ncts (\u03b3j(lk)|\u03b3j(zk))\nP\n(\u03b3j (lk) -1 |\u03b3j (zk) -1)\n\u03a0,\u03b3j\ncts\n\nwith j : zk \u2208 \u03b3j, otherwise\n\n(5)\n\n(cid:0)l0|z0(cid:1) \u2261 1\n\nand\n\n3\n\nz1\u03b31\u03b32\u03b32,1z1z2\u03b32,2\fDe\ufb01nition of Pkds. The kd-switch distribution is obtained from the modi\ufb01ed CTS distribution on\nthe context cells de\ufb01ned by the full-\ufb02edged k-d tree spatial partitioning scheme i.e.\n\nPkds(ln|zn) \u2261 P \u03a0kd,Rd\n\ncts\n\n(ln|zn) .\n\n(6)\nm = n\u22121 for any cell \u03b3,\nRemark 1. In [33], the authors observe better empirical performance with \u03b1\u03b3\nwhere n is the number of samples observed at the root partition \u2126 when the m-th sample is observed\nin \u03b3. With this switching rate they were able to provide a good redundancy bound for bounded depth\ntrees. In our unbounded case, we observed a better empirical performance with \u03b1\u03b3\nRemark 2. A Context Tree Weighting [35] scheme can be obtained by setting \u03b1\u03b3\ncorresponding distribution is denoted Pkdw.\n\nm = m\u22121.\n\nm = 0. The\n\n4 Pointwise universality\n\nIn this section, we show that Pkds is pointwise universal, i.e. it achieves a normalized log loss\nasymptotically as good as the true conditional entropy of the source generating the samples. More\nformally, we state the following theorem.\nTheorem. 1. The kd-switch distribution is pointwise universal, i.e.\n\n(7)\nfor any probability measure P generating the samples such that PZ|L are absolutely continuous with\nrespect to the Lebesgue measure.\n\nlog Pkds(Ln|Z n) \u2264 H (L|Z) a.s.\n\n\u2212 lim\nn\u2192\u221e\n\n1\nn\n\nIn order to prove Thm. 1, we \ufb01rst show that P \u03a0,\u2126\nis universal with respect to the class of piecewise\ncts\nmultinomial distributions de\ufb01ned by any nested partitioning scheme \u03a0. Then, we show that \u03a0kd\nallows to approximate arbitrarily well any conditional distribution.\nUniversality with respect to the class of piecewise multinomial distributions. Consider a nested\npartitioning scheme \u03a0 for \u2126. \u03a0n instantiated with some zn \u2208 \u2126n naturally de\ufb01nes a tree structure\nwhose root node represents \u2126. Given an arbitrary set of internal nodes, we can prune the tree by\ntransforming these internal nodes into leaves and discarding the corresponding subtrees. The new set\nof leaf nodes de\ufb01ne a partition of \u2126. Let Pn(zn) be the set of all the partitions that can be obtained\nby pruning the tree induced by \u03a0n instantiated with zn.\nThe next lemma shows that P \u03a0,\u2126\ncts , de\ufb01ned in Eq. 3, is universal with respect to the class of piecewise\nmultinomial distributions de\ufb01ned on the partitions Pn(zn).\nLemma. 1. Consider arbitrary sequences ln \u2208 Ln, zn \u2208 \u2126n, n \u2265 0. Then, for any A \u2208 Pn(zn)\nand for any piecewise multinomial distribution P\u03b8A, the following holds\n\n\u2212 log P \u03a0,\u2126\n\ncts (ln|zn) \u2264 \u2212 log P\u03b8A (ln|zn) + |A|\u03b6\n\n+ \u0393A log 2n + O (1)\n\n(cid:19)\n\n(cid:18) n\n\n|A|\n\n(8)\n\n(9)\n\nand\n\n\u2212 lim\nn\u2192\u221e\n\n1\nn\n\nlog P \u03a0,\u2126\n\ncts (Ln|Z n) \u2264 H (L|A(Z)) a.s.\n\nwhere \u0393A is the number of nodes in the tree, induced by \u03a0n instantiated with zn, that represents A\n(i.e., the code length given by a natural code for unbounded trees) and\n\n\u03b6(x) \u2261\n\nx log|L|\n|L|\u22121\n\n2\n\nlog x + log|L|\n\nif 0 \u2264 x < 1\nif x \u2265 1\n\n.\n\n(10)\n\n(cid:40)\n\nRemark 3. In a Context Tree Weighting scheme (\u03b1\u03b3\nSee proof of Lemma 1. Thus, universality holds for this case too.\nUniversal discretization of the feature space. In order to prove that the k-d tree based partitions\nallow to approximate arbitrarily well the conditional entropy H (L|Z), we use the following corollary\nof [28, Thm. 4.2].\n\nm = 0), the log 2n factor in Eq. 8 disappears.\n\n4\n\n\fCorollary. 1. Let \u001f(\u03b3) \u2261 supx,y\u2208\u03b3 (cid:107)x \u2212 y(cid:107). Let P be any probability measure such that PZ|L\nare absolutely continuous with respect to the Lebesgue measure. Given a partition scheme \u03a0 \u2261\n{\u03c0k}k\u2208N+, if \u2200\u03b4 > 0\n\n(cid:0)(cid:8)z \u2208 Rd : \u001f(\u03c0n(z|Z n)) > \u03b4(cid:9)(cid:1) a.s.\u2212\u2212\u2192 0\n\nPZ\n\nthen \u03a0 universally discretizes the feature space, i.e.\nH (L|\u03c0n(Z|Z n))\n\na.s.\u2212\u2212\u2192 H (L|Z) .\n\n(11)\n\n(12)\n\nThe next lemma provides the required shrinking condition for the k-d tree based partitioning.\nLemma. 2. \u03a0kd satis\ufb01es the shrinking condition of Eq. 11 and, thus, universally discretizes the\nfeature space.\nPointwise universality. The proof of Thm. 1 on the pointwise universality of Pkds stems from a\ncombination of Lemmas 1 and 2\u2014see Appendix.\n\n5 Online algorithm\n\ncts\n\n\u03b3Pkt\n\n(ln|zn).\n\n(cid:0)lt|lt\u22121(cid:1) corresponds to the contribution of all possible model sequences\n(cid:0)lt|lt\u22121, zt(cid:1) corresponds to the contribution of all possible model sequences ending in model b\n\nSince a direct computation of Eq. (3) is intractable and an online implementation is desired, we\nuse the recursive form of [33, Algorithm 1], which performs the exact calculation. We denote by\nthe sequentially computed kd-switch distribution at node \u03b3. In Section 5.2, we show that\nP \u03b3\ns\ns (ln|zn) = P \u03a0kd,\u03b3\nP \u03b3\n5.1 Algorithm\nOutline. For each node of the k-d tree, the algorithm maintains two weights denoted wa\nfollows from [33, Lemma 2], if lt, zt are the subsequences observed in \u03b3 and wa\nprocessing lt, then, wa\nending in model a (KT) to the total probability assigned to lt by the CTS distribution. Analogously,\n\u03b3P \u03b3\nwb\ns\n(CTS).\ns (ln|zn) given by Eq. 14.\nWe now describe the three steps that allow the online computation of P \u03b3\nThe algorithm starts with only a root node representing Rd. When a new point z\u2217 is observed, the\nfollowing steps are performed.\nStep 1: k-d tree update and new cells\u2019 initialization. The point z\u2217 is passed down the k-d tree\nuntil it reaches a leaf cell \u03b3. Then, a coordinate J is uniformly drawn from 1 . . . d and two child\nnodes, corresponding to the new cells de\ufb01ned in Eq. 1 with z\u2217 as splitting point, are created.\nLet ln, zn be the subsequences observed in \u03b3 and thus zn = z\u2217. Since the new cells may contain some\nof the symbols in ln\u22121, the following initialization is performed at each new node \u03b3i, i \u2208 {1, 2}:\n\n\u03b3. As\n\u03b3 is the weight before\n\n\u03b3 and wb\n\nwa\n\u03b3i\n\nwb\n\u03b3i\n\n\u2190 1\n2\n\u2190 1\n2\n\nPkt\n\nPkt\n\n(cid:0)\u03b3i\n(cid:0)\u03b3i\n\n(cid:0)ln\u22121(cid:1)(cid:1)\n(cid:0)ln\u22121(cid:1)(cid:1), with Pkt\n\n(cid:0)\u03b3i\n(cid:0)ln|ln\u22121(cid:1) + wb\n(cid:0)ln|ln\u22121(cid:1) if n < \u03c4n(\u03b3)\n\n\u03b3Pkt\n\n\u03b3P \u03b3\nr\n\nwhere\n\nP \u03b3\nr\n\n(cid:0)ln|ln\u22121, zn(cid:1) \u2190\n\ns (ln|zn) \u2190 wa\nP \u03b3\n\n(cid:40)Pkt\n\n(cid:0)ln\u22121(cid:1)(cid:1) = 1 if \u03b3i\n\n(cid:0)ln\u22121(cid:1) is empty.\n\n(cid:0)ln|ln\u22121, zn(cid:1)\n\nStep 2: Prediction. The probability assigned to the subsequence ln given zn observed in \u03b3 is\n\ns (\u03b3j (ln)|\u03b3j (zn))\n\u03b3j\nP\ns (\u03b3j (ln) -1 |\u03b3j (zn) -1)\n\u03b3j\nP\n\nwith j : zn \u2208 \u03b3j, otherwise\n\n.\n\n(15)\n\nStep 3: Updates. Having computed the probability assignment of Eq. 14, the weights of the nodes\ncorresponding to the cells {\u03b3 : z\u2217 \u2208 \u03b3} are updated. Given a node \u03b3 to be updated, let ln, zn be the\nsubsequences observed in \u03b3. The following updates are applied:\n\n\u03b3 \u2190 \u03b1\u03b3\nwa\n\u03b3 \u2190 \u03b1\u03b3\nwb\n\nn+1P \u03b3\nn+1P \u03b3\n\ns (ln|zn) + \u03b2\u03b3\ns (ln|zn) + \u03b2\u03b3\n\nn+1wa\nn+1wb\n\n\u03b3Pkt\n\u03b3P \u03b3\nr\n\n(cid:0)ln|ln\u22121(cid:1)\n(cid:0)ln|ln\u22121, zn(cid:1)\n\n5\n\n(13)\n\n(14)\n\n(16)\n\n\fwhere \u03b2\u03b3\n\nn \u2261 (1 \u2212 2\u03b1\u03b3\n\nn). When \u03b3 has just been created (i.e. \u03b3 is a leaf node), these updates reduce to\n\n\u03b3 \u2190 wa\nwa\n\u03b3 \u2190 wb\nwb\n\n\u03b3Pkt\n\u03b3Pkt\n\n(cid:0)ln|ln\u22121(cid:1) =\n\n(cid:0)ln|ln\u22121(cid:1)\n(cid:0)ln|ln\u22121(cid:1).\n(cid:12)(cid:12)ln\u22121(cid:12)(cid:12)ln\n\n+ 1\n2\n|L|\n2\n\n(17)\n\n(18)\n\nRemark 4. The KT estimator can be computed sequentially using the following formula [27]:\n\nPkt\n\nTherefore, the sequential computation only requires maintaining the counters(cid:12)(cid:12)ln\u22121(cid:12)(cid:12)ln\n\nfor each cell.\nRemark 5. Samples zi only need to be stored at leaf nodes. Once a leaf node is split, they are moved\nto their corresponding child nodes.\n\n|ln\u22121| +\n\n.\n\n5.2 Correctness\n\nThe steps of our algorithm are the same as those of [33, Algorithm 1] except for the initialization of\nEq. 13. In fact, as shown in the next lemma, it is equivalent to building the partitioning tree from\nthe beginning (assuming, without loss of generality, that zn is known in advance) and applying the\noriginal algorithm at every relevant context.\nLemma. 3. Let n \u2208 N+ and assume the partitioning tree for zn is built from the beginning. Let \u03b3\nbe any node of the tree. If the original initialization and update equations from [33, Algorithm 1]\n(corresponding respectively to Eq. 13 with an empty sequence and Eq. 16) are applied, the weights,\nafter observing lt in \u03b3 with t < \u03c4n(\u03b3), are wa\n2 Pkt(lt), which correspond to\nthose obtained after the initialization of Eq. 13 and the updates of Eq. 17.\nThe correctness of our algorithm follows from Lem. 3 and [33, Thm. 4], since for t \u2265 \u03c4n(\u03b3) the\noriginal update equations are used.\n\n2 Pkt(lt) and wb\n\n\u03b3 = 1\n\n\u03b3 = 1\n\n5.3 Complexity\n\nThe cost of processing ln, zn is linear in the depth Dn of the node split by the insertion of zn, since\nthe algorithm updates the weights at each node in the path leading to this node. If PZ is absolutely\ncontinuous with respect to the Lebesgue measure, since the full-\ufb02edged k-d tree is monotone\ntransformation invariant, we can assume without loss of generality that the marginal distributions\nof Z are uniform in [0, 1] (see [8, Sec. 20.1]) and thus its pro\ufb01le is equivalent to that of a random\nbinary search tree under the random permutation model (see [21, Sec. 2.3]). Then, Dn corresponds\n2 log n \u2192 1 in probability (see [21, Sec. 2.4]). Therefore,\nto the cost of an unsuccessful search and Dn\nthe complexity of processing ln, zn is O (log n) in probability with respect to PZn.\n\n6 Experiments\nSoftware-hardware setup. Python code and data used for the experiments are available at https:\n//github.com/alherit/kd-switch. Experiments were carried out on a machine running Debian\n3.16, equipped with two Intel(R) Xeon(R) E5-2667 v2 @ 3.30GHz processors and 62 GB of RAM.\nBoosting \ufb01nite length performance with ensembling. When considering \ufb01nite length performance,\nwe can be unlucky and obtain a bad set of hierarchical partitions (i.e., with low discrimination power).\nIn order to boost the probability of \ufb01nding good partitions, we can use a Bayesian mixture of J trees.\nBayesian mixtures trivially maintain universality.\nTwo sampling scenarii for labels. In the \ufb01rst one, labels are sampled from a Bernoulli distribution\nsuch that P (L = 0) = \u03b80, where \u03b80 is a known parameter. We then we sample from PZ|L. In this\n(1 \u2212 \u03b80)1{ln=1},\ncase, the root node distribution Pkt\nsince \u03b80 is known. In the second one, observations come in random order and P (L) is unknown.\n\n(cid:0)ln|ln\u22121(cid:1) is replaced by P (Ln = ln) = \u03b8\n\n1{ln=0}\n0\n\n6.1 Normalized log loss (NLL) convergence\nDatasets. We use the following datasets, detailed in Appendix B.1: (L-i) A 2D dataset consists\nof two Gaussian Mixtures spanning three different scales. (L-ii) A dataset in dimension d = 784\n\n6\n\n\fFigure 2: (Left and Middle) Convergence of NLL as a function of n, for a 30\u2019 calculation. (Right)\nRunning time as a function of n. Error bands represent the std dev. w.r.t. the randomness in the tree\ngeneration except for dataset (L-iv) where they represent the std dev. w.r.t. the shuf\ufb02ing of the data.\n\ncomposed of both real MNIST digits, as well as digits generated by a Generative Adversarial Network\n[24] trained on the MNIST dataset. (L-iii) The Higgs dataset [20], the goal being to distinguish the\nsignature of processes producing Higgs bosons. (L-iv) The Breast Cancer Wisconsin (Diagnostic)\nData Set [20]\u2014dimension d = 30.\nFor cases (L-i,L-ii,L-iii), in order to feed the online predictors, we apply the \ufb01rst sampling scenario\nfor labels. For case (L-iv), we apply the second one and, in each trial, we take the pooled dataset in a\nrandom order to feed the online predictors.\nResults. We focus on the cumulative normalized log loss performance (NLL), and the trade-off with\nthe computational requirements\u2014by limiting the running time to 30\u2019.\nWe compare the performance of our online predictors Pkds and Pkdw (see Rmk. 2) with a number of\ntrees J \u2208 {1, 50}, against the following contenders. The Bayesian mixture of knn-based sequential\nregressors proposed in [19], with a switch distribution using a horizon-free prior as Pkds. Practically,\nthis predictor depends on a given set of functions of n specifying the number of neighbors. We\nuse the same set speci\ufb01ed in [19]. We also consider a Bayesian Mixture of Gaussian Processes\nClassi\ufb01ers (gp) with RBF kernel width \u03c3 \u2208 {24i}i=\u22125...7. (Our implementation uses the scikit-\nlearn GaussianProcessClassi\ufb01er [23]. For each observation, we retrain the classi\ufb01er using all past\nobservations\u2014a step requiring samples from the two populations. Thus, we predict with a uniform\ndistribution (or P (L) when known) until at least one instance of each label has been observed.) In\nthe case (L-i), we also compare to the true conditional probability, which can be easily derived since\nPZ|L are known. Note that the normalized log loss of the true conditional probability converges to\nthe conditional entropy by the Shannon\u2013McMillan\u2013Breiman theorem [7].\nFig. 2 (Left and Middle) shows the NLL convergence with respect to the number of samples. Notice\nthat due to the 30\u2019 running time budget, curves stop at different n. Fig. 2 (Right) illustrates the\ncomputational complexity of each method. For our predictors, the statistical ef\ufb01ciency increases with\nthe number of trees\u2014at the expense of the computational burden. Weighting performs better than\nswitching for datasets (L-i, L-ii, L-iv), and the other way around for (L-iii). knn takes some time to\nget the right scale, then converges fast in most cases\u2014with a plateau though on dataset L-ii though.\nThis owes to the particular set of exponents used to de\ufb01ne the mixture of regressors [19]. Also, knn\nis computationally demanding, in particular when compared to our predictors with J = 1.\n\n7\n\n102103104105106n0.9960.9981.0001.0021.0041.006NLLHIGGS n_trials=10algorithmgpkds-J1kds-J50kdw-J1kdw-J50knn102103104105106n0.860.880.900.920.940.960.981.001.02NLLGMM n_trials=10algorithmgpkds-J1kds-J50kdw-J1kdw-J50knntrue02000004000006000008000001000000n02505007501000125015001750elapsed_time (s)GMM n_trials=10algorithmgpkds-J1kds-J50kdw-J1kdw-J50knn0100200300400500n0.40.60.81.01.2NLLBREASTCANCER n_trials=30algorithmgpkds-J1kds-J50kdw-J1kdw-J50knn102103104105n0.00.20.40.60.81.0NLLGANMNIST n_trials=10algorithmgpkds-J1kds-J50kdw-J1kdw-J50knn020000400006000080000100000n02505007501000125015001750elapsed_time (s)GANMNIST n_trials=10algorithmgpkds-J1kds-J50kdw-J1kdw-J50knn\f(a) SG. d = 50.\n\n(b) GMD. d = 100.\n\n(c) GVD. d = 50.\n\n(d) Blobs. d = 2.\n\nFigure 3: Tests on randomly rotated Gaussian datasets from [15]. The abscissa represents the\ntest sample size ntest for each of the two samples. Thus, for sequential methods, n = 4ntest.\n\n6.2 Two-sample testing (TST)\nConstruction. Given samples from two distributions, whose corresponding random variables X \u2208\nRd and Y \u2208 Rd are i.i.d., a nonparametric two-sample test tries to determine whether the null\nhypothesis PX = PY holds or not (see, e.g., [18, Section 6.9]). Consistent sequential two-sample\ntests with optional stop (i.e. the p-value is valid at any time n) can be built from a pointwise universal\nonline predictor Q [19] by de\ufb01ning (L, Z) as: (0, X) with probability \u03b80, or (1, Y ) with probability\n1 \u2212 \u03b80, where \u03b80 is a design parameter set to 1/2 in the following experiments. The p-value is\nP(ln)\nthe likelihood ratio\nQ(ln|zn). Note this corresponds to the \ufb01rst sampling scenario for labels. The\ninstantiation of this construction with Pkds and J = 50 is denoted KDS-seq.\nContenders. We compare KDS-seq against SW\u03c0S from [19], denoted KNN-seq: a sequential two-\nsample test obtained by instantiating the construction described above with the online knn predictor\ndescribed in the previous section. We also compare KDS-seq against the kernel tests from [15]:\nME-full, ME-grid, SCF-full, SCF-grid, MMD-quad, MMD-lin, and the classical Hotelling\u2019s T2for\ndifferences in means under Gaussian assumptions. These tests depend on a kernel width \u03c3 learned\non a trained set\u2014the train-test paradigm\u2014as opposed to KDS-seq which automatically detects the\npertinent scales. Contenders were launched with the hyperparameters speci\ufb01ed in their respective\npaper. For a fair comparison between sequential methods and those tests using the train-test paradigm\nwith ntest used for testing, we use a number of samples n = 4ntest\u2014detail in Appendix B.2.\nDatasets. We use the four datasets from [15, Table 1]: (T-i) Same Gaussians in dimension d = 50,\nto assess the type I error; (T-ii) Gaussian Mean Difference (GMD): normal distributions with a\ndifference in means along one direction, d = 100; (T-iii) Gaussian Variance Difference (GVD):\nnormal distributions with a difference in variance along one direction, d = 50; (T-iv) Blobs (Mixture\nof Gaussian distributions centered on a lattice) [9]. Datasets (T-ii, T-iii, T-iv) are meant to assess type\nII error. To prevent k-d tree cuts to exploit the particular direction where the difference lies, these\ndatasets undergo a random rotation (one per tree). See Appendix B.3 for results without rotations.\nResults. The signi\ufb01cance level is set to \u03b1 = .01 in all the cases. The Type I error rate and the power\n(1 \u2212 Type II error rate) are computed over 500 trials. In the SG case (Fig. 3(a)), all the tests have\na Type I error rate around the speci\ufb01ed \u03b1 as expected. In the GMD and Blobs cases (Fig. 3(b,d)),\nKDS-seq matches or outperforms all the contenders. On Blobs, KDS-seq outperforms KNN-seq\nthanks to its automatic scale detection, even though the mixture used by the latter allows it to handle\nthe multiple scales. For GVD (Fig. 3(c)), our results are weaker. To see why, recall that GMD\nis generated by adding one unit to one coordinate of the mean vector, while GVD is obtained by\ndoubling the variance along one direction. The span of the latter dataset is larger, and upon rotating\nthe data\u2014see comment above\u2014all directions are impacted. Given the high dimensionality, the\npartitioning of k-d trees faces more dif\ufb01culties to reduce the diameter of cells, which is key to\nconvergence\u2014see Corollary 1.\n\n7 Outlook\n\nWe foresee the following research directions. A \ufb01rst open question is to characterize the situations\nwhere switching should be preferred over weighting. A second core question is to quantify the ability\nof our k-d tree based construction to cope with multiple scales in the data. A third one is the derivation\nof \ufb01nite length bounds related to the complexity of the underlying conditional distribution. Finally,\naccommodating data in a metric (non Euclidean) space, using e.g. metric trees, would widen the\napplication spectrum of the method.\n\n8\n\n10002000300040005000Test sample size0.0050.0100.015Type-I error10002000300040005000Test sample size0.20.40.60.81.0Test power10002000300040005000Test sample size0.00.20.40.60.81.0Test power10002000300040005000Test sample size0.00.20.40.60.81.0Test powerME-fullME-gridSCF-fullSCF-gridMMD-quadMMD-linT2KDS-seqKNN-seq\fAcknowledgments\n\nWe would like to thank Mar\u00eda Zuluaga, Eoin Thomas, Nicolas Bondoux and Rodrigo Acu\u00f1a-Agost for\ninsightful comments and, also, Wittawat Jitkrittum and Arthur Gretton for providing us the complete\noutput of their experiments.\n\nReferences\n[1] P. Algoet. Universal schemes for prediction, gambling and portfolio selection. The Annals of Probability,\n\n20(2):901\u2013941, 1992.\n\n[2] P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep\n\nlearning. Nature communications, 5:4308, 2014.\n\n[3] A. R. Barron. Information-theoretic characterization of Bayes performance and the choice of priors in\nparametric and nonparametric problems. In A. D. J.M. Bernardo, J.O. Berger and A. Smith, editors,\nBayesian Statistics 6, pages 27\u201352. Oxford University Press, 1998.\n\n[4] R. Begleiter and R. El-Yaniv. Superior guarantees for sequential prediction and lossless compression via\n\nalphabet decomposition. Journal of Machine Learning Research, 7(Feb):379\u2013411, 2006.\n\n[5] H. Cai, S. R. Kulkarni, and S. Verd\u00fa. A universal lossless compressor with side information based on\ncontext tree weighting. In Information Theory, 2005. ISIT 2005. Proceedings. International Symposium on,\npages 2340\u20132344. IEEE, 2005.\n\n[6] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.\n\n[7] T. Cover and J. Thomas. Elements of Information Theory. Wiley & Sons, 2006.\n\n[8] L. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A probabilistic theory of pattern recognition, volume 31. Springer\n\nVerlag, 1996.\n\n[9] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. Sriperumbudur.\nOptimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing\nSystems, pages 1205\u20131213, 2012.\n\n[10] P. D. Gr\u00fcnwald. The minimum description length principle. MIT press, 2007.\n\n[11] L. Gy\u00f6\ufb01 and G. Lugosi. Strategies for sequential prediction of stationary time series.\n\nuncertainty, pages 225\u2013248. Springer, 2005.\n\nIn Modeling\n\n[12] L. Gyor\ufb01, G. Lugosi, and G. Morvai. A simple randomized algorithm for sequential prediction of ergodic\n\ntime series. IEEE Transactions on Information Theory, 45(7):2642\u20132650, 1999.\n\n[13] P. Hall and E. Hannan. On stochastic complexity and nonparametric density estimation. Biometrika,\n\n75(4):705\u2013714, 1988.\n\n[14] H. Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal\n\nSociety of London. Series A. Mathematical and Physical Sciences, 186(1007):453\u2013461, 1946.\n\n[15] W. Jitkrittum, Z. Szab\u00f3, K. P. Chwialkowski, and A. Gretton. Interpretable distribution features with\nmaximum testing power. In Advances in Neural Information Processing Systems, pages 181\u2013189, 2016.\n\n[16] S. S. Kozat, A. C. Singer, and G. C. Zeitler. Universal piecewise linear prediction via context trees. IEEE\n\nTransactions on Signal Processing, 55(7):3730\u20133745, 2007.\n\n[17] R. Krichevsky and V. Tro\ufb01mov. The performance of universal encoding. Information Theory, IEEE\n\nTransactions on, 27(2):199\u2013207, 1981.\n\n[18] E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer Texts in Statistics. Springer,\n\nNew York, third edition, 2005.\n\n[19] A. Lh\u00e9ritier and F. Cazals. A sequential non-parametric multivariate two-sample test. IEEE Transactions\n\non Information Theory, 64(5):3361\u20133370, 2018.\n\n[20] M. Lichman. UCI machine learning repository, 2013.\n\n[21] H. Mahmoud. Evolution of random search trees. Wiley-Interscience, 1992.\n\n9\n\n\f[22] N. Merhav and M. Feder. Universal prediction. Information Theory, IEEE Transactions on, 44(6):2124\u2013\n\n2147, 1998.\n\n[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[24] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional\n\ngenerative adversarial networks. In ICLR, 2016.\n\n[25] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation\n\nand Machine Learning). The MIT Press, 2005.\n\n[26] J. Rissanen, T. Speed, and B. Yu. Density estimation by stochastic complexity. Information Theory, IEEE\n\nTransactions on, 38(2):315\u2013323, 1992.\n\n[27] Y. Shtar\u2019kov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 23(3):3\u201317,\n\n1987.\n\n[28] J. Silva. On optimal signal representation for statistical learning and pattern recognition. PhD thesis,\n\nUniversity of Southern California, 2008.\n\n[29] N. Tziortziotis, C. Dimitrakakis, and K. Blekas. Cover tree bayesian reinforcement learning. The Journal\n\nof Machine Learning Research, 15(1):2313\u20132335, 2014.\n\n[30] L. G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n\n[31] T. van Erven, P. Gr\u00fcnwald, and S. de Rooij. Catching up faster by switching sooner: a predictive approach\nto adaptive estimation with an application to the AIC\u2013BIC dilemma. Journal of the Royal Statistical\nSociety: Series B (Statistical Methodology), 74(3):361\u2013417, 2012.\n\n[32] J. Veness, T. Lattimore, A. Bhoopchand, A. Grabska-Barwinska, C. Mattern, and P. Toth. Online learning\n\nwith gated linear networks. arXiv preprint arXiv:1712.01897, 2017.\n\n[33] J. Veness, K. S. Ng, M. Hutter, and M. Bowling. Context tree switching. In Data Compression Conference\n\n(DCC), 2012, pages 327\u2013336. IEEE, 2012.\n\n[34] F. M. J. Willems. The context-tree weighting method: Extensions. Information Theory, IEEE Transactions\n\non, 44(2):792\u2013798, 1998.\n\n[35] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens. The context-tree weighting method: Basic properties.\n\nInformation Theory, IEEE Transactions on, 41(3):653\u2013664, 1995.\n\n[36] B. Yu and T. Speed. Data compression and histograms. Probability Theory and Related Fields, 92(2):195\u2013\n\n229, 1992.\n\n10\n\n\f", "award": [], "sourceid": 8254, "authors": [{"given_name": "Alix", "family_name": "LHERITIER", "institution": "Amadeus"}, {"given_name": "Frederic", "family_name": "Cazals", "institution": "Inria"}]}