{"title": "Improvements to the Sequence Memoizer", "book": "Advances in Neural Information Processing Systems", "page_first": 685, "page_last": 693, "abstract": "The sequence memoizer is a model for sequence data with state-of-the-art performance on language modeling and compression. We propose a number of improvements to the model and inference algorithm, including an enlarged range of hyperparameters, a memory-efficient representation, and inference algorithms operating on the new representation. Our derivations are based on precise definitions of the various processes that will also allow us to provide an elementary proof of the mysterious\" coagulation and fragmentation properties used in the original paper on the sequence memoizer by Wood et al. (2009). We present some experimental results supporting our improvements.\"", "full_text": "Improvements to the Sequence Memoizer\n\nJan Gasthaus\n\nYee Whye Teh\n\nGatsby Computational Neuroscience Unit\n\nGatsby Computational Neuroscience Unit\n\nUniversity College London\nLondon, WC1N 3AR, UK\n\nUniversity College London\nLondon, WC1N 3AR, UK\n\nj.gasthaus@gatsby.ucl.ac.uk\n\nywteh@gatsby.ucl.ac.uk\n\nAbstract\n\nThe sequence memoizer is a model for sequence data with state-of-the-art per-\nformance on language modeling and compression. We propose a number of\nimprovements to the model and inference algorithm, including an enlarged range\nof hyperparameters, a memory-ef\ufb01cient representation, and inference algorithms\noperating on the new representation. Our derivations are based on precise de\ufb01-\nnitions of the various processes that will also allow us to provide an elementary\nproof of the \u201cmysterious\u201d coagulation and fragmentation properties used in the\noriginal paper on the sequence memoizer by Wood et al. (2009). We present some\nexperimental results supporting our improvements.\n\n1\n\nIntroduction\n\nThe sequence memoizer (SM) is a Bayesian nonparametric model for discrete sequence data producing\nstate-of-the-art results for language modeling and compression [1, 2]. It models each symbol of a\nsequence using a predictive distribution that is conditioned on all previous symbols, and thus can be\nunderstood as a non-Markov sequence model. Given the very large (in\ufb01nite) number of predictive\ndistributions needed to model arbitrary sequences, it is essential that statistical strength be shared in\ntheir estimation. To do so, the SM uses a hierarchical Pitman-Yor process prior over the predictive\ndistributions [3]. One innovation of the SM over [3] is its use of coagulation and fragmentation\nproperties [4, 5] that allow for ef\ufb01cient representation of the model using a data structure whose size\nis linear in the sequence length. However, in order to make use of these properties, all concentration\nparameters, which were allowed to vary freely in [3], were \ufb01xed to zero.\nIn this paper we explore a number of further innovations to the SM. Firstly, we propose a more \ufb02exible\nsetting of the hyperparameters with potentially non-zero concentration parameters that still allow the\nuse of the coagulation/fragmentation properties. In addition to better predictive performance, the\nsetting also partially mitigates a problem observed in [1], whereby on encountering a long sequence\nof the same symbol, the model becomes overly con\ufb01dent that it will continue with the same symbol.\nThe second innovation addresses memory usage issues in inference algorithms for the SM. In\nparticular, current algorithms use a Chinese restaurant franchise representation for the HPYP, where\nthe seating arrangement of customers in each restaurant is represented by a list, each entry being the\nnumber of customers sitting around one table [3]. This is already an improvement over the na\u00a8\u0131ve\nChinese restaurant franchise in [6] which stores pointers from customers to the tables they sit at, but\ncan still lead to huge memory requirements when restaurants contain many tables. One approach to\nmitigate this problem has been explored in [7], which uses a representation that stores a histogram of\ntable sizes instead of the table sizes themselves. Our proposal is to store even less, namely only the\nminimal statistics about each restaurant required to make predictions: the number of customers and\nthe number of tables occupied by the customers. Inference algorithms will have to be adapted to this\ncompact representation, and we describe and compare a number of these.\n\n1\n\n\fIn Section 2 we will give precise de\ufb01nitions of Pitman-Yor processes and Chinese restaurant processes.\nThese will be used to de\ufb01ne the SM model in Section 3, and to derive the results about the extended\nhyperparameter setting in Section 4 and the memory-ef\ufb01cient representation in Section 5. As a\nside bene\ufb01t we will also be able to give an elementary proof of the coagulation and fragmentation\nproperties in Section 4, which was presented as a fait accompli in [1], while the general and rigorous\ntreatment in the original papers [4, 5] is somewhat inaccessible to a wider audience.\n\n2 Pitman-Yor Processes and Chinese Restaurant Processes\n\nA Pitman-Yor process (PYP) is a particular distribution over distributions over some probability\nspace \u03a3 [8, 9]. We denote by PY(\u03b1, d, G0) a PYP with concentration parameter \u03b1 > \u2212d, discount\nparameter d \u2208 [0, 1), and base distribution G0 over \u03a3. We can describe a Pitman-Yor process using\nits associated Chinese restaurant process (CRP). A Chinese restaurant has customers sitting around\ntables which serve dishes. If there are c customers we index them with [c] = {1, . . . , c}. We de\ufb01ne a\nseating arrangement of the customers as a set of disjoint non-empty subsets partitioning [c]. Each\nsubset is a table and consists of the customers sitting around it, e.g. {{1, 3},{2}} means customers 1\nand 3 sit at one table and customer 2 sits at another by itself. Let Ac be the set of seating arrangements\nof c customers, and Act those with exactly t tables. The CRP describes a distribution over seating\narrangements as follows: customer 1 sits at a table; for customer c + 1, if A \u2208 Ac is the current\nseating arrangement, then she joins a table a \u2208 A with probability |a|\u2212d\n\u03b1+c and starts a new table with\nprobability \u03b1+|A|d\n\u03b1+c . We denote the resulting distribution over Ac as CRPc(\u03b1, d). Multiplying the\nconditional probabilities together,\n\n[1 \u2212 d]|a|\u22121\n\n1\n\nfor each A \u2208 Ac,\n\n(1)\n\nP (A) =\n\n[\u03b1 + d]|A|\u22121\n[\u03b1 + 1]c\u22121\n\nd\n\n1\n\n(cid:89)\n\na\u2208A\n\nd = (cid:81)n\u22121\n\n(cid:81)\na\u2208A[1 \u2212 d]|a|\u22121\n\n1\n\nwhere [y]n\ni=0 y + id is Kramp\u2019s symbol. Note that the denominator is the normalization\nconstant. Fixing the number of tables to be t \u2264 c, the distribution, denoted as CRPct(d), becomes:\n\n(2)\n\nwhere the normalization constant Sd(c, t) =(cid:80)\n\nP (A) =\n\nSd(c, t)\n\n(cid:81)\nfor each A \u2208 Act,\na\u2208A[1 \u2212 d]|a|\u22121\n\n1\n\nA\u2208Act\n\nis a generalized Stirling\nnumber of type (\u22121,\u2212d, 0) [10]. These can be computed recursively [3] (see also Section 5). Note\nthat conditioning on a \ufb01xed t the seating arrangement will not depend on \u03b1, only on d.\nSuppose G \u223c PY(\u03b1, d, G0) and z1, . . . , zc|G iid\u223c G. The CRP describes the PYP in terms of its effect\non z1:c = z1, . . . , zc. In particular, marginalizing out G, the distribution of z1:c can be described as\nfollows: draw A \u223c CRPc(\u03b1, d), on each table serve a dish which is an iid draw from G0, \ufb01nally let\nvariable zi take on the value of the dish served at the table that customer i sat at. Now suppose we\nwish to perform inference given observation of z1:c. This is equivalent to conditioning on the dishes\nthat each customer is served. Since customers at the same table are served the same dish, the different\nvalues among the zi\u2019s split the restaurant into multiple sections, with customers and tables in each\nsection being served a distinct dish. There can be more than one table in each section since multiple\ntables can serve the same dish (if G0 has atoms). If s \u2208 \u03a3 is a dish, let cs be the number of zi\u2019s with\nvalue s (number of customers served dish s), ts the number of tables, and As \u2208 Acsts the seating\narrangement of customers around the tables serving dish s (we reindex the cs customers to be [cs]).\nThe joint distribution over seating arrangements and observations is then:1\n\n[\u03b1 + d]t\u00b7\u22121\n[\u03b1 + 1]c\u00b7\u22121\n\nd\n\nP ({cs, ts, As}, z1:c) =\n\nwhere t\u00b7 =(cid:80)\n\nG0(s)ts\n\n,\ns\u2208\u03a3 ts and similarly for c\u00b7.We can marginalize out {As} from (3) using (2):\nP ({cs, ts}, z1:c) =\n\nSd(cs, ts)\n\nG0(s)ts\n\n(cid:33)\n\na\u2208As\n\ns\u2208\u03a3\n\n1\n\n1\n\n[\u03b1 + d]t\u00b7\u22121\n[\u03b1 + 1]c\u00b7\u22121\n\nd\n\n(4)\nInference then amounts to computing the posterior of either {ts, As} or only {ts} given z1:c (cs are\n\ufb01xed) and can be achieved by Gibbs sampling or other means.\n\ns\u2208\u03a3\n\ns\u2208\u03a3\n\n1\n\n.\n\n[1 \u2212 d]|a|\u22121\n\n(3)\n\n(cid:89)\n\n(cid:89)\n(cid:89)\n\n(cid:32)(cid:89)\n(cid:32)(cid:89)\n\ns\u2208\u03a3\n\n(cid:33)(cid:32)\n(cid:33)(cid:32)\n\n(cid:33)\n\n1We have omitted the set subscript {\u00b7}s\u2208\u03a3. We will drop these subscripts when they are clear from context.\n\n2\n\n\f3 The Sequence Memoizer and its Chinese Restaurant Representation\n\nIn this section we review the sequence memoizer (SM) and its representation using Chinese restaurants\n[3, 11, 1, 2]. Let \u03a3 be the discrete set of symbols making up the sequences to be modeled, and\nlet \u03a3\u2217 be the set of \ufb01nite sequences of symbols from \u03a3. The SM models a sequence x1:T =\nx1, x2, . . . , xT \u2208 \u03a3\u2217 using a set of conditional distributions:\n\nP (x1:T ) =\n\nP (xi|x1:i\u22121) =\n\nGx1:i\u22121(xi),\n\n(5)\n\nT(cid:89)\n\ni=1\n\nT(cid:89)\n\ni=1\n\nwhere Gu(s) is the conditional probability of the symbol s \u2208 \u03a3 occurring after a context u \u2208 \u03a3\u2217 (the\nsequence of symbols occurring before s). The parameters of the model consist of all the conditional\ndistributions {Gu}u\u2208\u03a3\u2217, and are given a hierarchical Pitman-Yor process (HPYP) prior:\n\nG\u03b5 \u223c PY(\u03b1\u03b5, d\u03b5, H)\n\nGu|G\u03c3(u) \u223c PY(\u03b1u, du, G\u03c3(u))\n\nfor u \u2208 \u03a3\u2217\\{\u03b5},\n\n(6)\n\nwhere \u03b5 is the empty sequence, \u03c3(u) is the sequence obtained by dropping the \ufb01rst symbol in u, and\nH is the overall base distribution over \u03a3 (we take H to be uniform over a \ufb01nite \u03a3). Note that we\nhave generalized the model to allow each Gu to have its own concentration and discount parameters,\nwhereas [1, 2] worked with \u03b1u = 0 and du = d|u| (i.e. context length-dependent discounts).\nAs in previous works, the hierarchy over {Gu} is represented using a Chinese restaurant franchise\n[6]. Each Gu has a corresponding restaurant indexed by u. Customers in the restaurant are draws\nfrom Gu, tables are draws from its base distribution G\u03c3(u), and dishes are the drawn values from \u03a3.\nFor each s \u2208 \u03a3 and u \u2208 \u03a3\u2217, let cus and tus be the numbers of customers and tables in restaurant u\nserved dish s, and let Aus \u2208 Acustus be their seating arrangement. Each observation of xi in context\nx1:i\u22121 corresponds to a customer in restaurant x1:i\u22121 who is served dish xi, and each table in each\nrestaurant u, being a draw from the base distribution G\u03c3(u), corresponds to a customer in the parent\nrestaurant \u03c3(u). Thus, the numbers of customers and tables have to satisfy the constraints\n\ncus = cx\n\ntvs,\n\n(7)\n\nus + (cid:88)\n\nv:\u03c3(v)=u\n\nus = 1 if s = xi and u = x1:i\u22121 for some i, and 0 otherwise.\n\nwhere cx\nThe goal of inference is to compute the posterior over the states {cus, tus, Aus}s\u2208\u03a3,u\u2208\u03a3\u2217 of the\nrestaurants (and possibly the concentration and discount parameters). The joint distribution can be\nobtained by multiplying the probabilities of all seating arrangements (3) in all restaurants:\n\n(cid:32)(cid:89)\n\ns\u2208\u03a3\n\n(cid:33) (cid:89)\n\nu\u2208\u03a3\u2217\n\n(cid:32)[\u03b1u + du]tu\u00b7\u22121\n\ndu\n\n[\u03b1u + 1]cu\u00b7\u22121\n\n1\n\n(cid:89)\n\n(cid:89)\n\ns\u2208\u03a3\n\na\u2208Aus\n\n(cid:33)\n\n[1 \u2212 du]|a|\u22121\n\n1\n\n.\n\nP ({cus, tus, Aus}, x1:T ) =\n\nH(s)t\u03b5s\n\n(8)\nThe \ufb01rst parentheses contain the probability of draws from the overall base distribution H, and the\nsecond parentheses contain the probability of the seating arrangement in restaurant u. Given a state\nof the restaurants drawn from the posterior, the predictive probability of symbol s in context v can\nthen be computed recursively (with P \u2217\n\n\u03c3(\u03b5)(s) de\ufb01ned to be H(s)):\nP \u2217\n\u03c3(v)(s).\n\n+ \u03b1v + tv\u00b7d\n\u03b1v + cv\u00b7\n\nv(s) = cvs \u2212 tvsd\nP \u2217\n\u03b1v + cv\u00b7\n\n(9)\n\n4 Non-zero Concentration Parameters\n\nIn [1] the authors proposed setting all the concentration parameters to zero. Though limiting the\n\ufb02exibility of the model, this allowed them to take advantage of coagulation and fragmentation\nproperties of PYPs [4, 5] to marginalize out all but a linear number (in T ) of restaurants from the\nhierarchy. We propose the following enlarged family of hyperparameter settings: let \u03b1\u03b5 = \u03b1 > 0\nbe free to vary at the root of the hierarchy, and set each \u03b1u = \u03b1\u03c3(u)du for each u \u2208 \u03a3\u2217\\{\u03b5}. The\n\n3\n\n\fdiscounts can vary freely. In addition to more \ufb02exible modeling, this also partially mitigates the\novercon\ufb01dence problem [2]. To see why, notice from (9) that the predictive probability is a weighted\naverage of predictive probabilities given contexts of various lengths. Since \u03b1v > 0, the model gives\nhigher weights to the predictive probabilities of shorter contexts (compared to \u03b1v = 0). These\ntypically give less extreme values since they include in\ufb02uences not just from the sequence of identical\nsymbols, but also from other observations of other symbols in other contexts.\nOur hyperparameter settings also retain the coagulation and fragmentation properties which allow us\nto marginalize out many PYPs in the hierarchy for ef\ufb01cient inference. We will provide an elementary\nproof of these results in terms of CRPs in the following. First we describe the coagulation and\nfragmentation operations. Let c \u2265 1 and suppose A2 \u2208 Ac and A1 \u2208 A|A2| are two seating\narrangements where the number of customers in A1 is the same as that of tables in A2. Each customer\nin A1 can be put in one-to-one correspondence to a table in A2 and sits at a table in A1. Now\nconsider re-representing A1 and A2. Let C \u2208 Ac be the seating arrangement obtained by coagulating\n(merging) tables of A2 corresponding to customers in A1 sitting at the same table. Further, split A2\ninto sections, one for each table a \u2208 C, where each section Fa \u2208 A|a| contains the |a| customers and\ntables merged to make up a. The converse of coagulating tables of A2 into C is of course to fragment\neach table a \u2208 C into the smaller tables in Fa. Note that there is a one-to-one correspondence\nbetween tables in C and in A1, and the number of customers in each table of A1 is that of tables in\nthe corresponding Fa. Thus A1 and A2 can be reconstructed from C and {Fa}a\u2208C.\nTheorem 1 ([4, 5]). Suppose A2 \u2208 Ac, A1 \u2208 A|A2|, C \u2208 Ac and Fa \u2208 A|a| for each a \u2208 C are\nrelated as above. Then the following describe equivalent distributions:\n(I) A2 \u223c CRPc(\u03b1d2, d2) and A1|A2 \u223c CRP|A2|(\u03b1, d1).\n(II) C \u223c CRPc(\u03b1d2, d1d2) and Fa|C \u223c CRP|a|(\u2212d1d2, d2) for each a \u2208 C.\n\nP (A1, A2) =\n\nProof. We simply show that the joint distributions are the same. Starting with (I) and using (1),\n\n(cid:18)[\u03b1 + d1]|A1|\u22121\n\n(cid:89)\n(cid:18) (cid:89)\n[\u03b1d2 + d1d2]|A1|\u22121\n\n[\u03b1 + 1]|A2|\u22121\n\na\u2208A1\n\nd1d2\n\nd1\n\n1\n\n=\n\n[\u03b1d2 + 1]c\u22121\n\n1\n\na\u2208A1\n\n[1 \u2212 d1]|a|\u22121\n\n1\n\n(cid:19)(cid:18)[\u03b1d2 + d2]|A2|\u22121\n(cid:19)(cid:18) (cid:89)\n\n[\u03b1d2 + 1]c\u22121\n\nd2\n\n1\n\n(cid:89)\n\nb\u2208A2\n\n1\n\n[1 \u2212 d2]|b|\u22121\n(cid:19)\n\n(cid:19)\n\n[d2 \u2212 d1d2]|a|\u22121\n\nd2\n\n[1 \u2212 d2]|b|\u22121\n\n1\n\n.\n\nb\u2208A2\n\nWe used the identity [\u03b2\u03b4 + \u03b4]n\u22121\n\u03b4 = \u03b4n\u22121[\u03b2 + 1]n\u22121\nexpressing the same quantities in terms of C and {Fa},\n\n1\n\nfor all \u03b2, \u03b4, n. Re-grouping the products and\n\n(cid:19)\n\n[1 \u2212 d2]|b|\u22121\n\n1\n\n= P (C,{Fa}a\u2208C).\n\n[\u03b1d2 + d1d2]|C|\u22121\n\nd1d2\n\n[\u03b1d2 + 1]c\u22121\n\n1\n\n=\n\n(cid:89)\n\nb\u2208Fa\n\n(cid:18)\n\n(cid:89)\n\na\u2208C\n\nd2\n\n[d2 \u2212 d1d2]|Fa|\u22121\n(cid:89)\n\nd1d2\n\n[\u03b1d2 + d1d2]|C|\u22121\n\n[\u03b1d2 + 1]c\u22121\n\n1\n\na\u2208C\n\nP (C) =\n\n[1 \u2212 d1d2]|a|\u22121\n\n.\n\n1\n\nWe see that conditioning on C each Fa \u223c CRP|a|(\u2212d1d2, d2). Marginalizing {Fa} out using (1),\n\nSo C \u223c CRPc(\u03b1d2, d1d2) and (I)\u21d2(II). Reversing the same argument shows that (II)\u21d2(I).\n\nStatement (I) of the theorem is exactly the Chinese restaurant franchise of the hierarchical model\nG1|G0 \u223c PY(\u03b1, d1, G0), G2|G1 \u223c PY(\u03b1d2, d2, G1) with c iid draws from G2. The theorem shows\n\n4\n\na1a2CFigure1:Illustrationoftherelationshipbe-tweentherestaurantsA1,A2,CandFa.A1A2Fa1Fa2\fthat the clustering structure of the c customers in the franchise is equivalent to the seating arrangement\nin a CRP with parameters \u03b1d2, d1d2, i.e. G2|G0 \u223c PY(\u03b1d2, d1d2, G0) with G1 marginalized out.\nConversely, the fragmentation operation (II) regains Chinese restaurant representations for both\nG2|G1 and G1|G0 from one for G2|G0.\nThis result can be applied to marginalize out all but a linear number of PYPs from (6) [1]. The\nresulting model is still a HPYP of the same form as (6), except that it only need be de\ufb01ned over the\npre\ufb01xes of x1:T as well as some subset of their ancestors. In the rest of this paper we will refer to (6)\nand its Chinese restaurant franchise representation (8) with the understanding that we are operating in\nthis reduced hierarchy. Let U denote the reduced set of contexts, and rede\ufb01ne \u03c3(u) to be the parent\nof u in U. The concentration and discount parameters need to be modi\ufb01ed accordingly.\n\n5 Compact Representation\n\nCurrent inference algorithms for the SM and hierarchical Pitman-Yor processes operate in the Chinese\nrestaurant franchise representation, and use either Gibbs sampling [3, 11, 1] or particle \ufb01ltering [2].\nTo lower memory requirements, instead of storing the precise seating arrangement of each restaurant,\nthe algorithms only store the numbers of customers, numbers of tables and sizes of all tables in the\nfranchise. This is suf\ufb01cient for sampling and for prediction. However, for large data sets the amount\nof memory required to store the sizes of the tables can still be very large. We propose algorithms that\nonly store the numbers of customers and tables but not the table sizes. This compact representation\nneeds to store only two integers (cus, tus) per context/symbol pair, as opposed to tus integers.2 These\ncounts are already suf\ufb01cient for prediction, as (9) does not depend on the table sizes. We will also\nconsider a number of sampling algorithms in this representation.\nOur starting point is the joint distribution over the Chinese restaurant franchise (8). Integrating out\nthe seating arrangements {Aus} using (2) gives the joint distribution over {cus, tus}:\n\nP ({cus, tus}, x1:T ) =\n\nH(s)t\u03b5s\n\nSdu(cus, tus)\n\n.\n\n(10)\n\n(cid:32)(cid:89)\n\ns\u2208\u03a3\n\n(cid:33)(cid:89)\n\nu\u2208U\n\n(cid:32)[\u03b1u + du]tu\u00b7\u22121\n\ndu\n\n[\u03b1u + 1]cu\u00b7\u22121\n\n1\n\n(cid:89)\n\ns\u2208\u03a3\n\n(cid:33)\n\nNote that each cus is in fact determined by (7) so in fact the only unobserved variables in (10) are\n{tus}. With this joint distribution we can now derive various sampling algorithms.\n\n5.1 Sampling Algorithms\nDirect Gibbs Sampling of {cus, tus}. It is straightforward derive a Gibbs sampler from (10). Since\nus and the tvs at child restaurants v, it is suf\ufb01cient to update each tus,\neach cus is determined by cx\nwhich for tus in the range {1, . . . , cus} has conditional distribution\n\nP (tus|rest) \u221d [\u03b1u + du]tu\u00b7\u22121\n[\u03b1\u03c3(u) + 1]c\u03c3(u)\u00b7\u22121\n\ndu\n\n1\n\nSdu(cus, tus)Sd\u03c3(u)(c\u03c3(u)s, t\u03c3(u)s),\n\n(11)\n\nwhere tu\u00b7, c\u03c3(u)\u00b7 and c\u03c3(u)s all depend on tus through the constraints (7). One problem with this\nsampler is that we need to compute Sdu(c, t) for all 1 \u2264 c, t \u2264 cus. If du is \ufb01xed these can be\nprecomputed and stored, but the resulting memory requirement is again large since each restaurant\ntypically has its own du value. If du is updated in the sampling, then these will need to be computed\neach time as well, costing O(c2\nus) per iteration. Further, Sd(c, t) typically has very high dynamic\nrange, so care has to be taken to avoid numerical under-/over\ufb02ow (e.g. by performing the computations\nin the log domain, involving many expensive log and exp computations).\nRe-instantiating Seating Arrangements. Another strategy is to re-instantiate the seating arrange-\nment by sampling Aus \u223c CRPcustus(du) from its conditional distribution given cus, tus (see\nSection 5.2 below), then performing the original Gibbs sampling of seating arrangements [3, 11].\nThis produces a new number of tables tus and the seating arrangement can be discarded. Note\nhowever that when tus changes this sampler will introduce changes to ancestor restaurants (by adding\n2In both representations one may also want to store the total number of customers and tables in each restaurant\nfor ef\ufb01ciency. In practice, where there is additional overhead due to the data structures involved, storage space\nfor the full representation can be reduced by treating context/symbol pairs with only one customer separately.\n\n5\n\n\for removing customers), so these will need to have their seating arrangements instantiated as well. To\nimplement this sampler ef\ufb01ciently, we visit restaurants in depth-\ufb01rst order, keeping in memory only\nthe seating arrangements of all restaurants on the path to the current one. The computational cost is\nO(custus), but with a potentially smaller hidden constant (no log/exp computations are required).\nOriginal Gibbs Sampling of {cus, tus}. A third strategy is to \u201cimagine\u201d having a seating arrange-\nment and running the original Gibbs sampler, incrementing tus if a table would have been created,\nand decrementing tus if a table would have been deleted. Recall that the original Gibbs sampler\noperates by iterating over customers, treating each as the last customer in the restaurant, removing\nit, then adding it back into the restaurant. When removing, if the customer were sitting by himself,\na table would need to be deleted too, so the probability of decrementing tus is the probability of a\ncustomer sitting by himself. From (2), this can be worked out to be\n\nP (decrement tus) = Sdu(cus \u2212 1, tus \u2212 1)\n\n(12)\nThe numerator is due to a sum over all seating arrangements where the other cus \u2212 1 customers sit at\nthe other tus \u2212 1 tables. When adding back the customer, the probability of incrementing the number\nof tables is the probability that the customer sits at a new table of the same dish s:\n\nSdu(cus, tus)\n\n.\n\nP (increment tus) =\n\n(\u03b1u + dutu\u00b7)P \u2217\n\n\u03c3(u)(s)\n\n(\u03b1u + dutu\u00b7)P \u2217\n\n\u03c3(u)(s) + cus \u2212 tusdu\n\n,\n\n(13)\n\nwhere P \u2217\n\u03c3(u)(s) is the predictive (9) with the current value of tus, and cus, tus are values with the\ncustomer removed. This sampler also requires computation of Sdu(c, t), but only for 1 \u2264 t \u2264 tus\nwhich can be signi\ufb01cantly smaller than cus. Computation cost is O(custus) (but again with a larger\nconstant due to computing the Stirling numbers in a stable way). We did not \ufb01nd a sampling method\ntaking less time than O(custus).\nParticle Filtering. (13) gives the probability of incrementing tus (and adding a customer to the\nparent restaurant) when a customer is added into a restaurant. This can be used as the basis for a\nparticle \ufb01lter, which iterates through the sequence x1:T , adding a customer corresponding to s = xi\nin context u = x1:i\u22121 at each step. Since no customer deletion is required, the cost is very small:\njust O(cus) for the cus customers per s and u (plus the cost of traversing the hierarchy to the current\nrestaurant, which is always necessary). Particle \ufb01ltering works very well in online settings, e.g.\ncompression [2], and as initialization for Gibbs sampling.\n\n5.2 Re-instantiating Aus given cus, tus\nTo simplify notation, here we will let d = du, c = cus, t = tus and A = Aus \u2208 Act. We will use the\nforward-backward algorithm in an undirected chain to sample A from CRPct(d) given in (2). First\nwe re-express A using two sets of variables z1, . . . , zc and y1, . . . , yc. Label a table a \u2208 A using\nthe index of the \ufb01rst customer at the table, i.e. the smallest element of a. Let zi be the number of\ntables occupied by the \ufb01rst i customers, and yi the label of the table that customer i sits at. The\nvariables satisfy the following constraints: z1 = 1, zc = t, and zi = zi\u22121 in which case yi \u2208 [i \u2212 1]\nor zi = zi\u22121 + 1 in which case yi = i. This gives a one-to-one correspondence between seating\narrangements in Act and settings of the variables satisfying the above constraints. Consider the\nfollowing distribution over the variables satisfying the constraints: z1, . . . , zc is distributed according\nto a Markov network with z1 = 1, zc = t, and edge potentials:\n\n\uf8f1\uf8f2\uf8f3i \u2212 1 \u2212 zid if zi = zi\u22121,\n(cid:81)\nIt is easy to see that the normalization constant is simply Sd(c, t) and\ni:zi=zi\u22121(i \u2212 1 \u2212 zid)\n\nf(zi, zi\u22121) =\n\n1\n0\n\nif zi = zi\u22121 + 1,\notherwise.\n\nSd(c, t)\n\n.\n\nGiven z1:c, we give each yi the following distribution conditioned on y1:i\u22121:\n\nP (yi|z1:c, y1:i\u22121) =\n\nif yi = i and zi = zi\u22121 + 1,\nif zi = zi\u22121 and yi \u2208 [i \u2212 1].\n\nP (z1:c) =\n\n(cid:40)1\nPi\u22121\nj=1 1(yj =yi)\u2212d\n\ni\u22121\u2212zid\n\n6\n\n(14)\n\n(15)\n\n(16)\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a), (b) Number of context/symbol pairs and total number of tables (counted after particle \ufb01lter\ninitialization and 10 sampling iterations using the compact original sampler) as a function input size. Sub\ufb01gure\n(a) shows the counts obtained from a byte-level model of the news \ufb01le in the Calgary corpus, whereas (b)\nshows the counts for word-level model of the Brown corpus (training set). The space required for the compact\nrepresentation is proportional to the number of context/symbol pairs, whereas for the full representation it is\nproportional to the number of tables. Note also that sampling tends to increase the number of tables over the\nparticle \ufb01lter initialization. (c) Time per iteration (seconds) as a function of input size for the original Gibbs\nsampler in the compact representation and the re-instantiating sampler (on the Brown corpus).\n\nMultiplying all the probabilities together, we see that P (z1:c, y1:c) is exactly equal to P (A) in (2).\nThus we can sample A by \ufb01rst sampling z1:c from (15), then each yi conditioned on previous ones\nusing (16), and converting this representation into A. We use a backward-\ufb01ltering-forward-sampling\nalgorithm to sample z1:c, as this avoids numerical under\ufb02ow problems that can arise when using\nforward-\ufb01ltering. Backward-\ufb01ltering avoids these problems by incorporating the constraint that zc\nhas to equal t into the messages from the beginning.\nFragmenting a Restaurant. In particle \ufb01ltering and in prediction, we often need to re-instantiate a\nrestaurant which was previously marginalized out. We can do so by sampling Aus given cus, tus for\neach s, then fragmenting each Aus using Theorem 1, counting the resulting numbers of customers\nand tables, then forgetting the seating arrangements.\n\n6 Experiments\n\nIn order to evaluate the proposed improvements in terms of reduced memory requirements and to\ncompare the performance of the different sampling schemes we performed three sets of experiments.3\nIn the \ufb01rst experiment we evaluated the potential space saving due to the compact representation.\nFigure 2 shows the number of context/symbol pairs and the total number of tables as a function\nof data set size. While the difference does not seem dramatic, there is still a signi\ufb01cant amount of\nmemory that can be saved by using the compact representation, as there is no additional overhead and\nmemory fragmentation due to variable-size arrays. The comparison between the byte-level model and\nthe word-level model in Figure 2 also demonstrates that the compact representation saves more space\nwhen |\u03a3| is small (which leads to context/symbol pairs having larger cus\u2019s and tus\u2019s). Finally, Figure\n2 illustrates another interesting effect: the number of tables is generally larger after a few iterations of\nGibbs sampling have been performed after the initialization using a single-particle particle \ufb01lter [2].\nThe second experiment compares the computational cost of the compact original sampler and\nthe sampler that re-instantiates full seating arrangements. The main computational cost of the\noriginal sampler is computing the ratio (12), while sampling the seating arrangements is the main\ncomputational cost of the re-instantiating sampler. Figure 2(c) shows the time needed for one iteration\nof Gibbs sampling as a function of data set size. The re-instantiating sampler is found to be much\nmore ef\ufb01cient, as it avoids the overhead involved in computing the Stirling numbers in a stable\nmanner (e.g. log/exp computations). For the original sampler, time can be traded off with space\n\n3All experiments were performed on two data sets: the news \ufb01le from the Calgary corpus (modeled as a\nsequence of 377,109 bytes; |\u03a3| = 256), and the Brown corpus (preprocessed as in [12]), modeled as a sequence\nof words (800,000 words training set; 181,041 words test set; |\u03a3| = 16383). Following [1], the discount\nparameters were \ufb01xed to .62, .69, .74, .80 for the \ufb01rst 4 levels and .95 for all subsequent levels of the hierarchy.\n\n7\n\n0123x 10500.511.522.5x 107Input sizeCalgary: news context/symbol pairstables (sampling)tables (particle filter)02468x 10502468x 107Brown corpusInput size context/symbol pairstables (sampling)tables (particle filter)02468x 105010203040Input sizeSeconds per iterationBrown corpus OriginalRe\u2212instantiating\f\u03b1 Particle Filter only Gibbs (1 sample) Gibbs (50 samples averaged)\n\nFragment Parent Fragment\n\nFragment Parent\n8.41\n8.39\n8.37\n8.34\n8.33\n8.33\n\n8.45\n8.41\n8.37\n8.33\n8.32\n8.32\n\n0\n1\n3\n10\n20\n50\n\n8.44\n8.40\n8.37\n8.33\n8.32\n8.31\n\n8.41\n8.39\n8.37\n8.33\n8.32\n8.32\n\n8.43\n8.39\n8.35\n8.32\n8.31\n8.31\n\nParent\n8.39\n8.38\n8.35\n8.32\n8.31\n8.31\n\nOnline\n\nPF Gibbs\n8.04\n8.04\n8.01\n8.01\n7.98\n7.98\n7.94\n7.95\n7.94\n7.94\n7.95\n7.95\n\nTable 1: Average log-loss on the Brown corpus (test set) for different values of \u03b1, different inference strategies,\nand different modes of prediction. Inference is performed by either just using the particle \ufb01lter or using the\nparticle \ufb01lter followed by 50 burn-in iterations of Gibbs sampling. Subsequently either 1 or 50 samples are\ncollected for prediction. Prediction is performed either using fragmentation or by predicting from the parent\nnode. The \ufb01nal two columns labelled Online show the results obtained by using the particle \ufb01lter on the test set\nas well, after training with either just the particle \ufb01lter or particle \ufb01lter followed by 50 Gibbs iterations. Non-zero\nvalues of \u03b1 can be seen to provide a signi\ufb01cant increase in perfomance, while the gains due to averaging samples\nor proper fragmentation during prediction are small.\n\nby tabulating all required Stirling numbers along the path down the tree (as was done in these\nexperiments). However, this leads to an additional memory overhead that mostly undoes any savings\nfrom the compact representation.\nThe third set of experiments uses the re-instantiating sampler and compares different modes of\nprediction and the effect of the non-zero concentration parameter. The results are shown in Table 1.\nPredictions with the SM can be made in several different ways. After obtaining one or more samples\nfrom the posterior distribution over customers and tables (either using particle \ufb01ltering or Gibbs\nsampling on the training set) one has a choice of either using particle \ufb01ltering on the test set as well\n(online setting), or making predictions while keeping the model \ufb01xed. One also has a choice when\nmaking predictions involving contexts that were marginalized out from the model: one can either\nre-instantiate these contexts by fragmentation or simply predict from the parent (or even the child) of\nthe required node. While one ultimately wants to average predictions over the posterior distribution,\none may consider using just a single sample for computational reasons.\n\n7 Discussion\n\nIn this paper we proposed an enlarged set of hyperparameters for the sequence memoizer that re-\ntains the coagulation/fragmentation properties important for ef\ufb01cient inference, and we proposed a\nnew minimal representation of the Chinese restaurant processes to reduce the memory requirement\nof the sequence memoizer. We developed novel inference algorithms for the new representation,\nand presented experimental results exploring their behaviors. We found that the algorithm which\nre-instantiates seating arrangements is signi\ufb01cantly more ef\ufb01cient than the other two Gibbs sam-\nplers, while particle \ufb01ltering is most ef\ufb01cient but produces slightly worse predictions. Along the\nway, we formalized the metaphorical language often used to describe Chinese restaurant processes\nin the machine learning literature, and were able to provide an elementary proof of the coagula-\ntion/fragmentation properties. We believe this more precise language will be of use to researchers\ninterested in hierarchical Dirichlet processes and its various generalizations.\nWe are currently exploring methods to compute or approximate the generalized Stirling numbers, and\nef\ufb01cient methods to optimize the hyperparameters in the sequence memoizer. A parting remark is\nthat the posterior distribution over {cus, tus} in (10) is in the form of a standard Markov network\nwith sum constraints (7). Thus other inference algorithms like loopy belief propagation or variational\ninference can potentially be applied. There are however two dif\ufb01culties to be resolved before these\nare possible: the large domains of the variables, and the large dynamic ranges of the factors.\n\nAcknowledgments\n\nWe would like to thank the Gatsby Charitable Foundation for generous funding.\n\n8\n\n\fReferences\n[1] F. Wood, C. Archambeau, J. Gasthaus, L. F. James, and Y. W. Teh. A stochastic memoizer for\nsequence data. In Proceedings of the International Conference on Machine Learning, volume 26,\npages 1129\u20131136, 2009.\n\n[2] J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on the Sequence Memoizer.\nIn James A. Storer and Michael W. Marcellin, editors, Data Compression Conference, pages\n337\u2013345, Los Alamitos, CA, USA, 2010. IEEE Computer Society.\n\n[3] Y. W. Teh. A Bayesian interpretation of interpolated Kneser-Ney. Technical Report TRA2/06,\n\nSchool of Computing, National University of Singapore, 2006.\n\n[4] J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870\u20131902, 1999.\n[5] M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation laws induced by general co-\nagulations of two-parameter Poisson-Dirichlet processes. http://arxiv.org/abs/math.PR/0601608,\n2006.\n\n[6] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal\n\nof the American Statistical Association, 101(476):1566\u20131581, 2006.\n\n[7] P. Blunsom, T. Cohn, S. Goldwater, and M. Johnson. A note on the implementation of\nhierarchical Dirichlet processes. In Proceedings of the ACL-IJCNLP 2009 Conference Short\nPapers, pages 337\u2013340, Suntec, Singapore, August 2009. Association for Computational\nLinguistics.\n\n[8] J. Pitman and M. Yor. The two-parameter Poisson-Dirichlet distribution derived from a stable\n\nsubordinator. Annals of Probability, 25:855\u2013900, 1997.\n\n[9] H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors. Journal of the\n\nAmerican Statistical Association, 96(453):161\u2013173, 2001.\n\n[10] L. C. Hsu and P. J.-S. Shiue. A uni\ufb01ed approach to generalized Stirling numbers. Advances in\n\nApplied Mathematics, 20:366\u2013384, 1998.\n\n[11] Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yor processes.\n\nIn\nProceedings of the 21st International Conference on Computational Linguistics and 44th\nAnnual Meeting of the Association for Computational Linguistics, pages 985\u2013992, 2006.\n\n[12] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model.\n\nJournal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n9\n\n\f", "award": [], "sourceid": 967, "authors": [{"given_name": "Jan", "family_name": "Gasthaus", "institution": null}, {"given_name": "Yee", "family_name": "Teh", "institution": null}]}