{"title": "Expectation Propagation for t-Exponential Family Using q-Algebra", "book": "Advances in Neural Information Processing Systems", "page_first": 2245, "page_last": 2254, "abstract": "Exponential family distributions are highly useful in machine learning since their calculation can be performed efficiently through natural parameters. The exponential family has recently been extended to the t-exponential family, which contains Student-t distributions as family members and thus allows us to handle noisy data well. However, since the t-exponential family is defined by the deformed exponential, an efficient learning algorithm for the t-exponential family such as expectation propagation (EP) cannot be derived in the same way as the ordinary exponential family. In this paper, we borrow the mathematical tools of q-algebra from statistical physics and show that the pseudo additivity of distributions allows us to perform calculation of t-exponential family distributions through natural parameters. We then develop an expectation propagation (EP) algorithm for the t-exponential family, which provides a deterministic approximation to the posterior or predictive distribution with simple moment matching. We finally apply the proposed EP algorithm to the Bayes point machine and Student-t process classification, and demonstrate their performance numerically.", "full_text": "Expectation Propagation for t-Exponential Family\n\nUsing q-Algebra\n\nFutoshi Futami\n\nThe University of Tokyo, RIKEN\nfutami@ms.k.u-tokyo.ac.jp\n\nIssei Sato\n\nThe University of Tokyo, RIKEN\n\nsato@k.u-tokyo.ac.jp\n\nMasashi Sugiyama\n\nRIKEN, The University of Tokyo\n\nsugi@k.u-tokyo.ac.jp\n\nAbstract\n\nExponential family distributions are highly useful in machine learning since their\ncalculation can be performed e\ufb03ciently through natural parameters. The exponen-\ntial family has recently been extended to the t-exponential family, which contains\nStudent-t distributions as family members and thus allows us to handle noisy data\nwell. However, since the t-exponential family is de\ufb01ned by the deformed exponen-\ntial, an e\ufb03cient learning algorithm for the t-exponential family such as expectation\npropagation (EP) cannot be derived in the same way as the ordinary exponential\nfamily. In this paper, we borrow the mathematical tools of q-algebra from statisti-\ncal physics and show that the pseudo additivity of distributions allows us to perform\ncalculation of t-exponential family distributions through natural parameters. We\nthen develop an expectation propagation (EP) algorithm for the t-exponential fam-\nily, which provides a deterministic approximation to the posterior or predictive\ndistribution with simple moment matching. We \ufb01nally apply the proposed EP\nalgorithm to the Bayes point machine and Student-t process classi\ufb01cation, and\ndemonstrate their performance numerically.\n\n1\n\nIntroduction\n\nExponential family distributions play an important role in machine learning, due to the fact that their\ncalculation can be performed e\ufb03ciently and analytically through natural parameters or expected\nsu\ufb03cient statistics [1]. This property is particularly useful in the Bayesian framework since a\nconjugate prior always exists for an exponential family likelihood and the prior and posterior are\noften in the same exponential family. Moreover, parameters of the posterior distribution can be\nevaluated only through natural parameters.\nAs exponential family members, Gaussian distributions are most commonly used because their\nmoments, conditional distribution, and joint distribution can be computed analytically. Gaussian\nprocesses are a typical Bayesian method based on Gaussian distributions, which are used for various\npurposes such as regression, classi\ufb01cation, and optimization [8]. However, Gaussian distributions are\nsensitive to outliers and heavier-tailed distributions are often more preferred in practice. For example,\nStudent-t distributions and Student-t processes are good alternatives to Gaussian distributions [4]\nand Gaussian processes [10], respectively.\nA technical problem of the Student-t distribution is that it does not belong to the exponential family\nunlike the Gaussian distribution and thus cannot enjoy good properties of the exponential family. To\ncope with this problem, the exponential family was recently generalized to the t-exponential family [3],\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fwhich contains Student-t distributions as family members. Following this line, the Kullback-Leibler\ndivergence was generalized to the t-divergence, and approximation methods based on t-divergence\nminimization have been explored [2]. However, the t-exponential family does not allow us to employ\nstandard useful mathematical tricks, e.g., logarithmic transformation does not reduce the product of t-\nexponential family functions into summation. For this reason, the t-exponential family unfortunately\ndoes not inherit an important property of the original exponential family, that is, calculation can be\nperformed through natural parameters. Furthermore, while the dimensionality of su\ufb03cient statistics\nis the same as that of the natural parameters in the exponential family and thus there is no need to\nincrease the parameter size to incorporate new information [9], this useful property does not hold in\nthe t-exponential family.\nThe purpose of this paper is to further explore mathematical properties of natural parameters of the\nt-exponential family through pseudo additivity of distributions based on q-algebra used in statistical\nphysics [7, 11]. More speci\ufb01cally, our contributions in this paper are three-fold:\n1. We show that, in the same way as ordinary exponential family distributions, q-algebra allows us\nto handle the calculation of t-exponential family distributions through natural parameters.\n2. Our q-algebra based method enables us to extend assumed density \ufb01ltering (ADF) [2] and develop\nan algorithm of expectation propagation (EP) [6] for the t-exponential family. In the same way as\nthe original EP algorithm for ordinary exponential family distributions, our EP algorithm provides\na deterministic approximation to the posterior or predictive distribution for t-exponential family\ndistributions with simple moment matching.\n3. We apply the proposed EP algorithm to the Bayes point machine [6] and Student-t process classi\ufb01-\ncation, and we demonstrate their usefulness as alternatives to the Gaussian approaches numerically.\n\n2\n\nt-exponential Family\n\nIn this section, we review the t-exponential family [3, 2], which is a generalization of the exponential\nfamily.\nThe t-exponential family is de\ufb01ned as,\n\nwhere expt(x) is the deformed exponential function de\ufb01ned as\n\np(x; (cid:18)) = expt(\u27e8(cid:8)(x); (cid:18)\u27e9 (cid:0) gt((cid:18)));\n\n{\n\nexpt(x) =\n\nexp(x)\n[1 + (1 (cid:0) t)x]\n\n1\n1(cid:0)t\n\nif t = 1;\notherwise;\n\n(1)\n\n(2)\n\n(4)\n\nand gt((cid:18)) is the log-partition function that satis\ufb01es\n\n(3)\nThe notation Epes denotes the expectation over pes(x), where pes(x) is the escort distribution of p(x)\nde\ufb01ned as\n\n\u2207(cid:18)gt((cid:18)) = Epes[(cid:8)(x)]:\n\np(x)t\np(x)tdx\nWe call (cid:18) a natural parameter and (cid:8)(x) su\ufb03cient statistics.\nLet us express the k-dimensional Student-t distribution with v degrees of freedom as\n\npes(x) =\n\n(cid:0)((v + k)=2)\n\nSt(x; v; (cid:22); (cid:6)) =\n\n((cid:25)v)k=2(cid:0)(v=2)j(cid:6)j1=2\n\n(cid:0)1(x (cid:0) (cid:22))\n(5)\nwhere (cid:0)(x) is the gamma function, jAj is the determinant of matrix A, and A\n\u22a4 is the transpose of\n) 1\nmatrix A. We can con\ufb01rm that the Student-t distribution is a member of the t-exponential family as\nfollows. First, we have\n)1(cid:0)t\n\n(cid:9) + (cid:9) (cid:1) (x (cid:0) (cid:22))\n\u22a4\n\n(cid:0)1(x (cid:0) (cid:22))\n\nSt(x; v; (cid:22); (cid:6)) =\n\n(\n\n(v(cid:6))\n\n(v(cid:6))\n\n1(cid:0)t ;\n\n(\n\n(6)\n\n:\n\n\u222b\n(\n1 + (x (cid:0) (cid:22))\n\u22a4\n\n)(cid:0) v+k\n\n2\n\n;\n\n(cid:0)((v + k)=2)\n\n((cid:25)v)k=2(cid:0)(v=2)j(cid:6)j1=2\n\n:\n\n(7)\n\nwhere (cid:9) =\n\n2\n\n\fNote that relation (cid:0)(v + k)=2 = 1=(1 (cid:0) t) holds, from which we have\n\n(\n(\n(cid:9)\n1 (cid:0) t\n\n)\n)\n\n(x\n\n\u27e8(cid:8)(x); (cid:18)\u27e9 =\n\n\u22a4\n\nKx (cid:0) 2(cid:22)\n\n\u22a4\n\nKx);\n\ngt((cid:18)) = (cid:0)\n\n(cid:9)\n1 (cid:0) t\n\n\u22a4\n\n((cid:22)\n\nK(cid:22) + 1) +\n\n1\n1 (cid:0) t\n\n;\n\n(8)\n\n(9)\n\nwhere K = (v(cid:6))\nfamily as:\n\n(cid:0)1. Then, we can express the Student-t distribution as a member of the t-exponential\n\n(\n\n) 1\n1 + (1 (cid:0) t)\u27e8(cid:8)(x); (cid:18)\u27e9 (cid:0) gt((cid:18))\n\n(\u27e8(cid:8)(x); (cid:18)\u27e9 (cid:0) gt((cid:18))\n\n)\n\nSt(x; v; (cid:22); (cid:6)) =\n\n(10)\nIf t = 1, the deformed exponential function is reduced to the ordinary exponential function, and\ntherefore the t-exponential family is reduced to the ordinary exponential family, which corresponds\nto the Student-t distribution with in\ufb01nite degrees of freedom. For t-exponential family distributions,\nthe t-divergence is de\ufb01ned as follows [2]:\n\n1(cid:0)t = expt\n\n:\n\nDt(p\u2225ep) =\n\n\u222b (\n\n)\npes(x) lnt p(x) (cid:0) pes(x) lntep(x)\n\ndx;\n\n(11)\n\nwhere lnt x := x1(cid:0)t(cid:0)1\n1(cid:0)t\n\n(x (cid:21) 0; t 2 R+) and pes(x) is the escort function of p(x).\n\n3 Assumed Density Filtering and Expectation Propagation\n\nWe brie\ufb02y review the assumed density \ufb01ltering (ADF) and expectation propagation (EP) [6].\n\u220f\nLet D = f(x1; y1); : : : ; (xi; yi)g be input-output paired data. We denote the likelihood for the i-th\ndata as li(w) and the prior distribution of parameter w as p0(w). The total likelihood is given as\ni li(w) and the posterior distribution can be expressed as p(wjD) / p0(w)\n3.1 Assumed Density Filtering\nADF is an online approximation method for the posterior distribution.\nimation to the posterior distribution, epi(cid:0)1(w), has already been obtained. Given the i-th sample\nSuppose that i (cid:0) 1 samples (x1; y1); : : : ; (xi(cid:0)1; yi(cid:0)1) have already been processed and an approx-\n(xi; yi), the posterior distribution pi(w) can be obtained as\n\ni li(w).\n\n\u220f\n\n(12)\nSince the true posterior distribution pi(w) cannot be obtained analytically, it is approximated in ADF\nby minimizing the Kullback-Leibler (KL) divergence from pi(w) to its approximation:\n\nNote that if pi and ep are both exponential family members, the above calculation is reduced to\n\n(13)\n\npi(w) /epi(cid:0)1(w)li(w):\nKL(pi\u2225ep):\nepi = arg minep\n\nmoment matching.\n\n3.2 Expectation Propagation\nAlthough ADF is an e\ufb00ective method for online learning, it is not favorable for non-online situations,\nbecause the approximation quality depends heavily on the permutation of data [6]. To overcome this\nproblem, EP was proposed [6].\nIn EP, an approximation of the posterior that contains whole data terms is prepared beforehand,\ntypically as a product of data-corresponding terms:\n\n(14)\n\nep(w) =\n\n\u220f\n\ni\n\neli(w);\n\n1\nZ\n\n3\n\n\fapproximation as\n\nfamily member, the total approximation also belongs to the exponential family.\nDi\ufb00erently from ADF, EP has these site approximation with the following four steps, which is\n\nwhere Z is the normalizing constant. In the above expression, factoreli(w), which is often called\na site approximation [9], corresponds to the local likelihood li(w). If eacheli(w) is an exponential\niteratively updated. First, when we update siteelj(w), we eliminate the e\ufb00ect of site j from the total\nep\n(15)\nwhereep\nnj(w) is often called a cavity distribution [9]. If an exponential family distribution is used, the\nlj(w) by minimizing the divergence KL(ep\nabove calculation is reduced to subtraction of natural parameters. Second, we incorporate likelihood\nnj is the normalizing\nAfter this step, we obtainep(w). Third, we exclude the e\ufb00ect of terms other than j, which is equivalent\nto calculating a cavity distribution aselj(w)new / ep(w)\nconstant. Note that this minimization is reduced to moment matching for the exponential family.\nby replacingelj(w) byelj(w)new.\nepnj (w). Finally, we update the site approximation\n\nnj\u2225ep(w)), where Z\n\nep(w)\nelj(w)\n\nnj(w)lj(w)=Z\n\nnj(w) =\n\n;\n\nNote again that calculation of EP is reduced to addition or subtraction of natural parameters for the\nexponential family.\n\n3.3 ADF for t-exponential Family\nADF for the t-exponential family was proposed in [2], which uses the t-divergence instead of the KL\ndivergence:\n\nDt(p\u2225p\n\n\u2032\n\n) =\n\npes(x) lnt p(x) (cid:0) pes(x) lnt p\n\u2032\n\n(x; (cid:18))\n\ndx:\n\n(16)\n\np\u2032\n\n\u2207(cid:18)gt((cid:18)) = Efpes ((cid:8)(x)), where fpes is the escort function of ep(x). Then, minimization of the\n\nWhen an approximate distribution is chosen from the t-exponential family, we can utilize the property\n\nep = arg min\n\n\u222b (\n\n)\n\nt-divergence yields\n\nEpes[(cid:8)(x)] = Efpes[(cid:8)(x)]:\n\n(17)\nThis is moment matching, which is a celebrated property of the exponential family. Since the\nexpectation is taken with respect to the escort function, this is called escort moment matching.\n=ep(w) = St(w;e(cid:22);e(cid:6); v). Then the\napproximated posteriorepi(w) = St(w;e(cid:22)(i);e(cid:6)i; v) can be obtained by minimizing the t-divergence\nAs an example, let us consider the situation where the prior is the Student-t distribution and the\nposterior is approximated by the Student-t distribution: p(wjD) (cid:24)\nfrom pi(w) /epi(cid:0)1(w)eli(w) as\n\nDt(pi(w)\u2225St(w; (cid:22)\n\n\u2032\n\n\u2032\n\n; (cid:6)\n\n; v)):\n\n(18)\n\narg min\n\n(cid:22)\u2032;(cid:6)\u2032\n\nThis allows us to obtain an analytical update expression for t-exponential family distributions.\n\n4 Expectation Propagation for t-exponential Family\n\nAs shown in the previous section, ADF has been extended to EP (which resulted in moment matching\nfor the exponential family) and to the t-exponential family (which yielded escort moment matching\nfor the t-exponential family). In this section, we combine these two extensions and propose EP for\nthe t-exponential family.\n\n4.1 Pseudo Additivity and q-Algebra\nDi\ufb00erently from ordinary exponential functions, deformed exponential functions do not satisfy the\nproduct rule:\n\nexpt(x) expt(y) \u0338= expt(x + y):\n\n(19)\n\n4\n\n\fThe q-division is the inverse of the q-product (and visa versa), and the q-logarithm is the inverse of\nthe q-exponential (and visa versa). From the above de\ufb01nitions, the q-logarithm and q-exponential\nsatisfy the following relations:\n\nwhich are called the q-product rules. Also for the q-division, similar properties hold:\n\nlnq x :=\n\nx1(cid:0)q (cid:0) 1\n1 (cid:0) q\n\n(x (cid:21) 0; q 2 R+):\n\nlnq(x (cid:10)q y) = lnq x + lnq y;\nexpq(x) (cid:10)q expq(y) = expq(x + y);\nlnq(x \u2298q y) = lnq x (cid:0) lnq y;\nexpq(x) \u2298q expq(y) = expq(x (cid:0) y);\n\n(20)\n\n(21)\n\n(22)\n\n(23)\n\n(24)\n(25)\n\n(26)\n(27)\n\n(30)\n\n(31)\n\nFor this reason, the cavity distribution cannot be computed analytically for the t-exponential family.\nOn the other hand, the following equality holds for the deformed exponential functions:\n\nexpt(x) expt(y) = expt(x + y + (1 (cid:0) t)xy);\n\nwhich is called pseudo additivity.\nIn statistical physics [7, 11], a special algebra called q-algebra has been developed to handle a system\nwith pseudo additivity. We will use the q-algebra for e\ufb03ciently handling t-exponential distributions.\nDe\ufb01nition 1 (q-product) Operation (cid:10)q called the q-product is de\ufb01ned as\n\nx (cid:10)q y :=\n\n[x1(cid:0)q + y1(cid:0)q (cid:0) 1]\n0\n\n1\n1(cid:0)q\n\nif x > 0; y > 0; x1(cid:0)q + y1(cid:0)q (cid:0) 1 > 0;\notherwise:\n\nDe\ufb01nition 2 (q-division) Operation \u2298q called the q-division is de\ufb01ned as\n\nx \u2298q y :=\n\nif x > 0; y > 0; x1(cid:0)q (cid:0) y1(cid:0)q (cid:0) 1 > 0;\notherwise:\nDe\ufb01nition 3 (q-logarithm) The q-logarithm is de\ufb01ned as\n\n[x1(cid:0)q (cid:0) y1(cid:0)q (cid:0) 1]\n0\n\n1\n1(cid:0)q\n\n{\n{\n\nwhich are called the q-division rules.\n\n4.2 EP for t-exponential Family\nThe q-algebra allows us to recover many useful properties from the ordinary exponential family. For\nexample, the q-product of t-exponential family distributions yields an unnormalized t-exponential\ndistribution:\n\nexpt(\u27e8(cid:8)(x); (cid:18)1\u27e9 (cid:0) gt((cid:18)1))(cid:10)t expt(\u27e8(cid:8)(x); (cid:18)2\u27e9 (cid:0) gt((cid:18)2))\n\n= expt(\u27e8(cid:8)(x); ((cid:18)1 + (cid:18)2)\u27e9 (cid:0)egt((cid:18)1; (cid:18)2)):\n\n(28)\n\nBased on this q-product rule, we develop EP for the t-exponential family.\nConsider the situation where prior distribution p(0)(w) is a member of the t-exponential family. As\nan approximation to the posterior, we choose a t-exponential family distribution\n\n(29)\nIn the original EP for the ordinary exponential family, we considered an approximate posterior of the\nform\n\nthat is, we factorized the posterior to a product of site approximations corresponding to data. On the\nother hand, in the case of the t-exponential family, we propose to use the following form called the\nt-factorization:\n\n\u220f\n\nep(w; (cid:18)) = expt(\u27e8(cid:8)(w); (cid:18)\u27e9 (cid:0) gt((cid:18))):\neli(w);\nep(w) / p(0)(w)\n\u220f\n\nep(w) / p(0)(w) (cid:10)t\n\neli(w):\n\n(cid:10)t\n\ni\n\ni\n\n5\n\n\fThe t-factorization is reduced to the original factorization form when t = 1.\nThis t-factorization enables us to calculate EP update rules through natural parameters for the t-\nexponential family in the same way as the ordinary exponential family. More speci\ufb01cally, consider\nthe case where factor j of the t-factorization is updated in four steps in the same way as original EP.\n(I) First, we calculate the cavity distribution by using the q-division as\n(cid:10)t\n\nelj(w) / p(0)(w) (cid:10)t\n\nep\nnj(w) /ep(w) \u2298t\n\neli(w):\n\n\u220f\n\n(32)\n\ni\u0338=j\n\nnj = (cid:18) (cid:0) (cid:18)(j):\n\n(II) The second step is inclusion of site likelihood lj(w), which can be performed byep\n\nThe above calculation is reduced to subtraction of natural parameters by using the q-algebra rules:\n(33)\nnj(w)lj(w).\nThe site likelihood lj(w) is incorporated to approximate the posterior by the ordinary product not the\nq-product. Thus moment matching is performed to obtain a new approximation. For this purpose,\nthe following theorem is useful.\nTheorem 1 The expected su\ufb03cient statistic,\n\n(cid:18)\n\n(cid:17) = \u2207(cid:18)gt((cid:18)) = Efpes[(cid:8)(w)];\n\u222b fpes\n\nnj(w)(lj(w))tdw; Z2 =\n\n(34)\n\n(35)\n\n(36)\n\nnj\n\n(w)(lj(w))tdw:\n\nA proof of Theorem 1 is given in Appendix A of the supplementary material. After moment matching,\n\n(III) Third, we exclude the e\ufb00ect of sites other than j. This is achieved by\n\ncan be derived as\n\n\u2207\n\n(cid:17) = (cid:17)\n\n1\nZ2\n\n(cid:18)nj Z1;\n\nnj +\n\nwhere Z1 =\n\n\u222b ep\nwe obtain an approximation,epnew(w).\nelnew\n\nj\n\n(w) /epnew(w) \u2298tep\nnew = (cid:18)new (cid:0) (cid:18)\nnj\n\nnj:\n\nnj(w);\n\n(37)\n\n(38)\n\nwhich is reduced to subtraction of natural parameter\n\n(IV) Finally, we update the site approximation by replacingelj(w) withelj(w)new.\n\n(cid:18)\n\nThese four steps are our proposed EP method for the t-exponential family. As we have seen, these\nsteps are reduced to the ordinary EP steps if t = 1. Thus, the proposed method can be regarded as\nan extention of the original EP to the t-exponential family.\n\n4.3 Marginal Likelihood for t-exponential Family\nIn the above, we omitted the normalization term of the site approximation to simplify the derivation.\nHere, we derive the marginal likelihood, which requires us to explicitly take into account the\n\nnormalization term eCi:\n\neli(wjeCi;e(cid:22)i;e(cid:27)2\n\ni ) = eCi (cid:10)t expt(\u27e8(cid:8)(w); (cid:18)\u27e9):\n\n(39)\nWe assume that this normalizer corresponds to Z1, which is the same assumption as that for the\nordinary EP. To calculate Z1, we use the following theorem (its proof is available in Appendix B of\nthe supplementary material):\nTheorem 2 For the Student-t distribution, we have\nexpt(\u27e8(cid:8)(w); (cid:18)\u27e9 (cid:0) g)dw =\n\nexpt(gt((cid:18))=(cid:9) (cid:0) g=(cid:9))\n\n) 3(cid:0)t\n\n(\n\n(40)\n\n\u222b\n\n;\n\n2\n\nwhere g is a constant, g((cid:18)) is the log-partition function and (cid:9) is de\ufb01ned in (7).\n\n6\n\n\fFigure 1: Boundaries obtained by ADF (left two, with di\ufb00erent sample orders) and EP (right).\n\nThis theorem yields\n\n2\n3(cid:0)t\n\n1 = gt((cid:18))=(cid:9) (cid:0) g\n\nnj\nt ((cid:18))=(cid:9) + logt\n\nlogt Z\n\n(41)\n\nand therefore the marginal likelihood can be calculated as follows (see Appendix C for details):\n\n\u222b\n(\n\nZEP =\n\np(0)(w) (cid:10)t\n(\u2211\n\n(cid:10)t\n\n\u220f\neli(w)dw\neCi=(cid:9) + gt((cid:18))=(cid:9) (cid:0) gprior\n\ni\n\neCj=(cid:9);\n\n)) 3(cid:0)t\n\n2\n\ni\n\nt\n\n=\n\nlogt\n\nexpt\n\n((cid:18))=(cid:9)\n\nBy substituting eCi in Eq.(42), we obtain the marginal likelihood. Note that, if t = 1, the above\n\nexpression of ZEP is reduced to the ordinary marginal likelihood expression [9]. Therefore, this\nmarginal likelihood can be regarded as a generalization of the ordinary exponential family marginal\nlikelihood to the t-exponential family.\nIn Appendices D and E of the supplementary material, we derive speci\ufb01c EP algorithms for the\nBayes point machine (BPM) and Student-t process classi\ufb01cation.\n\n:\n\n(42)\n\n5 Numerical Experiments\n\nIn this section, we numerically illustrate the behavior of our proposed EP applied to BPM and Student-\nt process classi\ufb01cation. Suppose that data (x1; y1); : : : ; (xn; yn) are given, where yi 2 f+1;(cid:0)1g\nexpresses a class label for covariate xi. We consider a model whose likelihood term can be expressed\nas\n\nli(w) = p(yijxi; w) = \u03f5 + (1 (cid:0) 2\u03f5)(cid:2)(yi\u27e8w; xi\u27e9);\n\n(43)\n\nwhere (cid:2)(x) is the step function taking 1 if x > 0 and 0 otherwise.\n\n5.1 BPM\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n; 0:05I) + 0:25N (x; [(cid:0)1; 1]\n\n; 0:05I) + 0:45N (x; [(cid:0)1;(cid:0)1]\n\nWe compare EP and ADF to con\ufb01rm that EP does not depend on data permutation. We gen-\nerate a toy dataset in the following way: 1000 data points x are generated from Gaussian mix-\nture model 0:05N (x; [1; 1]\n; 0:05I) +\n0:25N (x; [1;(cid:0)1]\n\u22a4\n; 0:05I), where N (x; (cid:22); (cid:6)) denotes the Gaussian density with respect to x with\nmean (cid:22) and covariance matrix (cid:6), and I is the identity matrix. For x, we assign label y = +1 when\n; 0:05I) and label y = (cid:0)1 when x comes from\nx comes from N (x; [1; 1]\nN (x; [(cid:0)1; 1]\n; 0:05I). We evaluate the dependence of the performance\nof BPM (see Appendix D of the supplementary material for details) on data permutation.\nFig.1 shows labeled samples by blue and red points, decision boundaries by black lines which are\nderived from ADF and EP for the Student-t distribution with v = 10 by changing data permutations.\nThe top two graphs show obvious dependence on data permutation by ADF (to clarify the dependence\non data permutation, we showed the most di\ufb00erent boundary in the \ufb01gure), while the bottom graph\nexhibits almost no dependence on data permutations by EP.\n\n; 0:05I) or N (x; [(cid:0)1;(cid:0)1]\n\n; 0:05I) or N (x; [1;(cid:0)1]\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n\u22a4\n\n7\n\n\fFigure 2: Classi\ufb01cation boundaries.\n\n5.2 Student-t Process Classi\ufb01cation\nWe compare the robustness of Student-t process classi\ufb01cation (STC) and Gaussian process classi\ufb01-\ncation (GPC) visually.\nWe apply our EP method to Student-t process binary classi\ufb01cation, where the latent function follows\nthe Student-t process (see Appendix E of the supplementary material for details). We compare this\nmodel with Gaussian process binary classi\ufb01cation with the likelihood expressed Eq.(43). This kind\nof model is called robust Gaussian process classi\ufb01cation [5]. Since the posterior distribution cannot\nbe obtained analytically even for the Gaussian process, we use EP for the ordinary exponential family\nto approximate the posterior.\nWe use a two-dimensional toy dataset, where we generate a two-dimensional data point xi\n(i = 1; : : : ; 300) following the normal distributions: p(xjyi = +1) = N (x; [1:5; 1:5]\n\u22a4\n; 0:5I)\nand p(xjyi = (cid:0)1) = N (x; [(cid:0)1;(cid:0)1]\n; 0:5I). We add eight outliers to the dataset and evaluate the\nrobustness against outliers (about 3% outliers). In the experiment, we used v = 10 for Student-t\nprocesses. We furthermore used the following kernel:\n\n\u22a4\n\n{\n(cid:0) D\u2211\n\n}\n\nk(xi; xj) = (cid:18)0 exp\n\n(cid:18)d\n1(xd\ni\n\n(cid:0) xd\nj )2\n\n+ (cid:18)2 + (cid:18)3(cid:14)i;j;\n\n(44)\n\nd=1\n\ni is the dth element of xi, and (cid:18)0; (cid:18)1; (cid:18)2; (cid:18)3 are hyperparameters to be optimized.\n\nwhere xd\nFig.2 shows the labeled samples by blue and red points, the obtained decision boundaries by black\nlines, and added outliers by blue and red stars. As we can see, the decision boundaries obtained by\nthe Gaussian process classi\ufb01er is heavily a\ufb00ected by outliers, while those obtained by the Student-t\nprocess classi\ufb01er are more stable. Thus, as expected, Student-t process classi\ufb01cation is more robust\n\n8\n\n\fTable 1: Classi\ufb01cation Error Rates (%)\n\nTable 2: Approximate log evidence\n\nDataset Outliers GPC\nPima\n\nSTC\n\nDataset Outliers\nPima\n\nIonosphere\n\nThyroid\n\nSonar\n\n34.0(3.0) 32.3(2.6)\n0\n5% 34.9(3.1) 32.9(3.1)\n10% 36.2(3.3) 34.4(3.5)\n7.5(2.0)\n9.6(1.7)\n0\n9.6(3.2)\n9.9(2.8)\n5%\n10% 13.0(5.2) 11.9(5.4)\n4.3(1.3)\n0\n4.4(1.3)\n4.8(1.8)\n5%\n5.5(2.3)\n10% 5.4(1.4)\n7.2(3.4)\n15.4(3.6) 15.0(3.2)\n0\n5% 18.3(4.4) 17.5(3.3)\n10% 19.4(3.8) 19.4(3.1)\n\nGPC\n\nSTC\n\n0\n-74.1(2.4) -37.1(6.1)\n5% -77.8(2.9) -37.2(6.5)\n10% -78.6(1.8) -36.8(6.5)\n0\n-59.5(5.2) -36.9(7.4)\n5% -75.0(3.6) -35.8(7.0)\n10% -90.3(5.2) -37.4(7.2)\n0\n-32.5(1.6) -41.2(4.3)\n5% -39.1(2.3) -45.8(5.5)\n10% -46.9(1.8) -45.8(4.5)\n0\n-55.8(1.2) -41.6(1.2)\n5% -59.4(2.5) -41.3(1.6)\n10% -65.8(1.1) -67.8(2.1)\n\nIonosphere\n\nThyroid\n\nSonar\n\nagainst outliers compared to Gaussian process classi\ufb01cation, thanks to the heavy-tailed structure of\nthe Student-t distribution.\n\n5.3 Experiments on the Benchmark dataset\n\nWe compared the performance of Gaussian process and Student-t process classi\ufb01cation on the UCI\ndatasets1. We used the kernel given in Eq.(44). The detailed explanation about experimental settings\nare given in Appendix F. Results are shown in Tables 1 and 2, where outliers mean how many\npercentages we randomly \ufb02ip training dataset labels to make additional outliers. As we can see\nStudent-t process classi\ufb01cation outperforms Gaussian process classi\ufb01cation in many cases.\n\n6 Conclusions\n\nIn this work, we enabled the t-exponential family to inherit the important property of the exponential\nfamily whose calculation can be e\ufb03ciently performed thorough natural parameters by using the\nq-algebra. With this natural parameter based calculation, we developed EP for the t-exponential\nfamily by introducing the t-factorization approach. The key concept of our proposed approach is that\nthe t-exponential family has pseudo additivity. When t = 1, our proposed EP for the t-exponential\nfamily is reduced to the original EP for the ordinary exponential family and t-factorization yields\nthe ordinary data-dependent factorization. Therefore, our proposed EP method can be viewed as a\ngeneralization of the original EP. Through illustrative experiments, we con\ufb01rmed that our proposed\nEP applied to the Bayes point machine can overcome the drawback of ADF, i.e., the proposed EP\nmethod is independent of data permutations. We also experimentally illustrated that proposed EP\napplied to Student-t process classi\ufb01cation exhibited high robustness to outliers compared to Gaussian\nprocess classi\ufb01cation. Experiments on benchmark data also demonstrated superiority of Student-t\nprocess.\nIn our future work, we will further extend the proposed EP method to more general message passing\nmethods or double-loop EP. We would like also to make our method more scalable to large datasets\nand develop another approximation method such as variational inference.\n\nAcknowledgement\n\nFF acknowledges support by JST CREST JPMJCR1403 and MS acknowledges support by KAKENHI\n17H00757.\n\n1https://archive.ics.uci.edu/ml/index.php\n\n9\n\n\fReferences\n[1] Christopher M Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[2] Nan Ding, Yuan Qi, and SVN Vishwanathan.\n\nt-divergence based approximate inference. In\n\nAdvances in Neural Information Processing Systems, pages 1494\u20131502, 2011.\n\n[3] Nan Ding and SVN Vishwanathan. t-logistic regression. In Advances in Neural Information\n\nProcessing Systems, pages 514\u2013522, 2010.\n\n[4] Pasi Jyl\u00e4nki, Jarno Vanhatalo, and Aki Vehtari. Robust Gaussian process regression with a\n\nstudent-t likelihood. Journal of Machine Learning Research, 12(Nov):3227\u20133257, 2011.\n\n[5] Hyun-Chul Kim and Zoubin Ghahramani. Outlier robust Gaussian process classi\ufb01cation. Struc-\n\ntural, Syntactic, and Statistical Pattern Recognition, pages 896\u2013905, 2008.\n\n[6] Thomas Peter Minka. A family of algorithms for approximate Bayesian inference. PhD Thesis,\n\nMassachusetts Institute of Technology, 2001.\n\n[7] Laurent Nivanen, Alain Le Mehaute, and Qiuping A Wang. Generalized algebra within a\n\nnonextensive statistics. Reports on Mathematical Physics, 52(3):437\u2013444, 2003.\n\n[8] Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine\n\nLearning, volume 1. MIT press Cambridge, 2006.\n\n[9] Matthias Seeger. Expectation propagation for exponential families. Technical Report, 2005.\n\nURL https://infoscience.epfl.ch/record/161464/files/epexpfam.pdf\n\n[10] Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to\n\ngaussian processes. In Arti\ufb01cial Intelligence and Statistics, pages 877\u2013885, 2014.\n\n[11] Hiroki Suyari and Makoto Tsukada. Law of error in Tsallis statistics. IEEE Transactions on\n\nInformation Theory, 51(2):753\u2013757, 2005.\n\n10\n\n\f", "award": [], "sourceid": 1327, "authors": [{"given_name": "Futoshi", "family_name": "Futami", "institution": "University of Tokyo/RIKEN"}, {"given_name": "Issei", "family_name": "Sato", "institution": "The University of Tokyo/RIKEN"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}