{"title": "Relative Density-Ratio Estimation for Robust Distribution Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 594, "page_last": 602, "abstract": "Divergence estimators based on direct approximation of density-ratios without going through separate approximation of numerator and denominator densities have been successfully applied to machine learning tasks that involve distribution comparison such as outlier detection, transfer learning, and two-sample homogeneity test. However, since density-ratio functions often possess high fluctuation, divergence estimation is still a challenging task in practice. In this paper, we propose to use relative divergences for distribution comparison, which involves approximation of relative density-ratios. Since relative density-ratios are always smoother than corresponding ordinary density-ratios, our proposed method is favorable in terms of the non-parametric convergence speed. Furthermore, we show that the proposed divergence estimator has asymptotic variance independent of the model complexity under a parametric setup, implying that the proposed estimator hardly overfits even with complex models. Through experiments, we demonstrate the usefulness of the proposed approach.", "full_text": "Relative Density-Ratio Estimation\nfor Robust Distribution Comparison\n\nMakoto Yamada\n\nTokyo Institute of Technology\n\nTaiji Suzuki\n\nThe University of Tokyo\n\nyamada@sg.cs.titech.ac.jp\n\ns-taiji@stat.t.u-tokyo.ac.jp\n\nTakafumi Kanamori\nNagoya University\n\nkanamori@is.nagoya-u.ac.jp\n\nHirotaka Hachiya Masashi Sugiyama\n\nTokyo Institute of Technology\n\n{hachiya@sg. sugi@}cs.titech.ac.jp\n\nAbstract\n\nDivergence estimators based on direct approximation of density-ratios without go-\ning through separate approximation of numerator and denominator densities have\nbeen successfully applied to machine learning tasks that involve distribution com-\nparison such as outlier detection, transfer learning, and two-sample homogeneity\ntest. However, since density-ratio functions often possess high \ufb02uctuation, diver-\ngence estimation is still a challenging task in practice. In this paper, we propose to\nuse relative divergences for distribution comparison, which involves approxima-\ntion of relative density-ratios. Since relative density-ratios are always smoother\nthan corresponding ordinary density-ratios, our proposed method is favorable in\nterms of the non-parametric convergence speed. Furthermore, we show that the\nproposed divergence estimator has asymptotic variance independent of the model\ncomplexity under a parametric setup, implying that the proposed estimator hardly\nover\ufb01ts even with complex models. Through experiments, we demonstrate the\nusefulness of the proposed approach.\n\n1 Introduction\n\nComparing probability distributions is a fundamental task in statistical data processing. It can be\nused for, e.g., outlier detection [1, 2], two-sample homogeneity test [3, 4], and transfer learning\n[5, 6].\nA standard approach to comparing probability densities p(x) and p0(x) would be to estimate a\ndivergence from p(x) to p0(x), such as the Kullback-Leibler (KL) divergence [7]:\nr(x) := p(x)/p0(x),\n\nKL[p(x), p0(x)] := Ep(x) [log r(x)] ,\n\nwhere Ep(x) denotes the expectation over p(x). A naive way to estimate the KL divergence is to\nseparately approximate the densities p(x) and p0(x) from data and plug the estimated densities in\nthe above de\ufb01nition. However, since density estimation is known to be a hard task [8], this approach\ndoes not work well unless a good parametric model is available. Recently, a divergence estimation\napproach which directly approximates the density-ratio r(x) without going through separate approx-\nimation of densities p(x) and p0(x) has been proposed [9, 10]. Such density-ratio approximation\nmethods were proved to achieve the optimal non-parametric convergence rate in the mini-max sense.\nHowever, the KL divergence estimation via density-ratio approximation is computationally rather\nexpensive due to the non-linearity introduced by the \u2018log\u2019 term. To cope with this problem, another\ndivergence called the Pearson (PE) divergence [11] is useful. The PE divergence is de\ufb01ned as\n\n(cid:163)\n\n(cid:164)\n\n(r(x) \u2212 1)2\n\n.\n\nPE[p(x), p0(x)] := 1\n\n2Ep0(x)\n\n1\n\n\fThe PE divergence is a squared-loss variant of the KL divergence, and they both belong to the class\nof the Ali-Silvey-Csisz\u00b4ar divergences (which is also known as the f-divergences, see [12, 13]). Thus,\nthe PE and KL divergences share similar properties, e.g., they are non-negative and vanish if and\nonly if p(x) = p0(x).\nSimilarly to the KL divergence estimation, the PE divergence can also be accurately estimated based\non density-ratio approximation [14]:\nthe density-ratio approximator called unconstrained least-\nsquares importance \ufb01tting (uLSIF) gives the PE divergence estimator analytically, which can be\ncomputed just by solving a system of linear equations. The practical usefulness of the uLSIF-based\nPE divergence estimator was demonstrated in various applications such as outlier detection [2], two-\nsample homogeneity test [4], and dimensionality reduction [15].\nIn this paper, we \ufb01rst establish the non-parametric convergence rate of the uLSIF-based PE di-\nvergence estimator, which elucidates its superior theoretical properties. However, it also reveals\nthat its convergence rate is actually governed by the \u2018sup\u2019-norm of the true density-ratio function:\nmaxx r(x). This implies that, in the region where the denominator density p0(x) takes small values,\nthe density-ratio r(x) = p(x)/p0(x) tends to take large values and therefore the overall convergence\nspeed becomes slow. More critically, density-ratios can even diverge to in\ufb01nity under a rather simple\nsetting, e.g., when the ratio of two Gaussian functions is considered [16]. This makes the paradigm\nof divergence estimation based on density-ratio approximation unreliable.\nIn order to overcome this fundamental problem, we propose an alternative approach to distribution\ncomparison called \u03b1-relative divergence estimation.\nIn the proposed approach, we estimate the\n\u03b1-relative divergence, which is the divergence from p(x) to the \u03b1-mixture density:\n\nq\u03b1(x) = \u03b1p(x) + (1 \u2212 \u03b1)p0(x) for 0 \u2264 \u03b1 < 1.\n\nFor example, the \u03b1-relative PE divergence is given by\n\nPE\u03b1[p(x), p0(x)] := PE[p(x), q\u03b1(x)] = 1\n\n2Eq\u03b1(x)\n\n(r\u03b1(x) \u2212 1)2\n\nwhere r\u03b1(x) is the \u03b1-relative density-ratio of p(x) and p0(x):\n\nr\u03b1(x) := p(x)/q\u03b1(x) = p(x)/\n\n\u03b1p(x) + (1 \u2212 \u03b1)p0(x)\n\n.\n\n(cid:163)\n\n(cid:164)\n\n,\n\n(1)\n\n(2)\n\n(cid:180)\n\n(cid:179)\n\nWe propose to estimate the \u03b1-relative divergence by direct approximation of the \u03b1-relative density-\nratio.\nA notable advantage of this approach is that the \u03b1-relative density-ratio is always bounded above by\n1/\u03b1 when \u03b1 > 0, even when the ordinary density-ratio is unbounded. Based on this feature, we the-\noretically show that the \u03b1-relative PE divergence estimator based on \u03b1-relative density-ratio approx-\nimation is more favorable than the ordinary density-ratio approach in terms of the non-parametric\nconvergence speed.\nWe further prove that, under a correctly-speci\ufb01ed parametric setup, the asymptotic variance of our\n\u03b1-relative PE divergence estimator does not depend on the model complexity. This means that the\nproposed \u03b1-relative PE divergence estimator hardly over\ufb01ts even with complex models.\nThrough experiments on outlier detection, two-sample homogeneity test, and transfer learning, we\ndemonstrate that our proposed \u03b1-relative PE divergence estimator compares favorably with alterna-\ntive approaches.\n\n2 Estimation of Relative Pearson Divergence via Least-Squares Relative\n\nDensity-Ratio Approximation\n\nsamples {xi}n\ni=1 from\nSuppose we are given independent and identically distributed (i.i.d.)\nj}n0\na d-dimensional distribution P with density p(x) and i.i.d. samples {x0\nj=1 from another d-\ndimensional distribution P 0 with density p0(x). Our goal is to compare the two underlying dis-\nj}n0\ntributions P and P 0 only using the two sets of samples {xi}n\nj=1.\nIn this section, we give a method for estimating the \u03b1-relative PE divergence based on direct ap-\nproximation of the \u03b1-relative density-ratio.\n\ni=1 and {x0\n\n2\n\n\f(cid:80)n\n\nDirect Approximation of \u03b1-Relative Density-Ratios: Let us model the \u03b1-relative density-ratio\n\u2018=1 \u03b8\u2018K(x, x\u2018), where \u03b8 := (\u03b81, . . . , \u03b8n)>\nr\u03b1(x) (2) by the following kernel model g(x; \u03b8) :=\nare parameters to be learned from data samples, > denotes the transpose of a matrix or a vector, and\nK(x, x0) is a kernel basis function. In the experiments, we use the Gaussian kernel.\nThe parameters \u03b8 in the model g(x; \u03b8) are determined so that the following expected squared-error\nJ is minimized:\n\n(cid:163)\n\n(cid:164) \u2212 Ep(x) [g(x; \u03b8)] + Const.,\n\n2 Ep0(x)\n\ng(x; \u03b8)2\n\nJ(\u03b8) := 1\n\n2Eq\u03b1(x)\n2 Ep(x)\n\n= \u03b1\n\n(cid:105)\n\n(cid:164)\n\n(g(x; \u03b8) \u2212 r\u03b1(x))2\n+ (1\u2212\u03b1)\ng(x; \u03b8)2\n\n(cid:104)\n(cid:163)\n(cid:98)\u03b8 := argmin \u03b8\u2208Rn\n\n>\n\n>\n\n1\n2 \u03b8\n\n(cid:105)\n\nwhere a penalty term \u03bb\u03b8\n\nwhere we used r\u03b1(x)q\u03b1(x) = p(x) in the third term. Approximating the expectations by empirical\naverages, we obtain the following optimization problem:\n\n(3)\n\u03b8/2 is included for regularization purposes, and \u03bb (\u2265 0) denotes the\n\n(cid:104)\n>(cid:99)H\u03b8 \u2212(cid:98)h\nregularization parameter.(cid:99)H and(cid:98)h are de\ufb01ned as\n(cid:80)n0\n(cid:98)H\u2018,\u20180 := \u03b1\nIt is easy to con\ufb01rm that the solution of Eq.(3) can be analytically obtained as(cid:98)\u03b8 = ((cid:99)H + \u03bbI n)\u22121(cid:98)h,\n(cid:98)r\u03b1(x) := g(x;(cid:98)\u03b8) =\n\n(cid:80)n\ni=1K(xi, x\u2018)K(xi, x\u20180)+ (1\u2212\u03b1)\nn0\n\nwhere I n denotes the n-dimensional identity matrix. Finally, a density-ratio estimator is given as\n\nj, x\u20180),(cid:98)h\u2018 := 1\n\n(cid:98)\u03b8\u2018K(x, x\u2018).\n\nj, x\u2018)K(x0\n\nj=1K(x0\n\n(cid:80)n\n\n(cid:80)n\n\ni=1K(xi, x\u2018).\n\n\u03b8 + \u03bb\n\n2 \u03b8\n\n>\n\n\u03b8\n\nn\n\nn\n\n,\n\n\u2018=1\n\n(cid:163)\n\nWhen \u03b1 = 0, the above method is reduced to a direct density-ratio estimator called unconstrained\nleast-squares importance \ufb01tting (uLSIF) [14]. Thus, the above method can be regarded as an ex-\ntension of uLSIF to the \u03b1-relative density-ratio. For this reason, we refer to our method as relative\nuLSIF (RuLSIF).\nThe performance of RuLSIF depends on the choice of the kernel function (the kernel width in the\ncase of the Gaussian kernel) and the regularization parameter \u03bb. Model selection of RuLSIF is\npossible based on cross-validation (CV) with respect to the squared-error criterion J.\nUsing an estimator of the \u03b1-relative density-ratio r\u03b1(x), we can construct estimators of the \u03b1-\nrelative PE divergence (1). After a few lines of calculation, we can show that the \u03b1-relative PE\ndivergence (1) is equivalently expressed as\nPE\u03b1 = \u2212 \u03b1\n2Ep(x) [r\u03b1(x)] \u2212 1\n2 Ep0(x)\n2 .\nNote that the middle expression can also be obtained via Legendre-Fenchel convex duality of the\ndivergence functional [17].\nBased on these expressions, we consider the following two estimators:\n\n(cid:164) \u2212 (1\u2212\u03b1)\n(cid:80)n\n(cid:99)PE\u03b1 := \u2212 \u03b1\ni=1(cid:98)r\u03b1(xi)2 \u2212 (1\u2212\u03b1)\n(cid:80)n\n(cid:102)PE\u03b1 := 1\ni=1(cid:98)r\u03b1(xi) \u2212 1\n\n(5)\nWe note that the \u03b1-relative PE divergence (1) can have further different expressions than the above\nones, and corresponding estimators can also be constructed similarly. However, the above two\n\nexpressions will be particularly useful: the \ufb01rst estimator (cid:99)PE\u03b1 has superior theoretical properties\n(see Section 3) and the second one (cid:102)PE\u03b1 is simple to compute.\n\n(cid:80)n\ni=1(cid:98)r\u03b1(xi) \u2212 1\n\n(cid:80)n0\nj=1(cid:98)r\u03b1(x0\n\n+ Ep(x) [r\u03b1(x)] \u2212 1\n\n2 Ep(x)\n\nj)2 + 1\n\nr\u03b1(x)2\n\nr\u03b1(x)2\n\n2 = 1\n\n(cid:164)\n\n(cid:163)\n\n(4)\n\n2n0\n\n2 ,\n\n2 .\n\n2n\n\n2n\n\nn\n\n3 Theoretical Analysis\n\nIn this section, we analyze theoretical properties of the proposed PE divergence estimators. Since\nour theoretical analysis is highly technical, we focus on explaining practical insights we can gain\nfrom the theoretical results here; we describe all the mathematical details in the supplementary\nmaterial.\n\n3\n\n\f(cid:104)\n\nFor theoretical analysis, let us consider a rather abstract form of our relative density-ratio estimator\ndescribed as\n\nargming\u2208G\n\n(6)\nwhere G is some function space (i.e., a statistical model) and R(\u00b7) is some regularization functional.\n\ni=1 g(xi) + \u03bb\n\n2 R(g)2\n\n\u03b1\n2n\n\nn\n\n,\n\nj=1 g(x0\n\nj)2 \u2212 1\n\n(cid:80)n\ni=1 g(xi)2 + (1\u2212\u03b1)\n2n0\n\n(cid:80)n0\n\n(cid:80)n\n\n(cid:105)\n\nNon-Parametric Convergence Analysis: First, we elucidate the non-parametric convergence rate\nof the proposed PE estimators. Here, we practically regard the function space G as an in\ufb01nite-\ndimensional reproducing kernel Hilbert space (RKHS) [18] such as the Gaussian kernel space, and\nR(\u00b7) as the associated RKHS norm.\nLet us represent the complexity of the function space G by \u03b3 (0 < \u03b3 < 2); the larger \u03b3 is, the\nmore complex the function class G is (see the supplementary material for its precise de\ufb01nition). We\nanalyze the convergence rate of our PE divergence estimators as \u00afn := min(n, n0) tends to in\ufb01nity\nfor \u03bb = \u03bb\u00afn under\n\n\u03bb\u00afn \u2192 o(1) and \u03bb\u22121\n\n\u00afn = o(\u00afn2/(2+\u03b3)).\n\nThe \ufb01rst condition means that \u03bb\u00afn tends to zero, but the second condition means that its shrinking\nspeed should not be too fast.\nUnder several technical assumptions detailed in the supplementary material, we have the following\n\nasymptotic convergence results for the two PE divergence estimators (cid:99)PE\u03b1 (4) and (cid:102)PE\u03b1 (5):\n(cid:180)\n\n(cid:99)PE\u03b1 \u2212 PE\u03b1 = Op(\u00afn\u22121/2ckr\u03b1k\u221e + \u03bb\u00afn max(1, R(r\u03b1)2)),\n(cid:102)PE\u03b1 \u2212 PE\u03b1 = Op\n\n(cid:179)\n\n, R(r\u03b1)kr\u03b1k(1\u2212\u03b3/2)/2\n\n\u221e\n\n, R(r\u03b1)}\n\n,\n\n(7)\n\n(8)\n\n\u00afn kr\u03b1k1/2\u221e max{1, R(r\u03b1)}\n\u03bb1/2\n+ \u03bb\u00afn max{1,kr\u03b1k(1\u2212\u03b3/2)/2\n(cid:113)\n\n\u221e\n\n(cid:113)\n\nwhere Op denotes the asymptotic order in probability,\n\nc := (1 + \u03b1)\n\nVp(x)[r\u03b1(x)] + (1 \u2212 \u03b1)\n\nVp0(x)[r\u03b1(x)],\n\nand Vp(x) denotes the variance over p(x):\n\nVp(x)[f(x)] =\n\nf(x)p(x)dx\n\np(x)dx.\n\n(cid:82)(cid:161)\nf(x) \u2212(cid:82)\n\n(cid:162)2\n(cid:176)(cid:176)(cid:176)\u221e < 1\n\n(cid:162)\u22121\n\n(cid:176)(cid:176)(cid:176)(cid:161)\n\nIn both Eq.(7) and Eq.(8), the coef\ufb01cients of the leading terms (i.e., the \ufb01rst terms) of the asymptotic\nconvergence rates become smaller as kr\u03b1k\u221e gets smaller. Since\n\nkr\u03b1k\u221e =\n\n\u03b1 + (1 \u2212 \u03b1)/r(x)\n\n\u03b1 for \u03b1 > 0,\n\nlarger \u03b1 would be more preferable in terms of the asymptotic approximation error. Note that when\n\u03b1 = 0, kr\u03b1k\u221e can tend to in\ufb01nity even under a simple setting that the ratio of two Gaussian func-\ntions is considered [16]. Thus, our proposed approach of estimating the \u03b1-relative PE divergence\n(with \u03b1 > 0) would be more advantageous than the naive approach of estimating the plain PE\ndivergence (which corresponds to \u03b1 = 0) in terms of the non-parametric convergence rate.\n\nThe above results also show that (cid:99)PE\u03b1 and (cid:102)PE\u03b1 have different asymptotic convergence rates. The\nslightly slower (depending on the complexity \u03b3) than \u00afn\u22121/2. Thus, (cid:99)PE\u03b1 would be more accurate\nthan (cid:102)PE\u03b1 in large sample cases. Furthermore, when p(x) = p0(x), Vp(x)[r\u03b1(x)] = 0 holds and\nthus c = 0 holds. Then the leading term in Eq.(7) vanishes and therefore (cid:99)PE\u03b1 has the even faster\n\nleading term in Eq.(7) is of order \u00afn\u22121/2, while the leading term in Eq.(8) is of order \u03bb1/2\n\nconvergence rate of order \u03bb\u00afn, which is slightly slower (depending on the complexity \u03b3) than \u00afn\u22121.\nSimilarly, if \u03b1 is close to 1, r\u03b1(x) \u2248 1 and thus c \u2248 0 holds.\nWhen \u00afn is not large enough to be able to neglect the terms of o(\u00afn\u22121/2), the terms of O(\u03bb\u00afn) matter.\nIf kr\u03b1k\u221e and R(r\u03b1) are large (this can happen, e.g., when \u03b1 is close to 0), the coef\ufb01cient of the\n\nO(\u03bb\u00afn)-term in Eq.(7) can be larger than that in Eq.(8). Then (cid:102)PE\u03b1 would be more favorable than\n(cid:99)PE\u03b1 in terms of the approximation accuracy.\n\n\u00afn , which is\n\nSee the supplementary material for numerical examples illustrating the above theoretical results.\n\n4\n\n\fParametric Variance Analysis: Next, we analyze the asymptotic variance of the PE divergence\n\nestimator (cid:99)PE\u03b1 (4) under a parametric setup.\n\nsamples {xi}n\n\n\u2217 such that g(x; \u03b8\n\nAs the function space G in Eq.(6), we consider the following parametric model: G = {g(x; \u03b8) | \u03b8 \u2208\n\u0398 \u2282 Rb} for a \ufb01nite b. Here we assume that this parametric model is correctly speci\ufb01ed, i.e., it\n\u2217) = r\u03b1(x).\nincludes the true relative density-ratio function r\u03b1(x): there exists \u03b8\nHere, we use RuLSIF without regularization, i.e., \u03bb = 0 in Eq.(6).\n\nLet us denote the variance of (cid:99)PE\u03b1 (4) by V[(cid:99)PE\u03b1], where randomness comes from the draw of\nnormality [19], V[(cid:99)PE\u03b1] can be expressed and upper-bounded as\nV[(cid:99)PE\u03b1] = Vp(x)\nLet us denote the variance of (cid:102)PE\u03b1 by V[(cid:102)PE\u03b1]. Then, under a standard regularity condition for the\nasymptotic normality [19], the variance of (cid:102)PE\u03b1 is asymptotically expressed as\n\n\u2264 kr\u03b1k2\u221e/n + \u03b12kr\u03b1k4\u221e/(4n) + (1 \u2212 \u03b1)2kr\u03b1k4\u221e/(4n0) + o(n\u22121, n0\u22121).\n\nj}n0\nj=1. Then, under a standard regularity condition for the asymptotic\n\ni=1 and {x0\n(cid:163)\n\n/n0 + o(n\u22121, n0\u22121)\n\n(1 \u2212 \u03b1)r\u03b1(x)2/2\n\nr\u03b1 \u2212 \u03b1r\u03b1(x)2/2\n\n/n + Vp0(x)\n\n(9)\n(10)\n\n(cid:164)\n\n(cid:163)\n\n(cid:164)\n\nV[(cid:102)PE\u03b1] = Vp(x)\n\n(cid:163)(cid:161)\n\n(cid:163)(cid:161)\n\nr\u03b1 + (1 \u2212 \u03b1r\u03b1)Ep(x)[\u2207g]>H\u22121\n\u03b1 \u2207g\nwhere \u2207g is the gradient vector of g with respect to \u03b8 at \u03b8 = \u03b8\n\n(1 \u2212 \u03b1)r\u03b1Ep(x)[\u2207g]>H\u22121\n\n+ Vp0(x)\n\n(cid:164)\n\n(cid:162)\n(cid:164)\n\n(cid:162)\n\u03b1 \u2207g\n/2\n\u2217 and\n\n/n\n\n/2\n/n0 + o(n\u22121, n0\u22121),\n\n(11)\n\nEq.(9) shows that, up to O(n\u22121, n0\u22121), the variance of (cid:99)PE\u03b1 depends only on the true relative\n\nH \u03b1 = \u03b1Ep(x)[\u2207g\u2207g>] + (1 \u2212 \u03b1)Ep0(x)[\u2207g\u2207g>].\n\ndensity-ratio r\u03b1(x), not on the estimator of r\u03b1(x). This means that the model complexity does not\naffect the asymptotic variance. Therefore, over\ufb01tting would hardly occur in the estimation of the\nrelative PE divergence even when complex models are used. We note that the above superior prop-\nerty is applicable only to relative PE divergence estimation, not to relative density-ratio estimation.\nThis implies that over\ufb01tting occurs in relative density-ratio estimation, but the approximation error\ncancels out in relative PE divergence estimation.\n\nthe variance of (cid:102)PE\u03b1 is affected by the model G,\n\u2217) = r\u03b1(x) holds, the variances of (cid:102)PE\u03b1 and (cid:99)PE\u03b1 are asymptotically\nthe same. However, in general, the use of (cid:99)PE\u03b1 would be more recommended.\nEq.(10) shows that the variance V[(cid:99)PE\u03b1] can be upper-bounded by the quantity depending on kr\u03b1k\u221e,\n\nOn the other hand, Eq.(11) shows that\nsince the factor Ep(x)[\u2207g]>H\u22121\nEp(x)[\u2207g]>H\u22121\n\n\u03b1 \u2207g depends on the model in general. When the equality\n\nwhich is monotonically lowered if kr\u03b1k\u221e is reduced. Since kr\u03b1k\u221e monotonically decreases as \u03b1\nincreases, our proposed approach of estimating the \u03b1-relative PE divergence (with \u03b1 > 0) would\nbe more advantageous than the naive approach of estimating the plain PE divergence (which corre-\nsponds to \u03b1 = 0) in terms of the parametric asymptotic variance.\nSee the supplementary material for numerical examples illustrating the above theoretical results.\n\n\u03b1 \u2207g(x; \u03b8\n\n4 Experiments\n\nIn this section, we experimentally evaluate the performance of the proposed method in two-sample\nhomogeneity test, outlier detection, and transfer learning tasks.\n\nTwo-Sample Homogeneity Test: First, we apply the proposed divergence estimator to two-\nsample homogeneity test.\nGiven two sets of samples X = {xi}n\ni.i.d.\u223c P 0, the goal of the two-\nsample homogeneity test is to test the null hypothesis that the probability distributions P and P 0\nare the same against its complementary alternative (i.e., the distributions are different). By using\n\nan estimator(cid:100)Div of some divergence between the two distributions P and P 0, homogeneity of two\n\ni.i.d.\u223c P and X 0 = {x0\n\nj}n0\n\nj=1\n\ni=1\n\ndistributions can be tested based on the permutation test procedure [20].\n\n5\n\n\fTable 1: Experimental results of two-sample test. The mean (and standard deviation in the bracket)\nrate of accepting the null hypothesis (i.e., P = P 0) for IDA benchmark repository under the sig-\nni\ufb01cance level 5% is reported. Left: when the two sets of samples are both taken from the positive\ntraining set (i.e., the null hypothesis is correct). Methods having the mean acceptance rate 0.95 ac-\ncording to the one-sample t-test at the signi\ufb01cance level 5% are speci\ufb01ed by bold face. Right: when\nthe set of samples corresponding to the numerator of the density-ratio are taken from the positive\ntraining set and the set of samples corresponding to the denominator of the density-ratio are taken\nfrom the positive training set and the negative training set (i.e., the null hypothesis is not correct).\nThe best method having the lowest mean accepting rate and comparable methods according to the\ntwo-sample t-test at the signi\ufb01cance level 5% are speci\ufb01ed by bold face.\n\nDatasets\nbanana\nthyroid\ntitanic\ndiabetes\nb-cancer\nf-solar\nheart\ngerman\nringnorm\nwaveform\n\nd n = n0\n100\n2\n19\n5\n21\n5\n85\n8\n29\n9\n100\n9\n13\n38\n100\n20\n100\n20\n21\n66\n\nMMD\n\n.96 (.20)\n.96 (.20)\n.94 (.24)\n.96 (.20)\n.98 (.14)\n.93 (.26)\n1.00 (.00)\n.99 (.10)\n.97 (.17)\n.98 (.14)\n\nP = P 0\n\nLSTT\n\n(\u03b1 = 0.0)\n.93 (.26)\n.95 (.22)\n.86 (.35)\n.87 (.34)\n.91 (.29)\n.91 (.29)\n.85 (.36)\n.91 (.29)\n.93 (.26)\n.92 (.27)\n\nLSTT\n\n(\u03b1 = 0.5)\n.92 (.27)\n.95 (.22)\n.92 (.27)\n.91 (.29)\n.94 (.24)\n.95 (.22)\n.91 (.29)\n.92 (.27)\n.91 (.29)\n.93 (.26)\n\nLSTT\n\n(\u03b1 = 0.95)\n.92\n(.27)\n(.33)\n.88\n(.31)\n.89\n(.39)\n.82\n(.27)\n.92\n(.26)\n.93\n.93\n(.26)\n(.31)\n.89\n(.36)\n.85\n.88\n(.33)\n\nMMD\n.52 (.50)\n.52 (.50)\n.87 (.34)\n.31 (.46)\n.87 (.34)\n.51 (.50)\n.53 (.50)\n.56 (.50)\n.00 (.00)\n.00 (.00)\n\nP 6= P 0\n\nLSTT\n\n(\u03b1 = 0.0)\n.10 (.30)\n.81 (.39)\n.86 (.35)\n.42 (.50)\n.75 (.44)\n.81 (.39)\n.28 (.45)\n.55 (.50)\n.00 (.00)\n.00 (.00)\n\nLSTT\n\n(\u03b1 = 0.5)\n.02 (.14)\n.65 (.48)\n.87 (.34)\n.47 (.50)\n.80 (.40)\n.55 (.50)\n.40 (.49)\n.44 (.50)\n.00 (.00)\n.02 (.14)\n\nLSTT\n\n(\u03b1 = 0.95)\n.17\n(.38)\n(.40)\n.80\n(.33)\n.88\n(.50)\n.57\n(.41)\n.79\n(.48)\n.66\n.62\n(.49)\n(.47)\n.68\n(.14)\n.02\n.00\n(.00)\n\nWhen an asymmetric divergence such as the KL divergence [7] or the PE divergence [11] is adopted\nfor two-sample test, the test results depend on the choice of directions: a divergence from P to\nP 0 or from P 0 to P .\n[4] proposed to choose the direction that gives a smaller p-value\u2014it was\nexperimentally shown that, when the uLSIF-based PE divergence estimator is used for the two-\nsample test (which is called the least-squares two-sample test; LSTT), the heuristic of choosing the\ndirection with a smaller p-value contributes to reducing the type-II error (the probability of accepting\nincorrect null-hypotheses, i.e., two distributions are judged to be the same when they are actually\ndifferent), while the increase of the type-I error (the probability of rejecting correct null-hypotheses,\ni.e., two distributions are judged to be different when they are actually the same) is kept moderate.\nWe apply the proposed method to the binary classi\ufb01cation datasets taken from the IDA benchmark\nrepository [21]. We test LSTT with the RuLSIF-based PE divergence estimator for \u03b1 = 0, 0.5, and\n0.95; we also test the maximum mean discrepancy (MMD) [22], which is a kernel-based two-sample\ntest method. The performance of MMD depends on the choice of the Gaussian kernel width. Here,\nwe adopt a version proposed by [23], which automatically optimizes the Gaussian kernel width. The\np-values of MMD are computed in the same way as LSTT based on the permutation test procedure.\nFirst, we investigate the rate of accepting the null hypothesis when the null hypothesis is correct\n(i.e., the two distributions are the same). We split all the positive training samples into two sets and\nperform two-sample test for the two sets of samples. The experimental results are summarized in\nthe left half of Table 1, showing that LSTT with \u03b1 = 0.5 compares favorably with those with \u03b1 = 0\nand 0.95 and MMD in terms of the type-I error.\nNext, we consider the situation where the null hypothesis is not correct (i.e., the two distributions\nare different). The numerator samples are generated in the same way as above, but a half of denom-\ninator samples are replaced with negative training samples. Thus, while the numerator sample set\ncontains only positive training samples, the denominator sample set includes both positive and nega-\ntive training samples. The experimental results are summarized in the right half of Table 1, showing\nthat LSTT with \u03b1 = 0.5 again compares favorably with those with \u03b1 = 0 and 0.95. Furthermore,\nLSTT with \u03b1 = 0.5 tends to outperform MMD in terms of the type-II error.\nOverall, LSTT with \u03b1 = 0.5 is shown to be a useful method for two-sample homogeneity test. See\nthe supplementary material for more experimental evaluation.\n\nInlier-Based Outlier Detection: Next, we apply the proposed method to outlier detection.\nLet us consider an outlier detection problem of \ufb01nding irregular samples in a dataset (called an\n\u201cevaluation dataset\u201d) based on another dataset (called a \u201cmodel dataset\u201d) that only contains regular\nsamples. De\ufb01ning the density-ratio over the two sets of samples, we can see that the density-ratio\n\n6\n\n\f100\n\nhaving\n\nTable 2: Experimental\nresults of outlier detec-\ntion. Mean AUC score\n(and standard devi-\nation in the bracket)\ntrials\nover\nis\nThe best\nreported.\nmethod\nthe\nhighest mean AUC\nscore and comparable\nmethods according to\nthe two-sample t-test\nat\nsigni\ufb01cance\nlevel 5% are speci\ufb01ed\nThe\nby bold face.\nsorted\ndatasets\nare\nascending\nin\nthe\norder of\nthe\ninput\ndimensionality d.\n\nthe\n\nDatasets\nIDA:banana\nIDA:thyroid\nIDA:titanic\nIDA:diabetes\nIDA:breast-cancer\nIDA:\ufb02are-solar\nIDA:heart\nIDA:german\nIDA:ringnorm\nIDA:waveform\nSpeech\n20News (\u2018rec\u2019)\n20News (\u2018sci\u2019)\n20News (\u2018talk\u2019)\nUSPS (1 vs. 2)\nUSPS (2 vs. 3)\nUSPS (3 vs. 4)\nUSPS (4 vs. 5)\nUSPS (5 vs. 6)\nUSPS (6 vs. 7)\nUSPS (7 vs. 8)\nUSPS (8 vs. 9)\nUSPS (9 vs. 0)\n\nd\n2\n5\n5\n8\n9\n9\n13\n20\n20\n21\n50\n100\n100\n100\n256\n256\n256\n256\n256\n256\n256\n256\n256\n\nOSVM\n\n(\u03bd = 0.05)\n.668 (.105)\n.760 (.148)\n.757 (.205)\n.636 (.099)\n.741 (.160)\n.594 (.087)\n.714 (.140)\n.612 (.069)\n.991 (.012)\n.812 (.107)\n.788 (.068)\n.598 (.063)\n.592 (.069)\n.661 (.084)\n.889 (.052)\n.823 (.053)\n.901 (.044)\n.871 (.041)\n.825 (.058)\n.910 (.034)\n.938 (.030)\n.721 (.072)\n.920 (.037)\n\nOSVM\n\n(\u03bd = 0.1)\n.676 (.120)\n.782 (.165)\n.752 (.191)\n.610 (.090)\n.691 (.147)\n.590 (.083)\n.694 (.148)\n.604 (.084)\n.993 (.007)\n.843 (.123)\n.830 (.060)\n.593 (.061)\n.589 (.071)\n.658 (.084)\n.926 (.037)\n.835 (.050)\n.939 (.031)\n.890 (.036)\n.859 (.052)\n.950 (.025)\n.967 (.021)\n.728 (.073)\n.966 (.023)\n\nRuLSIF\n(\u03b1 = 0)\n.597 (.097)\n.804 (.148)\n.750 (.182)\n.594 (.105)\n.707 (.148)\n.626 (.102)\n.748 (.149)\n.605 (.092)\n.944 (.091)\n.879 (.122)\n.804 (.101)\n.628 (.105)\n.620 (.094)\n.672 (.117)\n.848 (.081)\n.803 (.093)\n.950 (.056)\n.857 (.099)\n.863 (.078)\n.972 (.038)\n.941 (.053)\n.721 (.084)\n.982 (.048)\n\nRuLSIF\n(\u03b1 = 0.5)\n.619 (.101)\n.796 (.178)\n.701 (.184)\n.575 (.105)\n.737 (.159)\n.612 (.100)\n.769 (.134)\n.597 (.101)\n.971 (.062)\n.875 (.117)\n.821 (.076)\n.614 (.093)\n.609 (.087)\n.670 (.102)\n.878 (.088)\n.818 (.085)\n.961 (.041)\n.874 (.082)\n.867 (.068)\n.984 (.018)\n.951 (.039)\n.728 (.083)\n.989 (.022)\n\nRuLSIF\n\n(\u03b1 = 0.95)\n.623 (.115)\n.722 (.153)\n.712 (.185)\n.663 (.112)\n.733 (.160)\n.584 (.114)\n.726 (.127)\n.605 (.095)\n.992 (.010)\n.885 (.102)\n.836 (.083)\n.767 (.100)\n.704 (.093)\n.823 (.078)\n.898 (.051)\n.879 (.074)\n.984 (.016)\n.941 (.031)\n.901 (.049)\n.994 (.010)\n.980 (.015)\n.761 (.096)\n.994 (.011)\n\nvalues for regular samples are close to one, while those for outliers tend to be signi\ufb01cantly deviated\nfrom one. Thus, density-ratio values could be used as an index of the degree of outlyingness [1, 2].\nSince the evaluation dataset usually has a wider support than the model dataset, we regard the eval-\nuation dataset as samples corresponding to the denominator density p0(x), and the model dataset as\nsamples corresponding to the numerator density p(x). Then, outliers tend to have smaller density-\nratio values (i.e., close to zero). Thus, density-ratio approximators can be used for outlier detection.\nWe evaluate the proposed method using various datasets: IDA benchmark repository [21], an in-\nhouse French speech dataset, the 20 Newsgroup dataset, and the USPS hand-written digit dataset\n(the detailed speci\ufb01cation of the datasets is explained in the supplementary material).\nWe compare the area under the ROC curve (AUC) [24] of RuLSIF with \u03b1 = 0, 0.5, and 0.95, and\none-class support vector machine (OSVM) with the Gaussian kernel [25]. We used the LIBSVM\nimplementation of OSVM [26]. The Gaussian width is set to the median distance between samples,\nwhich has been shown to be a useful heuristic [25]. Since there is no systematic method to determine\nthe tuning parameter \u03bd in OSVM, we report the results for \u03bd = 0.05 and 0.1.\nThe mean and standard deviation of the AUC scores over 100 runs with random sample choice are\nsummarized in Table 2, showing that RuLSIF overall compares favorably with OSVM. Among the\nRuLSIF methods, small \u03b1 tends to perform well for low-dimensional datasets, and large \u03b1 tends to\nwork well for high-dimensional datasets.\n\nTransfer Learning: Finally, we apply the proposed method to transfer learning.\nj )}ntr\nLet us consider a transductive transfer learning setup where labeled training samples {(xtr\ndrawn i.i.d. from p(y|x)ptr(x) and unlabeled test samples {xte\ni=1 drawn i.i.d. from pte(x) (which\nis generally different from ptr(x)) are available. The use of exponentially-weighted importance\nweighting was shown to be useful for adaptation from ptr(x) to pte(x) [5]:\n\ni }nte\n\nj , ytr\n\nj=1\n\n(cid:180)\u03c4\n\n(cid:184)\n\n(cid:183)\n\n(cid:179)\n\n(cid:80)ntr\n\nj=1\n\nminf\u2208F\n\n1\nntr\n\npte(xtr\nj )\nptr(xtr\nj )\n\nloss(ytr\n\nj , f(xtr\n\nj ))\n\n,\n\nwhere f(x) is a learned function and 0 \u2264 \u03c4 \u2264 1 is the exponential \ufb02attening parameter. \u03c4 = 0 corre-\nsponds to plain empirical-error minimization which is statistically ef\ufb01cient, while \u03c4 = 1 corresponds\nto importance-weighted empirical-error minimization which is statistically consistent; 0 < \u03c4 < 1\nwill give an intermediate estimator that balances the trade-off between statistical ef\ufb01ciency and con-\nsistency. \u03c4 can be determined by importance-weighted cross-validation [6] in a data dependent\nfashion.\n\n7\n\n\fTable 3: Experimental results of transfer learning in human activity recognition. Mean classi\ufb01cation\naccuracy (and the standard deviation in the bracket) over 100 runs for human activity recognition of\na new user is reported. We compare the plain kernel logistic regression (KLR) without importance\nweights, KLR with relative importance weights (RIW-KLR), KLR with exponentially-weighted im-\nportance weights (EIW-KLR), and KLR with plain importance weights (IW-KLR). The method hav-\ning the highest mean classi\ufb01cation accuracy and comparable methods according to the two-sample\nt-test at the signi\ufb01cance level 5% are speci\ufb01ed by bold face.\n\n(cid:104)\n\n(cid:80)ntr\n\n(cid:105)\n\nTask\n\nWalks vs. run\nWalks vs. bicycle\nWalks vs. train\n\nKLR\n\n(\u03b1 = 0, \u03c4 = 0)\n(0.082)\n0.803\n(0.025)\n0.880\n0.985\n(0.017)\n\nRIW-KLR\n(\u03b1 = 0.5)\n0.889 (0.035)\n0.892 (0.035)\n0.992 (0.008)\n\nEIW-KLR\n(\u03c4 = 0.5)\n0.882 (0.039)\n0.867 (0.054)\n0.989 (0.011)\n\nIW-KLR\n\n(\u03b1 = 1, \u03c4 = 1)\n(0.035)\n0.882\n(0.070)\n0.854\n0.983\n(0.021)\n\nHowever, a potential drawback is that estimation of r(x) (i.e., \u03c4 = 1) is rather hard, as shown in this\npaper. Here we propose to use relative importance weights instead:\n\nminf\u2208F\n\n1\nntr\n\nj=1\n\n(1\u2212\u03b1)pte(xtr\n\nj )+\u03b1ptr(xtr\n\nj )loss(ytr\n\nj , f(xtr\n\nj ))\n\n.\n\npte(xtr\nj )\n\nWe apply the above transfer learning technique to human activity recognition using accelerometer\ndata. Subjects were asked to perform a speci\ufb01c task such as walking, running, and bicycle riding,\nwhich was collected by iPodTouch. The duration of each task was arbitrary and the sampling rate\nwas 20Hz with small variations (the detailed experimental setup is explained in the supplementary\nmaterial). Let us consider a situation where a new user wants to use the activity recognition system.\nHowever, since the new user is not willing to label his/her accelerometer data due to troublesome-\nness, no labeled sample is available for the new user. On the other hand, unlabeled samples for\nthe new user and labeled data obtained from existing users are available. Let labeled training data\n{(xtr\nj=1 be the set of labeled accelerometer data for 20 existing users. Each user has at most\n100 labeled samples for each action. Let unlabeled test data {xte\ni=1 be unlabeled accelerometer\ndata obtained from the new user.\nThe experiments are repeated 100 times with different sample choice for ntr = 500 and nte = 200.\nThe classi\ufb01cation accuracy for 800 test samples from the new user (which are different from the\n200 unlabeled samples) are summarized in Table 3, showing that the proposed method using relative\nimportance weights for \u03b1 = 0.5 works better than other methods.\n\nj )}ntr\n\ni }nte\n\nj , ytr\n\n5 Conclusion\n\nIn this paper, we proposed to use a relative divergence for robust distribution comparison. We gave\na computationally ef\ufb01cient method for estimating the relative Pearson divergence based on direct\nrelative density-ratio approximation. We theoretically elucidated the convergence rate of the pro-\nposed divergence estimator under non-parametric setup, which showed that the proposed approach\nof estimating the relative Pearson divergence is more preferable than the existing approach of esti-\nmating the plain Pearson divergence. Furthermore, we proved that the asymptotic variance of the\nproposed divergence estimator is independent of the model complexity under a correctly-speci\ufb01ed\nparametric setup. Thus, the proposed divergence estimator hardly over\ufb01ts even with complex mod-\nels. Experimentally, we demonstrated the practical usefulness of the proposed divergence estimator\nin two-sample homogeneity test, inlier-based outlier detection, and transfer learning tasks.\nIn addition to two-sample homogeneity test, inlier-based outlier detection, and transfer learning,\ndensity-ratios can be useful for tackling various machine learning problems, for example, multi-task\nlearning, independence test, feature selection, causal inference, independent component analysis,\ndimensionality reduction, unpaired data matching, clustering, conditional density estimation, and\nprobabilistic classi\ufb01cation. Thus, it would be promising to explore more applications of the pro-\nposed relative density-ratio approximator beyond two-sample homogeneity test, inlier-based outlier\ndetection, and transfer learning.\nAcknowledgments\nMY was supported by the JST PRESTO program, TS was partially supported by MEXT KAKENHI\n22700289 and Aihara Project, the FIRST program from JSPS, initiated by CSTP, TK was partially\nsupported by Grant-in-Aid for Young Scientists (20700251), HH was supported by the FIRST pro-\ngram, and MS was partially supported by SCAT, AOARD, and the FIRST program.\n\n8\n\n\fReferences\n[1] A. J. Smola, L. Song, and C. H. Teo. Relative novelty detection. In Proceedings of the Twelfth Interna-\n\ntional Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS2009), pages 536\u2013543, 2009.\n\n[2] S. Hido, Y. Tsuboi, H. Kashima, M. Sugiyama, and T. Kanamori. Statistical outlier detection using direct\n\ndensity ratio estimation. Knowledge and Information Systems, 26(2):309\u2013336, 2011.\n\n[3] A. Gretton, K. M. Borgwardt, M. Rasch, B. Sch\u00a8olkopf, and A. J. Smola. A kernel method for the two-\nIn B. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information\n\nsample-problem.\nProcessing Systems 19, pages 513\u2013520. MIT Press, Cambridge, MA, 2007.\n\n[4] M. Sugiyama, T. Suzuki, Y. Itoh, T. Kanamori, and M. Kimura. Least-squares two-sample test. Neural\n\nNetworks, 24(7):735\u2013751, 2011.\n\n[5] H. Shimodaira.\n\nImproving predictive inference under covariate shift by weighting the log-likelihood\n\nfunction. Journal of Statistical Planning and Inference, 90(2):227\u2013244, 2000.\n\n[6] M. Sugiyama, M. Krauledat, and K.-R. M\u00a8uller. Covariate shift adaptation by importance weighted cross\n\nvalidation. Journal of Machine Learning Research, 8:985\u20131005, May 2007.\n\n[7] S. Kullback and R. A. Leibler. On information and suf\ufb01ciency. Annals of Mathematical Statistics, 22:79\u2013\n\n86, 1951.\n\n[8] V. N. Vapnik. Statistical Learning Theory. Wiley, New York, NY, 1998.\n[9] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von B\u00a8unau, and M. Kawanabe. Direct importance\nestimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60:699\u2013746,\n2008.\n\n[10] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood\nratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847\u20135861, 2010.\n[11] K. Pearson. On the criterion that a given system of deviations from the probable in the case of a correlated\nsystem of variables is such that it can be reasonably supposed to have arisen from random sampling.\nPhilosophical Magazine, 50:157\u2013175, 1900.\n\n[12] S. M. Ali and S. D. Silvey. A general class of coef\ufb01cients of divergence of one distribution from another.\n\nJournal of the Royal Statistical Society, Series B, 28:131\u2013142, 1966.\n\n[13] I. Csisz\u00b4ar. Information-type measures of difference of probability distributions and indirect observation.\n\nStudia Scientiarum Mathematicarum Hungarica, 2:229\u2013318, 1967.\n\n[14] T. Kanamori, S. Hido, and M. Sugiyama. A least-squares approach to direct importance estimation.\n\nJournal of Machine Learning Research, 10:1391\u20131445, 2009.\n\n[15] T. Suzuki and M. Sugiyama. Suf\ufb01cient dimension reduction via squared-loss mutual information estima-\ntion. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS2010), pages 804\u2013811, 2010.\n\n[16] C. Cortes, Y. Mansour, and M. Mohri. Learning bounds for importance weighting. In J. Lafferty, C. K. I.\nWilliams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Pro-\ncessing Systems 23, pages 442\u2013450. 2010.\n\n[17] R. T. Rockafellar. Convex Analysis. Princeton University Press, Princeton, NJ, USA, 1970.\n[18] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society,\n\n68:337\u2013404, 1950.\n\n[19] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.\n[20] B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, New York, NY, 1993.\n[21] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for adaboost. Machine Learning, 42(3):287\u2013320,\n\n2001.\n\n[22] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Sch\u00a8olkopf, and A. J. Smola. Integrating\nstructured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49\u2013e57, 2006.\n[23] B. Sriperumbudur, K. Fukumizu, A. Gretton, G. Lanckriet, and B. Sch\u00a8olkopf. Kernel choice and clas-\nsi\ufb01ability for RKHS embeddings of probability distributions. In Y. Bengio, D. Schuurmans, J. Lafferty,\nC. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages\n1750\u20131758. 2009.\n\n[24] A. P. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms.\n\nPattern Recognition, 30:1145\u20131159, 1997.\n\n[25] B. Sch\u00a8olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of\n\na high-dimensional distribution. Neural Computation, 13(7):1443\u20131471, 2001.\n\n[26] C.-C. Chang and C.h-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available\n\nat http://www.csie.ntu.edu.tw/\u223ccjlin/libsvm.\n\n9\n\n\f", "award": [], "sourceid": 418, "authors": [{"given_name": "Makoto", "family_name": "Yamada", "institution": null}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": null}, {"given_name": "Takafumi", "family_name": "Kanamori", "institution": null}, {"given_name": "Hirotaka", "family_name": "Hachiya", "institution": null}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": null}]}