{"title": "Trimmed Density Ratio Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 4518, "page_last": 4528, "abstract": "Density ratio estimation is a vital tool in both machine learning and statistical community. However, due to the unbounded nature of density ratio, the estimation proceudre can be vulnerable to corrupted data points, which often pushes the estimated ratio toward infinity. In this paper, we present a robust estimator which automatically identifies and trims outliers. The proposed estimator has a convex formulation, and the global optimum can be obtained via subgradient descent. We analyze the parameter estimation error of this estimator under high-dimensional settings. Experiments are conducted to verify the effectiveness of the estimator.", "full_text": "Trimmed Density Ratio Estimation\n\nSong Liu\u2217\n\nUniversity of Bristol\n\nsong.liu@bristol.ac.uk\n\nTaiji Suzuki\n\nUniversity of Tokyo,\n\nSakigake (PRESTO), JST,\n\nAIP, RIKEN,\n\nThe Institute of Statistical Mathematics,\n\nAkiko Takeda\n\nAIP, RIKEN,\n\natakeda@ism.ac.jp\n\nKenji Fukumizu\n\nThe Institute of Statistical Mathematics,\n\nfukumizu@ism.ac.jp\n\ntaiji@mist.i.u-tokyo.ac.jp\n\nAbstract\n\nDensity ratio estimation is a vital tool in both machine learning and statistical\ncommunity. However, due to the unbounded nature of density ratio, the estimation\nprocedure can be vulnerable to corrupted data points, which often pushes the\nestimated ratio toward in\ufb01nity. In this paper, we present a robust estimator which\nautomatically identi\ufb01es and trims outliers. The proposed estimator has a convex\nformulation, and the global optimum can be obtained via subgradient descent. We\nanalyze the parameter estimation error of this estimator under high-dimensional\nsettings. Experiments are conducted to verify the effectiveness of the estimator.\n\n1\n\nIntroduction\n\nDensity ratio estimation (DRE) [18, 11, 27] is an important tool in various branches of machine\nlearning and statistics. Due to its ability of directly modelling the differences between two probability\ndensity functions, DRE \ufb01nds its applications in change detection [13, 6], two-sample test [32] and\noutlier detection [1, 26]. In recent years, a sampling framework called Generative Adversarial\nNetwork (GAN) (see e.g., [9, 19]) uses the density ratio function to compare arti\ufb01cial samples from a\ngenerative distribution and real samples from an unknown distribution. DRE has also been widely\ndiscussed in statistical literatures for adjusting non-parametric density estimation [5], stabilizing the\nestimation of heavy tailed distribution [7] and \ufb01tting multiple distributions at once [8].\nHowever, as a density ratio function can grow unbounded, DRE can suffer from robustness and\nstability issues: a few corrupted points may completely mislead the estimator (see Figure 2 in Section\n6 for example). Considering a density ratio p(x)/q(x), a point x that is extremely far away from the\nhigh density region of q may have an almost in\ufb01nite ratio value and DRE results can be dominated\nby such points. This makes DRE performance very sensitive to rare pathological data or small\nmodi\ufb01cations of the dataset. Here we give two examples:\n\nCyber-attack In change detection applications, a density ratio p(x)/q(x) is used to determine how\nthe data generating model differs between p and q. Consider a \u201chacker\u201d who can spy on our data\nmay just inject a few data points in p which are extremely far away from the high-density region of q.\nThis would result excessively large p(x)/q(x) tricking us to believe there is a signi\ufb01cant change from\nq(x) to p(x), even if there is no change at all. If the generated outliers are also far away from the\n\n\u2217This work was done when Song Liu was at The Institute of Statistical Mathematics, Japan\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fhigh density region of p(x), we end up with a very different density ratio function and the original\nparametric pattern in the ratio is ruined. We give such an example in Section 6.\n\nVolatile Samples The change of external environment may be responded in unpredictable ways. It\nis possible that a small portion of samples react more \u201caggressively\u201d to the change than the others.\nThese samples may be skewed and show very high density ratios, even if the change of distribution is\nrelatively mild when these volatile samples are excluded. For example, when testing a new fertilizer,\na small number of plants may fail to adapt, even if the vast majority of crops are healthy.\nOverly large density ratio values can cause further troubles when the ratio is used to weight samples.\nFor example, in the domain adaptation setting, we may reweight samples from one task and reuse\nthem in another task. Density ratio is a natural choice of such \u201cimportance weighting\u201d scheme\n[28, 25]. However, if one or a few samples have extremely high ratio, after renormalizing, other\nsamples will have almost zero weights and have little impact to the learning task.\nSeveral methods have been proposed to solve this problem. The relative density ratio estimation [33]\nestimates a \u201cbiased\u201d version of density ratio controlled by a mixture parameter \u03b1. The relative density\nratio is always upper-bounded by 1\n\u03b1, which can give a more robust estimator. However, it is not clear\nhow to de-bias such an estimator to recover the true density ratio function. [26] took a more direct\napproach. It estimates a thresholded density ratio by setting up a tolerance t to the density ratio value.\nAll likelihood ratio values bigger than t will be clipped to t. The estimator was derived from Fenchel\nduality for f-divergence [18]. However, the optimization for the estimator is not convex if one uses\nlog-linear models. The formulation also relies on the non-parametric approximation of the density\nratio function (or the log ratio function) making the learned model hard to interpret. Moreover, there\nis no intuitive way to directly control the proportion of ratios that are thresholded. Nonetheless, the\nconcept studied in our paper is inspired by this pioneering work.\nIn this paper, we propose a novel method based on a \u201ctrimmed Maximum Likelihood Estimator\u201d\n[17, 10]. This idea relies on a speci\ufb01c type of density ratio estimator (called log-linear KLIEP) [30]\nwhich can be written as a maximum likelihood formulation. We simply \u201cignore\u201d samples that make\nthe empirical likelihood take exceedingly large values. The trimmed density ratio estimator can\nbe formulated as a convex optimization and translated into a weighted M-estimator. This helps us\ndevelop a simple subgradient-based algorithm that is guaranteed to reach the global optimum.\nMoreover, we shall prove that in addition to recovering the correct density ratio under the outlier\nsetting, the estimator can also obtain a \u201ccorrected\u201d density ratio function under a truncation setting. It\nignores \u201cpathological\u201d samples and recovers density ratio only using \u201chealthy\u201d samples.\nAlthough trimming will usually result a more robust estimate of the density ratio function, we also\npoint out that it should not be abused. For example, in the tasks of two-sample test, a diverging\ndensity ratio might indicate interesting structural differences between two distributions.\nIn Section 2, we explain some preliminaries on trimmed maximum likelihood estimator. In Section 3,\nwe introduce a trimmed DRE. We solve it using a convex formulation whose optimization procedure\nis explained in Section 4. In Section 5, we prove the estimation error upper-bound with respect to a\nsparsity inducing regularizer. Finally, experimental results are shown in Section 6 and we conclude\nour work in Section 7.\n\n2 Preliminary: Trimmed Maximum Likelihood Estimation\n\nP , i.e., X :=(cid:8)x(i)(cid:9)n\n\nAlthough our main purpose is to estimate the density ratio, we \ufb01rst introduce the basic concept of\ntrimmed estimator using density functions as examples. Given n samples drawn from a distribution\ni.i.d.\u223c P, x \u2208 Rd, we want to estimate the density function p(x). Suppose the\n\ni=1\n\ntrue density function is a member of exponential family [20],\np(x; \u03b8) = exp [(cid:104)\u03b8, f (x)(cid:105) \u2212 log Z(\u03b8)] , Z(\u03b8) =\n\n(cid:90)\n\nq(x) exp(cid:104)\u03b8, f (x)(cid:105)dx\n\n(1)\n\nwhere f (x) is the suf\ufb01cient statistics, Z(\u03b8) is the normalization function and q(x) is the base\nmeasure.\nMaximum Likelihood Estimator (MLE) maximizes the empirical likelihood over the entire dataset.\nIn contrast, a trimmed MLE only maximizes the likelihood over a subset of samples according to\n\n2\n\n\ftheir likelihood values (see e.g., [10, 31]). This paradigm can be used to derive a popular outlier\ndetection method, one-class Support Vector Machine (one-SVM) [24]. The derivation is crucial to\nthe development of our trimmed density ratio estimator in later sections.\nWithout loss of generality, we can set the log likelihood function as log p(x(i); \u03b8) \u2212 \u03c40, where\n(cid:80)n\n\u03c40 is a constant. As samples corresponding to high likelihood values are likely to be inliers,\nwe can trim all samples whose likelihood is bigger than \u03c40 using a clipping function [\u00b7]\u2212, i.e.,\ni=1[log p(x(i); \u03b8) \u2212 \u03c40]\u2212, where [(cid:96)]\u2212 returns (cid:96) if (cid:96) \u2264 0 and 0 otherwise. This\n\u02c6\u03b8 = arg max\u03b8\noptimization has a convex formulation:\n(cid:104)\u0001, 1(cid:105),\n\nwhere \u0001 is the slack variable measuring the difference between log p(cid:0)x(i); \u03b8(cid:1) and \u03c40. However,\n\n(cid:17) \u2265 \u03c40 \u2212 \u0001i,\n\ns.t. \u2200i, log p\n\nmin\n\u03b8,\u0001\u22650\n\nx(i); \u03b8\n\nformulation (2) is not practical since computing the normalization term Z(\u03b8) in (1) is intractable for\na general f and it is unclear how to set the trimming level \u03c40. Therefore we ignore the normalization\nterm and introduce other control terms:\n\n(cid:16)\n\n(2)\n\nmin\n\n\u03b8,\u0001\u22650,\u03c4\u22650\n\n1\n2\n\n(cid:107)\u03b8(cid:107)2 \u2212 \u03bd\u03c4 +\n\n1\nn\n\n(cid:104)\u0001, 1(cid:105) s.t. \u2200i,(cid:104)\u03b8, f (x(i))(cid:105) \u2265 \u03c4 \u2212 \u0001i.\n\n(3)\n\nThe (cid:96)2 regularization term is introduced to avoid \u03b8 reaching unbounded values. A new hyper\nparameter \u03bd \u2208 (0, 1] replaces \u03c40 to control the number of trimmed samples. It can be proven using\nKKT conditions that at most 1 \u2212 \u03bd fraction of samples are discarded (see e.g., [24], Proposition 1 for\ndetails). Now we have reached the standard formulation of one-SVM.\nThis trimmed estimator ignores the large likelihood values and creates a focus only on the low density\nregion. Such a trimming strategy allows us to discover \u201cnovel\u201d points or outliers which are usually\nfar away from the high density area.\n\n3 Trimmed Density Ratio Estimation\n\nIn this paper, our main focus is to derive a robust density ratio estimator following a similar trimming\nstrategy. First, we brie\ufb02y review the a density ratio estimator [27] from the perspective of Kullback-\nLeibler divergence minimization.\n\n3.1 Density Ratio Estimation (DRE)\n} i.i.d.\u223c Q, as-\nFor two sets of data Xp := {x(1)\nq(x;\u03b8q) \u221d\nsume both the densities p(x) and q(x) are in exponential family (1). We know p(x;\u03b8p)\nexp [(cid:104)\u03b8p \u2212 \u03b8q, f (x)(cid:105)] . Observing that the data x only interacts with the parameter \u03b8p \u2212 \u03b8q through\nf , we can keep using f (x) as our suf\ufb01cient statistic for the density ratio model, and merge two\nparameters \u03b8p and \u03b8q into one single parameter \u03b4 = \u03b8p \u2212 \u03b8q. Now we can model our density ratio as\n\n} i.i.d.\u223c P, Xq := {x(1)\n\np , . . . , x(np)\n\nq , . . . , x(nq)\n\np\n\nq\n\nr(x; \u03b4) := exp [(cid:104)\u03b4, f (x)(cid:105) \u2212 log N (\u03b4)] , N (\u03b4) :=\n\nwhere N (\u03b4) is the normalization term that guarantees(cid:82) q(x)r(x; \u03b4)dx = 1 so that q(x)r(x; \u03b4) is a\n\nq(x) exp(cid:104)\u03b4, f (x)(cid:105)dx,\n\nvalid density function and is normalized over its domain.\nInterestingly, despite the parameterization (changing from \u03b8 to \u03b4), (4) is exactly the same as (1)\nwhere q(x) appeared as a base measure. The difference is, here, q(x) is a density function from\nwhich Xq are drawn so that N (\u03b4) can be approximated accurately from samples of Q. Let us de\ufb01ne\n\n(4)\n\n(cid:90)\n\n(cid:104)(cid:104)\u03b4, f (x)(cid:105) \u2212 log (cid:98)N (\u03b4)\n\n(cid:105)\n\n, (cid:98)N (\u03b4) :=\n\nnq(cid:88)\n\nj=1\n\n1\nnq\n\n\u02c6r(x; \u03b4) := exp\n\n(cid:104)(cid:104)\u03b4, f (x(j)\nq )(cid:105)(cid:105)\n\nexp\n\n.\n\n(5)\n\nNote this model can be computed for any f even if the integral in N (\u03b4) does not have a closed form .\n\n3\n\n\f(cid:90)\n\n(cid:90)\n\nIn order to estimate \u03b4, we minimize the Kullback-Leibler divergence between p and q \u00b7 r\u03b4:\n\nKL [p|q \u00b7 r\u03b4] = min\n\n\u03b4\n\nmin\n\n\u03b4\n\np(x) log\n\np(x)\n\nq(x)r(x; \u03b4)\n\ndx = c \u2212 max\n\n\u03b4\n\nlog \u02c6r(x(i)\n\np ; \u03b4)\n\nnp(cid:88)\n\ni=1\n\n\u2248 c \u2212 max\n\n\u03b4\n\n1\nnp\n\np(x) log r(x; \u03b4)dx\n\n(6)\n\nwhere c is a constant irrelevant to \u03b4. It can be seen that the minimization of KL divergence boils\ndown to maximizing log likelihood ratio over dataset Xp.\nNow we have reached the log-linear Kullback-Leibler Importance Estimation Procedure (log-linear\nKLIEP) estimator [30, 14].\n\n3.2 Trimmed Maximum Likelihood Ratio\n\nAs stated in Section 1, to rule out the in\ufb02uences of large density ratio, we trim samples with large\nlikelihood ratio values from (6). Similarly to one-SVM in (2), we can consider a trimmed MLE\np ; \u03b4) \u2212 t0]\u2212 where t0 is a threshold above which the likelihood ratios\n\u02c6\u03b4 = arg max\u03b4\nare ignored. It has a convex formulation:\n\ni=1[log \u02c6r(x(i)\n\n(cid:80)np\n\n(cid:104)\u0001, 1(cid:105), s.t. \u2200x(i)\n\np \u2208 Xp, log \u02c6r(x(i)\n\np ; \u03b4) \u2265 t0 \u2212 \u0001i.\n\nmin\n\u03b4,\u0001\u22650\n\n(7)\n\n(7) is similar to (2) since we have only replaced p(x; \u03b8) with \u02c6r(x; \u03b4). However, the ratio model\n\u02c6r(x; \u03b4) in (7) comes with a tractable normalization term \u02c6N while the normalization term Z in p(x; \u03b8)\nis in general intractable.\nSimilar to (3), we can directly control the trimming quantile via a hyper-parameter \u03bd:\np ; \u03b4) \u2265 t \u2212 \u0001i\n\n(cid:104)\u0001, 1(cid:105) \u2212 \u03bd \u00b7 t + \u03bbR(\u03b4), s.t. \u2200x(i)\n\np \u2208Xp, log \u02c6r(x(i)\n\nmin\n\n(8)\n\n\u03b4,\u0001\u22650,t\u22650\n\n1\nnp\n\nwhere R(\u03b4) is a convex regularizer. (8) is also convex, but it has np number of non-linear constraints\nand the search for the global optimal solution can be time-consuming. To avoid such a problem,\none could derive and solve the dual problem of (8). In some applications, we rely on the primal\nparameter structure (such as sparsity) for model interpretation, and feature engineering. In Section\n4, we translate (8) into an equivalent form so that its solution is obtained via a subgradient ascent\nmethod which is guaranteed to converge to the global optimum.\nOne common way to construct a convex robust estimator is using a Huber loss [12]. Although the\nproposed trimming technique rises from a different setting, it shares the same guiding principle with\nHuber loss: avoid assigning dominating values to outlier likelihoods in the objective function.\nIn Section 8.1 in the supplementary material, we show the relationship between trimmed DRE and\nbinary Support Vector Machines [23, 4].\n\n4 Optimization\n\nThe key to solving (8) ef\ufb01ciently is reformulating it into an equivalent max min problem.\nProposition 1. Assuming \u03bd is chosen such that \u02c6t > 0 for all optimal solutions in (8), then \u02c6\u03b4 is an\noptimal solution of (8) if and only if it is also the optimal solution of the following max min problem:\n\nw\u2208(cid:104)\n\nmax\n\n\u03b4\n\n(cid:105)np\n\nmin\n\n0, 1\nnp\n\n,(cid:104)1,w(cid:105)=\u03bd\n\nL(\u03b4, w) \u2212 \u03bbR(\u03b4), L(\u03b4, w) :=\n\nwi \u00b7 log \u02c6r(x(i)\n\np ; \u03b4).\n\n(9)\n\nnp(cid:88)\n\ni=1\n\nThe proof is in Section 8.2 in the supplementary material. We de\ufb01ne (\u02c6\u03b4, \u02c6w) as a saddle point of (9):\n\n\u2207\u03b4L(\u02c6\u03b4, \u02c6w) \u2212 \u2207\u03b4\u03bbR(\u02c6\u03b4) = 0, \u02c6w \u2208 arg\n\nL(\u02c6\u03b4, w),\n\n(10)\n\nw\u2208[0, 1\n\nnp\n\nmin\n]np ,(cid:104)w,1(cid:105)=\u03bd\n\nwhere the second \u2207\u03b4 means the subgradient if R is sub-differentiable.\n\n4\n\n\fAlgorithm 1 Gradient Ascent and Trimming\nInput: Xp, Xq, \u03bd and step sizes {\u03b7it}itmax\nnumber of iterations: itmax, Best objective, parameter pair (Obest = \u2212\u221e, \u03b4best, wbest) .\nwhile not converged and it \u2264 itmax do\np ; \u03b4it)\u00b7\u00b7\u00b7 \u2264 log \u02c6r(x(np)\n\nit=1 ; Initialize \u03b40, w0, Iteration counter: it = 0, Maximum\n\np ; \u03b4it) \u2264 log \u02c6r(x(2)\n\nx(i)\np\n\nso that log \u02c6r(x(1)\n,\u2200i \u2264 \u03bdnp. wit+1,i = 0, otherwise.\n\nObtain a sorted set\nwit+1,i = 1\nnp\nGradient ascent with respect to \u03b4: \u03b4it+1 = \u03b4it + \u03b7it \u00b7 \u2207\u03b4[L(\u03b4it, wit+1) \u2212 \u03bbR(\u03b4it)],\nObest = max(Obest,L(\u03b4it+1, wit+1)) and update (\u03b4best, wbest) accordingly.\n\n(cid:111)np\n\nit = it + 1.\n\np\n\n; \u03b4it).\n\n(cid:110)\n\ni=1\n\nend while\nOutput: (\u03b4best, wbest)\n\nNow the \u201ctrimming\u201d process of our estimator can be clearly seen from (9): The max procedure\nestimates a density ratio given the currently assigned weights w, and the min procedure trims the\nlarge log likelihood ratio values by assigning corresponding wi to 0 (or values smaller than 1\n). For\nnp\n. Intuitively, 1 \u2212 \u03bd is the proportion\nsimplicity, we only consider the cases where \u03bd is a multiple of 1\nnp\nof likelihood ratios that are trimmed thus \u03bd should not be greater than 1. Note if we set \u03bd = 1, (9) is\nequivalent to the standard density ratio estimator (6). Downweighting outliers while estimating the\nmodel parameter \u03b4 is commonly used by robust estimators (See e.g., [3, 29]).\nThe search for (\u02c6\u03b4, \u02c6w) is straightforward. It is easy to solve with respect to w or \u03b4 while the other\nis \ufb01xed: given a parameter \u03b4, the optimization with respect to w is a linear programming and one\nof the extreme optimal solutions is attained by assigning weight 1\nto the elements that correspond\nnp\nto the \u03bdnp-smallest log-likelihood ratio log \u02c6r(x(i), \u03b4). This observation leads to a simple \u201cgradient\nascent and trimming\u201d algorithm (see Algorithm 1). In Algorithm 1,\n\n\u2207\u03b4L(\u03b4, w) =\n\n1\nnp\n\nwi \u00b7 f (x(i)\n\np ) \u2212 \u03bd \u00b7\n\ni=1\n\nj=1\n\nf (x(j)\n\nq ), e(i) := exp((cid:104)\u03b4, f (x(i)\n\nq )(cid:105)).\n\nnp(cid:88)\n\nnq(cid:88)\n\ne(j)(cid:80)nq\n\nk=1 e(k)\n\nIn fact, Algorithm 1 is a subgradient method [2, 16], since the optimal value function of the inner\nproblem of (9) is not differentiable at some \u03b4 where the inner problem has multiple optimal solutions.\nThe subdifferential of the optimal value of the inner problem with respect to \u03b4 can be a set but\nAlgorithm 1 only computes a subgradient obtained using the extreme point solution wit+1 of the\ninner linear programming. Under mild conditions, this subgradient ascent approach will converge to\noptimal results with diminishing step size rule and it \u2192 \u221e. See [2] for details.\nAlgorithm 1 is a simple gradient ascent procedure and can be implemented by deep learning softwares\nsuch as Tensor\ufb02ow2 which bene\ufb01ts from the GPU acceleration. In contrast, the original problem (8),\ndue to its heavily constrained nature, cannot be easily programmed using such a framework.\n\n5 Estimation Consistency in High-dimensional Settings\n\nIn this section, we show how the estimated parameter \u02c6\u03b4 in (10) converges to the \u201coptimal parameters\u201d\n\u2217 as both sample size and dimensionality goes to in\ufb01nity under the \u201coutlier\u201d and \u201ctruncation\u201d setting\n\u03b4\nrespectively.\nIn the outlier setting (Figure 1a), we assume Xp is contaminated by outliers and all \u201cinlier\u201d samples\nin Xp are i.i.d.. The outliers are injected into our dataset Xp after looking at our inliers. For example,\nhackers can spy on our data and inject fake samples so that our estimator exaggerates the degree of\nchange.\nIn the truncation setting, there are no outliers. Xp and Xq are i.i.d. samples from P and Q\nrespectively. However, we have a subset of \u201cvolatile\u201d samples in Xp (the rightmost mode on\nhistogram in Figure 1b) that are pathological and exhibit large density ratio values.\n\n2https://www.tensorflow.org/\n\n5\n\n\f(a) Outlier Setting. Blue and red points are i.i.d.\n\n(b) Truncation Setting. There are no outliers.\n\nFigure 1: Two settings of theoretical analysis.\n\nIn the theoretical results in this section, we focus on analyzing the performance of our estimator\n\u2217 is k and\nfor high-dimensional data assuming the number of non-zero elements in the optimal \u03b4\nuse the (cid:96)1 regularizer, i.e., R(\u03b8) = (cid:107)\u03b8(cid:107)1 which induces sparsity on \u02c6\u03b4. The proofs rely on a recent\ndevelopment [35, 34] where a \u201cweighted\u201d high-dimensional estimator was studied. We also assume\nthe optimization of \u03b4 in (9) was conducted within an (cid:96)1 ball of width \u03c1, i.e., Ball(\u03c1), and \u03c1 is wisely\n\u2217 \u2208 Ball(\u03c1). The same technique was used in previous works\nchosen so that the optimal parameter \u03b4\n[15, 35].\nNotations: We denote w\u2217 \u2208 Rnp as the \u201coptimal\u201d weights depending on \u03b4\n\u2217 and our data. To lighten\nthe notation, we shorten the log density ratio model as z\u03b4(x) := log r(x; \u03b4), \u02c6z\u03b4(x) := log \u02c6r(x; \u03b4)\nThe proof of Theorem 1, 2 and 3 can be found in Section 8.4, 8.5 and 8.6 in supplementary materials.\n\n\u2217\n\n\u2217(cid:107). We state this theorem only with\n\n, w\u2217) and the pair is set properly later in Section 5.2 and 5.3.\n\n5.1 A Base Theorem\nNow we provide a base theorem giving an upperbound of (cid:107)\u02c6\u03b4 \u2212 \u03b4\nrespect to an arbitrary pair (\u03b4\nWe make a few regularity conditions on samples from Q. Samples of Xq should be well behaved in\nterms of log-likelihood ratio values.\nAssumption 1. \u22030 < c1 < 1, 1 < c2 < \u221e \u2200xq \u2208 Xq, u \u2208 Ball(\u03c1), c1 \u2264 exp(cid:104)\u03b4\n+ u, xq(cid:105) \u2264 c2\nand collectively c2/c1 = Cr.\nWe also assume the Restricted Strong Convexity (RSC) condition on the covariance of X q, i.e.,\nX q1)(cid:62). Note this property has been veri\ufb01ed for various\ncov(X q) = 1\nnq\ndifferent design matrices X q, such as Gaussian or sub-Gaussian (See, e.g., [21, 22]).\nAssumption 2. RSC condition of cov(X q) holds for all u, i.e., there exists \u03ba(cid:48)\nthat u(cid:62)cov(X q)u \u2265 \u03ba(cid:48)\nTheorem 1. In addition to Assumption 1 and 2, there exists coherence between parameter w and \u03b4\nat a saddle point (\u02c6\u03b4, \u02c6w):\n\n1 with high probability.\n\n1 > 0 and c > 0 such\n\nX q1)(X q \u2212 1\n\n1(cid:107)u(cid:107)2 \u2212 c\u221a\n\n(X q \u2212 1\n\n(cid:107)u(cid:107)2\n\nnq\n\nnq\n\nnq\n\n\u2217\n\nwhere \u02c6u := \u02c6\u03b4 \u2212 \u03b4\n\n(cid:104)\u2207\u03b4L(\u02c6\u03b4, \u02c6w) \u2212 \u2207\u03b4L(\u02c6\u03b4, w\u2217), \u02c6u(cid:105) \u2265 \u2212\u03ba2(cid:107)\u02c6u(cid:107)2 \u2212 \u03c42(n, d)(cid:107)\u02c6u(cid:107)1,\n\u2217, \u03ba2 > 0 is a constant and \u03c42(d, n) > 0. It can be shown that if\n\n\u03bbn \u2265 2 max\n\n\u2217\n\n, w\u2217)(cid:107)\u221e,\n\n\u221a\n\u03c1\u03bdc\n\n2C2\nr\n\nnq\n\n, \u03c42(n, d)\n\nk\u03bbn\n2\n\n\u221a\nr \u03ba2) \u00b7 3\n\nwith probability converging to one.\n\nr \u03ba2, where c > 0 is a constant determined by RSC condition, we are guaranteed that\n1 > 2C 2\n\u2217(cid:107) \u2264\nC2\n1\u22122C2\nr\n(\u03bd\u03ba(cid:48)\n\nand \u03bd\u03ba(cid:48)\n(cid:107)\u02c6\u03b4 \u2212 \u03b4\nThe condition (11) states that if we swap \u02c6w for w\u2217, the change of the gradient \u2207\u03b4L is limited.\nIntuitively, it shows that our estimator (9) is not \u201cpicky\u201d on w: even if we cannot have the optimal\nweight assignment w\u2217, we can still use \u201cthe next best thing\u201d, \u02c6w to compute the gradient which is\nclose enough. We later show how (11) is satis\ufb01ed. Note if (cid:107)\u2207\u03b4L(\u03b4\n, w\u2217)(cid:107)\u221e, \u03c42(n, d) converge to\nzero as np, nq, d \u2192 \u221e, by taking \u03bbn as such, Theorem 1 guarantees the consistency of \u02c6\u03b4. In Section\n5.2 and 5.3, we explore two different settings of (\u03b4\n\n, w\u2217) that make ||\u02c6\u03b4 \u2212 \u03b4\n\n\u2217(cid:107) converges to zero.\n\n\u2217\n\n\u2217\n\n(cid:104)(cid:107)\u2207\u03b4L(\u03b4\n\n(11)\n\n(cid:105)\n\n6\n\n\f5.2 Consistency under Outlier Setting\n\nSetting: Suppose dataset Xp is the union of two disjoint sets G (Good points) and B (Bad points)\nsuch that G i.i.d.\u223c p(x) and minj\u2208B z\u03b4\u2217 (x(j)\ni.i.d.\u223c\n\u2217 is set such that\nq(x) does not contain any outlier. We set \u03bd =\n\np ) (see Figure 1a). Dataset Xq\n\n. The optimal parameter \u03b4\n\np ) > maxi\u2208G z\u03b4\u2217 (x(i)\n\n|G|\nnp\n\np(x) = q(x)r(x; \u03b4\n\n\u2217\n\n). We set w\u2217\n\ni = 1\nnp\n\n,\u2200x(i)\n\np \u2208G and 0 otherwise.\n\nRemark: Knowing the inlier proportion |G|/np is a strong assumption. However it is only imposed\nfor theoretical analysis. As we show in Section 6, our method works well even if \u03bd is only a rough\nguess (like 90%). Loosening this assumption will be an important future work.\nAssumption 3. \u2200u \u2208 Ball(\u03c1), supx |\u02c6z\u03b4\u2217+u(x) \u2212 \u02c6z\u03b4\u2217 (x)| \u2264 Clip(cid:107)u(cid:107)1.\nThis assumption says that the log density ratio model is Lipschitz continuous around its optimal\n\u2217 and hence there is a limit how much a log ratio model can deviate from the optimal\nparameter \u03b4\nmodel under a small perturbation u. As our estimated weights \u02c6wi depends on the relative ranking of\n\u02c6z\u02c6\u03b4(x(i)\np ), this assumption implies that the relative ranking between two points will remain unchanged\nunder a small perturbation u if they are far apart. The following theorem shows that if we have\nenough clearance between \u201cgood\u201dand \u201cbad samples\u201d, \u02c6\u03b4 converges to the optimal parameter \u03b4\nTheorem 2. In addition to Assumption 1, 2 and a few mild technical conditions (see Section 8.5 in the\np ) \u2265\np ) \u2212 maxi\u2208G z\u03b4\u2217 (x(i)\nsupplementary material), Assumptions 3 holds. Suppose minj\u2208B z\u03b4\u2217 (x(j)\n, where K1 > 0, c > 0 are\n3Clip\u03c1, \u03bd =\nconstants, we are guaranteed that ||\u02c6\u03b4 \u2212 \u03b4\n\n|G|\n\u221a\n\u00b7 3\nk\u03bbn with probability converging to 1.\n\n, nq = \u2126(|G|2). If \u03bbn \u2265 2 \u00b7 max\n\u2217(cid:107) \u2264 C2\n\u03bd\u03ba(cid:48)\n\n|G|\nnp\n\n(cid:17)\n\n\u221a\n\u03c1\u03bdc\n\n\u2217.\n\n2C2\nr\n\nnq\n\n,\n\nr\n\n(cid:16)(cid:113) K1 log d\n(cid:16)(cid:112)log d/min(|G|, nq)\n(cid:17)\n\n1\n\nIt can be seen that (cid:107)\u02c6\u03b4 \u2212 \u03b4\n\n\u2217(cid:107) = O\n\nif d is reasonably large.\n\n5.3 Consistency under Truncation Setting\n\nIn this setting, we do not assume there are outliers in the observed data. Instead, we examine the\nability of our estimator recovering the density ratio up to a certain quantile of our data. This ability\nis especially useful when the behavior of the tail quantile is more volatile and makes the standard\nestimator (6) output unpredictable results.\nNotations: Given \u03bd \u2208 (0, 1], we call t\u03bd(\u03b4) is the \u03bd-th quantile of z\u03b4 if P [z\u03b4 < t\u03bd(\u03b4))] \u2264 \u03bd and\nP [z\u03b4 \u2264 t\u03bd(\u03b4))] \u2265 \u03bd. In this setting, we consider \u03bd is \ufb01xed by a user thus we drop the subscript \u03bd\n(\u03b4) = Xq \u2229 X(\u03b4). See Figure 1b for a visualization of t(\u03b4) and X(\u03b4)\n\nfrom all subsequent discussions. Let\u2019s de\ufb01ne a truncated domain: X(\u03b4) =(cid:8)x \u2208 Rd|z\u03b4(x) < t(\u03b4)(cid:9),\n\n(\u03b4) = Xp \u2229 X(\u03b4) and X\n\np\n\nq\n\nX\n(the dark shaded region).\n\n\u2217\n\ni.i.d.\u223c P and Xq\n\n) = p\u03b4\u2217 (x). We also de\ufb01ne the \u201coptimal\u201d weight assignment w\u2217\n\ni.i.d.\u223c Q. Truncated densities p\u03b4 and q\u03b4 are the\nSetting: Suppose dataset Xp\nunbounded densities p and q restricted only on the truncated domain X(\u03b4). Note that the truncated\ndensities are dependent on the parameter \u03b4 and \u03bd. We show that under some assumptions, the\n\u2217such that\nparameter \u02c6\u03b4 obtained from (9) using a \ufb01xed hyperparameter \u03bd will converge to the \u03b4\n,\u2200i, x(i)\np \u2208\ni = 1\nq\u03b4\u2217 (x)r(x; \u03b4\nnp\n) and 0 otherwise. Interestingly, the constraint in (9), (cid:104)w\u2217, 1(cid:105) = \u03bd may not hold, but our\nX(\u03b4\nanalysis in this section suggests we can always \ufb01nd a pair (\u02c6\u03b4, \u02c6w) in the feasible region so that\n(cid:107)\u02c6\u03b4 \u2212 \u03b4\n\u2217(cid:107) converges to 0 under mild conditions.\nWe \ufb01rst assume the log density ratio model and its CDF is Lipschitz continuous.\nAssumption 4.\n\n\u2217\n\n\u2200u \u2208 Ball(\u03c1), sup\n\n|\u02c6z\u03b4\u2217+u(x) \u2212 \u02c6z\u03b4\u2217 (x)| \u2264 Clip(cid:107)u(cid:107).\n\nx\n\n(12)\n\n7\n\n\fDe\ufb01ne T (u, \u0001) := (cid:8)x \u2208 Rd |\n\n\u2200u \u2208 Ball(\u03c1), 0 < \u0001 \u2264 1\n\n)| \u2264 2Clip(cid:107)u(cid:107) + \u0001(cid:9) where 0 < \u0001 \u2264 1. We assume\n\n\u2217\n\n|z\u03b4\u2217 (x) \u2212 t(\u03b4\nP [xp \u2208 T (u, \u0001)] \u2264 CCDF \u00b7 (cid:107)u(cid:107) + \u0001.\n\n\u2217\n\nIn this assumption, we de\ufb01ne a \u201czone\u201d T (u, \u0001) near the \u03bd-th quantile t(\u03b4\n) and assume the CDF of\nour ratio model is upper-bounded over this region. Different from Assumption 3, the RHS of (12) is\nwith respect to (cid:96)2 norm of u. In the following assumption, we assume regularity on P and Q.\nAssumption 5. \u2200xq \u2208 Rd,(cid:107)f (xq)(cid:107)\u221e \u2264 Cq and \u2200u \u2208 Ball(\u03c1),\u2200xp \u2208 T (u, 1),(cid:107)f (xp)(cid:107)\u221e \u2264 Cp.\nTheorem 3. In addition Assumption 1 and 2 and other mild assumptions (see Section 8.6 in the\n\u221a\nsupplementary material), Assumption 4 and 5 hold. If 1 \u2265 \u03bd \u2265 8CCDF\n)|2),\n\u03ba(cid:48)\n, 2L\u00b7Cp\u221a\n1 > 0, c > 0 are constants, we are guaranteed that ||\u02c6\u03b4 \u2212 \u03b4\n(cid:19)\n\nwhere K(cid:48)\nprobability.\n\n, nq = \u2126(|X p(\u03b4\n\n(cid:104)(cid:113) K(cid:48)\n\n1 log d\n\n|X p(\u03b4\u2217)| + 2C2\n\n\u03bbn \u2265 2 max\n\nr Cq|Xq\\X q(\u03b4\u2217)|\n\n\u2217(cid:107) \u2264 4C2\n\u03bd\u03ba(cid:48)\n\nk\u03bbn with high\n\n\u221a\n\u00b7 3\n\nkCpC2\nr\n\n(cid:105)\n\n2C2\nr\n\nnq\n\n\u221a\n\u03c1\u03bdc\n\nnp\n\n1\n\n,\n\nnq\n\n,\n\nr\n\n1\n\n\u2217\n\n(cid:18)(cid:113)\n\nIt can be seen that (cid:107)\u02c6\u03b4 \u2212 \u03b4\n)|/nq decays fast.\n|Xq\\X\n\n(\u03b4\n\n\u2217\n\nq\n\n\u2217(cid:107) = O\n\nlog d/min(|X\n\np\n\n(\u03b4\n\n\u2217\n\n)|, nq)\n\nif d is reasonably large and\n\n6 Experiments\n\ni,j \u2212 \u0398q\n\n6.1 Detecting Sparse Structural Changes between Two Markov Networks (MNs) [14]\n\nratio between two Gaussian MNs can be parametrized as p(x)/q(x) \u221d exp(\u2212(cid:80)\noutlier to Xp. To induce sparsity, we set R(\u2206) =(cid:80)d\n\nIn the \ufb01rst experiment3, we learn changes between two Gaussian MNs under the outlier setting. The\ni,j\u2264d \u2206i,jxixj),\nwhere \u2206i,j := \u0398p\ni,j is the difference between precision matrices. We generate 500 samples\nas Xp and Xq using two randomly structured Gaussian MNs. One point [10, . . . , 10] is added as an\ni,j=1,i\u2264j |\u2206i,j| and \ufb01x \u03bb = 0.0938. Then run\nDRE and TRimmed-DRE to learn the sparse differential precision matrix \u2206 and results are plotted\ni,j (cid:54)= 0) is marked by red boxes.\non Figure 2a and 2b4 where the ground truth (the position i, j, \u2206\u2217\nIt can be seen that the outlier completely misleads DRE while TR-DRE performs reasonably well.\nWe also run experiments with two different settings (d = 25, d = 36) and plot True Negative Rate\n(TNR) - True Positive Rate (TPR) curves. We \ufb01x \u03bd in TR-DRE to 90% and compare the performance\nof DRE and TR-DRE using DRE without any outliers as gold standard (see Figure 2c). It can be\nseen that the added outlier makes the DRE fail completely while TR-DRE can almost reach the gold\nstandard. It also shows the price we pay: TR-DRE does lose some power for discarding samples.\nHowever, the loss of performance is still acceptable.\n\n6.2 Relative Novelty Detection from Images\n\nIn the second experiment, we collect four images (see Figure 3a) containing three objects with a\ntextured background: a pencil, an earphone and an earphone case. We create data points from these\nfour images using sliding windows of 48 \u00d7 48 pixels (the green box on the lower right picture on\nFigure 3a). We extract 899 features using MATLAB HOG method on each window and construct\nan 899-dimensional sample. Although our theorems in Section 5 are proved for linear models, here\nf (x) is an RBF kernel using all samples in Xp as kernel basis. We pick the top left image as Xp and\nusing all three other images as Xq, then run TR-DRE, THresholded-DRE [26], and one-SVM.\nIn this task, we select high density ratio super pixels on image Xp. It can be expected that the\nsuper pixels containing the pencil will exhibit high density ratio values as they did not appear in\nthe reference dataset Xq while super pixels containing the earphone case, the earphones and the\nbackground, repeats similar patches in Xq will have lower density ratio values. This is different from\n\n3Code can be found at http://allmodelsarewrong.org/software.html\n4Figures are best viewed in color.\n\n8\n\n\f(a) \u02c6\u2206 obtained by DRE, d = 20, with\none outlier.\nFigure 2: Using DRE to learn changes between two MNs. We set R(\u00b7) = (cid:107)\u00b7(cid:107)1 and f (xi, xj) = xixj.\n\n(b) \u02c6\u2206 obtained by TR-DRE, \u03bd =\n90%, with one outlier.\n\n(c) TNR-TPR plot, \u03bd = 90%\n\n(a) Dataset\n\n(b) \u03bd = 97%\n\n(c) \u03bd = 90%\n\n(d) \u03bd = 85%\n\n(e) TH-DRE\n\n(f) one-SVM\n\nFigure 3: Relative object detection using super pixels. We set R(\u00b7) = (cid:107) \u00b7 (cid:107)2, f (x) is an RBF kernel.\n\na conventional novelty detection, as a density ratio function help us capture only the relative novelty.\nFor TR-DRE, we use the trimming threshold \u02c6t as the threshold for selecting high density ratio points.\nIt can be seen on Figure 3b, 3c and 3d, as we tune \u03bd to allow more and more high density ratio\nwindows to be selected, more relative novelties are detected: First the pen, then the case, and \ufb01nally\nthe earphones, as the lack of appearance in the reference dataset Xq elevates the density ratio value\nby different degrees. In comparison, we run TH-DRE with top 3% highest density ratio values\nthresholded, which corresponds to \u03bd = 97% in our method. The pattern of the thresholded windows\n(shaded in red) in Figure 3e is similar to Figure 3b though some parts of the case are mistakenly\nshaded. Finally, one-SVM with 3% support vectors (see Figure 3f) does not utilize the knowledge of\na reference dataset Xq and labels all salient objects in Xp as they corresponds to the \u201coutliers\u201d in Xp.\n\n7 Conclusion\n\nWe presents a robust density ratio estimator based on the idea of trimmed MLE. It has a convex\nformulation and the optimization can be easily conducted using a subgradient ascent method. We also\ninvestigate its theoretical property through an equivalent weighted M-estimator whose (cid:96)2 estimation\nerror bound was provable under two high-dimensional, robust settings. Experiments con\ufb01rm the\neffectiveness and robustness of the our trimmed estimator.\n\nAcknowledgments\n\nWe thank three anonymous reviewers for their detailed and helpful comments. Akiko Takeda thanks\nGrant-in-Aid for Scienti\ufb01c Research (C), 15K00031. Taiji Suzuki was partially supported by MEXT\nKAKENHI (25730013, 25120012, 26280009 and 15H05707), JST-PRESTO and JST-CREST. Song\nLiu and Kenji Fukumizu have been supported in part by MEXT Grant-in-Aid for Scienti\ufb01c Research\non Innovative Areas (25120012).\n\n9\n\n00.20.40.60.81TPR00.20.40.60.81TNR\fReferences\n[1] F. Azmandian, J. G. Dy, J. A. Aslam, and D. R. Kaeli. Local kernel density ratio-based feature\nselection for outlier detection. In Proceedings of 8th Asian Conference on Machine Learning\n(ACML2012), JMLR Workshop and Conference Proceedings, pages 49\u201364, 2012.\n\n[2] S. Boyd. Subgradient methods. Technical report, Stanford University, 2014. Notes for EE364b,\n\nStanford University, Spring 2013\u201314.\n\n[3] W. S. Cleveland. Robust locally weighted regression and smoothing scatterplots. Journal of the\n\nAmerican Statistical Association, 74(368):829\u2013836, 1979.\n\n[4] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other\n\nKernel-based Learning Methods. Cambridge University Press, 2000.\n\n[5] B. Efron and R. Tibshirani. Using specially designed exponential families for density estimation.\n\nThe Annals of Statistics, 24(6):2431\u20132461, 1996.\n\n[6] F. Fazayeli and A. Banerjee. Generalized direct change estimation in ising model structure. In\nProceedings of The 33rd International Conference on Machine Learning (ICML2016), page\n2281\u20132290, 2016.\n\n[7] W. Fithian and S. Wager. Semiparametric exponential families for heavy-tailed data. Biometrika,\n\n102(2):486\u2013493, 2015.\n\n[8] K. Fokianos. Merging information for semiparametric density estimation. Journal of the Royal\n\nStatistical Society: Series B (Statistical Methodology), 66(4):941\u2013958, 2004.\n\n[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[10] A. S. Hadi and A. Luceno. Maximum trimmed likelihood estimators: a uni\ufb01ed approach,\nexamples, and algorithms. Computational Statistics & Data Analysis, 25(3):251 \u2013 272, 1997.\n\n[11] J. Huang, A. Gretton, K. M Borgwardt, B. Sch\u00f6lkopf, and A. J Smola. Correcting sample\nselection bias by unlabeled data. In Advances in neural information processing systems, pages\n601\u2013608, 2007.\n\n[12] P. J. Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics,\n\n35(1):73\u2013101, 03 1964.\n\n[13] Y. Kawahara and M. Sugiyama. Sequential change-point detection based on direct density-ratio\n\nestimation. Statistical Analysis and Data Mining, 5(2):114\u2013127, 2012.\n\n[14] S. Liu, T. Suzuki, R. Relator, J. Sese, M. Sugiyama, and K. Fukumizu. Support consistency\nof direct sparse-change learning in Markov networks. Annals of Statistics, 45(3):959\u2013990, 06\n2017.\n\n[15] P.-L. Loh and M. J. Wainwright. Regularized m-estimators with nonconvexity: Statistical and\nalgorithmic theory for local optima. Journal of Machine Learning Research, 16:559\u2013616, 2015.\n\n[16] A. Nedi\u00b4c and A. Ozdaglar. Subgradient methods for saddle-point problems. Journal of\n\nOptimization Theory and Applications, 142(1):205\u2013228, 2009.\n\n[17] N. Neykov and P. N. Neytchev. Robust alternative of the maximum likelihood estimators.\n\nCOMPSTAT\u201990, Short Communications, pages 99\u2013100, 1990.\n\n[18] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the\nIEEE Transactions on Information Theory,\n\nlikelihood ratio by convex risk minimization.\n56(11):5847\u20135861, 2010.\n\n[19] S. Nowozin, B. Cseke, and R. Tomioka. f-gan: Training generative neural samplers using\nvariational divergence minimization. In Advances in Neural Information Processing Systems,\npages 271\u2013279, 2016.\n\n10\n\n\f[20] E. J. G. Pitman. Suf\ufb01cient statistics and intrinsic accuracy. Mathematical Proceedings of the\n\nCambridge Philosophical Society, 32(4):567\u2013579, 1936.\n\n[21] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted eigenvalue properties for correlated\n\ngaussian designs. Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[22] M. Rudelson and S. Zhou. Reconstruction from anisotropic random measurements. IEEE\n\nTransactions on Information Theory, 59(6):3434\u20133447, 2013.\n\n[23] B. Scholkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. MIT press, 2001.\n\n[24] B. Sch\u00f6lkopf, R. C. Williamson, Smola A. J., Shawe-Taylor J., and Platt J.C. Support vector\nmethod for novelty detection. In Advances in Neural Information Processing Systems 12, pages\n582\u2013588. MIT Press, 2000.\n\n[25] A. Shimodaira. Improving predictive inference under covariate shift by weighting the log-\n\nlikelihood function. Journal of Statistical Planning and Inference, 90(2):227 \u2013 244, 2000.\n\n[26] A. Smola, L. Song, and C. H. Teo. Relative novelty detection. In Proceedings of the Twelth\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), volume 5, pages\n536\u2013543, 2009.\n\n[27] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning.\n\nCambridge University Press, 2012.\n\n[28] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von B\u00fcnau, and M. Kawanabe. Direct\nimportance estimation for covariate shift adaptation. Annals of the Institute of Statistical\nMathematics, 60(4):699\u2013746, 2008.\n\n[29] J. A. K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle. Weighted least squares support\nvector machines: robustness and sparse approximation. Neurocomputing, 48(1):85\u2013105, 2002.\n\n[30] Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio estimation\nfor large-scale covariate shift adaptation. Journal of Information Processing, 17:138\u2013155, 2009.\n\n[31] D. L. Vandev and N. M. Neykov. About regression estimators with high breakdown point.\n\nStatistics: A Journal of Theoretical and Applied Statistics, 32(2):111\u2013129, 1998.\n\n[32] M. Wornowizki and R. Fried. Two-sample homogeneity tests based on divergence measures.\n\nComputational Statistics, 31(1):291\u2013313, 2016.\n\n[33] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio\nestimation for robust distribution comparison. Neural Computation, 25(5):1324\u20131370, 2013.\n\n[34] E. Yang, A. Lozano, and A. Aravkin. High-dimensional trimmed estimators: A general\n\nframework for robust structured estimation. arXiv preprint arXiv:1605.08299, 2016.\n\n[35] E. Yang and A. C. Lozano. Robust gaussian graphical modeling with the trimmed graphical\n\nlasso. In Advances in Neural Information Processing Systems, pages 2602\u20132610, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2356, "authors": [{"given_name": "Song", "family_name": "Liu", "institution": "University of Bristol"}, {"given_name": "Akiko", "family_name": "Takeda", "institution": "The Institute of Statistical Mathematics / RIKEN"}, {"given_name": "Taiji", "family_name": "Suzuki", "institution": "taiji@mist.i.u-tokyo.ac.jp"}, {"given_name": "Kenji", "family_name": "Fukumizu", "institution": "Institute of Statistical Mathematics"}]}