{"title": "Uncoupled Regression from Pairwise Comparison Data", "book": "Advances in Neural Information Processing Systems", "page_first": 3992, "page_last": 4002, "abstract": "Uncoupled regression is the problem to learn a model from unlabeled data and the set of target values while the correspondence between them is unknown. Such a situation arises in predicting anonymized targets that involve sensitive information, e.g., one's annual income. Since existing methods for uncoupled regression often require strong assumptions on the true target function, and thus, their range of applications is limited, we introduce a novel framework that does not require such assumptions in this paper. Our key idea is to utilize \\emph{pairwise comparison data, which consists of pairs of unlabeled data that we know which one has a larger target value. Such pairwise comparison data is easy to collect, as typically discussed in the learning-to-rank scenario, and does not break the anonymity of data. We propose two practical methods for uncoupled regression from pairwise comparison data and show that the learned regression model converges to the optimal model with the optimal parametric convergence rate when the target variable distributes uniformly. Moreover, we empirically show that for linear models the proposed methods are comparable to ordinary supervised regression with labeled data.", "full_text": "Uncoupled Regression\n\nfrom Pairwise Comparison Data\n\nLiyuan Xu 1,2 \u21e4\n\nliyuan@ms.k.u-tokyo.ac.jp\n\nGang Niu 2\n\ngang.niu@riken.jp\n\nJunya Honda 1,2\n\nhonda@stat.t.u-tokyo.ac.jp\n\nMasashi Sugiyama 2,1\nsugi@k.u-tokyo.ac.jp\n\n1The University of Tokyo\n\n2RIKEN\n\nAbstract\n\nUncoupled regression is the problem to learn a model from unlabeled data and the\nset of target values while the correspondence between them is unknown. Such a\nsituation arises in predicting anonymized targets that involve sensitive information,\ne.g., one\u2019s annual income. Since existing methods for uncoupled regression often\nrequire strong assumptions on the true target function, and thus, their range of\napplications is limited, we introduce a novel framework that does not require such\nassumptions in this paper. Our key idea is to utilize pairwise comparison data,\nwhich consists of pairs of unlabeled data that we know which one has a larger target\nvalue. Such pairwise comparison data is easy to collect, as typically discussed\nin the learning-to-rank scenario, and does not break the anonymity of data. We\npropose two practical methods for uncoupled regression from pairwise comparison\ndata and show that the learned regression model converges to the optimal model\nwith the optimal parametric convergence rate when the target variable distributes\nuniformly. Moreover, we empirically show that for linear models the proposed\nmethods are comparable to ordinary supervised regression with labeled data.\n\n1\n\nIntroduction\n\nIn supervised regression, we need a vast amount of labeled data in the training phase, which is costly\nand laborious to collect in many real-world applications. To deal with this problem, weakly-supervised\nregression has been proposed in various settings, such as semi-supervised learning (see Kostopoulos\net al. [17] for the survey), multiple instance regression [27, 34], and transductive regression [4, 5].\nSee [35] for a thorough review of the weakly-supervised learning in binary classi\ufb01cation, which can\nbe extended to regression with slight modi\ufb01cations.\nUncoupled regression [2] is one variant of weakly-supervised learning.\nIn ordinary \u201ccoupled\u201d\nregression, the pairs of features and targets are provided, and we aim to learn a model that minimizes\na certain prediction error on test data. On the other hand, in the uncoupled regression problem, we\nonly have access to unlabeled data and the set of target values, and thus, we do not know the true target\nfor each data point. Such a situation often arises when we aim to predict people\u2019s sensitive matters\nsuch as one\u2019s annual salary or the total amount of deposit, the data of which is often anonymized for\nprivacy concerns. Note that it may not be impossible to conduct uncoupled regression without further\nassumptions, since no labeled data is provided.\nCarpentier and Schlueter [2] showed that uncoupled regression is solvable when the feature is\none-dimensional and the true target function is monotonic to it. Although their algorithm is of less\n\n\u21e4Now at Gatsby Computational Neuroscience Unit\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fpractical use due to its strong assumption, their work offers a valuable insight that a model is learnable\nfrom uncoupled data if we know the ranking in the dataset. In this paper, we show that, instead\nof imposing the monotonic assumption, we can infer such ranking information from data to solve\nuncoupled regression. We use pairwise comparison data as a source of ranking information, which\nconsists of the pairs of unlabeled data that we know which data point has a larger target value.\nNote that pairwise comparison data is easy to collect even for sensitive matters such as one\u2019s annual\nearnings. Although people often hesitate to give explicit answers to it, it might be easier to answer\nindirect questions: \u201cWhich person earns more than you?\u201d 2, which yields pairwise comparison data\nthat we need. The dif\ufb01culty here is that comparison is based on the target value, which contains the\nnoise. Hence, the comparison data is also affected by this noise. Considering that we do not put any\nassumption on the true target function, our methods are applicable to many situations. A similar\nproblem was considered in Bao et al. [1] as well.\nOne naive method for uncoupled regression with pairwise comparison data is to use a score-based\nranking method [29], which learns a score function with the minimum inversions in pairwise com-\nparison data. With such a score function, we can match unlabeled data and the set of target values,\nand then, conduct supervised learning. However, as discussed in Rigollet and Weed [28], we cannot\nconsistently recover the true target function even if we know the true order of missing target values in\nunlabeled data due to the noise in them.\nIn contrast, our methods directly minimize the regression risk. We \ufb01rst rewrite the regression risk so\nthat it can be estimated from unlabeled and pairwise comparison data, and learn a model through\nempirical risk minimization. Such an approach based on risk rewriting has been extensively studied\nin the classi\ufb01cation scenario [7, 6, 23, 30, 18] and exhibits promising performance. We propose two\nestimators of the risk de\ufb01ned based on the expected Bregman divergence [11], which is a natural\nchoice of the risk function. We show that if the marginal distribution of the target variable is uniform\nthen the estimators are unbiased and the learned model converges to the optimal model with the\noptimal rate. In general cases, however, we prove that it is impossible to have such an unbiased\nestimator in any marginal distributions and the learned model may not converge to the optimal one.\nStill, our empirical evaluations based on synthetic data and benchmark datasets show that our methods\nexhibit similar performance to a model learned from coupled data for ordinary supervised regression.\nThe paper is structured as follows. After discussing the related work in Section 2, we formulate the\nuncoupled regression problem with pairwise comparison data in detail in Section 3. In Sections 4\nand 5, we discuss two methods for uncoupled regression and derive estimation error bounds for each\nmethod. Finally, we show empirical results in Section 6 and conclude the paper in Section 7.\n\n2 Related work\n\nSeveral methods have been proposed to match two independently collected data sources. In the\ncontext of data integration [3], the matching is conducted based on some contextual data provided for\nboth data sources. For example, Walter and Fritsch [31] used spatial information as contextual data\nto integrate two data sources. Some work evaluated the quality of matching by some information\ncriterion and found the best matching by maximizing the metric. This problem is called cross-domain\nobject matching (CDOM), which was formulated in Jebara [15]. A number of methods have been\nproposed for CDOM, such as Quadrianto et al. [26], Yamada and Sugiyama [33], and Jitta and Klami\n[16].\nAnother line of related work in the uncoupled regression problem imposed an assumption on the true\ntarget function. For example, Carpentier and Schlueter [2] assumed that the true target function is\nmonotonic to a single feature, and it was re\ufb01ned theoretically by Rigollet and Weed [28]. Another\ncommon assumption is that the true target function is exactly expressed as a linear function of the\nfeatures, which was studied in Hsu et al. [14] and Pananjady et al. [24]. Although the model learned\nfrom these methods converges to the true target function with in\ufb01nite uncoupled data, they are of\nless practical use due to their strong assumptions. On the other hand, our methods do not require any\nassumptions on such mapping functions and are applicable to wider scenarios.\n\n2This questioning can be regarded as one type of randomized response (indirect questioning) techniques [32],\n\nwhich is a survey method to avoid social desirability bias.\n\n2\n\n\fIt is worth noting that some methods use uncoupled data to enhance the performance of semi-\nsupervised learning. For example, in label regularization [19], uncoupled data is used to regularize\na regression model so that the distribution of prediction on unlabeled data is close to the marginal\ndistribution of target variables, which was reported to increase the accuracy.\nPairwise comparison data was originally considered in the ranking problem [29, 22], which aims to\nlearn a score function that can rank data correctly. In fact, we can apply ranking methods, such as\nrankSVM [13], to our problem. However, the naive application of them performs inferiorly compared\nto proposed methods, as we will show empirically, since our goal is not to order data correctly but to\npredict true target values.\n\n3 Problem settings\n\nIn this section, we formulate the uncoupled regression problem and introduce pairwise comparison\ndata.\n\n3.1 Uncoupled regression problem\nWe \ufb01rst formulate the standard regression problem brie\ufb02y. Let X\u21e2 Rd be a d-dimensional feature\nspace and Y\u21e2 R be a target space. We denote X, Y as random variables on spaces X ,Y, respectively.\nWe assume these random variables follow the joint distribution PX,Y . The goal of the regression\nproblem is to obtain model h : X!Y in hypothesis space H which minimizes the risk de\ufb01ned as\n(1)\n\nR(h) = EX,Y [l(h(X), Y )] ,\n\nwhere EX,Y denotes the expectation over PX,Y and l : Y\u21e5Y! R+ is a loss function.\nThe loss function l(z, t) measures the closeness between a true target t 2Y and an output of a model\nz 2Y , which generally grows as prediction z gets far from the target t. In this paper, we mainly\nconsider l(z, t) to be the Bregman divergence d(t, z), which is de\ufb01ned as\n\nd(t, z) = (t) (z) (t z)0(z)\n\nfor some convex function : R ! R, and 0 denotes the derivative of . It is natural to have such a\nloss function since the minimizer of risk R is EY |X=x [Y ] when hypothesis space H is rich enough\n[11], where EY |X=x is the conditional expectation over the distribution of Y given X = x. Many\ncommon loss functions can be interpreted as the Bregman divergence; for instance, when (x) = x2,\nthen d(t, z) becomes the l2-loss, and when (x) = x log x (1 x) log(1 x), then d(t, z) is\nthe Kullback\u2013Leibler divergence between the Bernoulli distributions with probabilities t and z.\nIn the standard regression scenario, we are given labeled training data D = {(xi, yi)}n\ni=1 drawn\nindependently and identically from PX,Y . Then, based on the training data, we empirically estimate\nrisk R(h) and obtain model \u02c6h by the minimizing the empirical risk. On the other hand, in uncoulped\ni=1 and target values DY = {yi}nY\nregression, what we are given are unlabeled data DU = {xi}nU\ni=1\nwithout correspondance. Here, nU is the size of unlabeled data. Furthermore, we denote the marginal\ndistribution of feature X as PX and its probability density function as fX. Similarly, PY stands for\nthe marginal distribution of target Y , and fY is the density function of PY . We use EX,Y , EX and\nEY to denote the expectations over PX,Y , PX, and PY , respectively.\nUnlike Carpentier and Schlueter [2], we do not try to match unlabeled data and target values. In fact,\nour methods do not use each target value in DY but use density function fY of the target, which can\nbe estimated from DY . For simplicity, we assume that the true density function fY is known. The\ncase where we need to estimate fY from DY is discussed in Appendix B.\n3.2 Pairwise comparison data\n\nHere, we introduce pairwise comparison data. It consists of two random variables (X +, X), where\nthe target value of X + is larger than that of X. Formally, (X +, X) are de\ufb01ned as\n\nX + =\u21e2X (Y Y 0),\n\n(Y < Y 0),\n\nX0\n\nX =\u21e2X0\n\n(Y Y 0),\nX (Y < Y 0),\n\n(2)\n\n3\n\n\fwhere (X, Y ), (X0, Y 0) are two independent pairs of random variables following PX,Y . We denote\nthe joint distribution of (X +, X) as PX +,X and the marginal distributions as PX +, PX. Density\nfunctions fX +,X, fX +, fX and expectations EX +,X, EX +, EX are de\ufb01ned in the same way.\ni , xi )}nR\nWe assume that we have access to nR pairs of i.i.d. samples of (X +, X) as DR = {(x+\nin addition to unlabeled data DU and density function fY of target variable Y . In the following\nsections, we show that uncoupled regression can be solved only from this information. In fact, our\nmethods only require samples of either one of X +, X, which corresponds to the case where only a\nwinner or loser of the ranking is observable.\nOne naive approach to conducting uncouple regression with DR would be to adopt ranking method,\nwhich is to learn a ranker r : X! R that minimizes the following expected ranking loss:\n(3)\nwhere\nis the indicator function. By minimizing the empirical estimation of (3) based on DR, we\ncan learn a ranker \u02c6r that can sort data points by target Y . Then, we can predict quantiles of test data\nby ranking DU, which leads to the prediction by applying the inverse of the cumulative distribution\nfunction (CDF) of Y . Formally, if the test point xtest is ranked top n0-th in DU, we can predict the\ntarget value for xtest as\n\nRR(r) = EX +,X\u21e5 \u21e5r(X +) r(X) < 0\u21e4\u21e4 ,\n\ni=1\n\n\u02c6h(xtest) = F 1\n\nY \u2713 nU n0\nnU \u25c6 ,\n\n(4)\n\nwhere FY (t) = P (Y \uf8ff t) is the CDF of Y .\nThis approach, however, is known to be highly sensitive to the randomness in the target variable as\ndiscussed in Rigollet and Weed [28]. This is because a noise involved in the single data point changes\nthe ranking of all other data points and affects their predictions. As illustrated in Rigollet and Weed\n[28], even if when we have a perfect ranker, i.e., we know the true order in DU, model (4) is still\ndifferent from the expected target Y given feature X in presence of noise.\n\n4 Empirical risk minimization by risk approximation\nIn this section, we propose a method to learn a model from pairwise comparison data DR, unlabeled\ndata DU, and density function fY of target variable Y . The method follows the empirical risk\nminimization principle, while the risk is approximated so that it can be empirically estimated from\navailable data. Therefore, we call this method the risk approximation (RA) method. Here, we present\nan approximated risk and derive its estimation error bound.\nFrom the de\ufb01nition of the Bregman divergence, the risk function in (1) is expressed as\n\nR(h) = EY [(Y )] EX [(h(X)) h(X)0(h(X))] EX,Y [Y 0(h(X))] .\n\n(5)\nIn this decomposition, the last term is the only problematic part in uncoupled regression since it\nrequires to calculate the expectation on the joint distribution. Here, we consider approximating the\nlast term based on the following expectations over the distributions of X +, X.\nLemma 1. We have\n\nEX +\u21e50(h(X +))\u21e4 = 2EX,Y [FY (Y )0(h(X))] ,\nEX\u21e50(h(X))\u21e4 = 2EX,Y [(1 FY (Y ))0(h(X))] .\n\nThe proof can be found in Appendix C.1. From Lemma 1, we can see that EX,Y [Y 0(h(X))] =\n(EX + [0(h(X +))])/2 if FY (y) = y, which corresponds to the case where target variable Y\nmarginally distributes uniformly in [0, 1]. This leads us to consider the approximation in the form of\n\nEX,Y [Y 0(h(X))] ' w1EX +\u21e50(h(X +))\u21e4 + w2EX\u21e50(h(X))\u21e4\n\n(6)\nfor some constants w1, w2 2 R. Note that the above uniform case corresponds to (w1, w2) =\n(1/2, 0). In general, if target Y marginally distributes uniformly on [a, b] for b > a, that is, FY (y) =\n(y a)/(b a) for all y 2 [a, b], we can see that approximation (6) becomes exact for (w1, w2) =\n(b/2, a/2) from Lemma 1. In such a case, we can construct an unbiased estimator of true risk R from\n\n4\n\n\funlabeled and pairwise comparison data. For non-uniform target marginal distributions, we choose\n(w1, w2) that minimizes the upper bound of the estimation error, which we will discuss in detail later.\nSince we have EX [0(X)] = 1\n2EX [0(X)] from Lemma 1, (6) can be\nrewritten as\n\n2EX + [0(X +)] + 1\n\nEX,Y [Y 0(h(X))]\n\n\n\n\n\n' EX [0(X)] +\u2713w1 \n\n(7)\nfor arbitrary 2 R. Hence, by approximating (5) by (7), we can write the approximated risk RRA as\nRRA(h; , w1, w2) = C EX [(h(X)) (h(X) )0(h(X))]\n\n2\u25c6 EX +\u21e50(h(X +))\u21e4 +\u2713w2 \n2\u25c6 EX +\u21e50(h(X +))\u21e4 \u2713w2 \n\n2\u25c6 EX\u21e50(h(X))\u21e4\n2\u25c6 EX\u21e50(h(X))\u21e4 .\n\nHere, C = EY [(Y )] can be ignored in the optimization procedure. Now, the empirical estimator of\nRRA is\n\u02c6RRA(h; , w1, w2) = C \n\n\n\n((h(xi)) (h(xi) )0(h(xi)))\ni )) +\u2713w2 \n\n2\u25c6 0(h(x+\n\ni ,xi )2DR\u2713\u2713w1 \n\n\n\n\n\n2\u25c6 0(h(xi ))\u25c6 ,\n\n\n\n\u2713w1 \nnU Xxi2DU\nnR X(x+\n\n\n\n1\n\n1\n\nyields the estimator with the minimum variance among estimators in the form of \u02c6RRA when nU ! 1.\nThe proof can be found in Appendix C.3. From Theorem 1, we can see that the optimal does not\nequal zero, which means that we can reduce the variance in the empirical estimation with a suf\ufb01cient\nnumber of unlabeled data by tuning . Note that this situation is natural since unlabeled data is easier\nto collect than pairwise comparison data as discussed in Duh and Kirchhoff [9].\nNow, from the discussion of the pseudo-dimension [12], we establish an upper bound of the following\nestimation error, which is used to choose weights (w1, w2). Let \u02c6hRA and h\u21e4 be the minimizers of\n\u02c6RRA and R in hypothesis class H, respectively. Then, we have the following theorem that bounds\nthe excess risk in terms of parameters (w1, w2).\nTheorem 2. Suppose that the pseudo-dimensions of {x ! 0(h(x)) | h 2H} ,{x !\nare \ufb01nite and there exist constants m, M such that\nh(x)0(h(x)) (h(x)) | h 2H}\n|h(x)0(h(x)) (h(x))|\uf8ff m,|0(h(x))|\uf8ff M for all x 2X and all h 2H . Then,\n\nR(\u02c6hRA) \uf8ff R(h\u21e4) + O0@s log 1/\n\nnU 1A + O0@s log 1/\n\nnR 1A + M Err(w1, w2)\n\n5\n\nwhich is to be minimized in the RA method. Again, we would like to emphasize that if marginal\ndistribution PY is uniform on [a, b] and (w1, w2) is set to (b/2, a/2), we have RRA = R and \u02c6RRA is\nan unbiased estimator of R for any 2 R.\nFrom the de\ufb01nition of \u02c6RRA, we can see that by setting to either 2w1 or 2w2, \u02c6RRA becomes\nindependent of either X + or X. This means that we can conduct uncouple regression even if one\nof X +, X is missing in data, which corresponds to the case where only winners or only losers of\nthe comparison are observed.\nAnother advantage of tuning free parameter is that we can reduce the variance in empirical risk\n\u02c6RRA as discussed in Sakai et al. [30] and Bao et al. [1]. As in Sakai et al. [30], the optimal that\nminimizes the variance in \u02c6RRA for nU ! 1 is derived as follows.\nTheorem 1. For a given model h, let 2\n\nbe\n\n+, 2\n\n\n2\n\n+ = VarX +\u21e50(h(X +))\u21e4 ,\n\n2\n\n = VarX\u21e50(h(X))\u21e4 ,\n\nrespectively, where VarX [\u00b7] is the variance with respect to the random variable X. Then, setting\n\n =\n\n+ + w22\n2(w12\n)\n+ + 2\n2\n\n\n\fholds with probability 1 , where Err is de\ufb01ned as\n\nErr(w1, w2) = EY [|Y 2w1FY (Y ) 2w2(1 FY (Y ))|] .\n\n(8)\n\nThe proof can be found in Appendix C.2. Note that the conditions for the boundedness of\n|h(x)0(h(x)) (h(x))|,|0(h(x))| hold for many losses, e.g., the l2-loss, when we consider\na hypothesis space of bounded functions.\nFrom Theorem 2, we can see that we can learn a model with less excess risk by minimizing\nErr(w1, w2). Note that Err(w1, w2) can be easily minimized since density function fY is known\nor can be estimated from DY . In particular, if target Y is uniformly distributed on [a, b], we have\nErr(w1, w2) = 0 by setting (w1, w2) = (b/2, a/2). In such a case, \u02c6hRA becomes a consistent model,\ni.e., R(\u02c6hRA) ! R(h\u21e4) as nU, nR ! 1. The convergence rate is O(1/pnU + 1/pnR), which is\nthe optimal parametric rate for the empirical risk minimization without additional assumptions when\nthe enough amount of unlabeled and pairwise comparison data is provided jointly [21].\nOne important case where target variable Y distributes uniformly is when the target is a \u201cquantile\nvalue\u201d. For instance, we are to build a screening system for credit cards. Then, what we are interested\nin is \u201chow much is an applicant credible in the population?\u201d, which means that we want to predict the\nquantile value of the \u201ccredit score\u201d in the marginal distribution. By de\ufb01nition, we know that such a\nquantile value distributes uniformly, and thus we can have a consistent model by minimizing \u02c6RRA.\nIn general cases, however, we may have Err(w1, w2) > 0, and \u02c6hRA becomes not consistent. Never-\ntheless, this is inevitable as suggested in the following theorem.\nTheorem 3. There exists a pair of joint distributions PX,Y , \u02dcPX,Y that yields the same marginal\ndistributions of feature PX and target PY , and the same distributions of the pairwise comparison\ndata PX +,X but have different conditional expectation EY |X=x [Y ].\nTheorem 3 states that there exists a pair of distributions that cannot be distinguished from available\ndata. Considering that h\u21e4(x) = EY |X=x [Y ] when hypothesis space H is rich enough [11], this\ntheorem implies that we cannot always obtain a consistent model. Still, in Section 6, we show that\nhRA empirically exhibits a similar accuracy to a model learned from ordinary coupled data.\n\n5 Empirical risk minimization by target transformation\n\nIn this section, we introduce another method to uncoupled regression with pairwise comparison data,\ncalled the target transformation (TT) method. Whereas the RA method minimizes the approximation\nof the original risk, the TT method transforms the target variable so that it marginally distributes\nuniformly and minimizes an unbiased estimator of the risk de\ufb01ned based on the transformed variable.\nAlthough there are several ways to map Y to a uniformly distributed random variable, one natural\ncandidate would be CDF FY (Y ), which leads to the following risk:\n\n(9)\nSince FY (Y ) distributes uniformly on [0, 1] by de\ufb01nition, we can construct the following unbiased\nestimator of RTT by the same discussion as in the previous section.\n\nRTT(h) = EX,Y [d(FY (Y ), FY (h(X))] .\n\n\u02c6RTT(h; ) = C \n\n\n\n1\n\n1\n\nnU Xxi2DU( FY (h(xi)))0(FY (h(xi))) + (FY (h(xi)))\n0(FY (h(xi )))\u25c6 ,\nnR X(x+\n\ni ,xi )2DR\u2713 1 \n\n0(FY (h(x+\n\ni ))) \n\n\n2\n\n2\n\nwhere is a hyper-parameter to be tuned. The TT method minimizes \u02c6RTT to learn a model. However,\nthe learned model is, again, not always consistent in terms of original risk R. This is because, in rich\nenough hypothesis space H, the minimizer hTT = F 1\nEY |X=x [Y ], the minimizer of (1), unless target Y distributes uniformly. Hence, for a non-uniform\ntarget, we cannot always obtain a consistent model. However, we can still derive an estimation error\nbound if hTT 2H and target variable Y is generated as\n\nY EY |X=x [FY (Y )] of (9) is different from\n\nY = htrue(X) + \",\n\n(10)\n\n6\n\n\fR(\u02c6hTT) \uf8ff R(htrue) +\u2713 P\n\np\n\n\u25c62\n\n+ O0@s log 1/\n\nnU 1A + O0@s log 1/\nnR 1A\n\nis the true target function and \" is a zero-mean noise variable bounded in\n\nwhere htrue : X!Y\n[, ] for some constant > 0.\nTheorem 4. Assume that target variable Y is generated by (10) and hTT 2H . If the pseudo-\ndimensions of {x ! 0(FY (h(x)))|h 2H} ,{x ! 0(FY (h(x)))|h 2H} are \ufb01nite and there exist\nconstants P > p > 0 such that p \uf8ff fY (y) \uf8ff P for all y 2Y , we have\n\nwith probability 1 for (x) = x2, where \u02c6hTT is the minimizer of risk \u02c6RTT in H.\nThe proof can be found in Appendix C.5. From Theorem 4, we can see that \u02c6hTT is not necessarily\nconsistent. Again, this is inevitable due to the same reason as the RA method. We can see that the\nerror in the TT method is explicitly dependent on the noise, and thus, it is advantageous when the\ntarget contains less noise. In Section 6, we empirically compare these methods and show that which\nmethod is more suitable differs from case to case.\n\n6 Experiments\n\nIn this section, we present the empirical performances of the proposed methods in experiments based\non synthetic data and benchmark data. We show that our proposed methods outperform the naive\nmethod described in (4) and existing method [24]. Moreover, it is shown that our methods have a\nsimilar performance to a model learned from ordinary supervised learning with coupled data.\nBefore presenting the results, we describe the detailed procedure of experiments. In all experiments,\nwe consider l2-loss l(z, t) = (zt)2, which corresponds to setting (x) = x2 in Bregman divergence\nd(t, z). The performance is also evaluated by the mean squared error (MSE) in the held-out test\ndata. We repeat each experiment for 100 times and report the mean and the standard deviation. We\nemploy hypothesis space of linear functions H = {h(x) = \u2713>x | \u2713 2 Rd} for the RA method. A\nslightly different hypothesis space H0 = {h(x) = F 1\nY ((\u2713>x)) | \u2713 2 Rd} is employed for the TT\nmethod in order to simplify the loss, where is logistic function (x) = 1/(1 + exp(x)). The\nprocedure of hyper-parameter tuning in RRA and RTT can be found in Appendix A.\n\n6.1 Comparison with baseline methods\n\nWe introduce two types of baseline methods here. One is a naive application of the ranking methods\ndescribed in (4), in which we use SVMRank [13] as a ranking method. We use the linear kernel in\nSVMRank. The other is an ordinary supervised linear regression (LR), in which we \ufb01t a linear model\nusing the true labels in unlabeled data DU. Note that LR does not use pairwise comparison data DR.\nResult for synthetic data. First, we show the result for the synthetic data, in which we know the\ntrue marginal PY . We sample 5-dimensional unlabeled data DU from normal distribution N (0, Id),\nwhere Id is the identity matrix. Then, we sample true unknown parameter \u2713 such that k\u2713k2 = 1\nuniformly at random. Target Y is generated as Y = \u2713>X + \", where \" is a noise following\nN (0, 0.01). Consequently, PY corresponds to N (0,p1.01), which is utilized in the proposed\nmethods and the ranking baseline. The pairwise comparison data is generated by (2). We \ufb01rst\nsample two features X, X0 from N (0, Id), and then, compare them based on the target value Y, Y 0\ncalculated by Y = \u2713>X + \". We \ufb01x nU to 100,000 and alter nR from 20 to 10,240 to see the change\nof performance with respect to the size of pairwise comparison data.\nThe results are presented in Figure 1. From this \ufb01gure, we can see that with suf\ufb01cient pairwise\ncomparison data, the performance of our methods is signi\ufb01cantly better than SVMRank baseline and\nclose to LR. This is astonishing since LR uses the true label of DU, while our methods do not. In\nthis experiment, the RA method consistently performs better than the TT method, though this is not\nuniversal as shown in the experiments on benchmark datasets.\nNote that the TT method is unstable when the size of pairwise comparison data is small. We observed\nthis phenomenon in all experiments. This is because we learn the quantile value when we minimize\n\n7\n\n\fFigure 1: MSE for Synthetic Data\n\nFigure 2: MSE for housing Dataset\n\nTable 1: MSE for benchmark datasets when nR is 5,000. The bold face means the outstanding\nmethod in uncoupled regression methods (SVMRank, RA and TT) chosen by the Welch t-test with\nsigni\ufb01cance level 5%. Note that LR does not solve uncoupled regression since it uses labels in DU.\n\nSupervised Regression\n\nUncoupled Regression\n\nDataset\nhousing\ndiabetes\nairfoil\nconcrete\npowerplant\n\nmpg\n\nredwine\nwhitewine\nabalone\n\nLR\n\n24.5(5.0)\n\n3041.9(219.8)\n\n23.3(2.2)\n109.5(13.3)\n20.6(0.9)\n12.1(2.04)\n\n0.412(0.0361)\n0.574(0.0325)\n5.05(0.375)\n\nSVMRank\n110.3(29.5)\n8575.9(883.1)\n\n62.1(7.6)\n322.9(45.8)\n372.2(34.8)\n125(15.1)\n1.28(0.112)\n1.58(0.0691)\n20.9(1.44)\n\nRA\n\nTT\n\n29.5(6.9)\n\n3087.3(256.3)\n\n23.7(2.0)\n111.7(13.2)\n21.8(1.1)\n12.8(2.16)\n\n0.442(0.0473)\n0.597(0.0382)\n5.26(0.372)\n\n22.5(6.2)\n\n3127.3(278.8)\n\n22.7(2.2)\n139.1(17.9)\n22.0(1.0)\n10.3(2.08)\n\n0.466(0.0412)\n0.644(0.0414)\n5.54(0.424)\n\nRTT, and this can be severely inaccurate when the size of pairwise comparison data is small. On the\nother hand, RRA directly minimizes the approximation of true risk R, which is less sensitive to the\nsize of DR.\nResult for benchmark datasets. We conducted the experiments for the benchmark datasets as\nwell, in which we do not know true marginal PY . The details of benchmark datasets can be found in\nAppendix A. We use the original features as unlabeled data DU. Density function fY is estimated\nfrom target values in the dataset by kernel density estimation [25] with the Gaussian kernel. Here, the\nbandwidth of Gaussian kernel is determined by cross-validation. The pairwise comparison data is\nconstructed by comparing the true target values of two data points uniformly sampled from DU.\nFigure 2 shows3 the performance of each method with respect to the size of pairwise comparison data\nfor the housing dataset. We can see that the proposed methods signi\ufb01cantly outperform SVMRank\nand approach to LR with increasing nR. This fact suggests that the estimation error in fY has little\nimpact on the performance. The results for various datasets when nR is 5,000 are presented in Table 1,\nin which both proposed methods show the promising performance. Note that the method with less\nMSE differs by each dataset, which means that we cannot easily judge which method is better.\n\n6.2 Comparison with other uncoupled regression methods\n\nHere, we show the results of the empirical comparison between our methods and the method proposed\nin Pananjady et al. [24], which is another uncoupled regression method. Note that Pananjady et al.\n[24] considered a different problem, since they assume that the true regression function is exactly a\nlinear function of the features and ignore the comparative data. Hence, we synthetically create data\nthat all three methods are applicable and conduct comparison based on it.\n\n3From Figure 2, we can again see that the TT method performs unstably when nR is small for benchmark\ndata, which causes the strange pro\ufb01le in the log plot. Since the standard deviation of the TT method is large, the\nmean accuracy minus the standard deviation goes negative, which diverges to 1 in the plot.\n\n8\n\n102103104SizeofPairwiseComparisonData102101100MSESVMRankTTRALR102103104SizeofPairwiseComparisonData101102MSESVMRankTTRALR\fFigure 3: MSE for synthetic data following\nnormal distribution\n\nFigure 4: MSE for 1-dimensional synthetic\ndata following uniform distribution\n\ni=1 and the target values\ni=1, assuming the linear relationship between them. Formally, the method is to \ufb01nd\n\nThis method is to learn the optimal coupling of unlabeled data DU = {xi}nU\nDY = {yi}nU\noptimal parameter \u02c6\u2713 and permutation \u02c6 : [nU] ! [nU] that minimizes the MSE:\n\n( \u02c6\u2713, \u02c6) = arg min\n\u27132Rd,2Pn\n\n(y(i) x>i \u2713)2,\n\nnUXi=1\n\nwhere Pn is the set of permutations of n items. Pananjady et al. [24] proved that this minimization\nis NP-hard in general but can be solved with O(nU log nU) computation when d = 1. Hence, we\nconduct the experiment based on synthetic 1-dimensional data.\nThe data is generated as follows. Unlabeled data DU = {xi}nU\ni=1 is sampled from a certain distribution,\nwhere the size of data is \ufb01xed as nU = 100, 000. Here, we used normal distribution N (0, 1) and\nuniform distribution on [1, 1]. We set \u2713 = 1 and generate target Y as Y = X + \", where \" is a\nnoise following N (0, 0.25). We randomly shuf\ufb02e target values yi to build DY . The comparative data\nis constructed in the same way as the synthetic data described in Section 6.1. Note that the method\nin Pananjady et al. [24] ignores comparative data, hence the performance does not depend on the\namount of comparative data.\nThe results4 are shown in Figures 3 and 4, which show the superiority of our method. This is mainly\ndue to the difference in data used in each method. The method in Pananjady et al. [24] only uses the\nunlabeled data and target values, while our methods utilize the comparative data as well. We can see\nthat this additional information greatly contributes to better performance.\n\n7 Conclusions\n\nIn this paper, we proposed novel methods for uncoupled regression by utilizing pairwise comparison\ndata. We introduced two methods, the risk approximation (RA) method, and the target transformation\n(TT) method, for the problem. The RA method is to approximate the expected Bregman divergence\nby the linear combination of expectations of given data, and the TT method is to learn a model for\nquantile values and uses the inverse of the CDF to predict the target. We derived estimation error\nbounds for each method and showed that the learned model is consistent when the target variable\ndistributes uniformly. Furthermore, the empirical evaluations based on both synthetic data and\nbenchmark datasets suggested the competence of our method. The empirical results also indicated\nthe instability of the TT method when the size of pairwise comparison data is small, and we may\nneed some regularization scheme to prevent it, which is left for future work.\n\nAcknowledgements\nLX utilized the facility provided by Masason Foundation. JH acknowledges support by KAKENHI\n18K17998, and MS was supported by JST CREST Grant Number JPMJCR18A2.\n\n4The standard deviation of the method in Pananjady et al. [24] is large but \ufb01nite. The mean minus standard\n\ndeviation diverges to 1 in the plot for the same reason as Figure 2.\n\n9\n\n103104SizeofPairwiseComparisonData1003\u21e51014\u21e51016\u21e5101MSETTRALRPananjadyetal103104SizeofPairwiseComparisonData1003\u21e51014\u21e51016\u21e5101MSETTRALRPananjadyetal\fReferences\n[1] H. Bao, G. Niu, and M. Sugiyama. Classi\ufb01cation from pairwise similarity and unlabeled data.\n\nIn Proceedings of the 35th International Conference on Machine Learning, 2018.\n\n[2] A. Carpentier and T. Schlueter. Learning relationships between data obtained independently. In\nProceedings of the 19th International Conference on Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[3] W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets\nfor data integration. In Proceedings of the 18th ACM SIGKDD International Conference on\nKnowledge Discovery and Data Mining, 2002.\n\n[4] C. Cortes and M. Mohri. On transductive regression. In Proceedings of 19th Advances in Neural\n\nInformation Processing Systems, 2007.\n\n[5] C. Cortes, M. Mohri, D. Pechyony, and A. Rastogi. Stability of transductive regression\nalgorithms. In Proceedings of the 25th International Conference on Machine Learning, 2008.\n\n[6] M. du Plessis, G. Niu, and M. Sugiyama. Convex formulation for learning from positive and\nunlabeled data. In Proceedings of the 32nd International Conference on Machine Learning,\n2015.\n\n[7] M. C. du Plessis, G. Niu, and M. Sugiyama. Analysis of learning from positive and unlabeled\ndata. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors,\nProceedings of the 27th Advances in Neural Information Processing Systems, 2014.\n\n[8] D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.\n\nuci.edu/ml.\n\n[9] K. Duh and K. Kirchhoff. Semi-supervised ranking for document retrieval. Computer Speech\n\nand Language, 25(2):261\u2013281, 2011.\n\n[10] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics,\n\n32:407\u2013499, 2004.\n\n[11] A. Frigyik, S. Srivastava, and M. R. Gupta. Functional Bregman divergence and bayesian\nestimation of distributions (preprint). IEEE Transactions on Information Theory, 54:5130\u20135139,\n11 2008.\n\n[12] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other\n\nlearning applications. Information and Computation, 100(1):78\u2013150, 1992.\n\n[13] R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression.\nIn A. Smola, P. Bartlett, B. Sch\u00f6lkopf, and D. Schuurmans, editors, Advances in Large Margin\nClassi\ufb01ers, pages 115\u2013132, Cambridge, MA, 2000. MIT Press.\n\n[14] D. J. Hsu, K. Shi, and X. Sun. Linear regression without correspondence. In Proceedings of the\n\n30th Advances in Neural Information Processing Systems, pages 1531\u20131540, 2017.\n\n[15] T. Jebara. Kernelizing sorting, permutation and alignment for minimum volume PCA. In\n\nProceedings of the 17th Annual Conference on Learning Theory, 2004.\n\n[16] A. Jitta and A. Klami. Few-to-few cross-domain object matching. In Proceedings of The 3rd\n\nInternational Workshop on Advanced Methodologies for Bayesian Networks, 2017.\n\n[17] G. Kostopoulos, S. Karlos, S. Kotsiantis, and O. Ragos. Semi-supervised regression: A recent\n\nreview. Journal of Intelligent and Fuzzy Systems, 35:1\u201318, 2018.\n\n[18] N. Lu, G. Niu, A. K. Menon, and M. Sugiyama. On the minimal supervision for training any\nbinary classi\ufb01er from only unlabeled data. In Proceedings of the 7th International Conference\non Learning Representations, 2019.\n\n[19] G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning\n\nwith weakly labeled data. The Journal of Machine Learning Research, 11:955\u2013984, 2010.\n\n10\n\n\f[20] P. Massart. The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of\n\nProbability, 18(3):1269\u20131283, 1990.\n\n[21] S. Mendelson. Lower bounds for the empirical minimization algorithm. IEEE Transactions on\n\nInformation Theory, 54:3797\u20133803, 2008.\n\n[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT\n\nPress, 2012.\n\n[23] G. Niu, M. C. du Plessis, T. Sakai, Y. Ma, and M. Sugiyama. Theoretical comparisons of\nIn Proceedings of the 29th\n\npositive-unlabeled learning against positive-negative learning.\nAdvances in Neural Information Processing Systems 29, 2016.\n\n[24] A. Pananjady, M. J. Wainwright, and T. A. Courtade. Linear regression with shuf\ufb02ed data:\nStatistical and computational limits of permutation recovery. IEEE Transactions on Information\nTheory, 64(5):3286\u20133300, 2018.\n\n[25] E. Parzen. On estimation of a probability density function and mode. The Annals of Mathemati-\n\ncal Statistics, 33(3):1065\u20131076, 1962.\n\n[26] N. Quadrianto, L. Song, and A. J. Smola. Kernelized sorting. In Proceedings of the 21st\n\nAdvances in Neural Information Processing Systems, 2009.\n\n[27] S. Ray and D. Page. Multiple instance regression. In Proceedings of the 18th International\n\nConference on Machine Learning, 2001.\n\n[28] P. Rigollet and J. Weed. Uncoupled isotonic regression via minimum Wasserstein deconvolution.\n\nInformation and Inference: A Journal of the IMA, 2019.\n\n[29] C. Rudin, C. Cortes, M. Mohri, and R. E. Schapire. Margin-based ranking meets boosting in\n\nthe middle. In Proceedings of the 18th Annual Conference on Learning Theory, 2005.\n\n[30] T. Sakai, M. C. du Plessis, G. Niu, and M. Sugiyama. Semi-supervised classi\ufb01cation based\non classi\ufb01cation from positive and unlabeled data. In Proceedings of the 34th International\nConference on Machine Learning, 2017.\n\n[31] V. Walter and D. Fritsch. Matching spatial data sets: a statistical approach. International\n\nJournal of Geographical Information Science, 13(5):445\u2013473, 1999.\n\n[32] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias.\n\nJournal of the American Statistical Association, 60(309):63\u201369, 1965.\n\n[33] M. Yamada and M. Sugiyama. Cross-domain object matching with model selection.\n\nIn\nProceedings of the 14th International Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n[34] Q. Zhang and S. A. Goldman. EM-DD: an improved multiple-instance learning technique. In\n\nProceedings of 15th Advances in Neural Information Processing Systems, 2002.\n\n[35] Z. Zhou. A brief introduction to weakly supervised learning. National Science Review, 5(1):\n\n44\u201353, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2198, "authors": [{"given_name": "Liyuan", "family_name": "Xu", "institution": "Gatsby Computational Neuroscience Unit"}, {"given_name": "Junya", "family_name": "Honda", "institution": "The Univerisity of Tokyo / RIKEN"}, {"given_name": "Gang", "family_name": "Niu", "institution": "RIKEN"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}