{"title": "Hypothesis Transfer Learning via Transformation Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 574, "page_last": 584, "abstract": "We consider the Hypothesis Transfer Learning (HTL) problem where one incorporates a hypothesis trained on the source domain into the learning procedure of the target domain. Existing theoretical analysis either only studies specific algorithms or only presents upper bounds on the generalization error but not on the excess risk. In this paper, we propose a unified algorithm-dependent framework for HTL through a novel notion of transformation functions, which characterizes the relation between the source and the target domains. We conduct a general risk analysis of this framework and in particular, we show for the first time, if two domains are related, HTL enjoys faster convergence rates of excess risks for Kernel Smoothing and Kernel Ridge Regression than those of the classical non-transfer learning settings. We accompany this framework with an analysis of cross-validation for HTL to search for the best transfer technique and gracefully reduce to non-transfer learning when HTL is not helpful. Experiments on robotics and neural imaging data demonstrate the effectiveness of our framework.", "full_text": "Hypothesis Transfer Learning via\n\nTransformation Functions\n\nSimon S. Du\n\nCarnegie Mellon University\n\nssdu@cs.cmu.edu\n\nJayanth Koushik\n\nCarnegie Mellon University\n\njayanthkoushik@cmu.edu\n\nAarti Singh\n\nCarnegie Mellon University\naartisingh@cmu.edu\n\nBarnab\u00e1s P\u00f3czos\n\nCarnegie Mellon University\nbapoczos@cs.cmu.edu\n\nAbstract\n\nWe consider the Hypothesis Transfer Learning (HTL) problem where one incor-\nporates a hypothesis trained on the source domain into the learning procedure\nof the target domain. Existing theoretical analysis either only studies speci\ufb01c\nalgorithms or only presents upper bounds on the generalization error but not on the\nexcess risk. In this paper, we propose a uni\ufb01ed algorithm-dependent framework\nfor HTL through a novel notion of transformation function, which characterizes\nthe relation between the source and the target domains. We conduct a general risk\nanalysis of this framework and in particular, we show for the \ufb01rst time, if two\ndomains are related, HTL enjoys faster convergence rates of excess risks for Kernel\nSmoothing and Kernel Ridge Regression than those of the classical non-transfer\nlearning settings. Experiments on real world data demonstrate the effectiveness of\nour framework.\n\nIntroduction\n\n1\nIn a classical transfer learning setting, we have a large amount of data from a source domain and\na relatively small amount of data from a target domain. These two domains are related but not\nnecessarily identical, and the usual assumption is that the hypothesis learned from the source domain\nis useful in the learning task of the target domain.\nIn this paper, we focus on the regression problem where the functions we want to estimate of the source\nand the target domains are different but related. Figure 1a shows a 1D toy example of this setting,\nwhere the source function is f so(x) = sin(4\u03c0x) and the target function is f ta(x) = sin(4\u03c0x) + 4\u03c0x.\nMany real world problems can be formulated as transfer learning problems. For example, in the task\nof predicting the reaction time of an individual from his/her fMRI images, we have about 30 subjects\nbut each subject has only about 100 data points. To learn the mapping from neural images to the\nreaction time of a speci\ufb01c subject, we can treat all but this subject as the source domain, and this\nsubject as the target domain. In Section 6, we show how our proposed method helps us learn this\nmapping more accurately.\nThis paradigm, hypothesis transfer learning (HTL) has been explored empirically with success in\nmany applications [Fei-Fei et al., 2006, Yang et al., 2007, Orabona et al., 2009, Tommasi et al.,\n2010, Kuzborskij et al., 2013, Wang and Schneider, 2014]. Kuzborskij and Orabona [2013, 2016]\npioneered the theoretical analysis of HTL for linear regression and recently Wang and Schneider\n[2015] analyzed Kernel Ridge Regression. However, most existing works only provide generalization\nbounds, i.e. the difference between the true risk and the training error or the leave-one-out error.\nThese analyses are not complete because minimizing the generalization error does not necessarily\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fY\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n\u22122\n \n0\n\n \n\nfso\nSource data\nfta\nTarget Data\n\nOffset KS\nOnly Target KS\nfta\n\n \n\n1\n\n0.5\n\n0\n\nY\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1\n\nX\n\n \n\n\u22120.5\n0\n\n0.1\n\n0.2\nX\n\n0.3\n\n0.4\n\nY\n\n3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n \n0\n\nScale KS\nOnly Target KS\nfta\n\n \n\n0.1\n\n0.2\nX\n\n0.3\n\n0.4\n\n(a) A toy example of transfer\nlearning. We have many more\nsamples from the source domain\nthan the target domain.\n\n(b) Transfer learning with Offset\nTransformation.\n\n(c) Transfer learning with Scale\nTransformation.\n\nFigure 1: Experimental results on synthetic data.\n\nreduce the true risk. Further, these works often rely on a particular form of transformation from the\nsource domain to the target domain. For example, Wang and Schneider [2015] studied the offset\ntransformation that instead of estimating the target domain function directly, they learn the residual\nbetween the target domain function and the source domain function. It is natural to ask what if we\nuse other transfer functions and how it affects the risk on the target domain.\nIn this paper, we propose a general framework of HTL. Instead of analyzing a speci\ufb01c form of\ntransfer, we treat it as an input of our learning algorithm. We call this input transformation function\nsince intuitively, it captures the relevance between these two domains.1 This framework uni\ufb01es many\nprevious works Wang and Schneider [2014], Kuzborskij and Orabona [2013], Wang et al. [2016] and\nnaturally induces a class of new learning procedures.\nTheoretically, we develop excess risk analysis for this framework. The performance depends on the\nstability [Bousquet and Elisseeff, 2002] of the algorithm used as a subroutine that if the algorithm\nis stable then the estimation error in the source domain will not affect the estimation in the target\ndomain much. To our knowledge, this connection was \ufb01rst established by Kuzborskij et al. [2013] in\nthe linear regression setting but here we generalize it to a broader context. In particular, we provide\nexplicit risk bounds for two widely used nonlinear estimators, Kernel Smoothing (KS) estimators\nand Kernel Ridge Regression (KRR) as subroutines. To the best of our knowledge, these are the \ufb01rst\nresults showing when two domains are related, transfer learning techniques have faster statistical\nconvergence rate of excess risk than that of non-transfer learning of kernel based methods. Further,\nwe accompany this framework with a theoretical analysis showing a small amount of data for cross-\nvalidation enables us (1) avoid using HTL when it is not useful and (2) choose the best transformation\nfunction as input from a large pool.\nThe rest of the paper is organized as follows. In Section 2 we introduce HTL and provide necessary\nbackgrounds for KS and KRR. We formalize our transformation function based framework in\nSection 3. Our main theoretical results are in Section 4 and speci\ufb01cally in Section 4.1 and Section 4.2\nwe provide explicit risk bounds for KS and KRR, respectively. In Section 5 we analyze cross-\nvalidation in HTL setting and in Section 6 we conduct experiments on real world data data. We\nconclude with a brief discussion of avenues for future work.\n\n2 Preliminaries\n2.1 Problem Setup\n\nIn this paper, we assume both X \u2208 Rd and Y \u2208 R lie in compact subsets: ||X||2 \u2264 \ufffdX, |Y | \u2264 \ufffdY\nfor some \ufffdX ,\ufffdY \u2208 R+. Throughout the paper, we use T = {(Xi, Yi)}n\ni=1 to denote a set of\nsamples. Let (X so, Y so) be the sample from the source domain, and (X ta, Y ta) the sample from the\ntarget domain. In our setting, there are nso samples drawn i.i.d from the source distribution: T so =\ni )}nta\ni=1.\n{(X so\nIn addition, we also use nval samples drawn i.i.d from the target domain for cross-validation. We\nmodel the joint relation between X and Y by: Y so = f so (X so) + \ufffdso and Y ta = f ta (X ta) + \ufffdta\nwhere f so and f ta are regression functions and we assume the noise E [\ufffdso] = E [\ufffdta] = 0, i.i.d,\n\ni=1, and nta samples drawn i.i.d from the target distribution: T ta = {(X ta\n\n, Y ta\n\ni\n\ni\n\n, Y so\n\ni )}nso\n\n1We formally de\ufb01ne the transformation functions in Section 3.\n\n2\n\n\fand bounded. We use A : T \u2192 \u02c6f to denote an algorithm that takes a set of samples and produce\n\nan estimator. Given an estimator \u02c6f, we de\ufb01ne the integrated L2 risk as R( \u02c6f ) = E\ufffd\ufffd \u02c6f (X) \u2212 Y\ufffd2\ufffd\n\nwhere the expectation is taken over the distribution of (X, Y ). Similarly, the empirical L2 risk on\na set of sample T is de\ufb01ned as \u02c6R( \u02c6f ) = 1\n. In HTL setting, we use \u02c6f so an\nestimator from the source domain to facilitate the learning procedure for f ta.\n\ni=1\ufffdYi \u2212 \u02c6f (Xi)\ufffd2\nn\ufffdn\n\n2.2 Kernel Smoothing\n\nWe say a function f is in the (\u03bb, \u03b1) H\u00f6lder class [Wasserman, 2006], if for any x, x\ufffd \u2208 Rd, f satis\ufb01es\n|f (x) \u2212 f (x\ufffd)| \u2264 \u03bb||x \u2212 x\ufffd||\u03b1\n2 , for some \u03b1 \u2208 (0, 1). The kernel smoothing method uses a positive\nkernel K on [0, 1], highest at 0, decreasing on [0, 1], 0 outside [0, 1], and\ufffdRd u2K(u) < \u221e. Using\ni=1, the kernel smoothing estimator is de\ufb01ned as follows: \u02c6f (x) =\ufffdn\nT = {(Xi, Yi)}n\ni=1 wi(x)Yi,\nwhere wi(x) = K(||x\u2212Xi||/h)\nj=1 K(||x\u2212Xj||/h) \u2208 [0, 1].\n\ufffdn\n\n2.3 Kernel Ridge Regression\n\nAnother popular non-linear estimator is the kernel ridge regression (KRR) which uses the theory\nof reproducing kernel Hilbert space (RKHS) for regression [Vovk, 2013]. Any symmetric positive\nsemide\ufb01nite kernel function K : Rd \u00d7 Rd \u2192 R de\ufb01nes a RKHS H. For each x \u2208 Rd, the function\nz \u2192 K(z, x) is contained in the Hilbert space H; moreover, the Hilbert space is endowed with\nan inner product \ufffd\u00b7,\u00b7\ufffdH such that K(\u00b7, x) acts as the kernel of the evaluation functional, meaning\n\ufffdf, K(x,\u00b7)\ufffdH = f (x) for f \u2208 H. In this paper we assume K is bounded: supx\u2208Rd K (x, x) = k <\n\u221e. Given the inner product, the H norm of a function g \u2208 H is de\ufb01ned as ||g||H \ufffd \ufffd\ufffdg, g\ufffdH\nand similarly the L2 norm, ||g||2 \ufffd\ufffd\ufffdRd g(x)2dPX\ufffd1/2 for a given PX. Also, the kernel induces\nan integral operator TK : L2 (PX ) \u2192 L2 (PX ): TK [f ] (x) = \ufffdRd K (x\ufffd, x) f (x\ufffd) dPx (x\ufffd) with\ncountably many non-zero eigenvalues: {\u00b5i}i\u22651. For a given function f, the approximation error is\nH\ufffd for \u03bb \u2265 0. Finally the estimated function\nde\ufb01ned as: Af (\u03bb) \ufffd inf h\u2208H\ufffd||h \u2212 f||2\nevaluated at point x can be written as \u02c6f (x) = K(X, x) (K(X, X) + n\u03bbI)\u22121 Y where X \u2208 Rn\u00d7d\nare the inputs of training samples and Y \u2208 Rn\u00d71 are the training labels Vovk [2013].\n2.4 Related work\n\nL2(PX ) + \u03bb||h||2\n\nBefore we present our framework, it is helpful to give a brief overview of existing literature on\ntheoretical analysis of transfer learning. Many previous works focused on the settings when only\nunlabeled data from the target domain are available [Huang et al., 2006, Sugiyama et al., 2008, Yu\nand Szepesv\u00e1ri, 2012]. In particular, a line of research has been established based on distribution\ndiscrepancy, a loss induced metric for the source and target distributions [Mansour et al., 2009,\nBen-David et al., 2007, Blitzer et al., 2008, Cortes and Mohri, 2011, Mohri and Medina, 2012]. For\nexample, recently Cortes and Mohri [2014] gave generalization bounds for kernel based methods\nunder convex loss in terms of discrepancy.\nIn many real world applications such as yield prediction from pictures [Nuske et al., 2014], or\nprediction of response time from fMRI [Verstynen, 2014], some labeled data from the target domain\nis also available. Cortes et al. [2015] used these data to improve their discrepancy minimization\nalgorithm. Zhang et al. [2013] focused on modeling target shift (P (Y ) changes), conditional shift\n(P (X|Y ) changes), and a combination of both. Recently, Wang and Schneider [2014] proposed a\nkernel mean embedding method to match the conditional probability in the kernel space and later\nderived generalization bound for this problem Wang and Schneider [2015]. Kuzborskij and Orabona\n[2013, 2016], Kuzborskij et al. [2016] gave excess risk bounds for target domain estimator in the form\nof a linear combination of estimators from multiple source domains and an additional linear function.\nBen-David and Urner [2013] showed a similar bound of the same setting with different quantities\ncapturing the relatedness. Wang et al. [2016] showed that if the features of source and target domain\nare [0, 1]d, using orthonormal basis function estimator, transfer learning achieves better excess risk\n\n3\n\n\fguarantee if f ta \u2212 f so can be approximated by the basis functions easier than f ta. Their work can be\nviewed as a special case of our framework using the transformation function G(a, b) = a + b.\n\n3 Transformation Functions\nIn this section, we \ufb01rst de\ufb01ne our class of models and give a meta-algorithm to learn the target\nregression function. Our models are based on the idea that transfer learning is helpful when one\ntransforms the target domain regression problem into a simpler regression problem using source\ndomain knowledge. Consider the following example.\n\nx+0.05\ufffd and f ta(x) = f so(x) + x. f so\n\nExample: Offset Transfer. Let f so(x) =\ufffdx (1 \u2212 x) sin\ufffd 2.1\u03c0\n\nis the so called Doppler function. It requires a large number of samples to estimate well because of its\nlack of smoothness Wasserman [2006]. For the same reason, f ta is also dif\ufb01cult to estimate directly.\nHowever, if we have enough data from the source domain, we can have a fairly good estimate of f so.\nFurther, notice that the offset function w(x) = f ta(x) \u2212 f so(x) = x, is just a linear function. Thus,\ninstead of directly using T ta to estimate f ta, we can use the target domain samples to \ufb01nd an estimate\nof w(x), denoted by \u02c6w(x), and our estimator for the target domain is just: \u02c6f ta(x) = \u02c6f so(x) + \u02c6w(x).\nFigure 1b shows this technique gives improved \ufb01tting for f ta.\nThe previous example exploits the fact that function w(x) = f ta(x) \u2212 f so(x) is a simpler function\nthan f ta. Now we generalize this idea further. Formally, we de\ufb01ne the transformation function as\nG(a, b) : R2 \u2192 R where we assume that given a \u2208 R, G(a,\u00b7) is invertible. Here a will be the\nregression function of the source domain evaluated at some point and the output of G will be the\nregression function of the target domain evaluated at the same point. Let G\u22121\na (\u00b7) denote the inverse\nof G(a,\u00b7) such that G\ufffda, G\u22121\na (c) = c \u2212 a. For a\ngiven G and a pair (f so, f ta), they together induce a function wG(x) = G\u22121\nf so(x)(f ta(x)). In the offset\ntransfer example, wG (x) = x. By this de\ufb01nition, for any x, we have G (f so (x) , wG (x)) = f ta (x) .\nWe call wG the auxiliary function of the transformation function G. In the HTL setting, G is a user-\nde\ufb01ned transformation that represents users\u2019 prior knowledge on the relation between the source and\ntarget domains. Now we list some other examples:\nExample: Scale-Transfer. Consider G(a, b) = ab. This transformation function is useful when\nf so and f ta satisfy a smooth scale transfer. For example, if f ta = cf so, for some constant c, then\nwG(x) = c because f ta (x) = G (f so (x) , wG (x)) = f so (x) wG (x) = f so (x) c. See Figure 1c.\nExample: Non-Transfer. Consider G(a, b) = b. Notice that f ta(x) = wG(x) and so f so is\nirrelevant. Thus this model is equivalent to traditional regression on the target domain since data from\nthe source domain does not help.\n\na (c)\ufffd = c. For example if G(a, b) = a + b and G\u22121\n\n3.1 A Meta Algorithm\n\nGiven the transformation G and data, we provide a general procedure to estimate f ta. The spirit of\nthe algorithm is turning learning a complex function f ta into an easier function wG. First we use an\nalgorithm Aso that takes T so to obtain \u02c6f so. Since we have suf\ufb01cient data from the source domain,\n\u02c6f so should be close to the true regression function f so. Second, we construct a new data set using\nthe nta data points from the target domain: T wG = \ufffd\ufffdX ta\nwhere\nHG : R2 \u2192 R and satis\ufb01es\n\ni=1\n\ni\n\ni ) , Y ta\n\n, HG\ufffd \u02c6f so (X ta\ni \ufffd\ufffd\ufffdnta\ni \ufffd\ufffd = wG\ufffdX ta\ni \ufffd\n\ni )\ufffdf ta\ufffdX ta\n\nE\ufffdHG\ufffdf so\ufffdX ta\n\ni \ufffd , Y ta\n\ni \ufffd\ufffd = G\u22121\n\nf so(X ta\n\nwhere and the expectation is taken over \ufffdta. Thus, we can use these newly constructed data to\n\nlearn wG with algorithm AWG: \u02c6wG = AWG\ufffdT WG\ufffd. Finally, we plug trained \u02c6f so and \u02c6wG into\n\ntransformation G to obtain an estimation for f ta: \u02c6f ta(X) = G( \u02c6f so (X) , \u02c6wG(X)). Pseudocode is\nshown in Algorithm 1.\n\nUnbiased Estimator HG (f so (X ta) , Y ta):\nIn Algorithm 1, we require an unbiased estimator for\nwG (X ta). Note that if G (a, b) is linear b or \ufffdta = 0, we can simply set HG (f so (X) , Y ) =\nG\u22121\n\nFor other scenarios, G\u22121\n\nf so(X) (Y ).\nf so(x) (f ta (x)) and we need to design estimator using the structure of G.\n\ni ) is biased: E\ufffdG\u22121\n\ni )\ufffd \ufffd=\n\nf so(X ta\ni )\n\nf so(X ta\ni )\n\nG\u22121\n\n(Y ta\n\n(Y ta\n\n4\n\n\fi\n\n, Y so\n\ni\n\n, Y ta\n\ni )}nta\n\ntarget domain data:\n\nT so = {(X so\n\nT ta =\ni=1, transformation function: G, algorithm to train f so: Aso, algorithm to train\n\nAlgorithm 1 Transformation Function based Transfer Learning\ni )}nso\nInputs: Source domain data:\ni=1,\n{(X ta\nwG: AwG and HG an unbiased estimator for estimating wG.\nOutputs: Regression function for the target domain: \u02c6f ta.\n1: Train the source domain regression function \u02c6f so = Aso (T so).\n2: Construct new data using \u02c6f so and T ta:\nHG\ufffd \u02c6f so (X ta\n3: Train the auxiliary function: \u02c6wG = AWG (T wG).\n4: Output the estimated regression for the target domain: \u02c6f ta(X) = G\ufffd \u02c6f so(X), \u02c6wG(X)\ufffd.\n\nT wG = {(X ta\n\ni=1, where Wi =\n\n, Wi)}nta\n\ni \ufffd.\n\ni ) , Y ta\n\ni\n\nRemark 1: Many transformation functions are equivalent to a transformation function G\ufffd (a, b)\nwhere G\ufffd (a, b) is linear in b. For example, for G (a, b) = ab2, i.e., f ta (x) = f so (x) w2\nG (x),\nconsider G\ufffd (a, b) = ab where b in G\ufffd stands for b2 in G, i.e., f ta (x) = f so (x) w\ufffdG (x). Therefore\nG and we only need to estimate w\ufffdG well instead of estimating wG. More generally, if\nw\ufffdG = w2\nG (a, b) can be factorized as G (a, b) = g1 (a) g2 (b), i.e., f ta (x) = g1 (f so (x)) g2 (wG (x)), we\nonly need to estimate g2 (wG (x)) and the convergence rate depends on the structure of g2 (wG (x)).\nRemark 2: When G is not linear in b and \ufffdta \ufffd= 0, observe that in Algorithm 1, we treat Y ta\ni s as\nnoisy covariates to estimate wG (Xi)s. This problem is called error-in-variable or measurement error\nand has been widely studied in statistics literature. For details, we refer the reader to the seminal\nwork by Carroll et al. [2006]. There is no universal estimator for the measurement error problem. In\nSection B, we provide a common technique, regression calibration to deal with measurement error\nproblem.\n\n4 Excess Risk Analyses\nIn this section, we present theoretical analyses for the proposed class of models and estimators. First,\nwe need to impose some conditions on G. The \ufb01rst assures that if the estimations of f so and wG are\nclose to the source regression and auxiliary function, then our estimator for f ta is close to the true\ntarget regression function. The second assures that we are estimating a regular function.\n\nAssumption 1 G (a, b) is L-Lipschitz: |G(a, b) \u2212 G(a\ufffd, b\ufffd)| \u2264 L||(a, b) \u2212 (a\ufffd, b\ufffd)||2 and is invert-\nible with respect to b given a, i.e. if G (x, y) = z then G\u22121\n\nx (z) = y.\n\nAssumption 2 Given G, the induced auxiliary function wG is bounded: for x : ||x||2 \u2264 \ufffdX,\nwG (x) \u2264 B for some B > 0.\nOffset Transfer and Non-Transfer satisfy these conditions with L = 1 and B = \ufffdY . Scale Transfer\nsatis\ufb01es these assumptions when f so is lower bounded from away 0. Lastly, we assume our unbiased\nestimator is also regular.\n\nAssumption 3 For x : ||x||2 \u2264 \ufffdX and y : |y| \u2264 \ufffdY , HG (x, y) \u2264 B for some B > 0 and HG is\nLipschitz continuous in the \ufb01rst argument:|HG (x, y) \u2212 HG (x\ufffd, y)| \u2264 L|x \u2212 x\ufffd| for some L > 0.\nWe begin with a general result which only requires the stability of AWG:\nTheorem 1 Suppose for any two sets of samples that have same features but different labels: T =\n{(X ta\n\n, the algorithm AwG for training wG satis\ufb01es:\n\n, Wi)}nta\n\ni=1\n\ni\n\ni=1 and \ufffdT =\ufffd\ufffdX ta\n\ni\n\n,\ufffdWi\ufffd\ufffdnta\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffdAwG (T ) \u2212 AwG\ufffd\ufffdT\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\u221e \u2264\n\nnta\ufffdi=1\n\ni \ufffd\ufffd\ufffd\ufffdWi \u2212\ufffdWi\ufffd\ufffd\ufffd ,\nci\ufffdX ta\n\n(1)\n\n5\n\n\fwhere ci only depends on X ta\ni\n\n. Then for any x,\n\ni\n\n2\n\n2\n\n, HG (f so (X ta\n\ni ) , Y ta\nbased on true source domain regression function.\n\n=O\ufffd\ufffd\ufffd\ufffd \u02c6f so (x) \u2212 f so (x)\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd \u02c6f ta(x) \u2212 f ta(x)\ufffd\ufffd\ufffd\n\ufffd nta\ufffdi=1\ni \ufffd\ufffd\ufffd\ufffd \u02c6f so\ufffdX ta\nci\ufffdX ta\ni=1\ufffd, the estimated auxiliary function trained\nwhere \u02dcwG = AwG\ufffd{(X ta\ni ))}nta\nO\ufffd| \u02dcwG (x) \u2212 wG (x)|2\ufffd, the estimation error of wG. However, since we are using estimated f so to\n\nTheorem 1 shows how the estimation error in the source domain function propagates to our estimation\nof the target domain function. Notice that if we happen to know f so, then the error is bounded by\n\n+ |\ufffdwG (x) \u2212 wG (x)|2 +\ni \ufffd\ufffd\ufffd\ufffd\ufffd2\uf8f6\uf8f8\ni \ufffd \u2212 f so\ufffdX ta\n\nconstruct training samples for wG, the error might accumulate as nta increases. Though the third\nterm in Theorem 1 might increase with nta, it also depends on the estimation error of f so which is\nrelatively small because of the large amount of source domain data.\nThe stability condition (1) we used is related to the uniform stability introduced by Bousquet and\nElisseeff Bousquet and Elisseeff [2002] where they consider how much will the output change if one\nof the training instance is removed or replaced by another whereas ours depends on two different\ntraining data sets. The connection between transfer learning and stability has been discovered\nby Kuzborskij and Orabona [2013], Liu et al. [2016] and Zhang [2015] in different settings, but they\nonly showed bounds for generalization, not for excess risk.\n\n4.1 Kernel Smoothing\n\nWe \ufb01rst analyze kernel smoothing method.\n\nTheorem 2 Suppose the support of X ta is a subset of the support of X so and the probability density\nof PX so and PX ta are uniformly bounded away from below on their supports. Further assume f so is\n(\u03bbso, \u03b1so) H\u00f6lder and wG is (\u03bbwG, \u03b1wG) H\u00f6lder . If we use kernel smoothing estimation for f so and\nwG with bandwidth hso \ufffd n\u22121/(2\u03b1so+d)\n, with probability at least 1 \u2212 \u03b4\nthe risk satis\ufb01es:\n\nand hwG \ufffd n\u22121/(2\u03b1wG +d)\n\nso\n\nta\n\nE\ufffdR\ufffd \u02c6f ta\ufffd\ufffd \u2212 R\ufffdf ta\ufffd = O\ufffdn\n\n\u22122\u03b1so\n2\u03b1so+d\nso\n\n+ n\n\n\u22122\u03b1wG\n2\u03b1wG\n+d\nta\n\n\ufffd log\ufffd 1\n\u03b4\ufffd\n\nwhere the expectation is taken over T so and T ta.\nTheorem 2 suggests that the risk depends on two sources, one from estimation of f so and one from\nestimation of wG. For the \ufb01rst term, since in the typical transfer learning scenarios nso >> nta, it is\nrelatively small in the setting we focus on. The second terms shows the power of transfer learning on\ntransforming a possibly complex target regression function into a simpler auxiliary function. It is\n\nwell known that learning f ta only using target domain has risk of the order \u03a9\ufffdn\u22122\u03b1f ta /(2\u03b1f ta +d)\n\nThus, if the auxiliary function is smoother than the target regression function, i.e. \u03b1wG > \u03b1f ta, we\nobtain better statistical rate.\n\n\ufffd.\n\nta\n\n4.2 Kernel Ridge Regression\n\nNext, we give an upper bound for the excess risk using KRR:\nTheorem 3 Suppose PX so = PX ta and the eigenvalues of the integral operator TK satisfy \u00b5i \u2264\nY , p \u2208 (0, 1) and there exists a constant C \u2265 1 such that for f \u2208 H,\nai\u22121/p for\n||f||\u221e \u2264 C ||f||p\n(\u03bb) \u2264 c\u03bb\u03b2so and AwG (\u03bb) \u2264 c\u03bb\u03b2wG .\nIf we use KRR for estimating f so and wG with regularization parameters \u03bbso \ufffd n\u22121/(\u03b2so+p)\nand\n\nL2(PX ). Furthur assume that Af so\n\ni \u2265 1 a \u2265 16\ufffd4\nH \u00b7 ||f||1\u2212p\n\nso\n\n6\n\n\fE\ufffdR\ufffd \u02c6f ta\ufffd\ufffd \u2212 R\ufffdf ta\ufffd = O\ufffd\ufffdn\n\n2\n\u03b2wG\nta\n\n+p\n\nlog (nta) \u00b7 n\n\n\u2212\u03b2so\n\u03b2so+p\nso\n\n+ n\n\n\u2212\u03b2wG\n\u03b2wG\n+p\nta\n\n\ufffd log\ufffd 1\n\u03b4\ufffd\ufffd\n\n\u03bbwG \ufffd n\u22121/(\u03b2wG +p)\n\nta\n\n, then with probability at least 1 \u2212 \u03b4 the excess risk satis\ufb01es:\n\nkernel smoothing. We believe this \u03bb\u22122\n\nwhere the expectation is taken over T so and T ta.\nSimilar to Theorem 2, Theorem 3 suggests that the estimation error comes from two sources. For\nestimating the auxiliary function wG, the statistical rate depends on properties of the kernel induced\nRKHS, and how far the auxiliary function is from this space. For the ease of presentation, we assume\nPX so = PX ta, so the approximation errors Af so and Af ta are de\ufb01ned on the same domain. The\n\nerror of estimating f so is ampli\ufb01ed by O\ufffd\u03bb\u22122\nthe uniform algorithmic stability parameter for KRR is O\ufffd\u03bb\u22122\nrate for excess risk is \u03a9\ufffdn\n\nwG log (nta)\ufffd, which is worse than that of nonparametric\nwG\ufffd Bousquet and Elisseeff [2002].\n\ufffd, so if \u03b2wg \u2265 \u03b2ta and nso is suf\ufb01ciently large then we achieve\n\nSteinwart et al. Steinwart et al. [2009] showed that for non-transfer learning, the optimal statistical\n\nwG is nearly tight because Bousquet and Elisseeff have shown\n\nimproved convergence rate through transfer learning.\nRemark: Theorem 2 and 3 are not directly comparable because our assumptions on the function\nspaces of these two theorems are different. In general, H\u00f6lder space is only a Banach space but not a\nHilbert space. We refer readers to Theorem 1 in Zhou [2008] for details.\n\n\u2212\u03b2ta\n\u03b2ta+p\nta\n\n5 Finding the Best Transformation Function\nIn the previous section we showed for a speci\ufb01c transformation function G, if auxiliary function\nis smoother than the target regression function then we have smaller excess risk.\nIn practice,\nwe would like to try out a class of transformation functions G , which is possibly uncountable.\nWe can construct a subset of G \u2282 G, which is \ufb01nite and satis\ufb01es that each G in G there is a\nG in G that is close to G. Here we give an example. Consider the transformation functions\nthat have the form: G = {G(a, b) = \u03b1a + b where |\u03b1| \u2264 L\u03b1,|a| \u2264 La} . We can quantize this set\nof transformation functions by considering a subset of G: G = {G(a, b) = k\ufffda + b} where \ufffd =\n2K , k = \u2212K,\u00b7\u00b7\u00b7 , 0,\u00b7\u00b7\u00b7 , K and |a| \u2264 La. Here \ufffd is the quantization unit.\nThe next theorem shows we only need to search the transformation function G in G whose corre-\nsponding estimator \u02c6f ta\n\nG has the lowest empirical risk on the validation dataset.\n\nL\u03b1\n\n\ufffd\n\n= argminG\u2208G\n\nTheorem 4 Let G be a class of transformation functions and G be its ||\u00b7||\u221e\nnorm \ufffd-cover.\nSuppose wG satis\ufb01es the same assumption in Theorem 1 and for any two G1, G2 \u2208 G,\nfor some constant L. Denote G\ufffd = argminG\u2208GR\ufffd \u02c6f ta\nG\ufffd\n||wG1 \u2212 wG2||\u221e \u2264 L||G1 \u2212 G2||\u221e\nG\ufffd. If we choose \ufffd = O\ufffd R( \u02c6f ta\ni=1 ci\ufffd and nval = \u03a9\ufffdlog\ufffd\ufffd\ufffdG\ufffd\ufffd /\u03b4\ufffd\ufffd, the\nand G\n\ufffdnta\nwith probability at least 1 \u2212 \u03b4, E\ufffdR\ufffd \u02c6f ta\n\ufffd\ufffd\ufffd \u2212 R (f ta) = O\ufffdE\ufffdR\ufffd \u02c6f ta\nG\ufffd\ufffd\ufffd \u2212 R (f ta)\ufffd where the\nexpectation is taken over T so and T ta.\nRemark 1: This theorem implies that if no-transfer function (G (a, b) = b) is in G then we will\nend up choosing a transformation function that has the same order of excess risk as using no-transfer\nlearning algorithm, thus avoiding negative transfer.\n\n\u02c6R\ufffd \u02c6f ta\n\nG\ufffd )\n\nG\n\nRemark 2: Note number of validation set is only logarithmically depending on the size of set of\ntransformation functions. Therefore, we only need to use a very small amount of data from the target\ndomain to do cross-validation.\n\n6 Experiments\nIn this section we use robotics and neural imaging data to demonstrate the effectiveness of the\nproposed framework. We conduct experiments on real-world data sets with the following procedures.\n\n7\n\n\fnta = 10\nOnly Target KS\n0.086 \u00b1 0.022\nOnly Target KRR\n0.080 \u00b1 0.017\nOnly Source KRR 0.098 \u00b1 0.017\nCombined KS\n0.092 \u00b1 0.011\nCombined KRR\n0.087 \u00b1 0.025\nCDM\n0.105 \u00b1 0.023\nOffset KS\n0.080 \u00b1 0.026\nOffset KRR\n0.146 \u00b1 0.112\n0.078 \u00b1 0.022\nScale KS\nScale KRR\n0.102 \u00b1 0.033\n\nnta = 320\n0.063 \u00b1 0.005\n0.040 \u00b1 0.005\n0.098 \u00b1 0.017\n0.067 \u00b1 0.006\n0.041 \u00b1 0.004\n0.056 \u00b1 0.004\n0.052 \u00b1 0.004\n0.041 \u00b1 0.003\n0.055 \u00b1 0.004\n0.042 \u00b1 0.002\nTable 1: 1 standard deviation intervals for the mean squared errors of various algorithms when\ntransferring from kin-8fm to kin-8nh. The values in bold are the smallest errors for each nta. Only\nSource KS has much worse performance than other algorithms so we do not show its result here.\n\nnta = 160\n0.065 \u00b1 0.006\n0.048 \u00b1 0.006\n0.098 \u00b1 0.017\n0.074 \u00b1 0.006\n0.047 \u00b1 0.003\n0.053 \u00b1 0.009\n0.050 \u00b1 0.003\n0.043 \u00b1 0.004\n0.054 \u00b1 0.008\n0.044 \u00b1 0.004\n\nnta = 20\n0.076 \u00b1 0.010\n0.078 \u00b1 0.022\n0.098 \u00b1 0.017\n0.084 \u00b1 0.008\n0.077 \u00b1 0.015\n0.074 \u00b1 0.020\n0.066 \u00b1 0.023\n0.066 \u00b1 0.017\n0.065 \u00b1 0.013\n0.095 \u00b1 0.100\n\nnta = 40\n0.066 \u00b1 0.008\n0.063 \u00b1 0.013\n0.098 \u00b1 0.017\n0.077 \u00b1 0.009\n0.062 \u00b1 0.009\n0.064 \u00b1 0.008\n0.052 \u00b1 0.006\n0.053 \u00b1 0.007\n0.056 \u00b1 0.009\n0.057 \u00b1 0.014\n\nnta = 80\n0.064 \u00b1 0.007\n0.050 \u00b1 0.007\n0.098 \u00b1 0.017\n0.075 \u00b1 0.006\n0.061 \u00b1 0.005\n0.060 \u00b1 0.007\n0.054 \u00b1 0.006\n0.048 \u00b1 0.006\n0.056 \u00b1 0.005\n0.052 \u00b1 0.010\n\n\u2022 Directly training on the target data T ta (Only Target KS, Only Target KRR).\n\u2022 Only training on the source data T so (Only Source KS, Only Source KRR).\n\u2022 Training on the combined source and target data (Combined KS, Combined KRR).\n\u2022 The CDM algorithm proposed by Wang and Schneider [2014] with KRR (CDM).\n\u2022 The algorithm described in this paper with G(a, b) = (a + \u03b1)b where \u03b1 is a hyper-parameter\n\u2022 The algorithm described in this paper with G(a, b) = \u03b1a + b where \u03b1 is a hyper-parameter\n\n(Scale KS, Scale KRR).\n\n(Offset KS, Offset KRR). \u2220\n\nFor the \ufb01rst experiment, we vary the size of the target domain to study the effect of nta relative\nto nso. We use two datasets from the \u2018kin\u2019 family in Delve [Rasmussen et al., 1996]. The two\ndatasets we use are \u2018kin-8fm\u2019 and \u2018kin-8nh\u2019, both with 8 dimensional inputs. kin-8fm has fairly linear\noutput, and low noise. kin-8nh on the other hand has non-linear output, and high noise. We consider\nthe task of transfer learning from kin-8fm to kin-8nh. In this experiment, We set nso to 320, and\nvary nta in {10, 20, 40, 80, 160, 320}. Hyper-parameters were picked using grid search with 10-fold\ncross-validation on the target data (or source domain data when not using the target domain data).\nTable 1 shows the mean squared errors on the target data. To better understand the results, we show a\nbox plot of the mean squared errors for nta = 40 onwards in Figure 2(a). The results for nta = 10\nand nta = 20 have high variance, so we do not show them in the plot. We also omit the results of\nOnly Source KRR because of its poor performance. We note that our proposed algorithm outperforms\nother methods across nearly all values of nta especially when nta is small. Only when there are as\nmany points in the target as in the source, does simply training on the target give the best performance.\nThis is to be expected since the primary purpose in doing transfer learning is to alleviate the problem\nof lack of data in the target domain. Though quite comparable, the performance of the scale methods\nwas worse than the offset methods in this experiment. In general, we would use cross-validation to\nchoose between the two.\nWe now consider another real-world dataset where the covariates are fMRI images taken while\nsubjects perform a Stroop task [Stroop, 1935]. We use the dataset collected by Verstynen [2014]\nwhich contains fMRI data of 28 subjects. A total of 120 trials were presented to each participant and\nfMRI data was collected throughout the trials, and went through a standard post-processing scheme.\nThe result of this is a feature vector corresponding to each trial that describes the activity of brain\nregions (voxels), and the goal is to use this to predict the response time.\nTo frame the problem in the transfer learning setting, we consider as source the data of all but one\nsubject. The goal is to predict on the remaining subject. We performed \ufb01ve repetitions for each\nalgorithm by drawing nso = 300 data points randomly from the 3000 points in the source domain.\nWe used nta = 80 points from the target domain for training and cross-validation; evaluation was\ndone on the 35 remaining points in the target domain. Figure 2 (b) shows a box plot of the coeffecient\nof determination values (R-squared) for the best performing algorithms. R-squared is de\ufb01ned as\n1 \u2212 SSres/SStot where SSres is the sum of squared residuals, and SStot is the total sum of squares.\nNote that R-squared can be negative when predicting on unseen samples \u2013 which were not used to \ufb01t\nthe model \u2013 as in our case. When positive, it indicates the proportion of explained variance in the\ndependent variable (higher the better). From the plot, it is clear that Offset KRR and Only Target\nKRR have the best performances on average and Offset KRR has smaller variance.\n\n8\n\n\fFigure 2: Box plots of experimental results on real datasets. Each box extends from the \ufb01rst to third\nquartile, and the horizontal lines in the middle are medians. For the robotics data, we report mean\nsquared error (lower the better) and for the fMRI data, we report R-squared (the higher the better).\nFor the ease of presentation, we only show results of algorithms with good performances.\n\nMean\n-0.0096\nOnly Target KS\n0.1041\nOnly Target KRR\nOnly Source KS\n-0.4932\nOnly Source KRR -0.8763\n-0.7540\nCombined KS\n-0.5868\nCombined KRR\n-3.1183\nCDM\n0.1190\nOffset KS\nOffset KRR\n0.1080\n0.0017\nScale KS\nScale KRR\n0.0897\n\nMedian\n0.0444\n0.1186\n-0.5366\n-0.9363\n-0.2023\n-0.0691\n-3.4510\n0.1081\n0.1221\n-0.0321\n0.1107\n\nStandard Deviation\n\n0.1041\n0.2361\n0.4555\n0.6265\n1.5109\n1.3223\n2.6473\n0.0612\n0.0682\n0.0632\n0.1104\n\nTable 2: Mean, median, and standard deviation for the coef\ufb01cient of determination (R-squared) of\nvarious algorithms on the fMRI dataset.\n\nTable 2 shows the full table of results for the fMRI task. Using only the source data produces large\nnegative R-squared, and while Only Target KRR does produce a positive mean R-squared, it comes\nwith a high variance. On the other hand, both Offset methods have low variance, showing consistent\nperformance. For this particular case, the Scale methods do not perform as well as the Offset methods,\nand as has been noted earlier, in general we would use cross validation to select an appropriate\ntransfer function.\n\n7 Conclusion and Future Works\nIn this paper, we proposed a general transfer learning framework for the HTL regression problem\nwhen there is some data available from the target domain. Theoretical analysis shows it is possible to\nachieve better statistical rate using transfer learning than standard supervised learning.\nNow we list two future directions and how our results could be further improved. First, in many real\nworld applications, there is also a large amount of unlabeled data from the target domain available.\nCombining our proposed framework with previous works for this scenario [Cortes and Mohri, 2014,\nHuang et al., 2006] is a promising direction to pursue. Second, we only present upper bounds in\nthis paper. It is an interesting direction to obtain lower bounds for HTL and other transfer learning\nscenarios.\n\n8 Acknowledgements\nS.S.D. and B.P. were supported by NSF grant IIS1563887 and ARPA-E Terra program. A.S. was\nsupported by AFRL grant FA8750-17-2-0212.\n\n9\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fReferences\nShai Ben-David and Ruth Urner. Domain adaptation as learning with auxiliary information. In New\n\nDirections in Transfer and Multi-Task-Workshop@ NIPS, 2013.\n\nShai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for\n\ndomain adaptation. Advances in neural information processing systems, 19:137, 2007.\n\nJohn Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman. Learning\nbounds for domain adaptation. In Advances in neural information processing systems, pages\n129\u2013136, 2008.\n\nOlivier Bousquet and Andr\u00e9 Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2(Mar):499\u2013526, 2002.\n\nRaymond J Carroll, David Ruppert, Leonard A Stefanski, and Ciprian M Crainiceanu. Measurement\n\nerror in nonlinear models: a modern perspective. CRC press, 2006.\n\nCorinna Cortes and Mehryar Mohri. Domain adaptation in regression. In Algorithmic Learning\n\nTheory, pages 308\u2013323. Springer, 2011.\n\nCorinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory and\n\nalgorithm for regression. Theoretical Computer Science, 519:103\u2013126, 2014.\n\nCorinna Cortes, Mehryar Mohri, and Andr\u00e9s Mu\u00f1oz Medina. Adaptation algorithm and theory based\non generalized discrepancy. In Proceedings of the 21th ACM SIGKDD International Conference\non Knowledge Discovery and Data Mining, pages 169\u2013178. ACM, 2015.\n\nCecil C Craig. On the tchebychef inequality of bernstein. The Annals of Mathematical Statistics, 4\n\n(2):94\u2013102, 1933.\n\nLi Fei-Fei, Rob Fergus, and Pietro Perona. One-shot learning of object categories. IEEE transactions\n\non pattern analysis and machine intelligence, 28(4):594\u2013611, 2006.\n\nJiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Sch\u00f6lkopf, and Alex J Smola.\nCorrecting sample selection bias by unlabeled data. In Advances in neural information processing\nsystems, pages 601\u2013608, 2006.\n\nSamory Kpotufe and Vikas Garg. Adaptivity to local smoothness and dimension in kernel regression.\n\nIn Advances in Neural Information Processing Systems, pages 3075\u20133083, 2013.\n\nIlja Kuzborskij and Francesco Orabona. Stability and hypothesis transfer learning. In ICML (3),\n\npages 942\u2013950, 2013.\n\nIlja Kuzborskij and Francesco Orabona. Fast rates by transferring from auxiliary hypotheses. Machine\n\nLearning, pages 1\u201325, 2016.\n\nIlja Kuzborskij, Francesco Orabona, and Barbara Caputo. From n to n+ 1: Multiclass transfer\nincremental learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3358\u20133365, 2013.\n\nIlja Kuzborskij, Francesco Orabona, and Barbara Caputo. Scalable greedy algorithms for transfer\n\nlearning. Computer Vision and Image Understanding, 2016.\n\nTongliang Liu, Dacheng Tao, Mingli Song, and Stephen Maybank. Algorithm-dependent gener-\nalization bounds for multi-task learning. IEEE transactions on pattern analysis and machine\nintelligence, 2016.\n\nYishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds\n\nand algorithms. arXiv preprint arXiv:0902.3430, 2009.\n\nMehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with drifting\n\ndistributions. In Algorithmic Learning Theory, pages 124\u2013138. Springer, 2012.\n\n10\n\n\fStephen Nuske, Kamal Gupta, Srinivasa Narasimhan, and Sanjiv Singh. Modeling and calibrating\nvisual yield estimates in vineyards. In Field and Service Robotics, pages 343\u2013356. Springer, 2014.\n\nFrancesco Orabona, Claudio Castellini, Barbara Caputo, Angelo Emanuele Fiorilla, and Giulio\nSandini. Model adaptation with least-squares svm for adaptive hand prosthetics. In Robotics and\nAutomation, 2009. ICRA\u201909. IEEE International Conference on, pages 2897\u20132903. IEEE, 2009.\n\nCarl Edward Rasmussen, Radford M Neal, Georey Hinton, Drew van Camp, Michael Revow, Zoubin\nGhahramani, Rafal Kustra, and Rob Tibshirani. Delve data for evaluating learning in valid\nexperiments. URL http://www. cs. toronto. edu/ delve, 1996.\n\nIngo Steinwart, Don R Hush, and Clint Scovel. Optimal rates for regularized least squares regression.\n\nIn COLT, 2009.\n\nJ Ridley Stroop. Studies of interference in serial verbal reactions. Journal of experimental psychology,\n\n18(6):643, 1935.\n\nMasashi Sugiyama, Shinichi Nakajima, Hisashi Kashima, Paul V Buenau, and Motoaki Kawanabe.\nDirect importance estimation with model selection and its application to covariate shift adaptation.\nIn Advances in neural information processing systems, pages 1433\u20131440, 2008.\n\nTatiana Tommasi, Francesco Orabona, and Barbara Caputo. Safety in numbers: Learning categories\nIn Computer Vision and Pattern\n\nfrom few examples with multi model knowledge transfer.\nRecognition (CVPR), 2010 IEEE Conference on, pages 3081\u20133088. IEEE, 2010.\n\nTimothy D Verstynen. The organization and dynamics of corticostriatal pathways link the medial\norbitofrontal cortex to future behavioral responses. Journal of neurophysiology, 112(10):2457\u2013\n2469, 2014.\n\nVladimir Vovk. Kernel ridge regression. In Empirical Inference, pages 105\u2013116. Springer, 2013.\n\nXuezhi Wang and Jeff Schneider. Flexible transfer learning under support and model shift. In\n\nAdvances in Neural Information Processing Systems, pages 1898\u20131906, 2014.\n\nXuezhi Wang and Jeff Schneider. Generalization bounds for transfer learning under model shift.\n\n2015.\n\nXuezhi Wang, Junier B Oliva, Jeff Schneider, and Barnab\u00e1s P\u00f3czos. Nonparametric risk and stability\nanalysis for multi-task learning problems. In 25th International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI), volume 1, page 2, 2016.\n\nLarry Wasserman. All of nonparametric statistics. Springer Science & Business Media, 2006.\n\nJun Yang, Rong Yan, and Alexander G Hauptmann. Cross-domain video concept detection using\nadaptive svms. In Proceedings of the 15th ACM international conference on Multimedia, pages\n188\u2013197. ACM, 2007.\n\nYaoliang Yu and Csaba Szepesv\u00e1ri. Analysis of kernel mean matching under covariate shift. arXiv\n\npreprint arXiv:1206.4650, 2012.\n\nKun Zhang, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional\nshift. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages\n819\u2013827, 2013.\n\nYu Zhang. Multi-task learning and algorithmic stability. In AAAI, volume 2, pages 6\u20132, 2015.\n\nDing-Xuan Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal\n\nof computational and Applied Mathematics, 220(1):456\u2013463, 2008.\n\n11\n\n\f", "award": [], "sourceid": 399, "authors": [{"given_name": "Simon", "family_name": "Du", "institution": "Carnegie Mellon University"}, {"given_name": "Jayanth", "family_name": "Koushik", "institution": "Carnegie Mellon University"}, {"given_name": "Aarti", "family_name": "Singh", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}]}