{"title": "Flexible Transfer Learning under Support and Model Shift", "book": "Advances in Neural Information Processing Systems", "page_first": 1898, "page_last": 1906, "abstract": "Transfer learning algorithms are used when one has sufficient training data for one supervised learning task (the source/training domain) but only very limited training data for a second task (the target/test domain) that is similar but not identical to the first. Previous work on transfer learning has focused on relatively restricted settings, where specific parts of the model are considered to be carried over between tasks. Recent work on covariate shift focuses on matching the marginal distributions on observations $X$ across domains. Similarly, work on target/conditional shift focuses on matching marginal distributions on labels $Y$ and adjusting conditional distributions $P(X|Y)$, such that $P(X)$ can be matched across domains. However, covariate shift assumes that the support of test $P(X)$ is contained in the support of training $P(X)$, i.e., the training set is richer than the test set. Target/conditional shift makes a similar assumption for $P(Y)$. Moreover, not much work on transfer learning has considered the case when a few labels in the test domain are available. Also little work has been done when all marginal and conditional distributions are allowed to change while the changes are smooth. In this paper, we consider a general case where both the support and the model change across domains. We transform both $X$ and $Y$ by a location-scale shift to achieve transfer between tasks. Since we allow more flexible transformations, the proposed method yields better results on both synthetic data and real-world data.", "full_text": "Flexible Transfer Learning under Support and Model\n\nShift\n\nXuezhi Wang\n\nComputer Science Department\nCarnegie Mellon University\nxuezhiw@cs.cmu.edu\n\nJeff Schneider\nRobotics Institute\n\nCarnegie Mellon University\nschneide@cs.cmu.edu\n\nAbstract\n\nTransfer learning algorithms are used when one has suf\ufb01cient training data for\none supervised learning task (the source/training domain) but only very limited\ntraining data for a second task (the target/test domain) that is similar but not iden-\ntical to the \ufb01rst. Previous work on transfer learning has focused on relatively\nrestricted settings, where speci\ufb01c parts of the model are considered to be car-\nried over between tasks. Recent work on covariate shift focuses on matching\nthe marginal distributions on observations X across domains. Similarly, work on\ntarget/conditional shift focuses on matching marginal distributions on labels Y\nand adjusting conditional distributions P (X|Y ), such that P (X) can be matched\nacross domains. However, covariate shift assumes that the support of test P (X)\nis contained in the support of training P (X), i.e., the training set is richer than the\ntest set. Target/conditional shift makes a similar assumption for P (Y ). Moreover,\nnot much work on transfer learning has considered the case when a few labels in\nthe test domain are available. Also little work has been done when all marginal\nand conditional distributions are allowed to change while the changes are smooth.\nIn this paper, we consider a general case where both the support and the model\nchange across domains. We transform both X and Y by a location-scale shift to\nachieve transfer between tasks. Since we allow more \ufb02exible transformations, the\nproposed method yields better results on both synthetic data and real-world data.\n\n1\n\nIntroduction\n\nIn a classical transfer learning setting, we have suf\ufb01cient fully labeled data from the source domain\n(or the training domain) where we fully observe the data points X tr, and all corresponding labels\nY tr are known. On the other hand, we are given data points, X te, from the target domain (or the\ntest domain), but few or none of the corresponding labels, Y te, are given. The source and the target\ndomains are related but not identical, thus the joint distributions, P (X tr, Y tr) and P (X te, Y te), are\ndifferent across the two domains. Without any transfer learning, a statistical model learned from the\nsource domain does not directly apply to the target domain. The use of transfer learning algorithms\nminimizes, or reduces the labeling work needed in the target domain.\nIt learns and transfers a\nmodel based on the labeled data from the source domain and the data with few or no labels from\nthe target domain, and should perform well on the unlabeled data in the target domain. Some real-\nworld applications of transfer learning include adapting a classi\ufb01cation model that is trained on\nsome products to help learn classi\ufb01cation models for other products [17], and learning a model on\nthe medical data for one disease and transferring it to another disease.\nThe real-world application we consider is an autonomous agriculture application where we want to\nmanage the growth of grapes in a vineyard [3]. Recently, robots have been developed to take images\nof the crop throughout the growing season. When the product is weighed at harvest at the end of\neach season, the yield for each vine will be known. The measured yield can be used to learn a model\n\n1\n\n\fto predict yield from images. Farmers would like to know their yield early in the season so they\ncan make better decisions on selling the produce or nurturing the growth. Acquiring training labels\nearly in the season is very expensive because it requires a human to go out and manually estimate the\nyield. Ideally, we can apply a transfer-learning model which learns from previous years and/or on\nother grape varieties to minimize this manual yield estimation. Furthermore, if we decide that some\nof the vines have to be assessed manually to learn the model shift, a simultaneously applied active\nlearning algorithm will tell us which vines should be measured manually such that the labeling cost\nis minimized. Finally, there are two different objectives of interest. To better nurture the growth we\nneed an accurate estimate of the current yield of each vine. However, to make informed decisions\nabout pre-selling an appropriate amount of the crops, only an estimate of the sum of the vine yields\nis needed. We call these problems active learning and active surveying respectively and they lead to\ndifferent selection criteria.\nIn this paper, we focus our attention on real-valued regression problems. We propose a transfer\nlearning algorithm that allows both the support on X and Y , and the model P (Y |X) to change\nacross the source and target domains. We assume only that the change is smooth as a function\nof X. In this way, more \ufb02exible transformations are allowed than mean-centering and variance-\nscaling. Speci\ufb01cally, we build a Gaussian Process to model the prediction on the transformed X,\nthen the prediction is matched with a few observed labels Y (also properly transformed) available\nin the target domain such that both transformations on X and on Y can be learned. The GP-based\napproach naturally lends itself to the active learning setting where we can sequentially choose query\npoints from the target dataset. Its \ufb01nal predictive covariance, which combines the uncertainty in the\ntransfer function and the uncertainty in the target label prediction, can be plugged into various GP\nbased active query selection criteria. In this paper we consider (1) Active Learning which reduces\ntotal predictive covariance [18, 19]; and (2) Active Surveying [20, 21] which uses an estimation\nobjective that is the sum of all the labels in the test set.\nAs an illustration, we show a toy problem in Fig. 1. As we can see, the support of P (X) in the\ntraining domain (red stars) and the support of P (X) in the test domain (blue line) do not overlap,\nneither do the support of Y across the two domains. The goal is to learn a model on the training\ndata, with a few labeled test data (the \ufb01lled blue circles), such that we can successfully recover the\ntarget function (the blue line). In Fig. 3, we show two real-world grape image datasets. The goal\nis to transfer the model learned from one kind of grape dataset to another. In Fig. 2, we show the\nlabels (the yield) of each grape image dataset, along with the 3rd dimension of its feature space.\nWe can see that the real-world problem is quite similar to the toy problem, which indicates that the\nalgorithm we propose in this paper will be both useful and practical for real applications.\n\nFigure 1: Toy problem\n\nFigure 2: Real grape data\n\nFigure 3: A part of one image\nfrom each grape dataset\n\nWe evaluate our methods on synthetic data and real-world grape image data. The experimental\nresults show that our transfer learning algorithms signi\ufb01cantly outperform existing methods with\nfew labeled target data points.\n\n2 Related Work\n\nTransfer learning is applied when joint distributions differ across source and target domains. Tradi-\ntional methods for transfer learning use Markov logic networks [4], parameter learning [5, 6], and\nBayesian Network structure learning [7], where speci\ufb01c parts of the model are considered to be\ncarried over between tasks.\n\n2\n\n\u22124\u221220246\u22122\u221210123XYSynthetic Data source dataunderlying P(source)underlying P(target)selected test x\u22121.5\u22121\u22120.500.5101000200030004000500060003rd dimension of the real grape datalabels RieslingTraminette\fRecently, a large part of transfer learning work has focused on the problem of covariate shift [8,\n9, 10]. They consider the case where only P (X) differs across domains, while the conditional\ndistribution P (Y |X) stays the same. The kernel mean matching (KMM) method [9, 10], is one of\nthe algorithms that deal with covariate shift. It minimizes ||\u00b5(Pte) \u2212 Ex\u223cPtr(x)[\u03b2(x)\u03c6(x)]|| over a\nre-weighting vector \u03b2 on training data points such that P (X) are matched across domains. However,\nthis work suffers two major problems. First, the conditional distribution P (Y |X) is assumed to be\nthe same, which might not be true under many real-world cases. The algorithm we propose will\nallow more than just the marginal on X to shift. Second, the KMM method requires that the support\nof P (X te) is contained in the support of P (X tr), i.e., the training set is richer than the test set. This\nis not necessarily true in many real cases either. Consider the task of transferring yield prediction\nusing images taken from different vineyards. If the images are taken from different grape varieties\nor during different times of the year, the texture/color could be very different across transferring\ntasks. In these cases one might mean-center (and possibly also variance-scale) the data to ensure\nthat the support of P (X te) is contained in (or at least largely overlapped with) P (X tr). In this\npaper, we provide an alternative way to solve the support shift problem that allows more \ufb02exible\ntransformations than mean-centering and variance-scaling.\nSome more recent research [12] has focused on modeling target shift (P (Y ) changes), conditional\nshift (P (X|Y ) changes), and a combination of both. The assumption on target shift is that X de-\npends causally on Y , thus P (Y ) can be re-weighted to match the distributions on X across domains.\nIn conditional shift, the authors apply a location-scale transformation on P (X|Y ) to match P (X).\nHowever, the authors still assume that the support of P (Y te) is contained in the support of P (Y tr).\nIn addition, they do not assume they can obtain additional labels, Y te, from the target domain, and\nthus make no use of the labels Y te, even if some are available.\nThere also have been a few papers handling differences in P (Y |X). [13] designed speci\ufb01c methods\n(change of representation, adaptation through prior, and instance pruning) to solve the label adap-\ntation problem. [14] relaxed the requirement that the training and testing examples be drawn from\nthe same source distribution in the context of logistic regression. Similar to work on covariate shift,\n[15] weighted the samples from the source domain to deal with domain adaptation. These settings\nare relatively restricted while we consider a more general case that both the data points X and the\ncorresponding labels Y can be transformed smoothly across domains. Hence all data will be used\nwithout any pruning or weighting, with the advantage that the part of source data which does not\nhelp prediction in the target domain will automatically be corrected via the transformation model.\nThe idea of combining transfer learning and active learning has also been studied recently. Both\n[22] and [23] perform transfer and active learning in multiple stages. The \ufb01rst work uses the source\ndata without any domain adaptation. The second work performs domain adaptation at the beginning\nwithout further re\ufb01nement. [24] and [25] consider active learning under covariate shift and still\nassume P (Y |X) stays the same. In [16], the authors propose a combined active transfer learning\nalgorithm to handle the general case where P (Y |X) changes smoothly across domains. However,\nthe authors still apply covariate shift algorithms to solve the problem that P (X) might differ across\ndomains, which follows the assumption covariate shift made on the support of P (X). In this paper,\nwe propose an algorithm that allows more \ufb02exible transformations (location-scale transform on both\nX and Y ). Our experiments on real-data shows this additional \ufb02exibility pays off in real applications.\n\n3 Approach\n\n3.1 Problem Formulation\n\ni \u2208 (cid:60)dx and each Y tr\n\nWe are given a set of n labeled training data points, (X tr, Y tr), from the source domain where\ni \u2208 (cid:60)dy. We are also given a set of m test data points, X te, from\neach X tr\nthe target domain. Some of these will have corresponding labels, Y teL. When necessary we will\nseparately denote the subset of X te that has labels as X teL, and the subset that does not as X teU .\nFor simplicity we restrict Y to be univariate in this paper, but the algorithm we proposed easily\nextends to the multivariate case.\nFor static transfer learning, the goal is to learn a predictive model using all the given data that\ni )2 where \u02c6Yi and Yi are the\nminimizes the squared prediction error on the test data, \u03a3m\npredicted and true labels for the ith test data point. We will evaluate the transfer learning algorithms\n\ni=1( \u02c6Y te\n\ni \u2212 Y te\n\n3\n\n\fby including a subset of labeled test data chosen uniformly at random. For active transfer learning\nthe performance metric is the same. The difference is that the active learning algorithm chooses the\ntest points for labeling rather than being given a randomly chosen set.\n\n3.2 Transfer Learning\nOur strategy is to simultaneously learn a nonlinear mapping X te \u2192 X new and Y te \u2192 Y \u2217. This\nallows \ufb02exible transformations on both X and Y , and our smoothness assumption using GP prior\nmakes the estimation stable. We call this method Support and Model Shift (SMS).\nWe apply the following steps (K in the following represents the Gaussian kernel, and KXY repre-\nsents the kernel between matrices X and Y , \u03bb ensures invertible kernel matrix):\n\n\u2022 Transform X teL to X new(L) by a location-scale shift: X new(L) = WteL (cid:12) X teL + BteL,\n\nsuch that the support of P (X new(L)) is contained in the support of P (X tr);\n\n\u2022 Build a Gaussian Process on (X tr, Y tr) and predict on X new(L) to get Y new(L);\n\u2022 Transform Y teL to Y \u2217 by a location-scale shift: Y \u2217 = wteL (cid:12) Y teL + bteL, then we\n\noptimize the following empirical loss:\n\narg\n\nmin\n\nWteL,BteL,wteL,bteL,wte\n\n||Y \u2217 \u2212 Y new(L)||2 + \u03bbreg||wte \u2212 1||2,\n\n(1)\n\nwhere WteL, BteL are matrices with the same size as X teL. wteL, bteL are vectors with the same\nsize as Y teL (l by 1, where l is the number of labeled samples in the target domain), and wte is an\nm by 1 scale vector on all Y te. \u03bbreg is a regularization parameter.\nTo ensure the smoothness of the transformation w.r.t. X, we parameterize WteL, BteL, wteL, bteL\nusing: WteL = RteLG, BteL = RteLH, wteL = RteLg, bteL = RteLh, where RteL =\nLteL(LteL + \u03bbI)\u22121, LteL = KX teLX teL. Following the same smoothness constraint we also have:\nwte = Rteg, where Rte = KX teX teL(LteL + \u03bbI)\u22121. This parametrization results in the new\nobjective function:\n\narg min\n\nG,H,g,h\n\n||(RteLg (cid:12) Y teL + RteLh) \u2212 Y new(L)||2 + \u03bbreg||Rteg \u2212 1||2.\n\n(2)\n\nIn the objective function, although we minimize the discrepancy between the transformed labels and\nthe predicted labels for only the labeled points in the test domain, we put a regularization term on\nthe transformation for all X te to ensure overall smoothness in the test domain. Note that the non-\nlinearity of the transformation makes the SMS approach capable of recovering a fairly wide set of\nchanges, including non-monotonic ones. However, because of the smoothness constraint imposed\non the location-scale transformation, it might not recover some extreme cases where the scale or\nlocation change is non-smooth/discontinuous. However, under these cases the learning problem by\nitself would be very challenging.\nWe use a Metropolis-Hasting algorithm to optimize the objective (Eq. 2) which is multi-modal due\nto the use of the Gaussian kernel. The proposal distribution is given by \u03b8t \u223c N (\u03b8t\u22121, \u03a3), where \u03a3\nis a diagonal matrix with diagonal elements determined by the magnitude of \u03b8 \u2208 {G, H, g, h}. In\naddition, the transformation on X requires that the support of P (X new) is contained in the support\nof P (X tr), which might be hard to achieve on real data, especially when X has a high-dimensional\nfeature space. To ensure that the training data can be better utilized, we relax the support-containing\ncondition by enforcing an overlapping ratio between the transformed X new and X tr, i.e., we reject\nthose proposal distributions which do not lead to a transformation that exceeds this ratio.\nAfter obtaining G, H, g, h, we make predictions on X teU by:\n\n\u2022 Transform X teU to X new(U ) with the optimized G, H: X new(U ) = WteU (cid:12) X teU +\n\nBteU = RteU G (cid:12) X teU + RteU H;\n\n\u2022 Build a Gaussian Process on (X tr, Y tr) and predict on X new(U ) to get Y new(U );\n\u2022 Predict using optimized g, h: \u02c6Y teU = (Y new(U ) \u2212 bteU )./wteU = (Y new(U ) \u2212\n\nRteU h)./RteU g,\n\n4\n\n\fwhere RteU = KX teU X teL(LteL + \u03bbI)\u22121.\nWith the use of W = RG, B = RH, w = Rg, b = Rh, we allow more \ufb02exible transformations\nthan mean-centering and variance-scaling while assuming that the transformations are smooth w.r.t\nX. We will illustrate the advantage of the proposed method in the experimental section.\n\n3.3 A Kernel Mean Embedding Point of View\n\nAfter the transformation from X teL to X new(L), we build a Gaussian Process on (X tr, Y tr)\nand predict on X new(L) to get Y new(L). This is equivalent to estimating \u02c6\u00b5[Y new(L)] us-\ning conditional embeddings [11] with a Gaussian kernel on X and a linear kernel on Y :\n\u02c6\u00b5[Y new(L)] = \u02c6U[PY tr|X tr ]\u03c6[X new(L)] = \u03c8(ytr)(\u03c6(xtr)(cid:62)\u03c6(xtr) + \u03bbI)\u22121\u03c6(cid:62)(xtr)\u03c6(xnew(L)) =\n(KX new(L)X tr (KX trX tr + \u03bbI)\u22121Y tr)(cid:62), where \u02c6U[PY tr|X tr ] denotes the empirical estimator of the\nconditional embedding, and \u03c6, \u03c8 denote the feature mapping on X, Y , respectively. Finally the ob-\njective function Eq. 2 is effectively minimizing the mean discrepancy: ||\u02c6\u00b5[Y \u2217] \u2212 \u02c6\u00b5[Y new(L)]||2 =\n||\u02c6\u00b5[Y \u2217] \u2212 \u02c6U[PY tr|X tr ]\u03c6[X new(L)]||2.\nThe transformation {W, B, w, b} are smooth w.r.t X.\nTake w for example, \u02c6\u00b5[w] =\n\u02c6U[Pw|X teL]\u03c6[X teL] = g(\u03c6(cid:62)(xteL)\u03c6(xteL) + \u03bbI)\u22121\u03c6(cid:62)(xteL)\u03c6(xteL) = g(LteL + \u03bbI)\u22121LteL =\n(RteLg)(cid:62).\n\n3.4 Active Learning\n\nWe consider two active learning goals and apply a myopic selection criteria to each:\n(1) Active Learning which reduces total predictive covariance [18, 19]. An optimal myopic selection\nis achieved by choosing the point which minimizes the trace of the predictive covariance matrix\nconditioned on that selection.\n(2) Active Surveying [20, 21] which uses an estimation objective that is the sum of all the labels in\nthe test set. An optimal myopic selection is achieved by choosing the point which minimizes the\nsum over all elements of the predictive covariance conditioned on that selection.\nNow we derive the predictive covariance of the SMS approach. Note the transformation between\n\u02c6Y teU and Y new(U ) is given by: \u02c6Y teU = (Y new(U ) \u2212 bteU )./wteU . Hence we have Cov[ \u02c6Y teU ] =\ndiag{1./wteU} \u00b7 Cov(Y new(U )) \u00b7 diag{1./wteU}.\nAs for Y new(U ), since we build on Gaussian Processes for the prediction from X new(U ) to Y new(U ),\nit follows: Y new(U )|X new(U ) \u223c N (\u00b5, \u03a3), where \u00b5 = KX new(U )X tr (KX trX tr + \u03bbI)\u22121Y tr, and\n\u03a3 = KX new(U )X new(U ) \u2212 KX new(U )X tr (KX trX tr + \u03bbI)\u22121KX trX new(U ).\n(cid:82)\nNote\n=\nWteU (cid:12) X teU + BteU .\ni.e., P ( \u02c6Y new(U )|X teU , D) =\nX new(U ) P ( \u02c6Y teU|X new(U ), D)P (X new(U )|X teU )dX new(U ), with D = {X tr, Y tr, X teL, Y teL}.\nUsing the empirical form of P (X new(U )|X teU ) which has probability 1/|X teU| for each sample,\nwe get: Cov[ \u02c6Y new(U )|X teU , X tr, Y tr, X teL, Y teL] = \u03a3. Plugging the covariance of Y new(U ) into\nCov[ \u02c6Y teU ] we can get the \ufb01nal predictive covariance:\n\nand X teU is given by: X new(U )\n\nIntegrating over X new(U ),\n\nthe\n\ntransformation between X new(U )\n\nCov( \u02c6Y teU ) = diag{1./wteU} \u00b7 \u03a3 \u00b7 diag{1./wteU}\n\n(3)\n\n4 Experiments\n\n4.1 Synthetic Dataset\n\n4.1.1 Data Description\n\nWe generate the synthetic data with (using matlab notation): X tr = randn(80, 1), Y tr =\nsin(2X tr +1)+0.1\u2217randn(80, 1); X te = [w\u2217min(X tr)+b : 0.03 : w\u2217max(X tr)/3+b], Y te =\nsin(2(revw \u2217 X te + revb) + 1) + 2. In words, X tr is drawn from a standard normal distribution,\nand Y tr is a sine function with Gaussian noise. X te is drawn from a uniform distribution with a\n\n5\n\n\flocation-scale transform on a subset of X tr. Y te is the same sine function plus a constant offset.\nThe synthetic dataset used is with w = 0.5; b = 5; revw = 2; revb = \u221210, as shown in Fig. 1.\n\n4.1.2 Results\n\nthe original data, the mean-\n\nWe compare the SMS approach with the following approaches:\n(1) Only test x: prediction using labeled test data only;\n(2) Both x: prediction using both the training data and labeled test data without transformation;\n(3) Offset: the offset approach [16];\n(4) DM: the distribution matching approach [16];\n(5) KMM: Kernel mean matching [9];\n(6) T/C shift: Target/Conditional shift [12], code is from http://people.tuebingen.mpg.de/kzhang/\nCode-TarS.zip.\nTo ensure the fairness of comparison, we apply (3) to (6) using:\ncentered data, and the mean-centered+variance-scaled (mean-var-centered) data.\nA detailed comparison with different number of labeled test points are shown in Fig. 4, averaged\nover 10 experiments. The selection of which test points to label is done uniformly at random for each\nexperiment. The parameters are chosen by cross-validation. Since KMM and T/C shift do not utilize\nthe labeled test points, the MSE of these two approaches are constants as shown in the text box. As\nwe can see from the results, our proposed approach performs better than all other approaches.\nAs an example, the results for transfer learning with 5 labeled test points on the synthetic dataset\nare shown in Fig. 5. The 5 labeled test points are shown as \ufb01lled blue circles. First, our proposed\nmodel, SMS, can successfully learn both the transformation on X and the transformation on Y , thus\nresulting in almost a perfect \ufb01t on unlabeled test points. Using only labeled test points results in a\npoor \ufb01t towards the right part of the function because there are no observed test labels in that part.\nUsing both training and labeled test points results in a similar \ufb01t as using the labeled test points\nonly, because the support of training and test domain do not overlap. The offset approach with\nmean-centered+variance-scaled data, also results in a poor \ufb01t because the training model is not true\nany more. It would have performed well if the variances are similar across domains. The support\nof the test data we generated, however, only consists of part of the support of the training data and\nhence simple variance-scaling does not yield a good match on P (Y |X). The distribution matching\napproach suffers the same problem. The KMM approach, as mentioned before, applies the same\nconditional model P (Y |X) across domains, hence it does not perform well. The Target/Conditional\nShift approach does not perform well either since it does not utilize any of the labeled test points.\nIts predicted support of P (Y te), is constrained in the support of P (Y tr), which results in a poor\nprediction of Y te once there exists an offset between the Y \u2019s.\n\nFigure 4: Comparison of MSE on the synthetic dataset with {2, 5, 10} labeled test points\n\n4.2 Real-world Dataset\n\n4.2.1 Data Description\n\nWe have two datasets with grape images taken from vineyards and the number of grapes on them as\nlabels, one is riesling (128 labeled images), another is traminette (96 labeled images), as shown in\nFigure 3. The goal is to transfer the model learned from one kind of grape dataset to another. The\ntotal number of grapes for these two datasets are 19, 253 and 30, 360, respectively.\n\n6\n\n200.511.522.53Mean Sqaured Error500.20.40.60.81000.020.040.06 SMSuse only test xuse both xoffset (original)offset (mean\u2212centered)offset (mean\u2212var\u2212centered)DM (original)DM (mean\u2212centered)DM (mean\u2212var\u2212centered) original mean mean\u2212varKMM 4.46 2.25 4.63T/C 1.97 3.51 4.71\fFigure 5: Comparison of results on the synthetic dataset: An example\n\nWe extract raw-pixel features from the images, and use Random Kitchen Sinks [1] to get the coef\ufb01-\ncients as feature vectors [2], resulting in 2177 features. On the traminette dataset we have achieved\na cross-validated R-squared correlation of 0.754. Previously speci\ufb01cally designed image processing\nmethods have achieved an R-squared correlation 0.73 [3]. This grape-detection method takes lots\nof manual labeling work and cannot be directly applied across different varieties of grapes (due to\ndifference in size and color). Our proposed approach for transfer learning, however, can be directly\nused for different varieties of grapes or even different kinds of crops.\n\n4.2.2 Results\n\nThe results for transfer learning are shown in Table 1. We compare the SMS approach with the\nsame baselines as in the synthetic experiments. For {DM, offset, KMM, T/C shift}, we only\nshow their best results after applying them on the original data, the mean-centered data, and the\nmean-centered+variance-scaled data. In each row the result in bold indicates the result with the\nbest RMSE. The result with a star mark indicates that the best result is statistically signi\ufb01cant at a\np = 0.05 level with unpaired t-tests. We can see that our proposed algorithm yields better results un-\nder most cases, especially when the number of labeled test points is small. This means our proposed\nalgorithm can better utilize the source data and will be particularly useful in the early stage of learn-\ning model transfer, when only a small number of labels in the target domain is available/required.\nThe Active Learning/Active Surveying results are as shown in Fig. 6. We compare the SMS ap-\nproach (covariance matrix in Eq. 3 for test point selection, and SMS for prediction) with:\n(1) combined+SMS: combined covariance [16] for selection, and SMS for prediction;\n(2) random+SMS: random selection, and SMS for prediction;\n(3) combined+offset: the Active Learning/Surveying algorithm proposed in [16], using combined\ncovariance for selection, and the corresponding offset approach for prediction.\n\n7\n\n\u22124\u221220246\u22122\u2212101234XY SMSsource datatargetselected test xprediction\u22124\u221220246\u221220246XY use only labeled test xsource datatargetselected test xprediction\u22122\u221210123\u22122\u221210123XYoffset (mean\u2212var\u2212centered data) source datatargetselected test xprediction (w=1)prediction (w=5)\u22122\u221210123\u22122\u221210123XYDM (mean\u2212var\u2212centered data) source datatargetselected test xprediction (p=1e\u22123)prediction (p=0.1)\u22122\u221210123\u22122\u221210123XYKMM/TC Shift (mean\u2212centered data) source datatargetselected test xprediction (KMM)prediction (T/C shift)\u22122\u221210123\u22122\u221210123XYKMM/TC Shift (mean\u2212var\u2212centered data) source datatargetselected test xprediction (KMM)prediction (T/C shift)\fFrom the results we can see that SMS is the best model overall. SMS is better than the Active\nLearning/Surveying approach proposed in [16] (combined+offset), especially in the Active Survey-\ning result. Moreover, the combined+SMS result is better than combined+offset, which also indicates\nthat the SMS model is better for prediction than the offset approach in [16]. Also, given the better\nmodel that SMS has, there is not much difference in which active learning algorithm we use. How-\never, SMS with active selection is better than SMS with random selection, especially in the Active\nLearning result.\n\n# X teL\n5\n10\n15\n20\n25\n30\n40\n50\n70\n90\n\nSMS\n1197\u00b123\u2217\n1046\u00b135\u2217\n993\u00b128\n985\u00b113\n982\u00b114\n960\u00b119\n890\u00b126\n893\u00b116\n860\u00b140\n791\u00b198\n\nTable 1: RMSE for transfer learning on real data\nDM\n1359\u00b154\n1196\u00b159\n1055\u00b127\n1056\u00b154\n1030\u00b129\n921\u00b129\n898\u00b130\n925\u00b159\n805\u00b138\n838\u00b1102\n\nOnly test x Both x\n1479\u00b169\n1323\u00b191\n1104\u00b146\n1086\u00b174\n1039\u00b131\n937\u00b129\n901\u00b131\n926\u00b164\n804\u00b137\n838\u00b1104\n\nOffset\n1303\u00b139\n1234\u00b153\n1063\u00b130\n1024\u00b120\n1040 \u00b127\n961\u00b130\n938\u00b130\n935\u00b159\n819\u00b140\n863\u00b199\n\n2094\u00b160\n1939\u00b141\n1916\u00b136\n1832\u00b146\n1839\u00b141\n1663\u00b131\n1621\u00b134\n1558\u00b151\n1399\u00b163\n1288\u00b1117\n\nKMM T/C Shift\n2127\n2127\n2127\n2127\n2127\n2127\n2127\n2127\n2127\n2127\n\n2330\n2330\n2330\n2330\n2330\n2330\n2330\n2330\n2330\n2330\n\nFigure 6: Active Learning/Surveying results on the real dataset (legend: selection+prediction).\n\n5 Discussion and Conclusion\n\nSolving objective Eq. 2 is relatively involved. Gradient methods can be a faster alternative but the\nnon-convex property of the objective makes it harder to \ufb01nd the global optimum using gradient\nmethods. In practice we \ufb01nd it is relatively ef\ufb01cient to solve Eq. 2 with proper initializations (like\nusing the ratio of scale on the support for w, and the offset between the scaled-means for b). In our\nreal-world dataset with 2177 features, it takes about 2.54 minutes on average in a single-threaded\nMATLAB process on a 3.1 GHz CPU with 8 GB RAM to solve the objective and recover the trans-\nformation. As part of the future work we are working on faster ways to solve the proposed objective.\nIn this paper, we proposed a transfer learning algorithm that handles both support and model shift\nacross domains. The algorithm transforms both X and Y by a location-scale shift across domains,\nthen the labels in these two domains are matched such that both transformations can be learned.\nSince we allow more \ufb02exible transformations than mean-centering and variance-scaling, the pro-\nposed method yields better results than traditional methods. Results on both synthetic dataset and\nreal-world dataset show the advantage of our proposed method.\n\nAcknowledgments\n\nThis work is supported in part by the US Department of Agriculture under grant number\n20126702119958.\n\n8\n\n051015202004006008001000120014001600Number of labeled test pointsRMSEActive Learning SMScombined+SMSrandom+SMScombined+offset0510152000.511.522.5x 104Number of labeled test pointsAbsolute ErrorActive Surveying SMScombined+SMSrandom+SMScombined+offset\fReferences\n\n[1] Rahimi, A. and Recht, B. Random features for large-scale kernel machines. Advances in Neural Information\nProcessing Systems, 2007.\n[2] Oliva, Junier B., Neiswanger, Willie, Poczos, Barnabas, Schneider, Jeff, and Xing, Eric. Fast distribution to\nreal regression. AISTATS, 2014.\n[3] Nuske, S., Gupta, K., Narasihman, S., and Singh., S. Modeling and calibration visual yield estimates in\nvineyards. International Conference on Field and Service Robotics, 2012.\n[4] Mihalkova, Lilyana, Huynh, Tuyen, and Mooney., Raymond J. Mapping and revising markov logic networks\nfor transfer learning. Proceedings of the 22nd AAAI Conference on Arti\ufb01cial Intelligence (AAAI-2007), 2007.\n[5] Do, Cuong B and Ng, Andrew Y. Transfer learning for text classi\ufb01cation. Neural Information Processing\nSystems Foundation, 2005.\n[6] Raina, Rajat, Ng, Andrew Y., and Koller, Daphne. Constructing informative priors using transfer learning.\nProceedings of the Twenty-third International Conference on Machine Learning, 2006.\n[7] Niculescu-Mizil, Alexandru and Caruana, Rich. Inductive transfer for bayesian network structure learning.\nProceedings of the Eleventh International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2007.\n[8] Shimodaira, Hidetoshi. Improving predictive inference under covariate shift by weighting the log-likelihood\nfunction. Journal of Statistical Planning and Inference, 90 (2): 227-244, 2000.\n[9] Huang, Jiayuan, Smola, Alex, Gretton, Arthur, Borgwardt, Karsten, and Schlkopf, Bernhard. Correcting\nsample selection bias by unlabeled data. NIPS 2007, 2007.\n[10] Gretton, Arthur, Borgwardt, Karsten M., Rasch, Malte, Scholkopf, Bernhard, and Smola, Alex. A kernel\nmethod for the two-sample-problem. NIPS 2007, 2007.\n[11] Song, Le, Huang, Jonathan, Smola, Alex, and Fukumizu, Kenji. Hilbert space embeddings of conditional\ndistributions with applications to dynamical systems. ICML 2009, 2009.\n[12] Zhang, Kun, Schlkopf, Bernhard, Muandet, Krikamol, and Wang, Zhikun. Domian adaptation under target\nand conditional shift. ICML 2013, 2013.\n[13] Jiang, J. and Zhai., C. Instance weighting for domain adaptation in nlp. Proc. 45th Ann. Meeting of the\nAssoc. Computational Linguistics, pp. 264-271, 2007.\n[14] Liao, X., Xue, Y., and Carin, L. Logistic regression with an auxiliary data source. Proc. 21st Intl Conf.\nMachine Learning, 2005.\n[15] Sun, Qian, Chattopadhyay, Rita, Panchanathan, Sethuraman, and Ye, Jieping. A two-stage weighting\nframework for multi-source domain adaptation. NIPS, 2011.\n[16] Wang, Xuezhi, Huang, Tzu-Kuo, and Schneider, Jeff. Active transfer learning under model shift. ICML,\n2014.\n[17] Pan, Sinno Jialin and Yang, Qiang. A survey on transfer learning. TKDE 2009, 2009.\n[18] Seo, Sambu, Wallat, Marko, Graepel, Thore, and Obermayer, Klaus. Gaussian process regression: Active\ndata selection and test point rejection. IJCNN, 2000.\n[19] Ji, Ming and Han, Jiawei. A variance minimization criterion to active learning on graphs. AISTATS, 2012.\n[20] Garnett, Roman, Krishnamurthy, Yamuna, Xiong, Xuehan, Schneider, Jeff, and Mann, Richard. Bayesian\noptimal active search and surveying. ICML, 2012.\n[21] Ma, Yifei, Garnett, Roman, and Schneider, Jeff. Sigma-optimality for active learning on gaussian random\n\ufb01elds. NIPS, 2013.\n[22] Shi, Xiaoxiao, Fan, Wei, and Ren, Jiangtao. Actively transfer domain knowledge. ECML, 2008.\n[23] Rai, Piyush, Saha, Avishek, III, Hal Daume, and Venkatasubramanian, Suresh. Domain adaptation meets\nactive learning. Active Learning for NLP (ALNLP), Workshop at NAACL-HLT, 2010.\n[24] Saha, Avishek, Rai, Piyush, III, Hal Daume, Venkatasubramanian, Suresh, and DuVall, Scott L. Active\nsupervised domain adaptation. ECML, 2011.\n[25] Chattopadhyay, Rita, Fan, Wei, Davidson, Ian, Panchanathan, Sethuraman, and Ye, Jieping. Joint transfer\nand batch-mode active learning. ICML, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1038, "authors": [{"given_name": "Xuezhi", "family_name": "Wang", "institution": "Carnegie Mellon University"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "CMU"}]}