{"title": "Non-Linear Domain Adaptation with Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 485, "page_last": 493, "abstract": "A common assumption in machine vision is that the training and test samples are drawn from the same distribution. However, there are many problems when this assumption is grossly violated, as in bio-medical applications where different acquisitions can generate drastic variations in the appearance of the data due to changing experimental conditions. This problem is accentuated with 3D data, for which annotation is very time-consuming, limiting the amount of data that can be labeled in new acquisitions for training. In this paper we present a multi-task learning algorithm for domain adaptation based on boosting. Unlike previous approaches that learn task-specific decision boundaries, our method learns a single decision boundary in a shared feature space, common to all tasks. We use the boosting-trick to learn a non-linear mapping of the observations in each task, with no need for specific a-priori knowledge of its global analytical form. This yields a more parameter-free domain adaptation approach that successfully leverages learning on new tasks where labeled data is scarce. We evaluate our approach on two challenging bio-medical datasets and achieve a significant improvement over the state-of-the-art.", "full_text": "Non-Linear Domain Adaptation with Boosting\n\nCarlos Becker\u2217\n\nC. Mario Christoudias\n\nPascal Fua\n\nCVLab, \u00b4Ecole Polytechnique F\u00b4ed\u00b4erale de Lausanne, Switzerland\n\nfirstname.lastname@epfl.ch\n\nAbstract\n\nA common assumption in machine vision is that the training and test samples\nare drawn from the same distribution. However, there are many problems when\nthis assumption is grossly violated, as in bio-medical applications where differ-\nent acquisitions can generate drastic variations in the appearance of the data due\nto changing experimental conditions. This problem is accentuated with 3D data,\nfor which annotation is very time-consuming, limiting the amount of data that\ncan be labeled in new acquisitions for training. In this paper we present a multi-\ntask learning algorithm for domain adaptation based on boosting. Unlike previous\napproaches that learn task-speci\ufb01c decision boundaries, our method learns a sin-\ngle decision boundary in a shared feature space, common to all tasks. We use\nthe boosting-trick to learn a non-linear mapping of the observations in each task,\nwith no need for speci\ufb01c a-priori knowledge of its global analytical form. This\nyields a more parameter-free domain adaptation approach that successfully lever-\nages learning on new tasks where labeled data is scarce. We evaluate our approach\non two challenging bio-medical datasets and achieve a signi\ufb01cant improvement\nover the state of the art.\n\n1\n\nIntroduction\n\nObject detection and segmentation approaches often assume that the training and test samples are\ndrawn from the same distribution. There are many problems in Computer Vision, however, where\nthis assumption can be grossly violated, such as in bio-medical applications that usually involve\nexpensive and complicated data acquisition processes that are not easily repeatable. As illustrated\nin Fig. 1, this can result in newly-acquired data that is signi\ufb01cantly different from the data used for\ntraining. As a result, a classi\ufb01er trained on data from one acquisition often cannot generalize well to\ndata obtained from a new one. Furthermore, although it is possible to expect supervised data from\na new acquisition, it is unreasonable to expect the practitioner to re-label large amounts of data for\neach new image that is acquired, particularly in the case of 3D image stacks.\nA possible solution is to treat each acquisition as a separate, but related classi\ufb01cation problem, and\nexploit their possible relationship to learn from the supervised data available across all of them.\nTypically, each such classi\ufb01cation problem is called a task, which is associated with a domain.\nFor example, for Fig. 1(a,b) the task is mitochondria segmentation in both acquisitions. However,\nthe domains are different, namely Striatum and Hippocampus EM stacks. Techniques in domain\nadaptation [1] and more generally multi-task learning [2, 3] seek to leverage data from a set of\ndifferent yet related tasks or domains to help learn a classi\ufb01er in a seemingly new task. In domain\nadaptation, it is typically assumed that there is a fairly large amount of labeled data in one domain,\ncommonly referred to as the source domain, and that a limited amount of supervision is available in\nthe other, often called the target domain. Our goal is to exploit the labeled data in the source domain\nto learn an accurate classi\ufb01er in the target domain despite having only a few labeled samples in the\nlatter.\n\n\u2217This work was supported in part by the ERC grant MicroNano.\n\n1\n\n\fMitochondria Segmentation\n\n(3D stacks)\n\nPath Classi\ufb01cation\n\n(2D images to 3D stacks)\n\n(a) Striatum\n\n(b) Hippocampus\n\n(c) Aerial road images\n\n(d) Neural Axons (OPF)\n\nFigure 1: (a,b) Slice cuts from two 3D Electron Microscopy acquisitions from different parts of the\nbrain of a rat. (c,d) 2D aerial road images and 3D neural axons from Olfactory Projection Fibers\n(OPF). Top and bottom rows show example images and ground truth respectively.\n\nThe data acquisition problem is unique to many multi-task learning problems, however, in that each\ntask is in fact the same, but what has changed is that the features across different acquisitions have\nundergone some unknown transformation. That is to say that each task can be well described by a\nsingle decision boundary in some common feature space that preserves the task-relevant features and\ndiscards the domain speci\ufb01c ones corresponding to unwanted acquisition artifacts. This contrasts the\nmore general multi-task setting where each task is comprised of both a common and task-speci\ufb01c\nboundary, even when mapped to a common feature space, as illustrated in Fig. 2. A method that can\njointly optimize over the common decision boundary and shared feature space is therefore desired.\nLinear latent variable methods such as those based on Canonical Correlation Analysis (CCA) [4,\n5] can be applied to learn a shared feature space across the different acquisitions. However, the\nsituation is further complicated by the fact that the unknown transformations are generally non-\nlinear. Although kernel methods can be applied to model the non-linearity [4, 6, 7], this requires\nthe existence of a well-de\ufb01ned kernel function that can often be dif\ufb01cult to specify a priori. Also,\nthe computational complexity of kernel methods scales quadratically with the number of training\nexamples, limiting their application to large datasets.\nIn this paper we propose a solution to the data acquisition problem and devise a method that can\njointly solve for the non-linear decision boundary and transformations across tasks. As illustrated\nin Fig. 2 our approach maps features from possibly high-dimensional, task-speci\ufb01c feature spaces\nto a low-dimensional space common to all tasks. We assume that only the mappings are task-\ndependent and that in the shared space the problem is linearly separable and the decision boundary\nis common to all tasks. We use the boosting-trick [8, 9, 10] to simultaneously learn the non-linear\ntask-speci\ufb01c mappings as well as the decision boundary, with no need for speci\ufb01c a-priori knowledge\nof their global analytical form. This yields a more parameter-free domain adaptation approach that\nsuccessfully leverages learning on new tasks where labeled data is scarce.\nWe evaluate our approach on the two challenging bio-medical datasets depicted by Fig. 1. We\n\ufb01rst consider the classi\ufb01cation of curvilinear structures in 3D image stacks of Olfactory Projection\nFibers (OPF) [11] using labeled 2D aerial road images. We then perform mitochondria segmentation\nin large 3D Electron Microscopy (EM) stacks of neural rat tissue, demonstrating the ability of our\nalgorithm to leverage labeled data from different data acquisitions on this challenging task. On both\ndatasets our approach obtains a signi\ufb01cant improvement over using labeled data from either domain\nalone and outperforms recent multi-task learning baseline methods.\n\n2 Related Work\n\nInitial ideas to multi-task learning exploited supervised data from related tasks to de\ufb01ne a form of\nregularization in the target problem [2, 12]. In this setting, related tasks, also sometimes referred to\n\n2\n\n\f(a) Standard Multi-task Learning\n\n(b) Domain Adaptation\n\nFigure 2: Illustration of the difference between (a) standard Multi-task Learning (MTL) and (b) our\nDomain Adaptation (DA) approach on two tasks. MTL assumes a single, pre-de\ufb01ned transformation\n\u03c6(x) : X \u2192 Z and learns shared and task-speci\ufb01c linear boundaries in Z, namely \u03b2o, \u03b21 and\n\u03b22 \u2208 Z. In contrast, our DA approach learns a single linear boundary \u03b2 in a common feature space\nZ, and task-speci\ufb01c mappings \u03c61(x), \u03c62(x) : X \u2192 Z. Best viewed in color.\n\nas auxiliary problems [13], are used to learn a latent representation and \ufb01nd discriminative features\nshared across tasks. This representation is then transferred to the target task to help regularize the\nsolution and learn from fewer labeled examples. The success of these approaches crucially hinges\non the ability to de\ufb01ne auxiliary tasks. Although this can be easily done in certain situations, e.g., as\nin [13], in many cases it is unclear how to generate them and the solution can be limiting, especially\nwhen provided only a few auxiliary problems. Unlike such methods, our approach is able to \ufb01nd an\ninformative shared representation even with as little as one related task.\nMore recent multi-task learning methods jointly optimize over both the shared and task-speci\ufb01c\ncomponents of each task [3, 14, 10, 15]. In [3] it was shown how the two step iterative optimiza-\ntion of [13] can be cast into a single convex optimization problem. In particular, for each task their\napproach computes a linear decision boundary de\ufb01ned as a linear combination between a shared\nhyperplane, shared across tasks, and a task-speci\ufb01c one in either the original or a kernelized feature\nspace. This idea was later further generalized to allow for more generic forms [14, 16, 17, 15], as\nin [14] that investigated the use of a hierarchically combined decision boundary. The use of boost-\ning for multi-task learning was explored in [10] as an alternative to kernel-based approaches. For\neach task they optimize for a shared and task-speci\ufb01c decision boundary similar to [3], except non-\nlinearities are modeled using a boosted feature space. As with other methods, however, additional\nparameters are required to control the degree of sharing between tasks that can be dif\ufb01cult to set,\nespecially when one or more tasks have only a few labeled samples.\nFor many problems, such as those common to domain adaptation [1], the decision problem is in fact\nthe same across tasks, however, the features of each task have undergone some unknown transforma-\ntion. Feature-based approaches seek to uncover this transformation by learning a mapping between\nthe features across tasks [18, 19, 7]. A cross-domain Mahalanobis distance metric was introduced\nin [18] that leverages across-task correspondences to learn a transformation from the source to target\ndomain. A similar method was later developed in [20] to handle cross-domain feature spaces of a\ndifferent dimensionality. Shared latent variable models have also been proposed to learn a shared\nrepresentation across multiple feature sources or tasks [4, 19, 6, 7, 21].\nFeature-based methods generally rely on the kernel-trick to model non-linearities that requires the\nselection of a pre-de\ufb01ned kernel function and is dif\ufb01cult to scale to large datasets. In this paper,\nwe exploit the boosting-trick [10] to handle non-linearities and learn a shared representation across\ntasks, overcoming these limitations. This results in a more parameter-free, scalable domain adapta-\ntion approach that can leverage learning on new tasks where labeled data is scarce.\n\n3 Our Approach\n\nWe consider the problem of learning a binary decision function from supervised data collected across\nmultiple tasks or domains. In our setting, each task is an instance of the same underlying decision\nproblem, however, its features are assumed to have undergone some unknown non-linear transfor-\nmation.\n\n3\n\n\fi, yt\n\ni}N t\nAssume that we are given training samples X t = {xt\ni=1 from t = 1, . . . , T tasks, where\ni \u2208 RD represents a feature vector for sample i in task t and yt\ni \u2208 {\u22121, 1} its label. For each task,\nxt\nwe seek to learn a non-linear transformation, \u03c6t(xt), that maps xt to a common, task-independent\nfeature space, Z, accounting for any unwanted feature shift. Instead of relying on cleverly chosen\nkernel functions we model each transformation using a set of task-speci\ufb01c non-linear functions\nHt = {ht\nA wide variety of task-speci\ufb01c feature functions can be explored within our framework. We consider\nfunctions of the form,\n\nj : RD \u2192 R, to de\ufb01ne \u03c6t : X t \u2192 Z as \u03c6t(xt) = [ht\n\n(cid:124).\nM (xt)]\n\n1(xt), . . . , ht\n\nM}, ht\n\n1, . . . , ht\n\nj(xt) = hj(xt \u2212 \u03c4 t\nht\nj ),\n\nj = 1, . . . , M\n\n(1)\nj \u2208 RD. This seems like an appropriate\nwhere H = {h1, . . . , hM} are shared across tasks and \u03c4 t\nmodel in the case of feature shift between tasks, for example due to different acquisition parameters.\nEach hj can be interpreted as a weak non-linear predictor of the task label and in practice M is\nlarge, possibly in\ufb01nite. In what follows, we set H to be the set of regression trees or stumps [8] that\nin combination with \u03c4 t can be used to model highly complex, non-linear transformations.\nAssuming that the problem is linearly separable in Z the predictive function ft(\u00b7) : RD \u2192 R for\neach task can then be written as\n\nft(x) = \u03b2\n\n(cid:124)\n\n\u03c6t(xt) =\n\n\u03b2jhj(xt \u2212 \u03c4 t\nj )\n\n(2)\n\nwhere \u03b2 \u2208 RM is a linear decision boundary in Z that is common to all tasks. This contrasts\nprevious approaches to multi-task learning such as [3, 10] that learn a separate decision boundary\nper task and, as we show later, is better suited for problems in domain adaptation.\nWe learn the functions ft(\u00b7) by minimizing the exponential loss on the training data across each task\n\n\u03b2\u2217, \u0393\u2217 = min\n\n\u03b2,\u0393\n\nwhere\n\nL(\u03b2, \u0393t; X t) =\n\nN t(cid:88)\n\ni=1\n\nexp(cid:2) \u2212 yt\n\ni ft(xt\n\nL(\u03b2, \u0393t; X t),\n\nN t(cid:88)\n\ni=1\n\nexp\n\n(cid:104) \u2212 yt\n\ni\n\nM(cid:88)\n\nj=1\n\n(3)\n\n,\n\n(4)\n\n(cid:105)\n\n\u03b2jhj(xt\n\ni \u2212 \u03c4 t\nj )\n\nM ].\n\n1, . . . , \u03c4 t\n\nand \u0393 = [\u03931, . . . , \u0393T ] with \u0393t = [\u03c4 t\nThe explicit minimization of Eq. (3) can be very dif\ufb01cult, since in practice, M can be prohibitively\nlarge and the hj\u2019s are typically discontinuous and highly non-linear. Luckily, this is a problem for\nwhich boosting is particularly well suited [8], as it has been demonstrated to be an effective method\nfor constructing a highly accurate classi\ufb01er from a possibly large collection of weak prediction\nfunctions. Similar to the kernel-trick, the resulting boosting-trick [8, 9, 10] can be used to de\ufb01ne a\nnon-linear mapping to a high dimensional feature space for which we assume the data to be linearly\nseparable. Unlike the kernel-trick, however, the boosting-trick de\ufb01nes an explicit mapping for which\n\u03b2 is assumed to be sparse [22, 10].\nWe propose to use gradient boosting [8, 9] to solve for ft(\u00b7). Given any twice-differentiable loss\nfunction, gradient boosting minimizes the loss in a stage-wise manner for iterations k = 1 to K. In\nparticular, we use the quadratic approximation introduced by [9]. When applied to minimize Eq. (3),\nthe goal at each boosting iteration is to \ufb01nd the weak learner \u02dch \u2208 H and the set of { \u02dc\u03c4 1, . . . , \u02dc\u03c4 T}\nthat minimize\n\nM(cid:88)\n\nj=1\n\nT(cid:88)\ni)(cid:3) =\n\nt=1\n\nT(cid:88)\n\n\uf8eb\uf8ed N t(cid:88)\n\n(cid:104)\u02dch(xt \u2212 \u02dc\u03c4 t) \u2212 rt\n\nik\n\nwt\nik\n\n(cid:105)2\n\n\uf8f6\uf8f8 ,\n\nt=1\n\ni=1\n\nwhere wt\nand rt\n\nik = yt\n\nik and rt\n\nik can be computed by differentiating the loss of Eq. (4), obtaining wt\n\ni ft(xt\ni)\ni. Once \u02dch and { \u02dc\u03c4 1, . . . , \u02dc\u03c4 T} are found, a line-search procedure is applied to determine\n\nik = e\u2212yt\n\n4\n\n(5)\n\n\fAlgorithm 1 Non-Linear Domain Adaptation with Boosting\nInput: Training samples and labels for T tasks X t = {(xt\ni, yt\nNumber of iterations K, shrinkage factor 0 < \u03b3 \u2264 1\n\ni )}N t\n\ni=1\n\n(cid:3)2\n\ni \u2212 \u03c4 t) \u2212 rt\n\nik\n\nT(cid:88)\nN t(cid:88)\nN t(cid:88)\nT(cid:88)\n\nt=1\n\ni=1\n\nwt\nik\n\n(cid:2)h(xt\n(cid:16)\n(cid:104) \u2212 yt\n\ni\n\nft(xt\n\ni) + \u03b1 \u02dch(xt\n\ni \u2212 \u02dc\u03c4 t)\n\n(cid:17)(cid:105)\n\n1: Set ft(\u00b7) = 0 \u2200 t = 1, . . . , T\n2: for k = 1 to K do\nik = e\u2212yt\n3:\n\nLet wt\n\ni ft(xt\n\n(cid:110)\u02dch(\u00b7), \u02dc\u03c4 1, . . . , \u02dc\u03c4 T(cid:111)\n\nFind\n\ni ) and rt\n\nik = yt\ni\n\n= argmin\n\nh\u2208H,\u03c4 1,...,\u03c4 T\n\n4:\n\n5:\n\nFind \u02dc\u03b1 through line search: \u02dc\u03b1 = argmin\n\nexp\n\n\u03b1\n\nt=1\n\ni=1\n\nSet \u02dc\u03b2 = \u03b3 \u02dc\u03b1\nUpdate ft(\u00b7) = ft(\u00b7) + \u02dc\u03b2 \u02dch( \u00b7 \u2212 \u02dc\u03c4 t) \u2200 t = 1, . . . , T\n\n6:\n7:\n8: end for\n9: return ft(\u00b7) \u2200 t = 1, . . . , T\n\nthe optimal weighting for \u02dch and the predictive functions ft(\u00b7) are updated, as described in Alg. 1.\nShrinkage may be applied to help regularize the solution, particularly when using powerful weak\nlearners such as regression trees [8].\nOur proposed approach is summarized in Alg. 1. The main dif\ufb01culty in applying this method is\nin line 4, which \ufb01nds the optimal values of \u02dch and { \u02dc\u03c4 1, . . . , \u02dc\u03c4 T} that minimize Eq. 5. This can be\nvery expensive, depending on the type of weak learners employed. In the next section we show that\nregression trees and boosted stumps can be used ef\ufb01ciently to minimize Eq. (5) at train time.\n3.1 Weak Learners\nRegression trees have proven very effective when used as weak learners with gradient boosting [23].\nAn important advantage is that training regression trees needs practically no parameter tuning and\nis very ef\ufb01cient when a greedy top-down approach is used [8].\nDecision stumps represent a special case of single-level regression trees. Despite their simplicity,\nthey have been demonstrated to achieve a high performance in challenging tasks such as face and\nobject detection [24, 25]. In cases where feature dimensionality D is very large, decision stumps\nmay be preferred over regression trees to reduce training time.\n\nRegression Trees: We use trees whose splits operate on a single dimension of the feature vector,\nand follow the top-down greedy tree learning approach described in [8]. The top split is learned \ufb01rst,\nseeking to minimize\n\nT(cid:88)\n\n\uf8eb\uf8ed N t(cid:88)\n\nt=1\n\ni=1\n\n(cid:2)\u03b71 \u2212 rt\n\nik\n\n(cid:3)2\n\n+\n\nN t(cid:88)\n\ni=1\n\n(cid:2)\u03b72 \u2212 rt\n\nik\n\n(cid:3)2\n\n\u00af1{xt\n\ni[n]\u2212\u03c4 t} wt\n\nik\n\n1{xt\n\ni[n]\u2212\u03c4 t} wt\n\nik\n\n\uf8f6\uf8f8 ,\n\n(6)\n\nargmin\n\nn\u2208{1,...,D},\n\n\u03b71,\u03b72,{\u03c4 1,...,\u03c4 T }\n\nwhere x[n] \u2208 R denotes the value of the nth dimension of x, 1{\u00b7} is the indicator function, and\n\u00af1{\u00b7} = 1 \u2212 1{\u00b7}. The difference w.r.t. classic regression trees is that, besides learning the values of\n\u03b71, \u03b72 and n, our approach requires the tree to also learn a threshold \u03c4 t \u2208 R per task. Given that\neach split operates on a single attribute x[n], the resulting \u02dc\u03c4 t is sparse, and learned one component\nat a time as the tree is built.\nOnce the top split is learned, a new split is trained on each of its child leaves, in a recursive manner.\nThis process is repeated until the maximum depth L, given as a parameter, is reached, or there are\nnot enough samples to learn a new node at a given leaf.\n\n5\n\n\fDecision Stumps: Decision stumps consist of a single split and return values \u03b71, \u03b72 = \u00b11. If also\nik = \u00b11, which is true when boosting with the exponential loss, then it can be demonstrated that\nrt\nminimizing Eq (6) can be separated into T independent minimization problems for all D attributes\nfor each n. Once this is done, a quick search can be performed to determine the n that minimizes\nEq. (6). This makes decision stumps feasible for large-scale applications with very high dimensional\nfeature spaces.\nIn the special case of the exponential loss and decision stumps, it can be shown that Alg. 1 reduces\nto a procedure similar to classic AdaBoost [26], with the exception that weak learner search is done\nin the multi-task manner described above.\n4 Evaluation\nWe evaluated our approach on two challenging domain adaptation problems for which annotation\nis very time-consuming, representative of the problems we seek to address. We \ufb01rst describe the\ndatasets, our experimental setup and baselines, and \ufb01nally present and discuss the obtained results.\n\n4.1 Datasets\n\nPath Classi\ufb01cation Tracing arbors of curvilinear structures is a well studied problem that \ufb01nds\napplications in a broad range of \ufb01elds from neuroscience to photogrammetry. We consider the\ndetection of 3D curvilinear structures in 3D image stacks of Olfactory Projection Fibers (OPF)\nusing 2D aerial road images (see Fig. 1(c,d)). For this problem, the task is to predict whether a\ntubular path between two image locations belongs to a curvilinear structure. We used a publicly-\navailable dataset [11] of 2D aerial images of road networks as the source domain and 3D stacks of\nOlfactory Projection Fibers (OPF) from the DIADEM challenge as the target domain. The source\ndomain consists of six fully-labeled 2D aerial road images and the target domain contains eight\nfully-labeled 3D stacks. We aim at using large amounts of labeled data from 2D road images to\nleverage learning in the 3D stacks. This is a clear scenario where transfer learning can be highly\nbene\ufb01cial, because labeling 2D images is much easier than annotating 3D stacks. Therefore, being\nable to take advantage of 2D data is essential to reduce tedious 3D labeling effort.\n\nMitochondria Segmentation: Mitochondria are organelles that play an important role in cellular\nfunctioning. The goal of this task is to segment mitochondria from large 3D Electron Microscopy\n(EM) stacks of 5 nm voxel size, acquired from the brain of a rat. As in the path classi\ufb01cation\nproblem, 3D annotations are time-consuming and exploiting already-annotated stacks is essential\nto speed up analysis. The source domain is a fully-labeled EM stack from the Striatum region\nof 853x506x496 voxels with 39 labeled mitochondria. The target domain consists of two stacks\nacquired from the Hippocampus, one a training volume of size 1024x653x165 voxels and the other\na test volume that is 1024x883x165 voxels, with 10 and 42 labeled mitochondria in each respectively.\nThe target test volume is fully-labeled, while the training one is partially annotated, similar to a real\nscenario. To capture contextual information, state-of-the-art methods typically use \ufb01lter response\nvectors of more than 100k dimensions, which in combination with the size of this dataset, makes the\nuse of linear latent space models dif\ufb01cult and the direct application of kernel methods infeasible.\n4.2 Experimental Setup\n\nFor path classi\ufb01cation we employ a dictionary whose codewords are Histogram of Gradient Devi-\nations (HGD) descriptors, as in [11]. This is well suited for characterizing tubular structures and\ncan be applied in the same way to 2D and 3D images. This allows us, in theory, to apply a classi-\n\ufb01er trained on 2D images to 3D volumes. However, differences in appearance and geometry of the\nstructures may potentially adversely affect classi\ufb01er accuracy when 2D-trained ones are applied to\n3D stacks, which motivates domain adaptation. We use half of the target domain for training and\nhalf for testing. 2500 positive and negative samples are extracted from each image through random\nsampling, as in [11]. This results in balanced sets of 30k samples for training in the source domain,\nand 20k for training and 20k for testing in the target domain.\nTo simulate the lack of training data, we randomly pick an equal number of positive and negative\nsamples for training from the target domain. The HGD codewords are extracted from the road\nimages and used for both domains to generate consistent feature vectors. We employ gradient\nboosted trees, which in our experiments outperformed boosted stumps and kernel SVMs. For all\n\n6\n\n\fFigure 3: Path Classi\ufb01cation: Median, lower and upper quartiles of the test error as the number of\ntraining samples is varied. Our approach nears Full TD performance with as few as 70 training sam-\nples in the target domain and signi\ufb01cantly outperforms the baseline methods. Best viewed in color.\n\nthe boosting-based baselines we set the maximum tree depth to L = 3, equivalent to a maximum of\n8 leaves, and shrinkage \u03b3 = 0.1, as in [8]. The number of boosting iterations is set to K = 500. For\nthis dataset we report the test error computed as the percentage of mis-classi\ufb01ed examples.\nFor mitochondria segmentation we use the boosting-based method of [27], which is optimized for 3D\nstacks and whose source code is publicly available. This method is based on boosted stumps, which\nmakes it very ef\ufb01cient at both train and test time. Similar to [27], we group voxels into supervoxels to\nreduce training and testing time, which yields 15k positive and 275k negative supervoxel samples in\nthe source domain. In the target domain it renders 12k negative training samples. To simulate a real\nscenario, we create 10 different transfer learning problems using the samples from one mitochondria\nat a time as positives, which translates into approximately 300 positive training supervoxels each.\nWe use the default parameters provided by the authors of [27] in their source code (K = 2000), and\nwe evaluate segmentation performance with the Jaccard Index, as in [27].\n\n4.3 Baselines\n\nOn each dataset, we compare our approach against the following baselines: training with reference\nor target domain data only (shown as SD only and TD only), training a single classi\ufb01er with both tar-\nget and source domain data (Pooling), and with the multi-task approach of [10] (shown as Chapelle\net al.). We evaluate performance with varying amounts of supervision in the target domain, and also\nshow the performance of a classi\ufb01er trained with all the available labeled data, shown as Full TD,\nwhich represents fully supervised performance on this domain and is useful in gauging the relative\nperformance improvement of each method.\nWe compare to linear Canonical Correlation Analysis (CCA) and Kernel CCA (KCCA) [4] for learn-\ning a shared latent space on the path classi\ufb01cation dataset, and use a Radial Basis kernel function\nfor KCCA, which is a commonly used kernel. Its bandwidth is set to the mean distance across the\ntraining observations. The data size and dimensionality of the mitochondria dataset is prohibitive\nfor these methods, and instead we compare to Mean-Variance Normalization (MVN) and Histogram\nMatching (HM) that are common normalizations one might apply to compensate for acquisition ar-\ntifacts. MVN normalizes each input 3D intensity patch to have a unit variance and zero-mean, useful\nfor compensating for linear brightness and contrast changes in the image. HM applies a non-linear\ntransformation and normalizes the intensity values of one data volume such that the histogram of its\nintensities matches the other.\n\n4.4 Results: Path Classi\ufb01cation\n\nThe results of applying our method and the baselines for path classi\ufb01cation are shown in Fig. 3. Our\napproach outperforms the baselines, and the difference in performance is particularly accentuated\nin the case of very few training samples. The next best competitor is the multi-task method of [10],\nalthough it exhibits a much higher variance than our approach and performs poorly when only pro-\nvided a few labeled target examples. This is also the case for KCCA. The results of linear CCA\nare not shown in the plots because it yielded very low performance compared to the other baselines,\n\n7\n\n2030407010015025050010002%4%6%8%10%Number of training samples in TDTest error  Our ApproachKernel CCAChapelle et al.PoolingTD onlyFull TD\fFigure 4: Mitochondria Segmentation: Box plot of the Jaccard index measure for our method and\nthe baselines over 10 runs on the target domain. Simple Mean-Variance Normalization (MVN)\nand Histogram Matching (HM) although helpful are unable to fully correct for differences between\nacquisitions. In contrast, our method yields a higher performance without the need for such priors\nand is able to faithfully leverage the source domain data to learn from relatively few examples in the\ntarget domain, outperforming the baseline methods.\n\nachieving a 14% error rate with 1k labeled examples and its performance signi\ufb01cantly degrading\nwith fewer training samples. Similarly, SD only performance is 16%.\nOur approach is very close to Full TD in performance when using as few as 70 training samples, even\nthough the Full TD classi\ufb01er was trained with 20k samples from the target domain. This highlights\nthe ability of our method to effectively leverage the large amounts of source-domain data. As shown\nin Fig. 3, there is a clear tendency for all methods to converge at the value of Full TD, although\nour approach does so signi\ufb01cantly faster. The low performance of Chapelle et al. [10] suggests\nthat modeling the domain shift using shared and task-speci\ufb01c boundaries, as is commonly done in\nmulti-task learning methods, is not a good model for domain adaptation problems such as the ones\nshown in Fig. 1. This gets accentuated by the parameter tuning required by [10], done through cross-\nvalidation, that performs poorly when only afforded a few labeled samples in the target domain and\nyields a longer training time. The method of [10] took 25 minutes to train, while our approach only\ntook between 2 and 15 minutes, depending on the amount of labeled target data.\n\n4.5 Results: Mitochondria Segmentation\n\nA box plot showing the distribution of the VOC scores throughout 10 different runs is shown in\nFig. 4. Our approach signi\ufb01cantly outperforms the multi-task method of [10] and the other base-\nlines, followed in performance by pooling with mean-variance normalization (MVN) and histogram\nmatching (HM). In contrast, our method yields higher performance and smaller variance over the\ndifferent runs without the need for such priors. From a practical point of view, our approach does\nnot require parameter tuning and cross-validation is not necessary. This can be a bottleneck in some\nscenarios where large volumes of data are used for training. For this task, training our method took\nless than an hour per run, while [10] took over 7 hours due to cross-validation.\n\n5 Conclusion\n\nIn this paper we presented an approach for performing non-linear domain adaptation with boosting.\nOur method learns a task-independent decision boundary in a common feature space, obtained via\na non-linear mapping of the features in each task. This contrasts recent approaches that learn task-\nspeci\ufb01c boundaries and is better suited for problems in domain adaptation where each task is of the\nsame decision problem, but whose features have undergone an unknown transformation. In this set-\nting, we illustrated how the boosting-trick can be used to de\ufb01ne task-speci\ufb01c feature mappings and\neffectively model non-linearity, offering distinct advantages over kernel-based approaches both in\naccuracy and ef\ufb01ciency. We evaluated our approach on two challenging bio-medical datasets where\nit achieved a signi\ufb01cant gain over using labeled data from either domain alone and outperformed\nrecent multi-task learning methods.\n\n8\n\n0.40.450.50.550.60.65TD onlyPoolingPooling + MVNPooling + HMChapelle et al.Our ApproachJaccard Index  Full TDSD only\fReferences\n[1] Jiang, J.: A literature survey on domain adaptation of statistical classi\ufb01ers. (2008)\n[2] Caruana, R.: Multitask Learning. Machine Learning 28 (1997)\n[3] Evgeniou, T., Micchelli, C., Pontil, M.: Learning Multiple Tasks with Kernel Methods. JMLR\n\n6 (2005)\n\n[4] Bach, F.R., Jordan, M.I.: Kernel Independent Component Analysis. JMLR 3 (2002) 1\u201348\n[5] Ek, C.H., Torr, P.H., Lawrence, N.D.: Ambiguity Modelling in Latent Spaces. In: MLMI.\n\n(2008)\n\n[6] Salzmann, M., Ek, C.H., Urtasun, R., Darrell, T.: Factorized Orthogonal Latent Spaces. In:\n\nAISTATS. (2010)\n\n[7] Memisevic, R., Sigal, L., Fleet, D.J.: Shared Kernel Information Embedding for Discrimina-\n\ntive Inference. PAMI (April 2012) 778\u2013790\n\n[8] Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer (2001)\n[9] Zheng, Z., Zha, H., Zhang, T., Chapelle, O., Sun, G.: A General Boosting Method and Its\n\nApplication to Learning Ranking Functions for Web Search. In: NIPS. (2007)\n\n[10] Chapelle, O., Shivaswamy, P., Vadrevu, S., Weinberger, K., Zhang, Y., Tseng, B.: Boosted\n\nMulti-Task Learning. Machine Learning (2010)\n\n[11] Turetken, E., Benmansour, F., Fua, P.: Automated Reconstruction of Tree Structures Using\n\nPath Classi\ufb01ers and Mixed Integer Programming. In: CVPR. (June 2012)\n\n[12] Baxter, J.: A Model of Inductive Bias Learning. Journal of Arti\ufb01cial Intelligence Research\n\n(2000)\n\n[13] Ando, R.K., Zhang, T.: A Framework for Learning Predictive Structures from Multiple Tasks\n\nand Unlabeled Data. JMLR 6 (2005) 1817\u20131853\n\n[14] Daum\u00b4e, H.: Bayesian Multitask Learning with Latent Hierarchies. In: UAI. (2009)\n[15] Kumar, A., Daum\u00b4e, H.: Learning Task Grouping and Overlap in Multi-task Learning.\n\nICML. (2012)\n\nIn:\n\n[16] Xue, Y., Liao, X., Carin, L., Krishnapuram, B.: Multi-task Learning for Classi\ufb01cation with\n\nDirichlet Process Priors. JMLR 8 (2007)\n\n[17] Jacob, L., Bach, F., Vert, J.P.: Clustered Multi-task Learning: a Convex Formulation.\n\nNIPS. (2008)\n\nIn:\n\n[18] Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting Visual Category Models to New Do-\n\nmains. In: ECCV. (2010)\n\n[19] Shon, A.P., Grochow, K., Hertzmann, A., Rao, R.P.N.: Learning Shared Latent Structure for\n\nImage Synthesis and Robotic Imitation. In: NIPS. (2006) 1233\u20131240\n\n[20] Kulis, B., Saenko, K., Darrell, T.: What You Saw is Not What You Get: Domain Adaptation\n\nUsing Asymmetric Kernel Transforms. In: CVPR. (2011)\n\n[21] Gopalan, R., Li, R., Chellappa, R.: Domain Adaptation for Object Recognition: An Unsuper-\n\nvised Approach. In: ICCV. (2011)\n\n[22] Rosset, S., Zhu, J., Hastie, T.: Boosting as a Regularized Path to a Maximum Margin Classi\ufb01er.\n\nJMLR (2004)\n\n[23] Caruana, R., Niculescu-Mizil, A.: An Empirical Comparison of Supervised Learning Algo-\n\nrithms. In: ICML. (2006)\n\n[24] Viola, P., Jones, M.: Rapid Object Detection Using a Boosted Cascade of Simple Features. In:\n\nCVPR. (2001)\n\n[25] Ali, K., Fleuret, F., Hasler, D., Fua, P.: A Real-Time Deformable Detector. PAMI 34(2)\n\n(February 2012) 225\u2013239\n\n[26] Freund, Y., Schapire, R.: A Short Introduction to Boosting (1999) Journal of Japanese Society\n\nfor Arti\ufb01cial Intelligence, 14(5):771-780.\n\n[27] Becker, C., Ali, K., Knott, G., Fua, P.: Learning Context Cues for Synapse Segmentation. TMI\n\n(2013) In Press.\n\n9\n\n\f", "award": [], "sourceid": 332, "authors": [{"given_name": "Carlos", "family_name": "Becker", "institution": "EPFL"}, {"given_name": "Christos", "family_name": "Christoudias", "institution": "EPFL"}, {"given_name": "Pascal", "family_name": "Fua", "institution": "EPFL"}]}