{"title": "Learning To Learn Around A Common Mean", "book": "Advances in Neural Information Processing Systems", "page_first": 10169, "page_last": 10179, "abstract": "The problem of learning-to-learn (LTL) or meta-learning is gaining increasing attention due to recent empirical evidence of its effectiveness in applications. The goal addressed in LTL is to select an algorithm that works well on tasks sampled from a meta-distribution. In this work, we consider the family of algorithms given by a variant of Ridge Regression, in which the regularizer is the square distance to an unknown mean vector. We show that, in this setting, the LTL problem can be reformulated as a Least Squares (LS) problem and we exploit a novel meta- algorithm to efficiently solve it. At each iteration the meta-algorithm processes only one dataset. Specifically, it firstly estimates the stochastic LS objective function, by splitting this dataset into two subsets used to train and test the inner algorithm, respectively. Secondly, it performs a stochastic gradient step with the estimated value. Under specific assumptions, we present a bound for the generalization error of our meta-algorithm, which suggests the right splitting parameter to choose. When the hyper-parameters of the problem are fixed, this bound is consistent as the number of tasks grows, even if the sample size is kept constant. Preliminary experiments confirm our theoretical findings, highlighting the advantage of our approach, with respect to independent task learning.", "full_text": "Learning To Learn Around A Common Mean\n\nGiulia Denevi1,2, Carlo Ciliberto3,4, Dimitris Stamos4 and Massimiliano Pontil1,4\n\n1Istituto Italiano di Tecnologia (Italy), 2University of Genoa (Italy),\n\n3Imperial College of London (UK), 4University College of London (UK)\n\nAbstract\n\nThe problem of learning-to-learn (LTL) or meta-learning is gaining increasing\nattention due to recent empirical evidence of its effectiveness in applications. The\ngoal addressed in LTL is to select an algorithm that works well on tasks sampled\nfrom a meta-distribution. In this work, we consider the family of algorithms given\nby a variant of Ridge Regression, in which the regularizer is the square distance\nto an unknown mean vector. We show that, in this setting, the LTL problem can\nbe reformulated as a Least Squares (LS) problem and we exploit a novel meta-\nalgorithm to ef\ufb01ciently solve it. At each iteration the meta-algorithm processes only\none dataset. Speci\ufb01cally, it \ufb01rstly estimates the stochastic LS objective function,\nby splitting this dataset into two subsets used to train and test the inner algorithm,\nrespectively. Secondly, it performs a stochastic gradient step with the estimated\nvalue. Under speci\ufb01c assumptions, we present a bound for the generalization error\nof our meta-algorithm, which suggests the right splitting parameter to choose.\nWhen the hyper-parameters of the problem are \ufb01xed, this bound is consistent as\nthe number of tasks grows, even if the sample size is kept constant. Preliminary\nexperiments con\ufb01rm our theoretical \ufb01ndings, highlighting the advantage of our\napproach, with respect to independent task learning.\n\n1 Introduction\n\nLearning-to-learn (LTL) or meta-learning addresses the problem of learning an algorithm that \u201cworks\nwell\u201d on a class of learning tasks, which are randomly observed via a corresponding \ufb01nite set of\ntraining examples, see [4, 19, 29] and references therein. The learning tasks are assumed to share\nspeci\ufb01c similarities, in that they are sampled from a common meta-distribution, often referred to\nas environment in literature [4]. The LTL process aims at leveraging such similarities in order to\nlearn an algorithm which is well suited to learn new tasks sampled from the same environment.\nThis approach brings a substantial improvement over learning the tasks in isolation \u2013 known as\nindependent task learning (ITL) \u2013 especially when the sample size per task is small, a common\nsetting in many applications [6, 27, 30]. These ideas are strongly related to multitask learning (MTL)\n[7, 8, 9, 10, 14, 18, 23]. The key difference is that in MTL the goal is to perform well on the observed\ntasks while in LTL the aim is to perform well on \u201cfuture\u201d tasks.\nIn this work, we study a particular kind of environment, in which the randomly observed tasks are\nlinear regression problems and the underlying family of learning algorithms is Ridge Regression\naround a common mean, that is, Ridge Regression in which we introduce in the regularizer a bias\nterm, playing the role of a common mean among the tasks. Starting from a stream of datasets sampled\nfrom this environment, our goal is to learn the common mean by minimizing the transfer risk [4, 19],\nwhich measures the average error of the underlying learning algorithm, trained on a random dataset\nfrom the environment. Although previous theoretical investigations of LTL minimize a proxy for the\ntransfer risk, given by the average multitask empirical error [4, 19, 20, 22, 24] or the so-called future\nempirical risk [12] of the algorithm, as a \ufb01rst contribution of this work, we show that the speci\ufb01c\nfamily of algorithms considered here, naturally lends itself to directly minimize the transfer risk of\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fthe algorithm trained with less points. More precisely, we \ufb01rst observe that, in this setting, the LTL\nproblem can be reformulated as a Least Squares (LS) problem, the structure of which depends on\nthe environment. After this, in order to compute it, motivated by recent empirical studies on few\nshot-learning and meta-learning [15, 16, 26], we split the datasets of the training tasks in a subset\nused to train the algorithm and a subset used to estimate its risk.\nLTL is particularly appealing when considered from an online or incremental perspective, in which\nwe receive in input a sequence of datasets and the goal is to ef\ufb01ciently update an underlying learning\nalgorithm, which will then be applied to the next yet-to-be-encountered task, without the need to keep\nin memory previously encountered datasets. Our second contribution is to show that the speci\ufb01c LTL\nproblem studied in this paper can be naturally tackled incrementally by Stochastic Gradient Descent\n(SGD) procedures and, when the environment satis\ufb01es certain assumptions given in the paper, we\nprovide a complete statistical analysis of this approach, which highlights the role of the splitting\nparameter, namely, the number of points we use to train the inner algorithm. Moreover, in such a case,\nwhen the hyper-parameters of the problem are \ufb01xed, a remarkable feature of our learning approach is\nthat it provides a consistent estimate of the transfer risk as the number of tasks grows, even if the\ntasks\u2019 sample size is kept constant, whereas classical approaches would need the sample size to grow\nas well. Our proof technique leverages previous work on stochastic optimization for LS [13] with\ntools from classical LTL theory [4, 19, 20, 22].\nPaper organization. In Sec. 2 we introduce the LTL problem for the class of algorithms based\non Ridge Regression around a common mean and we show that this problem is equivalent to a LS\nproblem. In Sec. 3 we describe the online approach through which we directly attempt to minimize\nthe transfer risk for this family of learning algorithms. In Sec. 4 we provide the statistical analysis\nfor this approach, under speci\ufb01c assumptions on the environment. Sec. 5 presents preliminary\nexperiments con\ufb01rming our theoretical observations and, \ufb01nally, in Sec. 6, we draw our conclusion\nand we highlight possible future directions. All the technical proofs are reported in the appendix.\nPrevious work. Although LTL is naturally suited for the incremental setting, we are only aware of\nfew theoretical investigations about it [1, 3, 12, 17, 25]. Most related to our work are [1], where the\nauthors consider a general PAC-Bayesian approach to LTL and [12], which considers the problem of\nlearning a linear representation shared by the tasks. However, both these papers address a different\nclass of learning algorithms. Furthermore, the approach presented here follows a different strategy\nof directly minimizing the transfer risk. In literature, the LTL problem has been almost exclusively\nconsidered in the setting in which the tasks are given in one batch [4, 19, 20, 21, 22, 24], as opposed\nto sequentially. Perhaps most related to our work is the paper [24], where the authors consider the\nsame family of learning algorithms analyzed in this paper, but in a PAC-Bayesian setting.\n\n2 Learning-to-Learn Problem\n\nIn this work we focus on the standard linear regression setting. Let Z = X\u21e5Y be the data space,\nwhere X\u2713 Rd is the input space and Y\u2713 R the output space. We denote by x, y, z generic elements\nof X ,Y,Z, respectively, where z = (x, y). The discrepancy between two outputs y, y0 2Y is\nmeasured by the square loss `(y, y0) = 1\n2 (y  y0)2. The symbols k\u00b7k and h\u00b7,\u00b7i denote the standard\nnorm and the standard scalar product in the Euclidean space and \u00b7> represents the standard transpose\noperation. We also let B1 \u2713 Rd be the zero-centered unit ball in Rd. For a real matrix M, we denote\nby kMk1 its spectral (operator) norm and by kMk its Frobenius norm. We let max(M ) = kMk1\nand min(M ) be the largest and smallest eigenvalue of a symmetric positive semide\ufb01nite (PSD) real\nmatrix M, respectively. The symbols  and  denote the classical ordering among real symmetric\nmatrices and, for any positive integer k 2 N, we de\ufb01ne the set of integer numbers [k] = {1, . . . , k}.\nFinally, all the expectations must be intended accordingly to the context.\n\n2.1 Environment and Transfer Risk\nWe consider the setting where we sequentially receive a stream of datasets Z(1)\nn , . . . ,\nsampled from a \ufb01xed environment \u21e2, each of which is formed by n points1. Starting from these\n1Speci\ufb01cally, \u21e2 is a distribution on the set of probability distributions on Z and each dataset is observed by\n\ufb01rst sampling a task \u00b5 \u21e0 \u21e2 and then sampling a dataset Zn \u21e0 \u00b5n of n i.i.d. points. However, with some abuse\nof notation, we refer to the environment both as the meta-distribution and the induced distribution on n-samples.\n\nn , . . . , Z(T )\n\n2\n\n\f(1)\n\ndatasets, we wish to learn a learning algorithm (hence the name LTL) that performs well on a\nnew random task sampled from the environment. Speci\ufb01cally, we wish to \ufb01nd an inner algorithm\nA : [n2N(X\u21e5Y )n ! Rd, such that, when we train it with a new dataset composed by n points2 and\nsampled from the environment, the corresponding error is low. This objective translates into requiring\nthat the transfer risk of the algorithm A trained with n points over the environment \u21e2, de\ufb01ned as\n\nEn(A) = E\u00b5\u21e0\u21e2EZn\u21e0\u00b5nEz\u21e0\u00b5\n\n1\n\n2hhx, A(Zn)i  y2i\n\nis as small as possible. This quantity measures the expected error (risk) that the algorithm A, trained\non the dataset Zn, incurs on average with respect to the distribution of tasks \u00b5 induced by \u21e2. That\nis, to compute the transfer risk, we \ufb01rst draw a task \u00b5 \u21e0 \u21e2 and a corresponding n-sample Zn 2Z n\nfrom \u00b5n, we then apply the learning algorithm to obtain the estimator A(Zn) and \ufb01nally we measure\nthe risk of this estimator on the distribution \u00b5 by taking the expectation over a test point z.\nThroughout the paper, we make the following assumption.\nAssumption 1 (Linear regression tasks). The meta-distribution \u21e2 samples linear regression tasks\nparametrized as \u00b5 = (w, p, \u2318), where w 2 Rd is the regression vector, p is the input marginal and \u2318\nis the noise distribution such that \u2318 | p has zero-mean, for almost every p. In particular, the sampling\nof a datapoint from \u00b5 must be intended as\n(2)\n\ny = hx, wi + \u270f.\nMoreover, denoting by \u2303\u00b5 = Ez\u21e0\u00b5xx>, we assume that E\u00b5\u21e0\u21e2min(\u2303\u00b5) > 03.\nIn this work, each dataset Zn is given by an i.i.d. sample of n points (xi, yi)n\ni=1 \u21e0 \u00b5n. In the\nfollowing, we will use the more compact notation Zn = (Xn, yn), where Xn 2 Rn\u21e5d denotes\nthe matrix having the vectors xi as rows, yn = (yi)n\ni=1 2 Rn is the vector of labels and \u270fn =\ni=1 2 Rn is the vector containing the noise on the labels. Moreover, by Asm. 1 the model\n(\u270fi)n\nequation yn = Xnw + \u270fn holds. We will denote by z = (x, y) \u21e0 \u00b5 a test point which must be\nalways intended to be independent from the training set, and in an analogous way, we will have\ny = hx, wi + \u270f. The above assumption on the environment is mainly made to simplify our exposition;\nthe general case may be pursued by considering the approximation error due to the choice of the class\nof the linear functions and a possible dependency of the noise on the inputs (heteroscedastic noise).\n\n(x, y) \u21e0 \u00b5 () x \u21e0 p,\n\n\u270f \u21e0 \u2318,\n\n2.2 The Family of Learning Algorithms\nThe subject of our study in this paper is the family of learning algorithms based on Ridge Regression,\nwhere we introduce a further bias term in the regularizer. More precisely, for any training dataset\nZn = (Xn, yn) \u21e0 \u00b5n and for any h 2 Rd, we consider the following algorithm\n\nOur goal is to leverage similarities between the tasks in the environment via the common mean h.\nIntuitively, if the regression vectors w sampled from the environment have a large mean \u00afw and a\nsmall variance, then, applying Ridge Regression around \u00afw should bring better estimation risk of these\nregression vectors than solving each task independently, for instance, with standard Ridge Regression\n(h = 0). In this work, we address the problem of choosing an algorithm in the above family that\nperforms well on tasks sampled from the environment according to the above LTL setting. This\ntranslates into selecting a parameter h such that the associated algorithm in the family has a small\ntransfer risk. Hence, considering  as a hyper-parameter external to our LTL problem and using the\nnotation En(h) \u2318E n(wh) for the transfer risk in Eq. (1), our goal is to minimize En with respect to\nh. In the next section, we will see that, thanks to the fact that the algorithm is an af\ufb01ne transformation\nof the parameter h (see Eq. (4)), not only this function is convex (an unusual property in the LTL\nsetting), but it presents also a particular LS structure.\n\n2For simplicity we assume that every dataset is composed by the same number of points n.\n3This is the meta-version of a frequent assumption in single-task LS literature, where usually the invertibility\n\nof the covariance matrix is required, see e.g. [13] and references therein.\n\n3\n\nwhere > 0 is a hyper-parameter of the algorithm. A direct computation gives the closed form\n\nZn 7! A(Zn) \u2318 wh(Zn) = argmin\nw2Rd\n\nwh(Zn) = C1\n\n,n\u21e3 X>n yn\n\nn\n\n+ h\u2318, C,n =\n\n1\n\nnXnw  yn2 + kw  hk2,\n\nX>n Xn\n\nn\n\n+ I.\n\n(3)\n\n(4)\n\n\f2.3 Minimizing the Transfer Risk: a Least Squares Problem\nThe following proposition establishes that the problem of minimizing En over the common mean h\ncan be reformulated as a LS problem with transformed inputs and outputs, sampled from a distribution\ninduced by sampling an n-dataset from the original environment. The proof is reported in App. B.\nProposition 1 (LS Problem Around a Common Mean for En). For any > 0 and h 2 Rd, the\ntransfer risk En in Eq. (1) of the learning algorithm wh in Eqs. (3)-(4), can be rewritten as\n\n(5)\n\nwhere the meta-data are given by\n\nEn(h) =\n\n1\n\n2E\u00afxn,\u00afynhh\u00afxn, hi  \u00afyn2i\nn E.\n,nx, and \u00afyn = y D\u00afxn,\n\nX>n yn\n\n\u00afxn = C1\n\nible and h\u21e4n = \u00af\u23031\n\nn E[\u00afyn \u00afxn] is the unique minimizer of the LS function in Eq. (5). In such a case,\n\nMoreover, under Asm. 1, the meta-covariance matrix \u00af\u2303n = E\u00afxn \u00afx>n = 2E\u21e5C1\n,n\u21e4 is invert-\nletting v = w  h\u21e4n, we have that \u00afyn = h\u00afxn, h\u21e4ni + \u00af\u270fn, with \u00af\u270fn = \u270f +D\u00afxn, v  X>n \u270fn\nn E.\nRemark 1 (A Misspeci\ufb01ed LS Problem with Heteroscedastic Noise for En). From Prop. 1 we observe\nthat, even though the original tasks are well-speci\ufb01ed linear models with homoscedastic (i.e. not\ndepending on the inputs) noise, without further assumptions on the environment, the linear model for\nthe meta-LS problem is usually not well-speci\ufb01ed, that is, in general, it may hold that E[\u00af\u270fn|\u00afxn] 6= 0,\nand, moreover, the noise is heteroscedastic.\n\n,n\u2303\u00b5C1\n\nAs a consequence of the above proposition, we can exploit standard results about LS. For instance,\nit is easy to show that, for a generic vector h 2 Rd, the excess transfer risk En(h) E n(h\u21e4n)\nn h  h\u21e4nk2. In particular, the gap between the\n2k \u00af\u23031/2\ncoincides with the weighted square norm 1\nperformance of standard Ridge Regression (h = 0), corresponding to solving each task independently,\nand the best algorithm in our class (i.e. the algorithm associated to the parameter h\u21e4n) is given by\nn h\u21e4nk2. Characterizing situations in which this gap is signi\ufb01cant is not an\nEn(0) E n(h\u21e4n) = 1\nobvious point, in the folllowing we will see that, making further assumptions on the environment, we\nwill be able to answer this question (see Rem. 3 below).\n\n2k \u00af\u23031/2\n\n1\n\n2\n\u270f\n\n,nxx>C1\n\n,n, we have\n\nEn(h) \uf8ff\n\n2.4 Two Examples of Environments\nSo far we have described the basic requirements of the environment considered in this work, however,\nmaking further assumptions about the data generation process will allow us to further analyze the\nLTL problem. We now discuss two speci\ufb01c examples. Their proofs are reported in App. B.\nExample 1. Let X\u2713B 1 and let the environment \u21e2 satisfy Asm. 1. Furthermore, assume for\nalmost every p that: i) \u2318 | p has variance bounded by 2\n\u270f , for \u270f  0, ii) E[w|p] = \u00afw, and iii)\nE\u21e5(w  \u00afw)(w  \u00afw)>|p\u21e4  2\nwI, for w  0. Then, for any > 0 and h 2 Rd, h\u21e4n = \u00afw and, letting\nAn = C1\n2 \u00af\u23031/2\nn h  \u00afw2 +\n\nThe fact that the minimizer of the transfer risk in Ex. 1 does not depend on n will be fundamental in\nthe subsequent analysis and we remark that the proof of this statement exploits only Asm. 1 and the\npoint ii). Moreover, the upper bound in Eq. (6) is tight. For a discussion on the link between our\nLTL problem for the setting in Ex. 1 and the Mean Estimation problem (see e.g. [11]) we refer to\nRem. 9 in App. B. We now point out that Ex. 1 is an exception to what observed in Rem. 1.\nRemark 2 (A Well-Speci\ufb01ed LS Problem for Ex. 1). For the environment in Ex. 1, the linear model\nfor the LS problem described in Prop. 1 is well-speci\ufb01ed. Indeed, we have that\n\n2 \u21e31 + tr\u21e3Eh X>n XnAn\n\ntr \u00af\u2303n +\n\nE\u21e5\u00af\u270fn\u00afxn\u21e4 = E[\u270f|p] +\u2326\u00afxn, E[v|p]  (n)1X>n E[\u270fn|p]\u21b5 = 0.\n\nSpeci\ufb01cally, to get the above statement, according to Asm. 1, we have exploited the independence of\nthe sampling of x, w, \u270f, conditioned with respect to p, the fact that the linear model is well-speci\ufb01ed\nfor the original tasks and the relation E[v|p] = E[w|p]  \u00afw = 0, which holds in the setting of Ex. 1.\n\ni\u2318\u2318.\n\n2\nw\n2\n\n(6)\n\nn2\n\n4\n\n\fIn the setting of Ex. 1 we can give conditions under which the gap between the performance of ITL\nand Ridge Regression with the best mean is signi\ufb01cant. We state this in the following remark, the\nproof of which is reported in App. B.\nRemark 3 (Advantage of Learning Around the Best Mean over ITL in Ex. 1). Consider the setting\nof Ex. 1. If the noise satis\ufb01es 2\nn \u00afwk2 and the regression vectors are such\nthat 2\n\n\u270f \u2327n12 + 11k \u00af\u23031/2\n\nn \u00afwk2, then En(0) E n( \u00afw) E n( \u00afw).\n\nw \u2327 tr( \u00af\u2303n)1k \u00af\u23031/2\n\nA special case of Ex. 1 is when the input marginal is always the same. The next example generalizes\nthis to a mixture and provides a scenario in which the minimizer h\u21e4n of the transfer risk varies with n.\nExample 2. Let X\u2713B 1 and consider the environment \u21e2 formed by K 2 N\\{0} clusters of tasks\nparametrized by the triplet (w, p, \u2318) as in Asm. 1. Assume that each cluster k 2 [K] is associated to a\nmarginal distribution pk that is sampled with probability P(p = pk) = \u232bk > 0. For any k 2 [K] and\n> 0, let \u00afwk = E[w|p = pk], \u00af\u2303n,k = (n)2E[An|p = pk] with An = C1\n,n and assume\nthat i) \u2318 | p = pk has variance bounded by 2\n\u270f,k for \u270f,k  0, ii) E[(w  \u00afwk)(w  \u00afwk)>|p = pk] \nw,kI for w,k  0. Then, for any > 0 and h 2 Rd, h\u21e4n = (PK\nk=1 \u232bk \u00af\u2303n,k \u00afwk\n2\nand En(h) =PK\n2 \u00af\u23031/2\nn,kh  \u00afwk2 +\n\nk=1 \u232bk \u00af\u2303n,k)1PK\n2 \u21e31 + tr\u21e3Eh X>n XnAn\n\nIn the subsequent analysis we will give theoretical guarantees for our LTL approach for Ex. 1 only.\nHowever, it is natural to expect that, also in the setting described in Ex. 2, when both the variance\nwithin each cluster and the variance between the clusters themselves are small, there should be an\nadvantage in applying our LTL method around a single mean, in comparison to solving each task\nindependently. The experiments in Sec. 5 will con\ufb01rm this. In the next section, we describe the\nonline meta-algorithm that we propose in order to address the LTL problem outlined above.\n\np = pki\u2318\u2318.\n\ntr \u00af\u2303n,k +\n\nk=1 \u232bkEn,k(h), where\n\nEn,k(h) \uf8ff\n\n,nxx>C1\n\nn2\n\n2\nw,k\n2\n\n2\n\u270f,k\n\n1\n\n3 The Splitting Stochastic Meta-Algorithm\n\nWe recall that our goal is, given a sequence of datasets Z(1)\nsampled from \u21e2, to \ufb01nd a\nparameter h so that the associated inner algorithm in Eqs. (3)-(4) works well on new datasets of n\npoints sampled from the same environment. This can be translated into minimizing the transfer risk\nEn in Eqs. (1)-(5). However, in our setting, we cannot directly minimize this function, as we would\nneed a further test point to compute the risk of the inner algorithm. Hence, we proceed as follows.\n\nn , . . . , Z(T )\n\nn\n\n3.1 The Splitting Step\n\nInspired by recent work on few shot-learning [15, 16, 26], we do not use all the n points in each dataset\nto train the inner algorithm (i.e. we do not work on what in literature is called future empirical risk\n[12, 20]), but we sacri\ufb01ce a subset of them for testing. More precisely, we \ufb01x a value r 2 [n  1] and,\nwhen we receive a new dataset Zn, we split it into two parts, Zn = (Zr, Znr), where Zr = (Xr, yr)\ncontains the \ufb01rst r points of Zn and Znr = (Xnr, ynr) contains the remaining n  r points.\nNote that the two datasets Zr and Znr are independent one of another, conditioned with respect\nto the task. Once this splitting is performed, we use Zr to train the inner algorithm in Eqs. (3)-(4)\nreplacing in its functional form Zn by Zr, while the remaining part of data Znr is used to estimate\nthe transfer risk of the corresponding algorithm by the formula\n\nEr(h) =\n\n1\n2E\u00b5\u21e0\u21e2EZr\u21e0\u00b5rEZnr\u21e0\u00b5nr\n\n1\n\nn  rhXnrwh(Zr)  ynr2i.\n\n(7)\n\nSome remarks about the de\ufb01nition of the inner algorithm and Eq. (7) are in order.\nRemark 4 (Normalization Factor). As described before, for any r 2 [n  1], the inner algorithm in\nEq. (3) is applied to the dataset Zr, with the same normalization factor 1/n (and not 1/r), i.e\n\nwh(Zr)= argmin\nw2Rd\n\n1\n\nnXrw  yr2+kw  hk2= argmin\n\nw2Rd\n\n5\n\n1\n\nrXrw  yr2+\n\nn\nr kw  hk2.\n\n(8)\n\n\fAccording to this de\ufb01nition, we are using the biased Ridge Regression with the standard normalization\nfactor, in which we divide the regularization parameter  by the fraction of points used for training.\nNote that the less the training points, the stronger the effect of the regularization. In such a case the\noutput of the algorithm will be encouraged to stay closer to the estimated mean h, thereby transferring\nmore knowledge among the tasks.\nRemark 5 (Conditional Mini-Batch). Since the test points in Znr are conditional i.i.d. with respect\nto the training points Zr, the de\ufb01nition of Er with more test points in Eq. (7) is equivalent to the\none with just one test point. However, from an algorithmic point of view, considering more than\none test point in the de\ufb01nition of Er, will be an important aspect. In fact, even if this technique\ncannot be properly interpreted as a standard mini-batch \u2013 since the test points are not independent\nwith respect to the global distribution (they are just conditionally independent) \u2013 we will see that,\nin the setting described in Ex. 1, working with more test points brings similar bene\ufb01ts as standard\nmini-batches. More precisely, it will reduce the variance of the unbiased estimates of the gradient\nused by our stochastic approach (see Lemma 19 in App. D). We stress that, in our analysis, the above\nstatement derives from both the speci\ufb01c characteristics of Ex. 1 and the normalization factor 1/n in\nthe algorithm.\nIn analogy with the case analyzed before for En in Prop. 1, exploiting again the following closed\nform of the algorithm in Eq. (8)\n\n,r\u21e3 X>r yr\n\nn\n\n+ h\u2318, C,r =\n\nwh(Zr) = C1\n\n(9)\nwe can rewrite the transfer risk Er in Eq. (7) as a LS function, but, differently from the setting in\nProp. 1, the LS function in this case is vector-valued. The following proposition formalize this. Its\nproof follows along the same lines of Prop. 1 and it is reported in App. B.\nProposition 2 (LS Problem Around a Common Mean for Er). For any > 0, h 2 Rd and r 2 [n1],\nthe transfer risk Er in Eq. (7) of the learning algorithm wh in Eqs. (8)-(9), can be rewritten as\n\n+ I,\n\nn\n\nX>r Xr\n\nwhere the meta-data are given by\n\nEr(h) =\n\n1\n\n2E \u00afXr,\u00afyrh \u00afXrh  \u00afyr2i\npn  r\u21e3ynr \n\n\u00afyr =\n\n1\n\n\n\n\u00afXr =\n\nXnrC1\n,r,\n\npn  r\nMoreover, under Asm. 1, the meta-covariance matrix \u00af\u2303r = E \u00afX>r\ninvertible and h\u21e4r = \u00af\u23031\ncase, letting v = w  h\u21e4r, we have that \u00afyr = \u00afXrh\u21e4r + \u00af\u270fr, with\n\u270fnr + \u00afXr\u21e3v \n\nn \u2318.\n\npn  r\n\nX>r \u270fr\n\n\u00af\u270fr =\n\n1\n\n\n\n\u00afXr\n\nX>r yr\n\npn  r\nn \u2318.\n\u00afXr = 2E\u21e5C1\n\n,r\u2303\u00b5C1\n\n(10)\n\n(11)\n\n,r\u21e4 is\n\n(12)\n\nr E[ \u00afX>r \u00afyr] is the unique minimizer of the LS function in Eq. (10). In such a\n\nNotice that, in virtue of Rem. 4 and Rem. 5, all the statements in Ex. 1 and Ex. 2 can be extended to\nEr by a rescaling of the parameter . We now point out some observations regarding the LS problem\nintroduced in Prop. 2.\nRemark 6 (A Misspeci\ufb01ed Vector-Valued LS Problem with Heteroscedastic Noise for Er). Eq. (12)\nimplies that the transformed data points, i.e. the rows of \u00afXr and the components of \u00afyr are not\nindependent. Indeed, due to the common training dataset Zr, there are dependencies between the\ncomponents of the meta-noise in Eq. (12). Similarly to what observed in Rem. 1 for En, also the\nnoise on the LS problem associated to Er is heteroscedastic, moreover the linear model is usually not\nwell-speci\ufb01ed, with the exception of the setting in Ex. 1 (as already observed for En in Rem. 2 ).\nWe conclude this section by describing how the previous setting can be naturally extended to the case\nin which we use all the points in each dataset to test the inner algorithm.\nRemark 7 (The Case r = 0). Interpreting Z0 = (X0, y0) as the empty set and the associated\nTikhonov matrix as C,0 = I, all the results stated above can be extended to the case r = 0.\n\nSpeci\ufb01cally, we de\ufb01ne the algorithm as wh(Z0) = argminw2Rdw  h2 = h and its transfer risk\nnXnh  yn2i. Again, the points in Zn are i.i.d., but we keep all of\n\nas E0(h) = 1\nthem because of what observed in Rem. 5. Moreover, making the identi\ufb01cations \u00afXr = Xn/pn and\n\u00afyr = yn/pn, the statements in Prop. 2 can be automatically extended to the case r = 0 as well.\n\n2E\u00b5\u21e0\u21e2EZn\u21e0\u00b5nh 1\n\n6\n\n\fAlgorithm 1 The Splitting Stochastic Meta-algorithm\n\nInput > 0, r 2{ 0}[ [n  1] (splitting parameter), 0 < \uf8ff 1/2 (step size)\nInitialization h(0) 2 Rd\nFor t = 1 to T\nn =X (t)\nn =Z(t)\n\nn \nnr with (Z(t)\n\nReceive Z(t)\n\nSplit Z(t)\n\nr , Z(t)\n\nn , y(t)\n\nZ(t)\n\nr \nr , y(t)\nnr ! Splitting step\nnr, y(t)\n\nr =X (t)\nnr =X (t)\n \u00afX (t)\nr \nr h(t1)  \u00afy(t)\nr ) ! Stochastic step\n\nr , \u00afy(t)\n\n\u00afX (t)\nr\n\nand \u00afy(t)\n\nBuild\nr ) = \u00afX (t)>\nCompute rLr(h(t1), \u00afX (t)\nUpdate h(t) = h(t1)  rLr(h(t1), \u00afX (t)\n\nr by Eq. (11)\nr , \u00afy(t)\n\nr\n\nReturn \u00afhT,r, = 1\n\nt=0 h(t)\n\nT +1PT\n\n3.2 The Stochastic Step\nOnce we have received a dataset and we have splitted it in order to compute the transfer risk Er in Eq.\n(10), we apply Stochastic Gradient Descent (SGD) [13, 28] to minimize the function Er, see Alg. 1.\nWe remark that, in our case, the application of the algorithm is slightly different to the classical\nsetting of LS regression, since, at each iteration, we process a data point of the form (X, y) with X\nmatrix and y vector, while, in the standard setting, usually we sample a vector x and a real number y.\nMore precisely, thanks to Prop. 2, introducing the function\n\nLr(h, \u00afXr, \u00afyr) =\n\n1\n\n2 \u00afXrh  \u00afyr2,\n\nwe can rewrite the transfer risk Er in Eq. (10) as Er(h) = E \u00afXr,\u00afyrLr(h, \u00afXr, \u00afyr). The estimator\n\u00afhT,r, returned by Alg. 1 is given by the average of the iterations. The next section is devoted to\nthe analysis of the statistical properties of this estimator, remembering that, since we are interested\nin testing the performance of the algorithm using all the n data points, our \ufb01nal aim is to minimize\nEn and not Er. Note, however, that the meta-algorithm allows us to minimize directly Er for every\nr \uf8ff n  1, which may be useful when we observe future datasets of sample size smaller than n.\n4 Statistical Analysis\nIn the proposition below, we give an upper bound on the expected excess transfer risk En(\u00afhT,r, ) \nEn(h\u21e4n) for Ex. 1. The bound suggests also how to choose the splitting parameter r for that speci\ufb01c\nenvironment. A complete statistical analysis for more general environments remains an interesting\nopen problem, which will be addressed in future work.\nProposition 3. Assume X\u2713B 1 and, for any r 2{ 0}[ [n  1] and > 0, let \u00afhT,r, be the output\nof Alg. 1. Then, the expected excess transfer risk of the algorithm in Eqs. (8)-(9) with parameter\n\u00afhT,r, trained with n points over the environment in Ex. 1 is bounded by\n\u270f\u2318d\u2318,\n\nE\u21e5En(\u00afhT,r, )E n(h\u21e4n)\u21e4 \uf8ff\n\n(n)2\u23182\n\n(r/n + )2\n\n4K\u21e2\n\n2\n\nwhere the expectation is over the datasets Z(1)\nment de\ufb01ned as\n\nK\u21e2 =\n\nn\n\nr\n\n+\n\nn  r\n\nn , . . . , Z(T )\n\nw +\u21e3 1\n\nand K\u21e2 is condition number of the environ-\n\nh(0)  \u00afw2 +\u21e32\nT + 1\u21e3 1\nE\u00b5\u21e0\u21e2\u2303\u00b51\nE\u00b5\u21e0\u21e2min\u2303\u00b5 .\n2 \u00af\u23031/2\nn \u00afhT,r,  h\u21e4n2. Adding and subtracting\n\n(13)\n\n7\n\nWe now brie\ufb02y describe the sketch of the proof of Prop. 3, all the details are reported in App. E.\n\nProof. We start observing that, since we are dealing with a LS problem (see Prop. 1), the excess\ntransfer risk coincides with En(\u00afhT,r, )E n(h\u21e4n) = 1\n\n\fThanks to the structure of the environment we are considering, we have that h\u21e4r = \u00afw for any r (see\nEx. 1), and consequently, the term B vanishes. Regarding the term A, it is possible to show that\n\nE\u21e5En(\u00afhT,r, ) E n(h\u21e4n)\u21e4 \uf8ff Eh \u00af\u23031/2\n\n|\n\n+ Eh \u00af\u23031/2\n|\n\nn h\u21e4r  h\u21e4n2i\n}\n{z\n\nB\n\nn \u00afhT,r,  h\u21e4r2i\n}\n\nA\n\n{z\nE\u21e5Er(\u00afhT,r, ) E r(h\u21e4r)\u21e4\n\n2K\u21e2(r/n + )2\n\nA \uf8ff\n\n2\n\n\u00af\u23031/2\nn h\u21e4r inside the norm, taking the expectation and using the inequality ka + bk2 \uf8ff 2kak2 + 2kbk2\nfor any two vectors a, b 2 Rd, we get that\n\n.\n\n(14)\n\nwhere K\u21e2 is de\ufb01ned in Eq. (13). Finally, the term E\u21e5Er(\u00afhT,r, ) E r(h\u21e4r)\u21e4 can be bounded via\n\nstandard convergence rates for SGD for LS problem (see Thm. 12 in App. C). The result follows by\nestimating the quantities in this rate in the speci\ufb01c setting of Ex. 1 (see Thm. 20 in App. D).\n\nThe bound in Prop. 3 is increasing in r, consequently, it suggests that, in the speci\ufb01c case of Ex. 1,\nthe best choice of the optimal splitting parameter is r = 0, corresponding to using all the points\nin each dataset to test the inner algorithm that outputs h, for any training set. Our analysis reveals\nthat this fact is due to both the speci\ufb01c characteristics of Ex. 1 and the normalization of the inner\nalgorithm. Moreover, as we will see in the next section, this optimal choice of r is con\ufb01rmed also by\nour experiments. Finally, we observe that, differently from standard bounds for LTL [12, 20, 22], the\nbound in Prop. 3 is consistent as T ! 1, for any \ufb01xed values of sample size n and any \ufb01xed value\nof the hyper-parameter . We note that to get such strong guarantees for more general environments\nis more challenging since in general h\u21e4r 6= h\u21e4n (see e.g. Ex. 2). Thus, as explained in the proof sketch\nabove, the decomposition of the excess transfer risk usually involves a further term measuring the\nweighted distance between the solutions of the two different LS problems.\n\n5 Experiments\n\nWe report the empirical evaluation of our LTL estimator on synthetic and real data4. In the experi-\nments,  and the splitting parameter r were tuned by cross-validation (see App. F for more details).\nSpeci\ufb01cally, we considered 20 candidate values of  in the range [106, 102], logarithmic spacing.\nSynthetic Datasets. We considered two different data generation protocols aimed at investigating\nempirically our analysis in Ex. 1 and Ex. 2. In both settings we considered a LTL scenario with\n100 training datasets (tasks) observed one at the time. We used 50 tasks to perform model selection\nand 200 tasks for test. Each task corresponds to a dataset (xi, yi)n\ni=1, xi 2 Rd with d = 30 and yi\ngenerated according to the speci\ufb01c setting (see below). For the training tasks we \ufb01xed n = 20.\n\u2022 Ex. 1. We generated the datasets according to an environment satisfying the hypotheses of Ex. 1.\nFor each task we generated the inputs x uniformly on the sphere of radius 1, the task vector w from a\nGaussian distribution with mean \u00afw = 4 2 Rd, the vector with entries all equal to 4, and standard\ndeviation w = 1. The labels were generated as y = hx, wi + \u270f with \u270f sampled from a zero-mean\nGaussian distribution, with standard deviation chosen to have signal-to-noise ratio equal to 5 for\neach task. Fig. 1 (Left) reports the performance of our LTL estimator for different split sizes r.\nInterestingly, we observe exactly the behavior predicted by the theoretical analysis for Ex. 1: in this\nsetting, the optimal LTL strategy is to set r = 0. This motivates us to consider the following more\nchallenging environment.\n\u2022 Ex. 2. We generated the data according to the environment described in Ex. 2 for K = 2 groups\nof tasks. The two marginals pk were chosen as Gaussian distributions with same standard deviation\n = 1 and means m1 = 0 and m2 = 2. Analogously, the task vectors w were sampled from one\nof two Gaussians with same standard deviation  = 1 and means respectively \u00afw1 = 2 and \u00afw2 = 4.\nThe two clusters were randomly selected with same probability \u232bk = 1/2, k 2 [2]. We generated\ny = hx, wi + \u270f with \u270f sampled with the same strategy as above. Fig. 1 (Right) reports the performance\nof our LTL estimator for different split sizes r. Here, differently from the previous case, the best\nchoice for r is a trade-off between 0 and n  1. We also compared our LTL estimator with the ITL\nbaseline (applying Ridge Regression with h = 0 independently for each task) as the number of tasks\n\n4The code used for the following experiments is available at https://github.com/dstamos\n\n8\n\n\fFigure 1: Mean square error (MSE) on the validation tasks after 100 training tasks of the LTL estimator for\nvarying dataset split sizes. Data generated according to (Left) Ex. 1 and (Right) Ex. 2. The dot indicates the\nminimum. The results are averaged over 30 independent runs (dataset generations).\n\nFigure 2: Performance of LTL vs ITL with respect to an increasing number of tasks. (Left) MSE on the test\ntasks for the dataset generation of Ex. 2. (Right) Explained variance on the test tasks for the School Dataset. The\nresults are averaged over 30 independent runs (datasets/train-split generations).\n\nincreases. This comparison is reported in Fig. 2 (Left). For LTL we report the performance for the\nbest split sizes r chosen by cross-validation. Note that LTL rapidly improves with respect to ITL.\nSchool Dataset. We compared the performance of LTL and ITL on the School dataset (see [2]),\nwhich contains 139 tasks of dimension d = 26 each. For all the experiments we randomly sampled\n75 tasks for training, 25 for validation and the rest for test. Performance of the two methods was\nmeasured in terms of the explained variance [2] (the higher the better). Fig. 2 (Right) reports the\nperformance of the two methods. Also here, we report the performance of LTL for the best r chosen\nby cross-validation. We observe that, again, LTL very rapidly outperforms ITL. This experiment\nshows that the LTL framework around a common mean is a transfer-knowledge method that can be\nsuccessfully applied, not only to synthetic toy datasets, but also to appropriate real data.\n\n6 Conclusion and Future Work\n\nIn this paper we considered the Learning-To-Learn setting in which the underlying algorithm is\nRidge Regression around a common mean. We showed that the associated LTL problem coincides\nwith a Least Squares problem with data distribution induced by the tasks\u2019 meta-distribution and,\nin order to solve it, we presented a novel stochastic meta-algorithm based on direct optimization\nof the transfer risk. Under speci\ufb01c assumptions of the environment, we derived an analysis of the\ngeneralization performance of the method, which highlights the role of the splitting parameter in the\nlearning process. Preliminary experiments con\ufb01rmed our theoretical \ufb01ndings. Future work will be\ndevoted to extend the statistical analysis of our meta-algorithm to more general environments and to\ncompare our approach with the more standard one working with the future empirical risk.\n\nAcknowledgments\n\nThis work was supported in part by the UK Defence Science and Technology Laboratory (Dstl) and\nEngineering and Physical Research Council (EPSRC) under grant EP/P009069/1. This is part of the\ncollaboration between US DOD, UK MOD and UK EPSRC under the Multidisciplinary University\nResearch Initiative.\n\n9\n\n\fReferences\n[1] P. Alquier, T. T. Mai, and M. Pontil. Regret bounds for lifelong learning. In Proceedings of the 20th\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, volume 54 of Proceedings of Machine\nLearning Research, pages 261\u2013269, 2017.\n\n[2] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning,\n\n73(3):243\u2013272, 2008.\n\n[3] M.-F. Balcan, A. Blum, and S. Vempala. Ef\ufb01cient representations for lifelong learning and autoencoding.\n\nIn Conference on Learning Theory, pages 191\u2013210, 2015.\n\n[4] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res., 12(149\u2013198):3, 2000.\n\n[5] R. Bhatia. Matrix Analysis. Springer, 1997.\n\n[6] R. Camoriano, G. Pasquale, C. Ciliberto, L. Natale, L. Rosasco, and G. Metta. Incremental robot learning\n\nof new objects with \ufb01xed update time. In International Conference on Robotics and Automation, 2017.\n\n[7] R. Caruana. Multitask learning. Machine Learning, 28(1):41\u201375, 1997.\n\n[8] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile. Linear algorithms for online multitask classi\ufb01cation.\n\nJournal of Machine Learning Research, 11:2901\u20132934, 2010.\n\n[9] C. Ciliberto, L. Rosasco, and S. Villa. Learning multiple visual tasks while discovering their structure. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 131\u2013139, 2015.\n\n[10] C. Ciliberto, A. Rudi, L. Rosasco, and M. Pontil. Consistent multitask learning with nonlinear output\n\nrelations. In Advances in Neural Information Processing Systems, pages 1986\u20131996, 2017.\n\n[11] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive-aggressive algorithms.\n\nJournal of Machine Learning Research, 7(Mar):551\u2013585, 2006.\n\n[12] G. Denevi, C. Ciliberto, D. Stamos, and M. Pontil. Incremental learning-to-learn with statistical guarantees.\n\nIn Proc. 34th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2018.\n\n[13] A. Dieuleveut, N. Flammarion, and F. Bach. Harder, better, faster, stronger convergence rates for least-\n\nsquares regression. The Journal of Machine Learning Research, 18(1):3520\u20133570, 2017.\n\n[14] T. Evgeniou, C. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods. Journal of\n\nMachine Learning Research, 6:615\u2013637, 2005.\n\n[15] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In\nProceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of\nMachine Learning Research, pages 1126\u20131135. PMLR, 2017.\n\n[16] L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi, and M. Pontil. Bilevel programming for hyperparameter\noptimization and meta-learning. In International Conference on Machine Learning, PMLR 80, pages\n568\u20131577, 2018.\n\n[17] M. Herbster, S. Pasteris, and M. Pontil. Mistake bounds for binary matrix completion. In Advances in\n\nNeural Information Processing Systems, pages 3954\u20133962, 2016.\n\n[18] L. Jacob, J.-p. Vert, and F. R. Bach. Clustered multi-task learning: A convex formulation. In Advances in\n\nneural information processing systems, pages 745\u2013752, 2009.\n\n[19] A. Maurer. Algorithmic stability and meta-learning. Journal of Machine Learning Research, 6:967\u2013994,\n\n2005.\n\n[20] A. Maurer. Transfer bounds for linear feature learning. Machine Learning, 75(3):327\u2013350, 2009.\n\n[21] A. Maurer, M. Pontil, and B. Romera-Paredes. Sparse coding for multitask and transfer learning. In\n\nInternational Conference on Machine Learning, 2013.\n\n[22] A. Maurer, M. Pontil, and B. Romera-Paredes. The bene\ufb01t of multitask representation learning. The\n\nJournal of Machine Learning Research, 17(1):2853\u20132884, 2016.\n\n[23] A. M. McDonald, M. Pontil, and D. Stamos. New perspectives on k-support and cluster norms. Journal of\n\nMachine Learning Research, 17(155):1\u201338, 2016.\n\n10\n\n\f[24] A. Pentina and C. Lampert. A PAC-Bayesian bound for lifelong learning. In International Conference on\n\nMachine Learning, pages 991\u2013999, 2014.\n\n[25] A. Pentina and R. Urner. Lifelong learning with weighted majority votes.\n\nInformation Processing Systems, pages 3612\u20133620, 2016.\n\nIn Advances in Neural\n\n[26] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. In I5th International Conference\n\non Learning Representations, 2017.\n\n[27] S.-A. Rebuf\ufb01, A. Kolesnikov, and C. H. Lampert. iCaRL: Incremental classi\ufb01er and representation learning.\n\nIn Proc. CVPR, 2017.\n\n[28] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics,\n\npages 400\u2013407, 1951.\n\n[29] S. Thrun and L. Pratt. Learning to Learn. Springer, 1998.\n\n[30] Y. Wu, Y. Su, and Y. Demiris. A morphable template framework for robot learning by demonstration:\n\nIntegrating one-shot and incremental learning approaches. Robotics and Autonomous Systems, 2014.\n\n11\n\n\f", "award": [], "sourceid": 6532, "authors": [{"given_name": "Giulia", "family_name": "Denevi", "institution": "IIT/UNIGE"}, {"given_name": "Carlo", "family_name": "Ciliberto", "institution": "Imperial College London"}, {"given_name": "Dimitris", "family_name": "Stamos", "institution": "University College London"}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": "IIT & UCL"}]}