{"title": "Learning from Multiple Sources", "book": "Advances in Neural Information Processing Systems", "page_first": 321, "page_last": 328, "abstract": null, "full_text": "Learning from Multiple Sources\n\nKoby Crammer, Michael Kearns, Jennifer Wortman Department of Computer and Information Science University of Pennsylvania Philadelphia, PA 19104\n\nAbstract\nWe consider the problem of learning accurate models from multiple sources of \"nearby\" data. Given distinct samples from multiple data sources and estimates of the dissimilarities between these sources, we provide a general theory of which samples should be used to learn models for each source. This theory is applicable in a broad decision-theoretic learning framework, and yields results for classification and regression generally, and for density estimation within the exponential family. A key component of our approach is the development of approximate triangle inequalities for expected loss, which may be of independent interest.\n\n1\n\nIntroduction\n\nWe introduce and analyze a theoretical model for the problem of learning from multiple sources of \"nearby\" data. As a hypothetical example of where such problems might arise, consider the following scenario: For each web user in a large population, we wish to learn a classifier for what sites that user is likely to find \"interesting.\" Assuming we have at least a small amount of labeled data for each user (as might be obtained either through direct feedback, or via indirect means such as clickthroughs following a search), one approach would be to apply standard learning algorithms to each user's data in isolation. However, if there are natural and accessible measures of similarity between the interests of pairs of users (as might be obtained through their mutual labelings of common web sites), an appealing alternative is to aggregate the data of \"nearby\" users when learning a classifier for each particular user. This alternative is intuitively subject to a trade-off between the increased sample size and how different the aggregated users are. We treat this problem in some generality and provide a bound addressing the aforementioned tradeoff. In our model there are K unknown data sources, with source i generating a distinct sample Si of ni observations. We assume we are given only the samples Si , and a disparity1 matrix D whose entry D(i, j ) bounds the difference between source i and source j . Given these inputs, we wish to decide which subset of the samples Sj will result in the best model for each source i. Our framework includes settings in which the sources produce data for classification, regression, and density estimation (and more generally any additive-loss learning problem obeying certain conditions). Our main result is a general theorem establishing a bound on the expected loss incurred by using all data sources within a given disparity of the target source. Optimization of this bound then yields a recommended subset of the data to be used in learning a model of each source. Our bound clearly expresses a trade-off between three quantities: the sample size used (which increases as we include data from more distant models), a weighted average of the disparities of the sources whose data is used, and a model complexity term. It can be applied to any learning setting in which the underlying loss function obeys an approximate triangle inequality, and in which the class of hypothesis models under consideration obeys uniform convergence of empirical estimates of loss to expectations.\nWe avoid using the term distance since our results include settings in which the underlying loss measures may not be formal distances.\n1\n\n\f\nFor classification problems, the standard triangle inequality holds. For regression we prove a 2approximation to the triangle inequality, and for density estimation for members of the exponential family, we apply Bregman divergence techniques to provide approximate triangle inequalities. We believe these approximations may find independent applications within machine learning. Uniform convergence bounds for the settings we consider may be obtained via standard data-independent model complexity measures such as VC dimension and pseudo-dimension, or via more recent datadependent approaches such as Rademacher complexity. The research described here grew out of an earlier paper by the same authors [1] which examined the considerably more limited problem of learning a model when all data sources are corrupted versions of a single, fixed source, for instance when each data source provides noisy samples of a fixed binary function, but with varying levels of noise. In the current work, each source may be entirely unrelated to all others except as constrained by the bounds on disparities, requiring us to develop new techniques. Wu and Dietterich studied similar problems experimentally in the context of SVMs [2]. The framework examined here can also be viewed as a type of transfer learning [3, 4]. In Section 2 we introduce a decision-theoretic framework for probabilistic learning that includes classification, regression, density estimation and many other settings as special cases, and then give our multiple source generalization of this model. In Section 3 we provide our main result, which is a general bound on the expected loss incurred by using all data within a given disparity of a target source. Section 4 then applies this bound to a variety of specific learning problems. In Section 5 we briefly examine data-dependent applications of our general theory using Rademacher complexity.\n\n2\n\nLearning models\n\nBefore detailing our multiple-source learning model, we first introduce a standard decision-theoretic learning framework in which our goal is to find a model minimizing a generalized notion of empirical loss [5]. Let the hypothesis class H be a set of models (which might be classifiers, real-valued functions, densities, etc.), and let f be the target model, which may or may not lie in the class H. Let z be a (generalized) data point or observation. For instance, in (noise-free) classification and regression, z will consist of a pair x, y where y = f (x). In density estimation, z is the observed value x. We assume that the target model f induces some underlying distribution Pf over observations z . In the case of classification or regression, Pf is induced by drawing the inputs x according to some underlying distribution P , and then setting y = f (x) (possibly corrupted by noise). In the case of density estimation f simply defines a distribution Pf over observations x. Each setting we consider has an associated loss function L(h, z ). For example, in classification we typically consider the 0/1 loss: L(h, x, y ) = 0 if h(x) = y , and 1 otherwise. In regression we might consider the squared loss function L(h, x, y ) = (y - h(x))2 . In density estimation we might consider the log loss L(h, x) = log(1/h(x)). In each case, we are interested in the expected loss of a model g2 on target g1 , e(g1 , g2 ) = EzPg1 [L(g2 , z )]. Expected loss is not necessarily symmetric. In our multiple source model, we are presented with K distinct samples or piles of data S1 , ..., SK , and a symmetric K K matrix D. Each pile Si contains ni observations that are generated from a fixed and unknown model fi , and D satisfies e(fi , fj ), e(fj , fi ) D(i, j ). 2 Our goal is to decide which piles Sj to use in order to learn the best approximation (in terms of expected loss) to each fi . While we are interested in accomplishing this goal for each fi , it suffices and is convenient to examine the problem from the perspective of a fixed fi . Thus without loss of generality let us suppose that we are given piles S1 , ..., SK of size n1 , . . . , nK from models f1 , . . . , fK such that 1 D(1, 1) 2 D(1, 2) K D(1, K ), and our goal is to learn f1 . Here we have simply taken the problem in the preceding paragraph, focused on the problem for f1 , and reordered the other models according to their proximity to f1 . To highlight the distinguished role of the target j j f1 we shall denote it f . We denote the observations in Sj byz1 , . . . , znj . In all cases we will ^ analyze, for any k K , the hypothesis hk minimizing the empirical loss ek (h) on the first k piles ^ S1 , . . . , Sk , i.e.\nWhile it may seem restrictive to assume that D is given, notice that D(i, j ) can be often be estimated from data, for example in a classification setting in which common instances labeled by both fi and fj are available.\n2\n\n\f\nwhere n1:k = n1 + + nk . We also denote the expected error of function h with respect to the first k piles of data as e n ik i (fi , h). ek (h) = E [ek (h)] = ^ n1:k =1\n\nk nj 1j i j ^ hk = argmin ek (h) = argmin ^ L(h, zi ) hH hH n1:k =1 =1\n\n3\n\nGeneral theory\n\nIn this section we provide the first of our main results: a general bound on the expected loss of the model minimizing the empirical loss on the nearest k piles. Optimization of this bound leads to a recommended number of piles to incorporate when learning f = f1 . The key ingredients needed to apply this bound are an approximate triangle inequality and a uniform convergence bound, which we define below. In the subsequent sections we demonstrate that these ingredients can indeed be provided for a variety of natural learning problems. Definition 1 For 1, we say that the -triangle inequality holds for a class of models F and expected loss function e if for all g1 , g2 , g3 F we have e(g1 , g2 ) (e(g1 , g3 ) + e(g3 , g2 )). The parameter 1 is a constant that depends on F and e. The choice = 1 yields the standard triangle inequality. We note that the restriction to models in the class F may in some cases be quite weak -- for instance, when F is all possible classifiers or real-valued functions with bounded range -- or stronger, as in densities from the exponential family. Our results will require only that the unknown source models f1 , . . . , fK lie in F , even when our hypothesis models are chosen from some possibly much more restricted class H F . For now we simply leave F as a parameter of the definition. Definition 2 A uniform convergence bound for a hypothesis space H and loss function L is a bound that states that for any 0 < < 1, with probability at least 1 - for any h H |e(h) - e(h)| (n, ) ^\n1 where e(h) = n i=1 L(h, zi ) for n observations z1 , . . . , zn generated independently according to ^ distributions P1 , . . . Pn , and e(h) = E [e(h)] where the expectation is taken over z1 , . . . , zn . is a ^\n\nfunction of the number of observations n and the confidence , and depends on H and L. This definition simply asserts that for every model in H, its empirical loss on a sample of size n and the expectation of this loss will be \"close.\" In general the function will incorporate standard measures of the d omplexity of H, and will be a decreasing function of the sample size n, as c /n) bounds of VC theory. Our bounds will be derived from the rich literain the classical O( ture on uniform convergence. The only twist to our setting is the fact that the observations are no longer necessarily identically distributed, since they are generated from multiple sources. However, generalizing the standard uniform convergence results to this setting is straightforward. We are now ready to present our general bound. Theorem 1 Let e be the expected loss function for loss L, and let F be a class of models for which the -triangle inequality holds with respect to e. Let H F be a class of hypothesis models for which there is a uniform convergence bound for L. Let K N, f = f1 , f2 , . . . , fK F , {i }K 1 , i= ^ {ni }K 1 , and hk be as defined above. For any such that 0 < < 1, with probability at least 1 - , i= for any k {1, . . . , K } n ik i 2 ^ e(f , hk ) ( + 2 ) i + 2 (n1:k , /2K ) + min {e(f , h)} hH n1:k =1\n\nn\n\n\f\nBefore providing the proof, let us examine the bound of Theorem 1, which expresses a natural and intuitive trade-off. The first term in the bound is a weighted sum of the disparities of the k K models whose data is used with respect to the target model f = f1 . We expect this term to increase as we increase k to include more distant piles. The second term is determined by the uniform convergence bound. We expect this term to decrease with added piles due to the increased sample size. The final term is what is typically called the approximation error -- the residual loss that we incur simply by limiting our hypothesis model to fall in the restricted class H. All three terms are influenced by the strength of the approximate triangle inequality that we have, as quantified by . The bounds given in Theorem 1 can be loose, but provide an upper bound necessary for optimization and suggest a natural choice for the number of piles k to use to estimate the target f : ( . n ik i 2 k = argmin + ) i + 2 (n1:k , /2K ) n1:k k =1 Theorem 1 and this optimization make the implicit assumption that the best subset of piles to use will be a prefix of the piles -- that is, that we should not \"skip\" a nearby pile in favor of more distant ones. This assumption will generally be true for typical data-independent uniform convergence such as VC dimension bounds, and true on average for data-dependent bounds, where we expect uniform convergence bounds to improve with increased sample size. We now give the proof of Theorem 1. Proof: (Theorem 1) By Definition 1, for any h H, any k {1, . . . K }, and any i {1, . . . , k}, e n ( n i i (f , h) e(f , fi ) + e(fi , h)) n1:k n1:k Summing over all i {1, . . . , k}, we find n ( ik i e(f , h) e(f , fi ) + e(fi , h)) n1:k =1 n e n ik ik i i (f , fi ) + = n1:k n1:k =1 =1\n\ne ik (fi , h) \n\n=1\n\nn1:k\n\nn\n\ni\n\n\n\ni\n\n+ ek (h)\n\nIn the first line above we have used the -triangle inequality to deliberately introduce a weighted summation involving the fi . In the second line, we have broken up the summation. Notice that the first summation is a weighted average of the expected loss of each fi , while the second summation is the expected loss of h on the data. Using the uniform convergence bound, we may assert that with high probability ek (h) ek (h) + (n1:k , /2K ), and with high probability ^ k P n e i i ^ k ) = min{ek (h)} min ek (h ^ ^ (fi , h) + (n1:k , /2K ) hH hH n1:k =1 utting these pieces together, we find that with high probability k n n e ik i i i ^k) e(f , h (fi , h) i + 2 (n1:k , /2K ) + min hH n1:k n1:k =1 =1 n ik i i + 2 (n1:k , /2K ) n1:k =1 k = n n i ik i i + min e(fi , f ) + e(f , h) hH n1:k n1:k =1 =1 n ik i 2 ( + 2 ) i + 2 (n1:k , /2K ) + min {e(f , h)} hH n1:k =1\n\n\f\n1\n\n0.9\n\n0.8 MAX DATA 0.7\n\n140 120 sample size\n\n0.6\n\n100 80 60 40 20\n\n0.5\n\n0.4\n\n0.3\n\n0 1 0.8\n\n0.2\n\n0.1\n\n0.6 0.4\n\n0\n\n0\n\n0.1\n\n0.2\n\n0.3\n\n0.4\n\n0.5\n\n0.6\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.2 0\n\nFigure 1: Visual demonstration of Theorem 2. In this problem there are K = 100 classifiers, each defined by\n2 parameters represented by a point fi in the unit square, such that the expected disagreement rate between two such classifiers equals the L1 distance between their parameters. (It is easy to create simple input distributions and classifiers that generate exactly this geometry.) We chose the 100 parameter vectors fi uniformly at random from the unit square (the circles in the left panel). To generate varying pile sizes, we let ni decrease with the distance of fi from a chosen \"central\" point at (0.75, 0.75) (marked \"MAX DATA\" in the left panel); the resulting pile sizes for each model are shown in the bar plot in the right panel, where the origin (0, 0) is in the near corner, (1, 1) in the far corner, and the pile sizes clearly peak near (0.75, 0.75). Given these fi , ni and the pairwise distances, the undirected graph on the left includes an edge between fi and fj if and only if the data from fj is used to learn fi and/or the converse when Theorem 2 is used to optimize the distance of the data used. The graph simultaneously displays the geometry implicit in Theorem 2 as well as its adaptivity to local circumstances. Near the central point the graph is quite sparse and the edges quite short, corresponding to the fact that for such models we have enough direct data that it is not advantageous to include data from distant models. Far from the central point the graph becomes dense and the edges long, as we are required to aggregate a larger neighborhood to learn the optimal model. In addition, decisions are affected locally by how many models are \"nearby\" a given model.\n\n4\n\nApplications to standard learning settings\n\nIn this section we demonstrate the applicability of the general theory given by Theorem 1 to several standard learning settings. We begin with the most straightforward application, classification. 4.1 Binary classification\n\nIn binary classification, we assume that our target model is a fixed, unknown and arbitrary function f from some input set X to {0, 1}, and that there is a fixed and unknown distribution P over the X . Note that the distribution P over input does not depend on the target function f . The observations are of the form z = x, y where y {0, 1}. The loss function L(h, x, y ) is defined as 0 if y = h(x) and 1 otherwise, and the corresponding expected loss is e(g1 , g2 ) = E x,y Pg1 [L(g2 , x, y )] = PrxP [g1 (x) = g2 (x)]. For 0/1 loss it is well-known and easy to see that the (standard) 1-triangle inequality holds, and classical VC theory [6] provides us with uniform convergence. The conditions of Theorem 1 are thus easily satisfied, yielding the following. Theorem 2 Let F be the set of all functions from an input set X into {0,1} and let d be the VC dimension of H F . Let e be the expected 0/1 loss. Let K N, f = f1 , f2 , . . . , fK F , ^ {i }K 1 , {ni }K 1 , and hk be as defined above in the multi-source learning model. For any such i= i= that 0 < < 1, with probability at least 1 - , for any k {1, . . . , K } d n ik log (2en1:k /d) + log (16K/ ) i ^ e(f , hk ) 2 i + min {e(f , h)} + 2 hH n1:k 8n1:k =1 In Figure 1 we provide a visual demonstration of the behavior of Theorem 1 applied to a simple classification problem.\n\n\f\n4.2\n\nRegression\n\nWe now turn to regression with squared loss. Here our target model f is any function from an input class X into some bounded subset of R. (Frequently we will have X Rd , but this is not required.) We again assume a fixed but unknown distribution P (that does not depend on f ) on the inputs. Our observations are of the form z = x, y . Our loss function is L(h, x, y ) = (y - h(x). 2 , and the ) ( expected loss is thus e(g1 , g2 ) = E x,y Pg1 [L(g2 , x, y )] = ExP g1 (x) - g2 (x))2 For regression it is known that the standard 1-triangle inequality does not hold. However, a 2-triangle inequality does hold and is stated in the following lemma. The proof is given in Appendix A. 3 Lemma 1 Given any three functions g1 , g2 , g3 : X ( R, a fixed and un,known distribution P on the inputs X , and the expected loss e(g1 , g2 ) = ExP g1 (x) - g2 (x))2 e(g1 , g2 ) 2 (e(g1 , g3 ) + e(g3 , g1 )) . The other required ingredient is a uniform convergence bound for regression with squared loss. There is a rich literature on such bounds and their corresponding complexity measures for the model class H, including the fat-shattering generalization of VC dimension [7], -nets and entropy [6] and the combinatorial and pseudo-dimension approaches beautifully surveyed in [5]. For concreteness here we adopt the latter approach, since it serves well in the following section on density estimation. While a detailed exposition of the pseudo-dimension dim(H) of a class H of real-valued functions exceeds both our space limitations and scope, it suffices to say that it generalizes the VC dimension for binary functions and plays a similar role in uniform convergence bounds. More precisely, in the same way that the VC dimension measures the largest set of points on which a set of classifiers can exhibit \"arbitrary\" behavior (by achieving all possible labelings of the points), dim(H) measures the largest set of points on which the output values induced by H are \"full\" or \"space-filling.\" (Technically we ask whether { h(x1 ), . . . , h(xd ) : h H} intersects all orthants of Rd with respect to some chosen origin.) Ignoring constant and logd rithmic factors, uniform convergence a im(H)/n. As with the VC dimension, bounds can be derived in which the complexity penalty is dim(H) is ordinarily closely related to the number of free parameters defining H. Thus for linear functions in Rd it is O(d) and for neural networks with W weights it is O(W ), and so on. Careful application of pseudo-dimension results from [5] along with Lemma 1 and Theorem 1 yields the following. A sketch of the proof appears in Appendix A. Theorem 3 Let F be the set of functions from X into [-B , B ] and let d be the pseudo-dimension of H F under squared loss. Let e be the expected squared loss. Let K N, f = f1 , f2 , . . . , fK ^ F , {i }K 1 , {ni }K 1 , and hk be as defined in the multi-source learning model. Assume that n1 i= i= d/16e. For any such that 0 < < 1, with probability at least 1 - , for any k {1, . . . , K } ^ e(f , hk ) 6 4 ik n\ni\n\n=1\n\nn1:k\n\n\n\ni + 4 min {e(f , h)} + 128B hH\n\n2\n\n\n\nn1:k\n\nd\n\n+\n\n l n(16K/ ) n1:k\n\nl\n\nn\n\n16e2 n1:k d\n\n.3\n\nDensity estimation\n\nWe turn to the more complex application to density estimation. Here our models are no longer functions, but densities P . The loss function for an observation x is the log loss L(P, x) = log (1/P (x)). The expected loss is then e(P1 , P2 ) = ExP1 [L(P2 , x)] = ExP1 [log(1/P2 (x))]. As we are not aware of an -triangle inequality that holds simultaneously for all density functions, we provide general mathematical tools to derive specialized -triangle inequalities for specific classes of distributions. We focus on the exponential family of distributions, which is quite general and has nice properties which allow us to derive the necessary machinery to apply Theorem 1. We start by defining the exponential family and explaining some of its properties. We proceed by deriving an -triangle inequality for Kullback-Liebler divergence in exponential families that implies\n3\n\nA version of this paper with the appendix included can be found on the authors' websites.\n\n\f\nan -triangle inequality for our expected loss function. This inequality and a uniform convergence bound based on pseudo-dimension yield a general method for deriving error bounds in the multiple source setting which we illustrate using the example of multinomial distributions. Let x X be a random variable, in either a continuous space (e.g. X Rd ) or a discrete space (e.g. X Zd ). We define the exponential family of distributions in terms of the following components. First, we have a vector function of the sufficient statistics needed to compute the distribution, denoted : Rd Rd . Associated with is a vector of expectation parameters Rd which pa rameterizes a particular distribution. Next we have a convex vector function F : Rd R (defined below) which is unique for each family of exponential distributions, and a normalization function P0 (x). Using this notation we define a probability distribution (in the expectation parameters) to be PF (x | ) = eF ()((x)-)+F () P0 (x) . (1)\n\nFor all distributions we consider it will hold that ExPF (|) [(x)] = . Using this fact and the linearity of expectation, we can derive the Kullback-Liebler (KL) divergence between two distributions of the same family (which use the same functions F and ) and obtain KL (PF (x | 1 ) PF (x | 2 )) = F (1 ) - [F (2 ) + F (2 ) (1 - 2 )] . (2)\n\nWe define the quantity on the right to be the Bregman divergence between the two (parameter) vectors 1 and 2 , denoted BF (1 2 ). The Bregman divergence measures the difference between F and its first-order Taylor expansion about 2 evaluated at 1 . Eq. (2) states that the KL divergence between two members of the exponential family is equal to the Bregman divergence between the two corresponding expectation parameters. We refer the reader to [8] for more details about Bregman divergences and to [9] for more information about exponential families. We will use the above relation between the KL divergence for exponential families and Bregman divergences to derive a triangle inequality as required by our theory. The following lemma shows that if we can provide a triangle inequality for the KL function, we can do so for expected log loss. Lemma 2 Let e be the expected log loss, i.e. e(P1 , P2 ) = ExP1 [log(1/P2 (x))]. For any three probability distributions P1 , P2 , and P3 , if KL (P1 P2 ) (KL (P1 P3 ) + KL (P3 P2 )) for some 1 then e(P1 , P2 ) (e(P1 , P3 ) + e(P3 , P2 )). The proof is given in Appendix B. The next lemma gives an approximate triangle inequality for the KL divergence. We assume that there exists a closed set P = {} which contains all the parameter vectors. The proof (again see Appendix B) uses Taylor's Theorem to derive upper and lower bounds on the Bregman divergence and then uses Eq. (2) to relate these bounds to the KL divergence. Lemma 3 Let P1 , P2 , and P3 be distributions from an exponential family with parameters and function F . Then KL (P1 P2 ) (KL (P1 P3 ) + KL (P3 P2 )) where = 2 supP 1 (H (F ( )))/ inf P d (H (F ( ))). Here 1 () and d () are the highest and lowest eigenvalues of a given matrix, and H () is the Hessian matrix. The following theorem, which states bounds for multinomial distributions in the multi-source setting, is provided to illustrate the type of results that can be obtained using the machinery described in this section. More details on the application to the multinomial distribution are given in Appendix B. Theorem 4 Let F H be the set of multinomial distributions over N values with the probability of each value bounded from below by for some > 0, and let = 2/ . Let d be the pseudodimension of H under log loss, and let e be the expected log loss. Let K N, f = f1 , f2 , . . . , fK ^ F , {i }K 1 , 4 {n}K 1 , and hk be as defined above in the multi-source learning model. Assume that i= i= n1 d/16e. For any 0 < < 1, with probability at least 1 - for any k {1, . . . , K }, n ik i ^ k ) ( + 2 ) e(f , h i + min {e(f , h)} hH n1:k =1\nHere we can actually make the weaker assumption that the i bound the KL divergences rather than the expected log loss, which avoids our needing upper bounds on the entropy of each source distribution.\n4\n\n\f\n5\n\n+ 128 log2\n\n 2\n\nn1:k\n\nd\n\n+\n\n l n(16K/ ) n1:k\n\nl\n\nn\n\n16e2 n1:k d\n\nData-dependent bounds\nGiven the interest in data-dependent convergence methods (such as maximum margin, PAC-Bayes, and others) in recent years, it is natural to ask how our multi-source theory can exploit these modern bounds. We examine one specific case for classification here using Rademacher complexity [10, 11]; analogs can be derived in a similar manner for other learning problems. If H is a class of functions mapping from a set X to R, we define the empirical Rademacher complexity of H on a fixed set of observations x1 , . . . , xn as x 2n w s i ^ n (H) = E up i h(xi ) 1 , . . . , xn R hH n =1 The Rademacher complexity for n observations is then defined as Rn (H) = E Rn (H) expectation is over x1 , . . . , xn .\n\nhere the expectation is taken over independent uniform {1}-valued random variables 1 , . . . , n . ^ w here the\n\nWe can apply Rademacher-based convergence bounds to obtain a data-dependent multi-source bound for classification. A proof sketch using techniques and theorems of [10] is in Appendix C. ^ Theorem 5 Let F be the set of all functions from an input set X into {-1,1} and let Rn1:k be the empirical Rademacher complexity of H F on the first k piles of data. Let e be the expected 0/1 ^ loss. Let K N, f = f1 , f2 , . . . , fK F , {i }K 1 , {ni }K 1 , and hk be as defined in the multii= i= source learning model. Assume that n1 d/16e. For any such that 0 < < 1, with probability at least 1 - , for any k {1, . . . , K } 2 n ik ln(4K/ ) i ^ ^ e(f , hk ) 2 i + min {e(f , h)} + Rn1:k (H) + 4 hH n1:k n1:k =1 While the use of data-dependent complexity measures can be expected to yield more accurate bounds and thus better decisions about the number k of piles to use, it is not without its costs in comparison to the more standard data-independent approaches. In particular, in principle the optimization of the bound of Theorem 5 to choose k may actually involve running the learning algorithm on all possible prefixes of the piles, since we cannot know the data-dependent complexity term for each prefix without doing so. In contrast, the data-independent bounds can be computed and optimized for k without examining the data at all, and the learning performed only once on the first k piles. References\n[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] K. Crammer, M. Kearns, and J. Wortman. Learning from data of variable quality. In NIPS 18, 2006. P. Wu and T. Dietterich. Improving SVM accuracy by training on auxiliary data sources. In ICML, 2004. J. Baxter. Learning internal representations. In COLT, 1995. S. Ben-David. Exploiting task relatedness for multiple task learning. In COLT, 2003. D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 1992. V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. M. Kearns and R. Schapire. Efficient distribution-free learning of probabilistic concepts. JCSS, 1994. Y. Censor and S.A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, New York, NY, USA, 1997. M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, Department of Statistics, University of California, Berkeley, 2003. P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 2002. V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Info. Theory, 2001.\n\n\f\n", "award": [], "sourceid": 2972, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Michael", "family_name": "Kearns", "institution": null}, {"given_name": "Jennifer", "family_name": "Wortman", "institution": null}]}