{"title": "Multi-step learning and underlying structure in statistical models", "book": "Advances in Neural Information Processing Systems", "page_first": 4815, "page_last": 4823, "abstract": "In multi-step learning, where a final learning task is accomplished via a sequence of intermediate learning tasks, the intuition is that successive steps or levels transform the initial data into representations more and more ``suited\" to the final learning task. A related principle arises in transfer-learning where Baxter (2000) proposed a theoretical framework to study how learning multiple tasks transforms the inductive bias of a learner. The most widespread multi-step learning approach is semi-supervised learning with two steps: unsupervised, then supervised. Several authors (Castelli-Cover, 1996; Balcan-Blum, 2005; Niyogi, 2008; Ben-David et al, 2008; Urner et al, 2011) have analyzed SSL, with Balcan-Blum (2005) proposing a version of the PAC learning framework augmented by a ``compatibility function\" to link concept class and unlabeled data distribution. We propose to analyze SSL and other multi-step learning approaches, much in the spirit of Baxter's framework, by defining a learning problem generatively as a joint statistical model on $X \\times Y$. This determines in a natural way the class of conditional distributions that are possible with each marginal, and amounts to an abstract form of compatibility function. It also allows to analyze both discrete and non-discrete settings. As tool for our analysis, we define a notion of $\\gamma$-uniform shattering for statistical models. We use this to give conditions on the marginal and conditional models which imply an advantage for multi-step learning approaches. In particular, we recover a more general version of a result of Poggio et al (2012): under mild hypotheses a multi-step approach which learns features invariant under successive factors of a finite group of invariances has sample complexity requirements that are additive rather than multiplicative in the size of the subgroups.", "full_text": "Multi-step learning and\n\nunderlying structure in statistical models\n\nDept. of Mathematics and Statistics\nBrain and Mind Research Institute\n\nUniversity of Ottawa\n\nMaia Fraser\n\nOttawa, ON K1N 6N5, Canada\n\nmfrase8@uottawa.ca\n\nAbstract\n\nIn multi-step learning, where a \ufb01nal learning task is accomplished via a sequence of\nintermediate learning tasks, the intuition is that successive steps or levels transform\nthe initial data into representations more and more \u201csuited\" to the \ufb01nal learning\ntask. A related principle arises in transfer-learning where Baxter (2000) proposed a\ntheoretical framework to study how learning multiple tasks transforms the inductive\nbias of a learner. The most widespread multi-step learning approach is semi-\nsupervised learning with two steps: unsupervised, then supervised. Several authors\n(Castelli-Cover, 1996; Balcan-Blum, 2005; Niyogi, 2008; Ben-David et al, 2008;\nUrner et al, 2011) have analyzed SSL, with Balcan-Blum (2005) proposing a\nversion of the PAC learning framework augmented by a \u201ccompatibility function\"\nto link concept class and unlabeled data distribution. We propose to analyze SSL\nand other multi-step learning approaches, much in the spirit of Baxter\u2019s framework,\nby de\ufb01ning a learning problem generatively as a joint statistical model on X \u21e5 Y .\nThis determines in a natural way the class of conditional distributions that are\npossible with each marginal, and amounts to an abstract form of compatibility\nfunction. It also allows to analyze both discrete and non-discrete settings. As tool\nfor our analysis, we de\ufb01ne a notion of -uniform shattering for statistical models.\nWe use this to give conditions on the marginal and conditional models which\nimply an advantage for multi-step learning approaches. In particular, we recover a\nmore general version of a result of Poggio et al (2012): under mild hypotheses a\nmulti-step approach which learns features invariant under successive factors of a\n\ufb01nite group of invariances has sample complexity requirements that are additive\nrather than multiplicative in the size of the subgroups.\n\nIntroduction\n\n1\nThe classical PAC learning framework of Valiant (1984) considers a learning problem with unknown\ntrue distribution p on X \u21e5 Y , Y = {0, 1} and \ufb01xed concept class C consisting of (deterministic)\nfunctions f : X ! Y . The aim of learning is to select a hypothesis h : X ! Y , say from C itself\n(realizable case), that best recovers f. More formally, the class C is said to be PAC learnable if there\nis a learning algorithm that with high probability selects h 2C having arbitrarily low generalization\nerror for all possible distributions D on X. The distribution D governs both the sampling of points\nz = (x, y) 2 X \u21e5 Y by which the algorithm obtains a training sample and also the cumulation of\nerror over all x 2 X which gives the generalization error. A modi\ufb01cation of this model, together\nwith the notion of learnable with a model of probability (resp. decision rule) (Haussler, 1989;\nKearns and Schapire, 1994), allows to treat non-deterministic functions f : X ! Y and the case\nY = [0, 1] analogously. Polynomial dependence of the algorithms on sample size and reciprocals\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fof probability bounds is further required in both frameworks for ef\ufb01cient learning. Not only do\nthese frameworks consider worst case error, in the sense of requiring the generalization error to be\nsmall for arbitrary distributions D on X, they assume the same concept class C regardless of the true\nunderlying distribution D. In addition, choice of the hypothesis class is taken as part of the inductive\nbias of the algorithm and not addressed.\nVarious, by now classic, measures of complexity of a hypothesis space (e.g., VC dimension or\nRademacher complexity, see Mohri et al. (2012) for an overview) allow to prove upper bounds on\ngeneralization error in the above setting, and distribution-speci\ufb01c variants of these such as annealed\nVC-entropy (see Devroye et al. (1996)) or Rademacher averages (beginning with Koltchinskii (2001))\ncan be used to obtain more re\ufb01ned upper bounds.\nThe widespread strategy of semi-supervised learning (SSL) is known not to \ufb01t well into PAC-style\nframeworks (Valiant, 1984; Haussler, 1989; Kearns and Schapire, 1994). SSL algorithms perform a\n\ufb01rst step using unlabeled training data drawn from a distribution on X, followed by a second step\nusing labeled training data from a joint distribution on X \u21e5 Y . This has been studied by several\nauthors (Balcan and Blum, 2005; Ben-David et al., 2008; Urner et al., 2011; Niyogi, 2013) following\nthe seminal work of Castelli and Cover (1996) comparing the value of unlabeled and labeled data.\nOne immediate observation is that without some tie between the possible marginals D on X and the\nconcept class C which records possible conditionals p(y|x), there is no bene\ufb01t to unlabeled data: if D\ncan be arbitrary then it conveys no information about the true joint distribution that generated labeled\ndata. Within PAC-style frameworks, however, C and D are completely independent. Balcan and\nBlum therefore proposed augmenting the PAC learning framework by the addition of a compatibility\nfunction : C\u21e5D! [0, 1], which records the amount of compatibility we believe each concept\nfrom C to have with each D 2D , the class of \u201call\" distributions on X. This function is required to be\nlearnable from D and is then used to reduce the concept class from C to a sub-class which will be\nused for the subsequent (supervised) learning step. If is a good compatible function this sub-class\nshould have lesser complexity than C (Balcan and Blum, 2005). While PAC-style frameworks in\nessence allow the true joint distribution to be anything in C\u21e5D , the existence of a good compatibility\nfunction in the sense of Balcan and Blum (2005) implicitly assumes the joint model that we believe\nin is smaller. We return to this point in Section 2.1.\nIn this paper we study properties of multi-step learning strategies \u2013 those which involve multiple\ntraining steps \u2013 by considering the advantages of breaking a single learning problem into a sequence\nof two learning problems. We start by assuming a true distribution which comes from a class of\njoint distributions, i.e. statistical model, P on X \u21e5 Y . We prove that underlying structure of a\ncertain kind in P, together with differential availability of labeled vs. unlabeled data, imply a\nquanti\ufb01able advantage to multi-step learning at \ufb01nite sample size. The structure we need is the\nexistence of a representation t(x) of x 2 X which is a suf\ufb01cient statistic for the classi\ufb01cation or\nregression of interest. Two common settings where this assumption holds are: manifold learning and\ngroup-invariant feature learning. In these settings we have respectively\n\n1. t = tpX is determined by the marginal pX and pX is concentrated on a submanifold of X,\n2. t = tG is determined by a group action on X and p(y|x) is invariant1 under this action.\n\nLearning t in these cases corresponds respectively to learning manifold features or group-invariant\nfeatures; various approaches exist (see (Niyogi, 2013; Poggio et al., 2012) for more discussion) and\nwe do not assume any \ufb01xed method. Our framework is also not restricted to these two settings. As a\ntool for analysis we de\ufb01ne a variant of VC dimension for statistical models which we use to prove\na useful lower bound on generalization error even2 under the assumption that the true distribution\ncomes from P. This allows us to establish a gap at \ufb01nite sample size between the error achievable\nby a single-step purely supervised learner and that achievable by a semi-supervised learner. We do\nnot claim an asymptotic gap. The purpose of our analysis is rather to show that differential \ufb01nite\navailability of data can dictate a multi-step learning approach. Our applications are respectively\na strengthening of a manifold learning example analyzed by Niyogi (2013) and a group-invariant\nfeatures example related to a result of Poggio et al. (2012). We also discuss the relevance of these to\nbiological learning.\nOur framework has commonalities with a framework of Baxter (2000) for transfer learning. In that\nwork, Baxter considered learning the inductive bias (i.e., the hypothesis space) for an algorithm for a\n\n1 This means there is a group G of transformations of X such that p(y|x) = p(y|g\u00b7x) for all g 2 G.\n2(distribution-speci\ufb01c lower bounds are by de\ufb01nition weaker than distribution-free ones)\n\n2\n\n\f\u201ctarget\" learning task, based on experience from previous \u201csource\" learning tasks. For this purpose he\nde\ufb01ned a learning environment E to be a class of probability distributions on X \u21e5 Y together with an\nunknown probability distribution Q on E, and assumed E to restrict the possible joint distributions\nwhich may arise. We also make a generative assumption, assuming joint distributions come from P,\nbut we do not use a prior Q. Within his framework Baxter studied the reduction in generalization\nerror for an algorithm to learn a new task, de\ufb01ned by p 2E , when given access to a sample from p\nand a sample from each of m other learning tasks, p1, . . . , pm 2E , chosen randomly according to Q,\ncompared with an algorithm having access to only a sample from p. The analysis produced upper\nbounds on generalization error in terms of covering numbers and a lower bound was also obtained in\nterms of VC dimension in the speci\ufb01c case of shallow neural networks. In proving our lower bound\nin terms of a variant of VC dimension we use a minimax analysis.\n\n2 Setup\nWe assume a learning problem is speci\ufb01ed by a joint probability distribution p on Z = X \u21e5 Y and a\nparticular (regression, classi\ufb01cation or decision) function fp : X ! R determined entirely by p(y|x).\nMoreover, we postulate a statistical model P on X \u21e5 Y and assume p 2P . Despite the simpli\ufb01ed\nnotation, fp(x) depends on the conditionals p(y|x) and not the entire joint distribution p.\nThere are three main types of learning problem our framework addresses (re\ufb02ected in three types\nof fp). When y is noise-free, i.e. p(y|x) is concentrated at a single y-value vp(x) 2{ 0, 1},\nfp = vp : X !{ 0, 1} (classi\ufb01cation); here fp(x) = Ep(y|x). When y is noisy, then either\nfp : X !{ 0, 1} (classi\ufb01cation/decision) or fp : X ! [0, 1] (regression) and fp(x) = Ep(y|x). In\nall three cases the parameters which de\ufb01ne fp, the learning goal, depend only on p(y|x) = Ep(y|x).\nWe assume the learner knows the model P and the type of learning problem, i.e., the hypothesis class\nis the \u201cconcept class\" C := {fp : p 2P} . To be more precise, for the \ufb01rst type of fp listed above, this\nis the concept class (Kearns and Vazirani, 1994); for the second type, it is a class of decision rules\nand for the third type, it is a class of p-concepts (Kearns and Schapire, 1994). For speci\ufb01c choice of\nloss functions, we seek worst-case bounds on learning rates, over all distributions p 2P .\nOur results for all three types of learning problem are stated in Theorem 3. To keep the presentation\nsimple, we give a detailed proof for the \ufb01rst two types, i.e., assuming labels are binary. This shows\nhow classic PAC-style arguments for discrete X can be adapted to our framework where X may be\nsmooth. Extending these arguments to handle non-binary Y proceeds by the same modi\ufb01cations as\nfor discrete X (c.f. Kearns and Schapire (1994)). We remark that in the presence of noise, better\nbounds can be obtained (see Theorem 3 for details) if a more technical version of De\ufb01nition 1 is used\nbut we leave this for a subsequent paper.\nWe de\ufb01ne the following probabilistic version of fat shattering dimension:\nDe\ufb01nition 1. Given P, a class of probability distributions on X \u21e5{ 0, 1}, let 2 (0, 1), \u21b5 2 (0, 1/2)\nand n 2 N = {0, 1, . . . , ...}. Suppose there exist (disjoint) sets Si \u21e2 X, i 2{ 1, . . . , n} with\nS = [iSi, a reference probability measure q on X, and a sub-class Pn \u21e2P of cardinality\n|Pn| = 2n with the following properties:\n\n1. q(Si) /n for every i 2{ 1, . . . , n}\n2. q lower bounds the marginals of all p 2P n on S, i.e.RB dpX RB dq for any p-measurable\nsubset B \u21e2 S\n3. 8 e 2{ 0, 1}n, 9 p 2P n such that Ep(y|x) > 1/2 + \u21b5 for x 2 Si when ei = 1 and\nEp(y|x) < 1/2 \u21b5 for x 2 Si when ei = 0\n\nthen we say P \u21b5-shatters S1, . . . , Sn -uniformly using Pn. The -uniform \u21b5-shattering dimen-\nsion of P is the largest n such that P \u21b5-shatters some collection of n subsets of X -uniformly.\nThis provides a measure of complexity of the class P of distributions in the sense that it indicates\nthe variability of the expected y-values for x constrained to lie in the region S with measure at\nleast under corresponding marginals. The reference measure q serves as a lower bound on the\nmarginals and ensures that they \u201cuniformly\" assign probabilty at least to S. Richness (variability)\nof conditionals is thus traded off against uniformity of the corresponding marginal distributions.\n\n3\n\n\fRemark 2 (Uniformity of measure). The technical requirement of a reference distribution q is\nautomatically satis\ufb01ed if all marginals pX for p 2P n are uniform over S. For simplicity this is the\nsituation considered in all our examples. The weaker condition (in terms of q) that we postulate in\nDe\ufb01nition 1 is however suf\ufb01cient for our main result, Theorem 3.\nIf fp is binary and y is noise-free then P shatters S1, . . . , Sn -uniformly if and only if there is\na sub-class Pn \u21e2P with the speci\ufb01ed uniformity of measure, such that each fp(\u00b7) = Ep(y|\u00b7),\np 2P n is constant on each Si and the induced set-functions shatter {S1, . . . , Sn} in the usual\n(Vapnik-Chervonenkis) sense. In that case, \u21b5 may be chosen arbitrarily in (0, 1/2) and we omit\nmention of it. If fp takes values in [0, 1] or fp is binary and y noisy then -uniform shattering can be\nexpressed in terms of fat-shattering (both at scale \u21b5).\nWe show that the -uniform \u21b5-shattering dimension of P can be used to lower bound the sample\nsize required by even the most powerful learner of this class of problems. The proof is in the same\nspirit as purely combinatorial proofs of lower bounds using VC-dimension. Essentially the added\ncondition on P in terms of allows to convert the risk calculation to a combinatorial problem. As a\ncounterpoint to the lower bound result, we consider an alternative two step learning strategy which\nmakes use of underlying structure in X implied by the model P and we obtain upper bounds for the\ncorresponding risk.\n\n2.1 Underlying structure\nWe assume a representation t : X ! Rk of the data, such that p(y|x) can be expressed in terms\nof p(y|t(x)), say fp(x) = g\u2713(t(x)) for some parameter \u2713 2 \u21e5. Such a t is generally known in\nStatistics as a suf\ufb01cient dimension reduction for fp but here we make no assumption on the dimension\nk (compared with the dimension of X). This is in keeping with the paradigm of feature extraction for\nuse in kernel machines, where the dimension of t(X) may even be higher than the original dimension\nof X. As in that setting, what will be important is rather that the intermediate representation t(x)\nreduce the complexity of the concept space. While t depends on p we will assume it does so only\nvia X. For example t could depend on p through the marginal pX on X or possible group action on\nX; it is a manifestation in the data X, possibly over time, of underlying structure in the true joint\ndistribution p 2P . The representation t captures structure in X induced by p. On the other hand, the\nregression function itself depends only on the conditional p(y|t(x)).\nIn general, the natural factorization \u21e1 : P!P X, p 7! pX determines for each marginal q 2P X\na collection \u21e11(q) of possible conditionals, namely those p(y|x) arising from joint p 2P that\nhave marginal pX = q. More generally any suf\ufb01cient statistic t induces a similar factorization (c.f.\nFisher-Neyman characterization) \u21e1t : P!P t, p 7! pt, where Pt is the marginal model with respect\nto t, and only conditionals p(y|t) are needed for learning. As before, given a known marginal q 2P t,\nthis implies a collection \u21e11\nKnowing q thus reduces the original problem where p(y|x) or p(y|t) can come from any p 2P to\none where it comes from p in a reduced class \u21e11(q) or \u21e11\n(q) ( P. Note the similarity with the\nassumption of Balcan and Blum (2005) that a good compatibility function reduce the concept class. In\nour case the concept class C consists of fp de\ufb01ned by p(y|t) in [tPY |t with PY |t :={p(y|t) : p 2P} ,\nand marginals come from Pt. The joint model P that we postulate, meanwhile, corresponds to\na subset of C\u21e5P t (pairs (fp, q) where fp uses p 2 \u21e11\n(q)). The indicator function for this\nsubset is an abstract (binary) version of compatibility function (recall the compatibility function of\nBalcan-Blum should be a [0, 1]-valued function on C\u21e5D , satisfying further practical conditions that\nour function typically would not). Thus, in a sense, our assumption of a joint model P and suf\ufb01cient\nstatistic t amounts to a general form of compatibility function that links C and D without making\nassumptions on how t might be learned. This is enough to imply the original learning problem can be\nfactored into \ufb01rst learning the structure t and then learning the parameter \u2713 for fp(x) = g\u2713(t(x)) in a\nreduced hypothesis space. Our goal is to understand when and why one should do so.\n\n(q) of possible conditionals p(y|t) relevant to learning.\n\nt\n\nt\n\nt\n\n2.2 Learning rates\n\nWe wish to quantify the bene\ufb01ts achieved by using such a factorization in terms of the bounds on\nthe expected loss (i.e. risk) for a sample of size m 2 N drawn iid from any p 2P . We assume the\nlearner is provided with a sample \u00afz = (z1, z2 \u00b7\u00b7\u00b7 zm), with zi = (xi, yi) 2 X \u21e5 Y = Z, drawn iid\nfrom the distribution p and uses an algorithm A : Zm !C = H to select A(\u00afz) to approximate fp.\n\n4\n\n\fLet `(A(\u00afz), fp) denote a speci\ufb01c loss. It might be 0/1, absolute, squared, hinge or logistic loss. We\nde\ufb01ne L(A(\u00afz), fp) to be the global expectation or L2-norm of one of those pointwise losses `:\n\nor\n\nL(A(\u00afz), fp) := Ex`(A(\u00afz)(x), fp(x)) =ZX\nL(A(\u00afz), fp) := ||`(A(\u00afz), fp)||L2(pX ) =sZX\n\n`(A(\u00afz)(x), fp(x))dpX(x)\n\n`(A(\u00afz)(x), fp(x))2dpX.\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\nThen the worst case expected loss (i.e. minimax risk) for the best learning algorithm with no\nknowledge of tpX is\n\nR(m) := inf\nA\n\nsup\np2P\n\np(y|tq)s.t.\np2P,pX =q\nwhile for the best learning algorithm with oracle knowledge of tpX it is\n\nE\u00afzL(A(\u00afz), fp) = inf\n\nA\n\nsup\nq2PX\n\nsup\n\nE\u00afzL(A(\u00afz, fp) .\n\nQ(m) := sup\nq2PX\n\ninf\nA\n\nsup\n\np(y|tq)s.t.\np2P,pX =q\n\nE\u00afzL(A(\u00afz, fp) .\n\nSome clari\ufb01cation is in order regarding the classes over which the suprema are taken. In principle\nthe worst case expected loss for a given A is the supremum over P of the expected loss. Since fp(x)\nis determined by p(y|tpX (x)), and tpX is determined by pX this is a supremum over q 2P X of\na supremum over p(y|tq(\u00b7)) such that pX = q. Finding the worst case expected error for the best\nA therefore means taking the in\ufb01mum of the supremum just described. In the case of Q(m) since\nthe algorithm knows tq, the order of the supremum over t changes with respect to the in\ufb01mum: the\nlearner can select the best algorithm A using knowledge of tq.\nClearly R(m) Q(m) by de\ufb01nition. In the next section, we lower bound R(m) and upper bound\nQ(m) to establish a gap between R(m) and Q(m).\n\n3 Main Result\nWe show that -uniform shattering dimension n or more implies a lower bound on the worst case\nexpected error, R(m), when the sample size m \uf8ff n. In particular - in the setup speci\ufb01ed in the\nprevious section - if {g\u2713(\u00b7) : \u2713 2 \u21e5} has much smaller VC dimension than n this results in a distinct\ngap between rates for a learner with oracle access to tpX and a learner without.\nTheorem 3. Consider the framework de\ufb01ned in the previous Section with Y = {0, 1}. Assume\n{g\u2713(\u00b7) : \u2713 2 \u21e5} has VC dimension d < m and P has -uniform \u21b5-shattering dimension n (1+\u270f)m.\nThen, for sample size m, Q(m) \uf8ff 16q d log(m+1)+log 8+1\nm+1/8 where b\ndepends both on the type of loss and the presence of noise, while c depends on noise.\nAssume the standard de\ufb01nition in (1). If fp are binary (in the noise-free or noisy setting) b = 1\nfor absolute, squared, 0-1, hinge or logistic loss. In the noisy setting, if fp = E(y|x) 2 [0, 1],\nb = \u21b5 for absolute loss and b = \u21b52 for squared loss. In general, c = 1 in the noise-free setting\nand c = (1/2 + \u21b5)m in the noisy setting. By requiring P to satisfy a stronger notion of -uniform\n\u21b5-shattering one can obtain c = 1 even in the noisy case.\n\nwhile R(m) >\u270fbc\n\n2m\n\nNote that for sample size m and -uniform \u21b5-shattering dimension 2m, we have \u270f = 1, so the lower\nbound in its simplest form becomes m+1/8. This is the bound we will use in the next Section to\nderive implications of Theorem 3.\nRemark 4. We have stated in the Theorem a simple upper bound, sticking to Y = {0, 1} and using\nVC dimension, in order to focus the presentation on the lower bound which uses the new complexity\nmeasure. The upper bound could be improved. It could also be replaced with a corresponding upper\nbound assuming instead Y = [0, 1] and fat shattering dimension d.\nProof. The upper bound on Q(m) holds for an ERM algorithm (by the classic argument, see for\nexample Corollary 12.1 in Devroye et al. (1996)). We focus here on the lower bound for R(m).\nMoreover, we stick to the simpler de\ufb01nition of -uniform shattering in De\ufb01nition 1 and omit proof of\nthe \ufb01nal statement of the Theorem, which is slightly more involved. We let n = 2m (i.e. \u270f = 1) and\n\n5\n\n\fwe comment in a footnote on the result for general \u270f. Let S1, . . . , S2m be sets which are -uniformly\n\u21b5-shattered using the family P2m \u21e2P and denote their union by S. By assumption S has measure\nat least under a reference measure q which is dominated by all marginals pX for p 2P 2m (see\nDe\ufb01nition 1). We divide our argument into three parts.\n1.\n\nIf we prove a lower bound for the average over P2m,\n\n8A,\n\n1\n\n22m Xp2P2m\n\nE\u00afzL(A(\u00afz), fp) bcm+1/8\n\n(5)\n\nit will also be a lower bound for the supremum over P2m:\n\n8A,\n\nsup\np2P2m\n\nE\u00afzL(A(\u00afz), fp) bcm+1/8 .\nand hence for the supremum over P. It therefore suf\ufb01ces to prove (5).\n2. Given x 2 S, de\ufb01ne vp(x) to be the more likely label for x under the joint distribution p 2P 2m.\nThis notation extends to the noisy case the de\ufb01nition of vp already given for the noise-free case. The\nuniform shattering condition implies p(vp(x)|x) > 1/2 + \u21b5 in the noisy case and p(vp(x)|x) = 1\nin the noise-free case. Given \u00afx = (x1, . . . , xm) 2 Sm, write \u00afzp(\u00afx) := (z1, . . . , zm) where zj =\n(xj, vp(xj)). Then\n\nE\u00afzL(A(\u00afz), fp) = ZZm\n ZSm\u21e5Y m\n\nL(A(\u00afz), fp)dpm(\u00afz)\nL(A(\u00afz), fp)dpm(\u00afz) cZSm\n\nL(A(\u00afzp(\u00afx)), fp)dpm\n\nX(\u00afx)\n\nwhere c is as speci\ufb01ed in the Theorem. Note the sets\n\nVl := {\u00afx 2 Sm \u21e2 X m : the xj occupy exactly l of the Si}\n\nfor l = 1, . . . , m de\ufb01ne a partition of Sm. Recall that dpX dq on S for all p 2P 2m so\nL(A(\u00afzp(\u00afx)), fp) dqm(\u00afx)\n\nL(A(\u00afzp(\u00afx)), fp)dpm\n\nX(\u00afx) \n\n1\n\n22m Xp2P2m\n\nmXl=1 Z\u00afx2Vl\n\nZSm\n\nwhich will complete the proof of (5).\n3. We now assume a \ufb01xed but arbitrary \u00afx 2 Vl and prove I b/8. To simplify the discussion,\nwe will refer to sets Si which contain a component xj of \u00afx as Si with data. We also need notation\nfor the elements of P2m: for each L \u21e2 [2m] denote by p(L) the unique element of P2m such that\nvp(L)|Si = 1 if i 2 L, and vp(L)|Si = 0 if i /2 L. Now, let L\u00afx := {i 2 [2m]\n: \u00afx \\ Si 6= ;}. These\nare the indices of sets Si with data. By assumption |L\u00afx| = l, and so |Lc\n\u00afx| = 2m l.\n\u00afx. We will\nEvery subset L \u21e2 [2m] and hence every p 2P 2m is determined by L \\ L\u00afx and L \\ Lc\ncollect together all p(L) having the same L \\ L\u00afx, namely for each D \u21e2 L\u00afx de\ufb01ne\n\nPD := {p(L) 2P 2m : L \\ L\u00afx = D}.\n\nThese 2l families partition P2m and in each PD there are 22ml probability distributions. Most\nimportantly, \u00afzp(\u00afx) is the same for all p 2P D (because D determines vp on the Si with data). This\n\n6\n\nL(A(\u00afzp(\u00afx)), fp)\n\ndqm(\u00afx).\n\n=\n\nmXl=1 Z\u00afx2Vl\n\n1\n\n0BBBB@\n22m Xp2P2m\n|\nmXl=1 Z\u00afx2Vl\n\n{z\ndqm(\u00afx) = Z\u00afx2Sm\n\nI\n\n1CCCCA\n\n}\n\ndqm(\u00afx) m\n\nWe claim the integrand, I, is bounded below by b/8 (this computation is performed in part 3, and\ndepends on knowing \u00afx 2 Vl). At the same time, S has measure at least under q so\n\n\fimplies A(\u00afzp(\u00afx)) : X ! R is the same function3 of X for all p in a given PD. To simplify notation,\nsince we will be working within a single PD, we write f := A(\u00afz(\u00afx)).\nWhile f is the hypothesized regression function given data \u00afx, fp is the true regression function when\np is the underlying distribution. For each set Si let vi be 1 if f is above 1/2 on a majority of Si using\nreference measure q (a q-majority) and 0 otherwise.\nWe now focus on the \u201cunseen\" Si where no data lie (i.e., i 2 Lc\ncorrespondence between elements p 2P D and subsets K \u21e2 Lc\n\u00afx:\nTake a speci\ufb01c p 2P D with its associated Kp. We have |f (x) fp(x)| >\u21b5 on the q-majority of the\nset Si for all i 2 Kp.\nThe condition |f (x) fp(x)| >\u21b5 with f (x) and fp(x) on opposite sides of 1/2 implies a lower\nbound on `(f (x), fp(x)) for each of the pointwise loss functions ` that we consider (0/1, absolute,\nsquare, hinge, logistic). The value of b, however, differs from case to case (see Appendix).\nFor now we have,\n\np 2P D ! Kp := {i 2 Lc\n\n\u00afx) and use the vi to specify a 1-1\n\n\u00afx : vp 6= vi}.\n\n1\n\n2ZSi\n\ndq(x) \n\nb\n4m\n\n.\n\nZSi\n\n`(f (x), fp(x)) dpX(x) ZSi\n\n`(f (x), fp(x)) dq(x) b\n\nb\n4m\n\nL(f (x), fp(x)) k\n\nSumming over all i 2 Kp, and letting k = |Kp|, we obtain (still for the same p)\n(assuming L is de\ufb01ned by equation (1))4. There are2m`\nk = 0, . . . , 2m `. Therefore,\nL(f (x), fp(x)) \n\n2m`Xk=0 \u27132m `\nk \u25c6k\n(using 2m ` 2m m = m)5. Since D was an arbitrary subset of L\u00afx, this same lower bound\nholds for each of the 2` families PD and so\n22m Xp2P2m\n\nk possible K with cardinality k, for any\n22m`(2m `)\n\n4m 22m` b\n\nL(f (x), fp(x)) \n\nXp2PD\n\nb\n8\n\n.\n\nb\n4m\n\n=\n\n1\n\nI =\n\nb\n\n2\n\n8\n\nIn the constructions of the next Section it is often the case that one can prove a different level of\nshattering for different n, namely (n)-uniform shattering of n subsets for various n. The following\nCorollary is an immediate consequence of the Theorem for such settings. We state it for binary fp\nwithout noise.\nCorollary 5. Let C 2 (0, 1) and M 2 N. If P (n)-uniformly \u21b5-shatters n subsets of X and\n(n)n+1/8 > C for all n < M then no learning algorithm can achieve worst case expected error\nbelow \u21b5C, using a training sample of size less than M/2. If such uniform shattering holds for all\nn 2 N then the same lower bound applies regardless of sample size.\nEven when (n)-uniform shattering holds for all n 2 N and limn!1 (n) = 1, if (n) approaches\n1 suf\ufb01ciently slowly then it is possible (n)n+1 ! 0 and there is no asymptotic obstacle to learning.\nBy contrast, the next Section shows an extreme situation where limn!1 (n)n+1 e > 0. In that\ncase, learning is impossible.\n\n4 Applications and conclusion\nManifold learning We now describe a simpler, \ufb01nite dimensional version of the example in Niyogi\n(2013). Let X = RD, D 2 and Y = {0, 1}. Fix N 2 N and consider a very simple type of\n1-dimensional manifold in X, namely the union of N linear segments, connected in circular fashion\n(see Figure 1). Let PX be the collection of marginal distributions, each of which is supported on and\nassigns uniform probability along a curve of this type. There is a 1-1 correspondence between the\nelements of PX and curves just described.\n\n3Warning: f need not be an element of {fp : p 2P 2n}; we only know f 2H = {fp : p 2P} .\n4In the L2 version, using px x, the reader can verify the same lower bound holds.\n5In the case where we use (1 + \u270f)m instead of 2m, we would have (1 + \u270f)m ` \u270fm here.\n\n7\n\n\fFigure 1: An example of M with N = 12. The\ndashed curve is labeled 1, the solid curve 0 (in\nnext Figure as well).\n\nFigure 2: M with N = 28 = 4(n + 1) pieces,\nused to prove uniform shattering of n sets\n(shown for the case n = 6 with e = 010010).\n\nn+1 (see Appendix and Figure 2). Since (1 1\n\nOn each curve M, choose two distinct points x0, x00. Removing these disconnects M. Let one\ncomponent be labeled 0 and the other 1, then label x0 and x00 oppositely. Let P be the class of joint\ndistributions on X \u21e5 Y with conditionals as described and marginals in PX. This is a noise-free\nsetting and fp is binary. Given M (or circular coordinates on M), consider the reduced class\nP0 := {p 2P : support(pX) = M}. Then H0 := {fp : p 2P 0} has VC dimension 3. On the\nother hand, for n < N/4 1 it can be shown that P (n)-uniformly shatters n sets with fp, where\nn+1 )n+1 ! e > 0 as n ! 1, it follows\n(n) = 1 1\nfrom Corollary 5 that the worst case expected error is bounded below by e/8 for any sample of size\nn \uf8ff N/8 1/2. If many linear pieces are allowed (i.e. N is high) this could be an impractical\nnumber of labeled examples. By contrast with this example, (n) in Niyogi\u2019s example cannot be\nmade arbitrarily close to 1.\nGroup-invariant features We give a simpli\ufb01ed, partially-discrete example (for a smooth version\nand Figures, see Appendix). Let Y = {0, 1} and let X = J \u21e5 I where J = {0, 1, . . . , n1 \n1}\u21e5{ 0, 1, . . . , n2 1} is an n1 by n2 grid (ni 2 N) and I = [0, 1] is a real line segment. One\nshould picture X as a rectangular array of vertical sticks. Above each grid point (j1, j2) consider\ntwo special points on the stick I, one with i = i+ := 1 \u270f and the other with i = i := 0 + \u270f.\nLet PX contain only the uniform distribution on X and assume the noise-free setting. For each\n\u00afe 2{ +,}n1n2, on each segment (j1, j2) \u21e5 I assign, via p\u00afe, the label 1 above the special point\n(determined by \u00afe) and 0 below the point. This determines a family of n1n2 conditional distributions\nand thus a family P := {p\u00afe : \u00afe 2{ +,}n1n2} of n1n2 joint distributions. The reader can verify\nthat P has 2\u270f-uniform shattering dimension n1n2. Note that when the true distribution is p\u00afe for some\n\u00afe 2{ +,}n1n2 the labels will be invariant under the action a\u00afe of Zn1 \u21e5 Zn2 de\ufb01ned as follows.\nGiven (z1, z2) 2 Zn1 \u21e5Zn2 and (j1, j2) 2 J, let the group element (z1, z2) move the vertical stick at\n(j1, j2) to the one at (z1 + j1 mod n1, z2 + j2 mod n2) without \ufb02ipping the stick over, just stretching\nit as needed so the special point i\u00b1 determined by \u00afe on the \ufb01rst stick goes to the one on the second\nstick. The orbit space of the action can be identi\ufb01ed with I. Let t : X \u21e5 Y ! I be the projection\nof X \u21e5 Y to this orbit space, then there is an induced labelling of this orbit space (because labels\nwere invariant under the action of the group). Given access to t, the resulting concept class has VC\ndimension 1. On the other hand, given instead access to a projection s for the action of the subgroup\nZn1 \u21e5{ 0}, the class eP := {p(\u00b7|s) : p 2P} has 2\u270f-uniform shattering dimension n2. Thus we have\n\na general setting where the over-all complexity requirements for two-step learning are n1 + n2 while\nfor single-step learning they are n1n2.\n\nConclusion We used a notion of uniform shattering to demonstrate both manifold learning and\ninvariant feature learning situations where learning becomes impossible unless the learner has access\nto very large amounts of labeled data or else uses a two-step semi-supervised approach in which\nsuitable manifold- or group-invariant features are learned \ufb01rst in unsupervised fashion. Our examples\nalso provide a complexity manifestation of the advantages, observed by Poggio and Mallat, of forming\nintermediate group-invariant features according to sub-groups of a larger transformation group.\n\nAcknowledgements The author is deeply grateful to Partha Niyogi for the chance to have been his\nstudent. This paper is directly inspired by discussions with him which were cut short much too soon.\nThe author also thanks Ankan Saha and Misha Belkin for very helpful input on preliminary drafts.\n\n8\n\n\fReferences\nM. Ahissar and S. Hochstein. The reverse hierarchy theory of visual perceptual learning. Trends\n\nCogn Sci., Oct;8(10):457\u201364, 2004.\n\nG. Alain and Y. Bengio. What regularized auto-encoders learn from the data generating distribution.\n\nTechnical report, 2012. http://arXiv:1211.4246[cs.LG].\n\nM.-F. Balcan and A. Blum. A pac-style model for learning from labeled and unlabeled data. In\n\nLearning Theory, volume 3559, pages 111\u2013126. Springer LNCS, 2005.\n\nJ. Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research, 12:149?198,\n\n2000.\n\nM. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric framework for learning\nfrom labeled and unlabeled examples. Journal of Machine Learning Research, 7:2399\u20132434, 2006.\nS. Ben-David, T. Lu, and D. P\u00e1l. Does unlabeled data provably help? worst-case analysis of the\n\nsample complexity of semi-supervised learning. In COLT, pages 33\u201344, 2008.\n\nJ. Bourne and M. Rosa. Hierarchical development of the primate visual cortex, as revealed by\nneuro\ufb01lament immunoreactivity: early maturation of the middle temporal area (mt). Cereb Cortex,\nMar;16(3):405\u201314, 2006. Epub 2005 Jun 8.\n\nV. Castelli and T. Cover. The relative value of labeled and unlabeled samples in pattern recognition.\n\nIEEE Transactions on Information Theory, 42:2102\u20132117, 1996.\n\nL. Devroye, L. Gy\u00f6r\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of\n\nApplications of mathematics. Springer, New York, 1996.\n\nD. Haussler. Generalizing the pac model: Sample size bounds from metric dimension-based uniform\n\nconvergence results. pages 40\u201345, 1989.\n\nM. Kearns and R. Schapire. Ef\ufb01cient distribution-free learning of probabilistic concepts. Journal of\n\nComputer and System Sciences, 48:464\u2013497, 1994.\n\nM. J. Kearns and U. V. Vazirani. An Introduction to Computational Learning Theory. MIT Press,\n\nCambridge, Massachusetts, 1994.\n\nV. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on\n\nInformation Theory, 47(5):1902\u20131914, 2001.\n\nS. Mallat. Group invariant scattering. CoRR, abs/1101.2286, 2011. http://arxiv.org/abs/1101.2286.\nM. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. MIT Press, 2012.\nP. Niyogi. Manifold regularization and semi-supervised learning: Some theoretical analyses. Journal\n\nof Machine Learning Research, 14:1229\u20131250, 2013.\n\nT. Poggio, J. Mutch, F. Anselmi, L. Rosasco, J. Leibo, and A. Tacchetti. The computational magic of\nthe ventral stream: sketch of a theory (and why some deep architectures work). Technical report,\nMassachussetes Institute of Technology, 2012. MIT-CSAIL-TR-2012-035.\n\nR. Urner, S. Shalev-Shwartz, and S. Ben-David. Access to unlabeled data can speed up prediction\n\ntime. In ICML, 2011.\n\nL. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n\n9\n\n\f", "award": [], "sourceid": 2438, "authors": [{"given_name": "Maia", "family_name": "Fraser", "institution": "University of Ottawa"}]}