{"title": "Learning with Group Invariant Features: A Kernel Perspective.", "book": "Advances in Neural Information Processing Systems", "page_first": 1558, "page_last": 1566, "abstract": "We analyze in this paper a random feature map based on a theory of invariance (\\emph{I-theory}) introduced in \\cite{AnselmiLRMTP13}. More specifically, a group invariant signal signature is obtained through cumulative distributions of group-transformed random projections. Our analysis bridges invariant feature learning with kernel methods, as we show that this feature map defines an expected Haar-integration kernel that is invariant to the specified group action. We show how this non-linear random feature map approximates this group invariant kernel uniformly on a set of $N$ points. Moreover, we show that it defines a function space that is dense in the equivalent Invariant Reproducing Kernel Hilbert Space. Finally, we quantify error rates of the convergence of the empirical risk minimization, as well as the reduction in the sample complexity of a learning algorithm using such an invariant representation for signal classification, in a classical supervised learning setting", "full_text": "Learning with Group Invariant Features:\n\nA Kernel Perspective.\n\nYoussef Mroueh\nIBM Watson Group\n\nmroueh@us.ibm.com\n\nTomaso Poggio\nCBMM, MIT .\n\ntp@ai.mit.edu\n\nStephen Voinea\u2217\nCBMM, MIT.\n\u2217Co-\ufb01rst author\n\nvoinea@mit.edu\n\nAbstract\n\nWe analyze in this paper a random feature map based on a theory of invariance\n(I-theory) introduced in [1]. More speci\ufb01cally, a group invariant signal signature\nis obtained through cumulative distributions of group-transformed random pro-\njections. Our analysis bridges invariant feature learning with kernel methods, as\nwe show that this feature map de\ufb01nes an expected Haar-integration kernel that is\ninvariant to the speci\ufb01ed group action. We show how this non-linear random fea-\nture map approximates this group invariant kernel uniformly on a set of N points.\nMoreover, we show that it de\ufb01nes a function space that is dense in the equivalent\nInvariant Reproducing Kernel Hilbert Space. Finally, we quantify error rates of\nthe convergence of the empirical risk minimization, as well as the reduction in the\nsample complexity of a learning algorithm using such an invariant representation\nfor signal classi\ufb01cation, in a classical supervised learning setting.\n\n1\n\nIntroduction\n\nEncoding signals or building similarity kernels that are invariant to the action of a group is a key\nproblem in unsupervised learning, as it reduces the complexity of the learning task and mimics how\nour brain represents information invariantly to symmetries and various nuisance factors (change in\nlighting in image classi\ufb01cation and pitch variation in speech recognition) [1, 2, 3, 4]. Convolutional\nneural networks [5, 6] achieve state of the art performance in many computer vision and speech\nrecognition tasks, but require a large amount of labeled examples as well as augmented data, where\nwe re\ufb02ect symmetries of the world through virtual examples [7, 8] obtained by applying identity-\npreserving transformations such as shearing, rotation, translation, etc., to the training data. In this\nwork, we adopt the approach of [1], where the representation of the signal is designed to re\ufb02ect\nthe invariant properties and model the world symmetries with group actions. The ultimate aim is\nto bridge unsupervised learning of invariant representations with invariant kernel methods, where\nwe can use tools from classical supervised learning to easily address the statistical consistency and\nsample complexity questions [9, 10]. Indeed, many invariant kernel methods and related invariant\nkernel networks have been proposed. We refer the reader to the related work section for a review\n(Section 5) and we start by showing how to accomplish this invariance through group-invariant Haar-\nintegration kernels [11], and then show how random features derived from a memory-based theory\nof invariances introduced in [1] approximate such a kernel.\n\n1.1 Group Invariant Kernels\n\nWe start by reviewing group-invariant Haar-integration kernels introduced in [11], and their use in a\nbinary classi\ufb01cation problem. This section highlights the conceptual advantages of such kernels as\nwell as their practical inconvenience, putting into perspective the advantage of approximating them\nwith explicit and invariant random feature maps.\n\n1\n\n\fK(x, z) =\n\n(cid:90)\n\n(cid:90)\n\nInvariant Haar-Integration Kernels. We consider a subset X of the hypersphere in d dimensions\nSd\u22121. Let \u03c1X be a measure on X . Consider a kernel k0 on X , such as a radial basis function kernel.\nLet G be a group acting on X , with a normalized Haar measure \u00b5. G is assumed to be a compact\nand unitary group. De\ufb01ne an invariant kernel K between x, z \u2208 X through Haar-integration [11] as\nfollows:\n\nk0(gx, g(cid:48)z)d\u00b5(g)d\u00b5(g(cid:48)).\n\nG\n\nG\n\n(1)\nAs we are integrating over the entire group, it is easy to see that: K(g(cid:48)x, gz) = K(x, z), \u2200g, g(cid:48) \u2208\nG,\u2200x, z \u2208 X . Hence the Haar-integration kernel is invariant to the group action. The symmetry of\nK is obvious. Moreover, if k0 is a positive de\ufb01nite kernel, it follows that K is positive de\ufb01nite as\nwell [11]. One can see the Haar-integration kernel framework as another form of data augmentation,\nsince we have to produce group-transformed points in order to compute the kernel.\nInvariant Decision Boundary. Turning now to a binary classi\ufb01cation problem, we assume that we\nare given a labeled training set: S = {(xi, yi) | xi \u2208 X , yi \u2208 Y = {\u00b11}}N\ni=1. In order to learn a\ndecision function f : X \u2192 Y, we minimize the following empirical risk induced by an L-Lipschitz,\nconvex loss function V , with V (cid:48)(0) < 0 [12]: minf\u2208HK \u02c6EV (f ) := 1\ni=1 V (yif (xi)), where we\nrestrict f to belong to a hypothesis class induced by the invariant kernel K, the so called Reproducing\nKernel Hilbert Space HK. The representer theorem [13] shows that the solution of such a problem,\ni K(x, xi). Since the\nor the optimal decision boundary f\u2217\nkernel K is group-invariant it follows that : f\u2217\ni=1 \u03b1iK(x, xi) =\nN (x), \u2200g \u2208 G. Hence the the decision boundary f\u2217is group-invariant as well, and we have:\nf\u2217\nf\u2217\nN (gx) = f\u2217\nReduced Sample Complexity. We have shown that a group-invariant kernel induces a group-\ninvariant decision boundary, but how does this translate to the sample complexity of the learning\nalgorithm? To answer this question, we will assume that the input set X has the following structure:\nX = X0 \u222a GX0, GX0 = {z|z = gx, x \u2208 X0, g \u2208 G/ {e}}, where e is the identity group element.\nThis structure implies that for a function f in the invariant RKHS HK, we have:\n\n(cid:80)N\nN (x) =(cid:80)N\ni=1 \u03b1iK(gx, xi) =(cid:80)N\n\nN (gx) =(cid:80)N\n\nN has the following form: f\u2217\n\nN (x),\u2200g \u2208 G,\u2200x \u2208 X .\n\ni=1 \u03b1\u2217\n\nN\n\n\u2200z \u2208 GX0,\u2203 x \u2208 X0,\u2203 g \u2208 G such that, z = gx, and f (z) = f (x).\n\nLet \u03c1y(x) = P(Y = y|x) be the label posteriors. We assume that \u03c1y(gx) = \u03c1y(x),\u2200g \u2208 G. This\nis a natural assumption since the label is unchanged given the group action. Assume that the set X\nis endowed with a measure \u03c1X that is also group-invariant. Let f be the group-invariant decision\nfunction and consider the expected risk induced by the loss V , EV (f ), de\ufb01ned as follows:\n\nV (yf (x))\u03c1y(x)\u03c1X (x)dx,\n\n(2)\n\n(cid:90)\n\n(cid:88)\n\nX\n\ny\u2208Y\n\nEV (f ) =\n\nEV (f ) is a proxy to the misclassi\ufb01cation risk [12]. Using the invariant properties of the function\nclass and the data distribution we have by invariance of f, \u03c1y, and \u03c1:\n\nV (yf (x))\u03c1y(x)\u03c1X (x)dx +\n\nV (yf (z))\u03c1y(z)\u03c1X (z)dz\n\n(cid:90)\n\n(cid:88)\n\nGX0\n\ny\u2208Y\n\n(cid:88)\n\ny\u2208Y\n\n(cid:90)\n(cid:90)\n(cid:90)\n(cid:90)\n\nX0\n\nG\n\nG\n\nd\u00b5(g)\n\n(cid:88)\n\nX0\n\ny\u2208Y\n\nd\u00b5(g)\n\nX0\n\n(cid:90)\n(cid:90)\n\n(cid:88)\n(cid:88)\n\ny\u2208Y\n\nX0\n\ny\u2208Y\n\nEV (f ) =\n\n=\n\n=\n\n=\n\nV (yf (x))\u03c1y(x)\u03c1X (x)dx.\n\nV (yf (gx))\u03c1y(gx)\u03c1X (x)dx\n\nV (yf (x))\u03c1y(x)\u03c1X (x)dx (By invariance of f, \u03c1y, and \u03c1 )\n\nHence, given an invariant kernel to a group action that is identity preserving, it is suf\ufb01cient to\nminimize the empirical risk on the core set X0, and it generalizes to samples in GX0.\nLet us imagine that X is \ufb01nite with cardinality |X|; the cardinality of the core set X0 is a small\nfraction of the cardinality of X : |X0| = \u03b1|X|, where 0 < \u03b1 < 1. Hence, when we sample training\npoints from X0, the maximum size of the training set is N = \u03b1|X| << |X|, yielding a reduction in\nthe sample complexity.\n\n2\n\n\f1.2 Contributions\n\nWe have just reviewed the group-invariant Haar-integration kernel. In summary, a group-invariant\nkernel implies the existence of a decision function that is invariant to the group action, as well as\na reduction in the sample complexity due to sampling training points from a reduced set, a.k.a the\ncore set X0.\nKernel methods with Haar-integration kernels come at a very expensive computational price at both\ntraining and test time: computing the Kernel is computationally cumbersome as we have to integrate\nover the group and produce virtual examples by transforming points explicitly through the group\naction. Moreover, the training complexity of kernel methods scales cubicly in the sample size.\nThose practical considerations make the usefulness of such kernels very limited.\nThe contributions of this paper are on three folds:\n\n1. We \ufb01rst show that a non-linear random feature map \u03a6 : X \u2192 RD derived from a memory-\nbased theory of invariances introduced in [1] induces an expected group-invariant Haar-\nintegration kernel K. For \ufb01xed points x, z \u2208 X , we have: E(cid:104)\u03a6(x), \u03a6(z)(cid:105) = K(x, z),\nwhere K satis\ufb01es: K(gx, g(cid:48)z) = K(x, z),\u2200g, g(cid:48) \u2208 G, x, z \u2208 X .\n\n2. We show a Johnson-Lindenstrauss type result that holds uniformly on a set of N points that\nassess the concentration of this random feature map around its expected induced kernel. For\nsuf\ufb01ciently large D, we have (cid:104)\u03a6(x), \u03a6(z)(cid:105) \u2248 K(x, z), uniformly on an N points set.\n3. We show that, with a linear model, an invariant decision function can be learned in this\nN (x) \u2248 (cid:104)w\u2217, \u03a6(x)(cid:105)\nrandom feature space by sampling points from the core set X0 i.e: f\u2217\nand generalizes to unseen points in GX0, reducing the sample complexity. Moreover, we\nshow that those features de\ufb01ne a function space that approximates a dense subset of the\ninvariant RKHS, and assess the error rates of the empirical risk minimization using such\nrandom features.\n\n4. We demonstrate the validity of these claims on three datasets:\n\n(MNIST), and speech (TIDIGITS).\n\ntext (arti\ufb01cial), vision\n\n2 From Group Invariant Kernels to Feature Maps\nIn this paper we show that a random feature map based on I-theory [1]: \u03a6 : X \u2192 RD approximates\na group-invariant Haar-integration kernel K having the form given in Equation (1):\n\n(cid:104)\u03a6(x), \u03a6(z)(cid:105) \u2248 K(x, z).\n\nWe start with some notation that will be useful for de\ufb01ning the feature map. Denote the cumulative\ndistribution function of a random variable X by,\n\nFX (\u03c4 ) = P(X \u2264 \u03c4 ),\n\nFix x \u2208 X , Let g \u2208 G be a random variable drawn according to the normalized Haar measure \u00b5 and\nlet t be a random template whose distribution will be de\ufb01ned later. For s > 0, de\ufb01ne the following\ntruncated cumulative distribution function (CDF) of the dot product (cid:104)x, gt(cid:105):\n\n\u03c8(x, t, \u03c4 ) = Pg((cid:104)x, gt(cid:105) \u2264 \u03c4 ) = F(cid:104)x,gt(cid:105)(\u03c4 ), \u03c4 \u2208 [\u2212s, s], x \u2208 X ,\n\nLet \u03b5 \u2208 (0, 1). We consider the following Gaussian vectors (sampling with rejection) for the tem-\nplates t:\n\n(cid:18)\n\n(cid:19)\n\nt = n \u223c N\n\n0,\n\nId\n\n1\nd\n\n, if (cid:107)n(cid:107)2\n\n2 < 1 + \u03b5, t =\u22a5 else .\n\nThe reason behind this sampling is to keep the range of (cid:104)x, gt(cid:105) under control: The squared norm\n(cid:107)n(cid:107)2\n2 will be bounded by 1 + \u03b5 with high probability by a classical concentration result (See proof\nof Theorem 1 for more details). The group being unitary and x \u2208 Sd\u22121, we know that : |(cid:104)x, gt(cid:105)| \u2264\n(cid:107)n(cid:107)2 <\nRemark 1. We can also consider templates t, drawn uniformly on the unit sphere Sd\u22121. Uniform\ntemplates on the sphere can be drawn as follows:\n\n1 + \u03b5 \u2264 1 + \u03b5, for \u03b5 \u2208 (0, 1).\n\n\u221a\n\nt =\n\n\u03bd\n(cid:107)\u03bd(cid:107)2\n\n, \u03bd \u223c N (0, Id),\n\n3\n\n\f\u221a\n\nsince the norm of a gaussian vector is highly concentrated around its mean\nd, we can use the\ngaussian sampling with rejection. Results proved for gaussian templates (with rejection) will hold\ntrue for templates drawn at uniform on the sphere with different constants.\n\nDe\ufb01ne the following kernel function,\n\nKs(x, z) = Et\n\n\u03c8(x, t, \u03c4 )\u03c8(z, t, \u03c4 )d\u03c4,\n\nLet \u00afg \u2208 G. As the group is closed, we have \u03c8(t, \u00afgx, \u03c4 ) = (cid:82)\n(cid:82)\n\nwhere s will be \ufb01xed throughout the paper to be s = 1+\u03b5 since the gaussian sampling with rejection\ncontrols the dot product to be in that range.\nG 1I(cid:104)g\u00afgx,t(cid:105)\u2264\u03c4 d\u00b5(g) =\nG 1I(cid:104)gx,t(cid:105)\u2264\u03c4 d\u00b5(g) = \u03c8(t, x, \u03c4 ) and hence K(gx, g(cid:48)z) = K(x, z), for all g, g(cid:48) \u2208 G. It is clear\nnow that K is a group-invariant kernel.\nIn order to approximate K, we sample |G| elements uniformly and independently from the group\nG, i.e. gi, i = 1 . . .|G|, and de\ufb01ne the normalized empirical CDF :\n\n(cid:90) s\n\n\u2212s\n\nWe discretize the continuous threshold \u03c4 as follows:\n\n|G|(cid:88)\n1I(cid:104)git,x(cid:105)\u2264\u03c4 , \u2212 s \u2264 \u03c4 \u2264 s.\n|G|(cid:88)\n\ni=1\n\n1I(cid:104)git,x(cid:105)\u2264 s\n\nn k, \u2212 n \u2264 k \u2264 n.\n\n1\n\nm\n\n|G|\u221a\n\u221a\ns\u221a\nnm|G|\n\n\u03c6(x, t, \u03c4 ) =\n\n(cid:18)\n\n\u03c6\n\nx, t,\n\n=\n\n(cid:19)\n\nsk\nn\n\n(cid:20)\n\n(cid:18)\n\n\u03a6(x) =\n\n\u03c6\n\nx, tj,\n\ni=1\n\n(cid:19)(cid:21)\n\nsk\nn\n\nWe sample m templates independently according to the Gaussian sampling with rejection, tj, j =\n1 . . . m. We are now ready to de\ufb01ne the random feature map \u03a6:\n\nIt is easy to see that:\n\nlim\nn\u2192\u221e\n\nEt,g (cid:104)\u03a6(x), \u03a6(z)(cid:105)R(2n+1)\u00d7m = lim\nn\u2192\u221e\n\nEt,g\n\nj=1...m,k=\u2212n...n\n\n(cid:18)\n\nm(cid:88)\n\nn(cid:88)\n\nj=1\n\nk=\u2212n\n\n\u2208 R(2n+1)\u00d7m.\n\n(cid:19)\n\n(cid:18)\n\nsk\nn\n\n\u03c6\n\nx, tj,\n\n\u03c6\n\nz, tj,\n\n(cid:19)\n\nsk\nn\n\n= Ks(x, z).\n\nIn Section 3 we study the geometric information captured by this kernel by stating explicitly the\nsimilarity it computes.\nRemark 2 (Ef\ufb01ciency of the representation). 1) The main advantage of such a feature map, as\noutlined in [1], is that we store transformed templates in order to compute \u03a6, while if we wanted\nto compute an invariant kernel of type K (Equation (1)), we would need to explicitly transform\nthe points. The latter is computationally expensive. Storing transformed templates and computing\nthe signature \u03a6 is much more ef\ufb01cient. It falls in the category of memory-based learning, and is\nbiologically plausible [1].\n2) As |G|,m,n get large enough, the feature map \u03a6 approximates a group-invariant Kernel, as we\nwill see in next section.\n\n3 An Equivalent Expected Kernel and a Uniform Concentration Result\n\nIn this section we present our main results, with proofs given in the supplementary material . Theo-\nrem 1 shows that the random feature map \u03a6, de\ufb01ned in the previous section, corresponds in expec-\ntation to a group-invariant Haar-integration kernel Ks(x, z). Moreover, s \u2212 Ks(x, z) computes the\naverage pairwise distance between all points in the orbits of x and z, where the orbit is de\ufb01ned as\nthe collection of all group-transformations of a given point x : Ox = {gx, g \u2208 G}.\nTheorem 1 (Expectation). Let \u03b5 \u2208 (0, 1) and x, z \u2208 X . De\ufb01ne the distance dG between the orbits\nOx and Oz:\n\n1\u221a\n2\u03c0d\nand the group-invariant expected kernel\n\ndG(x, z) =\n\n(cid:107)gx \u2212 g(cid:48)z(cid:107)2 d\u00b5(g)d\u00b5(g(cid:48)),\n\n(cid:90)\n\n(cid:90)\n\nG\n\nG\n\n(cid:90) s\n\n\u2212s\n\n\u03c8(x, t, \u03c4 )\u03c8(z, t, \u03c4 )d\u03c4, s = 1 + \u03b5.\n\nKs(x, z) = lim\nn\u2192\u221e\n\nEt,g (cid:104)\u03a6(x), \u03a6(z)(cid:105)R(2n+1)\u00d7m = Et\n\n4\n\n\f1. The following inequality holds with probability 1:\n\n\u03b5 \u2212 \u03b42(d, \u03b5) \u2264 Ks(x, z) \u2212 (1 \u2212 dG(x, z)) \u2264 \u03b5 + \u03b41(d, \u03b5),\n\n(3)\n\nwhere \u03b41(\u03b5, d) = e\u2212d\u03b52/16\u221a\n\nd\n\n\u2212 1\n\n2\n\ne\u2212\u03b5d/2(1+\u03b5)\n\n\u221a\n\nd\n\nd\n2\n\nand \u03b42(\u03b5, \u03b4) = e\u2212d\u03b52 /16\u221a\n\nd\n\n+ (1 + \u03b5)e\u2212d\u03b52/8.\n\n2. For any \u03b5 \u2208 (0, 1) as the dimension d \u2192 \u221e we have \u03b41(\u03b5, d) \u2192 0 and \u03b42(\u03b5, d) \u2192 0, and\n\nwe have asymptotically Ks(x, z) \u2192 1 \u2212 dG(x, z) + \u03b5 = s \u2212 dG(x, z).\n\n3. Ks is symmetric and Ks is positive semi-de\ufb01nite.\n\nRemark 3. 1) \u03b5, \u03b41(d, \u03b5), and \u03b42(d, \u03b5) are not errors due to results holding with high probability\nbut are due to the truncation and are a technical artifact of the proof. 2) Local invariance can be\nde\ufb01ned by restricting the sampling of the group elements to a subset G \u2282 G. Assuming that for each\ng \u2208 G, g\u22121 \u2208 G, the equivalent kernel has asymptotically the following form:\n(cid:107)gx \u2212 g(cid:48)z(cid:107)2 d\u00b5(g)d\u00b5(g(cid:48)).\n\nKs(x, z) \u2248 s \u2212 1\u221a\n\n(cid:90)\n\n(cid:90)\n\n3) The norm-one constraint can be relaxed, let R = supx\u2208X (cid:107)x(cid:107)2 < \u221e, hence we can set s =\nR(1 + \u03b5), and\n\n2\u03c0d\n\nG\n\nG\n\n\u2212\u03b42(d, \u03b5) \u2264 Ks(x, z) \u2212 (R(1 + \u03b5) \u2212 dG(x, z)) \u2264 \u03b41(d, \u03b5),\n\n(4)\n\nwhere \u03b41(\u03b5, d) = R e\u2212d\u03b52 /16\u221a\n\nd\n\n\u2212 R\n\n2\n\ne\u2212\u03b5d/2(1+\u03b5)\n\n\u221a\n\nd\n\nd\n2\n\nand \u03b42(\u03b5, \u03b4) = R e\u2212d\u03b52/16\u221a\n\nd\n\n+ R(1 + \u03b5)e\u2212d\u03b52/8.\n\nTheorem 2 is, in a sense, an invariant Johnson-Lindenstrauss [14] type result where we show that\nthe dot product de\ufb01ned by the random feature map \u03a6 , i.e (cid:104)\u03a6(x), \u03a6(z)(cid:105), is concentrated around the\ninvariant expected kernel uniformly on a data set of N points, given a suf\ufb01ciently large number of\ntemplates m, a large number of sampled group elements |G|, and a large bin number n. The error\nnaturally decomposes to a numerical error \u03b50 and statistical errors \u03b51, \u03b52 due to the sampling of the\ntemplates and the group elements respectively.\nTheorem 2. [Johnson-Lindenstrauss type Theorem- N point Set] Let D = {xi | xi \u2208 X}N\nbe a \ufb01nite dataset. Fix \u03b50, \u03b51, \u03b52, \u03b41, \u03b42 \u2208 (0, 1). For a number of bins n \u2265 1\n, templates m \u2265\n), where C1, C2 are universal numeric constants,\n\n), and group elements |G| \u2265 C2\n\ni=1\n\n\u03b50\n\nlog( N m\n\u03b42\n\n\u03b52\n2\n\nC1\nlog( N\n\u03b52\n\u03b41\n1\nwe have:\n\n|(cid:104)\u03a6(xi), \u03a6(xj)(cid:105) \u2212 Ks(xi, xj)| \u2264 \u03b50 + \u03b51 + \u03b52, i = 1 . . . N, j = 1 . . . N,\n\n(5)\n\nwith probability 1 \u2212 \u03b41 \u2212 \u03b42.\nPutting together Theorems 1 and 2, the following Corollary shows how the group-invariant random\nfeature map \u03a6 captures the invariant distance between points uniformly on a dataset of N points.\nCorollary 1 (Invariant Features Maps and Distances between Orbits). Let D = {xi | xi \u2208 X}N\nbe a \ufb01nite dataset. Fix \u03b50, \u03b4 \u2208 (0, 1). For a number of bins n \u2265 3\n\u03b4 ),\nlog( N\nand group elements |G| \u2265 9C2\n\u03b4 ), where C1, C2 are universal numeric constants, we have:\n(6)\n\n\u03b5 \u2212 \u03b42(d, \u03b5) \u2212 \u03b50 \u2264 (cid:104)\u03a6(xi), \u03a6(xj)(cid:105) \u2212 (1 \u2212 dG(xi, xj)) \u2264 \u03b50 + \u03b5 + \u03b41(d, \u03b5),\n\n, templates m \u2265 9C1\n\ni = 1 . . . N, j = 1 . . . N, with probability 1 \u2212 2\u03b4.\nRemark 4. Assuming that the templates are unitary and drawn form a general distribution p(t), the\nequivalent kernel has the following form:\n\nlog( N m\n\ni=1\n\n\u03b52\n0\n\n\u03b52\n0\n\n\u03b50\n\nd\u00b5(g)d\u00b5(g(cid:48))\n\ns \u2212 max((cid:104)x, gt(cid:105) ,(cid:104)z, g(cid:48)t(cid:105))p(t)dt\n\n.\n\nKs(x, z) =\n\nG\n\nG\n\nIndeed when we use the gaussian sampling with rejection for the templates,\n\n(cid:82) max((cid:104)x, gt(cid:105) ,(cid:104)z, g(cid:48)t(cid:105))p(t)dt is asymptotically proportional to\n\nthe integral\n. It is interesting\nto consider different distributions that are domain-speci\ufb01c for the templates and assess the number\nof the templates needed to approximate such kernels. It is also interesting to \ufb01nd the optimal tem-\nplates that achieve the minimum distortion in equation 6, in a data dependent way, but we will\naddress these points in future work.\n\n(cid:13)(cid:13)(cid:13)g\u22121x \u2212 g\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:48),\u22121z\n\n(cid:90)\n\n(cid:90)\n\n(cid:18)(cid:90)\n\n(cid:19)\n\n5\n\n\f4 Learning with Group Invariant Random Features\n\n(cid:80)|G|\n\nt. The RKHS de\ufb01ned by the invariant kernel Ks, Ks(x, z) = (cid:82)(cid:82) s\n\nIn this section, we show that learning a linear model in the invariant, random feature space, on a\ntraining set sampled from the reduced core set X0, has a low expected risk, and generalizes to unseen\ntest points generated from the distribution on X = X0 \u222a GX0. The architecture of the proof follows\nideas from [15] and [16]. Recall that given an L-Lipschitz convex loss function V , our aim is to\nminimize the expected risk given in Equation (2). Denote the CDF by \u03c8(x, t, \u03c4 ) = P((cid:104)gt, x(cid:105) \u2264 \u03c4 ),\nand the empirical CDF by \u02c6\u03c8(x, t, \u03c4 ) = 1|G|\ni=1 1I(cid:104)git,x(cid:105)\u2264\u03c4 . Let p(t) be the distribution of templates\n\u2212s \u03c8(x, t, \u03c4 )\u03c8(z, t, \u03c4 )p(t)dtd\u03c4\ndenoted HKs , is the completion of the set of all \ufb01nite linear combinations of the form:\n(cid:27)\n\nSimilarly to [16], we de\ufb01ne the following in\ufb01nite-dimensional function space:\n|w(t, \u03c4 )|\n\n(cid:88)\n(cid:90) (cid:90) s\n\n\u03b1iKs(x, xi), xi \u2208 X , \u03b1i \u2208 R.\n\nf (x) =\n\n(cid:26)\n\n(7)\n\ni\n\nFp =\n\nLemma 1. Fp is dense in HKs. For f \u2208 Fp we have EV (f ) =(cid:82)\n\nw(t, \u03c4 )\u03c8(x, t, \u03c4 )dtd\u03c4 | sup\n\nf (x) =\n\n\u2212s\n\n\u03c4,t\n\n(cid:80)\n\nX0\n\np(t)\ny\u2208Y V (yf (x))\u03c1y(x)d\u03c1X (x),\n\nwhere X0 is the reduced core set.\n(cid:104) \u02c6\u03c8(cid:0)x, tj, sk\nSince Fp is dense in HKs, we can learn an invariant decision function in the space Fp, instead\nof learning in HKs. Let \u03a8(x) =\n. \u03a8, and \u03a6 are equivalent up to\nconstants. We will approximate the set Fp as follows:\nn(cid:88)\nm(cid:88)\n\nj=1...m,k=\u2212n...n\n\n(cid:1)(cid:105)\n(cid:18)\n\n(cid:19)\n\nn\n\n\uf8f1\uf8f2\uf8f3f (x) = (cid:104)w, \u03a8(x)(cid:105) =\n\nsk\nn\n\n, tj \u223c p, j = 1 . . . m | (cid:107)w(cid:107)\u221e \u2264 C\nm\n\nwj,k\n\n\u02c6\u03c8\n\nx, tj,\n\n\uf8fc\uf8fd\uf8fe .\n\n\u02dcF =\n\ns\nn\n\nj=1\n\nk=\u2212n\n\n\u2264 C\n\n.\n\nHence, we learn the invariant decision function via empirical risk minimization where we restrict\nthe function to belong to \u02dcF, and the sampling in the training set is restricted to the core set X0. Note\nthat with this function space we are regularizing for convenience the norm in\ufb01nity of the weights\nbut this can be relaxed in practice to a classical Tikhonov regularization.\nTheorem 3 (Learning with Group invariant features). Let S = {(xi, yi) | xi \u2208 X0, yi \u2208\nY, i = 1 . . . N}, a training set sampled from the core set X0. Let f\u2217\nN = arg minf\u2208 \u02dcF \u02c6EV (f ) =\n\n(cid:80)N\n\n1\nN\n\ni=1 V (yif (xi)).Fix \u03b4 > 0, then\n1\u221a\nEV (f\u2217\nN\n\nN ) \u2264 min\nf\u2208Fp\n\nEV (f ) + 2\n\n(cid:32)\n\n(cid:115)\n\n+\n\n2sLC\u221a\nm\n\n(cid:32)\n(cid:18) 1\n\n\u03b4\n\n(cid:19)(cid:33)\n\n4LsC + 2V (0) + LC\n\n(cid:32)\n\n(cid:18)\n\n2sC(cid:112)|G|\n\n(cid:115)\n\nlog\n\n1\n2\n\n(cid:114)\n\n(cid:19)(cid:33)\n(cid:18) 1\n(cid:17)(cid:19)\n(cid:16) m\n\n\u03b4\n\n\u03b4\n\n(cid:33)\n\n,\n\n+\n\n2sC\n\nn\n\n1 +\n\n2 log\n\n+ L\n\n1 +\n\n2 log\n\nwith probability at least 1 \u2212 3\u03b4 on the training set and the choice of templates and group elements.\nThe proof of Theorem 3 is given in Appendix B. Theorem 3 shows that learning a linear model\nin the invariant random feature space de\ufb01ned by \u03a6 (or equivalently \u03a8), has a low expected\nrisk. More importantly, this risk is arbitrarily close to the optimal risk achieved in an in\ufb01nite-\ndimensional class of functions, namely Fp. The training set is sampled from the reduced core\nset X0, and invariant learning generalizes to unseen test points generated from the distribution\non X = X0 \u222a GX0, hence the reduction in the sample complexity. Recall that Fp is dense in\nthe RKHS of the Haar-integration invariant Kernel, and so the expected risk achieved by a linear\nmodel in the invariant random feature space is not far from the one attainable in the invariant\nRKHS. Note that the error decomposes into two terms. The \ufb01rst, O( 1\u221a\n), is statistical and it\ndepends on the training sample complexity N. The other is governed by the approximation error of\nfunctions Fp, with functions in \u02dcF, and depends on the number of templates m, number of group\nelements sampled |G|, the number of bins n, and has the following form O( 1\u221a\nn.\n+ 1\n\n(cid:16)(cid:113) log m|G|\n\nm )+O\n\n(cid:17)\n\nN\n\n6\n\n\f5 Relation to Previous Work\n\nWe now put our contributions in perspective by outlining some of the previous work on invariant\nkernels and approximating kernels with random features.\nApproximating Kernels. Several schemes have been proposed for approximating a non-linear ker-\nnel with an explicit non-linear feature map in conjunction with linear methods, such as the Nystr\u00a8om\nmethod [17] or random sampling techniques in the Fourier domain for translation-invariant kernels\n[15]. Our features fall under the random sampling techniques where, unlike previous work, we sam-\nple both projections and group elements to induce invariance with an integral representation. We\nnote that the relation between random features and quadrature rules has been thoroughly studied in\n[18], where sharper bounds and error rates are derived, and can apply to our setting.\nInvariant Kernels. We focused in this paper on Haar-integration kernels [11], since they have an\nintegral representation and hence can be represented with random features [18]. Other invariant\nkernels have been proposed: In [19] authors introduce transformation invariant kernels, but unlike\nour general setting, the analysis is concerned with dilation invariance. In [20], multilayer arccosine\nkernels are built by composing kernels that have an integral representation, but does not explicitly\ninduce invariance. More closely related to our work is [21], where kernel descriptors are built for vi-\nsual recognition by introducing a kernel view of histogram of gradients that corresponds in our case\nto the cumulative distribution on the group variable. Explicit feature maps are obtained via kernel\nPCA, while our features are obtained via random sampling. Finally the convolutional kernel network\nof [22] builds a sequence of multilayer kernels that have an integral representation, by convolution,\nconsidering spatial neighborhoods in an image. Our future work will consider the composition of\nHaar-integration kernels, where the convolution is applied not only to the spatial variable but to the\ngroup variable akin to [2].\n\n6 Numerical Evaluation\n\nF + \u03bb||W||2\n\nF\n\n(cid:8) 1\nN ||Y \u2212 \u03a6(X)W||2\n\n(cid:9), where ||\u00b7||F is the Frobenius norm,\n\nIn this paper, and speci\ufb01cally in Theorems 2 and 3, we showed that the random, group-invariant\nfeature map \u03a6 captures the invariant distance between points, and that learning a linear model\ntrained in the invariant, random feature space will generalize well to unseen test points.\nIn this\nsection, we validate these claims through three experiments. For the claims of Theorem 2, we\nwill use a nearest neighbor classi\ufb01er, while for Theorem 3, we will rely on the regularized least\nsquares (RLS) classi\ufb01er, one of the simplest algorithms for supervised learning. While our proofs\nfocus on norm-in\ufb01nity regularization, RLS corresponds to Tikhonov regularization with square\nloss. Speci\ufb01cally, for performing T\u2212way classi\ufb01cation on a batch of N training points in Rd,\nsummarized in the data matrix X \u2208 RN\u00d7d and label matrix Y \u2208 RN\u00d7T , RLS will perform the\noptimization, minW\u2208Rm\u00d7T\n\u03bb is the regularization parameter, and \u03a6 is the feature map, which for the representation described\nin this paper will be a CDF pooling of the data projected onto group-transformed random templates.\nAll RLS experiments in this paper were completed with the GURLS toolbox [23]. The three\ndatasets we explore are:\nXperm (Figure 1): An arti\ufb01cial dataset consisting of all sequences of length 5 whose elements\ncome from an alphabet of 8 characters. We want to learn a function which assigns a positive value\nto any sequence that contains a target set of characters (in our case, two of them) regardless of their\nposition. Thus, the function label is globally invariant to permutation, and so we project our data\nonto all permuted versions of our random template sequences.\nMNIST (Figure 2): We seek local invariance to translation and rotation, and so all random templates\nare translated by up to 3 pixels in all directions and rotated between -20 and 20 degrees.\nTIDIGITS (Figure 3): We use a subset of TIDIGITS consisting of 326 speakers (men, women,\nchildren) reading the digits 0-9 in isolation, and so each datapoint is a waveform of a single word.\nWe seek local invariance to pitch and speaking rate [25], and so all random templates are pitch\nshifted up and down by 400 cents and warped to play at half and double speed. The task is 10-way\nclassi\ufb01cation with one class-per-digit. See [24] for more detail.\n\nAcknowledgements: Stephen Voinea acknowledges the support of a Nuance Foundation Grant.\nThis work was also supported in part by the Center for Brains, Minds and Machines (CBMM),\nfunded by NSF STC award CCF 1231216.\n\n7\n\n\fFigure 1: Classi\ufb01cation accuracy as a function of training set size, averaged over 100 random\ntraining samples at each size. \u03a6 = CDF(n, m) refers to a random feature map with n bins and m\ntemplates. With 25 templates, the random feature map outperforms the raw features and a bag-of-\nwords representation (also invariant to permutation) and even approaches an RLS classi\ufb01er with a\nHaar-integration kernel. Error bars were removed from the RLS plot for clarity. See supplement.\n\nFigure 2: Left Plot) Mean classi\ufb01cation accuracy as a function of number of bins and templates,\naveraged over 30 random sets of templates. Right Plot) Classi\ufb01cation accuracy as a function of\ntraining set size, averaged over 100 random samples of the training set at each size. At 1000 exam-\nples per class, we achieve an accuracy of 98.97%.\n\nFigure 3: Mean classi\ufb01cation accuracy as a function of number of bins and templates, averaged\nover 30 random sets of templates. In the \u201cSpeaker\u201d dataset, we test on unseen speakers, and in the\n\u201cGender\u201d dataset, we test on a new gender, giving us an extreme train/test mismatch. [25].\n\n8\n\n0.40.50.60.70.80.91.0101001000Number of Training Points Per ClassAccuracy\u03a6 RawBag\u2212Of\u2212WordsHaarCDF(25,1)CDF(25,10)CDF(25,25)Xperm Sample Complexity RLS101001000Number of Training Points Per Class\u03a6 RawBag\u2212Of\u2212WordsCDF(25,1)Xperm Sample Complexity 1\u2212NN0.10.20.30.40.50.60.70.80.91.0110100Number of TemplatesAccuracyBins525MNIST Accuracy RLS (1000 Points Per Class)0.50.60.70.80.91.0101001000Number of Training Points Per Class\u03a6 RawCDF(50,500)MNIST Sample Complexity RLS0.00.10.20.30.40.50.60.70.80.91.0101001000Number of TemplatesAccuracyBins525100TIDIGITS Speaker RLS101001000Number of TemplatesBins525100TIDIGITS Gender RLS\fReferences\n[1] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio, \u201cUnsupervised\nlearning of invariant representations in hierarchical architectures.,\u201d CoRR, vol. abs/1311.4158,\n2013.\n\n[2] J. Bruna and S. Mallat, \u201cInvariant scattering convolution networks,\u201d CoRR, vol. abs/1203.1513,\n\n2012.\n\n[3] G. Hinton, A. Krizhevsky, and S. Wang, \u201cTransforming auto encoders,\u201d ICANN-11, 2011.\n[4] Y. Bengio, A. C. Courville, and P. Vincent, \u201cRepresentation learning: A review and new per-\n\nspectives,\u201d IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798\u20131828, 2013.\n\n[5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, \u201cGradient-based learning applied to document\n\nrecognition,\u201d in Proceedings of the IEEE, vol. 86, pp. 2278\u20132324, 1998.\n\n[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \u201cImagenet classi\ufb01cation with deep convolutional\n\nneural networks.,\u201d in NIPS, pp. 1106\u20131114, 2012.\n\n[7] P. Niyogi, F. Girosi, and T. Poggio, \u201cIncorporating prior information in machine learning by\n\ncreating virtual examples,\u201d in Proceedings of the IEEE, pp. 2196\u20132209, 1998.\n\n[8] Y.-A. Mostafa, \u201cLearning from hints in neural networks,\u201d Journal of complexity, vol. 6,\n\npp. 192\u2013198, June 1990.\n\n[9] V. N. Vapnik, Statistical learning theory. A Wiley-Interscience Publication 1998.\n[10] I. Steinwart and A. Christmann, Support vector machines. Information Science and Statistics,\n\nNew York: Springer, 2008.\n\n[11] B. Haasdonk, A. Vossen, and H. Burkhardt, \u201cInvariance in kernel methods by haar-integration\n\nkernels.,\u201d in SCIA , Springer, 2005.\n\n[12] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, \u201cConvexity, classi\ufb01cation, and risk bounds,\u201d\n\nJournal of the American Statistical Association, vol. 101, no. 473, pp. 138\u2013156, 2006.\n\n[13] G. Wahba, Spline models for observational data, vol. 59 of CBMS-NSF Regional Conference\n\nSeries in Applied Mathematics. Philadelphia, PA: SIAM, 1990.\n\n[14] W. B. Johnson and J. Lindenstrauss, \u201cExtensions of lipschitz mappings into a hilbert space.,\u201d\n\nConference in modern analysis and probability, 1984.\n\n[15] A. Rahimi and B. Recht, \u201cWeighted sums of random kitchen sinks: Replacing minimization\n\nwith randomization in learning.,\u201d in NIPS 2008.\n\n[16] A. Rahimi and B. Recht, \u201cUniform approximation of functions with random bases,\u201d in Pro-\n\nceedings of the 46th Annual Allerton Conference, 2008.\n\n[17] C. Williams and M. Seeger, \u201cUsing the nystrm method to speed up kernel machines,\u201d in NIPS,\n\n2001.\n\n[18] F. R. Bach, \u201cOn the equivalence between quadrature rules and random features,\u201d CoRR,\n\nvol. abs/1502.06800, 2015.\n\n[19] C. Walder and O. Chapelle, \u201cLearning with transformation invariant kernels,\u201d in NIPS, 2007.\n[20] Y. Cho and L. K. Saul, \u201cKernel methods for deep learning,\u201d in NIPS, pp. 342\u2013350, 2009.\n[21] L. Bo, X. Ren, and D. Fox, \u201cKernel descriptors for visual recognition,\u201d in NIPS., 2010.\n[22] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, \u201cConvolutional kernel networks,\u201d in NIPS,\n\n2014.\n\n[23] A. Tacchetti, P. K. Mallapragada, M. Santoro, and L. Rosasco, \u201cGurls: a least squares library\n\nfor supervised learning,\u201d CoRR, vol. abs/1303.0934, 2013.\n\n[24] S. Voinea, C. Zhang, G. Evangelopoulos, L. Rosasco, and T. Poggio, \u201cWord-level invariant\n\nrepresentations from acoustic waveforms,\u201d vol. 14, pp. 3201\u20133205, September 2014.\n\n[25] M. Benzeghiba, R. De Mori, O. Deroo, S. Dupont, T. Erbes, D. Jouvet, L. Fissore, P. Laface,\nA. Mertins, C. Ris, R. Rose, V. Tyagi, and C. Wellekens, \u201cAutomatic speech recognition and\nspeech variability: A review,\u201d Speech Communication, vol. 49, pp. 763\u2013786, 01 2007.\n\n9\n\n\f", "award": [], "sourceid": 971, "authors": [{"given_name": "Youssef", "family_name": "Mroueh", "institution": "IBM"}, {"given_name": "Stephen", "family_name": "Voinea", "institution": "MIT"}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": "MIT"}]}