{"title": "Correlated random features for fast semi-supervised learning", "book": "Advances in Neural Information Processing Systems", "page_first": 440, "page_last": 448, "abstract": "This paper presents Correlated Nystrom Views (XNV), a fast semi-supervised algorithm for regression and classification. The algorithm draws on two main ideas. First, it generates two views consisting of computationally inexpensive random features. Second, multiview regression, using Canonical Correlation Analysis (CCA) on unlabeled data, biases the regression towards useful features. It has been shown that CCA regression can substantially reduce variance with a minimal increase in bias if the views contains accurate estimators. Recent theoretical and empirical work shows that regression with random features closely approximates kernel regression, implying that the accuracy requirement holds for random views. We show that XNV consistently outperforms a state-of-the-art algorithm for semi-supervised learning: substantially improving predictive performance and reducing the variability of performance on a wide variety of real-world datasets, whilst also reducing runtime by orders of magnitude.", "full_text": "Correlated random features for\nfast semi-supervised learning\n\nBrian McWilliams\n\nETH Z\u00a8urich, Switzerland\n\nbrian.mcwilliams@inf.ethz.ch\n\nDavid Balduzzi\n\nETH Z\u00a8urich, Switzerland\n\ndavid.balduzzi@inf.ethz.ch\n\nJoachim M. Buhmann\nETH Z\u00a8urich, Switzerland\n\njbuhmann@inf.ethz.ch\n\nAbstract\n\nThis paper presents Correlated Nystr\u00a8om Views (XNV), a fast semi-supervised al-\ngorithm for regression and classi\ufb01cation. The algorithm draws on two main ideas.\nFirst, it generates two views consisting of computationally inexpensive random\nfeatures. Second, multiview regression, using Canonical Correlation Analysis\n(CCA) on unlabeled data, biases the regression towards useful features. It has\nbeen shown that CCA regression can substantially reduce variance with a mini-\nmal increase in bias if the views contains accurate estimators. Recent theoretical\nand empirical work shows that regression with random features closely approxi-\nmates kernel regression, implying that the accuracy requirement holds for random\nviews. We show that XNV consistently outperforms a state-of-the-art algorithm\nfor semi-supervised learning: substantially improving predictive performance and\nreducing the variability of performance on a wide variety of real-world datasets,\nwhilst also reducing runtime by orders of magnitude.\n\n1\n\nIntroduction\n\nAs the volume of data collected in the social and natural sciences increases, the computational cost\nof learning from large datasets has become an important consideration. For learning non-linear\nrelationships, kernel methods achieve excellent performance but na\u00a8\u0131vely require operations cubic in\nthe number of training points.\nRandomization has recently been considered as an alternative to optimization that, surprisingly, can\nyield comparable generalization performance at a fraction of the computational cost [1, 2]. Ran-\ndom features have been introduced to approximate kernel machines when the number of training\nexamples is very large, rendering exact kernel computation intractable. Among several different\napproaches, the Nystr\u00a8om method for low-rank kernel approximation [1] exhibits good theoretical\nproperties and empirical performance [3\u20135].\nA second problem arising with large datasets concerns obtaining labels, which often requires a do-\nmain expert to manually assign a label to each instance which can be very expensive \u2013 requiring sig-\nni\ufb01cant investments of both time and money \u2013 as the size of the dataset increases. Semi-supervised\nlearning aims to improve prediction by extracting useful structure from the unlabeled data points\nand using this in conjunction with a function learned on a small number of labeled points.\n\nContribution. This paper proposes a new semi-supervised algorithm for regression and classi\ufb01-\ncation, Correlated Nystr\u00a8om Views (XNV), that addresses both problems simultaneously. The method\n\n1\n\n\fconsists in essentially two steps. First, we construct two \u201cviews\u201d using random features. We in-\nvestigate two ways of doing so: one based on the Nystr\u00a8om method and another based on random\nFourier features (so-called kitchen sinks) [2, 6]. It turns out that the Nystr\u00a8om method almost always\noutperforms Fourier features by a quite large margin, so we only report these results in the main\ntext.\nThe second step, following [7], uses Canonical Correlation Analysis (CCA, [8, 9]) to bias the opti-\nmization procedure towards features that are correlated across the views. Intuitively, if both views\ncontain accurate estimators, then penalizing uncorrelated features reduces variance without increas-\ning the bias by much. Recent theoretical work by Bach [5] shows that Nystr\u00a8om views can be ex-\npected to contain accurate estimators.\nWe perform an extensive evaluation of XNV on 18 real-world datasets, comparing against a modi\ufb01ed\nversion of the SSSL (simple semi-supervised learning) algorithm introduced in [10]. We \ufb01nd that\nXNV outperforms SSSL by around 10-15% on average, depending on the number of labeled points\navailable, see \u00a73. We also \ufb01nd that the performance of XNV exhibits dramatically less variability\nthan SSSL, with a typical reduction of 30%.\nWe chose SSSL since it was shown in [10] to outperform a state of the art algorithm, Laplacian\nRegularized Least Squares [11]. However, since SSSL does not scale up to large sets of unlabeled\ndata, we modify SSSL by introducing a Nystr\u00a8om approximation to improve runtime performance.\nThis reduces runtime by a factor of \u21e51000 on N = 10, 000 points, with further improvements as N\nincreases. Our approximate version of SSSL outperforms kernel ridge regression (KRR) by > 50%\non the 18 datasets on average, in line with the results reported in [10], suggesting that we lose little\nby replacing the exact SSSL with our approximate implementation.\n\nRelated work. Multiple view learning was \ufb01rst introduced in the co-training method of [12] and\nhas also recently been extended to unsupervised settings [13,14]. Our algorithm builds on an elegant\nproposal for multi-view regression introduced in [7]. Surprisingly, despite guaranteeing improved\nprediction performance under a relatively weak assumption on the views, CCA regression has not\nbeen widely used since its proposal \u2013 to the best of our knowledge this is \ufb01rst empirical evaluation\nof multi-view regression\u2019s performance. A possible reason for this is the dif\ufb01culty in obtaining\nnaturally occurring data equipped with multiple views that can be shown to satisfy the multi-view\nassumption. We overcome this problem by constructing random views that satisfy the assumption\nby design.\n\n2 Method\n\nThis section introduces XNV, our semi-supervised learning method. The method builds on two\nmain ideas. First, given two equally useful but suf\ufb01ciently different views on a dataset, penalizing\nregression using the canonical norm (computed via CCA), can substantially improve performance\n[7]. The second is the Nystr\u00a8om method for constructing random features [1], which we use to\nconstruct the views.\n\n2.1 Multi-view regression\n\nSuppose we have data T =(x1, y1), . . . , (xn, yn) for xi 2 RD and yi 2 R, sampled according to\n\njoint distribution P (x, y). Further suppose we have two views on the data\n\nz(\u232b) : RD ! H(\u232b) = RM : x 7! z(\u232b)(x) =: z(\u232b)\n\nfor \u232b 2{ 1, 2}.\n\nWe make the following assumption about linear regressors which can be learned on these views.\nAssumption 1 (Multi-view assumption [7]). De\ufb01ne mean-squared error loss function `(g, x, y) =\n(g(x) y)2 and let loss(g) := EP `(g(x), y). Further let L(Z) denote the space of linear maps\nfrom a linear space Z to the reals, and de\ufb01ne:\n\nf (\u232b) := argmin\ng2L(H(\u232b))\n\nThe multi-view assumption is that\n\nloss(g) for \u232b 2{ 1, 2}\nloss\u21e3f (\u232b)\u2318 loss(f ) \uf8ff \u270f\n\n2\n\nand f :=\n\nargmin\n\nloss(g).\n\ng2L(H(1)H(2))\n\nfor \u232b 2{ 1, 2}.\n\n(1)\n\n\fIn short, the best predictor in each view is within \u270f of the best overall predictor.\n\nCanonical correlation analysis. Canonical correlation analysis [8, 9] extends principal compo-\nnent analysis (PCA) from one to two sets of variables. CCA \ufb01nds bases for the two sets of variables\nsuch that the correlation between projections onto the bases are maximized.\n\nThe \ufb01rst pair of canonical basis vectors,\u21e3b(1)\n\n1 , b(2)\n\n1 \u2318 is found by solving:\ncorr\u21e3b(1)>z(1), b(2)>z(2)\u2318 .\n\nargmax\n\nb(1),b(2)2RM\n\n1. Orthogonality: ET\u21e5\u00afz(\u232b)>j\n2. Correlation: ET\u21e5\u00afz(1)>j\n\nSubsequent pairs are found by maximizing correlations subject to being orthogonal to previously\n\nfound pairs. The result of performing CCA is two sets of bases, B(\u232b) = hb(\u232b)\n\u232b 2{ 1, 2}, such that the projection of z(\u232b) onto B(\u232b) which we denote \u00afz(\u232b) satis\ufb01es\n\u00afz(\u232b)\nk ] = jk, where jk is the Kronecker delta, and\n\n1 , . . . , b(\u232b)\n\nMi for\n\n\u00afz(2)\n\nj is referred to as the jth canonical correlation coef\ufb01cient.\nDe\ufb01nition 1 (canonical norm). Given vector \u00afz(\u232b) in the canonical basis, de\ufb01ne its canonical norm\nas\n\nk \u21e4 = j \u00b7 jk where w.l.o.g. we assume 1 1 2 \u00b7\u00b7\u00b7 0.\nk\u00afz(\u232b)kCCA :=vuut\n\n\u21e3\u00afz(\u232b)\nj \u23182\n\n1 j\nj\n\nDXj=1\n\nCanonical ridge regression. Assume we observe n pairs of views coupled with real valued labels\n\n.\n\nnz(1)\n\ni\n\n, yion\n\ni=1\n\n, z(2)\n\ni\n\n, canonical ridge regression \ufb01nds coef\ufb01cientsb\nnXi=1\u21e3yi (\u232b) >\u00afz(\u232b)\ni \u23182\nb\n\n+ k(\u232b)k2\nThe resulting estimator, referred to as the canonical shrinkage estimator, is\n\n:= argmin\n\n1\nn\n\n(\u232b)\n\n\n\n(\u232b)\n\n=hb(\u232b)\n1 , . . . ,b(\u232b)\n\nM i> such that\n\nCCA.\n\n(2)\n\n(3)\n\n(4)\n\nj\nn\n\nnXi=1\n\n\u00afz(\u232b)\ni,j yi.\n\nj =\n\nb(\u232b)\n\nPenalizing with the canonical norm biases the optimization towards features that are highly cor-\nrelated across the views. Good regressors exist in both views by Assumption 1. Thus, intuitively,\npenalizing uncorrelated features signi\ufb01cantly reduces variance, without increasing the bias by much.\nMore formally:\nTheorem 1 (canonical ridge regression, [7]). Assume E[y2|x] \uf8ff 1 and that Assumption 1 holds. Let\nf (\u232b)\ndenote the estimator constructed with the canonical shrinkage estimator, Eq. (4), on training\nb\nset T , and let f denote the best linear predictor across both views. For \u232b 2{ 1, 2} we have\n\n)] loss(f ) \uf8ff 5\u270f +PM\n\nj=1 2\nj\nn\n\nET [loss(f (\u232b)\nb\n\nwhere the expectation is with respect to training sets T sampled from P (x, y).\n\nthe variance. TheP 2\n\nThe \ufb01rst term, 5\u270f, bounds the bias of the canonical estimator, whereas the second, 1\n\nj bounds\nj can be thought of as a measure of the \u201cintrinsic dimensionality\u201d of the\nunlabeled data, which controls the rate of convergence.\nIf the canonical correlation coef\ufb01cients\ndecay suf\ufb01ciently rapidly, then the increase in bias is more than made up for by the decrease in\nvariance.\n\nnP 2\n\n3\n\n\f2.2 Constructing random views\n\nWe construct two views satisfying Assumption 1 in expectation, see Theorem 3 below. To ensure our\nmethod scales to large sets of unlabeled data, we use random features generated using the Nystr\u00a8om\nmethod [1].\nSuppose we have data {xi}N\ni=1. When N is very large, constructing and manipulating the N \u21e5 N\nGram matrix [K]ii0 = h(xi), (xi0)i = \uf8ff(xi, xi0) is computationally expensive. Where here, (x)\nde\ufb01nes a mapping from RD to a high dimensional feature space and \uf8ff(\u00b7,\u00b7) is a positive semi-de\ufb01nite\nkernel function.\nThe idea behind random features is to instead de\ufb01ne a lower-dimensional mapping, z(xi) : RD !\nRM through a random sampling scheme such that [K]ii0 \u21e1 z(xi)>z(xi0) [6, 15]. Thus, using\nrandom features, non-linear functions in x can be learned as linear functions in z(x) leading to\nsigni\ufb01cant computational speed-ups. Here we give a brief overview of the Nystr\u00a8om method, which\nuses random subsampling to approximate the Gram matrix.\n\nNXi=1\n\nK \u21e1 \u02dcK :=\n\nThe Nystr\u00a8om method. Fix an M \u2327 N and randomly (uniformly) sample a subset M = {\u02c6xi}M\ni=1. Let bK denote the Gram matrix [bK]ii0 where i, i0 2M . The\nof M points from the data {xi}N\nNystr\u00a8om method [1, 3] constructs a low-rank approximation to the Gram matrix as\nNXi0=1\n[\uf8ff(xi, \u02c6x1), . . . ,\uf8ff (xi, \u02c6xM )]bK\u2020 [\uf8ff(xi0, \u02c6x1), . . . ,\uf8ff (xi0, \u02c6xM )]> ,\nz(xi) = bD1/2bV> [\uf8ff(xi, \u02c6x1), . . . ,\uf8ff (xi, \u02c6xM )]> ,\n\nwhere bK\u2020 2 RM\u21e5M is the pseudo-inverse of bK. Vectors of random features can be constructed as\nwhere the columns of bV are the eigenvectors of bK with bD the diagonal matrix whose entries are\n\nthe corresponding eigenvalues. Constructing features in this way reduces the time complexity of\nlearning a non-linear prediction function from O(N 3) to O(N ) [15].\nAn alternative perspective on the Nystr\u00a8om approximation, that will be useful below, is as follows.\nConsider integral operators\n\ni=1\n\n(5)\n\nLN [f ](\u00b7) :=\n\n1\nN\n\n\uf8ff(xi,\u00b7)f (xi)\n\nand LM [f ](\u00b7) :=\n\n1\nM\n\n\uf8ff(xi,\u00b7)f (xi),\n\n(6)\n\nMXi=1\n\nNXi=1\n\nand introduce Hilbert space \u02c6H = span{ \u02c6'1, . . . , \u02c6'r} where r is the rank of bK and the \u02c6'i are the \ufb01rst\nr eigenfunctions of LM. Then the following proposition shows that using the Nystr\u00a8om approxima-\ntion is equivalent to performing linear regression in the feature space (\u201cview\u201d) z : X! \u02c6H spanned\nby the eigenfunctions of linear operator LM in Eq. (6):\nProposition 2 (random Nystr\u00a8om view, [3]). Solving\n\nis equivalent to solving\n\n1\nN\n\nmin\nw2Rr\n\n1\nN\n\nmin\nf2 \u02c6H\n\nNXi=1\nNXi=1\n\n`(w>z(xi), yi) +\n\n\n2kwk2\n\n2\n\n`(f (xi), yi) +\n\n\n2kfk2\n\nH\uf8ff.\n\n(7)\n\n(8)\n\n2.3 The proposed algorithm: Correlated Nystr\u00a8om Views (XNV)\n\nAlgorithm 1 details our approach to semi-supervised learning based on generating two views consist-\ning of Nystr\u00a8om random features and penalizing features which are weakly correlated across views.\ni=n+1.\nThe setting is that we have labeled data {xi, yi}n\nStep 1 generates a set of random features. The next two steps implement multi-view regression using\nthe randomly generated views z(1)(x) and z(2)(x). Eq. (9) yields a solution for which unimportant\n\ni=1 and a large amount of unlabeled data {xi}N\n\n4\n\n\fAlgorithm 1 Correlated Nystr\u00a8om Views (XNV).\nInput: Labeled data: {xi, yi}n\ni=1 and unlabeled data: {xi}N\n1: Generate features. Sample \u02c6x1, . . . , \u02c6x2M uniformly from the dataset, compute the eigendecom-\npositions of the sub-sampled kernel matrices \u02c6K(1) and \u02c6K(2) which are constructed from the\nsamples 1, . . . , M and M + 1, . . . , 2M respectively, and featurize the input:\n\ni=n+1\n\nz(\u232b)(xi) \u02c6D(\u232b),1/2 \u02c6V(\u232b)> [\uf8ff(xi, \u02c6x1), . . . ,\uf8ff (xi, \u02c6xM )]> for \u232b 2{ 1, 2}.\n\n2: Unlabeled data. Compute CCA bases B(1), B(2) and canonical correlations 1, . . . , M for the\n\n`\u21e3>\u00afzi, yi\u2318 + kk2\n\nCCA + kk2\n2 .\n\n(9)\n\n3: Labeled data. Solve\n\ntwo views and set \u00afzi B(1)z(1)(xi).\nnXi=1\n\nb = argmin\n\n1\nn\n\n\n\nOutput: b\n\nfeatures are heavily downweighted in the CCA basis without introducing an additional tuning pa-\nrameter. The further penalty on the `2 norm (in the CCA basis) is introduced as a practical measure\n\nto control the variance of the estimatorb which can become large if there are many highly correlated\nj \u21e1 0 for large j). In practice most of the shrinkage is due to the CCA\n\nfeatures (i.e. the ratio 1j\nnorm: cross-validation obtains optimal values of in the range [0.00001, 0.1].\n\nComputational complexity. XNV is extremely fast. Nystr\u00a8om sampling, step 1, reduces the O(N 3)\noperations required for kernel learning to O(N ). Computing the CCA basis, step 2, using standard\nalgorithms is in O(N M 2). However, we reduce the runtime to O(N M ) by applying a recently\nproposed randomized CCA algorithm of [16]. Finally, step 3 is a computationally cheap linear\nprogram on n samples and M features.\n\nPerformance guarantees. The quality of the kernel approximation in (5) has been the subject of\ndetailed study in recent years leading to a number of strong empirical and theoretical results [3\u20135,\n15]. Recent work of Bach [5] provides theoretical guarantees on the quality of Nystr\u00a8om estimates in\nthe \ufb01xed design setting that are relevant to our approach.1\nTheorem 3 (Nystr\u00a8om generalization bound, [5]). Let \u21e0 2 RN be a random vector with \ufb01nite\nvariance and zero mean, y = [y1, . . . , yN ]>, and de\ufb01ne smoothed estimate \u02c6ykernel\n:= (K +\nN I)1K(y + \u21e0) and smoothed Nystr\u00a8om estimate \u02c6yNystr\u00a8om := ( \u02dcK + N I)1 \u02dcK(y + \u21e0), both\ncomputed by minimizing the MSE with ridge penalty . Let \u2318 2 (0, 1). For suf\ufb01ciently large M\n(depending on \u2318, see [5]), we have\n\nEME\u21e0\u21e5ky \u02c6yNystr\u00a8omk2\n\n2\u21e4 \uf8ff (1 + 4\u2318) \u00b7 E\u21e0\u21e5ky \u02c6ykernelk2\n2\u21e4\n\nwhere EM refers to the expectation over subsampled columns used to construct \u02dcK.\nIn short, the best smoothed estimators in the Nystr\u00a8om views are close to the optimal smoothed\nestimator. Since the kernel estimate is consistent, loss(f ) ! 0 as n ! 1. Thus, Assumption 1\nholds in expectation and the generalization performance of XNV is controlled by Theorem 1.\n\nRandom Fourier Features. An alternative approach to constructing random views is to use\nFourier features instead of Nystr\u00a8om features in Step 1. We refer to this approach as Correlated\nKitchen Sinks (XKS) after [2]. It turns out that the performance of XKS is consistently worse than\nXNV, in line with the detailed comparison presented in [3]. We therefore do not discuss Fourier\nfeatures in the main text, see \u00a7SI.3 for details on implementation and experimental results.\n\n1Extending to a random design requires techniques from [17].\n\n5\n\n\fTask\n\nSet Name\n\nTable 1: Datasets used for evaluation.\nTask\n1 abalone2\nC\n2 adult2\nC\n3 ailerons4\nR\n4 bank84\nC\n5 bank324\nC\n6 cal housing4 R\n7 census2\nR\n8 CPU2\nR\n9 CT2\nR\n\nD Set Name\nN\n6\n2, 089\n14\n32, 561\n40\n7, 154\n8\n4, 096\n32\n4, 096\n10, 320\n8\n18, 186 119\n6, 554\n21\n30, 000 385\n\n10 elevators4 R\n11 HIVa3\nC\n12 house4\nR\n13 ibn Sina3\nC\n14 orange3\nC\n15 sarcos 15\nR\n16 sarcos 55\nR\n17 sarcos 75\nR\n18 sylva3\nC\n\nD\nN\n8, 752\n18\n21, 339 1, 617\n11, 392\n16\n92\n10, 361\n230\n25, 000\n21\n44, 484\n21\n44, 484\n44, 484\n21\n216\n72, 626\n\n2.4 A fast approximation to SSSL\n\nThe SSSL (simple semi-supervised learning) algorithm proposed in [10] \ufb01nds the \ufb01rst s eigenfunc-\ntions i of the integral operator LN in Eq. (6) and then solves\n\nargmin\nw2Rs\n\nnXi=1\n\n0@\nsXj=1\n\nwjk(xi) yi1A\n\n2\n\n,\n\n(10)\n\nwhere s is set by the user. SSSL outperforms Laplacian Regularized Least Squares [11], a state of\nthe art semi-supervised learning method, see [10]. It also has good generalization guarantees under\nreasonable assumptions on the distribution of eigenvalues of LN. However, since SSSL requires\ncomputing the full N \u21e5 N Gram matrix, it is extremely computationally intensive for large N.\nMoreover, tuning s is dif\ufb01cult since it is discrete.\nWe therefore propose SSSLM, an approximation to SSSL. First, instead of constructing the full\nGram matrix, we construct a Nystr\u00a8om approximation by sampling M points from the labeled and\nunlabeled training set. Second, instead of thresholding eigenfunctions, we use the easier to tune\nridge penalty which penalizes directions proportional to the inverse square of their eigenvalues [18].\nAs justi\ufb01cation, note that Proposition 2 states that the Nystr\u00a8om approximation to kernel regression\nactually solves a ridge regression problem in the span of the eigenfunctions of \u02c6LM. As M increases,\nthe span of \u02c6LM tends towards that of LN [15]. We will also refer to the Nystr\u00a8om approximation to\nSSSL using 2M features as SSSL2M. See experiments below for further discussion of the quality\nof the approximation.\n\n3 Experiments\n\nSetup. We evaluate the performance of XNV on 18 real-world datasets, see Table 1. The datasets\ncover a variety of regression (denoted by R) and two-class classi\ufb01cation (C) problems. The sarcos\ndataset involves predicting the joint position of a robot arm; following convention we report results\non the 1st, 5th and 7th joint positions.\nThe SSSL algorithm was shown to exhibit state-of-the-art performance over fully and semi-\nsupervised methods in scenarios where few labeled training examples are available [10]. How-\never, as discussed in \u00a72.2, due to its computational cost we compare the performance of XNV to the\nNystr\u00a8om approximations SSSLM and SSSL2M.\nWe used a Gaussian kernel for all datasets. We set the kernel width, and the `2 regularisation\nstrength, , for each method using 5-fold cross validation with 1000 labeled training examples. We\ntrained all methods using a squared error loss function, `(f (xi), yi) = (f (xi)yi)2, with M = 200\nrandom features, and n = 100, 150, 200, . . . , 1000 randomly selected training examples.\n\n2Taken from the UCI repository http://archive.ics.uci.edu/ml/datasets.html\n3Taken from http://www.causality.inf.ethz.ch/activelearning.php\n4Taken from http://www.dcc.fc.up.pt/\u02dcltorgo/Regression/DataSets.html\n5Taken from http://www.gaussianprocess.org/gpml/data/\n\n6\n\n\fRuntime performance. The SSSL algorithm of [10] is not computationally feasible on large\ndatasets, since it has time complexity O(N 3). For illustrative purposes, we report run times6 in\nseconds of the SSSL algorithm against SSSLM and XNV on three datasets of different sizes.\n\nruntimes bank8 cal housing sylva\n-\nSSSL\n24s\nSSSL2M\n26s\nXNV\n\n2300s\n0.6s\n1.3s\n\n72s\n0.3s\n0.9s\n\nFor the cal housing dataset, XNV exhibits an almost 1800\u21e5 speed up over SSSL. For the largest\ndataset, sylva, exact SSSL is computationally intractable. Importantly, the computational over-\nhead of XNV over SSSL2M is small.\n\nGeneralization performance. We report on the prediction performance averaged over 100 experi-\nments. For regression tasks we report on the mean squared error (MSE) on the testing set normalized\nby the variance of the test output. For classi\ufb01cation tasks we report the percentage of the test set that\nwas misclassi\ufb01ed.\nThe table below shows the improvement in performance of XNV over SSSLM and SSSL2M (taking\nwhichever performs better out of M or 2M on each dataset), averaged over all 18 datasets. Observe\nthat XNV is considerably more accurate and more robust than SSSLM.\n\nXNV vs SSSLM/2M\nAvg reduction in error\nAvg reduction in std err\n\nn = 100\n11%\n15%\n\nn = 200\n16%\n30%\n\nn = 300\n15%\n31%\n\nn = 400\n12%\n33%\n\nn = 500\n9%\n30%\n\nThe reduced variability is to be expected from Theorem 1.\n\n0.24\n\n0.23\n\n0.22\n\n0.21\n\n0.2\n\n0.19\n\n0.18\n\n0.17\n\n0.16\n\nr\no\nr\nr\ne\nn\no\n\n \n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n \n\nSSSL\n\nM\n\nSSSL\n\n2M\n\nXNV\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n \n\nSSSL\n\nM\n\nSSSL\n\n2M\n\nXNV\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\nr\no\nr\nr\ne\nn\no\n\n \n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n \n\nSSSL\n\nM\n\nSSSL\n\n2M\n\nXNV\n\n0.15\n\n \n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nnumber of labeled training points\n\n \n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nnumber of labeled training points\n\n0\n\n \n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nnumber of labeled training points\n\n(a) adult\n\n(b) cal housing\n\n(c) census\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\nr\no\nr\nr\ne\n\n \n\nn\no\n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n \n\nSSSL\n\nM\n\nSSSL\n\n2M\n\nXNV\n\n0.2\n\n0.18\n\n0.16\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\nr\no\nr\nr\ne\nn\no\n\n \n\ni\n\ni\nt\nc\nd\ne\nr\np\n\nSSSL\n\nM\n\nSSSL\n\n2M\n\nXNV\n\n \n\n0.4\n\nr\no\nr\nr\ne\nn\no\n\n \n\ni\n\ni\nt\nc\nd\ne\nr\np\n\n0.35\n\n0.3\n\n0.25\n\n0.2\n\n0.15\n\n0.1\n\n \n\nSSSL\n\nM\n\nSSSL\n\n2M\n\nXNV\n\n0.1\n\n \n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\n0.04\n\n \n100\n\nnumber of labeled training points\n\n(d) elevators\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nnumber of labeled training points\n\n0.05\n\n \n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\n700\n\n800\n\n900\n\n1000\n\nnumber of labeled training points\n\n(e) ibn Sina\n\n(f) sarcos 5\n\nFigure 1: Comparison of mean prediction error and standard deviation on a selection of datasets.\n\nTable 2 presents more detailed comparison of performance for individual datasets when n =\n200, 400. The plots in Figure 1 shows a representative comparison of mean prediction errors for\nseveral datasets when n = 100, . . . , 1000. Error bars represent one standard deviation. Observe that\nXNV almost always improves prediction accuracy and reduces variance compared with SSSLM and\nSSSL2M when the labeled training set contains between 100 and 500 labeled points. A complete\nset of results is provided in \u00a7SI.1.\nDiscussion of SSSLM. Our experiments show that going from M to 2M does not improve gener-\nalization performance in practice. This suggests that when there are few labeled points, obtaining a\n\n6Computed in Matlab 7.14 on a Core i5 with 4GB memory.\n\n7\n\n\fmore accurate estimate of the eigenfunctions of the kernel does not necessarily improve predictive\nperformance. Indeed, when more random features are added, stronger regularization is required to\nreduce the in\ufb02uence of uninformative features, this also has the effect of downweighting informative\nfeatures. This suggests that the low rank approximation SSSLM to SSSL suf\ufb01ces.\nFinally, \u00a7SI.2 compares the performance of SSSLM and XNV to fully supervised kernel ridge reg-\nression (KRR). We observe dramatic improvements, between 48% and 63%, consistent with the\nresults observed in [10] for the exact SSSL algorithm.\n\nRandom Fourier features. Nystr\u00a8om features signi\ufb01cantly outperform Fourier features, in line\nwith observations in [3]. The table below shows the relative improvement of XNV over XKS:\n\nXNV vs XKS\nAvg reduction in error\nAvg reduction in std err\n\nn = 100\n30%\n36%\n\nn = 200\n28%\n44%\n\nn = 300\n26%\n34%\n\nn = 400\n25%\n37%\n\nn = 500\n24%\n36%\n\nFurther results and discussion for XKS are included in the supplementary material.\n\nXNV\n\nset\n\nSSSLM\n\nXNV\n\nSSSLM\n\nSSSL2M\n\n0.054 (0.005) 0.055 (0.006) 0.053 (0.004) 10\n0.198 (0.014) 0.184 (0.010) 0.175 (0.010) 11\n0.218 (0.016) 0.231 (0.020) 0.213 (0.016) 12 0.761 (0.075) 0.787 (0.091)\n0.561 (0.030) 13\n0.558 (0.027) 0.567 (0.029)\n0.058 (0.004) 0.060 (0.005) 0.055 (0.003) 14\n0.567 (0.081) 0.634 (0.103) 0.459 (0.045) 15\n0.020 (0.012) 0.022 (0.014) 0.019 (0.005) 16\n0.395 (0.395) 0.463 (0.414) 0.263 (0.352) 17\n0.437 (0.096) 0.367 (0.060) 0.222 (0.015) 18\n\n0.309 (0.059) 0.358 (0.077) 0.226 (0.020)\n0.146 (0.048) 0.072 (0.024) 0.036 (0.001)\n0.792 (0.100)\n0.109 (0.017) 0.109 (0.017) 0.068 (0.010)\n0.019 (0.001) 0.019 (0.001) 0.019 (0.000)\n0.076 (0.008) 0.078 (0.009) 0.071 (0.006)\n0.172 (0.032) 0.192 (0.036) 0.119 (0.014)\n0.041 (0.004) 0.043 (0.005) 0.040 (0.004)\n0.036 (0.007) 0.039 (0.007) 0.028 (0.009)\n\nSSSL2M\n\nTable 2: Performance (normalized MSE/classi\ufb01cation error rate). Standard errors in parentheses.\nset\nn = 200\n1\n2\n3\n4\n5\n6\n7\n8\n9\nn = 400\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\n0.051 (0.003) 0.052 (0.003) 0.050 (0.002) 10\n0.218 (0.022) 0.233 (0.027) 0.192 (0.010)\n0.177 (0.008) 0.172 (0.006) 0.167 (0.005) 11\n0.051 (0.009) 0.122 (0.031) 0.036 (0.001)\n0.199 (0.011) 0.209 (0.013) 0.193 (0.010) 12 0.691 (0.040) 0.701 (0.051)\n0.709 (0.058)\n0.517 (0.018) 0.527 (0.019) 0.510 (0.016) 13\n0.070 (0.009) 0.072 (0.008) 0.054 (0.004)\n0.050 (0.003) 0.051 (0.003) 0.050 (0.002) 14\n0.019 (0.001) 0.019 (0.001) 0.019 (0.000)\n0.513 (0.055) 0.555 (0.063) 0.432 (0.036) 15\n0.059 (0.004) 0.060 (0.005) 0.057 (0.003)\n0.019 (0.010) 0.021 (0.012) 0.014 (0.003) 16\n0.105 (0.014) 0.106 (0.014) 0.090 (0.007)\n0.209 (0.171) 0.286 (0.248) 0.110 (0.107) 17 0.032 (0.002) 0.033 (0.003) 0.032 (0.002)\n0.249 (0.024) 0.304 (0.037) 0.201 (0.013) 18\n0.029 (0.006) 0.032 (0.005) 0.023 (0.006)\n\n4 Conclusion\n\nWe have introduced the XNV algorithm for semi-supervised learning. By combining two randomly\ngenerated views of Nystr\u00a8om features via an ef\ufb01cient implementation of CCA, XNV outperforms the\nprior state-of-the-art, SSSL, by 10-15% (depending on the number of labeled points) on average\nover 18 datasets. Furthermore, XNV is over 3 orders of magnitude faster than SSSL on medium\nsized datasets (N = 10, 000) with further gains as N increases. An interesting research direction\nis to investigate using the recently developed deep CCA algorithm, which extracts higher order\ncorrelations between views [19], as a preprocessing step.\nIn this work we use a uniform sampling scheme for the Nystr\u00a8om method for computational reasons\nsince it has been shown to perform well empirically relative to more expensive schemes [20]. Since\nCCA gives us a criterion by which to measure the important of random features, in the future we\naim to investigate active sampling schemes based on canonical correlations which may yield better\nperformance by selecting the most informative indices to sample.\n\nAcknowledgements. We thank Haim Avron for help with implementing randomized CCA and\nPatrick Pletscher for drawing our attention to the Nystr\u00a8om method.\n\n8\n\n\fReferences\n[1] Williams C, Seeger M: Using the Nystr\u00a8om method to speed up kernel machines. In NIPS 2001.\n[2] Rahimi A, Recht B: Weighted sums of random kitchen sinks: Replacing minimization with random-\n\nization in learning. In Adv in Neural Information Processing Systems (NIPS) 2008.\n\n[3] Yang T, Li YF, Mahdavi M, Jin R, Zhou ZH: Nystr\u00a8om Method vs Random Fourier Features: A Theo-\n\nretical and Empirical Comparison. In NIPS 2012.\n\n[4] Gittens A, Mahoney MW: Revisiting the Nystr\u00a8om method for improved large-scale machine learning.\n\nIn ICML 2013.\n\n[5] Bach F: Sharp analysis of low-rank kernel approximations. In COLT 2013.\n[6] Rahimi A, Recht B: Random Features for Large-Scale Kernel Machines. In Adv in Neural Information\n\nProcessing Systems 2007.\n\n[7] Kakade S, Foster DP: Multi-view Regression Via Canonical Correlation Analysis. In Computational\n\nLearning Theory (COLT) 2007.\n\n[8] Hotelling H: Relations between two sets of variates. Biometrika 1936, 28:312\u2013377.\n[9] Hardoon DR, Szedmak S, Shawe-Taylor J: Canonical Correlation Analysis: An Overview with Appli-\n\ncation to Learning Methods. Neural Comp 2004, 16(12):2639\u20132664.\n\n[10] Ji M, Yang T, Lin B, Jin R, Han J: A Simple Algorithm for Semi-supervised Learning with Improved\n\nGeneralization Error Bound. In ICML 2012.\n\n[11] Belkin M, Niyogi P, Sindhwani V: Manifold regularization: A geometric framework for learning\n\nfrom labeled and unlabeled examples. JMLR 2006, 7:2399\u20132434.\n\n[12] Blum A, Mitchell T: Combining labeled and unlabeled data with co-training. In COLT 1998.\n[13] Chaudhuri K, Kakade SM, Livescu K, Sridharan K: Multiview clustering via Canonical Correlation\n\nAnalysis. In ICML 2009.\n\n[14] McWilliams B, Montana G: Multi-view predictive partitioning in high dimensions. Statistical Analysis\n\nand Data Mining 2012, 5:304\u2013321.\n\n[15] Drineas P, Mahoney MW: On the Nystr\u00a8om Method for Approximating a Gram Matrix for Improved\n\nKernel-Based Learning. JMLR 2005, 6:2153\u20132175.\n\n[16] Avron H, Boutsidis C, Toledo S, Zouzias A: Ef\ufb01cient Dimensionality Reduction for Canonical Corre-\n\nlation Analysis. In ICML 2013.\n\n[17] Hsu D, Kakade S, Zhang T: An Analysis of Random Design Linear Regression. In COLT 2012.\n[18] Dhillon PS, Foster DP, Kakade SM, Ungar LH: A Risk Comparison of Ordinary Least Squares vs\n\nRidge Regression. Journal of Machine Learning Research 2013, 14:1505\u20131511.\n\n[19] Andrew G, Arora R, Bilmes J, Livescu K: Deep Canonical Correlation Analysis. In ICML 2013.\n[20] Kumar S, Mohri M, Talwalkar A: Sampling methods for the Nystr\u00a8om method. JMLR 2012, 13:981\u2013\n\n1006.\n\n9\n\n\f", "award": [], "sourceid": 289, "authors": [{"given_name": "Brian", "family_name": "McWilliams", "institution": "ETH Zurich"}, {"given_name": "David", "family_name": "Balduzzi", "institution": "ETH Zurich"}, {"given_name": "Joachim", "family_name": "Buhmann", "institution": "ETH Zurich"}]}