{"title": "Solving Interpretable Kernel Dimensionality Reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 7915, "page_last": 7925, "abstract": "Kernel dimensionality reduction (KDR) algorithms find a low dimensional representation of the original data by optimizing kernel dependency measures that are capable of capturing nonlinear relationships. The standard strategy is to first map the data into a high dimensional feature space using kernels prior to a projection onto a low dimensional space. While KDR methods can be easily solved by keeping the most dominant eigenvectors of the kernel matrix, its features are no longer easy to interpret. Alternatively, Interpretable KDR (IKDR) is different in that it projects onto a subspace \\textit{before} the kernel feature mapping, therefore, the projection matrix can indicate how the original features linearly combine to form the new features. Unfortunately, the IKDR objective requires a non-convex manifold optimization that is difficult to solve and can no longer be solved by eigendecomposition. Recently, an efficient iterative spectral (eigendecomposition) method (ISM) has been proposed for this objective in the context of alternative clustering. However, ISM only provides theoretical guarantees for the Gaussian kernel. This greatly constrains ISM's usage since any kernel method using ISM is now limited to a single kernel. This work extends the theoretical guarantees of ISM to an entire family of kernels, thereby empowering ISM to solve any kernel method of the same objective. In identifying this family, we prove that each kernel within the family has a surrogate $\\Phi$ matrix and the optimal projection is formed by its most dominant eigenvectors. With this extension, we establish how a wide range of IKDR applications across different learning paradigms can be solved by ISM. To support reproducible results, the source code is made publicly available on \\url{https://github.com/ANONYMIZED}.", "full_text": "Solving Interpretable Kernel Dimension Reduction\n\nChieh Wu, Jared Miller, Yale Chang, Mario Sznaier, and Jennifer Dy\n\nElectrical and Computer Engineering Dept., Northeastern University, Boston, MA\n\nAbstract\n\nKernel dimensionality reduction (KDR) algorithms \ufb01nd a low dimensional rep-\nresentation of the original data by optimizing kernel dependency measures that\nare capable of capturing nonlinear relationships. The standard strategy is to \ufb01rst\nmap the data into a high dimensional feature space using kernels prior to a pro-\njection onto a low dimensional space. While KDR methods can be easily solved\nby keeping the most dominant eigenvectors of the kernel matrix, its features are\nno longer easy to interpret. Alternatively, Interpretable KDR (IKDR) is different\nin that it projects onto a subspace before the kernel feature mapping, therefore,\nthe projection matrix can indicate how the original features linearly combine to\nform the new features. Unfortunately, the IKDR objective requires a non-convex\nmanifold optimization that is dif\ufb01cult to solve and can no longer be solved by\neigendecomposition. Recently, an ef\ufb01cient iterative spectral (eigendecomposition)\nmethod (ISM) has been proposed for this objective in the context of alternative\nclustering. However, ISM only provides theoretical guarantees for the Gaussian\nkernel. This greatly constrains ISM\u2019s usage since any kernel method using ISM\nis now limited to a single kernel. This work extends the theoretical guarantees of\nISM to an entire family of kernels, thereby empowering ISM to solve any kernel\nmethod of the same objective. In identifying this family, we prove that each kernel\nwithin the family has a surrogate \u03a6 matrix and the optimal projection is formed\nby its most dominant eigenvectors. With this extension, we establish how a wide\nrange of IKDR applications across different learning paradigms can be solved by\nISM. To support reproducible results, the source code is made publicly available\non https://github.com/chieh-neu/ISM_supervised_DR.\n\n1\n\nIntroduction\n\nThe most important information for a given dataset often lies in a low dimensional space [1; 2; 3; 4;\n5; 6; 7; 8; 9; 10; 11; 12; 13; 14]. Due to the ability of kernel dependence measures for capturing both\nlinear and nonlinear relationships, they are powerful criteria for nonlinear DR [15; 16]. The standard\napproach is to \ufb01rst map the data into a high dimensional feature space prior to a projection onto\na low dimensional space [17]. This approach has been preferred because it captures the nonlinear\nrelationship with an established solution, i.e., the most dominant eigenvectors of the kernel matrix.\nHowever, since the high dimensional feature space maps the original featues nonlinearly, it is no\nlonger interpretable. Alternatively, if the projection onto a subspace precedes the feature mapping,\nthe projection matrix can be obtained to inform how the original features linearly combine to form\nthe new features. Exploiting this insight, many formulations have leveraged kernel alignment or\nHilbert Schmidt Independence Criterion (HSIC) [18] to model this approach [19; 8; 16; 20; 15; 1;\n21; 22; 23; 12]. Together, we refer to these approaches as Interpretable Kernel Dimension Reduction\n(IKDR). Unfortunately, this formulation can no longer be solved via eigendecomposition, instead, it\nbecomes a highly non-convex manifold optimization that is computationally expensive.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fNumerous approaches have been proposed to solve this complex objective. With its orthogonality\nconstraint, it is a form of optimization on a manifold: i.e., the constraint can be modeled geometrically\nas a Stiefel or Grassmann manifold [24; 25; 26]. Earlier work, Boumal and Absil [27] propose to\nrecast a similar problem on the Grassmann manifold and then apply \ufb01rst and second-order Riemannian\ntrust-region methods to solve it. Theis et al. [28] employ a trust-region method for minimizing the\ncost function on the Stiefel manifold. Wen and Yin [29] later propose to unfold the Stiefel manifold\ninto a \ufb02at plane and optimize on the \ufb02attened representation. While the manifold approaches perform\nwell under smaller data sizes, they quickly become inef\ufb01cient when the dimension or sample size\nincreases, which poses a serious challenge to larger modern problems. Besides manifold approaches,\nNiu et al. [30] propose Dimension Growth (DG) to perform gradient descent via greedy algorithm a\ncolumn at a time. By keeping the descent direction of the current column orthogonal to all previously\ndiscovered columns, DG ensures the constraint compliance.\nThe approaches discussed thus far have remained inef\ufb01cient. Recently, Wu et al. [19] proposed the\nIterative Spectral Method (ISM) for alternative clustering where their experiments on a dataset of\n600 samples showed that it took DG almost 2 days while ISM \ufb01nished under 2-seconds with a lower\nobjective cost. Moreover, ISM retains the ability to use eigendecomposition to solve IKDR. Instead\nof \ufb01nding the eigenvectors of kernel matrices, ISM uses a small surrogate matrix \u03a6 to replace the\nkernel matrix, thereby allowing for a much faster eigendecomposition. Yet, ISM is not without its\nlimitations. Since ISM\u2019s theoretical guarantees are speci\ufb01c to Gaussian kernels, repurposing ISM to\nother kernel methods becomes impractical, i.e., a kernel method of a single kernel signi\ufb01cantly limits\nits \ufb02exibility and representational power.\nIn this paper, we expand ISM\u2019s theoretical guarantees to an entire family of kernels, thereby realizing\nISM\u2019s potential for a wide range of applications. Within this family, each kernel is associated with\na matrix \u03a6 where its most dominant eigenvectors form the solution. Here, \u03a6 matrices replace the\nconcept of kernels to serve as an interchangeable component of any applicable kernel method. We\nfurther extend the family to kernels that are conic combinations of the ISM family of kernels. Here,\nwe prove that any conic combination of kernels within the family also has an associated \u03a6 matrix\nconstructed using the respective conic combination of \u03a6s.\nEmpowered by extending ISM\u2019s theoretical guarantees to other kernels, we present ISM as a solution\nto IKDR problems across several learning paradigms, including supervised DR [8; 16; 20], unsu-\npervised DR [15; 1], semi-supervised DR[21; 22], and alternative clustering [19; 23; 30]. Indeed,\nwe demonstrate how many of these applications can be reformulated into an identical optimization\nobjective which ISM solves, implying a signi\ufb01cant role for ISM that has been previously unknown.\nOur Contributions.\n\u2022 We generalize the theoretical guarantees of ISM to an entire family of kernels and propose the\n\u2022 We generalize ISM to conic combinations of kernels from the ISM family.\n\u2022 We establish that ISM can be used to solve general classes of IKDR learning paradigms.\n\u2022 We present experimental evidence to highlight the generalization of ISM to a wide range of\nlearning paradigms under a family of kernels and demonstrate its ef\ufb01ciency in terms of speed and\nbetter accuracies compared to competing methods.\n\nnecessary criteria for a kernel to be included into this family.\n\n2 A General Form for Interpretable Kernel Dimension Reduction\nLet X \u2208 Rn\u00d7d be a dataset of n samples with d features and let Y \u2208 Rn\u00d7k be the corresponding\nlabels where k denotes the number of classes. Given \u03baX (\u00b7,\u00b7) and \u03baY (\u00b7,\u00b7) as two kernel functions that\napplies respectively to X and Y to construct kernel matrices KX \u2208 Rn\u00d7n and KY \u2208 Rn\u00d7n. Also let\nH be a centering matrix where H = I \u2212 (1/n)1n1T\nn with H \u2208 Rn\u00d7n, I as the identity matrix and\n1n as a vector of 1s. HSIC measures the nonlinear dependence between X and Y whose empirical\nestimate is expressed as H(X, Y ) = 1\n(1\u2212n)2 Tr(HKX HKY ), with H(X, Y ) = 0 denoting complete\nindependence and H(X, Y ) (cid:29) 0 as high dependence [18]. Additional background regarding HSIC is\nprovided in Appendix N.\nA general IKDR problem can be posed as discovering a subspace W \u2208 Rd\u00d7q such that H(XW, Y )\nis maximized. Since W induces a reduction of dimension, we can assume that q < d. To prevent an\nunbounded solution, the subspace W is constrained such that W T W = I. Since this formulation\n\n2\n\n\fhas a wide range of applications across different learning paradigms, our work investigates the\ncommonality of these problems and discovers that various learning IKDR paradigms can be expressed\nas the following optimization problem:\n\nmax\n\nW\n\nTr(\u0393KXW )\n\ns.t. W T W = I,\n\n(1)\n\nwhere \u0393 is a symmetric matrix commonly derived from KY . Although this objective is shared\namong many IKDR problems, the highly non-convex objective continues to pose a serious challenge.\nTherefore the realization of ISM\u2019s ability to solve Eq. (1) impacts many applications. Here, we\nprovide several examples of this connection.\nSupervised Dimension Reduction. In supervised DR [16; 20], both the data X and the label Y are\nknown. We wish to discover a low dimensional subspace W such that XW is maximally dependent\non Y in the nonlinear high dimensional feature space. This problem can be cast as maximizing\nthe HSIC between XW and Y where we maximize Tr(KXW HKY H). Since HKY H includes all\nknown variables, they can be considered as a constant \u0393 = HKY H. Eq. (1) is obtained by rotating\nthe trace terms and constraining W to W T W = I.\nUnsupervised Dimension Reduction. Niu et al. [1] introduced a DR algorithm for spectral cluster-\ning based on an HSIC formulation. In unsupervised DR, we also discover a low dimensional subspace\nW such that XW is maximally dependent on Y . Therefore, the objective here is actually identical\nto the supervised objective of Tr(KXW HKY H), except since Y in unknown here, both W and Y\nneed to be learned. By setting KY = Y Y T , this problem can be solved by alternating maximization\nbetween Y and W . When W is \ufb01xed, the problem reduces down to spectral clustering [30] and Y\ncan be solved via eigendecomposition as shown in Niu et al. [1]. When Y is \ufb01xed, the objective\nbecomes the supervised formulation previously discussed.\nSemi-Supervised Dimension Reduction. In semi-supervised DR clustering problems [22], some\nform of scores \u02c6Y \u2208 Rn\u00d7r are provided by subject experts for each sample. It is assumed that if two\nsamples are similar, their scores should also be similar. In this case, the objective is to cluster the data\ngiven some supervised guidance from the experts. The clustering portion can be accomplished by\nspectral clustering [31] and HSIC can capture the supervised expert knowledge. By simultaneously\nmaximizing the clustering quality of spectral clustering and the HSIC between the data and the expert\nscores, this problem is formulated as\n\nT r(Y TLW Y ) + \u00b5 Tr(KXW HK \u02c6Y H),\nmax\nW,Y\ns.t\n\nLW = D\u2212 1\n\n2 KXW D\u2212 1\n\n2 , W T W = I, Y T Y = I\n\n(3)\nwhere \u00b5 is a constant to balance the importance between the \ufb01rst and the second terms of the\nobjective, D \u2208 Rn\u00d7n is the degree matrix that is a diagonal matrix with its diagonal elements\nde\ufb01ned as Ddiag = KXW 1n. Similar to the unsupervised DR problem, this objective is solved\nby alternating optimization of Y and W . Since the second term does not include Y , when W is\n\ufb01xed, the objective reduces down to spectral clustering. By initializing W to an identity matrix, Y is\ninitialized to the solution of spectral clustering. When Y is \ufb01xed, W can be solved by isolating KXW .\nIf we let \u03a8 = HK \u02c6Y H and \u2126 = D\u2212 1\n2 , maximizing Eq. (2) is equivalent to maximizing\nTr[(\u2126 + \u00b5\u03a8)KXW ] subject to W T W = I. At this point, it is easy to see that by setting \u0393 = \u2126 + \u00b5\u03a8,\nthe problem is again equivalent to Eq. (1).\nAlternative Clustering. In alternative clustering [30], a set of labels \u02c6Y \u2208 Rn\u00d7k is provided as the\noriginal clustering labels. The objective of alternative clustering is to discover an alternative set of\nlabels that is high in clustering quality while different from the original label. In a way, this is a form\nof semi-supervised learning. Instead of having extra information about the clusters we desire, the\nsupervision here indicates what we wish to avoid. Therefore, this problem can be formulated almost\nidentically as a semi-supervised problem with\n\n2 Y Y T D\u2212 1\n\n(2)\n\n(4)\n\nTr(Y TLW Y ) \u2212 \u00b5 Tr(KXW HK \u02c6Y H),\n\nmax\nW,Y\ns.t\n\n(5)\nGiven that the only difference here is a sign change before the second term, this problem can be\nsolved identically as the semi-supervised DR problem and the sub-problem of maximizing W when\nY is \ufb01xed can be reduced into Eq. (1).\n\n2 , W T W = I, Y T Y = I.\n\nLW = D\u2212 1\n\n2 KXW D\u2212 1\n\n3\n\n\f3 Extending the Theoretical Guarantees to a Family of Kernels\n\nThe ISM algorithm. The ISM algorithm, as proposed by Wu et al. [19], solves Eq. (1) by setting\nthe q most dominant eigenvectors of a special matrix \u03a6 as its solution W ; we de\ufb01ne \u03a6 in a later\nsection. We denote these eigenvectors in our context as Vmax and their eigenvalues as \u039b. Since \u03a6\nderived by Wu et al. [19] is a function of W , the new W is used to construct the next \u03a6 which we\nagain set its Vmax as the next W . This process iterates until the change in \u039b between each iteration\nfalls below a prede\ufb01ned threshold \u03b4. To initialize the \ufb01rst W , the 2nd order Taylor expansion is\nused to approximate \u03a6 which yields a matrix \u03a60 that is independent of W . We supply extra detail in\nAppendix Q and its pseudo-code in Algorithm 1.\n\nAlgorithm 1 ISM Algorithm\nInput : Data X, kernel, Subspace Dimension q\nOutput : Projected subspace W\nInitialization : Initialize \u03a60 using Table 2.\nSet W0 to Vmax of \u03a60.\nwhile ||\u039bi \u2212 \u039bi\u22121||2/||\u039bi||2 < \u03b4 do\n\nCompute \u03a6 using Table 3\nSet Wk to Vmax of \u03a6\n\nend\n\nExtending ISM Algorithm. Unfortunately, the theoretical foundation of ISM is speci\ufb01cally tailored\nto the Gaussian kernel. Since the proof relies heavily on the exponential structure of the Gaussian\nfunction, extending the algorithm to other kernels seems unlikely. However, we discovered that\nthere exists a family of kernels where each kernel possesses its own distinct pair of \u03a6/\u03a60 matrices.\nFrom our proof, we discovered a general formulation of \u03a6/\u03a60 for any kernel within the family.\nMoreover, since the only change is the \u03a6/\u03a60 pair, the ISM algorithm holds by simply substituting\nthe appropriate \u03a6/\u03a60 matrices based on the kernel. We have derived several examples of \u03a60/\u03a6 in\nTables 2 and 3 and supplied the derivation for each kernel in Appendices B and C.\nTo clarify the notations for Tables 2 and 3, given a matrix \u03a8, we de\ufb01ne D\u03a8 and L\u03a8 respectively as\nthe degree matrix and the Laplacian of \u03a8 where D\u03a8 = Diag(\u03a81n) and L = D\u03a8 \u2212 \u03a8. Here, Diag\nis a function that places the elements of a vector into the diagonal of a zero squared matrix. While\nKXW is the kernel computed from XW , we denote KXW,p as speci\ufb01cally a polynomial kernel of\norder p. We also denote the symbol (cid:12) as a Hadamard product between matrices.\n\u03a6 for Common Kernels. After deriving \u03a6/\u03a60 pairs for the most common kernels, we note several\nrecurrent characteristics. First, \u03a6 scales with the dimension d instead of the size of the data n. Since\nn (cid:29) d is common across many datasets, the eigendecomposition performed on \u03a6 \u2208 Rd\u00d7d can be\nsigni\ufb01cantly faster while requiring less memory. Second, following Eq. (6), they are highly ef\ufb01cient\nto compute since a vectorized formulation of \u03a6 can be derived for each kernel as shown in Table 3;\ncommonly, they reduce to a dot product between a Laplacian matrix L with the data matrices X.\nThe occurrence of the Laplacian is particularly surprising since nowhere in Eq. (1) suggests this\nrelationship. Third, observe from Table 3 that \u03a6 can be expressed as X T \u2126X, where \u2126 is a positive\nsemi-de\ufb01nite (PSD) matrix. Since the formulation of \u03a6 without \u2126 is the covariance matrix X T X, \u2126\nadjusts the covariance matrix by incorporating both the kernel and the label information. In addition,\nby applying the Cholesky decomposition on \u2126 to rewrite \u03a6 as (X T L)(LT X), L becomes a matrix\nthat adjusts the data itself. Therefore, IKDR can be interpreted as applying PCA on the adjusted data\nLT X where the kernel and label information is included.\n\nf (\u03b2)\n\n\u03b2\n\u03b2\n\nKernel Name\nLinear\nSquared\nPolynomial\nGaussian\n\nb(xi, xj)\nxi \u2212 xj\nxi \u2212 xj\nxi \u2212 xj\nTable 1: Converting common kernels to f (\u03b2).\n\nMultiquadratic (cid:112)\u03b2 + c2\n\n(\u03b2 + c)p\n\nxi\n\na(xi, xj)\nxi \u2212 xj\nxi \u2212 xj\nxi \u2212 xj\n\nxi\n\n\u2212\u03b2\n2\u03c32\n\ne\n\nxj\n\nxj\n\n4\n\n\fKernel\nLinear\nSquared\n\nApproximation of \u03a6s\n\u03a60 = X T \u0393X\n\u03a60 = X TL\u0393X\nPolynomial\n\u03a60 = X T \u0393X\n\u03a60 = \u2212X TL\u0393X\nGaussian\nMultiquadratic \u03a60 = X TL\u0393X\nTable 2: Equations for the approximate\n\u03a6s for the common kernels.\n\nKernel\nLinear\nSquared\n\n\u03a6 Equations\n\u03a6 = X T \u0393X\n\u03a6 = X TL\u0393X\n\u03a6 = X T \u03a8X , \u03a8 = \u0393 (cid:12) KXW,p\u22121\nPolynomial\n\u03a6 = \u2212X TL\u03a8X , \u03a8 = \u0393 (cid:12) KXW\nGaussian\nMultiquadratic \u03a6 = X TL\u03a8X , \u03a8 = \u0393 (cid:12) K (\u22121)\nTable 3: Equations for \u03a6s for the\ncommon kernels.\n\nXW\n\nExtending the ISM Theoretical Guarantees. The main theorem in Wu et al. [19] proves that a\n\ufb01xed point W \u2217 of Algorithm 1 is a local maximum of Eq. (1) only if the Gaussian kernel is used.\nOur work extends the theorem to a family of kernels which we refer to as the ISM family. Here, we\nsupply the theoretical foundation for this claim by \ufb01rst providing the following de\ufb01nition.\nDe\ufb01nition 1. Given \u03b2 = a(xi, xj)T W W T b(xi, xj) with a(xi, xj) and b(xi, xj) as functions of\nxi and xj, any twice differentiable kernel that can be written in terms of f (\u03b2) while retaining its\nsymmetric positive semi-de\ufb01nite property is an ISM kernel belonging to the ISM family with an\nassociated \u03a6 matrix de\ufb01ned as\n\n(cid:88)\n\ni,j\n\n\u03a6 =\n\n1\n2\n\n\u0393i,jf(cid:48)(\u03b2)Ai,j.\n\n(6)\n\nwhere Ai,j = b(xi, xj)a(xi, xj)T + a(xi, xj)b(xi, xj)T .\n\nSince the equation for different kernels varies vastly, it is not clear how they can be reformulated into\na single structure that simultaneously satis\ufb01es all ISM guarantees. De\ufb01nition 1 is the key realization\nthat unites a set of kernels into a family. Under this de\ufb01nition, we proved later in Theorem 1 that\nthe Gaussian kernel within the original proof of ISM can be replaced by f (\u03b2). Therefore, the ISM\nguarantees simultaneously extend to any kernel that satis\ufb01es De\ufb01nition 1. As a result, a general \u03a6\nfor any ISM kernel can be derived as shown in Eq. (6). Moreover, note that the family of potential\nkernels is not limited to a \ufb01nite set of known kernels, instead, it extends to any conic combinations of\nISM kernels. We prove in Appendix O the following proposition.\nProposition 1. Any conic combination of ISM kernels is still an ISM kernel.\nProperties of \u03a6. Since each ISM kernel is coupled with its own \u03a6 matrix, \u03a6s can conceptually\nreplace kernels. Recall that the Vmax of \u03a6 for any kernel in the ISM family is the local maximum of\nEq. (1). This central property is established in the following two theorems.\nTheorem 1. Given a full rank \u03a6 with an eigengap as de\ufb01ned by Eq. (80) in Appendix D, a \ufb01xed point\nW \u2217 of Algorithm 1 satis\ufb01es the 2nd Order Necessary Conditions (Theorem 12.5 [32]) for Eq. (1)\nusing any ISM kernel.\nTheorem 2. A sequence of subspaces {WkW T\nk }k\u2208N generated by Algorithm 1 contains a convergent\nsubsequence.\n\nSince the entire ISM proof along with its convergence guarantee is required to be revised and\ngeneralized under De\ufb01nition 1, we leave the detail to Appendix D and P while presenting here only\nthe main conclusions. Functionally, our proof is separated into two lemmas to establish \u03a6 as a kernel\nsurrogate. Lemma 1 concludes that given any \u03a6 of an ISM kernel, the gradient of the Lagrangian for\nEq. (1) is equivalent to \u2212\u03a6W \u2212 W \u039b. Therefore, when the gradient is set to 0, the eigenvectors of \u03a6 is\nequivalent to the stationary point of Eq. (1). For Lemma 2, given \u00af\u039b as the eigenvalues associated with\nthe eigenvectors not chosen and C as constant, it concludes that the 2nd order necessary condition\nis satis\ufb01ed when (mini \u00af\u039bi \u2212 maxj \u039bj) \u2265 C. This inequality indicates the necessity for the smallest\neigenvalue among the un-chosen eigenvectors to be greater than the maximum eigenvalue of the\nchosen by at least C. Therefore, given the choice of q eigenvectors, the q smallest eigenvalues will\nmaximize the gap. This is equivalent to \ufb01nding the most dominant eigenvectors of \u03a6. Putting both\nlemmas together, we conclude that the most dominant eigenvectors of any \u03a6 within the ISM family is\nthe solution to Eq. (1).\nInitializing W with \u03a60. After generalizing ISM, different \u03a6s may or may not be a function of\nW . When \u03a6 is not a function of W , the Vmax of \u03a6 is immediately the solution. However, if \u03a6 is a\n\n5\n\n\ffunction of W , \u03a6 iteratively updates from the previous W . This process is initialized using a \u03a60 that\nis independent of W . To obtain \u03a60, ISM approximates the Gaussian kernel up to the 2nd order of\nthe Taylor series around \u03b2 = 0 and discovers that the approximation of \u03a6 is independent of W . Our\nwork leverages De\ufb01nition 1 and proves that a common formulation for \u03a60 is possible. We formalize\nour \ufb01nding in the following theorem and provided the proof in Appendix F.\nTheorem 3. For any kernel within the ISM family, a \u03a6 independent of W can be approximated with\n\n(cid:88)\n\n\u03a6 \u2248 sign(\u2207\u03b2f (0))\n\n\u0393i,jAi,j.\n\n(7)\n\ni,j\n\nExtending ISM to Conic Combination of Kernels. The two lemmas of Theorem 1 highlights the\nconceptual convenience of working with \u03a6 in place of kernels. This conceptual replacement extends\neven to conic combinations of ISM kernels. As a corollary to Theorem 1, we discovered that when a\nkernel is constructed through a conic combination of ISM kernels, it also has an associated \u03a6 matrix.\nRemarkably, it is equivalent to the conic combination of \u03a6s from individual kernels using the same\ncoef\ufb01cients. Formally, we propose the following corollary with its proof in Appendix M.\nCorollary 1. The \u03a6 matrix associated with a conic combination of kernels is the conic combination\nof \u03a6s associated with each individual kernel.\nComplexity analysis. Let t be the number of iterations required for convergence and n (cid:29) d, ISM\u2019s\ntime complexity is dominated by the dot product between L \u2208 Rn\u00d7n and X \u2208 Rn\u00d7d. Together\nISM has a time complexity of O(n2dt); a signi\ufb01cant improvement from DG O(n2dq2t), or SM\nat O(n2dqt). ISM is also faster since t is signi\ufb01cantly smaller. While t ranges from hundreds to\nthousands for competing algorithms, ISM normally converges at t < 5. In terms of memory, ISM\nfaces similar challenges as all kernel methods where the memory complexity is upper bounded at\nO(n2).\n\n4 Experiments\n\nDatasets. The experiment includes 5 real datasets of commonly encountered data types. Wine [33]\nconsists of continuous data while the Cancer dataset [34] features are discrete. The Face dataset\n[35] is a standard dataset used for alternative clustering; it includes images of 20 people in various\nposes. The MNIST [36] dataset includes images of handwritten characters. The Face and the MNIST\ndatasets are chosen to highlight ISM\u2019s ability to handle images. The Flower image by Alain Nicolas\n[37] is another dataset chosen for alternative clustering where we seek alternative ways to perform\nimage segmentation. For more in-depth details on each dataset, see Appendix J.\nExperimental Setup. We showcase ISM\u2019s ef\ufb01cacy on three different learning paradigms, i.e.,\nsupervised dimension reduction[20], unsupervised clustering [1], and semi-supervised alternative\nclustering [22]. As an optimization technique, we compare ISM in Table 4 against competing state-\nof-the-art manifold optimization algorithms: Dimension Growth (DG) [30], the Stiefel Manifold\napproach (SM) [29], and the Grassmann Manifold (GM) [27; 38]. To emphasize ISM family of\nkernels, the supervised and unsupervised results using several less conventional kernels are included\nin Table 5. Within this table, we also investigate using conic combination of \u03a6s by combining the\nGaussian and the polynomial kernels with center alignment [39]. Since center alignment is speci\ufb01c\nto supervised cases, this is not repeated for the unsupervised case.\nFor supervised dimension reduction, we perform SVM on XW using 10-fold cross validation. For\neach of the 10-fold experiments, we trained W and the SVM classi\ufb01er only on the training set while\nreporting the result only on the test set, i.e., the test set was never used during the training. We repeat\nthis process for each fold of cross-validation. From the 10-fold results in Table 4, we record the\nmean and the standard deviation of the run-time, cost, and accuracy. We investigate the scalability in\nFigure 1b by comparing the change in run-time as we increment the sample size. For unsupervised\ndimension reduction, we perform spectral clustering on XW after learning W where we record\nthe run-time, cost, and NMI. For alternative clustering, we highlight the ISM family of kernels by\nreproducing the original ISM results (generated with Gaussian kernel) using the polynomial kernel.\nOn the Flower image, each sample is a R3 vector. We supply the original image segmentation\nresult as semi-supervised labels and learn an alternative way to segment the image. The original\nsegmentation and the alternative segmentation are shown in Figure1a. For the Face dataset, each\n\n6\n\n\fsample is a vector vectorized from a grayscaled image of individuals. We provide the identity of\nindividuals as the original clustering label and search for an alternative way to cluster the data.\nEvaluation Metric. In the supervised case, the test classi\ufb01cation accuracy from the 10-fold cross\nvalidation is recorded along with the cost and run-time. The time is broken down into days (d), hours\n(h), minutes (m), and seconds (s). The best results are bold for each experiment. In the unsupervised\ncase, we report the Normalized Mutual Information (NMI) [40] to compare the clustering labels\nagainst the ground truth. For detail on how NMI is computed, see Appendix L.\nExperiment Settings. The median of the pair-wise Euclidean distance is used as \u03c3 for all experi-\nments using the Gaussian kernel. Degree of 3 is used for all polynomial kernels. The dimension of\nsubspace q is set to the number of classes/clusters. The convergence threshold \u03b4 is set to 0.01. All\ncompeting algorithms use their default initialization. All datasets are centered to 0 and scaled to a\nstandard deviation of 1. All sources are written in Python using Numpy and Sklearn [41; 42]. All\nexperiments were conducted on Dual Intel Xeon E5-2680 v2 @ 2.80GHz, with 20 total cores. Due to\nlimited computational resources, each run is limited to 3 days.\nComplexity Analysis of Competing Methods. The run-time as a function of linearly increasing\nsample size is shown for the polynomial kernel in Figure 1b. Since the complexity analysis for ISM\nsuggests a relationship of O(n2) with respect to the sample size, log2(.) is used for the Y -axis. As\nexpected, the ISM\u2019s linear run-time growth in Figure 1b supports our analysis of O(n2) relationship.\nThe plot for competing algorithms reported a similar linear relationship with comparable slopes. This\nindicates that the difference in speed is not a function of the data size, but other factors such as q and\nt. Using DG\u2019s complexity of O(n2dq2t) as an example, it normally converges when t is in the ranges\nof thousands. Since q = 20 was used in the \ufb01gure, the signi\ufb01cant speed improvement from ISM can\nbe derived from the q2t factor since ISM generally converges at t below 5.\nResults. Comparing against other optimization algorithms in Table 4, the results con\ufb01rm ISM as a\nsigni\ufb01cantly faster algorithm while consistently achieving a lower cost. This disparity is especially\nprominent when the data dimension q is higher. We highlight that for the Face dataset on the Gaussian\nkernel, it took DG 1.92 days, while ISM \ufb01nished within 0.99 seconds: a 105-fold speed difference.\nTo further con\ufb01rm these advantages, the same experiment is repeated using the polynomial kernel\nwhere similar results can be observed. Besides the execution time and cost, the classi\ufb01cation accuracy\nacross 5 datasets never falls below 95% in the supervised setting. The same datasets and techniques\nare repeated in an unsupervised clustering problem. While the clustering quality is comparable across\nthe datasets, ISM clearly produces the lowest cost with the fastest execution time.\nTable 5 focuses on the generalization of ISM to a family of kernels. Since Table 4 already supplied\nresults from the Gaussian and polynomial kernel, we feature 4 more kernels to support the claim.\nAs kernel methods treat kernels as interchangeable components of the algorithm, ISM achieves a\nsimilar effect by replacing the \u03a6 matrix. As evidenced from the table, similar accuracy and time can\nbe achieved with this replacement without affecting the rest of the algorithm. In many cases, the\nmultiquadratic kernel outperforms even the Gaussian and the polynomial kernel. In a similar spirit,\nwe repeated the same experiments in the unsupervised case and received further con\ufb01rmation.\nTo support Corollary 1, results using a Gaussian + polynomial (G+P) kernel is also supplied in Table 5.\nIt is not surprising that a combination of \u03a6s is the best performing kernel. Since the union of the two\nkernels covers a larger feature space, the expressiveness is also greater. This result supports the claim\nthat a conic combination of \u03a6s can replace the same combination of kernels for Eq. (1).\nTo study the generalized ISM on a (semi-supervised) alternative clustering problem, we use it to\nrecreate the results from the original paper on alternative clustering. We emphasize that our results\ndiffer in the choice of using the polynomial kernel instead of the Gaussian. From the Flower\nexperiment, it is visually clear that the original image segmentation of 2 clusters (separated by black\nand white) is completely different from the alternative segmentation. For the Face data, the original\nclusters were grouped by the identity of the individuals while the algorithm produced 4 alternative\nclusters. By averaging the images of each alternative cluster, the new clustering pattern can be\nvisually seen in Figure 1a; the samples are alternatively clustered by the pose.\nBy applying ISM to 3 different learning paradigms, we showcase ISM as an extremely fast optimiza-\ntion algorithm that can solve a wide range of IKDR problems, thereby drawing a deeper connection\nbetween these domains. Hence, the impact of generalizing ISM to other kernels is also conveniently\ntranslated to these applications.\n\n7\n\n\f(a) Reproducing results from the original ISM paper\nusing polynomial kernels.\n\n(b) Log2 run-time as a function of increasing samples.\n\nSupervised\n\ne\nn\nW\n\ni\n\nTime\nCost\n\nAccuracy\n\nr Time\nCost\n\ne\nc\nn\na\nC\n\nAccuracy\n\nTime\nCost\n\nAccuracy\n\ne\nc\na\nF\n\nT Time\nCost\n\nS\nI\nN\nAccuracy\nM\nUnsupervised\n\ni\n\ne\nn\nW\n\ne\nc\nn\na\nC\n\nTime\nCost\nNMI\nr Time\nCost\nNMI\nTime\nCost\nNMI\nT Time\nCost\nNMI\n\nS\nI\nN\nM\n\ne\nc\na\nF\n\nGaussian\n\npolynomial\n\nISM\n\nGM\n\nISM\n\nDG\n\nSM\n\nGM\n\nDG\n\n7.9s \u00b1 2.9s\n-1201 \u00b1 25\n\nSM\n\n1.7s \u00b1 0.7s\n-1310 \u00b1 26\n\n16.8m \u00b1 3.4s\n-1307 \u00b1 25\n\n17s \u00b1 12s\n-31996 \u00b1 499\n\n4.5m \u00b1 103s\n-30302 \u00b1 2297\n\n0.02s \u00b1 0.0s\n-114608 \u00b1 1752\n\n16.82m \u00b1 3.6s\n13.2s \u00b1 6.2s\n0.02s \u00b1 0.01s\n-108892 \u00b1 1590\n-1311 \u00b1 26\n-112440 \u00b1 1719\n96.6% \u00b1 2.7%\n95.0% \u00b1 5% 93.2% \u00b1 5.5% 95% \u00b1 4.2% 95% \u00b1 6% 97.2% \u00b1 3.7% 93.8% \u00b1 3.9%\n17.5m \u00b1 1.1m\n0.08s \u00b1 0.0s\n4m \u00b1 1.2m\n-32249 \u00b1 338\n-1690 \u00b1 108\n-1882 \u00b1 47\n97.3%\u00b1 0.3% 97.3%\u00b1 0.3% 97.3%\u00b1 0.2% 97.4%\u00b1 0.4% 97.4%\u00b1 0.3% 97.3% \u00b1 0.3% 97.4% \u00b1 0.3% 97.3% \u00b1 0.3%\n21.5m \u00b1 9.8s\n0.99s \u00b1 0.1s\n-3257 \u00b1 517\n-3754 \u00b1 31\n100% \u00b1 0% 100% \u00b1 0% 100% \u00b1 0% 99.2% \u00b1 0.2%\n99.8% \u00b1 0.2%\n13.8s \u00b1 2.3s\n-639 \u00b1 2.3\n99% \u00b1 0%\n\n5.0m \u00b1 5.7s\n-37907 \u00b1 15958\n100% \u00b1 0%\n2.1m \u00b1 3s\n-620 \u00b1 5.1\n99% \u00b1 0%\n\n0.7s \u00b1 0.03s\n-82407 \u00b1 1670\n100% \u00b1 0%\n12.1s \u00b1 1.4s\n-639 \u00b1 2\n99% \u00b1 0%\n\n14.77s \u00b1 0.6s\n-111339 \u00b1 1652\n96.6% \u00b1 3.7%\n3.3m \u00b1 3s\n-1737 \u00b1 84\n\n2.1d \u00b1 13.9h\n-78845 \u00b1 1503\n100% \u00b1 0%\n\n2.5m \u00b1 1.0s\n-621 \u00b1 5.1\n98.5% \u00b1 0.4%\n\n17.8m \u00b1 80s\n-30998 \u00b1 560\n\n22.7m \u00b1 18s\n-771 \u00b1 28\n\n0.13s \u00b1 0.0s\n-1894 \u00b1 47\n\n1.92d \u00b1 11h\n-3431 \u00b1 32\n\n10s \u00b1 5s\n-3749 \u00b1 33\n\n> 3d\nN/A\nN/A\n\n> 3d\nN/A\nN/A\n\n> 3d\nN/A\nN/A\n\n> 3d\nN/A\nN/A\n\n0.01s\n-27.4\n0.86\n0.57s\n-243\n0.8\n0.3s\n-169.3\n0.94\n1.8h\n-2105\n0.47\n\n9.9s\n-25.2\n0.86\n4.3m\n-133\n0.79\n1.3d\n-167.7\n0.95\n> 3d\nN/A\nN/A\n\n0.6s\n-27.3\n0.86\n3.9s\n-146\n0.8\n5.3s\n-168.9\n0.93\n1.3d\n-2001\n0.46\n\n16.7m\n-27.3\n0.86\n44m\n-142\n0.79\n55.9m\n-37\n0.89\n> 3d\nN/A\nN/A\n\n0.02s\n-1600\n0.84\n0.5s\n-15804\n0.79\n1.0s\n-368\n0.94\n8.3m\n-51358\n0.32\n\n14.4s\n-1582\n0.84\n8.0m\n-14094\n0.80\n> 3d\nNA\nN/A\n> 3d\nN/A\nN/A\n\n2.9s\n-1598\n0.84\n8.8m\n-15749\n0.79\n22m\n-348\n0.89\n0.9d\n-51129\n0.32\n\n33.5m\n-1496\n0.83\n41m\n-11985\n0.80\n1.6d\n-321\n0.89\n> 3d\nN/A\nN/A\n\nTable 4: Run-time, cost, and objective performance are recorded under supervised/unsupervised\nobjectives. ISM is signi\ufb01cantly faster compared to other optimization techniques while achieving\nlower objective cost.\n\nSupervised\n\nUnsupervised\n\ne Time\nn\nW\n\nAccuracy\n\ni\n\nr Time\n\ne\nc\nn\na\nC\ne Time\n\nAccuracy\n\nc\na\nF\n\nAccuracy\n\nT Time\nS\nI\nN\nM\n\nAccuracy\n\nG+P\n\nLinear\n\nMultiquad\n0.02s \u00b1 0.01s\n97.2% \u00b1 3.7%\n0.15s \u00b1 0.01s\n\nSquared\n0.01s \u00b1 0s\n96.6% \u00b1 3.7%\n0.09s \u00b1 0.02s\n\n0.007s \u00b1 0s\n0.003s \u00b1 0s\nTime\n98.3% \u00b1 2.6% NMI\n97.2% \u00b1 2.8%\n0.02s \u00b1 0.002s\n0.06s \u00b1 0.004s\nTime\n97.2% \u00b1 0.3% 97.3% \u00b1 0.04% 97.4% \u00b1 0.003% 97.4% \u00b1 0.003% NMI\n0.3s \u00b1 0.2s\n0.2s \u00b1 0.2s\n0.5s \u00b1 0.03s\nTime\n97.1% \u00b1 0.4%\n97.3% \u00b1 0.3%\n100% \u00b1 0%\nNMI\n17.6s \u00b1 2.5s\n6.4s \u00b1 0.4s\n17.4s \u00b1 0.4s\nTime\n99.1% \u00b1 0.1% 99.3% \u00b1 0.2%\n99.3% \u00b1 0.2% NMI\n\n0.3s \u00b1 0.2s\n97.3% \u00b1 0.4%\n10.6m \u00b1 1.9m\n99.1% \u00b1 0.1%\n\nLinear\n0.02s\n0.85\n0.23s\n0.80\n0.68s\n0.93\n3.1m\n0.54\n\nSquared Multiquad\n\n0.04s\n0.85\n\n0.5s\n0.79\n\n0.92s\n0.95\n4.7m\n0.54\n\n0.06s\n0.88\n0.56s\n0.84\n3.7s\n0.92\n\n52m\n0.54\n\nTable 5: Run-time and objective performance are recorded across several kernels within the ISM\nfamily. It con\ufb01rms the usage of \u03a6 or linear combination of \u03a6 in place of kernels.\n\n8\n\n\f5 Conclusion\n\nWe have extended the theoretical guarantees of ISM to a family of kernels beyond the Gaussian\nkernel via the discovery of the \u03a6 matrix. Our theoretical analysis proves that the family of ISM\nkernels extend even to conic combinations of ISM kernels. With this extension, ISM becomes an\nef\ufb01cient solution for a wide range of supervised, unsupervised and semi-supervised applications. Our\nexperimental results con\ufb01rm the ef\ufb01ciency of the algorithm while showcasing its wide impact across\nmany domains.\nAcknowledgments\nWe would like to acknowledge support for this project from NSF grant IIS-1546428. We would also\nlike to thank Zulqarnain Khan for his insightful discussions.\n\nReferences\n[1] Donglin Niu, Jennifer Dy, and Michael Jordan. Dimensionality reduction for spectral cluster-\ning. In Proceedings of the Fourteenth International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 552\u2013560, 2011.\n\n[2] Taiji Suzuki and Masashi Sugiyama. Suf\ufb01cient dimension reduction via squared-loss mutual\ninformation estimation. In Proceedings of the Thirteenth International Conference on Arti\ufb01cial\nIntelligence and Statistics, pages 804\u2013811, 2010.\n\n[3] Ian Jolliffe. Principal component analysis. Springer, 2011.\n\n[4] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. science, 290(5500):2323\u20132326, 2000.\n\n[5] Florian Wickelmaier. An introduction to mds. Sound Quality Research Unit, Aalborg University,\n\nDenmark, 46(5):1\u201326, 2003.\n\n[6] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. science, 290(5500):2319\u20132323, 2000.\n\n[7] Nandakishore Kambhatla and Todd K Leen. Dimension reduction by local principal component\n\nanalysis. Neural computation, 9(7):1493\u20131516, 1997.\n\n[8] Kenji Fukumizu, Francis R Bach, Michael I Jordan, et al. Kernel dimension reduction in\n\nregression. The Annals of Statistics, 37(4):1871\u20131905, 2009.\n\n[9] Le Song, Arthur Gretton, Karsten Borgwardt, and Alex J Smola. Colored maximum variance\n\nunfolding. In Advances in neural information processing systems, pages 1385\u20131392, 2008.\n\n[10] Le Song, Alex Smola, Arthur Gretton, Karsten M Borgwardt, and Justin Bedo. Supervised\nfeature selection via dependence estimation. In Proceedings of the 24th international conference\non Machine learning, pages 823\u2013830. ACM, 2007.\n\n[11] Max Vladymyrov and Miguel \u00c1 Carreira-Perpin\u00e1n. Locally linear landmarks for large-scale\nIn Joint European Conference on Machine Learning and Knowledge\n\nmanifold learning.\nDiscovery in Databases, pages 256\u2013271. Springer, 2013.\n\n[12] Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. Feature selection\nvia dependence maximization. Journal of Machine Learning Research, 13(May):1393\u20131434,\n2012.\n\n[13] Mikhail Belkin and Partha Niyogi. Laplacian eigenmaps for dimensionality reduction and data\n\nrepresentation. Neural computation, 15(6):1373\u20131396, 2003.\n\n[14] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and appli-\ncations. IEEE transactions on pattern analysis and machine intelligence, 35(11):2765\u20132781,\n2013.\n\n9\n\n\f[15] Bernhard Sch\u00f6lkopf, Alexander Smola, and Klaus-Robert M\u00fcller. Nonlinear component analysis\n\nas a kernel eigenvalue problem. Neural computation, 10(5):1299\u20131319, 1998.\n\n[16] Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised\nprincipal component analysis: Visualization, classi\ufb01cation and regression on subspaces and\nsubmanifolds. Pattern Recognition, 44(7):1357\u20131371, 2011.\n\n[17] Bernhard Sch\u00f6lkopf, Alexander Smola, and Klaus-Robert M\u00fcller. Kernel principal component\nanalysis. In International conference on arti\ufb01cial neural networks, pages 583\u2013588. Springer,\n1997.\n\n[18] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Sch\u00f6lkopf. Measuring statistical\ndependence with hilbert-schmidt norms. In International conference on algorithmic learning\ntheory, pages 63\u201377. Springer, 2005.\n\n[19] Chieh Wu, Stratis Ioannidis, Mario Sznaier, Xiangyu Li, David Kaeli, and Jennifer Dy. Iterative\nspectral method for alternative clustering. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 115\u2013123, 2018.\n\n[20] Mahdokht Masaeli, Jennifer G Dy, and Glenn M Fung. From transformation-based dimension-\nality reduction to feature selection. In Proceedings of the 27th International Conference on\nMachine Learning (ICML-10), pages 751\u2013758, 2010.\n\n[21] Mehrdad J Gangeh, Safaa MA Bedawi, Ali Ghodsi, and Fakhri Karray. Semi-supervised\ndictionary learning based on hilbert-schmidt independence criterion. In International Conference\nImage Analysis and Recognition, pages 12\u201319. Springer, 2016.\n\n[22] Yale Chang, Junxiang Chen, Michael H Cho, Peter J Castaidi, Edwin K Silverman, and\nJennifer G Dy. Clustering with domain-speci\ufb01c usefulness scores. In Proceedings of the 2017\nSIAM International Conference on Data Mining, pages 207\u2013215. SIAM, 2017.\n\n[23] Donglin Niu, Jennifer G Dy, and Michael I Jordan. Multiple non-redundant spectral clustering\nviews. In Proceedings of the 27th international conference on machine learning (ICML-10),\npages 831\u2013838, 2010.\n\n[24] Ioan Mackenzie James. The topology of Stiefel manifolds, volume 24. Cambridge University\n\nPress, 1976.\n\n[25] Yasunori Nishimori and Shotaro Akaho. Learning algorithms utilizing quasi-geodesic \ufb02ows on\n\nthe stiefel manifold. Neurocomputing, 67:106\u2013135, 2005.\n\n[26] Alan Edelman, Tom\u00e1s A Arias, and Steven T Smith. The geometry of algorithms with or-\nthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303\u2013353,\n1998.\n\n[27] Nicolas Boumal and Pierre-antoine Absil. Rtrmc: A riemannian trust-region method for low-\nrank matrix completion. In Advances in neural information processing systems, pages 406\u2013414,\n2011.\n\n[28] Fabian J Theis, Thomas P Cason, and P-A Absil. Soft dimension reduction for ica by joint\ndiagonalization on the stiefel manifold. In International Conference on Independent Component\nAnalysis and Signal Separation, pages 354\u2013361. Springer, 2009.\n\n[29] Zaiwen Wen and Wotao Yin. A feasible method for optimization with orthogonality constraints.\n\nMathematical Programming, 142(1-2):397\u2013434, 2013.\n\n[30] Donglin Niu, Jennifer G Dy, and Michael I Jordan. Iterative discovery of multiple alterna-\ntiveclustering views. IEEE transactions on pattern analysis and machine intelligence, 36(7):\n1340\u20131353, 2014.\n\n[31] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[32] Stephen Wright and Jorge Nocedal. Numerical optimization. Springer Science, 35:67\u201368, 1999.\n\n10\n\n\f[33] Dua Dheeru and E\ufb01 Karra Taniskidou. UCI machine learning repository, 2017. URL http:\n\n//archive.ics.uci.edu/ml.\n\n[34] William H Wolberg. Wisconsin breast cancer dataset. University of Wisconsin Hospitals, 1992.\n\n[35] Stephen D Bay, Dennis Kibler, Michael J Pazzani, and Padhraic Smyth. The uci kdd archive\nof large data sets for data mining research and experimentation. ACM SIGKDD Explorations\nNewsletter, 2(2):81\u201385, 2000.\n\n[36] Li Deng. The mnist database of handwritten digit images for machine learning research [best of\n\nthe web]. IEEE Signal Processing Magazine, 29(6):141\u2013142, 2012.\n\n[37] Particles of tessellations. http://en.tessellations-nicolas.com/. Accessed: 2017-04-\n\n25.\n\n[38] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization\non manifolds. Journal of Machine Learning Research, 15:1455\u20131459, 2014. URL http:\n//www.manopt.org.\n\n[39] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels\nbased on centered alignment. Journal of Machine Learning Research, 13(Mar):795\u2013828, 2012.\n\n[40] Alexander Strehl and Joydeep Ghosh. Cluster ensembles\u2014a knowledge reuse framework for\ncombining multiple partitions. Journal of machine learning research, 3(Dec):583\u2013617, 2002.\n\n[41] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for\n\nPython, 2001\u2013. URL http://www.scipy.org/. [Online; accessed ].\n\n[42] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier\nGrisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton,\nJake VanderPlas, Arnaud Joly, Brian Holt, and Ga\u00ebl Varoquaux. API design for machine learning\nsoftware: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for\nData Mining and Machine Learning, pages 108\u2013122, 2013.\n\n[43] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[44] Andrew V Knyazev and Peizhen Zhu. Principal angles between subspaces and their tangents.\n\n2012.\n\n11\n\n\f", "award": [], "sourceid": 4340, "authors": [{"given_name": "Chieh", "family_name": "Wu", "institution": "Northeastern University"}, {"given_name": "Jared", "family_name": "Miller", "institution": "Northeastern University"}, {"given_name": "Yale", "family_name": "Chang", "institution": "Northeastern University"}, {"given_name": "Mario", "family_name": "Sznaier", "institution": "Northeastern University"}, {"given_name": "Jennifer", "family_name": "Dy", "institution": "Northeastern University"}]}