{"title": "Kernel Hyperalignment", "book": "Advances in Neural Information Processing Systems", "page_first": 1790, "page_last": 1798, "abstract": "We offer a regularized, kernel extension of the multi-set, orthogonal Procrustes problem, or hyperalignment. Our new method, called Kernel Hyperalignment, expands the scope of hyperalignment to include nonlinear measures of similarity and enables the alignment of multiple datasets with a large number of base features. With direct application to fMRI data analysis, kernel hyperalignment is well-suited for multi-subject alignment of large ROIs, including the entire cortex. We conducted experiments using real-world, multi-subject fMRI data.", "full_text": "Kernel Hyperalignment\n\nAlexander Lorbert & Peter J. Ramadge\n\nDepartment of Electrical Engineering\n\nPrinceton University\n\nAbstract\n\nWe offer a regularized, kernel extension of the multi-set, orthogonal Procrustes\nproblem, or hyperalignment. Our new method, called Kernel Hyperalignment,\nexpands the scope of hyperalignment to include nonlinear measures of similar-\nity and enables the alignment of multiple datasets with a large number of base\nfeatures. With direct application to fMRI data analysis, kernel hyperalignment is\nwell-suited for multi-subject alignment of large ROIs, including the entire cortex.\nWe report experiments using real-world, multi-subject fMRI data.\n\n1\n\nIntroduction\n\nOne of the goals of multi-set data analysis is forming qualitative comparisons between datasets. To\nthe extent that we can control and design experiments to facilitate these comparisons, we must \ufb01rst\nask whether the data are aligned. In its simplest form, the primary question of interest is whether\ncorresponding features among the datasets measure the same quantity. If yes, we say the data are\naligned; if not, we must \ufb01rst perform an alignment of the data.\nThe alignment problem is crucial to multi-subject fMRI data analysis, which is the motivation for\nthis work. An appreciable amount of effort is devoted to designing experiments that maintain the\nfocus of a subject. This is to ensure temporal alignment across subjects for a common stimulus.\nHowever, with each subject exhibiting his/her own unique spatial response patterns, there is a need\nfor spatial alignment. Speci\ufb01cally, we want between subject correspondence of voxel j at TR i\n(Time of Repetition). The typical approach taken is anatomical alignment [20] whereby anatomi-\ncal landmarks are used to anchor spatial commonality across subjects. In linear algebra parlance,\nanatomical alignment is an af\ufb01ne transformation with 9 degrees of freedom.\nRecently, Haxby et al. [9] proposed Hyperalignment, a function-based alignment procedure. Instead\nof a 9-parameter transformation, a higher-order, orthogonal transformation is derived from voxel\ntime-series data. The underlying assumption of hyperalignment is that, for a \ufb01xed stimulus, a sub-\nject\u2019s time-series data will possess a common geometry. Accordingly, the role of alignment is to\n\ufb01nd isometric transformations of the per-subject trajectories traced out in voxel space so that the\ntransformed time-series best match each other. Using their method, the authors were able to achieve\na between-subject classi\ufb01cation accuracy on par with\u2014and even greater than\u2014within-subject accu-\nracy.\nSuppose that subject data are recorded in matrices X1:m \u2208 Rt\u00d7n. This could be data from an\nexperiment involving m subjects, t TRs, and n voxels. We are interested in extending the regularized\nhyperalignment problem\n\nwhere matrices A1:m \u2208 Rn\u00d7n are symmetric and positive de\ufb01nite. In general, the above problem\nmanifests itself in many application areas. For example, when Ak = I we have hyperalignment or\n\n1\n\n(1)\n\nminimize (cid:80)\n\nsubject to RT\n\ni 0 and \u03b2 \u2265 0. As with regularized hyperalignment [22], when (\u03b1, \u03b2) = (1, 0) we obtain\nhyperalignment and when (\u03b1, \u03b2) \u2248 (0, 1) we obtain a form of CCA.\nLet Ki have eigen-decomposition Vi\u039biVT\nshort. We introduce two symmetric, positive de\ufb01nite matrices: Bi = Vi diagj{\nCi = Vi diagj{ 1\n\ni , where \u039bi = diag{\u03bbi1, . . . , \u03bbit} or diagj{\u03bbij} for\ni and\n\ni AiRi = I would lack any intuition. Therefore, we restrict Ai = \u03b1I + \u03b2\u03a6T\n\n1\u221a\n\n1\u221a\n\n\u2212 1\u221a\n\n}VT\n\n\u03b1+\u03b2\u03bbij\n\n\u03b1 )}VT\ni .\n\n(\n\n\u03bbij\n\n\u03b1+\u03b2\u03bbij\n\nLemma 3.1. For Ai = \u03b1I + \u03b2\u03a6T\n\ni \u03a6i we have A\n\n\u2212 1\ni = 1\u221a\n\n2\n\n\u03b1 I + \u03a6T\n\ni Ci\u03a6i and \u03a6iA\n\n\u2212 1\ni = Bi\u03a6i.\n\n2\n\nWe can use Lemma 3.1 to transform (7) into\n\n(cid:107)Bi\u03a6iQ \u2212 \u03a8(cid:107)2\n\nF\n\nor\n\narg min\nQT Q=I\n\narg max\nQT Q=I\n\ntr\n\n(cid:16)\n\nQT \u03a6T\n\ni Bi\n\n(cid:104) 1|A|\n\n(cid:80)\n\nj\u2208A Bj\u03a6j \u02c6Qj\n\n(cid:105)(cid:17)\n\n,\n\n(8)\n\nwhere \u02c6Qj is the current estimate of Qj. Solving for the matrix Q is still well beyond practical\ncomputation. The following lemma is the gateway for managing this problem.\nLemma 3.2. If \u02dcU \u2208 St(N, d) and \u02dcG \u2208 O(d), then \u02dcQ = IN \u2212 \u02dcU(Id \u2212 \u02dcG) \u02dcUT \u2208 O(N ).2\n\nFamiliar applications of the above lemma include the identity matrix ( \u02dcG = Id) and Householder\nre\ufb02ections ( \u02dcG = \u2212Id).\nIf \u02dcG is block diagonal with 2 \u00d7 2 blocks of Givens rotations, then the\ncolumns of \u02dcU, taken two at a time, are the two-dimensional planes of rotation [7]. We therefore\nrefer to \u02dcU as the plane support matrix.\nLemma 3.2 can be interpreted as a lifting mechanism for identity deviations. The difference Id \u2212 \u02dcG\nrepresents a O(d) deviation from identity. Applying \u02dcU(Id \u2212 \u02dcG) \u02dcUT = IN \u2212 \u02dcQ, \u201clifts\u201d this differ-\nence to a O(N ) deviation from identity. Reversing directions, we can also utilize Lemma 3.2 for\ncompressing O(N ). From IN \u2212 \u02dcQ = \u02dcU(Id \u2212 \u02dcG) \u02dcUT , the rank of the deviation, IN \u2212 Q, is upper\nbounded by d, producing a subset of O(N ).\nMotivated by Lemma 3.2 we impose\n\nQi = IN \u2212 U(I \u2212 Gi)UT ,\n\n(9)\nwhere U \u2208 St(N, r), Gi \u2208 O(r), and 1 \u2264 r \u2264 N. Ideally, we want r small to bene\ufb01t from a\nreduced dimension. As is typically the case when using kernel methods, leveraging the Representer\nTheorem shifts the dimensionality of the problem from the feature cardinality to the number of\nexamples, i.e., r = mt. We pool all of the data, forming the mt \u00d7 N matrix\n\n2\n\n,\n\nm\n\n0 K\n\n\u00b7\u00b7\u00b7 \u03a6T\n(10)\n1 \u03a6T\n2\n\u2212 1\n0 assumed positive de\ufb01nite. As long as r \u2264 N,\n0 \u2208 RN\u00d7r with K0 = \u03a60\u03a6T\n\u2212 1\n0 K\n0\n\nand set U = \u03a6T\nthe orthogonality constraint is met because (\u03a6T\nTheorem 3.3 (Hyperalignment Representer Theorem). Within the set of global minimizers of (6)\nthere exists a solution {R(cid:63)\nm} that admits a representation\ni = IN \u2212 U(I \u2212 G(cid:63)\nQ(cid:63)\n2 St(N, d) (cid:44) {Z : Z \u2208 RN\u00d7d , ZT Z = Id} is the (N, d) Stiefel Manifold (N \u2265 d), and\n\n\u2212 1\nm} = {A\n1 Q(cid:63)\n\u2212 1\ni )UT , where U = \u03a6T\n0 K\n0\n\ni \u2208 O(mt) (i = 1, . . . , m).\n\n1, . . . , A\nand G(cid:63)\n\n\u2212 1\n0 = Ir.\n\n\u2212 1\n0 K0K\n\n\u2212 1\n0 K\n0\n\n1, . . . , R(cid:63)\n\nO(N ) (cid:44) {Z : Z \u2208 RN\u00d7N , ZT Z = IN} is the orthogonal group of N \u00d7 N matrices.\n\n\u2212 1\nm Q(cid:63)\n\n)T (\u03a6T\n\n) = K\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n2\n\n\u03a60 =(cid:2)\u03a6T\n\n(cid:3)T\n\n4\n\n\fA \u2190\n\nsample mean\n{1, 2, . . . , m} \\ {i} LOO mean\n\nInput: X1:m \u2208 Rt\u00d7n, A1:m \u2208 Rn\u00d7n\nOutput: R1:m \u2208 Rn\u00d7n\nInitialize Q1:m as identity (n \u00d7 n)\n1:m\u2190 XiA\n\u22121/2\nSet \u02dcXi\ni\nforeach round do\n\nforeach subject/view i do\n\n(cid:40) {1, 2, . . . , m}\n(cid:88)\n\nY \u2190 1\n\u02dcXjQj\n|A|\n[ \u00afU \u00af\u03a3 \u00afV] \u2190 SVD( \u02dcXT\nQi \u2190 \u00afU \u00afVT\n\nj\u2208A\n\ni Y)\n\nend\n\nend\nforeach subject/view i do\n\nRi \u2190 A\n\n\u2212 1\ni Qi\n\n2\n\nend\n\nAlgorithm 1: Regularized Hyperalignment\n\n(cid:3)T\n\n1 \u03a6T\n\n2 \u00b7\u00b7\u00b7 \u03a6T\n\nm\n\nInput: \u02c6k(\u00b7,\u00b7), \u03b1, \u03b2, X1:m \u2208 Rt\u00d7n\nOutput: R1:m, linear maps in feature space\nInitialize feature maps \u03a61, . . . , \u03a6m \u2208 Rt\u00d7N\n\nInitialize G1:m \u2208 Rr\u00d7r as identity (r = mt)\nforeach round do\n\nInitialize plane support \u03a60 =(cid:2)\u03a6T\n(cid:40) {1, 2, . . . , m}\n(cid:88)\n\nforeach subject/view i do\n\nA \u2190\n\nsample mean\n{1, 2, . . . , m} \\ {i} LOO mean\n\n\u02dcBjGj\n\nY \u2190 1\n|A|\n[ \u00afU \u00af\u03a3 \u00afV] \u2190 SVD( \u02dcBT\nGi \u2190 \u00afU \u00afVT\n\nj\u2208A\n\ni Y)\n\nend\n\nend\nforeach subject/view i do\n\u2212 1\n0 K\n0\n\nQi \u2190 I \u2212 \u03a6T\n\u2212 1\nRi \u2190 A\ni Qi\n\n2\n\n2\n\n(Ir \u2212 Gi)K\n\n\u2212 1\n0 \u03a60\n\n2\n\nend\nAlgorithm 2: Regularized Kernel Hyperalignment\n\nWhen mt is large enough so that evaluating an SVD of numerous mt \u00d7 mt matrices is prohibitive,\nwe can \ufb01rst perform PCA-like reduction. Let K0 have eigen-decomposition V0\u039b0VT\n0 , where the\nnonnegative diagonal entries of \u039b0 are sorted in decreasing order. We set \u03a60(cid:48) = VT\n0(cid:48)\u03a60, where\nV0(cid:48) is formed by the \ufb01rst r columns of V0, and then use U = \u03a6T\nIn general, rather\nthan compute Q according to (7), involving N (N\u22121)/2 = O(N 2) degrees of freedom (when N is\n\ufb01nite), we end up with r(r\u22121)/2 = O(r2) degrees of freedom via the kernel trick.\nLet \u02dcBi = BiKi0K\n\n\u2212 1\n0 \u2208 Rt\u00d7r. We reduce (8) in terms of Gi and obtain (Supplementary Material)\n\n\u22121/2\n0(cid:48)\n\n0(cid:48)K\n\n.\n\n2\n\n\uf8f9\uf8fb\uf8f6\uf8f8 ,\n\n\u02dcBj \u02c6Gj\n\ntr\n\n\uf8eb\uf8edGT \u02dcBT\n(cid:104) 1|A|\n(cid:80)\n\ni\n\n\uf8ee\uf8f0 1\n\n|A|\n\n(cid:88)\n\nj\u2208A\n\n(cid:105)\n\nGi = arg max\nG\u2208O(r)\n\n(11)\n\nwhere \u02c6Gj is the current estimate of Gj. Equation (11) is the classical orthogonal Procrustes prob-\nlem. If \u00afU \u00af\u03a3 \u00afVT is the SVD of GT \u02dcBT\n, then a maximizer is given by \u00afU \u00afVT [7].\ni\nThe kernel hyperalignment procedure is given in Algorithm 2. Using the approach taken in this\nsection also leads to an ef\ufb01cient solution of the standard orthogonal Procrustes problem for n \u2265 2t\n(Supplementary Material). In turn, this leads to an ef\ufb01cient iterative solution for the hyperalignment\nproblem when n is large.\n\nj\u2208A \u02dcBj \u02c6Gj\n\n4 Alignment Assessment\n\nAn alignment procedure is not subject to the typical train-and-test paradigm. The lack of spatial\ncorrespondence demands an align-train-test approach. We assume these three sets have within-\nsubject (or within-view) alignment. With all other parameters \ufb01xed, if the aligned test error is\nsmaller than the unaligned test error, there is strong evidence suggesting that alignment was the\nunderlying cause.\nKernel hyperalignment returns linear transformations R1:m that act on data living in feature space.\nIn general, we cannot directly train and test in the feature space due to its large size. We can,\nhowever, learn from relational data. For example, we can compute distances between examples\nand, subsequently, produce nearest neighbor classi\ufb01ers. Assume (\u03b1, \u03b2) = (1, 0), i.e., the R1:m\n\n5\n\n\fare orthogonal. If x1 \u2208 Rn is a view-i example and x2 \u2208 Rn is a view-j example, the respective\npre-aligned and post-aligned squared distances between the two examples are given by\n\n(cid:107)\u03a6(xT\n\n1 ) \u2212 \u03a6(xT\n\n(cid:107)\u03a6(xT\n1 )Ri \u2212 \u03a6(xT\n\n2 )(cid:107)2\n2 )Rj(cid:107)2\n\nF = \u02c6k(x1, x1) + \u02c6k(x2, x2) \u2212 2\u02c6k(x1, x2)\nF = \u02c6k(x1, x1) + \u02c6k(x2, x2) \u2212 2\u03a6(xT\n\n(12)\n(13)\nThe cross-term in (13) has not been expanded for a simple reason: it is too messy. We realized early\non that the alignment and training phase would be replete with lengthy expansions and, consequently,\nsought to simplify matters with a computer science solution. Both binary and unary operations in\nfeature space can be accomplished with a simple class. Our Phi class stores expressions of the\nfollowing forms:\n\n1 )RiRT\n\nj \u03a6(xT\n\n2 )T .\n\nk=1Mk\u03a6(Xa(k))\n\nk=1\u03a6(Xa(k))T Mk\n\nk=1\u03a6(Xa(k))T Mk\u03a6(Xa(k))\n\n.\n\n(14)\n\n(cid:125)\n\n(cid:80)K\n(cid:124)\n\n(cid:123)(cid:122)\n\nType 1\n\n(cid:80)K\n(cid:124)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nType 2\n\nbIN +(cid:80)K\n(cid:124)\n\n(cid:125)\n\n(cid:123)(cid:122)\n\nType 3\n\nEach class instance stores matrices M1:K, scalar b, right address vector a, and left address vector a.\nThe address vectors are pointers to the input data. This allows for faster manipulation and smaller\nmemory allocation. Addition and subtraction require a common type. If types match, then the M\nmatrices must be checked for compatible sizes. Multiplication is performed for types 1 with 2, 1\nwith 3, 2 with 1, 3 with 2, and 3 with 3. The \ufb01rst of these cases, for example, produces a numeric\nresult via the kernel trick. We also de\ufb01ne scalar multiplication and division for all types and matrix\nmultiplication for types 1 and 2. A transpose operator applies for all types and maps type 1 to 2,\n2 to 1, and 3 to 3. More advanced operations, such as powers and inverses, are also possible. Our\nimplementation was done in Matlab.\nThe construction of the Phi class allows us to stay in feature space and avoid lengthy expansions. In\nturn, this facilitates implementing the richer set of SVM classi\ufb01ers. Let X\u00af1, . . . , X \u00afm \u2208 Rs\u00d7n be our\ntraining data with feature representation \u03a6\u00af\u0131 = \u03a6(X\u00af\u0131) \u2208 Rs\u00d7N . Recall that kernel hyperalignment\n\u00af\uf6be ; we now\nseeks to align in feature space. Before alignment we might have considered K\u00af\u0131\u00af\uf6be = \u03a6\u00af\u0131\u03a6T\n\u00af\uf6be . If every row of X\u00af\u0131 has a corresponding\nconsider the Gram matrix (\u03a6\u00af\u0131Ri)(\u03a6\u00af\uf6beRj)T = \u03a6\u00af\u0131RiRT\nlabel, we can train an SVM with\n\nj \u03a6T\n\n\u00afA \u2208 Rms\u00d7ms denotes the aligned kernel matrix. The unaligned kernel matrix, K \u00afU ,\n\nwhere K \u00afA = KT\nis also an m \u00d7 m block matrix with ij-th block K\u00af\u0131\u00af\uf6be.\nUsing the dual formulation of an SVM, a classi\ufb01er can be constructed from the relational data\nexhibited among the examples [4]. Similar to a k-nearest neighbor classi\ufb01er relying on pairwise\ndistances, an SVM relies on the kernel matrix. The kernel matrix is a matrix of inner products and\nis therefore linear. This enables us to assess a partition-based alignment.\nIn fMRI, we perform two alignments\u2014one for each hemisphere. Each alignment produces two\naligned kernel matrices, which we sum and then input into an SVM. Thus, linearity provides us the\nmeans to handle \ufb01ner partitions by simply summing the aligned kernel matrices.\n\n\uf8eb\uf8ec\uf8ed \u03a6\u00af1R1\n\n...\n\n\uf8f6\uf8f7\uf8f8\u00d7\n\uf8eb\uf8ec\uf8ed \u03a6\u00af1R1\n\n...\n\n\u03a6 \u00afmRm\n\n\u03a6 \u00afmRm\n\n\uf8f6\uf8f7\uf8f8T\n\n=\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ed\n\nK \u00afA =\n\n\u03a6\u00af1R1RT\n\u03a6\u00af2R2RT\n\n1 \u03a6T\n1 \u03a6T\n\n\u00af1 \u03a6\u00af1R1RT\n\u00af1 \u03a6\u00af2R2RT\n\n2 \u03a6T\n\u00af2\n2 \u03a6T\n\u00af2\n\n...\n\n\u03a6 \u00afmRmRT\n\n1 \u03a6T\n\u00af1\n\n\u00b7\u00b7\u00b7 \u03a6\u00af1R1RT\n\nm\u03a6T\n\u00afm\n\n...\n\n\u03a6 \u00afmRmRT\n\nm\u03a6T\n\u00afm\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f8 ,\n\n(15)\n\n6\n\n\fTable 1: Seven label classi\ufb01cation using movie-based alignment Below is the cross-validated,\nbetween-subject classi\ufb01cation accuracy (within-subject in brackets) with (\u03b1, \u03b2) = (1, 0). Four\nhundred TRs per subject were used for the alignment. Chance = 1/7 \u2248 14.29%.\n\nKernel\n\nVentral Temporal\n\n2,997 voxels/hemisphere\n\nAnatomical\n\nEntire Cortex\n\n133,590 voxels/hemisphere\n\nKernel Hyp.\n\nAnatomical\n\nKernel Hyp.\n\nLinear\nQuadratic\nGaussian\nSigmoid\n\n35.71% [42.68%]\n35.00% [43.32%]\n36.25% [43.39%]\n35.89% [43.21%]\n\n48.57% [42.68%]\n50.36% [42.32%]\n48.57% [43.39%]\n48.21% [43.21%]\n\n34.64% [26.79%]\n36.07% [25.54%]\n36.07% [26.07%]\n35.00% [26.79%]\n\n36.25% [26.79%]\n36.43% [25.54%]\n36.43% [26.07%]\n36.25% [26.79%]\n\n5 Experiments\n\nThe data used in this section consisted of fMRI time-series data from 10 subjects who viewed a\nmovie and also engaged in a block-design visualization experiment [17]. Each subject saw Raiders\nof the Lost Ark (1981) lasting a total of 2213 TRs. In the visualization experiment, subjects were\nshown images belonging to a speci\ufb01c class for 16 TRs followed by 10 TRs of rest. The 7 classes\nwere: (1) female face, (2) male face, (3) monkey, (4) house, (5) chair, (6) shoe and (7) dog. There\nwere 8 runs total, and each run had every image class represented once.\nWe assess alignment by classi\ufb01cation accuracy. To provide the same number of voxels per ROI for\nall subjects, we \ufb01rst performed anatomical alignment. We then selected a contiguous block of 400\nTRs from the movie data to serve as the per-subject input of the kernel hyperalignment. Next, we\nextracted labeled examples from the visualization experiment by taking an offset time average of\neach 16 TR class exposure. An offset of 6 seconds factored in the hemodynamic response. This\nproduced 560 labeled examples: 10 subjects \u00d7 8 runs/subject \u00d7 7 examples/run.\nKernel hyperalignment allows us to (a) use nonlinear measures of similarity, and (b) consider more\nvoxels for the alignment. Consequently, we (a) experiment with a variety of kernels, and (b) do not\nneed to pre-select or screen voxels as was done in [9]\u2014we include them all. Table 1 features results\nfrom a 7-label classi\ufb01cation experiment. Recall that a linear kernel reduces to hyperalignment. We\nclassi\ufb01ed using a multi-label \u03bd-SVM [3]. We used the \ufb01rst 400 TRs from each subject\u2019s movie data,\nand aligned each hemisphere separately. The kernel functions are supplied in the Supplementary\nMaterial. As observed in [9] and repeated here, hyperalignment leads to increased between-subject\naccuracy and outperforms within-subject accuracy. Thus, we are extracting more common structure\nacross subjects. Whereas employing Algorithm 1 for 2,997 voxels is feasible (and slow), 133,590\nvoxels is not feasible at all.\nTo complete the picture, we plot the effects of regularization. Figure 1 displays the cross-validated,\nbetween-subject classi\ufb01cation accuracy for varying (\u03b1, \u03b2) where \u03b1 = 1\u2212\u03b2. This traces out a route\nfrom CCA (\u03b1 \u2248 0) to hyperalignment (\u03b1 = 1). When compared to the alignments in [9], our voxel\ncounts are orders of magnitude larger. For our four chosen kernels, hyperalignment (\u03b1 = 1) presents\nitself as the option with near-greatest accuracy.\nOur results support the robustness of hyperalignment and imply that voxel selection may be a crucial\npre-processing step when dealing with the whole volume. More voxels mean more noisy voxels,\nand hyperalignment does not distinguish itself from anatomical alignment when the entire cortex is\nconsidered. We can visualize this phenomenon with Multidimensional Scaling (MDS) [21].\nMDS takes as input all of the pairwise distances between subjects (the previous section discussed\ndistance calculations). Figure 2 depicts the optimal Euclidean representation of our 10 subjects be-\nfore and after kernel hyperalignment ((\u03b1, \u03b2) = (1, 0)) with respect to the \ufb01rst 400 TRs of the movie\ndata. Focusing on VT, kernel hyperalignment manages to cluster 7 of the 10 subjects. However,\nwhen we shift to the entire cortex, we see that anatomical alignment has already succeeded in a sim-\nilar clustering. Kernel hyperalignment manages to group the subjects closer together, and manifests\nitself as a re-centering.\n\n7\n\n\fFigure 1: Cross-validated between-subject classi\ufb01cation accuracy (7 labels) as a function of the\nregularization parameter, \u03b1 = 1\u2212\u03b2, for various kernels after alignment. The solid curves are for\nVentral Temporal and the dashed curves are for the entire cortex. Chance = 1/7 \u2248 14.29%.\n\nFigure 2: Visualizing alignment with MDS Each locus pair approximates the normalized relation-\nship among the 10 subjects in 2D - before (left) and after (right) applying kernel hyperalignment.\nCentroids are translated to the origin and numbers correspond to individual subjects.\n\n6 Conclusion\n\nWe have extended hyperalignment in both scale and feature space. Kernel hyperalignment can\nhandle a large number of original features and incorporate nonlinear measures of similarity. We have\nalso shown how to use the linear maps\u2014applied in feature space\u2014for post-alignment classi\ufb01cation.\nIn the setting of fMRI, we have demonstrated successful alignment with a variety of kernels. Kernel\nhyperalignment achieved better between-subject classi\ufb01cation over anatomical alignment for VT.\nThere was no noticeable difference when we considered the entire cortex. Nevertheless, kernel\nhyperalignment proved robust and did not degrade with increasing voxel count.\nWe envision a fruitful path for kernel hyperalignment. Empirically, we have noticed a tradeoff\nbetween feature cardinality and classi\ufb01cation accuracy, motivating the need for intelligent feature\nselection within our established framework. Although we have limited our focus to fMRI data anal-\nysis, kernel hyperalignment can be applied to other research areas which rely on multi-set Procrustes\nproblems.\n\n8\n\n00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracyLinear Kernel00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracyQuadratic Kernel00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracyGaussian Kernel00.20.40.60.810.20.250.30.350.40.450.50.55\uf061 ( = 1-\uf062)BSC AccuracySigmoid Kernel1234567891012345678910123456789101234567891012345678910123456789101234567891012345678910Ventral TemporalEntire CortexLinear KernelGaussian Kernel\fReferences\n[1] F.R. Bach and M.I. Jordan. Kernel independent component analysis. The Journal of Machine\n\nLearning Research, 3:1\u201348, 2003.\n\n[2] C.M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.\n[3] C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM Transactions\non Intelligent Systems and Technology, 2:27:1\u201327:27, 2011. Software available at http:\n//www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[4] P.H. Chen, C.J. Lin, and B. Sch\u00a8olkopf. A tutorial on \u03bd-support vector machines. Applied\n\nStochastic Models in Business and Industry, 21(2):111\u2013136, 2005.\n\n[5] A. Edelman, T. As, A. Arias, and T. Smith. The geometry of algorithms with orthogonality\n\nconstraints. SIAM J. Matrix Anal. Appl, 1998.\n\n[6] C. Goodall. Procrustes methods in the statistical analysis of shape. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 285\u2013339, 1991.\n\n[7] J.C. Gower and G.B. Dijksterhuis. Procrustes Problems, volume 30. Oxford University Press,\n\nUSA, 2004.\n\n[8] D.R. Hardoon, S. Szedmak, and J. Shawe-Taylor. Canonical correlation analysis: An overview\n\nwith application to learning methods. Neural Computation, 16(12):2639\u20132664, 2004.\n\n[9] J.V. Haxby, J.S. Guntupalli, A.C. Connolly, Y.O. Halchenko, B.R. Conroy, M.I. Gobbini,\nM. Hanke, and P.J. Ramadge. A common, high-dimensional model of the representational\nspace in human ventral temporal cortex. Neuron, 72(2):404\u2013416, 2011.\n\n[10] T. Hofmann, B. Sch\u00a8olkopf, and A.J. Smola. Kernel methods in machine learning. The Annals\n\nof Statistics, pages 1171\u20131220, 2008.\n\n[11] R.A. Horn and C.R. Johnson. Matrix Analysis. Cambridge University Press, 1990.\n[12] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321\u2013377, 1936.\n[13] J.R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433, 1971.\n[14] G.S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic\nprocesses and smoothing by splines. The Annals of Mathematical Statistics, 41(2):495\u2013502,\n1970.\n\n[15] M. Kuss and T. Graepel. The geometry of kernel canonical correlation analysis. Technical\n\nreport, Max Planck Institute, 2003.\n\n[16] P.L. Lai and C. Fyfe. Kernel and nonlinear canonical correlation analysis. International Jour-\n\nnal of Neural Systems, 10(5):365\u2013378, 2000.\n\n[17] M.R. Sabuncu, B.D. Singer, B. Conroy, R.E. Bryan, P.J. Ramadge, and J.V. Haxby. Function\n\nbased inter-subject alignment of human cortical anatomy. Cerebral Cortex, 2009.\n\n[18] B. Sch\u00a8olkopf, R. Herbrich, and A. Smola. A generalized representer theorem. In Computa-\n\ntional learning theory, pages 416\u2013426. Springer, 2001.\n\n[19] P.H. Schonemann. A generalized solution of the orthogonal procrustes problem. Psychome-\n\ntrika, 31(1):1\u201310, March 1966.\n\n[20] J. Talairach and P. Tournoux. Co-planar stereotaxic atlas of the human brain: 3-dimensional\n\nproportional system: an approach to cerebral imaging. Thieme, 1988.\n\n[21] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear\n\ndimensionality reduction. Science, 290(5500):2319\u20132323, 2000.\n\n[22] H. Xu, A. Lorbert, P. J. Ramadge, J. S. Guntupalli, and J. V. Haxby. Regularized hyperalign-\nment of multi-set fmri data. Proceedings of the 2012 IEEE Signal Processing Workshop, Ann\nArbor Michigan, 2012.\n\n9\n\n\f", "award": [], "sourceid": 884, "authors": [{"given_name": "Alexander", "family_name": "Lorbert", "institution": null}, {"given_name": "Peter", "family_name": "Ramadge", "institution": null}]}