{"title": "Demixed Principal Component Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 2654, "page_last": 2662, "abstract": "In many experiments, the data points collected live in high-dimensional observation spaces, yet can be assigned a set of labels or parameters. In electrophysiological recordings, for instance, the responses of populations of neurons generally depend on mixtures of experimentally controlled parameters. The heterogeneity and diversity of these parameter dependencies can make visualization and interpretation of such data extremely difficult. Standard dimensionality reduction techniques such as principal component analysis (PCA) can provide a succinct and complete description of the data, but the description is constructed independent of the relevant task variables and is often hard to interpret. Here, we start with the assumption that a particularly informative description is one that reveals the dependency of the high-dimensional data on the individual parameters. We show how to modify the loss function of PCA so that the principal components seek to capture both the maximum amount of variance about the data, while also depending on a minimum number of parameters. We call this method demixed principal component analysis (dPCA) as the principal components here segregate the parameter dependencies. We phrase the problem as a probabilistic graphical model, and present a fast Expectation-Maximization (EM) algorithm. We demonstrate the use of this algorithm for electrophysiological data and show that it serves to demix the parameter-dependence of a neural population response.", "full_text": "Demixed Principal Component Analysis\n\nWieland Brendel\n\nRanulfo Romo\n\nEcole Normale Sup\u00e9rieure, Paris, France\nChampalimaud Neuroscience Programme\n\nInstituto de Fisiolog\u00eda Celular\n\nUniversidad Nacional Aut\u00f3noma de M\u00e9xico\n\nLisbon, Portugal\n\nMexico City, Mexico\n\nChristian K. Machens\n\nEcole Normale Sup\u00e9rieure, Paris, France\n\nChampalimaud Neuroscience Programme, Lisbon, Portugal\n\nAbstract\n\nIn many experiments, the data points collected live in high-dimensional observa-\ntion spaces, yet can be assigned a set of labels or parameters. In electrophysio-\nlogical recordings, for instance, the responses of populations of neurons generally\ndepend on mixtures of experimentally controlled parameters. The heterogene-\nity and diversity of these parameter dependencies can make visualization and in-\nterpretation of such data extremely dif\ufb01cult. Standard dimensionality reduction\ntechniques such as principal component analysis (PCA) can provide a succinct\nand complete description of the data, but the description is constructed indepen-\ndent of the relevant task variables and is often hard to interpret. Here, we start\nwith the assumption that a particularly informative description is one that reveals\nthe dependency of the high-dimensional data on the individual parameters. We\nshow how to modify the loss function of PCA so that the principal components\nseek to capture both the maximum amount of variance about the data, while also\ndepending on a minimum number of parameters. We call this method demixed\nprincipal component analysis (dPCA) as the principal components here segregate\nthe parameter dependencies. We phrase the problem as a probabilistic graphical\nmodel, and present a fast Expectation-Maximization (EM) algorithm. We demon-\nstrate the use of this algorithm for electrophysiological data and show that it serves\nto demix the parameter-dependence of a neural population response.\n\n1\n\nIntroduction\n\nSamples of multivariate data are often connected with labels or parameters. In fMRI data or elec-\ntrophysiological data from awake behaving humans and animals, for instance, the multivariate data\nmay be the voxels of brain activity or the \ufb01ring rates of a population of neurons, and the parameters\nmay be sensory stimuli, behavioral choices, or simply the passage of time. In these cases, it is often\nof interest to examine how the external parameters or labels are represented in the data set.\nSuch data sets can be analyzed with principal component analysis (PCA) and related dimensionality\nreduction methods [4, 2]. While these methods are usually successful in reducing the dimensionality\nof the data, they do not take the parameters or labels into account. Not surprisingly, then, they\noften fail to represent the data in a way that simpli\ufb01es the interpretation in terms of the underlying\nparameters. On the other hand, dimensionality reduction methods that can take parameters into\naccount, such as canonical correlation analysis (CCA) or partial least squares (PLS) [1, 5], impose a\nspeci\ufb01c model of how the data depend on the parameters (e.g. linearly), which can be too restrictive.\nWe illustrate these issues with neural recordings collected from the prefrontal cortex (PFC) of mon-\nkeys performing a two-frequency discrimination task [9, 3, 7]. In this experiment a monkey received\n\n1\n\n\ftwo mechanical vibrations with frequencies f1 and f2 on its \ufb01ngertip, delayed by three seconds. The\nmonkey then had to make a binary decision d depending on whether f1 > f2. In the data set, each\nneuron has a unique \ufb01ring pattern, leading to a large diversity of neural responses. The \ufb01ring rates of\nthree neurons (out of a total of 842) are plotted in Fig. 1, top row. The responses of the neurons mix\ninformation about the different task parameters, a common observation for data sets of recordings\nin higher-order brain areas, and a problem that exacerbates interpretation of the data.\nHere we address this problem by modifying PCA such that the principal components depend on\nindividual task parameters while still capturing as much variance as possible. Previous work has\naddressed the question of how to demix data depending on two [7] or several parameters [8], but did\nnot allow components that capture nonlinear mixtures of parameters. Here we extend this previous\nwork threefold: (1) we show how to systematically split the data into univariate and multivariate\nparameter dependencies; (2) we show how this split suggests a simple loss function, capable of\ndemixing data with arbitrary combinations of parameters, (3) we introduce a probabilistic model for\nour method and derive a fast algorithm using expectation-maximization.\n\n2 Principal component analysis and the demixing problem\n\nC =(cid:10)ytsdy(cid:62)\n\n(cid:11)\n\nThe \ufb01ring rates of the neurons in our dataset depend on three external parameters: the time t, the\nstimulus s = f1, and the decision d of the monkey. We omit the second frequency f2 since this\nparameter is highly correlated with f1 and d (the monkey makes errors in < 10% of the trials). Each\nsample of \ufb01ring rates in the population, yn, is therefore tagged with parameter values (tn, sn, dn).\nFor notational simplicity, we will assume that each data point is associated with a unique set of\nparameter values so that the parameter values themselves can serve as indices for the data points yn.\nIn turn, we drop the index n, and simply write ytsd.\nThe main aim of PCA is to \ufb01nd a new coordinate system in which the data can be represented in\na more succinct and compact fashion. The covariance matrix of the \ufb01ring rates summarizes the\nsecond-order statistics of the data set,\n\n(1)\nand has size D \u00d7 D where D is the number of neurons in the data set (we will assume the data are\ncentered throughout the paper). The angular bracket denotes averaging over all parameter values\n(t, s, d) which corresponds to averaging over all data points. Given the covariance matrix, we can\ncompute the \ufb01ring rate variance that falls along arbitrary directions in state space. For instance, the\nvariance captured by a coordinate axis given by a normalized vector w is simply\n\ntsd\n\ntsd\n\nL = w(cid:62)Cw.\n\n(2)\nThe \ufb01rst principal component corresponds to the axis that captures most of the variance of the data,\nand thereby maximizes the function L subject to the normalization constraint w(cid:62)w = 1. The\nsecond principal component maximizes variance in the orthogonal subspace and so on [4, 2].\nPCA succeeds nicely in summarizing the population response for our data set: the \ufb01rst ten principal\ncomponents capture more than 90% of the variance of the data. However, PCA completely ignores\nthe causes of \ufb01ring rate variability. Whether \ufb01ring rates have changed due to the \ufb01rst stimulus\nfrequency s = f1, due to the passage of time, t, or due to the decision, d, they will enter equally into\nthe computation of the covariance matrix and therefore do not in\ufb02uence the choice of the coordinate\nsystem constructed by PCA. To clarify this observation, we will segregate the data ytsd into pieces\ncapturing the variability caused by different parameters.\nMarginalized average. Let us denote the set of parameters by S = {t, s, d}. For every subset of\nS we construct a \u2018marginalized average\u2019,\n\n\u00afyt := (cid:104)ytsd(cid:105)sd ,\n\n\u00afys := (cid:104)ytsd(cid:105)td ,\n\n\u00afyts := (cid:104)ytsd(cid:105)d \u2212 \u00afyt \u2212 \u00afys,\n\n(3)\n(4)\n(5)\nwhere (cid:104)ytsd(cid:105)\u03c6 denotes the average of the data over the subset \u03c6 \u2286 S. In \u00afyt = (cid:104)ytsd(cid:105)sd, for instance,\nwe average over all parameter values (s, d) such that the remaining variation of the averaged data\n\n\u00afytsd := ytsd \u2212 \u00afyts \u2212 \u00afytd \u2212 \u00afysd \u2212 \u00afyt \u2212 \u00afys \u2212 \u00afyd,\n\n\u00afysd := (cid:104)ytsd(cid:105)t \u2212 \u00afys \u2212 \u00afyd,\n\n\u00afytd := (cid:104)ytsd(cid:105)s \u2212 \u00afyt \u2212 \u00afyd,\n\n\u00afyd := (cid:104)ytsd(cid:105)ts\n\n2\n\n\fFigure 1: (Top row) Firing rates of three (out of D = 842) neurons recorded in the PFC of monkeys\ndiscriminating two vibratory frequencies. The two stimuli were presented during the shaded periods.\nThe rainbow colors indicate different stimulus frequencies f1, black and gray indicate the decision\nof the monkey during the interval [3.5,4.5] sec. (Bottom row) Relative contribution of time (blue),\nstimulus (light blue), decision (green), and non-linear mixtures (yellow) to the total variance for a\nsample of 14 neurons (left), the top 14 principal components (middle), and naive demixing (right).\n\nonly comes from t. In \u00afyts, we subtract all variation due to t or s individually, leaving only variation\nthat depends on combined changes of (t, s). These marginalized averages are orthogonal so that\n\n\u2200\u03c6, \u03c6(cid:48) \u2286 S (cid:104)\u00afy(cid:62)\n\n\u03c6\u00afy\u03c6(cid:48)(cid:105) = 0 if \u03c6 (cid:54)= \u03c6(cid:48).\n\n(6)\n\nAt the same time, their sum reconstructs the original data,\n\n(7)\nThe latter two properties allow us to segregate the covariance matrix of ytsd into \u2018marginalized\ncovariance matrices\u2019 that capture the variance in a subset of parameters \u03c6 \u2286 S,\n\nytsd = \u00afyt + \u00afys + \u00afyd + \u00afyts + \u00afytd + \u00afysd + \u00afytsd.\n\nC = Ct + Cs + Cd + Cts + Ctd + Csd + Ctsd, with C\u03c6 = (cid:104)\u00afy\u03c6\u00afy(cid:62)\n\u03c6(cid:105).\n\nNote that here we use the parameters {t, s, d} as labels, whereas they are indices in Eq. (3)-(5),\nand (7). For a given component w, the marginalized covariance matrices allow us to calculate the\nvariance x\u03c6 of w conditioned on \u03c6 \u2286 S as\n\u03c6 = w(cid:62)C\u03c6w,\nx2\n\u03c6 =: (cid:107)x(cid:107)2\n2.\n\u03c6 x2\n\nso that the total variance is given by L =(cid:80)\n\nt /(cid:107)x(cid:107)2\n\nsd + x2\n\nd + x2\n\ntd)/(cid:107)x(cid:107)2\n\ns + x2\ntsd)/(cid:107)x(cid:107)2\n\n2), decision (light blue; computed as (x2\n\nUsing this segregation, we are able to examine the distribution of variance in the PCA compo-\nnents and the original data. In Fig. 1, bottom row, we plot the relative contributions of time (blue;\n2), stimulus (green; com-\ncomputed as x2\nts)/(cid:107)x(cid:107)2\nputed as (x2\n2), and nonlinear mixtures of stimulus and decision (yellow; computed as\n(x2\n2) for a set of sample neurons (left) and for the \ufb01rst fourteen components of PCA\n(center). The left plot shows that individual neurons carry varying degree of information about the\ndifferent task parameters, reaf\ufb01rming the heterogeneity of neural responses. While the situation is\nslightly better for the PCA components, we still \ufb01nd a strong mixing of the task parameters.\nTo improve visualization of the data and to facilitate the interpretation of individual components, we\nwould prefer components that depend on only a single parameter, or, more generally, that depend on\nthe smallest number of parameters possible. At the same time, we would want to keep the attractive\nproperties of PCA in which every component captures as much variance as possible about the data.\nNaively, we could simply combine eigenvectors from the marginalized covariance matrices. For ex-\nample, consider the \ufb01rst Q eigenvectors of each marginalized covariance matrix. Apply symmetric\n\n3\n\n\ufb01ring rate (Hz)604530150time (s)01234604530150time (s)01234604530150time (s)01234PCAnaive demixingsample neurons1234567891011121314\fFigure 2: Illustration of the objective functions. The PCA objective function corresponds to the\nL2-norm in the space of standard deviations, x. Whether a solution falls into the center or along\nthe axis does not matter, as long as it captures a maximum of overall variance. The dPCA objective\nfunctions (with parameters \u03bb = 1 and \u03bb = 4) prefer solutions along the axes over solutions in the\ncenter, even if the solutions along the axes capture less overall variance.\n\northogonalization to these eigenvectors and choose the Q coordinates that capture the most variance.\nThe resulting variance distribution is plotted in Fig. 1 (bottom, right). While the parameter depen-\ndence of the components is sparser than in PCA, there is a strong bias towards time, and variance\ninduced by the decision of the monkey is squeezed out. As a further drawback, naive demixing\ncovers only 84.6% of the total variance compared with 91.7% for PCA. We conclude that we have\nto rely on a more systematic approach based speci\ufb01cally on an objective that promotes demixing.\n\n3 Demixed principal component analysis (dPCA): Loss function\n\nL = w(cid:62)Cw = (cid:80)\n\n\u03c6 w(cid:62)C\u03c6w = (cid:80)\n\n\u03c6 x2\n\n\u03c6 = (cid:107)x(cid:107)2\n\nWith respect to the segregated covariances, the PCA objective function, Eq. (2), can be written as\n2. This function is illustrated in Fig 2 (left), and\nshows that PCA will maximize variance, no matter whether this variance comes about through a\nsingle marginalized variance, or through mixtures thereof.\nConsequently, we need to modify this objective function such that solutions w that do not mix\nvariances\u2014thereby falling along one of the axes in x-space\u2014are favored over solutions w that fall\ninto the center in x-space. Hence, we seek an objective function L = L(x) that grows monotonically\nwith any x\u03c6 such that more variance is better, just as in PCA, and that grows faster along the axes\nthan in the center so that mixtures of variances get punished. A simple way of imposing this is\n\n(cid:18)(cid:107)x(cid:107)2\n\n(cid:19)\u03bb\n\nLdPCA = (cid:107)x(cid:107)2\n\n2\n\n(cid:107)x(cid:107)1\n\n(8)\nwhere \u03bb \u2265 0 controls the tradeoff. This objective function is illustrated in Fig. 2 (center and right)\nfor two values of \u03bb. Here, solutions w that lead to mixtures of variances are punished against\nsolutions that do not mix variances.\nNote that the objective function is a function of the coordinate axis w, and the aim is to maximize\nLdPCA with respect to w. A generalization to a set of Q components w1, . . . , wQ is straightforward\nby maximizing L in steps for every component and ensuring orthonormality by means of symmetric\northogonalization [6] after each step. We call the resulting algorithm demixed principal component\nanalysis (dPCA), since it essentially can be seen as a generalization of standard PCA.\n\n4 Probabilistic principal component analysis with orthogonality constraint\n\nWe introduced dPCA by means of a modi\ufb01cation of the objective function of PCA. It is straight-\nforward to build a gradient ascent algorithm to solve Eq. (8). However, we aim for a superior algo-\nrithm by framing dPCA in a probabilistic framework. A probabilistic model provides several bene\ufb01ts\nthat include dealing with missing data and the inclusion of prior knowledge [see 2, p. 570]. Since\nthe probabilistic treatment of dPCA requires two modi\ufb01cations over the conventional expectation-\nmaximization (EM) algorithm for probabilistic PCA (PPCA), we here review PPCA [11, 10], and\nshow how to introduce an explicit orthogonality constraint on the mixing matrix.\n\n4\n\n00dPCA (\u03bb=1)x1x220406080100dPCA (\u03bb=4)x1x220406080100PCA or dPCA (\u03bb=0)x1x220406080100204060801000\fIn PPCA, the observed data y are linear combinations of latent variables z\n\ny = Wz + \u0001y\n\n(cid:1). The latent variables are assumed to follow a zero-mean,\n\nmatrix. In turn, p(y|z) = N(cid:0)y|Wz, \u03c32ID\nof the latent variables. Our aim is to maximize the likelihood of the data, p(Y) =(cid:81)\n\n(9)\nwhere \u0001y \u223c N (0, \u03c32ID) is isotropic Gaussian noise with variance \u03c32 and W \u2208 RD\u00d7Q is the mixing\nunit-covariance Gaussian prior, p(z) = N (z|0, IQ). These equations completely specify the model\nof the data and allow us to compute the marginal distribution p(y).\nLet Y = {yn} be the set of data points, with n = 1 . . . N, and Z = {zn} the corresponding values\nn p(yn), with\nrespect to the parameters W and \u03c3. To this end, we use the EM algorithm, in which we \ufb01rst calculate\nthe statistics (mean and covariance) of the posterior distribution, p(Z|Y), given \ufb01xed values for W\nand \u03c32 (Expectation step). Then, using these statistics, we compute the expected complete-data\nlikelihood, E[p(Y, Z)], and maximize it with respect to W and \u03c32 (Maximization step). We cycle\nthrough the two steps until convergence.\nExpectation step. The posterior distribution p(Z|Y) is again Gaussian and given by\n\nN(cid:0)zn\n\n(cid:12)(cid:12)M\u22121W(cid:62)yn, \u03c32M\u22121(cid:1) with M = W(cid:62)W + \u03c32IQ.\n\np(Z|Y) =\n\n(10)\n\nMean and covariance can be read off the arguments, and we note in particular that E[znz(cid:62)\nn ] =\n\u03c32M\u22121 + E[zn]E[zn](cid:62). We can then take the expectation of the complete-data log likelihood with\nrespect to this posterior distribution, so that\n\nN(cid:89)\n\nn=1\n\nE(cid:2)ln p(cid:0)Y, Z(cid:12)(cid:12)W, \u03c32(cid:1)(cid:3) = \u2212 N(cid:88)\n\n1\n\n2\n\nn=1\n1\n\n2\u03c32 (cid:107)yn(cid:107)2 \u2212 1\n\n(cid:26) D\nln(cid:0)2\u03c0\u03c32(cid:1) +\n2\u03c32 Tr(cid:0)E(cid:2)znz(cid:62)\n(cid:110)(cid:107)yn(cid:107)2 \u2212 2E [zn](cid:62) W(cid:62)yn + Tr(cid:0)E(cid:2)znz(cid:62)\nN(cid:88)\n\n(cid:3) W(cid:62)W(cid:1) + Q\n\nln (2\u03c0) +\n\n1\n2\n\n\u03c32\n\n+\n\n2\n\nn\n\nn\n\nE [zn](cid:62) W(cid:62)yn\n\nTr(cid:0)E(cid:2)znz(cid:62)\n(cid:3) W(cid:62)W(cid:1)(cid:111)\n\nn\n\n.\n\n(cid:3)(cid:1)(cid:27)\n\n.\n\n(11)\n\n(12)\n\nMaximization step. Next, we need to maximize Eq. (11) with respect to \u03c3 and W. For \u03c3, we obtain\n\n(\u03c3\u2217)2 =\n\n1\nN D\n\nn=1\n\nFor W, we need to deviate from the conventional PPCA algorithm, since the development of proba-\nbilistic dPCA requires an explicit orthogonality constraint on W, which had so far not been included\nin PPCA. To impose this constraint, we factorize W into an orthogonal and a diagonal matrix,\n\n(13)\nwhere U \u2208 RD\u00d7Q has orthogonal columns of unit length and \u0393 \u2208 RQ\u00d7Q is diagonal. In order to\nmaximize Eq. (11) with respect to U and \u0393 we make use of in\ufb01nitesimal translations in the respective\nrestricted space of matrices,\n\nW = U\u0393, U(cid:62)U = ID\n\nU \u2192 (ID + \u0001A) U,\n\n\u0393 \u2192 (IQ + \u0001 diag(b)) \u0393,\n\n(14)\nwhere A \u2208 SkewD is D \u00d7 D skew-symmetric, b \u2208 RQ, and \u0001 (cid:28) 1. The set of D \u00d7 D skew-\nsymmetric matrices are the generators of rotations in the space of orthogonal matrices. The neces-\nsary conditions for a maximum of the likelihood function at U\u2217, \u0393\u2217 are\n\nE(cid:2)ln p(cid:0)Y, Z(cid:12)(cid:12)(ID + \u0001A) U\u2217\u0393, \u03c32(cid:1)(cid:3) \u2212 E(cid:2)ln p(cid:0)Y, Z(cid:12)(cid:12)U\u2217\u0393, \u03c32(cid:1)(cid:3) = 0 + O(cid:0)\u00012(cid:1) \u2200A \u2208 SkewD,\nE(cid:2)ln p(cid:0)Y, Z(cid:12)(cid:12)U (IQ + \u0001 diag(b)) \u0393\u2217, \u03c32(cid:1)(cid:3) \u2212 E(cid:2)ln p(cid:0)Y, Z(cid:12)(cid:12)U\u0393\u2217, \u03c32(cid:1)(cid:3) = 0 + O(cid:0)\u00012(cid:1) \u2200b \u2208 RD.\n(cid:3) \u0393 = K\u03a3L(cid:62), the maximum is\nGiven the reduced singular value decomposition1 of(cid:80)\n(cid:3)(cid:17)\nE(cid:2)znz(cid:62)\nynE(cid:2)z(cid:62)\n\nn ynE(cid:2)z(cid:62)\n(cid:16)(cid:88)\n\nU\u2217 = KL(cid:62)\n\u0393\u2217 = diag\n\nU(cid:62)(cid:88)\n\n(cid:3)(cid:17)\u22121\n\n(cid:16)\n\ndiag\n\n(16)\n\n(15)\n\n(17)\n\n(18)\n\nn\n\nn\n\nn\n\n1The reduced singular value decomposition factorizes a D\u00d7Q matrix A as A = KDL\u2217, where K is a D\u00d7Q\n\nunitary matrix, D is a Q \u00d7 Q nonnegative, real diagonal matrix, and L\u2217 is a Q \u00d7 Q unitary matrix.\n\nn\n\nn\n\n5\n\n\fFigure 3: (a) Graphical representation of the general idea of dPCA. Here, the data y are projected on\na subspace z of latent variables. Each latent variable zi depends on a set of parameters \u03b8j \u2208 S. To\nease interpretation of the latent variables zi, we impose a sparse mapping between the parameters\nand the latent variables. (b) Full graphical model of dPCA.\n\nwhere diag(A) returns a square matrix with the same diagonal as A but with all off-diagonal elements\nset to zero.\n\n5 Probabilistic demixed principal component analysis\n\nWe described a PPCA EM-algorithm with an explicit constraint on the orthogonality of the columns\nof W. So far, variance due to different parameters in the data set are completely mixed in the latent\nvariables z. The essential idea of dPCA is to demix these parameter dependencies by sparsifying\nthe mapping from parameters to latent variables (see Fig. 3a). Since we do not want to impose\nthe nature of this mapping (which is to remain non-parametric), we suggest a model in which each\nlatent variable zi is segregated into (and replaced by) a set of R latent variables {z\u03c6,i}, each of which\ndepends on a subset \u03c6 \u2286 S of parameters. Note that R is the number of all subsets of S, exempting\n\nthe empty set. We require zi =(cid:80)\n\n\u03c6\u2286S z\u03c6,i, so that\n\nWz\u03c6 + \u0001y\n\n(19)\n\ny =(cid:88)\n\n\u03c6\u2286S\n\n(cid:88)\n\n\u03c6\u2286S\n\nwith \u0001y \u223c N (0, \u03c32ID), see also Fig. 3b. The priors over the latent variables are speci\ufb01ed as\n\n(20)\nwhere \u039b\u03c6 is a row in \u039b \u2208 RR\u00d7Q, the matrix of variances for all latent variables. The covariance of\nthe sum of the latent variables shall again be the identity matrix,\n\np(z\u03c6) = N (z\u03c6|0, diag\u039b\u03c6)\n\ndiag \u039b\u03c6 = IQ.\n\n(21)\n\nof the latent variables zi = (cid:80)\n\nThis completely speci\ufb01es our model. As before, we will use the EM-algorithm to maximize the\nmodel evidence p(Y) with respect to the parameters \u039b, W, \u03c3. However, we additionally impose that\neach column \u039bi of \u039b shall be sparse, thereby ensuring that the diversity of parameter dependencies\n\u03c6 z\u03c6,i is reduced. Note that \u039bi is proportional to the vector x with\nelements x\u03c6 introduced in section 3. This links the probabilistic model to the loss function in Eq. (8).\nExpectation step. Due to the implicit parameter dependencies of the latent variables, the sets of\nvariables Z\u03c6 = {zn\n\u03c6} can only depend on the respective marginalized averages of the data. The\nposterior distribution over all latent variables Z = {Z\u03c6} therefore factorizes such that\n\np(Z\u03c6|\u00afY\u03c6)\n\n(22)\n\np(Z|Y) = (cid:89)\n\n\u03c6\u2286S\n\n6\n\nb\u03b81\u03b82z1z2z3z4y1y2y3y4y5yz\u03c31W\u039b\u03c32Nz\u03c3L...a\fAlgorithm 1: demixed Principal Component Analysis (dPCA)\n\nInput: Data Y, # components Q\nAlgorithm:\n\nU(k=1) \u2190 \ufb01rst Q principal components of y,\nrepeat\n\u03c6 , U(k), \u0393(k), \u03c3(k), \u039b(k) \u2192 update using (25), (17), (18), (12) and (30)\nM(k)\nk \u2190 k + 1\n\nI(k=1) \u2190 IQ\n\nuntil p(Y) converges\n\nHence, the expectation of the complete-data log-likelihood function is modi\ufb01ed from Eq. (11),\n\nwhere\n\nE(cid:2)ln p(cid:0)Y, Z(cid:12)(cid:12)W, \u03c32(cid:1)(cid:3) = \u2212 N(cid:88)\n\nwhere \u00afY\u03c6 = {\u00afyn\n\u03c6} are the marginalized averages over the complete data set. For three parameters,\nthe marginalized averages were speci\ufb01ed in Eq. (3)-(7). For more than three parameters, we obtain\n(23)\n\n(\u22121)|\u03c4| (cid:104)y(cid:105)n\n\n\u03c6 = (cid:104)y(cid:105)n\n\u00afyn\n\n(S\\\u03c6)\u222a\u03c4 .\n\nwhere (cid:104)y(cid:105)n\naverage to the respective data point.2 In turn, the posterior of Z\u03c6 takes the form\n\n\u03c8 denotes averaging of the data over the parameter subset \u03c8. The index n refers the\n\np(Z\u03c6|\u00afY\u03c6) =\n\n\u03c6 W(cid:62)\u00afyn\n\n\u03c6, \u03c32M\u22121\n\n\u03c6\n\n(cid:17)\n\n\u03c4\u2286\u03c6\n\n(S\\\u03c6) +(cid:88)\nN(cid:16)\nN(cid:89)\n\nzn\n\u03c6\n\nn=1\n\n(cid:12)(cid:12)(cid:12)M\u22121\n\nM\u03c6 = W(cid:62)W + \u03c32 diag \u039b\u22121\n\u03c6 .\n\nln(cid:0)2\u03c0\u03c32(cid:1) +\n\n(cid:40)\n2\u03c32 Tr(cid:0)E(cid:2)zn\n\nn=1\n1\n\nD\n2\n\n\u03c6zn(cid:62)\n\n\u03c6\n\nln det diag (\u039b\u03c6) +\n\n+\n\n+\n\n1\n2\n\n1\n\n\u03c6\u2286S\n\nln (2\u03c0)\n\n2\nW(cid:62)yn\n\n(cid:26) Q\n2\u03c32 (cid:107)yn(cid:107)2 +(cid:88)\n(cid:3)(cid:62)\n(cid:3) W(cid:62)W(cid:1) \u2212 1\nE(cid:2)zn\n(cid:3) diag (\u039b\u03c6)\u22121(cid:17)(cid:27)(cid:41)\n(cid:16)E(cid:2)zn\nE [z\u03c6] and E[zz(cid:62)] = (cid:80)\n\n\u03c32\n\u03c6zn(cid:62)\n\n1\n2\n\nTr\n\n\u03c6\n\n\u03c6\n\n.\n\n\u03c6 z\u03c6, so that E [z] = (cid:80)\n\nof marginalized averages,(cid:80)\n\nMaximization Step. Comparison of Eq. (11) and Eq. (26) shows that the maximum-likelihood\nestimates of W = U\u0393 and of \u03c32 are unchanged (this can be seen by substituting z for the sum\n\u03c6 ]). The\nmaximization with respect to \u039b is more involved because we have to respect constraints from two\nsides. First, Eq. (21) constrains the L1-norm of the columns \u039bi of \u039b. Second, since we aim for\ncomponents depending only on a small subset of parameters, we have to introduce another constraint\nto promote sparsity of \u039bi. Though this constraint is rather arbitrary, we found that constraining all\nbut one entry of \u039bi to be zero works quite effectively, so that (cid:107)\u039bi(cid:107)0 = 1. Consequently, for each\ncolumn \u039bi of \u039b, the maximization of the expected likelihood, L, Eq. (26), is given by\n\nE[z\u03c6z(cid:62)\n\n\u03c6\n\n\u03c6\n\nDe\ufb01ning B\u03c6i =(cid:80)\n\n\u039bi \u2192 arg max\nE[zn\n\nL (\u039bi) = \u2212(cid:88)\n\n\u03c6izn\n\n\u039bi\n\nn\n\nL (\u039bi)\n\ns.t.\n\n(cid:0) ln \u039b\u03c6i + B\u03c6i\u039b\u22121\n\n(cid:107)\u039bi(cid:107)1 = 1 and (cid:107)\u039bi(cid:107)0 = 1.\n(cid:1)\n\n\u03c6i], the relevant terms in the likelihood can be written as\n\n= \u2212 ln(1 \u2212 m\u0001) \u2212 B\u03c6(cid:48)i(1 \u2212 m\u0001)\u22121 \u2212(cid:88)\n\n\u03c6i\n\n\u03c6\n\n(ln \u0001 + B\u03c6i\u0001\u22121)\n\n\u03c6\u2208J\n\n2To see through this notation, notice that the n-th data point yn or yn is tagged with parameter values\n\u03b8n = (\u03b81,n, \u03b82,n, . . .). Any average over a subset \u03c8 = S \\ \u03c6 of the parameters leaves vectors (cid:104)y(cid:105)\u03c8 that still\ndepend on some remaining parameters, \u03c6 = \u03b8rest. We can therefore take their values for the n-th data point,\nrest, and assign the respective value of the average to the n-data point as well, writing (cid:104)y(cid:105)n\n\u03c8.\n\u03b8n\n\n7\n\n(24)\n\n(25)\n\n(26)\n\n(27)\n\n(28)\n\n(29)\n\n\fFigure 4: On the left we plot the relative variance of the fourteen highest components in dPCA\nconditioned on time (blue), stimulus (light blue), decision (green) and non-linear mixtures (yellow).\nOn the right the \ufb01ring rates of six dPCA components are displayed in three columns separated into\ncomponents with the highest variance in time (left), in decision (middle) and in the stimulus (right).\n\nwhere \u03c6(cid:48) is the index of the non-zero entry of \u039bi, and J is the complementing index set (of length\nm = R \u2212 1) of all zero-entries which have been set to \u0001 (cid:28) 1 for regularization purposes. Since \u0001 is\nsmall, its inverse is very large. Accordingly, the likelihood is maximized for the index \u03c6(cid:48) referring\nto the largest entry in B\u03c6i, so that\n\n(cid:26) 1\n\nif (cid:80)\n\n\u03c6i] \u2265(cid:80)\n\nE[zn\n\n\u03c6izn\n\n\u039b\u03c6i =\n\n(30)\nMore generally, it is possible to substitute the sparsity constraint with (cid:107)\u039bi(cid:107)0 = K for K > 1 and\nmaximize L (\u039bi) numerically. The full algorithm for dPCA is summarized in Algorithm 1.\n\n0\n\nn\n\n\u03c8i] for all \u03c8 (cid:54)= \u03c6\n\n\u03c8izn\n\nE[zn\notherwise\n\nn\n\n6 Experimental results\n\nThe results of the dPCA algorithm applied to the electrophysiological data from the PFC are shown\nin Fig. 4. With 90% of the total variance in the \ufb01rst fourteen components, dPCA captures a compa-\nrable amount of variance as PCA (91.7%). The distribution of variances in the dPCA components\nis shown in Fig. 4, left. Note that, compared with the distribution in the PCA components (Fig. 1,\nbottom, center), the dPCA components clearly separate the different sources of variability. More\nspeci\ufb01cally, the neural population is dominated by components that only depend on time (blue), yet\nalso features separate components for the monkey\u2019s decision (green) and the perception of the stim-\nulus (light blue). The components of dPCA, of which the six most prominent are displayed in Fig. 4,\nright, therefore re\ufb02ect and separate the parameter dependencies of the data, even though these de-\npendencies were completely intermingled on the single neuron level (compare Fig. 1, bottom, left).\n\n7 Conclusions\n\nDimensionality reduction methods that take labels or parameters into account have recently found a\nresurgence in interest. Our study was motivated by the speci\ufb01c problems related to electrophysiolog-\nical data sets. The main aim of our method\u2014demixing parameter dependencies of high-dimensional\ndata sets\u2014may be useful in other context as well. Very similar problems arise in fMRI data, for in-\nstance, and dPCA could provide a useful alternative to other dimensionality reduction methods such\nas CCA, PLS, or Supervised PCA [1, 12, 5]. Furthermore, the general aim of demixing dependencies\ncould likely be extended to other methods (such as ICA) as well. Ultimately, we see dPCA as a par-\nticular data visualization technique that will prove useful if a demixing of parameter dependencies\naids in understanding data.\nThe source code both for Python and Matlab can be found at https://sourceforge.net/projects/dpca/.\n\n8\n\n2501250-125-250\ufb01ring rate (Hz)\ufb01ring rate (Hz)time (s)012342501250-125-250time (s)01234time (s)012341234567891011121314dPCA\fReferences\n[1] F. R. Bach and M. I. Jordan. A probabilistic interpretation of canonical correlation analysis.\n\nTechnical Report 688, University of California, Berkeley, 2005.\n\n[2] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer, 2006.\n\n[3] C. D. Brody, A. Hern\u00e1ndez, A. Zainos, and R. Romo. Timing and neural encoding of so-\nmatosensory parametric working memory in macaque prefrontal cortex. Cerebral Cortex,\n13(11):1196\u20131207, 2003.\n\n[4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer,\n\n2001.\n\n[5] A. Krishnan, L. J. Williams, A. R. McIntosh, and H. Abdi. Partial least squares (PLS) methods\n\nfor neuroimaging: a tutorial and review. NeuroImage, 56:455\u2013475, 2011.\n\n[6] P.-O. Lowdin. On the non-orthogonality problem connected with the use of atomic wave\nfunctions in the theory of molecules and crystals. The Journal of Chemical Physics, 18(3):365,\n1950.\n\n[7] C. K. Machens. Demixing population activity in higher cortical areas. Frontiers in computa-\n\ntional neuroscience, 4(October):8, 2010.\n\n[8] C. K. Machens, R. Romo, and C. D. Brody. Functional, but not anatomical, separation of\n\n\u201cwhat\u201d and \u201cwhen\u201d in prefrontal cortex. Journal of Neuroscience, 30(1):350\u2013360, 2010.\n\n[9] R. Romo, C. D. Brody, A. Hernandez, and L. Lemus. Neuronal correlates of parametric work-\n\ning memory in the prefrontal cortex. Nature, 399(6735):470\u2013473, 1999.\n\n[10] S. Roweis. EM algorithms for PCA and SPCA. Advances in neural information processing\n\nsystems, 10:626\u2013632, 1998.\n\n[11] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Journal of the\n\nRoyal Statistical Society - Series B: Statistical Methodology, 61(3):611\u2013622, 1999.\n\n[12] S. Yu, K. Yu, V. Tresp, H. P. Kriegel, and M. Wu. Supervised probabilistic principal component\n\nanalysis. Proceedings of 12th ACM SIGKDD International Conf. on KDD, 10, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1440, "authors": [{"given_name": "Wieland", "family_name": "Brendel", "institution": null}, {"given_name": "Ranulfo", "family_name": "Romo", "institution": null}, {"given_name": "Christian", "family_name": "Machens", "institution": null}]}