{"title": "Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA", "book": "Advances in Neural Information Processing Systems", "page_first": 3765, "page_last": 3773, "abstract": "Nonlinear independent component analysis (ICA) provides an appealing framework for unsupervised feature learning, but the models proposed so far are not identifiable. Here, we first propose a new intuitive principle of unsupervised deep learning from time series which uses the nonstationary structure of the data. Our learning principle, time-contrastive learning (TCL), finds a representation which allows optimal discrimination of time segments (windows). Surprisingly, we show how TCL can be related to a nonlinear ICA model, when ICA is redefined to include temporal nonstationarities. In particular, we show that TCL combined with linear ICA estimates the nonlinear ICA model up to point-wise transformations of the sources, and this solution is unique --- thus providing the first identifiability result for nonlinear ICA which is rigorous, constructive, as well as very general.", "full_text": "Unsupervised Feature Extraction by\n\nTime-Contrastive Learning and Nonlinear ICA\n\nAapo Hyv\u00e4rinen1,2 and Hiroshi Morioka1\n\n1 Department of Computer Science and HIIT\n\nUniversity of Helsinki, Finland\n\n2 Gatsby Computational Neuroscience Unit\n\nUniversity College London, UK\n\nAbstract\n\nNonlinear independent component analysis (ICA) provides an appealing framework\nfor unsupervised feature learning, but the models proposed so far are not identi\ufb01able.\nHere, we \ufb01rst propose a new intuitive principle of unsupervised deep learning\nfrom time series which uses the nonstationary structure of the data. Our learning\nprinciple, time-contrastive learning (TCL), \ufb01nds a representation which allows\noptimal discrimination of time segments (windows). Surprisingly, we show how\nTCL can be related to a nonlinear ICA model, when ICA is rede\ufb01ned to include\ntemporal nonstationarities. In particular, we show that TCL combined with linear\nICA estimates the nonlinear ICA model up to point-wise transformations of the\nsources, and this solution is unique \u2014 thus providing the \ufb01rst identi\ufb01ability result\nfor nonlinear ICA which is rigorous, constructive, as well as very general.\n\n1\n\nIntroduction\n\nUnsupervised nonlinear feature learning, or unsupervised representation learning, is one of the\nbiggest challenges facing machine learning. Various approaches have been proposed, many of them\nin the deep learning framework. Some of the most popular methods are multi-layer belief nets and\nRestricted Boltzmann Machines [13] as well as autoencoders [14, 31, 21], which form the basis for\nthe ladder networks [30]. While some success has been obtained, the general consensus is that the\nexisting methods are lacking in scalability, theoretical justi\ufb01cation, or both; more work is urgently\nneeded to make machine learning applicable to big unlabeled data.\nBetter methods may be found by using the temporal structure in time series data. One approach which\nhas shown a great promise recently is based on a set of methods variously called temporal coherence\n[17] or slow feature analysis [32]. The idea is to \ufb01nd features which change as slowly as possible,\noriginally proposed in [6] for learning invariant features. Kernel-based methods [12, 26] and deep\nlearning methods [23, 27, 9] have been developed to extend this principle to the general nonlinear\ncase. However, it is not clear how one should optimally de\ufb01ne the temporal stability criterion; these\nmethods typically use heuristic criteria and are not based on generative models.\nIn fact, the most satisfactory solution for unsupervised deep learning would arguably be based\non estimation of probabilistic generative models, because probabilistic theory often gives optimal\nobjectives for learning. This has been possible in linear unsupervised learning, where sparse coding\nand independent component analysis (ICA) use independent, typically sparse, latent variables that\ngenerate the data via a linear mixing. Unfortunately, at least without temporal structure, the nonlinear\nICA model is seriously unidenti\ufb01able [18], which means that the original sources cannot be found.\nIn spite of years of research [20], no generally applicable identi\ufb01ability conditions have been found.\nNevertheless, practical algorithms have been proposed [29, 1, 5] with the hope that some kind of\nuseful solution can still be found even for data with no temporal structure (that is, an i.i.d. sample).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: An illustration of how we combine a new generative nonlinear ICA model with the new\nlearning principle called time-contrastive learning (TCL). A) The probabilistic generative model is\na nonlinear version of ICA, where the observed signals are given by a nonlinear transformation of\nsource signals, which are mutually independent, and have segment-wise nonstationarity. B) In TCL\nwe train a feature extractor to be sensitive to the nonstationarity of the data by using a multinomial\nlogistic regression which attempts to discriminate between the segments, labelling each data point\nwith the segment label 1, . . . , T . The feature extractor and the logistic regression together can be\nimplemented by a conventional multi-layer perceptron with back-propagation training.\n\nHere, we combine a new heuristic principle for analysing temporal structure with a rigorous treatment\nof a nonlinear ICA model, leading to a new identi\ufb01ability proof. The structure of our theory is\nillustrated in Figure 1.\nFirst, we propose to learn features using the (temporal) nonstationarity of the data. The idea is that\nthe learned features should enable discrimination between different time windows; in other words,\nwe search for features that provide maximal information on which part of the time series a given data\npoint comes from. This provides a new, intuitively appealing method for feature extraction, which we\ncall time-contrastive learning (TCL).\nSecond, we formulate a generative model in which independent components have different distri-\nbutions in different time windows, and we observe nonlinear mixtures of the components. While\na special case of this principle, using nonstationary variances, has been very successfully used in\nlinear ICA [22], our extension to the nonlinear case is completely new. Such nonstationarity of\nvariances seems to be prominent in many kinds of data, for example EEG/MEG [2], natural video\n[17], and closely related to changes in volatility in \ufb01nancial time series; but we further generalize the\nnonstationarity to modulated exponential families.\nFinally, we show that as a special case, TCL estimates the nonlinear part of the nonlinear ICA model,\nleaving only a simple linear mixing to be determined by linear ICA, and a \ufb01nal indeterminacy in\nterms of a component-wise nonlinearity similar to squaring. For modulated Gaussian sources, even\nthe squaring can be removed and we have \u201cfull\u201d identi\ufb01ability. This gives the very \ufb01rst identi\ufb01ability\nproof for a high-dimensional, nonlinear, ICA mixing model \u2014 together with a practical method for\nits estimation.\n\n2 Time-contrastive learning\n\nTCL is a method to train a feature extractor by using a multinomial logistic regression (MLR)\nclassi\ufb01er which aims to discriminate all segments (time windows) in a time series, given the segment\nindices as the labels of the data points. In more detail, TCL proceeds as follows:\n\n1. Divide a multivariate time series xt into segments, i.e. time windows, indexed by \u03c4 =\n\n1, . . . , T . Any temporal segmentation method can be used, e.g. simple equal-sized bins.\n\n2. Associate each data point with the corresponding segment index \u03c4 in which the data point is\n\ncontained; i.e. the data points in the segment \u03c4 are all given the same segment label \u03c4.\n\n2\n\nSourcesignalsObservedsignals1n1nA Generative modelB Time-contrastive learning1 2 3 T Time ( )Nonlinear mixture:Predictions of segment labels1 1 2 2 3 1mFeaturevaluesT T3 4Multinomial logistic regression:Segments: Feature extractor:Theorem 1\f3. Learn a feature extractor h(xt; \u03b8) together with an MLR with a linear regression function\n\u03c4 h(xt; \u03b8) + b\u03c4 to classify all data points with the corresponding segment labels \u03c4 used as\nwT\nclass labels Ct, as de\ufb01ned above. (For example, by ordinary deep learning with h(xt; \u03b8)\nbeing outputs in the last hidden layer and \u03b8 being network weights.)\n\nThe purpose of the feature extractor is to extract a feature vector that enables the MLR to discriminate\nthe segments. Therefore, it seems intuitively clear that the feature extractor needs to learn a useful\nrepresentation of the temporal structure of the data, in particular the differences of the distributions\nacross segments. Thus, we are effectively using a classi\ufb01cation method (MLR) to accomplish\nunsupervised learning. Methods such as noise-contrastive estimation [11] and generative adversarial\nnets [8], see also [10], are similar in spirit, but clearly distinct from TCL which uses the temporal\nstructure of the data by contrasting different time segments.\nIn practice, the feature extractor needs to be capable of approximating a general nonlinear relationship\nbetween the data points and the log-odds of the classes, and it must be easy to learn from data\nsimultaneously with the MLR. To satisfy these requirements, we use here a multilayer perceptron\n(MLP) as the feature extractor. Essentially, we use ordinary MLP/MLR training according to very\nwell-known neural network theory, with the last hidden layer working as the feature extractor. Note\nthat the MLR is here only used as an instrument for training the feature extractor, and has no practical\nmeaning after the training.\n\n3 TCL as approximator of log-pdf ratios\n\nWe next show how the combination of the optimally discriminative feature extractor and MLR learns\nto model the nonstationary probability density functions (pdf\u2019s) of the data. The posterior over classes\nfor one data point xt in the multinomial logistic regression of TCL is given by well-known theory as\n\np(Ct = \u03c4|xt; \u03b8, W, b) =\n\n1 +(cid:80)T\n\n\u03c4 h(xt; \u03b8) + b\u03c4 )\n\nexp(wT\nj=2 exp(wT\n\nj h(xt; \u03b8) + bj)\n\n(1)\n\nwhere Ct is a class label of the data at time t, xt is the n-dimensional data point at time t, \u03b8 is the\nparameter vector of the m-dimensional feature extractor (MLP) denoted by h, W = [w1, . . . , wT ] \u2208\nRm\u00d7T , and b = [b1, . . . , bT ]T are the weight and bias parameters of the MLR. We \ufb01xed the elements\nof w1 and b1 to zero to avoid the well-known indeterminacy of the softmax function.\nOn the other hand, the true posteriors of the segment labels can be written, by the Bayes rule, as\n\np(Ct = \u03c4|xt) =\n\np\u03c4 (xt)p(Ct = \u03c4 )\nj=1 pj(xt)p(Ct = j)\n\n(cid:80)T\n\n,\n\n(2)\n\nwhere p(Ct = \u03c4 ) is a prior distribution of the segment label \u03c4, and p\u03c4 (xt) = p(xt|Ct = \u03c4 ).\nAssume that the feature extractor has a universal approximation capacity (in the sense of well-known\nneural network theory), and that the amount of data is in\ufb01nite, so that the MLR converges to the\noptimal classi\ufb01er. Then, we will have equality between the model posterior Eq. (1) and the true\nposterior in Eq. (2) for all \u03c4. Well-known developments, intuitively based on equating the numerators\nin those equations and taking the pivot into account, lead to the relationship\n\n\u03c4 h(xt; \u03b8) + b\u03c4 = log p\u03c4 (xt) \u2212 log p1(xt) + log\nwT\n\np(Ct = \u03c4 )\np(Ct = 1)\n\n,\n\n(3)\n\nwhere the last term on the right-hand side is zero if the segments have equal prior probability (i.e.\nequal length). In other words, what the feature extractor computes after TCL training (under optimal\nconditions) is the log-pdf of the data point in each segment (relative to that in the \ufb01rst segment which\nwas chosen as pivot above). This gives a clear probabilistic interpretation of the intuitive principle of\nTCL, and will be used below to show its connection to nonlinear ICA.\n\n4 Nonlinear nonstationary ICA model\n\nIn this section, seemingly unrelated to the preceding section, we de\ufb01ne a probabilistic generative\nmodel; the connection will be explained in the next section. We assume, as typical in nonlinear ICA,\n\n3\n\n\fthat the observed multivariate time series xt is a smooth and invertible nonlinear mixture of a vector\nof source signals st = (s1(t), . . . , sn(t)); in other words:\n\nxt = f (st).\n\n(4)\nThe components si(t) in st are assumed mutually independent over i (but not over time t). The\ncrucial question is how to de\ufb01ne a suitable model for the sources, which is general enough while\nallowing strong identi\ufb01ability results.\nHere, we start with the fundamental assumption that the source signals si(t) are nonstationary,\nand use such nonstationarity for source separation. For example, the variances (or similar scaling\ncoef\ufb01cients) could be changing as proposed earlier in the linear case [22, 24, 16]. We generalize\nthat idea and propose a generative model for nonstationary sources based on the exponential family.\nMerely for mathematical convenience, we assume that the nonstationarity is much slower than the\nsampling rate, so the time series can be divided into segments in each of which the distribution is\napproximately constant (but the distribution is different in different segments). The log-pdf of the\nsource signal with index i in the segment \u03c4 is then de\ufb01ned as:\n\nV(cid:88)\n\nlog p\u03c4 (si) = qi,0(si) +\n\n\u03bbi,v(\u03c4 )qi,v(si) \u2212 log Z(\u03bbi,1(\u03c4 ), . . . , \u03bbi,V (\u03c4 ))\n\n(5)\n\nv=1\n\nwhere qi,0 is a \u201cstationary baseline\u201d log-pdf of the source, and the qi,v, v \u2265 1 are nonlinear scalar\nfunctions de\ufb01ning the exponential family for source i; the index t is dropped for simplicity. The\nessential point is that the parameters \u03bbi,v(\u03c4 ) of the source i depend on the segment index \u03c4, which\ncreates nonstationarity. The normalization constant Z disappears in all our proofs below.\nA simple example would be obtained by setting qi,0 = 0, V = 1, i.e., using a single modulated\nfunction qi,1 with qi,1(si) = \u2212s2\ni /2 which means that the variance of a Gaussian source is modulated,\nor qi,1(si) = \u2212|si|, a modulated Laplacian source. Another interesting option might be to use\ntwo nonlinearities similar to \u201crecti\ufb01ed linear units\u201d (ReLU) given by qi,1(si) = \u2212 max(si, 0) and\nqi,2(si) = \u2212 max(\u2212si, 0) to model both changes in scale (variance) and location (mean). Yet another\noption is to use a Gaussian baseline qi,0(si) = \u2212s2\nOur de\ufb01nition thus generalizes the linear model [22, 24, 16] to the nonlinear case, as well as to very\ngeneral modulated non-Gaussian densities by allowing qi,v to be non-quadratic, using more than one\nqi,v per source (i.e. we can have V > 1) as well as a non-stationary baseline. We emphasize that our\nprinciple of nonstationarity is clearly distinct from the principle of linear autocorrelations previously\nused in the nonlinear case [12, 26]. Note further that some authors prefer to use the term blind source\nseparation (BSS) for generative models with temporal structure.\n\ni /2 with a nonquadratic function qi,1.\n\n5 Solving nonlinear ICA by TCL\n\nNow we consider the case where TCL as de\ufb01ned in Section 2 is applied on data generated by the\nnonlinear ICA model in Section 4. We refer again to Figure 1 which illustrates the total system. For\nsimplicity, we consider the case qi,0 = 0, V = 1, i.e. the exponential family has a single modulated\nfunction qi,1 per source, and this function is the same for all sources; we will discuss the general case\nseparately below. The modulated function will be simply denoted by q := qi,1 in the following.\nFirst, we show that the nonlinear functions q(si), i = 1, . . . , n, of the sources can be obtained as\nunknown linear transformations of the outputs of the feature extractor hi trained by the TCL:\nTheorem 1. Assume the following:\n\nA1. We observe data which is obtained by generating independent sources1 according to (5),\nand mixing them as in (4) with a smooth invertible f. For simplicity, we assume only a single\nfunction de\ufb01ning the exponential family, i.e. qi,0 = 0, V = 1 and q := qi,1 as explained\nabove.\n\nA2. We apply TCL on the data so that the dimension of the feature extractor h is equal to the\n\ndimension of the data vector xt, i.e., m = n.\n\n1More precisely: the sources are generated independently given the \u03bbi,v. Depending on how the \u03bbi,v are\n\ngenerated, there may or may not be marginal dependency between the si; see the Corollary 1 below.\n\n4\n\n\fA3. The modulation parameter matrix L with elements [L]\u03c4,i = \u03bbi,1(\u03c4 ) \u2212 \u03bbi,1(1), \u03c4 =\n1, . . . , T ; i = 1, . . . , n has full column rank n. (Intuitively: the variances of the com-\nponents are modulated suf\ufb01ciently independently of each other. Note that many segments\nare actually allowed to have equal distributions since this matrix is typically very tall.)\n\nThen, in the limit of in\ufb01nite data, the outputs of the feature extractor are equal to q(s) =\n(q(s1), q(s2), . . . , q(sn))T up to an invertible linear transformation. In other words,\n\nq(st) = Ah(xt; \u03b8) + d\n\nfor some constant invertible matrix A \u2208 Rn\u00d7n and a constant vector d \u2208 Rn.\nSketch of proof : (see Supplementary Material for full proof) The basic idea is that after convergence\nwe must have equality between the model of the log-pdf in each segment given by TCL in Eq. (3)\nand that given by nonlinear ICA, obtained by summing the RHS of Eq. (5) over i:\n\n(6)\n\nn(cid:88)\n\n\u03c4 h(x; \u03b8) \u2212 k1(x) =\nwT\n\n\u03bbi,1(\u03c4 )q(si) \u2212 k2(\u03c4 )\n\n(7)\n\ni=1\n\nwhere k1 does not depend on \u03c4, and k2(\u03c4 ) does not depend on x or s. We see that the functions hi(x)\nand q(si) must span the same linear subspace. (TCL looks at differences of log-pdf\u2019s, introducing\nthe baseline k1(x), but this does not actually change the subspace). This implies that the q(si) must\nbe equal to some invertible linear transformation of h(x; \u03b8) and a constant bias term, which gives\n(6).\nTo further estimate the linear transformation A in (6), we can simply use linear ICA, under a further\nindependence assumption regarding the generation of the \u03bbi,1:\nCorollary 1. Assume the \u03bbi,1 are randomly generated, independently for each i. The estimation\n(identi\ufb01cation) of the q(si) can then be performed by \ufb01rst performing TCL, and then linear ICA on\nthe hidden representation h(x).\n\nProof: We only need to combine the well-known identi\ufb01ability proof of linear ICA [3] with Theorem 1,\nnoting that the quantities q(si) are now independent, and since q has a strict upper bound (which is\nnecessary for integrability), q(si) must be non-Gaussian.\nIn general, TCL followed by linear ICA does not allow us to exactly recover the independent\ncomponents because the function q(\u00b7) can hardly be invertible, typically being something like\nsquaring or absolute values. However, for a speci\ufb01c class of q including the modulated Gaussian\nfamily, we can prove a stricter form of identi\ufb01ability. Slightly counterintuitively, we can recover the\nsigns of the si, since we also know the corresponding x and the transformation is invertible:\nCorollary 2. Assume q(s) is a strictly monotonic function of |s|. Then, we can further identify the\noriginal si, up to strictly monotonic transformations of each source.\nProof: To make p\u03c4 (s) integrable, necessarily q(s) \u2192 \u2212\u221e when |s| \u2192 \u221e, and q(s) must have a\n\ufb01nite maximum, which we can set to zero without restricting generality. For each \ufb01xed i, consider the\nmanifold de\ufb01ned by q(gi(x))) = 0. By invertibility of g, this divides the space of x into two halves.\nIn one half, de\ufb01ne \u02dcsi = q(si), and in the other, \u02dcsi = \u2212q(si). With such \u02dcsi, we have thus recovered\nthe original sources, up to the strictly monotonic transformation \u02dcsi = c sign(si)q(si), where c is\neither +1 or \u22121. (Note that in general, the si are meaningfully de\ufb01ned only up to a strictly monotonic\ntransformation, analogue to multiplication by an arbitrary constant in the linear case [3].)\n\nSummary of Theory What we have proven is that in the special case of a single q(s) which is a\nmonotonic function of |s|, our nonlinear ICA model is identi\ufb01able, up to inevitable component-wise\nmonotonic transformations. We also provided a practical method for the estimation of the nonlinear\ntransformations q(si) for any general q, given by TCL followed by linear ICA. (The method provided\nfor \u201cinverting\u201d q in the proof of Corollary 2 may be very dif\ufb01cult to implement in practice.)\n\nExtensions First, allowing a stationary baseline qi,0 does not change the Theorem at all, and a\nweaker form of Corollary 1 holds as well. Second, with many qi,v (V > 1), the left-hand-side of (6)\nwill have V n entries given by all the possible qi,v(si), and the dimension of the feature extractor must\nbe equally increased; the condition of full rank on L is likewise more complicated. Corollary 1 must\nthen consider an independent subspace model, but it can still be proven in the same way. (The details\nand the proof will be presented in a later paper.) Third, the case of combining ICA with dimension\nreduction is treated in Supplementary Material.\n\n5\n\n\f6 Simulation on arti\ufb01cial data\n\nData generation We created data from the nonlinear ICA model in Section 4, using the simpli\ufb01ed\ncase of the Theorem (a single function q) as follows. Nonstationary source signals (n = 20, segment\nlength 512) were randomly generated by modulating Laplacian sources by \u03bbi,1(\u03c4 ) randomly drawn\nso that the std\u2019s inside the segments have a uniform distribution in [0, 1]. As the nonlinear mixing\nfunction f (s), we used an MLP (\u201cmixing-MLP\u201d). In order to guarantee that the mixing-MLP is\ninvertible, we used leaky ReLU\u2019s and the same number of units in all layers.\n\nTCL settings, training, and \ufb01nal linear ICA As the feature extractor to be trained by TCL, we\nadopted an MLP (\u201cfeature-MLP\u201d). The segmentation in TCL was the same as in the data generation,\nand the number of layers was the same in the mixing-MLP and the feature-MLP. Note that when\nL = 1, both the mixing-MLP and feature-MLP are a one layer model, and then the observed signals\nare simply linear mixtures of the source signals as in a linear ICA model. As in the Theorem, we\nset m = n. As the activation function in the hidden layers, we used a \u201cmaxout\u201d unit, constructed by\ntaking the maximum across G = 2 af\ufb01ne fully connected weight groups. However, the output layer\nhas \u201cabsolute value\u201d activation units exclusively. This is because the output of the feature-MLP (i.e.,\nh(x; \u03b8)) should resemble q(s), based on Theorem 1, and here we used the Laplacian distribution\nfor generating the sources. The initial weights of each layer were randomly drawn from a uniform\ndistribution for each layer, scaled as in [7]. To train the MLP, we used back-propagation with a\nmomentum term. To avoid over\ufb01tting, we used (cid:96)2 regularization for the feature-MLP and MLR.\nAccording to Corollary 1 above, after TCL we further applied linear ICA (FastICA, [15]) to the\nh(x; \u03b8), and used its outputs as the \ufb01nal estimates of q(si). To evaluate the performance of source\nrecovery, we computed the mean correlation coef\ufb01cients between the true q(si) and their estimates.\nFor comparison, we also applied a linear ICA method based on nonstationarity of variance (NSVICA)\n[16], a kernel-based nonlinear ICA method (kTDSEP) [12], and a denoising autoencoder (DAE) [31]\nto the observed data. We took absolute values of the estimated sources to make a fair comparison\nwith TCL. In kTDSEP, we selected the 20 estimated components with the highest correlations with\nthe source signals. We initialized the DAE by the stacked DAE scheme [31], and sigmoidal units\nwere used in the hidden layers; we omitted the case L > 3 because of instability of training.\n\nResults Figure 2a) shows that after training the feature-MLP by TCL, the MLR achieved higher\nclassi\ufb01cation accuracies than chance level, which implies that the feature-MLP was able to learn\na representation of the data nonstationarity. (Here, chance level denotes the performance of the\nMLP with a randomly initialized feature-MLP.) We can see that the larger the number of layers is\n(which means that the nonlinearity in the mixing-MLP is stronger), the more dif\ufb01cult it is to train the\nfeature-MLP and the MLR. The classi\ufb01cation accuracy also goes down when the number of segments\nincreases, since when there are more and more classes, some of them will inevitably have very similar\ndistributions and are thus dif\ufb01cult to discriminate; this is why we computed the chance level as above.\nFigure 2b) shows that the TCL method could reconstruct the q(si) reasonably well even for the\nnonlinear mixture case (L > 1), while all other methods failed (NSVICA obviously performed very\nwell in the linear case).The \ufb01gure also shows that (1) the larger the number of segments (amount of\ndata) is, the higher the performance of the TCL method is (i.e. the method seems to converge), and\n(2) again, more layers makes learning more dif\ufb01cult.\nTo summarize, this simulation con\ufb01rms that TCL is able to estimate the nonlinear ICA model based\non nonstationarity. Using more data increases performance, perhaps obviously, while making the\nmixing more nonlinear decreases performance.\n\n7 Experiments on real brain imaging data\n\nTo evaluate the applicability of TCL to real data, we applied it on magnetoencephalography (MEG),\ni.e. measurements of the electrical activity in the human brain. In particular, we used data measured\nin a resting-state session, during which the subjects did not have any task nor were receiving any\nparticular stimulation. In recent years, many studies have shown the existence of networks of brain\nactivity in resting state, with MEG as well [2, 4]. Such networks mean that the data is nonstationary,\nand thus this data provides an excellent target for TCL.\n\n6\n\n\fa)\n\nb)\n\nFigure 2: Simulation on arti\ufb01cial data. a) Mean classi\ufb01cation accuracies of the MLR in TCL, as a\nfunction of the numbers of layers and segments. (Accuracies are on training data since it is not obvious\nhow to de\ufb01ne test data.) Note that chance levels (dotted lines) change as a function of the number\nof segments (see text). The MLR achieved higher accuracy than chance level. b) Mean absolute\ncorrelation coef\ufb01cients between the true q(s) and the features learned by TCL (solid line) and, for\ncomparison: nonstationarity-based linear ICA (NSVICA, dashed line), kernel-based nonlinear ICA\n(kTDSEP, dotted line), and denoising autoencoder (DAE, dash-dot line). TCL has much higher\ncorrelations than DAE or kTDSEP, and in the nonlinear case (L > 1), higher than NSVICA.\n\nData and preprocessing We used MEG data from an earlier neuroimaging study [25], graciously\nprovided by P. Ramkumar. MEG signals were measured from nine healthy volunteers by a Vectorview\nhelmet-shaped neuromagnetometer at a sampling rate of 600 Hz with 306 channels. The experiment\nconsisted of two kinds of sessions, i.e., resting sessions (2 sessions of 10 min) and task sessions (2\nsessions of 12 min). In the task sessions, the subjects were exposed to a sequence of 6\u201333 s blocks of\nauditory, visual and tactile stimuli, which were interleaved with 15 s rest periods. We exclusively\nused the resting-session data for the training of the network, and task-session data was only used in\nthe evaluation. The modality of the sensory stimulation (incl. no stimulation, i.e. rest) provided a\nclass label that we used in the evaluation, giving in total four classes. We preprocessed the MEG\nsignals by Morlet \ufb01ltering around the alpha frequency band.\n\nTCL settings We used segments of equal size, of length 12.5 s or 625 data points (downsampling\nto 50 Hz); the length was based on prior knowledge about the time-scale of resting-state networks.\nThe number of layers took the values L \u2208 {1, 2, 3, 4}, and the number of nodes of each hidden\nlayer was a function of L so that we always \ufb01xed the number of output layer nodes to 10, and\nincreased gradually the number of nodes when going to earlier layer as L = 1 : 10, L = 2 : 20 \u2212 10,\nL = 3 : 40 \u2212 20 \u2212 10, and L = 4 : 80 \u2212 40 \u2212 20 \u2212 10. We used ReLU\u2019s in the middle layers, and\nadaptive units \u03c6(x) = max(x, ax) exclusively for the output layer, which is more \ufb02exible than the\n\u201cabsolute value\u201d unit used in the Simulation above. To prevent over\ufb01tting, we applied dropout [28] to\ninputs, and batch normalization [19] to hidden layers. Since different subjects and sessions are likely\nto have uninteresting technical differences, we used a multi-task learning scheme, with a separate\ntop-layer MLR classi\ufb01er for each measurement session and subject, but a shared feature-MLP. (In\nfact, if we use the MLR to discriminate all segments of all sessions, it tends to mainly learn such\ninter-subject and inter-session differences.) Otherwise, all the settings were as in Section 6.\n\nEvaluation methods To evaluate the obtained features, we performed classi\ufb01cation of the sensory\nstimulation categories (modalities) by applying feature extractors trained with (unlabeled) resting-\nsession data to (labeled) task-session data. Classi\ufb01cation was performed using a linear support\nvector machine (SVM) classi\ufb01er trained on the stimulation modality labels, and its performance was\nevaluated by a session-average of session-wise one-block-out cross-validation (CV) accuracies. The\nhyperparameters of the SVM were determined by nested CV without using the test data. The average\nactivities of the feature extractor during each block were used as feature vectors in the evaluation\nof TCL features. However, we used log-power activities for the other (baseline) methods because\nthe average activities had much lower performance with those methods. We balanced the number of\nblocks between the four categories. We measured the CV accuracy 10 times by changing the initial\nvalues of the feature extractor training, and showed their average performance. We also visualized\nthe spatial activity patterns obtained by TCL, using weighted-averaged sensor signals; i.e., the sensor\nsignals are averaged while weighted by the activities of the feature extractor.\n\n7\n\nNumber of segments8163264128256512Accuracy (%)124810204080100L=1L=2L=3L=4L=5L=1(chance)L=2(chance)L=3(chance)L=4(chance)L=5(chance)Number of segments8163264128256512Mean correlation00.20.40.60.81TCL(L=1)TCL(L=2)TCL(L=3)TCL(L=4)TCL(L=5)NSVICA(L=1)NSVICA(L=2)NSVICA(L=3)NSVICA(L=4)NSVICA(L=5)kTDSEP(L=1)kTDSEP(L=2)kTDSEP(L=3)kTDSEP(L=4)kTDSEP(L=5)DAE(L=1)DAE(L=2)DAE(L=3)\fa)\n\nb)\n\nL3\n\nL2\n\nL1\n\nFigure 3: Real MEG data. a) Classi\ufb01cation accuracies of linear SVMs newly trained with task-\nsession data to predict stimulation labels in task-sessions, with feature extractors trained in advance\nwith resting-session data. Error bars give standard errors of the mean across ten repetitions. For TCL\nand DAE, accuracies are given for different numbers of layers L. Horizontal line shows the chance\nlevel (25%). b) Example of spatial patterns of nonstationary components learned by TCL. Each\nsmall panel corresponds to one spatial pattern with the measurement helmet seen from three different\nangles (left, back, right); red/yellow is positive and blue is negative. L3: approximate total spatial\npattern of one selected third-layer unit. L2: the patterns of the three second-layer units maximally\ncontributing to this L3 unit. L1: for each L2 unit, the two most strongly contributing \ufb01rst-layer units.\n\nResults Figure 3a) shows the comparison of classi\ufb01cation accuracies between the different methods,\nfor different numbers of layers L = {1, 2, 3, 4}. The classi\ufb01cation accuracies by the TCL method\nwere consistently higher than those by the other (baseline) methods.2 We can also see a superior\nperformance of multi-layer networks (L \u2265 3) compared with that of the linear case (L = 1), which\nindicates the importance of nonlinear demixing in the TCL method.\nFigure 3b) shows an example of spatial patterns learned by the TCL method. For simplicity of\nvisualization, we plotted spatial patterns for the three-layer model. We manually picked one out of\nthe ten hidden nodes from the third layer, and plotted its weighted-averaged sensor signals (Figure 3b,\nL3). We also visualized the most strongly contributing second- and \ufb01rst-layer nodes. We see\nprogressive pooling of L1 units to form left temporal, right temporal, and occipito-parietal patterns in\nL2, which are then all pooled together in the L3 resulting in a bilateral temporal pattern with negative\ncontribution from the occipito-parietal region. Most of the spatial patterns in the third layer (not\nshown) are actually similar to those previously reported using functional magnetic resonance imaging\n(fMRI), and MEG [2, 4]. Interestingly, none of the hidden units seems to represent artefacts (i.e.\nnon-brain signals), in contrast to ordinary linear ICA of EEG or MEG.\n\n8 Conclusion\nWe proposed a new learning principle for unsupervised feature (representation) learning. It is based\non analyzing nonstationarity in temporal data by discriminating between time segments. The ensuing\n\u201ctime-contrastive learning\u201d is easy to implement since it only uses ordinary neural network training: a\nmulti-layer perceptron with logistic regression. However, we showed that, surprisingly, it can estimate\nindependent components in a nonlinear mixing model up to certain indeterminacies, assuming that\nthe independent components are nonstationary in a suitable way. The indeterminacies include a linear\nmixing (which can be resolved by a further linear ICA step), and component-wise nonlinearities,\nsuch as squares or absolute values. TCL also avoids the computation of the gradient of the Jacobian,\nwhich is a major problem with maximum likelihood estimation [5].\nOur developments also give by far the strongest identi\ufb01ability proof of nonlinear ICA in the literature.\nThe indeterminacies actually reduce to just inevitable monotonic component-wise transformations in\nthe case of modulated Gaussian sources. Thus, our results pave the way for further developments in\nnonlinear ICA, which has so far seriously suffered from the lack of almost any identi\ufb01ability theory,\nand provide a new principled approach to unsupervised deep learning.\nExperiments on real MEG found neuroscienti\ufb01cally interesting networks. Other promising future\napplication domains include video data, econometric data, and biomedical data such as EMG and\nECG, in which nonstationary variances seem to play a major role.3\n\n2Note that classi\ufb01cation using the \ufb01nal linear ICA is equivalent to using whitening since ICA only makes a\n\nfurther orthogonal rotation.\n\n3This research was supported in part by JSPS KAKENHI 16J08502 and the Academy of Finland.\n\n8\n\nTCLDAENSVICAkTDSEPClassification accuracy (%)304050L=1L=4L=1L=4\fReferences\n[1] L. B. Almeida. MISEP\u2014linear and nonlinear ICA based on mutual information. J. of Machine Learning\n\nResearch, 4:1297\u20131318, 2003.\n\n[2] M. J. Brookes et al. Investigating the electrophysiological basis of resting state networks using magnetoen-\n\ncephalography. Proc. Natl. Acad. Sci., 108(40):16783\u201316788, 2011.\n\n[3] P. Comon. Independent component analysis\u2014a new concept? Signal Processing, 36:287\u2013314, 1994.\n[4] F. de Pasquale et al. A cortical core for dynamic integration of functional networks in the resting human\n\nbrain. Neuron, 74(4):753\u2013764, 2012.\n\narXiv:1410.8516 [cs.LG], 2015.\n\n[5] L. Dinh, D. Krueger, and Y. Bengio. NICE: Non-linear independent components estimation.\n\n[6] P. F\u00f6ldi\u00e1k. Learning invariance from transformation sequences. Neural Computation, 3:194\u2013200, 1991.\n[7] X. Glorot and Y. Bengio. Understanding the dif\ufb01culty of training deep feedforward neural networks. In\n\nAISTATS\u201910, 2010.\n\n[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\n\nGenerative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[9] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun. Unsupervised feature learning from temporal\n\n[10] M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander. Likelihood-free inference via classi\ufb01cation.\n\ndata. arXiv:1504.02518, 2015.\n\narXiv:1407.4981 [stat.CO], 2014.\n\n[11] M. U. Gutmann and A. Hyv\u00e4rinen. Noise-contrastive estimation of unnormalized statistical models, with\n\napplications to natural image statistics. J. of Machine Learning Research, 13:307\u2013361, 2012.\n\n[12] S. Harmeling, A. Ziehe, M. Kawanabe, and K.-R. M\u00fcller. Kernel-based nonlinear blind source separation.\n\nNeural Comput., 15(5):1089\u20131124, 2003.\n\n[13] G. E. Hinton. Learning multiple layers of representation. Trends Cogn. Sci., 11:428\u2013434, 2007.\n[14] G. E. Hinton and R. S. Zemel. Autoencoders, minimum description length, and helmholtz free energy. Adv.\n\n[15] A. Hyv\u00e4rinen. Fast and robust \ufb01xed-point algorithms for independent component analysis. IEEE Trans.\n\nNeural Inf. Process. Syst., 1994.\n\nNeural Netw., 10(3):626\u2013634, 1999.\n\n[16] A. Hyv\u00e4rinen. Blind source separation by nonstationarity of variance: A cumulant-based approach. IEEE\n\nTransactions on Neural Networks, 12(6):1471\u20131474, 2001.\n\n[17] A. Hyv\u00e4rinen, J. Hurri, and P. O. Hoyer. Natural Image Statistics. Springer-Verlag, 2009.\n[18] A. Hyv\u00e4rinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness\n\n[19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\nresults. Neural Netw., 12(3):429\u2013439, 1999.\n\ncovariate shift. CoRR, abs/1502.03167, 2015.\n\n[20] C. Jutten, M. Babaie-Zadeh, and J. Karhunen. Nonlinear mixtures. Handbook of Blind Source Separation,\n\nIndependent Component Analysis and Applications, pages 549\u2013592, 2010.\n\n[21] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv:1312.6114 [stat.ML], 2014.\n[22] K. Matsuoka, M. Ohya, and M. Kawamoto. A neural net for blind separation of nonstationary signals.\n\nNeural Netw., 8(3):411\u2013419, 1995.\n\n[23] H. Mobahi, R. Collobert, and J. Weston. Deep learning from temporal coherence in video. In Proceedings\n\nof the 26th Annual International Conference on Machine Learning, pages 737\u2013744, 2009.\n\n[24] D.-T. Pham and J.-F. Cardoso. Blind separation of instantaneous mixtures of non stationary sources. IEEE\n\nTrans. Signal Processing, 49(9):1837\u20131848, 2001.\n\n[25] P. Ramkumar, L. Parkkonen, R. Hari, and A. Hyv\u00e4rinen. Characterization of neuromagnetic brain rhythms\nover time scales of minutes using spatial independent component analysis. Hum. Brain Mapp., 33(7):1648\u2013\n1662, 2012.\n\n[26] H. Sprekeler, T. Zito, and L. Wiskott. An extension of slow feature analysis for nonlinear blind source\n\nseparation. J. of Machine Learning Research, 15(1):921\u2013947, 2014.\n\n[27] J. T. Springenberg and M. Riedmiller. Learning temporal coherent features through life-time sparsity. In\n\nNeural Information Processing, pages 347\u2013356. Springer, 2012.\n\n[28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\n\nprevent neural networks from over\ufb01tting. J. Mach. Learn. Res., 15(1):1929\u20131958, 2014.\n\n[29] Y. Tan, J. Wang, and J.M. Zurada. Nonlinear blind source separation using a radial basis function network.\n\nIEEE Transactions on Neural Networks, 12(1):124\u2013134, 2001.\n\n[30] H. Valpola. From neural PCA to deep unsupervised learning. In Advances in Independent Component\n\nAnalysis and Learning Machines, pages 143\u2013171. Academic Press, 2015.\n\n[31] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising autoencoders:\nLearning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res.,\n11:3371\u20133408, 2010.\n\n[32] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural\n\nComput., 14(4):715\u2013770, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1870, "authors": [{"given_name": "Aapo", "family_name": "Hyvarinen", "institution": "University of Helsinki"}, {"given_name": "Hiroshi", "family_name": "Morioka", "institution": "University of Helsinki"}]}