{"title": "Entropy and mutual information in models of deep neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1821, "page_last": 1831, "abstract": "We examine a class of stochastic deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) We show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual information throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.", "full_text": "Entropy and mutual information in\n\nmodels of deep neural networks\n\nMarylou Gabri\u00e9\u22171, Andre Manoel2,3, Cl\u00e9ment Luneau4, Jean Barbier1,4,5, Nicolas Macris4,\n\nFlorent Krzakala1,6,7 and Lenka Zdeborov\u00e13,6\n\n1Laboratoire de Physique Statistique, \u00c9cole Normale Sup\u00e9rieure, PSL University\n2Parietal Team, INRIA, CEA, Universit\u00e9 Paris-Saclay & Owkin Inc., New York\n\n3Institut de Physique Th\u00e9orique, CEA, CNRS, Universit\u00e9 Paris-Saclay\n\n4Laboratoire de Th\u00e9orie des Communications, \u00c9cole Polytechnique F\u00e9d\u00e9rale de Lausanne\n\n5International Center for Theoretical Physics, Trieste, Italy\n6Department of Mathematics, Duke University, Durham NC\n\n7Sorbonne Universit\u00e9s & LightOn Inc., Paris\n\nAbstract\n\nWe examine a class of stochastic deep learning models with a tractable method to\ncompute information-theoretic quantities. Our contributions are three-fold: (i) We\nshow how entropies and mutual informations can be derived from heuristic statisti-\ncal physics methods, under the assumption that weight matrices are independent\nand orthogonally-invariant. (ii) We extend particular cases in which this result is\nknown to be rigorously exact by providing a proof for two-layers networks with\nGaussian random weights, using the recently introduced adaptive interpolation\nmethod. (iii) We propose an experiment framework with generative models of\nsynthetic datasets, on which we train deep neural networks with a weight constraint\ndesigned so that the assumption in (i) is veri\ufb01ed during learning. We study the be-\nhavior of entropies and mutual informations throughout learning and conclude that,\nin the proposed setting, the relationship between compression and generalization\nremains elusive.\n\nThe successes of deep learning methods have spurred efforts towards quantitative modeling of the\nperformance of deep neural networks. In particular, an information-theoretic approach linking\ngeneralization capabilities to compression has been receiving increasing interest. The intuition behind\nthe study of mutual informations in latent variable models dates back to the information bottleneck\n(IB) theory of [1]. Although recently reformulated in the context of deep learning [2], verifying its\nrelevance in practice requires the computation of mutual informations for high-dimensional variables,\na notoriously hard problem. Thus, pioneering works in this direction focused either on small network\nmodels with discrete (continuous, eventually binned) activations [3], or on linear networks [4, 5].\nIn the present paper we follow a different direction, and build on recent results from statistical physics\n[6, 7] and information theory [8, 9] to propose, in Section 1, a formula to compute information-\ntheoretic quantities for a class of deep neural network models. The models we approach, described in\nSection 2, are non-linear feed-forward neural networks trained on synthetic datasets with constrained\nweights. Such networks capture some of the key properties of the deep learning setting that are\nusually dif\ufb01cult to include in tractable frameworks: non-linearities, arbitrary large width and depth,\nand correlations in the input data. We demonstrate the proposed method in a series of numerical\nexperiments in Section 3. First observations suggest a rather complex picture, where the role of\ncompression in the generalization ability of deep neural networks is yet to be elucidated.\n\n\u2217Corresponding author: marylou.gabrie@ens.fr\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f1 Multi-layer model and main theoretical results\n\nA stochastic multi-layer model\u2014 We consider a model of multi-layer stochastic feed-forward\nneural network where each element xi of the input layer x \u2208 Rn0 is distributed independently\nas P0(xi), while hidden units t(cid:96),i at each successive layer t(cid:96) \u2208 Rn(cid:96) (vectors are column vectors)\n(cid:124)\ncome from P(cid:96)(t(cid:96),i|W\n(cid:96),it(cid:96)\u22121), with t0 \u2261 x and W(cid:96),i denoting the i-th row of the matrix of weights\nW(cid:96) \u2208 Rn(cid:96)\u00d7n(cid:96)\u22121. In other words\nt0,i \u2261 xi \u223c P0(\u00b7),\n\ntL,i \u223c PL(\u00b7|W\n\nt1,i \u223c P1(\u00b7|W\n\n(cid:124)\nL,itL\u22121),\n\n(cid:124)\n1,ix),\n\n(1)\n\n. . .\n\n(cid:124)\n\n(cid:124)\n\n(cid:96)=1 and distributions {P(cid:96)}L\n\n(cid:96),it(cid:96)\u22121, \u03be(cid:96),i)(cid:1).\n\ni=1, and the activation function \u03d5(cid:96) applied componentwise.\n\n(cid:96),it(cid:96)\u22121) =(cid:82) dP\u03be(\u03be(cid:96),i) \u03b4(cid:0)t(cid:96),i\u2212\u03d5(cid:96)(W\n\ngiven a set of weight matrices {W(cid:96)}L\n(cid:96)=1 which encode possible non-\nlinearities and stochastic noise applied to the hidden layer variables, and P0 that generates the visible\nvariables. In particular, for a non-linearity t(cid:96),i = \u03d5(cid:96)(h, \u03be(cid:96),i), where \u03be(cid:96),i \u223c P\u03be(\u00b7) is the stochastic\nnoise (independent for each i), we have P(cid:96)(t(cid:96),i|W\nModel (1) thus describes a Markov chain which we denote by X \u2192 T1 \u2192 T2 \u2192 \u00b7\u00b7\u00b7 \u2192 TL, with\nT(cid:96) = \u03d5(cid:96)(W(cid:96)T(cid:96)\u22121, \u03be(cid:96)), \u03be(cid:96) = {\u03be(cid:96),i}n(cid:96)\nReplica formula\u2014 We shall work in the asymptotic high-dimensional statistics regime where all\n\u02dc\u03b1(cid:96) \u2261 n(cid:96)/n0 are of order one while n0 \u2192\u221e, and make the important assumption that all matrices\nW(cid:96) are orthogonally-invariant random matrices independent from each other; in other words, each\nmatrix W(cid:96) \u2208 Rn(cid:96)\u00d7n(cid:96)\u22121 can be decomposed as a product of three matrices, W(cid:96) = U(cid:96)S(cid:96)V(cid:96), where\nU(cid:96) \u2208 O(n(cid:96)) and V(cid:96) \u2208 O(n(cid:96)\u22121) are independently sampled from the Haar measure, and S(cid:96) is a\ndiagonal matrix of singular values. The main technical tool we use is a formula for the entropies\nof the hidden variables, H(T(cid:96)) = \u2212ET(cid:96) ln PT(cid:96)(t(cid:96)), and the mutual information between adjacent\nlayers I(T(cid:96); T(cid:96)\u22121) = H(T(cid:96)) + ET(cid:96),T(cid:96)\u22121 ln PT(cid:96)|T(cid:96)\u22121(t(cid:96)|t(cid:96)\u22121), based on the heuristic replica method\n[10, 11, 6, 7]:\nClaim 1 (Replica formula). Assume model (1) with L layers in the high-dimensional limit with\ncomponentwise activation functions and weight matrices generated from the ensemble described\n(cid:124)\nk Wk. Then for any (cid:96) \u2208 {1, . . . , L} the normalized\nabove, and denote by \u03bbWk the eigenvalues of W\nentropy of T(cid:96) is given by the minimum among all stationary points of the replica potential:\n\nlim\nn0\u2192\u221e\n\n1\nn0\n\nH(T(cid:96)) = min extr\n\nA,V , \u02dcA, \u02dcV\n\n\u03c6(cid:96)(A, V , \u02dcA, \u02dcV ),\n\n(2)\n\nwhich depends on (cid:96)-dimensional vectors A, V , \u02dcA, \u02dcV , and is written in terms of mutual information\nI and conditional entropies H of scalar variables as\n\n(cid:2) \u02dcAkVk + \u03b1kAk \u02dcVk \u2212 FWk (AkVk)(cid:3)\n\n\u02dc\u03b1k\u22121\n\n(cid:105)\n\n\u22121\nk+1)\n\n\u03c6(cid:96)(A, V , \u02dcA, \u02dcV ) = I\n\nt0; t0 +\n\n(cid:16)\n\n(cid:96)\u22121(cid:88)\n\n(cid:104)\n\n(cid:17)\n\n\u03be0(cid:112) \u02dcA1\n\n1\n2\n\n\u2212\n\n(cid:96)(cid:88)\n\nk=1\n\n1\n2\n\n+\n\nk=1\n\n\u02dc\u03b1k\n\nH(tk|\u03bek; \u02dcAk+1, \u02dcVk, \u02dc\u03c1k) \u2212\n\nwhere \u03b1k = nk/nk\u22121, \u02dc\u03b1k = nk/n0, \u03c1k = (cid:82) dPk\u22121(t) t2, \u02dc\u03c1k = (E\u03bbWk\n\n\u03bbWk )\u03c1k/\u03b1k, and\n\u03bek \u223c N (0, 1) for k = 0, . . . , (cid:96). In the computation of the conditional entropies in (3), the scalar\ntk-variables are generated from P (t0) = P0(t0) and\n\n+ \u02dc\u03b1(cid:96)H(t(cid:96)|\u03be(cid:96); \u02dcV(cid:96), \u02dc\u03c1(cid:96)),\n\nlog(2\u03c0e \u02dcA\n\n(3)\n\n(cid:112)\nP (tk|\u03bek; A, V, \u03c1) = E \u02dc\u03be,\u02dcz Pk(tk + \u02dc\u03be/\u221aA|\n\n(cid:112)\nP (t(cid:96)|\u03be(cid:96); V, \u03c1) = E\u02dcz P(cid:96)(t(cid:96)|\n\n\u03c1 \u2212 V \u03bek + \u221aV \u02dcz),\n\nk = 1, . . . , (cid:96) \u2212 1,\n\n(4)\n\n(5)\nwhere \u02dc\u03be and \u02dcz are independent N (0, 1) random variables. Finally, the function FWk (x) depends on\nthe distribution of the eigenvalues \u03bbW(cid:96) following\n(6)\n\n\u03c1 \u2212 V \u03be(cid:96) + \u221aV \u02dcz),\n(cid:8)2\u03b1k\u03b8 + (\u03b1k \u2212 1) ln(1 \u2212 \u03b8) + E\u03bbWk\n\nln[x\u03bbWk + (1 \u2212 \u03b8)(1 \u2212 \u03b1k\u03b8)](cid:9).\n\nFWk (x) = min\n\u03b8\u2208R\n\nThe computation of the entropy in the large dimensional limit, a computationally dif\ufb01cult task, has\nthus been reduced to an extremization of a function of 4(cid:96) variables, that requires evaluating single or\nbidimensional integrals. This extremization can be done ef\ufb01ciently by means of a \ufb01xed-point iteration\nstarting from different initial conditions, as detailed in the Supplementary Material [12]. Moreover, a\n\n2\n\n\fuser-friendly Python package is provided [13], which performs the computation for different choices\nof prior P0, activations \u03d5(cid:96) and spectra \u03bbW(cid:96). Finally, the mutual information between successive layers\nI(T(cid:96); T(cid:96)\u22121) can be obtained from the entropy following the evaluation of an additional bidimensional\nintegral, see Section 1.6.1 of the Supplementary Material [12].\nOur approach in the derivation of (3) builds on recent progresses in statistical estimation and\ninformation theory for generalized linear models following the application of methods from statistical\nphysics of disordered systems [10, 11] in communication [14], statistics [15] and machine learning\nproblems [16, 17]. In particular, we use advanced mean \ufb01eld theory [18] and the heuristic replica\nmethod [10, 6], along with its recent extension to multi-layer estimation [7, 8], in order to derive the\nabove formula (3). This derivation is lengthy and thus given in the Supplementary Material [12]. In\na related contribution, Reeves [9] proposed a formula for the mutual information in the multi-layer\nsetting, using heuristic information-theoretic arguments. As ours, it exhibits layer-wise additivity,\nand the two formulas are conjectured to be equivalent.\nRigorous statement\u2014 We recall the assumptions under which the replica formula of Claim 1 is\nconjectured to be exact: (i) weight matrices are drawn from an ensemble of random orthogonally-\ninvariant matrices, (ii) matrices at different layers are statistically independent and (iii) layers have\na large dimension and respective sizes of adjacent layers are such that weight matrices have aspect\nk=1 of order one. While we could not prove the replica prediction in full generality,\nratios {\u03b1k, \u02dc\u03b1k}(cid:96)\nwe stress that it comes with multiple credentials: (i) for Gaussian prior P0 and Gaussian distributions\nP(cid:96), it corresponds to the exact analytical solution when weight matrices are independent of each\nother (see Section 1.6.2 of the Supplementary Material [12]). (ii) In the single-layer case with a\nGaussian weight matrix, it reduces to formula (6) in the Supplementary Material [12], which has\nbeen recently rigorously proven for (almost) all activation functions \u03d5 [19]. (iii) In the case of\nGaussian distributions P(cid:96), it has also been proven for a large ensemble of random matrices [20] and\n(iv) it is consistent with all the results of the AMP [21, 22, 23] and VAMP [24] algorithms, and their\nmulti-layer versions [7, 8], known to perform well for these estimation problems.\nIn order to go beyond results for the single-layer problem and heuristic arguments, we prove Claim 1\nfor the more involved multi-layer case, assuming Gaussian i.i.d. matrices and two non-linear layers:\nTheorem 1 (Two-layer Gaussian replica formula). Suppose (H1) the input units distribution\nP0 is separable and has bounded support; (H2) the activations \u03d51 and \u03d52 corresponding to\n(cid:124)\n2,it1) are bounded C2 with bounded \ufb01rst and second derivatives\nP1(t1,i|W\nw.r.t their \ufb01rst argument; and (H3) the weight matrices W1, W2 have Gaussian i.i.d. entries. Then\nfor model (1) with two layers L = 2 the high-dimensional limit of the entropy veri\ufb01es Claim 1.\n\n(cid:124)\n1,ix) and P2(t2,i|W\n\nThe theorem, that closes the conjecture presented in [7], is proven using the adaptive interpolation\nmethod of [25, 19] in a multi-layer setting, as \ufb01rst developed in [26]. The lengthy proof, presented in\ndetails in the Supplementary Material [12], is of independent interest and adds further credentials to\nthe replica formula, as well as offers a clear direction to further developments. Note that, following\nthe same approximation arguments as in [19] where the proof is given for the single-layer case, the\nhypothesis (H1) can be relaxed to the existence of the second moment of the prior, (H2) can be\ndropped and (H3) extended to matrices with i.i.d. entries of zero mean, O(1/n0) variance and \ufb01nite\nthird moment.\n\n2 Tractable models for deep learning\n\nx is distributed according to a separable prior distribution PX (x) =(cid:81)\n\nThe multi-layer model presented above can be leveraged to simulate two prototypical settings of deep\nsupervised learning on synthetic datasets amenable to the replica tractable computation of entropies\nand mutual informations.\nThe \ufb01rst scenario is the so-called teacher-student (see Figure 1, left). Here, we assume that the input\ni P0(xi), factorized in the\ncomponents of x, and the corresponding label y is given by applying a mapping x \u2192 y, called\nthe teacher. After generating a train and test set in this manner, we perform the training of a deep\nneural network, the student, on the synthetic dataset. In this case, the data themselves have a simple\nstructure given by P0.\nIn constrast, the second scenario allows generative models (see Figure 1, right) that create more\nstructure, and that are reminiscent of the generative-recognition pair of models of a Variational\n\n3\n\n\f(cid:81)\n\nAutoencoder (VAE). A code vector y is sampled from a separable prior distribution PY (y) =\ni P0(yi) and a corresponding data point x is generated by a possibly stochastic neural network, the\ngenerative model. This setting allows to create input data x featuring correlations, differently from\nthe teacher-student scenario. The studied supervised learning task then consists in training a deep\nneural net, the recognition model, to recover the code y from x.\nIn both cases, the chain going from X to any\nlater layer is a Markov chain in the form of (1).\nIn the \ufb01rst scenario, model (1) directly maps\nto the student network. In the second scenario\nhowever, model (1) actually maps to the feed-\nforward combination of the generative model\nfollowed by the recognition model. This shift\nis necessary to verify the assumption that the\nstarting point (now given by Y ) has a separable\ndistribution. In particular, it generates correlated\ninput data X while still allowing for the compu-\ntation of the entropy of any T(cid:96).\nAt the start of a neural network training, weight\nmatrices initialized as i.i.d. Gaussian random\nmatrices satisfy the necessary assumptions of\nthe formula of Claim 1. In their singular value\ndecomposition\n\nFigure 1: Two models of synthetic data\n\nW(cid:96) = U(cid:96)S(cid:96)V(cid:96)\n\n(7)\n\nthe matrices U(cid:96) \u2208 O(n(cid:96)) and V(cid:96) \u2208 O(n(cid:96)\u22121), are typical independent samples from the Haar measure\nacross all layers. To make sure weight matrices remain close enough to independent during learning,\nwe de\ufb01ne a custom weight constraint which consists in keeping U(cid:96) and V(cid:96) \ufb01xed while only the matrix\nS(cid:96), constrained to be diagonal, is updated. The number of parameters is thus reduced from n(cid:96) \u00d7 n(cid:96)\u22121\nto min(n(cid:96), n(cid:96)\u22121). We refer to layers following this weight constraint as USV-layers. For the replica\nformula of Claim 1 to be correct, the matrices S(cid:96) from different layers should furthermore remain\nuncorrelated during the learning. In Section 3, we consider the training of linear networks for which\ninformation-theoretic quantities can be computed analytically, and con\ufb01rm numerically that with\nUSV-layers the replica predicted entropy is correct at all times. In the following, we assume that is\nalso the case for non-linear networks.\nIn Section 3.2 of the Supplementary Material [12]we train a neural network with USV-layers on\na simple real-world dataset (MNIST), showing that these layers can learn to represent complex\nfunctions despite their restriction. We further note that such a product decomposition is reminiscent\nof a series of works on adaptative structured ef\ufb01cient linear layers (SELLs and ACDC) [27, 28]\nmotivated this time by speed gains, where only diagonal matrices are learned (in these works the\nmatrices U(cid:96) and V(cid:96) are chosen instead as permutations of Fourier or Hadamard matrices, so that\nthe matrix multiplication can be replaced by fast transforms). In Section 3, we discuss learning\nexperiments with USV-layers on synthetic datasets.\nWhile we have de\ufb01ned model (1) as a stochastic model, traditional feed forward neural networks are\ndeterministic. In the numerical experiments of Section 3, we train and test networks without injecting\nnoise, and only assume a noise model in the computation of information-theoretic quantities. Indeed,\nfor continuous variables the presence of noise is necessary for mutual informations to remain \ufb01nite\n(see discussion of Appendix C in [5]). We assume at layer (cid:96) an additive white Gaussian noise of small\namplitude just before passing through its activation function to obtain H(T(cid:96)) and I(T(cid:96); T(cid:96)\u22121), while\nkeeping the mapping X \u2192 T(cid:96)\u22121 deterministic. This choice attempts to stay as close as possible to\nthe deterministic neural network, but remains inevitably somewhat arbitrary (see again discussion of\nAppendix C in [5]).\nOther related works\u2014 The strategy of studying neural networks models, with random weight\nmatrices and/or random data, using methods originated in statistical physics heuristics, such as the\nreplica and the cavity methods [10] has a long history. Before the deep learning era, this approach\nled to pioneering results in learning for the Hop\ufb01eld model [29] and for the random perceptron\n[30, 31, 16, 17].\n\n4\n\n teacherstudentgenerativerecognitionteacher-student(i.i.d. input data)generative-recognition(correlated input data)\fRecently, the successes of deep learning along with the disqualifying complexity of studying real\nworld problems have sparked a revived interest in the direction of random weight matrices. Recent\nresults \u2013without exhaustivity\u2013 were obtained on the spectrum of the Gram matrix at each layer using\nrandom matrix theory [32, 33], on expressivity of deep neural networks [34], on the dynamics of\npropagation and learning [35, 36, 37, 38], on the high-dimensional non-convex landscape where the\nlearning takes place [39], or on the universal random Gaussian neural nets of [40].\nThe information bottleneck theory [1] applied to neural networks consists in computing the mutual\ninformation between the data and the learned hidden representations on the one hand, and between\nlabels and again hidden learned representations on the other hand [2, 3]. A successful training should\nmaximize the information with respect to the labels and simultaneously minimize the information\nwith respect to the input data, preventing over\ufb01tting and leading to a good generalization. While this\nintuition suggests new learning algorithms and regularizers [41, 42, 43, 44, 45, 46, 47], we can also\nhypothesize that this mechanism is already at play in a priori unrelated commonly used optimization\nmethods, such as the simple stochastic gradient descent (SGD). It was \ufb01rst tested in practice by [3]\non very small neural networks, to allow the entropy to be estimated by binning of the hidden neurons\nactivities. Afterwards, the authors of [5] reproduced the results of [3] on small networks using the\ncontinuous entropy estimator of [45], but found that the overall behavior of mutual information during\nlearning is greatly affected when changing the nature of non-linearities. Additionally, they investigate\nthe training of larger linear networks on i.i.d. normally distributed inputs where entropies at each\nhidden layer can be computed analytically for an additive Gaussian noise. The strategy proposed\nin the present paper allows us to evaluate entropies and mutual informations in non-linear networks\nlarger than in [5, 3].\n\n3 Numerical experiments\n\nWe present a series of experiments both aiming at further validating the replica estimator and\nleveraging its power in noteworthy applications. A \ufb01rst application presented in the paragraph\n3.1 consists in using the replica formula in settings where it is proven to be rigorously exact as\na basis of comparison for other entropy estimators. The same experiment also contributes to the\ndiscussion of the information bottleneck theory for neural networks by showing how, without any\nlearning, information-theoretic quantities have different behaviors for different non-linearities. In the\nfollowing paragraph 3.2, we validate the accuracy of the replica formula in a learning experiment\nwith USV-layers \u2014where it is not proven to be exact \u2014 by considering the case of linear networks\nfor which information-theoretic quantities can be otherwise computed in closed-form. We \ufb01nally\nconsider in the paragraph 3.3, a second application testing the information bottleneck theory for large\nnon-linear networks. To this aim, we use the replica estimator to study compression effects during\nlearning.\n3.1 Estimators and activation comparisons\u2014 Two non-parametric estimators have already been\nconsidered by [5] to compute entropies and/or mutual informations during learning. The kernel-\ndensity approach of Kolchinsky et. al. [45] consists in \ufb01tting a mixture of Gaussians (MoG) to samples\nof the variable of interest and subsequently compute an upper bound on the entropy of the MoG [48].\nThe method of Kraskov et al. [49] uses nearest neighbor distances between samples to directly build\nan estimate of the entropy. Both methods require the computation of the matrix of distances between\nsamples. Recently, [46] proposed a new non-parametric estimator for mutual informations which\ninvolves the optimization of a neural network to tighten a bound. It is unfortunately computationally\nhard to test how these estimators behave in high dimension as even for a known distribution the\ncomputation of the entropy is intractable (#P-complete) in most cases. However the replica method\nproposed here is a valuable point of comparison for cases where it is rigorously exact.\nIn the \ufb01rst numerical experiment we place ourselves in the setting of Theorem 1: a 2-layer network\nwith i.i.d weight matrices, where the formula of Claim 1 is thus rigorously exact in the limit of large\nnetworks, and we compare the replica results with the non-parametric estimators of [45] and [49].\nNote that the requirement for smooth activations (H2) of Theorem 1 can be relaxed (see discussion\nbelow the Theorem). Additionally, non-smooth functions can be approximated arbitrarily closely by\nsmooth functions with equal information-theoretic quantities, up to numerical precision.\nWe consider a neural network with layers of equal size n = 1000 that we denote: X \u2192 T1 \u2192 T2.\nThe input variable components are i.i.d. Gaussian with mean 0 and variance 1. The weight matrices\n\n5\n\n\fentries are also i.i.d. Gaussian with mean 0. Their standard-deviation is rescaled by a factor 1/\u221an\nand then multiplied by a coef\ufb01cient \u03c3 varying between 0.1 and 10, i.e. around the recommended\nvalue for training initialization. To compute entropies, we consider noisy versions of the latent\nnoise = 10\u22125) is added\nvariables where an additive white Gaussian noise of very small variance (\u03c32\nright before the activation function, T1 = f (W1X + \u00011) and T2 = f (W2f (W1X) + \u00012) with\nnoiseIn), which is also done in the remaining experiments to guarantee the mutual\n\u00011,2 \u223c N (0, \u03c32\ninformations to remain \ufb01nite. The non-parametric estimators [45, 49] were evaluated using 1000\nsamples, as the cost of computing pairwise distances is signi\ufb01cant in such high dimension and we\nchecked that the entropy estimate is stable over independent draws of a sample of such a size (error\nbars smaller than marker size). On Figure 2, we compare the different estimates of H(T1) and H(T2)\nfor different activation functions: linear, hardtanh or ReLU. The hardtanh activation is a piecewise\nlinear approximation of the tanh, hardtanh(x) =\u22121 for x <\u22121, x for \u22121 < x < 1, and 1 for x > 1,\nfor which the integrals in the replica formula can be evaluated faster than for the tanh.\nIn the linear and hardtanh case, the non-parametric methods are following the tendency of the replica\nestimate when \u03c3 is varied, but appear to systematically over-estimate the entropy. For linear networks\nwith Gaussian inputs and additive Gaussian noise, every layer is also a multivariate Gaussian and\ntherefore entropies can be directly computed in closed form (exact in the plot legend). When using\nthe Kolchinsky estimate in the linear case we also check the consistency of two strategies, either\n\ufb01tting the MoG to the noisy sample or \ufb01tting the MoG to the deterministic part of the T(cid:96) and augment\nnoise, as done in [45] (Kolchinsky et al. parametric in the plot legend).\nthe resulting variance with \u03c32\nIn the network with hardtanh non-linearities, we check that for small weight values, the entropies\nare the same as in a linear network with same weights (linear approx in the plot legend, computed\nusing the exact analytical result for linear networks and therefore plotted in a similar color to exact).\nLastly, in the case of the ReLU-ReLU network, we note that non-parametric methods are predicting\nan entropy increasing as the one of a linear network with identical weights, whereas the replica\ncomputation re\ufb02ects its knowledge of the cut-off and accurately features a slope equal to half of the\nlinear network entropy (1/2 linear approx in the plot legend). While non-parametric estimators are\ninvaluable tools able to approximate entropies from the mere knowledge of samples,they inevitably\nintroduce estimation errors. The replica method is taking the opposite view. While being restricted to\na class of models, it can leverage its knowledge of the neural network structure to provide a reliable\nestimate. To our knowledge, there is no other entropy estimator able to incorporate such information\nabout the underlying multi-layer model.\nBeyond informing about estimators accuracy, this experiment also unveils a simple but possibly\nimportant distinction between activation functions. For the hardtanh activation, as the random weights\nmagnitude increases, the entropies decrease after reaching a maximum, whereas they only increase\nfor the unbounded activation functions we consider \u2013 even for the single-side saturating ReLU. This\nloss of information for bounded activations was also observed by [5], where entropies were computed\nby discretizing the output as a single neuron with bins of equal size. In this setting, as the tanh\nactivation starts to saturate for large inputs, the extreme bins (at \u22121 and 1) concentrate more and\nmore probability mass, which explains the information loss. Here we con\ufb01rm that the phenomenon is\nalso observed when computing the entropy of the hardtanh (without binning and with small noise\ninjected before the non-linearity). We check via the replica formula that the same phenomenology\narises for the mutual informations I(X; T(cid:96)) (see Section3.1 of the Supplementary Material [12]).\n3.2 Learning experiments with linear networks\u2014 In the following, and in Section 3.3 of the Sup-\nplementary Material [12], we discuss training experiments of different instances of the deep learning\nmodels de\ufb01ned in Section 2. We seek to study the simplest possible training strategies achieving\ngood generalization. Hence for all experiments we use plain stochastic gradient descent (SGD) with\nconstant learning rates, without momentum and without any explicit form of regularization. The\nsizes of the training and testing sets are taken equal and scale typically as a few hundreds times the\nsize of the input layer. Unless otherwise stated, plots correspond to single runs, yet we checked\nover a few repetitions that outcomes of independent runs lead to identical qualitative behaviors. The\nvalues of mutual informations I(X; T(cid:96)) are computed by considering noisy versions of the latent\nnoise = 10\u22125) is added\nvariables where an additive white Gaussian noise of very small variance (\u03c32\nright before the activation function, as in the previous experiment. This noise is neither present\nat training time, where it could act as a regularizer, nor at testing time. Given the noise is only\nassumed at the last layer, the second to last layer is a deterministic mapping of the input variable;\nhence the replica formula yielding mutual informations between adjacent layers gives us directly\n\n6\n\n\fFigure 2: Entropy of latent variables in stochastic networks X \u2192 T1 \u2192 T2, with equally sized\nlayers n = 1000, inputs drawn from N (0, In), weights from N (0, \u03c32In2/n), as a function of the\nweight scaling parameter \u03c3. An additive white Gaussian noise N (0, 10\u22125In) is added inside the\nnon-linearity. Left column: linear network. Center column: hardtanh-hardtanh network. Right\ncolumn: ReLU-ReLU network.\n\nI(T(cid:96); T(cid:96)\u22121) = H(T(cid:96)) \u2212 H(T(cid:96)|T(cid:96)\u22121) = H(T(cid:96)) \u2212 H(T(cid:96)|X) = I(T(cid:96); X). We provide a second\nPython package [50] to implement in Keras learning experiments on synthetic datasets, using USV-\nlayers and interfacing the \ufb01rst Python package [13] for replica computations.\nTo start with we consider the training of a linear network in the teacher-student scenario. The teacher\nhas also to be linear to be learnable: we consider a simple single-layer network with additive white\nGaussian noise, Y = \u02dcWteachX + \u0001, with input x \u223c N (0, In) of size n, teacher matrix \u02dcWteach\ni.i.d. normally distributed as N (0, 1/n) , noise \u0001 \u223c N (0, 0.01In), and output of size nY = 4.\nWe train a student network of three USV-layers, plus one fully connected unconstrained layer\nX \u2192 T1 \u2192 T2 \u2192 T3 \u2192 \u02c6Y on the regression task, using plain SGD for the MSE loss ( \u02c6Y \u2212 Y )2.\nWe recall that in the USV-layers (7) only the diagonal matrix is updated during learning. On the\nleft panel of Figure 3, we report the learning curve and the mutual informations between the hidden\nlayers and the input in the case where all layers but outputs have size n = 1500. Again this linear\nsetting is analytically tractable and does not require the replica formula, a similar situation was\nstudied in [5]. In agreement with their observations, we \ufb01nd that the mutual informations I(X; T(cid:96))\nkeep on increasing throughout the learning, without compromising the generalization ability of the\nstudent. Now, we also use this linear setting to demonstrate (i) that the replica formula remains\ncorrect throughout the learning of the USV-layers and (ii) that the replica method gets closer and\ncloser to the exact result in the limit of large networks, as theoretically predicted (2). To this aim, we\nrepeat the experiment for n varying between 100 and 1500, and report the maximum and the mean\nvalue of the squared error on the estimation of the I(X; T(cid:96)) over all epochs of 5 independent training\nruns. We \ufb01nd that even if errors tend to increase with the number of layers, they remain objectively\nvery small and decrease drastically as the size of the layers increases.\n3.3 Learning experiments with deep non-linear networks\u2014 Finally, we apply the replica formula\nto estimate mutual informations during the training of non-linear networks on correlated input data.\nWe consider a simple single layer generative model X = \u02dcWgenY + \u0001 with normally distributed code\nY \u223c N (0, InY ) of size nY = 100, data of size nX = 500 generated with matrix \u02dcWgen i.i.d. normally\ndistributed as N (0, 1/nY ) and noise \u0001 \u223c N (0, 0.01InX ). We then train a recognition model to solve\nthe binary classi\ufb01cation problem of recovering the label y = sign(Y1), the sign of the \ufb01rst neuron in\nY , using plain SGD but this time to minimize the cross-entropy loss. Note that the rest of the initial\ncode (Y2, ..YnY ) acts as noise/nuisance with respect to the learning task. We compare two 5-layers\nrecognition models with 4 USV- layers plus one unconstrained, of sizes 500-1000-500-250-100-2,\nand activations either linear-ReLU-linear-ReLU-softmax (top row of Figure 4) or linear-hardtanh-\nlinear-hardtanh-softmax (bottom row). Because USV-layers only feature O(n) parameters instead\n\n7\n\n10\u22121100101weightscaling\u03c3\u22121.50.01.53.04.5H(T1)linear-linearnetworkKraskovetal.Kolchinskyetal.replica10\u22121100101weightscaling\u03c3\u22122.50.02.55.0H(T2)Kolchinskyetal.parametricexact10\u22121100101weightscaling\u03c3\u22121.50.01.53.0H(T1)hardtanh-hardtanhnetworklinearapprox.10\u22121100101weightscaling\u03c3\u22124.0\u22122.00.02.04.0H(T2)10\u22121100101weightscaling\u03c3\u22121.50.01.53.0H(T1)ReLU-ReLUnetwork1/2linearapprox.linearapprox.10\u22121100101weightscaling\u03c3\u22122.50.02.55.0H(T2)\fFigure 3: Training of a 4-layer linear student of varying size on a regression task generated by a\nlinear teacher of output size nY = 4. Upper-left: MSE loss on the training and testing sets during\ntraining by plain SGD for layers of size n = 1500. Best training loss is 0.004735, best testing loss is\n0.004789. Lower-left: Corresponding mutual information evolution between hidden layers and input.\nCenter-left, center-right, right: maximum and squared error of the replica estimation of the mutual\ninformation as a function of layers size n, over the course of 5 independent trainings for each value\nof n for the \ufb01rst, second and third hidden layer.\n\nof O(n2) we observe that they require more iterations to train in general. In the case of the ReLU\nnetwork, adding interleaved linear layers was key to successful training with 2 non-linearities, which\nexplains the somewhat unusual architecture proposed. For the recognition model using hardtanh, this\nwas actually not an issue (see Supplementary Material [12] for an experiment using only hardtanh\nactivations), however, we consider a similar architecture for fair comparison. We discuss further the\nability of learning of USV-layers in the Supplementary Material [12].\nThis experiment is reminiscent of the setting of [3], yet now tractable for networks of larger sizes.\nFor both types of non-linearities we observe that the mutual information between the input and all\nhidden layers decrease during the learning, except for the very beginning of training where we can\nsometimes observe a short phase of increase (see zoom in insets). For the hardtanh layers this phase\nis longer and the initial increase of noticeable amplitude.\nIn this particular experiment, the claim of [3] that compression can occur during training even with\nnon double-saturated activation seems corroborated (a phenomenon that was not observed by [5]).\nYet we do not observe that the compression is more pronounced in deeper layers and its link to\ngeneralization remains elusive. For instance, we do not see a delay in the generalization w.r.t. training\naccuracy/loss in the recognition model with hardtanh despite of an initial phase without compression\nin two layers. Further learning experiments, including a second run of this last experiment, are\npresented in the Supplementary Material [12].\n\n4 Conclusion and perspectives\n\nWe have presented a class of deep learning models together with a tractable method to compute\nentropy and mutual information between layers. This, we believe, offers a promising framework for\nfurther investigations, and to this aim we provide Python packages that facilitate both the computation\nof mutual informations and the training, for an arbitrary implementation of the model. In the future,\nallowing for biases by extending the proposed formula would improve the \ufb01tting power of the\nconsidered neural network models.\nWe observe in our high-dimensional experiments that compression can happen during learning,\neven when using ReLU activations. While we did not observe a clear link between generalization\nand compression in our setting, there are many directions to be further explored within the models\npresented in Section 2. Studying the entropic effect of regularizers is a natural step to formulate\nan entropic interpretation to generalization. Furthermore, while our experiments focused on the\nsupervised learning, the replica formula derived for multi-layer models is general and can be applied\nin unsupervised contexts, for instance in the theory of VAEs. On the rigorous side, the greater\nperspective remains proving the replica formula in the general case of multi-layer models, and further\n\n8\n\n01503004500.00.5lossN=1500testlosstrainloss0150300450epochs4.04.44.85.2I(X;Ti)T1T2T350010001500layerssizeN02468(\u02c6Irep\u2212Iexact)2\u00d710\u221213layer1maxmean50010001500layerssizeN0.00.20.40.60.8(\u02c6Irep\u2212Iexact)2\u00d710\u22124layer2maxmean50010001500layerssizeN0.00.30.60.91.2(\u02c6Irep\u2212Iexact)2\u00d710\u22124layer3maxmean\fFigure 4: Training of two recognition models on a binary classi\ufb01cation task with correlated input data\nand either ReLU (top) or hardtanh (bottom) non-linearities. Left: training and generalization cross-\nentropy loss (left axis) and accuracies (right axis) during learning. Best training-testing accuracies\nare 0.995 - 0.991 for ReLU version (top row) and 0.998 - 0.996 for hardtanh version (bottom row).\nRemaining colums: mutual information between the input and successive hidden layers. Insets zoom\non the \ufb01rst epochs.\n\ncon\ufb01rm that the replica formula stays true after the learning of the USV-layers. Another question\nworth of future investigation is whether the replica method can be used to describe not only entropies\nand mutual informations for learned USV-layers, but also the optimal learning of the weights itself.\n\nAcknowledgments\n\nThe authors would like to thank L\u00e9on Bottou, Antoine Maillard, Marc M\u00e9zard, L\u00e9o Miolane, and\nGalen Reeves for insightful discussions. This work has been supported by the ERC under the\nEuropean Union\u2019s FP7 Grant Agreement 307087-SPARCS and the European Union\u2019s Horizon 2020\nResearch and Innovation Program 714608-SMiLe, as well as by the French Agence Nationale de\nla Recherche under grant ANR-17-CE23-0023-01 PAIL. Additional funding is acknowledged by\nMG from \u201cChaire de recherche sur les mod\u00e8les et sciences des donn\u00e9es\u201d, Fondation CFM pour\nla Recherche-ENS; by AM from Labex DigiCosme; and by CL from the Swiss National Science\nFoundation under grant 200021E-175541. We gratefully acknowledge the support of NVIDIA\nCorporation with the donation of the Titan Xp GPU used for this research.\n\nReferences\n[1] N. Tishby, F. C. Pereira, and W. Bialek. The Information Bottleneck Method. 37th Annual Allerton\n\nConference on Communication, Control, and Computing, 1999.\n\n[2] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In IEEE Information\n\nTheory Workshop (ITW), 2015.\n\n[3] R. Shwartz-Ziv and N. Tishby. Opening the Black Box of Deep Neural Networks via Information.\n\narXiv:1703.00810, 2017.\n\n[4] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss. Information bottleneck for Gaussian variables. Journal\n\nof Machine Learning Research, 6(Jan):165\u2013188, 2005.\n\n[5] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A. Kolchinsky, B. D. Tracey, and D. D. Cox. On the\nInformation Bottleneck Theory of Deep Learning. In International Conference on Learning Representations\n(ICLR), 2018.\n\n[6] Y. Kabashima. Inference from correlated patterns: a uni\ufb01ed theory for perceptron learning and linear\n\nvector channels. Journal of Physics: Conference Series, 95(1):012001, 2008.\n\n9\n\n05001000epochs0.00.20.40.6losstestlosstrainloss0.70.80.91.0accuracytrainacctestacc05001000epochs192021I(X;T1)layer1-linear01021.2021.2505001000epochs8910I(X;T2)layer2-relu01010.8010.8505001000epochs6.07.08.0I(X;T3)layer3-linear0108.68.705001000epochs2.42.52.6I(X;T4)layer4-relu0102.6752.6802.685010002000epochs0.00.20.40.6losstestlosstrainloss0.60.70.80.91.0accuracytrainacctestacc010002000epochs20202021I(X;T1)layer1-linear01521.2721.28010002000epochs11121314I(X;T2)layer2-hardtanh010020013.613.8010002000epochs8.09.0I(X;T3)layer3-linear0109.5709.5759.580010002000epochs3.84.04.24.4I(X;T4)layer4-hardtanh01004.244.28\f[7] A. Manoel, F. Krzakala, M. M\u00e9zard, and L. Zdeborov\u00e1. Multi-layer generalized linear estimation. In IEEE\n\nInternational Symposium on Information Theory (ISIT), 2017.\n\n[8] A. K. Fletcher and S. Rangan. Inference in Deep Networks in High Dimensions. arXiv:1706.06549, 2017.\n\n[9] G. Reeves. Additivity of Information in Multilayer Networks via Additive Gaussian Noise Transforms. In\n\n55th Annual Allerton Conference on Communication, Control, and Computing, 2017.\n\n[10] M. M\u00e9zard, G. Parisi, and M. Virasoro. Spin Glass Theory and Beyond. World Scienti\ufb01c Publishing\n\nCompany, 1987.\n\n[11] M. M\u00e9zard and A. Montanari. Information, Physics, and Computation. Oxford University Press, 2009.\n\n[12] ArXiv version of this work. https://arxiv.org/abs/1805.09785.\n\n[13] dnner: Deep Neural Networks Entropy with Replicas, Python library.\n\nsphinxteam/dnner.\n\nhttps://github.com/\n\n[14] A. M. Tulino, G. Caire, S. Verd\u00fa, and S. Shamai (Shitz). Support Recovery With Sparsely Sampled Free\n\nRandom Matrices. IEEE Transactions on Information Theory, 59(7):4243\u20134271, 2013.\n\n[15] D. Donoho and A. Montanari. High dimensional robust M-estimation: asymptotic variance via approximate\n\nmessage passing. Probability Theory and Related Fields, 166(3-4):935\u2013969, 2016.\n\n[16] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Physical\n\nReview A, 45(8):6056, 1992.\n\n[17] A. Engel and C. Van den Broeck. Statistical Mechanics of Learning. Cambridge University Press, 2001.\n\n[18] M. Opper and D. Saad. Advanced mean \ufb01eld methods: Theory and practice. MIT press, 2001.\n\n[19] J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborov\u00e1. Phase Transitions, Optimal Errors and\n\nOptimality of Message-Passing in Generalized Linear Models. arXiv:1708.03395, 2017.\n\n[20] J. Barbier, N. Macris, A. Maillard, and F. Krzakala. The Mutual Information in Random Linear Estimation\n\nBeyond i.i.d. Matrices. In IEEE International Symposium on Information Theory (ISIT), 2018.\n\n[21] D. Donoho, A. Maleki, and A. Montanari. Message-passing algorithms for compressed sensing. Proceed-\n\nings of the National Academy of Sciences, 106(45):18914\u201318919, 2009.\n\n[22] L. Zdeborov\u00e1 and F. Krzakala. Statistical physics of inference: thresholds and algorithms. Advances in\n\nPhysics, 65(5):453\u2013552, 2016.\n\n[23] S. Rangan. Generalized approximate message passing for estimation with random linear mixing. In IEEE\n\nInternational Symposium on Information Theory (ISIT), 2011.\n\n[24] S. Rangan, P. Schniter, and A. K. Fletcher. Vector approximate message passing. In IEEE International\n\nSymposium on Information Theory (ISIT), 2017.\n\n[25] J. Barbier and N. Macris. The adaptive interpolation method: a simple scheme to prove replica formulas in\n\nBayesian inference. arXiv:1705.02780 to appear in Probability Theory and Related Fields, 2017.\n\n[26] J. Barbier, N. Macris, and L. Miolane. The Layered Structure of Tensor Estimation and its Mutual\n\nInformation. In 55th Annual Allerton Conference on Communication, Control, and Computing, 2017.\n\n[27] M. Moczulski, M. Denil, J. Appleyard, and N. de Freitas. ACDC: A Structured Ef\ufb01cient Linear Layer. In\n\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[28] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried convnets. In\n\nIEEE International Conference on Computer Vision (ICCV), 2015.\n\n[29] D. J. Amit, H. Gutfreund, and H. Sompolinsky. Storing in\ufb01nite numbers of patterns in a spin-glass model\n\nof neural networks. Physical Review Letters, 55(14):1530, 1985.\n\n[30] E. Gardner and B. Derrida. Three un\ufb01nished works on the optimal storage capacity of networks. Journal\n\nof Physics A, 22(12):1983, 1989.\n\n[31] M. M\u00e9zard. The space of interactions in neural networks: Gardner\u2019s computation with the cavity method.\n\nJournal of Physics A, 22(12):2181, 1989.\n\n10\n\n\f[32] C. Louart and R. Couillet. Harnessing neural networks: A random matrix approach. In IEEE International\n\nConference on Acoustics, Speech and Signal Processing (ICASSP), 2017.\n\n[33] J. Pennington and P. Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural\n\nInformation Processing Systems (NIPS), 2017.\n\n[34] M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the Expressive Power of Deep\n\nNeural Networks. In International Conference on Machine Learning (ICML), 2017.\n\n[35] A. Saxe, J. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep\n\nlinear neural networks. In International Conference on Learning Representations (ICLR), 2014.\n\n[36] S.S. Schoenholz, J. Gilmer, S. Ganguli, and J. Sohl-Dickstein. Deep information propagation.\n\nInternational Conference on Learning Representations (ICLR), 2017.\n\nIn\n\n[37] M. Advani and A. Saxe. High-dimensional dynamics of generalization error in neural networks.\n\narXiv:1710.03667, 2017.\n\n[38] C. Baldassi, A. Braunstein, N. Brunel, and R. Zecchina. Ef\ufb01cient supervised learning in networks with\n\nbinary synapses. Proceedings of the National Academy of Sciences, 104:11079\u201311084, 2007.\n\n[39] Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the\nsaddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information\nProcessing Systems, 2014.\n\n[40] R. Giryes, G. Sapiro, and A. M. Bronstein. Deep neural networks with random Gaussian weights: a\n\nuniversal classi\ufb01cation strategy? IEEE Transactions on Signal Processing, 64(13):3444\u20133457, 2016.\n\n[41] M. Chalk, O. Marre, and G. Tkacik. Relevant sparse codes with variational information bottleneck. In\n\nAdvances in Neural Information Processing Systems, 2016.\n\n[42] A. Achille and S. Soatto.\n\nInformation Dropout: Learning Optimal Representations Through Noisy\n\nComputation. IEEE Transactions on Pattern Analysis and Machine Inteligence, 2018.\n\n[43] A. Alemi, I. Fischer, J. Dillon, and K. Murphy. Deep variational information bottleneck. In International\n\nConference on Learning Representations (ICLR), 2017.\n\n[44] A. Achille and S. Soatto. Emergence of Invariance and Disentangling in Deep Representations. In ICML\n\n2017 Workshop on Principled Approaches to Deep Learning, 2017.\n\n[45] A. Kolchinsky, B. D. Tracey, and D. H. Wolpert. Nonlinear Information Bottleneck. arXiv:1705.02436,\n\n2017.\n\n[46] M.I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R.D. Hjelm. MINE: Mutual\n\nInformation Neural Estimation. In International Conference on Machine Learning (ICML), 2018.\n\n[47] S. Zhao, J. Song, and S. Ermon.\n\narXiv:1706.02262, 2017.\n\nInfoVAE: Information Maximizing Variational Autoencoders.\n\n[48] A. Kolchinsky and B. D. Tracey. Estimating mixture entropy with pairwise distances. Entropy, 19(7):361,\n\n2017.\n\n[49] A. Kraskov, H. St\u00f6gbauer, and P. Grassberger. Estimating mutual information. Physical Review E,\n\n69(6):066138, 2004.\n\n[50] lsd: Learning with Synthetic Data, Python library.\n\nlearning-synthetic-data.\n\nhttps://github.com/marylou-gabrie/\n\n11\n\n\f", "award": [], "sourceid": 920, "authors": [{"given_name": "Marylou", "family_name": "Gabri\u00e9", "institution": "\u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Andre", "family_name": "Manoel", "institution": "OWKIN"}, {"given_name": "Cl\u00e9ment", "family_name": "Luneau", "institution": "\u00c9cole Polytechnique de Lausanne"}, {"given_name": "jean", "family_name": "barbier", "institution": "EPFL"}, {"given_name": "Nicolas", "family_name": "Macris", "institution": "EPFL"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "\u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Lenka", "family_name": "Zdeborov\u00e1", "institution": "CEA Saclay"}]}