{"title": "Exact Rate-Distortion in Autoencoders via Echo Noise", "book": "Advances in Neural Information Processing Systems", "page_first": 3889, "page_last": 3900, "abstract": "Compression is at the heart of effective representation learning. However, lossy compression is typically achieved through simple parametric models like Gaussian noise to preserve analytic tractability, and the limitations this imposes on learning are largely unexplored. Further, the Gaussian prior assumptions in models such as variational autoencoders (VAEs) provide only an upper bound on the compression rate in general. We introduce a new noise channel, Echo noise, that admits a simple, exact expression for mutual information for arbitrary input distributions. The noise is constructed in a data-driven fashion that does not require restrictive distributional assumptions. With its complex encoding mechanism and exact rate regularization, Echo leads to improved bounds on log-likelihood and dominates beta-VAEs across the achievable range of rate-distortion trade-offs. Further, we show that Echo noise can outperform flow-based methods without the need to train additional distributional transformations.", "full_text": "Exact Rate-Distortion in Autoencoders via Echo Noise\n\nRob Brekelmans, Daniel Moyer, Aram Galstyan, Greg Ver Steeg\n\nInformation Sciences Institute\n\nUniversity of Southern California\n\nbrekelma, moyerd@usc.edu; galstyan, gregv@isi.edu\n\nMarina del Rey, CA 90292\n\nAbstract\n\nCompression is at the heart of effective representation learning. However, lossy\ncompression is typically achieved through simple parametric models like Gaussian\nnoise to preserve analytic tractability, and the limitations this imposes on learning\nare largely unexplored. Further, the Gaussian prior assumptions in models such as\nvariational autoencoders (VAEs) provide only an upper bound on the compression\nrate in general. We introduce a new noise channel, Echo noise, that admits a simple,\nexact expression for mutual information for arbitrary input distributions. The noise\nis constructed in a data-driven fashion that does not require restrictive distributional\nassumptions. With its complex encoding mechanism and exact rate regularization,\nEcho leads to improved bounds on log-likelihood and dominates -VAEs across the\nachievable range of rate-distortion trade-offs. Further, we show that Echo noise can\noutperform \ufb02ow-based methods without the need to train additional distributional\ntransformations.\n\n1\n\nIntroduction\n\nRate-distortion theory provides an organizing principle for representation learning that is enshrined\nin machine learning as the Information Bottleneck principle [39]. The goal is to compress input\nrandom variables X into a representation Z with mutual information rate I(X; Z), while minimizing\na distortion measure that captures our ability to use the representation for a task. For the rate to be\nrestricted, some information must be lost through noise. Despite the use of increasingly complex\nencoding functions via neural networks, simple noise models like Gaussians still dominate the\nliterature because of their analytic tractability. Unfortunately, the effect of these assumptions on the\nquality of learned representations is not well understood.\nThe Variational Autoencoding (VAE) framework [21, 36] has provided the basis for a number of\nrecent developments in representation learning [1, 10, 11, 18, 20, 41]. While VAEs were originally\nmotivated as performing posterior inference under a generative model, several recent works have\nviewed the Evidence Lower Bound objective as corresponding to an unsupervised rate-distortion\nproblem [1, 3, 35]. From this perspective, reconstruction of the input provides the distortion measure,\nwhile the KL divergence between encoder and prior gives an upper bound on the information rate\nthat depends heavily on the choice of prior [3, 37, 40].\nIn this work, we deconstruct this interpretation of VAEs and their extensions. Do the restrictive\nassumptions of the Gaussian noise model limit the quality of VAE representations? Does forcing\nthe latent space to be independent and Gaussian constrain the expressivity of our models? We \ufb01nd\nevidence to support both claims, showing that a powerful noise model can achieve more ef\ufb01cient\nlossy compression and that relaxing prior or marginal assumptions can lead to better bounds on both\nthe information rate and log-likelihood.\nThe main contribution of this paper is the introduction of the Echo noise channel, a powerful, data-\ndriven improvement over Gaussian channels whose compression rate can be precisely expressed for\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fEcho\nNoise\n\nI(Z; X) = 3 bits\n\nAAACAXicbVDLSgNBEJyNrxhfq14EL4NBiJewq4KCCEEveotgHpgsYXYySYbMY5mZFcMSL/6KFw+KePUvvPk3TpI9aGJBQ1HVTXdXGDGqjed9O5m5+YXFpexybmV1bX3D3dyqahkrTCpYMqnqIdKEUUEqhhpG6pEiiIeM1ML+5civ3ROlqRS3ZhCRgKOuoB2KkbFSy925LtydwfoBPIdHsMlD+ZDAkBo9bLl5r+iNAWeJn5I8SFFuuV/NtsQxJ8JghrRu+F5kggQpQzEjw1wz1iRCuI+6pGGpQJzoIBl/MIT7VmnDjlS2hIFj9fdEgrjWAx7aTo5MT097I/E/rxGbzmmQUBHFhgg8WdSJGTQSjuKAbaoINmxgCcKK2lsh7iGFsLGh5WwI/vTLs6R6WPS9on9znC9dpHFkwS7YAwXggxNQAlegDCoAg0fwDF7Bm/PkvDjvzsekNeOkM9vgD5zPH10AlO4=\nAAACAXicbVDLSgNBEJyNrxhfq14EL4NBiJewq4KCCEEveotgHpgsYXYySYbMY5mZFcMSL/6KFw+KePUvvPk3TpI9aGJBQ1HVTXdXGDGqjed9O5m5+YXFpexybmV1bX3D3dyqahkrTCpYMqnqIdKEUUEqhhpG6pEiiIeM1ML+5civ3ROlqRS3ZhCRgKOuoB2KkbFSy925LtydwfoBPIdHsMlD+ZDAkBo9bLl5r+iNAWeJn5I8SFFuuV/NtsQxJ8JghrRu+F5kggQpQzEjw1wz1iRCuI+6pGGpQJzoIBl/MIT7VmnDjlS2hIFj9fdEgrjWAx7aTo5MT097I/E/rxGbzmmQUBHFhgg8WdSJGTQSjuKAbaoINmxgCcKK2lsh7iGFsLGh5WwI/vTLs6R6WPS9on9znC9dpHFkwS7YAwXggxNQAlegDCoAg0fwDF7Bm/PkvDjvzsekNeOkM9vgD5zPH10AlO4=\nAAACAXicbVDLSgNBEJyNrxhfq14EL4NBiJewq4KCCEEveotgHpgsYXYySYbMY5mZFcMSL/6KFw+KePUvvPk3TpI9aGJBQ1HVTXdXGDGqjed9O5m5+YXFpexybmV1bX3D3dyqahkrTCpYMqnqIdKEUUEqhhpG6pEiiIeM1ML+5civ3ROlqRS3ZhCRgKOuoB2KkbFSy925LtydwfoBPIdHsMlD+ZDAkBo9bLl5r+iNAWeJn5I8SFFuuV/NtsQxJ8JghrRu+F5kggQpQzEjw1wz1iRCuI+6pGGpQJzoIBl/MIT7VmnDjlS2hIFj9fdEgrjWAx7aTo5MT097I/E/rxGbzmmQUBHFhgg8WdSJGTQSjuKAbaoINmxgCcKK2lsh7iGFsLGh5WwI/vTLs6R6WPS9on9znC9dpHFkwS7YAwXggxNQAlegDCoAg0fwDF7Bm/PkvDjvzsekNeOkM9vgD5zPH10AlO4=\nAAACAXicbVDLSgNBEJyNrxhfq14EL4NBiJewq4KCCEEveotgHpgsYXYySYbMY5mZFcMSL/6KFw+KePUvvPk3TpI9aGJBQ1HVTXdXGDGqjed9O5m5+YXFpexybmV1bX3D3dyqahkrTCpYMqnqIdKEUUEqhhpG6pEiiIeM1ML+5civ3ROlqRS3ZhCRgKOuoB2KkbFSy925LtydwfoBPIdHsMlD+ZDAkBo9bLl5r+iNAWeJn5I8SFFuuV/NtsQxJ8JghrRu+F5kggQpQzEjw1wz1iRCuI+6pGGpQJzoIBl/MIT7VmnDjlS2hIFj9fdEgrjWAx7aTo5MT097I/E/rxGbzmmQUBHFhgg8WdSJGTQSjuKAbaoINmxgCcKK2lsh7iGFsLGh5WwI/vTLs6R6WPS9on9znC9dpHFkwS7YAwXggxNQAlegDCoAg0fwDF7Bm/PkvDjvzsekNeOkM9vgD5zPH10AlO4=\n\n+\n\nGaussian\n\nNoise\n\n=\n\n=\n\nq(x)\n\nAAAB83icbVDLSgMxFL3xWeur6tJNsAh1U2ZE0GXRjcsK9gGdoWTSTBuayYxJRixDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJIJr4zjfaGV1bX1js7RV3t7Z3duvHBy2dZwqylo0FrHqBkQzwSVrGW4E6yaKkSgQrBOMb3K/88iU5rG8N5OE+REZSh5ySoyVvIeaFxEzCkL8dNavVJ26MwNeJm5BqlCg2a98eYOYphGThgqidc91EuNnRBlOBZuWvVSzhNAxGbKepZJETPvZLPMUn1plgMNY2ScNnqm/NzISaT2JAjuZJ9SLXi7+5/VSE175GZdJapik80NhKrCJcV4AHnDFqBETSwhV3GbFdEQUocbWVLYluItfXibt87rr1N27i2rjuqijBMdwAjVw4RIacAtNaAGFBJ7hFd5Qil7QO/qYj66gYucI/gB9/gArk5Ea\nAAAB83icbVDLSgMxFL3xWeur6tJNsAh1U2ZE0GXRjcsK9gGdoWTSTBuayYxJRixDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJIJr4zjfaGV1bX1js7RV3t7Z3duvHBy2dZwqylo0FrHqBkQzwSVrGW4E6yaKkSgQrBOMb3K/88iU5rG8N5OE+REZSh5ySoyVvIeaFxEzCkL8dNavVJ26MwNeJm5BqlCg2a98eYOYphGThgqidc91EuNnRBlOBZuWvVSzhNAxGbKepZJETPvZLPMUn1plgMNY2ScNnqm/NzISaT2JAjuZJ9SLXi7+5/VSE175GZdJapik80NhKrCJcV4AHnDFqBETSwhV3GbFdEQUocbWVLYluItfXibt87rr1N27i2rjuqijBMdwAjVw4RIacAtNaAGFBJ7hFd5Qil7QO/qYj66gYucI/gB9/gArk5Ea\nAAAB83icbVDLSgMxFL3xWeur6tJNsAh1U2ZE0GXRjcsK9gGdoWTSTBuayYxJRixDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJIJr4zjfaGV1bX1js7RV3t7Z3duvHBy2dZwqylo0FrHqBkQzwSVrGW4E6yaKkSgQrBOMb3K/88iU5rG8N5OE+REZSh5ySoyVvIeaFxEzCkL8dNavVJ26MwNeJm5BqlCg2a98eYOYphGThgqidc91EuNnRBlOBZuWvVSzhNAxGbKepZJETPvZLPMUn1plgMNY2ScNnqm/NzISaT2JAjuZJ9SLXi7+5/VSE175GZdJapik80NhKrCJcV4AHnDFqBETSwhV3GbFdEQUocbWVLYluItfXibt87rr1N27i2rjuqijBMdwAjVw4RIacAtNaAGFBJ7hFd5Qil7QO/qYj66gYucI/gB9/gArk5Ea\nAAAB83icbVDLSgMxFL3xWeur6tJNsAh1U2ZE0GXRjcsK9gGdoWTSTBuayYxJRixDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJIJr4zjfaGV1bX1js7RV3t7Z3duvHBy2dZwqylo0FrHqBkQzwSVrGW4E6yaKkSgQrBOMb3K/88iU5rG8N5OE+REZSh5ySoyVvIeaFxEzCkL8dNavVJ26MwNeJm5BqlCg2a98eYOYphGThgqidc91EuNnRBlOBZuWvVSzhNAxGbKepZJETPvZLPMUn1plgMNY2ScNnqm/NzISaT2JAjuZJ9SLXi7+5/VSE175GZdJapik80NhKrCJcV4AHnDFqBETSwhV3GbFdEQUocbWVLYluItfXibt87rr1N27i2rjuqijBMdwAjVw4RIacAtNaAGFBJ7hFd5Qil7QO/qYj66gYucI/gB9/gArk5Ea\n\nInput distribution\n\nNoisy channel\n\nAAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQm7CrgoJN0Ea7COaB2SXMTmaTITM768ysGJYUNv6KjYUitn6EnX/jJNlCEw9cOJxzL/feE8SMKu0431ZuYXFpeSW/Wlhb39jcsrd3GkokEpM6FkzIVoAUYTQidU01I61YEsQDRprB4GLsN++JVFREN3oYE5+jXkRDipE2UscuXpVvz2DrAHqM3MEj6PFAPKQwoFqNOnbJqTgTwHniZqQEMtQ69pfXFTjhJNKYIaXarhNrP0VSU8zIqOAlisQID1CPtA2NECfKTydPjOC+UbowFNJUpOFE/T2RIq7UkAemkyPdV7PeWPzPayc6PPVTGsWJJhGeLgoTBrWA40Rgl0qCNRsagrCk5laI+0girE1uBROCO/vyPGkcVlyn4l4fl6rnWRx5UAR7oAxccAKq4BLUQB1g8AiewSt4s56sF+vd+pi25qxsZhf8gfX5AwAflm0=\nAAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQm7CrgoJN0Ea7COaB2SXMTmaTITM768ysGJYUNv6KjYUitn6EnX/jJNlCEw9cOJxzL/feE8SMKu0431ZuYXFpeSW/Wlhb39jcsrd3GkokEpM6FkzIVoAUYTQidU01I61YEsQDRprB4GLsN++JVFREN3oYE5+jXkRDipE2UscuXpVvz2DrAHqM3MEj6PFAPKQwoFqNOnbJqTgTwHniZqQEMtQ69pfXFTjhJNKYIaXarhNrP0VSU8zIqOAlisQID1CPtA2NECfKTydPjOC+UbowFNJUpOFE/T2RIq7UkAemkyPdV7PeWPzPayc6PPVTGsWJJhGeLgoTBrWA40Rgl0qCNRsagrCk5laI+0girE1uBROCO/vyPGkcVlyn4l4fl6rnWRx5UAR7oAxccAKq4BLUQB1g8AiewSt4s56sF+vd+pi25qxsZhf8gfX5AwAflm0=\nAAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQm7CrgoJN0Ea7COaB2SXMTmaTITM768ysGJYUNv6KjYUitn6EnX/jJNlCEw9cOJxzL/feE8SMKu0431ZuYXFpeSW/Wlhb39jcsrd3GkokEpM6FkzIVoAUYTQidU01I61YEsQDRprB4GLsN++JVFREN3oYE5+jXkRDipE2UscuXpVvz2DrAHqM3MEj6PFAPKQwoFqNOnbJqTgTwHniZqQEMtQ69pfXFTjhJNKYIaXarhNrP0VSU8zIqOAlisQID1CPtA2NECfKTydPjOC+UbowFNJUpOFE/T2RIq7UkAemkyPdV7PeWPzPayc6PPVTGsWJJhGeLgoTBrWA40Rgl0qCNRsagrCk5laI+0girE1uBROCO/vyPGkcVlyn4l4fl6rnWRx5UAR7oAxccAKq4BLUQB1g8AiewSt4s56sF+vd+pi25qxsZhf8gfX5AwAflm0=\nAAACBHicbVC7SgNBFJ2Nrxhfq5ZpBoMQm7CrgoJN0Ea7COaB2SXMTmaTITM768ysGJYUNv6KjYUitn6EnX/jJNlCEw9cOJxzL/feE8SMKu0431ZuYXFpeSW/Wlhb39jcsrd3GkokEpM6FkzIVoAUYTQidU01I61YEsQDRprB4GLsN++JVFREN3oYE5+jXkRDipE2UscuXpVvz2DrAHqM3MEj6PFAPKQwoFqNOnbJqTgTwHniZqQEMtQ69pfXFTjhJNKYIaXarhNrP0VSU8zIqOAlisQID1CPtA2NECfKTydPjOC+UbowFNJUpOFE/T2RIq7UkAemkyPdV7PeWPzPayc6PPVTGsWJJhGeLgoTBrWA40Rgl0qCNRsagrCk5laI+0girE1uBROCO/vyPGkcVlyn4l4fl6rnWRx5UAR7oAxccAKq4BLUQB1g8AiewSt4s56sF+vd+pi25qxsZhf8gfX5AwAflm0=\n\nI(Z; X) \uf8ff 3 bits\nq(z|x)\n\nAAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWoW5KIoIui25cVrAXaEOZTCft0MkkzkzEWuvGV3HjQhG3voU738ZJG0Fbfxj4+M85zDm/H3OmtON8WXPzC4tLy7mV/Ora+samvbVdU1EiCa2SiEey4WNFORO0qpnmtBFLikOf07rfP0/r9RsqFYvElR7E1AtxV7CAEayN1bZ3r4utEOueH6A7dI9++PawbReckjMWmgU3gwJkqrTtz1YnIklIhSYcK9V0nVh7Qyw1I5yO8q1E0RiTPu7SpkGBQ6q84fiCETowTgcFkTRPaDR2f08McajUIPRNZ7qhmq6l5n+1ZqKDU2/IRJxoKsjkoyDhSEcojQN1mKRE84EBTCQzuyLSwxITbULLmxDc6ZNnoXZUcp2Se3lcKJ9lceRgD/ahCC6cQBkuoAJVIPAAT/ACr9aj9Wy9We+T1jkrm9mBP7I+vgHxQpXn\nAAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWoW5KIoIui25cVrAXaEOZTCft0MkkzkzEWuvGV3HjQhG3voU738ZJG0Fbfxj4+M85zDm/H3OmtON8WXPzC4tLy7mV/Ora+samvbVdU1EiCa2SiEey4WNFORO0qpnmtBFLikOf07rfP0/r9RsqFYvElR7E1AtxV7CAEayN1bZ3r4utEOueH6A7dI9++PawbReckjMWmgU3gwJkqrTtz1YnIklIhSYcK9V0nVh7Qyw1I5yO8q1E0RiTPu7SpkGBQ6q84fiCETowTgcFkTRPaDR2f08McajUIPRNZ7qhmq6l5n+1ZqKDU2/IRJxoKsjkoyDhSEcojQN1mKRE84EBTCQzuyLSwxITbULLmxDc6ZNnoXZUcp2Se3lcKJ9lceRgD/ahCC6cQBkuoAJVIPAAT/ACr9aj9Wy9We+T1jkrm9mBP7I+vgHxQpXn\nAAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWoW5KIoIui25cVrAXaEOZTCft0MkkzkzEWuvGV3HjQhG3voU738ZJG0Fbfxj4+M85zDm/H3OmtON8WXPzC4tLy7mV/Ora+samvbVdU1EiCa2SiEey4WNFORO0qpnmtBFLikOf07rfP0/r9RsqFYvElR7E1AtxV7CAEayN1bZ3r4utEOueH6A7dI9++PawbReckjMWmgU3gwJkqrTtz1YnIklIhSYcK9V0nVh7Qyw1I5yO8q1E0RiTPu7SpkGBQ6q84fiCETowTgcFkTRPaDR2f08McajUIPRNZ7qhmq6l5n+1ZqKDU2/IRJxoKsjkoyDhSEcojQN1mKRE84EBTCQzuyLSwxITbULLmxDc6ZNnoXZUcp2Se3lcKJ9lceRgD/ahCC6cQBkuoAJVIPAAT/ACr9aj9Wy9We+T1jkrm9mBP7I+vgHxQpXn\nAAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWoW5KIoIui25cVrAXaEOZTCft0MkkzkzEWuvGV3HjQhG3voU738ZJG0Fbfxj4+M85zDm/H3OmtON8WXPzC4tLy7mV/Ora+samvbVdU1EiCa2SiEey4WNFORO0qpnmtBFLikOf07rfP0/r9RsqFYvElR7E1AtxV7CAEayN1bZ3r4utEOueH6A7dI9++PawbReckjMWmgU3gwJkqrTtz1YnIklIhSYcK9V0nVh7Qyw1I5yO8q1E0RiTPu7SpkGBQ6q84fiCETowTgcFkTRPaDR2f08McajUIPRNZ7qhmq6l5n+1ZqKDU2/IRJxoKsjkoyDhSEcojQN1mKRE84EBTCQzuyLSwxITbULLmxDc6ZNnoXZUcp2Se3lcKJ9lceRgD/ahCC6cQBkuoAJVIPAAT/ACr9aj9Wy9We+T1jkrm9mBP7I+vgHxQpXn\n\nq(z)\n\nAAAB83icbVDLSgMxFL3js9ZX1aWbYBHqpsyIoMuiG5cV7AM6Q8mkmTY0yYxJRqhDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJpxp47rfzsrq2vrGZmmrvL2zu7dfOThs6zhVhLZIzGPVDbGmnEnaMsxw2k0UxSLktBOOb3K/80iVZrG8N5OEBgIPJYsYwcZK/kPNF9iMwgg9nfUrVbfuzoCWiVeQKhRo9itf/iAmqaDSEI617nluYoIMK8MIp9Oyn2qaYDLGQ9qzVGJBdZDNMk/RqVUGKIqVfdKgmfp7I8NC64kI7WSeUC96ufif10tNdBVkTCapoZLMD0UpRyZGeQFowBQlhk8swUQxmxWREVaYGFtT2ZbgLX55mbTP655b9+4uqo3roo4SHMMJ1MCDS2jALTShBQQSeIZXeHNS58V5dz7moytOsXMEf+B8/gAunZEc\nAAAB83icbVDLSgMxFL3js9ZX1aWbYBHqpsyIoMuiG5cV7AM6Q8mkmTY0yYxJRqhDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJpxp47rfzsrq2vrGZmmrvL2zu7dfOThs6zhVhLZIzGPVDbGmnEnaMsxw2k0UxSLktBOOb3K/80iVZrG8N5OEBgIPJYsYwcZK/kPNF9iMwgg9nfUrVbfuzoCWiVeQKhRo9itf/iAmqaDSEI617nluYoIMK8MIp9Oyn2qaYDLGQ9qzVGJBdZDNMk/RqVUGKIqVfdKgmfp7I8NC64kI7WSeUC96ufif10tNdBVkTCapoZLMD0UpRyZGeQFowBQlhk8swUQxmxWREVaYGFtT2ZbgLX55mbTP655b9+4uqo3roo4SHMMJ1MCDS2jALTShBQQSeIZXeHNS58V5dz7moytOsXMEf+B8/gAunZEc\nAAAB83icbVDLSgMxFL3js9ZX1aWbYBHqpsyIoMuiG5cV7AM6Q8mkmTY0yYxJRqhDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJpxp47rfzsrq2vrGZmmrvL2zu7dfOThs6zhVhLZIzGPVDbGmnEnaMsxw2k0UxSLktBOOb3K/80iVZrG8N5OEBgIPJYsYwcZK/kPNF9iMwgg9nfUrVbfuzoCWiVeQKhRo9itf/iAmqaDSEI617nluYoIMK8MIp9Oyn2qaYDLGQ9qzVGJBdZDNMk/RqVUGKIqVfdKgmfp7I8NC64kI7WSeUC96ufif10tNdBVkTCapoZLMD0UpRyZGeQFowBQlhk8swUQxmxWREVaYGFtT2ZbgLX55mbTP655b9+4uqo3roo4SHMMJ1MCDS2jALTShBQQSeIZXeHNS58V5dz7moytOsXMEf+B8/gAunZEc\nAAAB83icbVDLSgMxFL3js9ZX1aWbYBHqpsyIoMuiG5cV7AM6Q8mkmTY0yYxJRqhDf8ONC0Xc+jPu/Bsz7Sy09UDgcM693JMTJpxp47rfzsrq2vrGZmmrvL2zu7dfOThs6zhVhLZIzGPVDbGmnEnaMsxw2k0UxSLktBOOb3K/80iVZrG8N5OEBgIPJYsYwcZK/kPNF9iMwgg9nfUrVbfuzoCWiVeQKhRo9itf/iAmqaDSEI617nluYoIMK8MIp9Oyn2qaYDLGQ9qzVGJBdZDNMk/RqVUGKIqVfdKgmfp7I8NC64kI7WSeUC96ufif10tNdBVkTCapoZLMD0UpRyZGeQFowBQlhk8swUQxmxWREVaYGFtT2ZbgLX55mbTP655b9+4uqo3roo4SHMMJ1MCDS2jALTShBQQSeIZXeHNS58V5dz7moytOsXMEf+B8/gAunZEc\n\nEncoder distribution\n\nFigure 1: For a noisy channel characterized by z = x + s\u270f, we compare drawing the noise, \u270f, from a\nGaussian distribution (as in VAEs) or an Echo distribution.\n\narbitrary input distributions. Echo noise is constructed from the empirical distribution of its inputs,\nallowing its variation to re\ufb02ect that of the source (see Fig. 1). We leverage this relationship to derive\nan analytic form for mutual information that avoids distributional assumptions on either the noise\nor the encoding marginal. Further, the Echo channel avoids the need to specify a prior, and instead\nimplicitly uses the optimal prior in the Evidence Lower Bound. This marginal distribution is neither\nGaussian nor independent in general.\nAfter introducing the Echo noise channel and an exact characterization of its information rate in Sec. 2,\nwe proceed to interpret Variational Autoencoders from an encoding perspective in Sec. 3. We formally\nde\ufb01ne our rate-distortion objective in Sec. 3.1, and draw connections with recent related works in\nSec. 4. Finally, we report log likelihood results, visualize the space of compression-reconstruction\ntrade-offs, and evaluate disentanglement in Echo representations in Sec. 5.\n\n2 Echo Noise\n\nTo avoid learning representations that memorize the data, we would like to constrain the mutual\ninformation between the input X and the representation Z. Since we have freedom to choose how to\nencode the data, we can design a noise model that facilitates calculating this generally intractable\nquantity.\nThe Echo noise channel has a shift-and-scale form that mirrors the reparameterization trick in VAEs.\nReferring to the observed data distribution as q(x), with z 2 Rdz , x 2 Rdx, we can de\ufb01ne the\nstochastic encoder q(z|x) using:\n\nz = f (x) + S(x)\u270f\n\n(1)\n\nFor brevity, we omit the subscripts that indicate that the functions f : Rdx ! Rdz and matrix function\nS : Rdx ! Rdz \u21e5 Rdz depend on neural networks parameterized by . All that remains to specify the\nencoder is to \ufb01x the distribution of the noise variable, q(\u270f). For VAEs, the noise is typically chosen to\nbe Gaussian, \u270f \u21e0N (0, Idz ). 1\nWith the goal of calculating mutual information, we will need to compare the marginal entropy H(Z),\nwhich integrates over samples x, and the conditional entropy H(Z|X), whose stochasticity is only\ndue to the noise for deterministic f (x) and S(x). The choice of noise will affect both quantities, and\nour approach is to relate them by enforcing an equivalence between the distributions q(z) and q(\u270f).\nSince q(z) =R q(z|x)q(x)dx is de\ufb01ned in terms of the source, we can also imagine constructing\nthe noise in a data-driven way. For instance, we could draw \u270f = f (x0), x0 iid\u21e0q(x) in an effort to make\nthe noise match the channel output. However, this changes the distribution of Z and the noise would\nneed to be updated to continue resembling the output.\n\n1Our approach is also easily adapted to multiplicative noise, such as in Achille and Soatto [1].\n\n2\n\n\fInstead, by iteratively applying Eq. 1, we can guarantee that the noise and marginal distributions\nmatch in the limit. Using superscripts to indicate iid samples x` iid\u21e0q(x), we draw \u270f according to:\n\n\u270f = f (x0) + S(x0)\u2713f (x1) + S(x1)\u21e3f (x2) + S(x2)...\n\n= f (x0) + S(x0)f (x1) + S(x0)S(x1)f (x2)...\n\n(2)\n\nEcho noise is thus constructed using an in\ufb01nite sum over attenuated \u201cechoes\u201d of the transformed data\nsamples. This can be written more compactly as follows.\nDe\ufb01nition: Echo Noise The Echo noise distribution E(f (x), S(x), q(x)) is de\ufb01ned for functions\nf, S, and probability density function q over x 2 Rdx, by sampling according to the following\nprocedure.\n\n\u270f =\n\n1X`=0 `Y`0=1\n\nS(x`0)! f (x`),\n\nx` iid\u21e0q(x)\n\n(3)\n\nAlthough the noise distribution may be complex, it has the interesting property that it exactly matches\nthe eventual output marginal q(z).\nLemma 2.1 (Echo noise matches channel output). If \u270f \u21e0 Echo(f (x), S(x), q(x)) and z = f (x) +\nS(x)\u270f, then z has the same distribution as \u270f.\nWe can observe this relationship by simply re-labeling the sample indices in the expanded expression\nfor the noise in Eq. 2. In particular, the training example that we condition on in Eq. 1 corresponds to\nthe \ufb01rst sample x0 in a draw from the noise. This equivalence is the key insight leading to an exact\nexpression for the mutual information:\nTheorem 2.2 (Echo Information). For any source distribution q(x), and a noisy channel de\ufb01ned by\nEq. 1 that satis\ufb01es 2.3, the mutual information is as follows:\n\nI(X; Z) = Ex log | det S(x)|\n\n(4)\n\nProof. We start by expanding the de\ufb01nition of mutual information in terms of entropies. Since f (x)\nand S(x) are deterministic, we treat them as constants after conditioning on X = x. The stochasticity\nunderlying H(Z|X = x) is thus only due to the random variable \u270f.\n\nI(X; Z) = H(Z) H(Z|X)\n\n= H(Z) Ex H(f (x) + S(x)E | X = x)\n= H(Z) Ex H(S(x)E | X = x)\n= H(Z) H(E) Ex log | det S(x)|\n= Ex log | det S(X)|\n\nWe have used the translation invariance of differential entropy in the third line, and the scaling\nproperty in the fourth line [12]. The entropy terms cancel as a result of Lemma 2.1.\n\nIn this work, we consider only diagonal S(x) \u2318 diag(s1(x), . . . , sdz (x)) as is typical for VAEs, so\nthat the determinant in Eq. 4 simpli\ufb01es as I(X; Z) = Pj Ex log |sj(x)| =Pj I(X; Zj).\n\nFinally, we note that the noise distribution q(\u270f) is only de\ufb01ned implicitly through a sampling procedure.\nFor this to be meaningful, we must ensure that the in\ufb01nite sum converges.\nLemma 2.3. The in\ufb01nite sum in Eq. 3 converges, and thus Echo noise sampling is well-behaved, if\n8x, 9M s.t. |f (x)|\uf8ff M and \u21e2(S(x)) < 1, where \u21e2 is the spectral radius.\nIn App. B, we discuss several implementation choices to guarantee that these conditions are met\nand that Echo noise can be accurately sampled using a \ufb01nite number of terms. This is particularly\ndif\ufb01cult in the high noise, low information regime, as zero mutual information (sj(x)) = 18 x, j)\nwould imply an in\ufb01nite amount of noise. To avoid this issue and ensure precise sampling, we clip\nthe magnitude of sj(x) so that, for a given M and number of samples, the sum of remainder terms\nis guaranteed to be within machine precision. This imposes a lower bound on the achievable rate\nacross the Echo channel, which depends on the number of terms considered and can be tuned by the\npractitioner.\n\n3\n\n\f2.1 Properties of Echo Noise\nWe can visualize applying Echo noise to a complex input distribution in Fig. 1, using the identity\ntransformation f (x) = x and constant noise scaling sj(x) = .5. Here, we directly observe the\nequivalence of the noise and output distributions. Further, the data-driven nature of the Echo channel\nmeans it can leverage the structure in the (transformed) input to destroy information in a more targeted\nway than spherical Gaussian noise.\nIn particular, Echo\u2019s ability to add noise that is correlated across dimensions distinguishes it from\ncommon diagonal noise models. It is important to note that the noise still re\ufb02ects the dependence\nin f (x) even when S(x) is diagonal. In fact, we show in App. C that T C(Z) = T C(Z|X) for the\ndiagonal case, where total correlation measures the divergence from independence, e.g. T C(Z|X) =\nDKL[q(z|x)||Q q(zj|x)] [43].\nIn the setting of learned f (x) and S(x), notice that the noise depends on the parameters. This means\nthat training gradients are propagated through \u270f, unlike traditional VAEs where q(\u270f) is \ufb01xed. This\nmay be a factor in improved performance: data samples are used as both signal and noise in different\nparts of the optimization, leading to a more ef\ufb01cient use of data.\nFinally, the Echo channel ful\ufb01lls several of the desirable properties that often motivate Gaussian\nnoise and prior assumptions. Eqs. 1 and 3 de\ufb01ne a simple sampling procedure that only requires a\nsupply of iid samples from the input distribution. It is easy to sample both the noise and conditional\ndistributions for the purposes of evaluating expectations, while Echo also provides a natural way\nto sample from the true encoding marginal q(z) via its equivalence with q(\u270f). While we cannot\nevaluate the density of a given z under q(z|x) or q(z), as might be useful in importance sampling\n[8], we can characterize their relationship on average using the mutual information in Eq. 4. These\ningredients make Echo noise useful for learning representations within the autoencoder framework.\n\n3 Lossy Compression in VAEs\n\nVariational Autoencoders (VAEs) [21, 36] seek to maximize the log-likelihood of data under a\nlatent factor generative model de\ufb01ned by p\u2713(x, z) = p(z)p\u2713(x|z), where \u2713 represents parameters\nof the generative model decoder and p(z) is the prior distribution over latent variables. However,\nmaximum likelihood is intractable in general due to the dif\ufb01cult integral over Z, log p\u2713(x) =\nlogR p(z)p\u2713(x|z)dz.\nTo avoid this problem, VAEs introduce a variational distribution, q(z|x), which encodes the input\ndata q(x) and approximates the generative model posterior p\u2713(z|x). This leads to the tractable\n(average) Evidence Lower Bound (ELBO) on likelihood:\n\nEq log p\u2713(x) Eq log p\u2713(x) DKL[q(z|x)||p\u2713(z|x)]\n= Eq log p\u2713(x|z) DKL[q(z|x)||p(z)]\n\n(5)\n\nThe connection between VAEs and rate-distortion theory can be seen using a decomposition of the\nKL divergence term from Hoffman and Johnson [19].\n\nDKL[q(z|x)||p(z)] = DKL[q(z|x)||q(z)] + DKL[q(z)||p(z)]\n\n DKL[q(z|x)||q(z))] = Iq(X; Z)\n\n(6)\n\nThis decomposition lends insight into the orthogonal goals of the ELBO regularization term. The\nmutual information Iq(X; Z) encourages lossy compression of the data into a latent code, while the\nmarginal divergence enforces consistency with the prior. The non-negativity of the KL divergence\nimplies that each of these terms detracts from our likelihood bound.\nSimilarly, we observe that DKL[q(z|x)||p(z)] gives an upper bound on the mutual information, with\na gap of DKL[q(z)||p(z)]. From this perspective, a static Gaussian prior can be seen a particular\nand possibly loose marginal approximation [3, 14, 37]. The true encoding marginal q(z) provides\nthe unique, optimal choice of prior and leads to a tighter bound on the likelihood:\n\nEq log p\u2713(x) Eq log p\u2713(x|z) Iq(X; Z)\n\n(7)\n\nOur exact expression for the mutual information over an Echo channel provides the \ufb01rst general\nmethod to directly optimize this objective. This corresponds to adaptively setting p(z) equal to q(z)\n\n4\n\n\fthroughout training, so that Eq. 7 can be seen as bounding the likelihood under the generative model\n\np(x) =R q(z)p\u2713(x|z)dx.\n\n3.1 Rate-Distortion Objective\n\nWhile the VAE is motivated as performing amortized inference of the latent variables in a generative\nmodel, the prior is rarely leveraged to encode domain-speci\ufb01c structure. Further, we have shown that\nenforcing prior consistency can detract from likelihood bounds.\nWe instead follow Alemi et al. [3] in advocating that representation learning be motivated from\nan encoding perspective using rate-distortion theory. In particular, we choose reconstruction under\nthe generative model as the distortion measure d(x, z) = log p\u2713(x|z), and study the following\noptimization problem:\n\nmax\n\u2713,\n\nEq log p\u2713(x|z) Iq(X; Z)\n\n(8)\n\nWhile this resembles the -VAE objective of Higgins et al. [18], we highlight two notable distinctions.\nFirst, treating Iq(X; Z) rather than the upper bound DKL[q(z|x)||p(z)] avoids the need to specify a\nprior and facilitates a direct interpretation in terms of lossy compression. Further, the parameter is\nnaturally interpreted as a Lagrange multiplier enforcing a constraint on Iq(X; Z). The special choice\nof = 1 gives a bound on log-likelihood according to Eq. 7, which we use to compare results across\nmethods in Sec. 5. We direct the reader to App. A for a more formal treatment of rate-distortion.\n\n4 Related Work\n\nRate-Distortion Theory: A number of recent works have made connections between the Evidence\nLower Bound objective and rate-distortion theory [1, 3, 25, 35], with the average distortion corre-\nsponding to the cross entropy reconstruction loss as above.. In particular, Alemi et al. [3] consider the\nfollowing upper and lower bounds on the mutual information Iq(X; Z):\n\nH D = Hq(X) + Eq log p\u2713(x|z) \uf8ff Iq(X; Z) \uf8ff DKL[q(z|x)||r(z)] = R\n\nWith the data entropy as a constant, minimizing the cross entropy distortion corresponds to the\nvariational information maximization lower bound of Barber and Agakov [4]. The upper bound\nmatches the decomposition in Eq. 6 for the generalized choice of marginal r(z). Several recent works\nhave also considered \u2018learned priors\u2019 or \ufb02ow-based density estimators [3, 11, 40] that seek to reduce\nthe marginal divergence by approximating q(z) (see below). Using this upper bound on the rate\nterm, Alemi et al. [3] and Rezende and Viola [35] obtain objectives similar to Eq. 8.\nExisting models are usually trained with a static [3, 18] or a heuristic annealing schedule [6, 9],\nwhich implicitly correspond to constant constraints (see App.A). However, setting target values\nfor either the rate or distortion remains an interesting direction for future discussion. Rezende and\nViola [35] view the distortion as an intuitive quantity to specify in practice, while Zhao et al. [46]\ntrain a separate model to provide constraint values. As both works show, specifying a constant and\noptimizing the Lagrange multiplier with gradient descent can lead to improved performance.\nMutual Information in Unsupervised Learning: A number of recent works have argued that the\nmaximum likelihood objective may be insuf\ufb01cient to guarantee useful representations of data [3, 45].\nIn particular, when paired with powerful decoders that can match the data distribution, VAEs may\nlearn to completely ignore the latent code [6, 11].\nTo rectify these issues, a commonly proposed solution has been to add terms to the objective function\nthat maximize, minimize, or constrain the mutual information between data and representation\n[3, 7, 32, 45, 46]. However, justi\ufb01cations for these approaches have varied and numerous methods\nhave been employed for estimating the mutual information. These include sampling [32], indirect\noptimization via other divergences [45], mixture entropy estimation [23], learned mixtures [40],\nautoregressive density estimation [3], and a dual form of the KL divergence [5]. Poole et al. [33]\nprovide a thorough review and analysis of variational upper and lower bounds on mutual information,\nalthough recent results have shown limits on our ability to construct high con\ufb01dence estimators\ndirectly from samples [29]. Echo notably avoids this limitation by providing an analytic expression\nfor the rate whenever the representation is sampled according to Eq. 3.\n\n5\n\n\fTable 1: Test Log Likelihood Bounds\n\nBinary MNIST\n\nOmniglot\n\nFashion MNIST\n\nMethod\nRate Dist\nEcho\n26.4 62.4\nVAE\n26.2 63.6\nInfoVAE\n26.0 64.0\nVAE-MAF\n26.1 63.7\nVAE-Vamp\n26.3 63.0\nIAF-Prior\n26.5 63.5\nIAF+MMD 26.3 63.6\n26.4 63.6\nIAF-MAF\nIAF-Vamp\n26.4 62.8\n\n-ELBO \n88.8\n.18\n.18\n89.8\n.14\n90.0\n.15\n89.8\n.19\n89.3\n90.0\n.13\n.15\n90.1\n.18\n89.9\n89.2\n.18\n\nRate Dist\n30.2 84.4\n30.5 86.5\n30.3 87.3\n30.5 86.4\n30.8 84.3\n30.5 86.7\n30.7 86.4\n30.6 86.5\n30.4 85.0\n\n-ELBO \n114.6\n.30\n.44\n117.0\n.51\n117.6\n.31\n116.9\n.28\n115.1\n117.2\n.36\n.28\n117.1\n.24\n117.1\n115.4\n.20\n\nRate Dist\n16.6 218.3\n15.7 219.3\n15.6 219.5\n15.7 219.3\n15.9 218.5\n15.8 219.1\n15.7 219.2\n15.8 219.1\n16.0 218.3\n\n-ELBO \n234.9\n.21\n.10\n235.0\n.10\n235.1\n.14\n234.9\n.08\n234.4\n234.9\n.10\n.13\n234.9\n.14\n234.9\n234.3\n.16\n\nParams\n(\u00b7106)\n1.40\n1.40\n1.40\n3.12\n1.99\n3.12\n3.12\n4.84\n3.71\n\nAmong the approaches above,, the InfoVAE model of Zhao et al. [45] provides a potentially interesting\ncomparison with our method. The objective adds a parameter to more heavily regularize the marginal\ndivergence and a parameter \u21b5 to control mutual information. However, since DKL[q(z)||p(z)] is\nintractable, the Maximum Mean Discrepancy (MMD) [16] between the encoding outputs and a\nstandard Gaussian is used as a proxy. For the choice of = 1000 (as in the original paper) and \u21b5 = 0\n(no information preference), the objective simpli\ufb01es to:\n\nLInfoVAE = LELBO 999 DMMD[q(z)||p(z)]\n\nThe sizeable MMD penalty encourages q(z) \u21e1 p(z), so that DKL[q(z|x)||p(z)] \u21e1\nDKL[q(z|x)||q(z)] = Iq(X; Z). Thus, the KL divergence term in the ELBO should more closely\nre\ufb02ect a mutual information regularizer, facilitating comparison with the rate in Echo models.\nFlow models, which evaluate densities on simple distributions such as Gaussians but apply complex\ntransformations with tractable Jacobians, are another prominent recent development in unsuper-\nvised learning [15, 22, 31, 34]. Flows can be used both as an encoding mechanism and marginal\napproximation for our purposes. In particular, Inverse Autoregressive Flow [22] can be seen as\ntransforming the output of a Gaussian noise channel into an approximate posterior sample using a\nstack of autoregressive networks. Masked Autoregressive Flow [31] models a similar transformation\nwith computational tradeoffs suited for density estimation, mapping latent samples to high probability\nunder a Gaussian base distribution to approximate q(z).\nFinally, the VampPrior [40] may also be used as a marginal approximation, modeling q(z) using\na mixture distribution 1\nbackpropagation.\n\nKPk q(z|uk) evaluated on a set of \u2018pseudo-inputs\u2019 uk 2 Rdx learned by\n\n5 Results\n\nIn this section, we would ideally like to quantify the impact of three key elements of the Echo\napproach: a data-driven noise model, exact rate regularization throughout training, and a \ufb02exible\nmarginal distribution. In App. D.2, we observe that the dimension-wise marginals learned by Echo\nappear Gaussian despite our lack of explicit constraints. However, the joint marginal over q(z) (or\nequivalently q(\u270f)) may still have a complex dependence structure, which is not penalized for deviating\nfrom independence or Gaussianity. We calculate a second-order approximation of total correlation in\nApp. C to con\ufb01rm that this noise is indeed dependent across dimensions.\n\n5.1 ELBO Results\n\nWe proceed to analyse the log-likelihood performance of relevant models on three image datasets:\nstatic Binary MNIST [38], Omniglot [24] as adapted by Burda et al. [8], and Fashion MNIST (fM-\nNIST) [44]. All models are trained with 32 latent variables using the same convolutional architecture\nas in Alemi et al. [3] except with ReLU activations. We trained using Adam optimization for 200\nepochs, with an initial learning rate of 0.0003 decaying linearly to 0 over the last 100 epochs.\n\n6\n\n\fFigure 2: Binary MNIST R-D and Visualization\n\nFigure 3: Omniglot R-D and Visualization\n\nTable 1 shows negative test ELBO values, with the rate column reported as the appropriate upper\nbound for comparison methods. Results are averaged from ten runs of each model after removing the\nhighest and lowest outliers. We compare Echo against diagonal Gaussian noise and IAF encoders,\neach with four marginal approximations: a Gaussian prior with and without the MMD penalty (e.g.\nIAF-Prior, IAF+MMD), MAF [31], and VampPrior [40]. Note that VAE is still used to denote the\nGaussian encoder when paired with a different marginal (e.g. VAE-Vamp).\nWe \ufb01nd that the Echo noise autoencoder obtains improved likelihood bounds on Binary MNIST and\nOmniglot, with competitive results on fMNIST. We emphasize that Echo achieves this performance\nwith signi\ufb01cantly fewer parameters than comparison methods. IAF and MAF each require training\nan additional autoregressive model with size similar to the original network, while the VampPrior\nuses 750 learned pseudoinputs of the same dimension as the data. Although Echo involves special\ncomputation to construct the noise for each training example, it has the same number of parameters\nas a standard VAE and runs in approximately the same wall clock time.\nWe observe only minor differences based on the choice of encoding mechanism, which is somewhat\nsurprising given the additional expressivity of the IAF transformation. The bene\ufb01t of the \ufb02ow\ntransformations may be more readily observed on more dif\ufb01cult datasets or with more advanced\narchitecture tuning [22].\nWe do \ufb01nd that a more complex marginal approximation can help performance. Although we see\nminimal gains from the MMD penalty and MAF marginal, the VampPrior bridges much of the\nperformance gap with Echo noise. Recall that a learned prior can help ensure a tight rate bound while\nproviding \ufb02exibility to learn a more complex marginal (in this case, a mixture model). However, the\nrelative contribution of these effects is dif\ufb01cult to decouple. Echo instead provides both an exact rate\nand an adaptive prior by directly linking the choice of encoder and marginal.\n\n5.2 Rate Distortion Curves\n\nMoving beyond the special case of = 1, rate-distortion theory provides the practitioner with an\nentire space of compression-relevance tradeoffs corresponding to constraints on the rate. We plot\nR-D curves for Binary MNIST in Fig. 2, Omniglot in Fig. 3, and Fashion MNIST in App. D.1. We\nalso show model reconstructions at several points along the curve, with the output averaged over 10\nencoding samples to observe how stochasticity in the latent space is translated through the decoder.\n\n7\n\n\fTable 2: Disentanglement Scores\n\nFigure 4: Echo = 0, = 1\n\nIndependent Ground Truth\nFactor\n\nMIG\n\nDependent Ground Truth\nMIG\n\nFactor\n\n = 1\n = 4\n = 8\n\nEcho VAE\n0.83 0.65\n0.78 0.65\n0.75 0.69\n0.83 0.65\n = 0\n0.78 0.72\n = 20\n0.79 0.73\n = 50\n = 100 0.77 0.70\n\nEcho VAE\n0.16 0.07\n0.18 0.10\n0.18 0.13\n0.16 0.07\n0.30 0.17\n0.30 0.18\n0.29 0.18\n\nEcho VAE\n0.70 0.60\n0.67 0.60\n0.56 0.56\n0.70 0.60\n0.65 0.60\n0.58 0.53\n0.49 0.53\n\nEcho VAE\n0.11 0.08\n0.11 0.07\n0.06 0.06\n0.10 0.08\n0.16 0.07\n0.16 0.07\n0.09 0.08\n\nThese visualizations are organized to compare models with similar rates, which we emphasize may\noccur at different values of for different methods depending on the shape of their respective curves.\nThe Echo rate-distortion curve indeed exhibits several notable differences with comparison methods.\nWe \ufb01rst note that Echo performance begins to drop off as we approach the lower limit on achievable\nrate, which is shown with a dashed vertical line and ensures that the rate calculation accurately re\ufb02ects\nthe noise for a \ufb01nite number of samples (see App.B). In this regime, the sigmoids parameterizing\nsj(x) are saturated for much of training, and unused dimensions still count against the objective since\nwe cannot achieve zero rate. We reiterate that this low rate limit may be adjusted by considering more\nterms in the in\ufb01nite sum or decreasing the number of latent factors.\nAt low rates, our models maintain only high level features of the input image, and the blurred average\nreconstructions re\ufb02ect that different samples can lead to semantically different generations. On both\ndatasets, Echo gives qualitatively different output variation than Gaussian noise at low rate and similar\ndistortion. Intermediate-rate models still re\ufb02ect some of this sample diversity, particularly on the\nmore dif\ufb01cult Omniglot dataset.\nFor very high capacity models, we observe that Echo slightly extends its gains on both datasets, with\nthree to \ufb01ve nats lower distortion than comparison methods at the same rates. Intuitively, a more\ncomplex encoding marginal may be harder to match to a (learned) prior, loosening the upper bound\non mutual information. The Echo approach can be particularly useful in this regime, as it avoids\nexplicitly constructing the marginal while still providing exact rate regularization.\n\n5.3 Disentangled Representations\nSigni\ufb01cant recent attention has been devoted to learning disentangled representations of data, which\nre\ufb02ect the true generative factors of variation in the data [10, 27] and may be useful for downstream\ntasks [26, 42]. While prevailing de\ufb01nitions and metrics for disentanglement have recently been\nchallenged [26], existing methods often rely on the inductive bias of independent ground truth factors,\neither via total correlation (TC) regularization [10, 20], or by using higher to more strongly penalize\nthe KL divergence to an independent prior [9, 18]. Since Echo does not assume a factorized encoder\nor marginal, we investigate whether it can better preserve disentanglement when the ground truth\nfactors are not independent.\nTo evaluate the quality of Echo noise representations, we compare against VAE models with diag-\nonal Gaussian noise and priors, and consider the effects of increasing or adding independence\nregularization with parameter [10, 20]:\n\nL = Eq log p\u2713(x|z) Iq(X; Z) T C (Z)\n\nTC regularization is implemented as in [20], where a discriminator is trained to distinguish samples\nfrom q(z) andQ q(zj). We keep = 1 when modifying . Note that enforcing marginal indepen-\ndence will also limit the dependence in the noise learned by Echo, since T C(Z|X) and T C(Z) are\nlinked as described in Sec. 2.1.\n\n8\n\n\fWe calculate disentanglement scores on the dSprites dataset [28], where the ground truth factors of\nshape, scale, x-y position, and rotation are known and sampled independently across the dataset. To\ninduce dependence in the ground truth factors, we downsample the dataset by partitioning each factor\ninto 4 bins and randomly excluding pairwise combinations of bins with probability 0.15. This leads\nto a dataset of 15% of the original size, with a total correlation of 1.49 nats in the generative factors.\nWe use both the implementation and experimental setup of Locatello et al. [26] and average scores\nover ten runs of each method.\nTable 2 reports FactorVAE [20] and Mutual Information Gap [10] scores for both independent and\ndependent ground truth factors. We \ufb01nd that Echo provides superior disentanglement scores to VAEs\nacross the board, although the relative improvement does not increase in the case of dependent\nlatent factors. On the full dataset, independence regularization improves the MIG score for Echo and\nboth scores for VAE, but may guide both models toward more entangled representations when this\ninductive bias does not match the ground truth. Finally, we note that increasing need not improve\ndisentanglement for Echo noise, since we have relaxed assumptions of independence in both the\nencoder and marginal. Higher actually appear to hurt disentanglement scores on the dependent\ndataset for both methods.\nIn Figure 4, we visualize an Echo model that has successfully learned to disentangle position and\nscale, but not rotation, on the full dSprites dataset. Each row represents a single latent dimension, and\neach column shows mean f (x) values as a function of the respective ground truth factors. Note that\nthe \ufb01rst column shows a heatmap in the x-y plane, while the orange, blue, and green lines indicate\nellipse, square, and heart, respectively (see [10]). In general, we observed that Echo models achieved\ntheir highest MIG scores on position, scale, and shape, with rotation often entangled across two or\nmore dimensions.\n\n6 Conclusion\n\nVAEs can be interpreted as performing a rate-distortion optimization, but may be handicapped by their\nweak compression mechanism, independent Gaussian marginal assumptions, and upper bound on rate.\nWe introduced a new type of channel, Echo noise, that provides a more \ufb02exible, data-driven approach\nto constructing noise and admits an exact expression for mutual information. Our results demonstrate\nthat using Echo noise in autoencoders can lead to better bounds on log-likelihood, favorable trade-offs\nbetween compression and reconstruction, and more disentangled representations.\nThe Echo channel can be substituted for Gaussian noise in most scenarios where VAEs are used, with\nsimilar runtime and the same number of parameters. Echo should also translate to other rate-distortion\nproblems via the choice of distortion measure, including supervised learning with the traditional\nInformation Bottleneck method [2, 39] and invariant representation learning as in [30]. Exploring\nfurther settings where mutual information provides meaningful regularization for neural network\nrepresentations remains an exciting avenue for future work.\n\nReferences\n[1] Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations\n\nthrough noisy computation. arXiv preprint arXiv:1611.01353, 2016.\n\n[2] Alexander Alemi, Ian Fischer, Joshua Dillon, and Kevin Murphy. Deep variational information\n\nbottleneck. arXiv preprint arXiv:1612.00410, 2016.\n\n[3] Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif A Saurous, and Kevin Murphy.\nFixing a broken elbo. In International Conference on Machine Learning, pages 159\u2013168, 2018.\n[4] David Barber and Felix V Agakov. The im algorithm: a variational approach to information\n\nmaximization. In Advances in neural information processing systems, page None, 2003.\n\n[5] Ishmael Belghazi, Sai Rajeswar, Aristide Baratin, R Devon Hjelm, and Aaron Courville. Mine:\n\nmutual information neural estimation. arXiv preprint arXiv:1801.04062, 2018.\n\n[6] Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J\u00f3zefowicz, and Samy\nBengio. Generating sentences from a continuous space. CoRR, abs/1511.06349, 2015. URL\nhttp://arxiv.org/abs/1511.06349.\n\n9\n\n\f[7] DT Braithwaite and W Bastiaan Kleijn. Bounded information rate variational autoencoders.\n\narXiv preprint arXiv:1807.07306, 2018.\n\n[8] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.\n\narXiv preprint arXiv:1509.00519, 2015.\n\n[9] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des-\njardins, and Alexander Lerchner. Understanding disentangling in beta-vae. arXiv preprint\narXiv:1804.03599, 2018.\n\n[10] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentan-\nglement in variational autoencoders. In Advances in Neural Information Processing Systems,\n2018.\n\n[11] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya\nSutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731,\n2016.\n\n[12] Thomas M Cover and Joy A Thomas. Elements of information theory. Wiley-Interscience,\n\n2006.\n\n[13] Joshua V Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore,\nBrian Patton, Alex Alemi, Matt Hoffman, and Rif A Saurous. Tensor\ufb02ow distributions. arXiv\npreprint arXiv:1711.10604, 2017.\n\n[14] Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Galstyan. Auto-encoding total\n\ncorrelation explanation. AISTATS, 2019.\n\n[15] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. Made: Masked autoencoder\nfor distribution estimation. In International Conference on Machine Learning, pages 881\u2013889,\n2015.\n\n[16] Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch\u00f6lkopf, and Alexander\nSmola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723\u2013773,\n2012.\n\n[17] Virgil Grif\ufb01th and Christof Koch. Quantifying synergistic mutual information. In Guided\n\nSelf-Organization: Inception, pages 159\u2013190. Springer, 2014.\n\n[18] Irina Higgins, Loic Matthey, Arka Pal, Matthew Botvinick Shakir Mohamed Christo-\npher Burgess, Xavier Glorot, and Alexander Lerchner. \"beta-vae: Learning basic visual concepts\nwith a constrained variational framework.\". In Proceedings of the International Conference on\nLearning Representations (ICLR), 2017.\n\n[19] Matthew D Hoffman and Matthew J Johnson. Elbo surgery: yet another way to carve up the\nvariational evidence lower bound. In Workshop in Advances in Approximate Bayesian Inference,\nNIPS, 2016.\n\n[20] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983,\n\n2018.\n\n[21] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[22] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max\nWelling. Improved variational inference with inverse autoregressive \ufb02ow. In Advances in Neural\nInformation Processing Systems, pages 4743\u20134751, 2016.\n\n[23] Artemy Kolchinsky, Brendan D Tracey, and David H Wolpert. Nonlinear information bottleneck.\n\narXiv preprint arXiv:1705.02436, 2017.\n\n[24] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept\n\nlearning through probabilistic program induction. Science, 350(6266):1332\u20131338, 2015.\n\n10\n\n\f[25] Luis A. Lastras-Montano. Information theoretic lower bounds on negative log likelihood. 2018.\n\nURL https://openreview.net/forum?id=rkemqsC9Fm.\n\n[26] Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard\nSch\u00f6lkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning\nof disentangled representations. In International Conference on Machine Learning, pages\n4114\u20134124, 2019.\n\n[27] Emile Mathieu, Tom Rainforth, N Siddharth, and Yee Whye Teh. Disentangling disentanglement\n\nin variational autoencoders. In International Conference on Machine Learning, 2019.\n\n[28] Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentangle-\n\nment testing sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.\n\n[29] David McAllester and Karl Statos. Formal limitations on the measurement of mutual information.\n\narXiv preprint arXiv:1811.04251, 2018.\n\n[30] Daniel Moyer, Shuyang Gao, Rob Brekelmans, Aram Galstyan, and Greg Ver Steeg. Invariant\nrepresentations without adversarial training. In Advances in Neural Information Processing\nSystems, pages 9084\u20139093, 2018.\n\n[31] George Papamakarios, Iain Murray, and Theo Pavlakou. Masked autoregressive \ufb02ow for density\nestimation. In Advances in Neural Information Processing Systems, pages 2338\u20132347, 2017.\n\n[32] Mary Phuong, Max Welling, Nate Kushman, Ryota Tomioka, and Sebastian Nowozin. The\nmutual autoencoder: Controlling information in latent code representations. 2018. URL\nhttps://openreview.net/forum?id=HkbmWqxCZ.\n\n[33] Ben Poole, Sherjil Ozair, Aaron Van Den Oord, Alex Alemi, and George Tucker. On variational\nIn International Conference on Machine Learning, pages\n\nbounds of mutual information.\n5171\u20135180, 2019.\n\n[34] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nInternational Conference on Machine Learning, pages 1530\u20131538, 2015.\n\nIn\n\n[35] Danilo Jimenez Rezende and Fabio Viola. Taming vaes. arXiv preprint arXiv:1810.00597,\n\n2018.\n\n[36] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In International Conference on Machine\nLearning, pages 1278\u20131286, 2014.\n\n[37] Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mohamed. Distribution matching in\n\nvariational inference. arXiv preprint arXiv:1802.06847, 2018.\n\n[38] Ruslan Salakhutdinov and Iain Murray. On the quantitative analysis of deep belief networks. In\nProceedings of the 25th international conference on Machine learning, pages 872\u2013879. ACM,\n2008.\n\n[39] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.\n\narXiv preprint physics/0004057, 2000.\n\n[40] Jakub M Tomczak and Max Welling. Vae with a vampprior. AIStats 2018, 2017. URL\n\narXivpreprintarXiv:1705.07120.\n\n[41] Michael Tschannen, Olivier Bachem, and Mario Lucic. Recent advances in autoencoder-based\n\nrepresentation learning. arXiv preprint arXiv:1812.05069, 2018.\n\n[42] Sjoerd van Steenkiste, Francesco Locatello, J\u00fcrgen Schmidhuber, and Olivier Bachem.\narXiv preprint\n\nAre disentangled representations helpful for abstract visual reasoning?\narXiv:1905.12506, 2019.\n\n[43] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of\n\nresearch and development, 4(1):66\u201382, 1960.\n\n11\n\n\f[44] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms. 2017.\n\n[45] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Infovae: Information maximizing varia-\ntional autoencoders. CoRR, abs/1706.02262, 2017. URL http://arxiv.org/abs/1706.\n02262.\n\n[46] Shengjia Zhao, Jiaming Song, and Stefano Ermon. The information autoencoding family: A\nlagrangian perspective on latent variable generative models. CoRR, abs/1806.06514, 2018. URL\nhttp://arxiv.org/abs/1806.06514.\n\n12\n\n\f", "award": [], "sourceid": 2137, "authors": [{"given_name": "Rob", "family_name": "Brekelmans", "institution": "University of Southern Caifornia"}, {"given_name": "Daniel", "family_name": "Moyer", "institution": "University of Southern California"}, {"given_name": "Aram", "family_name": "Galstyan", "institution": "USC Information Sciences Institute"}, {"given_name": "Greg", "family_name": "Ver Steeg", "institution": "USC Information Sciences Institute"}]}