{"title": "Deepcode: Feedback Codes via Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 9436, "page_last": 9446, "abstract": "The design of codes for communicating reliably over a statistically well defined channel is an important endeavor involving deep mathematical research and wide- ranging practical applications. In this work, we present the first family of codes obtained via deep learning, which significantly beats state-of-the-art codes designed over several decades of research. The communication channel under consideration is the Gaussian noise channel with feedback, whose study was initiated by Shannon; feedback is known theoretically to improve reliability of communication, but no practical codes that do so have ever been successfully constructed.\n\nWe break this logjam by integrating information theoretic insights harmoniously with recurrent-neural-network based encoders and decoders to create novel codes that outperform known codes by 3 orders of magnitude in reliability. We also demonstrate several desirable properties in the codes: (a) generalization to larger block lengths; (b) composability with known codes; (c) adaptation to practical constraints. This result also presents broader ramifications to coding theory: even when the channel has a clear mathematical model, deep learning methodologies, when combined with channel specific information-theoretic insights, can potentially beat state-of-the-art codes, constructed over decades of mathematical research.", "full_text": "Deepcode: Feedback Codes via Deep Learning\n\nHyeji Kim\u21e4, Yihan Jiang\u2020, Sreeram Kannan\u2020, Sewoong Oh\u2021, Pramod Viswanath\u2021\n\nSamsung AI Centre Cambridge*, University of Washington\u2020, University of Illinois at Urbana Champaign\u2021\n\nAbstract\n\nThe design of codes for communicating reliably over a statistically well de\ufb01ned\nchannel is an important endeavor involving deep mathematical research and wide-\nranging practical applications. In this work, we present the \ufb01rst family of codes\nobtained via deep learning, which signi\ufb01cantly beats state-of-the-art codes designed\nover several decades of research. The communication channel under consideration\nis the Gaussian noise channel with feedback, whose study was initiated by Shannon;\nfeedback is known theoretically to improve reliability of communication, but no\npractical codes that do so have ever been successfully constructed.\nWe break this logjam by integrating information theoretic insights harmoniously\nwith recurrent-neural-network based encoders and decoders to create novel codes\nthat outperform known codes by 3 orders of magnitude in reliability. We also\ndemonstrate several desirable properties in the codes: (a) generalization to larger\nblock lengths; (b) composability with known codes; (c) adaptation to practical con-\nstraints. This result also presents broader rami\ufb01cations to coding theory: even when\nthe channel has a clear mathematical model, deep learning methodologies, when\ncombined with channel-speci\ufb01c information-theoretic insights, can potentially beat\nstate-of-the-art codes, constructed over decades of mathematical research.\n\n1\n\nIntroduction\n\nThe ubiquitous digital communication enabled via wireless (e.g. WiFi, mobile, satellite) and wired\n(e.g. ethernet, storage media, computer buses) media has been the plumbing underlying the current\ninformation age. The advances of reliable and ef\ufb01cient digital communication have been primarily\ndriven by the design of codes which allow the receiver to recover messages reliably and ef\ufb01ciently\nunder noisy environments. The discipline of coding theory has made signi\ufb01cant progresses in the past\nseven decades since Shannon\u2019s celebrated work in 1948 [1]. As a result, we now have near optimal\ncodes in a canonical setting, namely, Additive White Gaussian Noise (AWGN) channel. However,\nseveral channel models of great practical interest lack ef\ufb01cient and practical coding schemes.\nChannels with feedback (from the receiver to the transmitter) is an example of a long-standing\nopen problem and with signi\ufb01cant practical importance. Modern wireless communication includes\nfeedback, in one form or the other; for example, the feedback can be the received value itself or\nquantization of the received value or an automatic repeat request (ARQ) [2]. Accordingly, there\nare different models for channels with feedback, and among them, the AWGN channel with output\nfeedback is a model that captures the essence of channels with feedback; this model is also classical,\nintroduced by Shannon in 1956 [3]. In this channel model, the received value is fed back (with\n\n\u21e4H. Kim is with Samsung AI Centre Cambridge, UK. Email: hkim1505@gmail.com\n\u2020Y. Jiang and S. Kannan are with the Department of Electrical Engineering at University of Washington.\n\nEmail: yihanrogerjiang@gmail.com and ksreeram@uw.edu.\n\n\u2021S. Oh and P. Viswanath are with the Coordinated Science Lab at University of Illinois at Urbana Champaign\n(UIUC). S. Oh is with the Department of Industrial and Enterprise Systems Engineering at UIUC. P. Viswanath\nis with the Department of Electrical Engineering at UIUC. Email: {swoh,pramodv}@illinois.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\funit time delay) to the transmitter without any processing (we refer to Figure 1 for an illustration of\nchannel). Designing codes for this channel via deep learning approaches is the central focus of this\npaper.\nWhile the output feedback does not improve the Shannon capacity of the AWGN channel [3], it is\nknown to provide better reliability at \ufb01nite block lengths [4]. On the other hand, practical coding\nschemes have not been successful in harnessing the feedback gain thereby signi\ufb01cantly limiting\nthe use of feedback in practice. This state of the art is at odds with the theoretical predictions of\nthe gains in reliability via using feedback: the seminal work of Schalkwijk-Kailath [4] proposed\nS-K scheme, a (theoretically) achievable scheme with superior reliability guarantees, but which\nsuffers from extreme sensitivity to both the precision of the numerical computation and noise in the\nfeedback [5, 6]. Another competing scheme of [7] is designed for channels with noisy feedback, but\nnot only is the reliability poor, it is almost independent of the feedback quality, suggesting that the\nfeedback data is not being fully exploited. More generally, it has been proven that no linear code\nincorporating the noisy output feedback can perform well [8]. This is especially troubling since all\npractical codes are linear and linear codes are known to achieve capacity (without feedback) [9].\nIn this paper, we demonstrate new neural network-driven encoders (with matching decoders) that\noperate signi\ufb01cantly better (100\u20131000 times) than state of the art, on the AWGN channel with\n(noisy) output feedback. We show that architectural insights from simple communication channels\nwith feedback when coupled with recurrent neural network architectures can discover novel codes.\nWe consider Recurrent Neural Network (RNN) parameterized encoders (and decoders), which are\ninherently nonlinear and map information bits directly to real-valued transmissions in a sequential\nmanner.\nDesigning codes driven by deep learning has been of signi\ufb01cant interest recently [10\u201325], starting\nfrom [10] which proposes an autoencoder framework for communications. In [10], it is demonstrated\nthat for classical AWGN channels, feedforward neural codes can mimic the performance of a well-\nknown code for a short block length (4 information bits). Extending this idea to orthogonal frequency\ndivision multiplex (OFDM), [11, 12] show that neural codes can mimic the performance of state-\nof-the-art codes for short block lengths (8 information bits). Several results extend the autoencoder\nidea to other settings of AWGN channels [13] and modulation [26]). Beyond AWGN channels, [14]\nconsiders the problem of communicating a complicated source (text) over erasure channels and shows\nthat RNN-based neural codes that map raw texts directly to a codeword can beat the state-of-the art\ncodes, when the reliability is evaluated by human (as opposed to bit error rate). Deep learning has\nbeen applied also in the problem of designing decoders for existing encoders [15\u201319], demonstrating\nthe ef\ufb01ciency, robustness, and adaptivity of neural decoders over the existing decoders. In a different\ncontext, for distributed computation, where encoder adds redundant computations so that the decoder\ncan reliably approximate the desired computations under unavailabilities, [20] showed that neural\nnetwork based codes can beat the state of the art codes.\nWhile several works in the past years apply deep learning for channel coding, very few of them\nconsider the design of novel codes using deep learning (rather than decoders). Furthermore, none of\nthem are able to beat state-of-the-art codes on a standard (well studied) channel. We demonstrate\n\ufb01rst family of codes obtained via deep learning which beats state-of-the-art codes, signaling a\npotential shift in code design, which historically has been driven by individual human ingenuity with\nsporadic progress over the decades. Henceforth, we call this new family of codes Deepcode. We also\ndemonstrate the superior performance of variants of Deepcode under a variety of practical constraints.\nFurthermore, Deepcode has complexity comparable to traditional codes, even without any effort\nat optimizing the storage and run-time complexity of the neural network architectures. Our main\ncontributions are as follows:\n\n1. We demonstrate Deepcode \u2013 a new family of RNN-driven neural codes that have three\norders of magnitude better reliability than state of the art with both noiseless and noisy\nfeedback. Our results are signi\ufb01cantly driven by the intuition obtained from information\nand coding theory, in designing a series of progressive improvements in the neural network\narchitectures (Section 3 and 4).\n\n2. We show that variants of Deepcode signi\ufb01cantly outperform state-of-the art codes under\na variety of practical constraints (example: delayed feedback, very noisy feedback link)\n(Section 4).\n\n2\n\n\f3. We show composability: Deepcode naturally concatenates with a traditional inner code and\ndemonstrates continued improvements in reliability as the block length increases (Section 4).\n4. Our interpretation and analysis of Deepcode provide guidance on the fundamantal under-\nstanding of how the feedback can be used and some information theoretic insights into\ndesigning codes for channels with feedback (Section 5).\n\n2 Problem formulation\n\nThe most canonical channel studied in the literature (example: textbook material [27]) and also\nused in modeling practical scenarios (example: 5G LTE standards) is the Additive White Gaussian\nNoise (AWGN) channel without feedback. Concretely, the encoder takes in K information bits\njointly, b = (b1,\u00b7\u00b7\u00b7 , bK) 2{ 0, 1}K, and outputs n real valued signals to be transmitted over a\nnoisy channel (sequentially). At the i-th transmission for each i 2{ 1, . . . , n}, a transmitted symbol\nxi 2 R is corrupted by an independent Gaussian noise ni \u21e0N (0, 2), and the decoder receives\nyi = xi + ni 2 R. After receiving the n received symbols, the decoder makes a decision on which\ninformation bit b was sent, out of 2K possible choices. The goal is to maximize the probability of\ncorrectly decoding the received symbols and recover b.\nBoth the encoder and the decoder are functions, mapping b 2{ 0, 1}K to x 2 Rn and y 2 Rn to\n\u02c6b 2{ 0, 1}K, respectively. The design of a good code (an encoder and a corresponding decoder)\naddresses both (i) the statistical challenge of achieving a small error rate; and (ii) the computational\nchallenge of achieving the desired error rate with ef\ufb01cient encoder and decoder. Almost a century\nof progress in this domain of coding theory has produced several innovative codes that ef\ufb01ciently\nachieve small error rate, including convolutional codes, Turbo codes, LDPC codes, and polar codes.\nThese codes are known to perform close to the fundamental limits on reliable communication [28].\n\nBLER\n\nFigure 1: AWGN channel with noisy feedback (left). Deepcode signi\ufb01cantly outperforms the baseline\nof S-K and the state-of-the art codes, on block-length 50 and noiseless feedback (right).\n\nSNR = 10 log10 2\n\nIn a canonical AWGN channel with noisy feedback, the received symbol yi is transmitted back to the\nencoder after one unit time of delay and via another additive white Gaussian noise feedback channel\n(Figure 1). The encoder can use this feedback symbol to sequentially and adaptively decide what\nsymbol to transmit next. At time i the encoder receives a noisy view of what was received at the\nreceiver (in the past by one unit time), \u02dcyi1 = yi1 + wi1 2 R, where the noise is independent\nand distributed as wi1 \u21e0N (0, 2\nF ). Formally, an encoder is now a function that sequentially maps\nthe information bit vector b and the feedback symbols \u02dcyi1\n1 = (\u02dcy1,\u00b7\u00b7\u00b7 , \u02dcyi1) received thus far to a\n(b, \u02dcyi1\ntransmit symbol xi: fi :\ni 2{ 1,\u00b7\u00b7\u00b7 , n} and a decoder is a function that maps\n1 7! \u02c6b 2{ 0, 1}K.\nthe received sequence yn\nThe standard measures of performance are the average bit error rate (BER) de\ufb01ned as BER \u2318\n(1/K)PK\ni=1 P(bi 6= \u02c6bi) and the block error rate (BLER) de\ufb01ned as BLER \u2318 P(b 6= \u02c6b), where the\nrandomness comes from the forward and feedback channels and any other sources of randomness\nIt is standard (both theoretically and\nthat might be used in the encoding and decoding processes.\npractically) to have an average power constraint, i.e., (1/n)E[kxk2] \uf8ff 1, where x = (x1,\u00b7\u00b7\u00b7 , xn)\nand the expectation is over the randomness in choosing the information bits b uniformly at random,\nthe randomness in the noisy feedback symbols \u02dcyi\u2019s, and any other randomness used in the encoder.\n\n1 = (y1,\u00b7\u00b7\u00b7 , yn) into estimated information bits: g : yn\n\n1\n\n) 7! xi,\n\n3\n\n\fWhile the capacity of the channel remains the same in the presence of feedback [3], the reliability\ncan increase signi\ufb01cantly as demonstrated by the celebrated result of Schalkwijk and Kailath (S-K),\n[4], which is described in detail in Appendix D. Although the optimal theoretical performance is\nmet by the S-K code, critical drawbacks make it fragile. Theoretically, the scheme critically relies\non exactly noiseless feedback (i.e. 2\nF = 0), and does not extend to channels with even arbitrarily\nsmall amount of noise in the feedback (i.e. 2\nF > 0). Practically, the scheme is extremely sensitive\nto numerical precisions; we see this in Figure 1, the numerical errors dominate the performance of\nthe S-K scheme, with a practical choice of MATLAB implementation with a precision of 16 bits to\nrepresent \ufb02oating-point numbers.\nEven with a noiseless feedback channel with 2\nF = 0, which the S-K scheme is designed for, it is\noutperformed signi\ufb01cantly by our proposed Deepcode (described in detail in Section 3). At moderate\nSNR of 2 dB, Deepcode can outperform S-K scheme by three orders of magnitude in BLER. The\nresulting BLER is shown as a function of the Signal-to-Noise Ratio (SNR) de\ufb01ned as 10 log10 2.\nAlso shown as baselines are the state-of-the art polar, LDPC, and tail-bitting convolutional codes\n(TBCC) in a 3GPP document for the 5G meeting [29] (we refer to Appendix A for the details of\nthese codes used in the simulation). Deepcode signi\ufb01cantly improves over all state-of-the-art codes\nof similar block-length and the same rate. Also plotted as a baseline are the theoretically estimated\nperformance of the best code with no ef\ufb01cient decoding schemes. This impractical baseline lies\nbetween approximate achievable BLER (labelled Normapx in the \ufb01gure) and a converse to the\nBLER (labelled Converse in the \ufb01gure) from [28, 30]. More recently proposed schemes address\nS-K scheme\u2019s sensitivity to noise in the feedback, but either still suffer from similar sensitivity to\nnumerical precisions at the decoder [31], or is incapable of exploiting the feedback information [7] as\nwe illustrate in Figure 4.\n\n3 Neural encoder and decoder\n\nA natural strategy to create a feedback code is to utilize a recurrent neural network (RNN) as an\nencoder since (i) communication with feedback is naturally a sequential process and (ii) to exploit\nthe sequential structure for ef\ufb01cient decoding. We propose representing the encoder and the decoder\nas RNNs, training them jointly under AWGN channels with noisy feedback, and minimizing the error\nin decoding the information bits. However, in our experiments, we \ufb01nd that this strategy by itself is\ninsuf\ufb01cient to achieve any performance improvement with feedback.\nWe exploit information theoretic insights to enable improved performance, by considering the erasure\nchannel with feedback: here transmitted bits are either received perfectly or erased and whether the\nprevious bit was erased or received perfectly is fed back to the transmitter. In such a channel, the\nfollowing two-phase scheme can be used: transmit a block of symbols, and then transmit whichever\nsymbols were erased in the \ufb01rst block (and ad in\ufb01nitum). This motivates a two-phase scheme, where\nuncoded bits are sent in the \ufb01rst phase, and then based on the feedback in the \ufb01rst phase, coded bits\nare sent in the second phase; thus the code only needs to be designed for the second phase. Even\ninside this two-phase paradigm, several architectural choices need to be made. We show in this\nsection that these intuitions can be critically employed to innovate neural network architectures.\nOur experiments focus on the setting of rate 1/3 and information block length of 50 for concreteness1.\nThat is, the encoder maps K = 50 message bits to a codeword of length n = 150. We discuss\ngeneralizations to longer block lengths in Section 4.\nA. RNN feedback encoder/decoder (RNN (linear) and RNN (tanh)). We propose an encoding\nscheme that progresses in two phases. In the \ufb01rst phase, the K information bits are sent raw (uncoded)\nover the AWGN channel. In the second phase, 2K coded bits are generated based on the information\nbits b and (delayed) output feedback and sequentially transmitted. We propose a decoding scheme\nusing two layers of bidirectional Gated Recurrent Units (GRU). When jointly trained, a linear RNN\nencoder achieves performance close to Turbo code that does not use the feedback information at all\nas shown in Figure 2. With a non-linear activation function of tanh(\u00b7), the performance improves,\nachieving BER close to the existing S-K scheme. Such a gain of non-linear codes over linear ones is\nin-line with theory [31].\n\n1Source codes are available under https://github.com/hyejikim1/feedback_code (Keras) and\n\nhttps://github.com/yihanjiang/feedback_code (PyTorch).\n\n4\n\n\fEncoder A: RNN feedback encoder\n\nBER\n\nFigure 2: Building upon a simple linear RNN encoder (left), we progressively improve the archi-\ntecture. Eventually with RNN(tanh)+ZP+W+A architecture formally described in Section 3, we\nsigni\ufb01cantly outperform the baseline of S-K scheme and Turbo code, by several orders of magnitude\nin the bit error rate, on block-length 50 and noiseless feedback (2\n\nF = 0).\n\nSNR =10 log10 2\n\nEncoding. The architecture of the encoder is shown in Figure 2. The encoding process has two\nphases. In the \ufb01rst phase, the encoder simply transmits the K raw message bits. That is, the encoder\nmaps bk to ck = 2bk 1 for k 2{ 1,\u00b7\u00b7\u00b7 , K}, and stores the feedback \u02dcy1,\u00b7\u00b7\u00b7 , \u02dcyK for later use. In\nthe second phase, the encoder generates a coded sequence of length 2K (length (1/r 1)K for\ngeneral rate r code) through a single directional RNN. In particular, each k-th RNN cell generates\ntwo coded bits ck,1, ck,2 for k 2{ 1, . . . , K}, which uses both the information bits and (delayed)\noutput feedback from the earlier raw information bit transmissions. The input to the k-th RNN cell\nis of size four: bk, \u02dcyk ck (the estimated noise added to the k-th message bit in phase 1) and the\nmost recent two noisy feedbacks from phase 2: \u02dcyk1,1 ck1,1 and \u02dcyk1,2 ck1,2. Note that we\nuse \u02dcyk,j = ck,j + nk,j + wk,j to denote the feedback received from the transmission of ck,j for\nk 2{ 1,\u00b7\u00b7\u00b7 , K} and j 2{ 1, 2}, and nk,j and wk,j are corresponding forward and feedback channel\nnoises, respectively.\nTo generate codewords that satisfy power constraint, we put a normalization layer to the RNN outputs\nso that each coded bit has a mean 0 and a variance 1. During training, the normalization layer\nsubtracts the batch mean from the output of RNN and divide by the standard deviation of the batch.\nAfter training, we compute the mean and the variance of the RNN outputs over 106 examples. In\ntesting, we use the precomputed means and variances. Further implementation details are in Appendix\nB.\nDecoding. Based on the received sequence y = (y1,\u00b7\u00b7\u00b7 , yk, y1,1, y1,2, y2,1, y2,2,\u00b7\u00b7\u00b7 , yK,1, yK,2)\nof length 3K, the decoder estimates K information bits. For the decoder, we use a two-layered\nbidirectional Gated Recurrent Unit (GRU), where the input to the k-th GRU cell is a tuple of three\nreceived symbols, (yk, yk,1, yk,2). We refer to Appendix B for more implementation details.\nTraining. Both the encoder and decoder are trained jointly using binary cross-entropy as the loss\nfunction over 4 \u21e5 106 examples, with batch size 200, via an Adam optimizer (1=0.9, 2=0.999,\n\u270f=1e-8). The input to the neural network is K information bits and the output is K estimated bits (as\nin the autoencoder setting). During the training, we let K = 100. AWGN channels are simulated\nfor the channels from the encoder to the decoder and from decoder to the encoder. In training, we\nlet the forward SNR equal to be test SNR and feedback SNR to be the test feedback SNR. We\nrandomly initialize weights of the encoder and the decoder. We observed that training with random\ninitialization of encoder-decoder gives a better encoder-decoder compared to initializing with a\npre-trained encoder/decoder by sequential channel codes for non-feedback AWGN channels (e.g.\nconvolutional codes). We also use a decaying learning rate and gradient clipping; we reduce the\nlearning rate by 10 times after training with 106 examples, starting from 0.02. Gradients are clipped\nto 1 if L2 norm of the gradient exceeds 1 so that we prevent the gradients from getting too large.\nTypical error analysis. Due to the recurrent structure in generating coded bits (ck,1, ck,2), the coded\nbit stream carries more information on the \ufb01rst few bits than last few bits (e.g. b1 than bK). This\nresults in more errors in the last information bits, as shown in Figure 3, where we plot the average\nBER of bk for k = {1,\u00b7\u00b7\u00b7 , K}.\n\n5\n\n\fBER of bk\n\n2\n\nPosition (k)\n\nPosition (k)\n\nFigure 3: (Left) A naive RNN(tanh) code gives a high BER in the last few information bits. With\nthe idea of zero padding and power allocation, the RNN(tanh)+ZP+W+A architecture gives a BER\nthat varies less across the bit position, and overall BER is signi\ufb01cantly improved over the naive\nRNN(tanh) code. (Middle) Noise variances across bit position which results in a block error: High\nnoise variance on the second parity bit stream (c1,2,\u00b7\u00b7\u00b7 , cK,2) causes a block error. (Right) Noise\ncovariance: Noise sequence which results in a block error does not have a signi\ufb01cant correlation\nacross position.\n\nB. RNN feedback code with zero padding (RNN (tanh) + ZP). In order to reduce high errors in\nthe last information bits, as shown in Figure 3, we apply the zero padding (ZP) technique; we pad a\nzero in the end of information bits, and transmit a codeword for the padded information bits (The\nencoder and decoder with zero padding are illustrated in Appendix B). By bringing zero padding,\nthe BER of the last information bits, as well as other bits, drops signi\ufb01cantly, as shown in Figure 3.\nZero padding requires a few extra channel usages (e.g. with 3 symbol zero padding, we map 50\ninformation bits to a codeword of length 153). However, due to the signi\ufb01cant improvement in BER,\nit is widely used in sequential codes (e.g. convolutional codes and turbo codes).\nTypical error analysis. To see if there is a pattern in the noise sequence which makes the decoder fail,\nwe study the \ufb01rst and second order noise statistics which result in the error in decoding. In Figure 3\n(Middle), we plot the average variance of noise added to bk in \ufb01rst phase and ck,1 and ck,2 in the\nsecond phase, as a function of k. From the \ufb01gure, we make two observations; (i) large noise in the\nlast bits cause an error, and (ii) large noise in ck,2 is likely to cause an error, which implies that the\nraw bit stream and the coded bit streams are not equally robust to the noise \u2013 an observation that will\nbe exploited next. In Figure 3 (Right), we plot noise covariances that result in a decoding error. From\nFigure 3 (Right), we see that there is no particular correlation within the noise sequence that makes\nthe decoder fail. This suggests that there is no particular error pattern to be exploited, and the BER\nperformance further improved.\nC. RNN feedback code with power allocation (RNN(tanh) + ZP + W). Based on the observation\nthat the raw bit ck and coded bit ck,1, ck,2 are not equally robust, as shown in Figure 3 (Middle), we\nintroduce trainable weights which allow allocating different amount of power to the raw bit and coded\nbits. Appendix B provides all implementation details. By introducing and training these weights, we\nachieve the improvement in BER as shown in Figures 2 and 3.\nTypical error analysis. While the average BER is improved by about an order of magnitude for most\nbit positions as shown in Figure 3 (Left), the BER of the last bit remains about the same. On the\nother hand, the BER of \ufb01rst few bits are now smaller, suggesting the following bit-speci\ufb01c power\nallocation method.\nD. Deepcode: RNN feedback code with bit power allocation (RNN(tanh) + ZP + W + A). We\nintroduce a weight vector allowing the power of bits in different position to be different, as illustrated\nin Figure 10. Ideally, we would like to reduce the power for the \ufb01rst information bits and increase the\npower for the last information bits. The resulting BER curve is shown in Figure 2(-o-). We can see\nthat the BER is noticeably decreased. In Figure 3(-o-), we can see that the BER in the last bits are\nreduced, and we can also see that the BER in the \ufb01rst bits are increased. Our use of unequal power\nallocation across information bits is in-line with other approaches from information theory [32], [33].\nWe call this neural code Deepcode.\nTypical error analysis. As shown in Figure 3, the BER at each position remains about the same\nexcept for the last few bits. This suggests a symmetry in our code and nearest-neighbor-like decoder.\nFor an AWGN channel without feedback, it is known that the optimal decoder (nearest neighbor\n\n6\n\n\fdecoder) under a symmetric code (in particular, each coded bit follows a Gaussian distribution) is\nrobust to the distribution of noise [34]; the BER does not increase if we keep the power of noise\nand only change the distribution. As an experiment demonstrating the robustness of Deepcode, in\nAppendix E, we show that BER of Deepcode does not increase if we keep the power of noise and\nchange the distribution from i.i.d. Gaussian to bursty Gaussian noise.\nComplexity. Complexity and latency, as well as reliability, are important metrics in practice, as\nthe encoder and decoder need to run in real time on mobile devices. Deepcode has computational\ncomplexity and latency comparable to currently used codes (without feedback) that are already in\ncommunication standards. Turbo decoder, for example, is a belief-propagation decoder with many\n(e.g., 10 \u2013 20) iterations, and each iteration is followed by a permutation. Turbo encoder also includes\na permutation of information bits (of length K). On the other hand, the proposed neural encoder in\nDeepcode is a single layered RNN encoder with 50 hidden units, and the neural decoder in Deepcode\nis a 2-layered GRU decoder, also with 50 hidden units, all of which are matrix multiplications that\ncan be parallelized. Ideas such as knowledge distillation [35] and network binarization [36] can be\nused to potentially further reduce the complexity of the network.\n\n4 Practical considerations: noise, delay, coding in feedback, and blocklength\n\nWe considered so far the AWGN channel with noiseless output feedback with a unit time-step delay.\nIn this section, we demonstrate the robustness of Deepcode (and its variants) under two variations on\nthe feedback channel, noise and delay, and present generalization to longer block lengths. We show\nthat (a) Deepcode and its variant that allows a K-step delayed feedback are more reliable than the\nstate-of-the-art schemes in channels with noisy feedback; (b) by allowing the receiver to feed back an\nRNN encoded output instead of its raw output, and learning this RNN encoder, we achieve a further\nimprovement in reliability, demonstrating the power of encoding in the feedback link; (c) Deepcode\nconcatenated with turbo code achieves superior error rate decay as block length increases with noisy\nfeedback.\nNoisy feedback. We show that Deepcode trained under AWGN channels with noisy output feedback,\nachieves a signi\ufb01cantly smaller BER than both S-K and C-L schemes under AWGN channels with\nnoisy output feedback. In Figure 4 (Left), we plot the BER as a function of the feedback SNR for\nS-K scheme, C-L scheme, and Deepcode for a rate 1/3, 50 information bits, where we \ufb01x the forward\nchannel SNR to be 0dB. As feedback SNR increases, we expect the BER to decrease. However, as\nshown in Figure 4 (Left), the C-L scheme, which is designed for noisy feedback, and S-K scheme\nare very sensitive to noise in the feedback, and reliability is almost independent of feedback quality.\nDeepcode outperforms these two baseline (linear) codes by a large margin, with decaying error as\nfeedback SNR increases, showing that Deepcode harnesses noisy feedback information to make\ncommunication more reliable. This is highly promising as the performance under noisy feedback is\ndirectly related to the practical communication channels.\nNoise feedback with delay. We model the practical constraint of delay in the feedback, by introduc-\ning a variant of Deepcode that works with a K time-step delayed feedback (discussed in detail in\nAppendix B.5); recall K is the number of information bits and this code tolerates a large delay in\nthe feedback. Perhaps unexpectedly, we see from Figure 4 (Left) that this neural code for delayed\nfeedback achieves a similar BER to no delay in the feedback; this is true especially at small feedback\nSNRs, till around 12dB.\nNoisy feedback with delay and coding. It is natural to allow the receiver send back a general\nfunction of its past received values, i.e., receiver encodes its output and sends the coded (real valued)\nbit. Designing the code for this setting is challenging as it involves designing two encoders and one\ndecoder jointly in a sequential manner. We propose using RNN as an encoder that maps noisy output\nto the transmitted feedback, with implementation details in Appendix B.5. Figure 4 demonstrates the\npowerful encoding of the received output, as learnt by the neural architecture; the BER is improved\ntwo-three times.\nGeneralization to longer block lengths. In wireless communications, a wide range of blocklengths\nare of interest (e.g., 40 to 6144 information bits in LTE standards). In previous sections, we considered\nblock length of 50 information bits. Here we show how to generalize Deepcode to longer block\nlengths and achieve an improved reliability as we increase the block length.\n\n7\n\n\fBER\n\nBER\n\nBER\n\nSNR of feedback channel\n\nSNR\n\nBlocklength\n\nFigure 4: (Left) Deepcode (introduced in section 3) and a variant of neural code which allows K time-\nstep delay signi\ufb01cantly outperform the two baseline schemes under noisy feedback. Another variant\nof Deepcode which allows the receiver to feed back an RNN encoded output (with K-step delay)\nperforms even better than the neural code with raw output feedback (with unit-delay), demonstrating\nthe power of coding in the feedback. (Middle) By unrolling the RNN cells of Deepcode, the BER\nof Deepcode remains the same for block lengths 50 to 500. (Right) Concatenated Deepcode and\nturbo code (with and without noise in the feedback) achieves BER that decays exponentially as block\nlength increases, faster than turbo codes (without feedback) at same rate.\n\nA natural generalization of the RNN-based neural code is to unroll the RNN cells. In Figure 4 (Middle),\nwe plot the BER as a function of the SNR, for 50 information bits and length 500 information bits\n(under noiseless feedback) when we unroll the RNN cells. We can see that the BER remains the same\nas we increase block lengths. This is not an entirely satisfying generalization because, typically, it is\npossible to design a code for which error rate decays faster as block length increases. For example,\nturbo codes have error rate decaying exponentially (log BER decades linearly) in the block length\nas shown in Figure 4 (Right). This critically relies on the interleaver, which creates long range\ndependencies between information bits that are far apart in the block. Given that the neural encoder\nis a sequential code, there is no strong long range dependence. Each transmitted bit depends on only\na few past information bits and their feedback (we refer to Section 5 for a detailed discussion).\nTo resolve this problem, we propose a new concatenated code which concatenates Deepcode (as inner\ncode) and turbo code as an outer code. The outer code is not restricted to a turbo code, and we refer\nto Appendix C for a detailed discussion.\nIn Figure 4 (Right), we plot the BERs of the concatenated code, under both noiseless and noisy\nfeedback (of feedback SNR 10dB), and turbo code, both at rate 1/9 at (forward) SNR 6.5dB. From\nthe \ufb01gure, we see that even with noisy feedback, BER drops almost exponentially (log BER drops\nlinearly) as block length increases, and the slope is sharper than the one for turbo codes. We also note\nthat in this setting, C-L scheme suggests not using the feedback.\n\n5\n\nInterpretation\n\nThus far we have used information theoretic insights in driving our deep learning designs. Here, we\nask if the deep learning architectures we have learnt can provide an insight to the information theory\nof communication with feedback. We aim to understand the behavior of Deepcode (i.e., the coded\nbits generated via RNN in Phase II). We show that in the second phase, (a) the encoder focuses on\nre\ufb01ning information bits that were corrupted by large noise in the \ufb01rst phase; and (b) the coded bit\ndepends on past as well as current information bits, i.e., coupling in the coding process.\nCorrecting highly corrupted noise in Phase I. The major motivation of the two-phase encoding\nscheme is that after Phase I, the encoder knows which out of K information bits were corrupted by a\nlarge noise, and in Phase II, encoder can focus on re\ufb01ning those bits. In Figure 5, we plot samples of\n(nk, ck,1) (left) and (nk, ck,2) (right) for bk = 1 and bk = 0 where nk denotes the noise added to the\nthe transmission of bk in the \ufb01rst phase. Consider bk = 1. This \ufb01gure shows that if the noise added to\nbit bk in phase 1 is large, encoder generates coded bits close to zero (i.e., does not further re\ufb01ne bk).\nOtherwise, encoder generates coded bits of large magnitude (i.e., use more power to re\ufb01ne bk).\nCoupling. A natural question is whether our feedback code is exploiting the memory of RNN\nand coding information bits jointly. To answer this question, we look at the correlation between\n\n8\n\n\fck,1\n\nck,2\n\nFigure 5: Noise in \ufb01rst phase vs. \ufb01rst parity bit (left) and second parity bit (right). Blue(x) points are\nwhen bk = 1 and Red (o) points are for bk = 0.\n\nnk\n\nnk\n\ninformation bits and the coded bits. If the memory of RNN were not used, we would expect the\ncoded bits (ck,1, ck,2) to depend only on bk. We \ufb01nd that E[ck,1bk] = 0.42, E[ck,1bk1] =\n0.24, E[ck,1bk2] = 0.1, E[ck,1bk3] = 0.05, and E[ck,2bk] = 0.57, E[ck,2bk1] =\n0.11, E[ck,2bk2] = 0.05, E[ck,2bk3] = 0.02 (for the encoder for forward SNR 0dB and\nnoiseless feedback). This result implies that the RNN encoder does make use of the memory, of\nlength two to three.\nOverall, our analysis suggests that Deepcode exploits memory and selectively enhances bits that were\nsubject to larger noise - properties reminiscent of any good code. We also observe that the relationship\nbetween the transmitted bit and previous feedback demonstrates a non-linear relationship as expected.\nThus our code has features requisite of a strong feedback code. Furthermore, improvements can\nbe obtained if instead of transmitting two coded symbols per bit during Phase II, an attention-type\nmechanism can be used to zoom in on bits that were prone to high noise in Phase I. These insights\nsuggest the following generic feedback code: it is a sequential code with long cumulative memory\nbut the importance of a given bit in the memory is dynamically weighted based on the feedback.\n\n6 Conclusion\n\nIn this paper we have shown that appropriately designed and trained RNN codes (encoder and\ndecoder), which we call Deepcode, outperform the state-of-the-art codes by a signi\ufb01cant margin\non the challenging problem of communicating over AWGN channels with noisy output feedback,\nboth on the theoretical model and with practical considerations taken into account. By concatenating\nDeepcode with a traditional outer code, the BER curve drops signi\ufb01cantly with increasing block\nlengths, allowing generalizations of the learned neural network architectures. The encoding and\ndecoding capabilities of the RNN architectures suggest that new codes could be found in other open\nproblems in information theory (e.g., network settings), where practical codes are sorely missing.\n\n7 Acknowledgment\n\nWe thank Shrinivas Kudekar and Saurabh Tavildar for helpful discussions and providing references to\nthe state-of-the-art feedforward codes. We thank Dina Katabi for a detailed discussion that prompted\nwork on system implementation. This work is in part supported by National Science Foundation\nawards CCF-1553452 and RI-1815535, Army Research Of\ufb01ce under grant number W911NF-18-1-\n0384, and Amazon Catalyst award. Y. Jiang and S. Kannan would also like to acknowledge NSF\nawards 1651236 and 1703403.\n\nReferences\n[1] C. E. Shannon, \u201cA mathematical theory of communication, part i, part ii,\u201d Bell Syst. Tech. J., vol. 27, pp.\n\n623\u2013656, 1948.\n\n[2] C. J. Il, J. Mayank, S. Kannan, L. Philip, and K. Sachin, \u201cAchieving single channel, full duplex wireless\ncommunication,\u201d in Proceedings of the 16th Annual International Conference on Mobile Computing and\nNetworking, MOBICOM 2010, Chicago, Illinois, USA, September 20-24, 2010.\n\n[3] C. Shannon, \u201cThe zero error capacity of a noisy channel,\u201d IRE Transactions on Information Theory, vol. 2,\n\nno. 3, pp. 8\u201319, 1956.\n\n9\n\n\f[4] J. Schalkwijk and T. Kailath, \u201cA coding scheme for additive noise channels with feedback\u2013i: No bandwidth\n\nconstraint,\u201d IEEE Transactions on Information Theory, vol. 12, no. 2, pp. 172\u2013182, 1966.\n\n[5] J. Schalkwijk, \u201cA coding scheme for additive noise channels with feedback\u2013ii: Band-limited signals,\u201d IEEE\n\nTransactions on Information Theory, vol. 12, no. 2, pp. 183\u2013189, April 1966.\n\n[6] R. G. Gallager and B. Nakiboglu, \u201cVariations on a theme by schalkwijk and kailath,\u201d IEEE Transactions\n\non Information Theory, vol. 56, no. 1, pp. 6\u201317, Jan 2010.\n\n[7] Z. Chance and D. J. Love, \u201cConcatenated coding for the AWGN channel with noisy feedback,\u201d IEEE\n\nTransactions on Information Theory, vol. 57, no. 10, pp. 6633\u20136649, Oct 2011.\n\n[8] Y.-H. Kim, A. Lapidoth, and T. Weissman, \u201cThe Gaussian channel with noisy feedback,\u201d in IEEE Interna-\n\ntional Symposium on Information Theory.\n\nIEEE, 2007, pp. 1416\u20131420.\n\n[9] P. Elias, \u201cCoding for noisy channels,\u201d in IRE Convention record, vol. 4, 1955, pp. 37\u201346.\n\n[10] T. J. O\u2019Shea and J. Hoydis, \u201cAn introduction to deep learning for the physical layer,\u201d arXiv preprint\n\narXiv:1702.00832, 2017.\n\n[11] A. Felix, S. Cammerer, S. D\u00f6rner, J. Hoydis, and S. t. Brink, \u201cOFDM-autoencoder for end-to-end learning\n\nof communications systems,\u201d arXiv preprint arXiv:1803.05815, 2018.\n\n[12] S. Cammerer, S. D\u00f6rner, J. Hoydis, and S. ten Brink, \u201cEnd-to-end learning for physical layer communica-\ntions,\u201d in The International Zurich Seminar on Information and Communication (IZS 2018) Proceedings.\nETH Zurich, 2018, pp. 51\u201352.\n\n[13] T. J. O\u2019Shea, T. Erpek, and T. C. Clancy, \u201cDeep learning based MIMO communications,\u201d CoRR, vol.\n\nabs/1707.07980, 2017.\n\n[14] N. Farsad, M. Rao, and A. Goldsmith, \u201cDeep learning for joint source-channel coding of text,\u201d IEEE\n\nInternational Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.\n\n[15] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, \u201cCommunication algorithms via deep learn-\ning,\u201d in The International Conference on Representation Learning (ICLR 2018) Proceedings. Vancouver,\nApril, 2018.\n\n[16] E. Nachmani, Y. Be\u2019ery, and D. Burshtein, \u201cLearning to decode linear codes using deep learning,\u201d in\nIEEE 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton) 2016, pp.\n341\u2013346.\n\n[17] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be\u2019ery, \u201cDeep learning methods\n\nfor improved decoding of linear codes,\u201d IEEE Journal of Selected Topics in Signal Processing, 2018.\n\n[18] X. Tan, W. Xu, Y. Be\u2019ery, Z. Zhang, X. You, and C. Zhang, \u201cImproving massive MIMO belief propagation\n\ndetector with deep neural network,\u201d arXiv preprint arXiv:1804.01002, 2018.\n\n[19] J. Seo, J. Lee, and K. Kim, \u201cDecoding of polar code by using deep feed-forward neural networks,\u201d\nInternational Conference on Computing, Networking and Communications (ICNC), pp. 238\u2013242, March\n2018.\n\n[20] J. Kosaian, K. Rashmi, and S. Venkataraman, \u201cLearning a code: Machine learning for approximate\n\nnon-linear coded computation,\u201d arXiv preprint arXiv:1806.01259, 2018.\n\n[21] J. Zhao and Z. Gao, \u201cResearch on the blind equalization technology based on the complex BP neural\nnetwork with tunable activation functions,\u201d IEEE 2nd Advanced Information Technology, Electronic and\nAutomation Control Conference (IAEAC), pp. 813\u2013817, March 2017.\n\n[22] N. Farsad and A. Goldsmith, \u201cNeural network detection of data sequences in communication systems,\u201d\n\nIEEE Transactions on Signal Processing, vol. 66, no. 21, pp. 5663\u20135678, Nov 2018.\n\n[23] H. He, C. Wen, S. Jin, and G. Y. Li, \u201cDeep learning-based channel estimation for beamspace mmWave\n\nmassive MIMO systems,\u201d CoRR, vol. abs/1802.01290, 2018.\n\n[24] H. Sun, X. Chen, M. Hong, Q. Shi, X. Fu, and N. D. Sidiropoulos, \u201cLearning to optimize: Training deep\n\nneural networks for interference management,\u201d IEEE Transactions on Signal Processing, August 2018.\n\n[25] C.-K. Wen, W.-T. Shih, and S. Jin, \u201cDeep learning for massive MIMO CSI feedback,\u201d IEEE Wireless\n\nCommunications Letters, December 2017.\n\n10\n\n\f[26] T. J. O\u2019Shea, K. Karra, and T. C. Clancy, \u201cLearning to communicate: Channel auto-encoders, domain spe-\nci\ufb01c regularizers, and attention,\u201d in IEEE International Symposium on Signal Processing and Information\nTechnology (ISSPIT), 2016, pp. 223\u2013228.\n\n[27] T. M. Cover and J. A. Thomas, Elements of information theory.\n\nJohn Wiley & Sons, 2012.\n\n[28] Y. Polyanskiy, H. V. Poor, and S. Verd\u00fa, \u201cChannel coding rate in the \ufb01nite blocklength regime,\u201d IEEE\n\nTransactions on Information Theory, vol. 56, no. 5, pp. 2307\u20132359, 2010.\n\n[29] H. Huawei,\n\n\u201cPerformance evaluation of channel codes for control channel,\u201d 3GPP TSG-\nRAN WG1 #87 Reno, U.S.A., November 14-18, 2016, vol. R1-1611257. [Online]. Available:\nwww.3gpp.org/ftp/tsg_ran/WG1_RL1/TSGR1_87/Docs/R1-1611257.zip\n\n[30] T. Erseghe, \u201cOn the evaluation of the Polyanskiy-Poor-Verdu converse bound for \ufb01nite block-length coding\n\nin AWGN,\u201d IEEE Transactions on Information Theory, vol. 61, January 2014.\n\n[31] Y. H. Kim, A. Lapidoth, and T. Weissman, \u201cThe Gaussian channel with noisy feedback,\u201d in IEEE\n\nInternational Symposium on Information Theory, June 2007, pp. 1416\u20131420.\n\n[32] T. Duman and M. Salehi, \u201cOn optimal power allocation for turbo codes,\u201d in IEEE International Symposium\n\non Information Theory - Proceedings.\n\nIEEE, 1997, p. 104.\n\n[33] H. Qi, D. Malone, and V. Subramanian, \u201cDoes every bit need the same power? an investigation on unequal\npower allocation for irregular LDPC codes,\u201d in 2009 International Conference on Wireless Communications\nSignal Processing, Nov 2009, pp. 1\u20135.\n\n[34] A. Lapidoth, \u201cNearest neighbor decoding for additive non-Gaussian noise channels,\u201d IEEE Transactions\n\non Information Theory, vol. 42, no. 5, pp. 1520\u20131529, Sep 1996.\n\n[35] G. Hinton, O. Vinyals, and J. Dean, \u201cDistilling the knowledge in a neural network,\u201d arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[36] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, \u201cXnor-net: Imagenet classi\ufb01cation using binary\nSpringer, 2016, pp.\n\nconvolutional neural networks,\u201d in European Conference on Computer Vision.\n525\u2013542.\n\n[37] A. Ben-Yishai and O. Shayevitz, \u201cInteractive schemes for the AWGN channel with noisy feedback,\u201d IEEE\n\nTransactions on Information Theory, vol. 63, no. 4, pp. 2409\u20132427, April 2017.\n\n[38] G. D. Forney, Jr, MIT Press, Cambridge, MA., 1966.\n\n[39] K. Miwa, N. Miki, T. Kawamura, and M. Sawahashi, \u201cPerformance of decision-directed channel estimation\nusing low-rate turbo codes for dft-precoded OFDMA,\u201d in IEEE 75th Vehicular Technology Conference\n(VTC Spring), May 2012, pp. 1\u20135.\n\n11\n\n\f", "award": [], "sourceid": 5743, "authors": [{"given_name": "Hyeji", "family_name": "Kim", "institution": "Samsung AI Center Cambridge"}, {"given_name": "Yihan", "family_name": "Jiang", "institution": "University of Washington Seattle"}, {"given_name": "Sreeram", "family_name": "Kannan", "institution": "University of Washington"}, {"given_name": "Sewoong", "family_name": "Oh", "institution": "University of Washington"}, {"given_name": "Pramod", "family_name": "Viswanath", "institution": "UIUC"}]}