{"title": "Turbo Autoencoder: Deep learning based channel codes for point-to-point communication channels", "book": "Advances in Neural Information Processing Systems", "page_first": 2758, "page_last": 2768, "abstract": "Designing codes that combat the noise in a communication medium has remained a significant area of research in information theory as well as wireless communications. Asymptotically optimal channel codes have been developed by mathematicians for communicating under canonical models after over 60 years of research. On the other hand, in many non-canonical channel settings, optimal codes do not exist and the codes designed for canonical models are adapted via heuristics to these channels and are thus not guaranteed to be optimal. In this work, we make significant progress on this problem by designing a fully end-to-end jointly trained neural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the following contributions: (a) under moderate block lengths, TurboAE approaches state-of-the-art performance under canonical channels; (b) moreover, TurboAE outperforms the state-of-the-art codes under non-canonical settings in terms of reliability. TurboAE shows that the development of channel coding design can be automated via deep learning, with near-optimal performance.", "full_text": "Turbo Autoencoder: Deep learning based channel\ncodes for point-to-point communication channels\n\nYihan Jiang\n\nECE Department\n\nUniversity of Washington\n\nSeattle, United States\n\nyij021@uw.edu\n\nSreeram Kannan\nECE Department\n\nUniversity of Washington\n\nSeattle, United States\n\nksreeram@ee.washington.edu\n\nHyeji Kim\n\nSamsung AI Center\n\nCambridge\n\nCambridge, United\n\nKingdom\n\nhkim1505@gmail.com\n\nSewoong Oh\nAllen School of\n\nComputer Science &\n\nEngineering\n\nUniversity of Washington\n\nSeattle, United States\n\nsewoong@cs.washington.edu\n\nAbstract\n\nHimanshu Asnani\nSchool of Technology\nand Computer Science\n\nTata Institute of\n\nFundamental Research\n\nMumbai, India\n\nhimanshu.asnani@tifr.res.in\n\nPramod Viswanath\nECE Department\n\nUniversity of Illinois at\nUrbana Champaign\nIllinois, United States\npramodv@illinois.edu\n\nDesigning codes that combat the noise in a communication medium has remained\na signi\ufb01cant area of research in information theory as well as wireless communica-\ntions. Asymptotically optimal channel codes have been developed by mathemati-\ncians for communicating under canonical models after over 60 years of research.\nOn the other hand, in many non-canonical channel settings, optimal codes do not\nexist and the codes designed for canonical models are adapted via heuristics to\nthese channels and are thus not guaranteed to be optimal. In this work, we make\nsigni\ufb01cant progress on this problem by designing a fully end-to-end jointly trained\nneural encoder and decoder, namely, Turbo Autoencoder (TurboAE), with the\nfollowing contributions: (a) under moderate block lengths, TurboAE approaches\nstate-of-the-art performance under canonical channels; (b) moreover, TurboAE\noutperforms the state-of-the-art codes under non-canonical settings in terms of\nreliability. TurboAE shows that the development of channel coding design can be\nautomated via deep learning, with near-optimal performance.\n\nIntroduction\n\n1\nAutoencoder is a powerful unsupervised learning framework to learn latent representations by\nminimizing reconstruction loss of the input data [1]. Autoencoders have been widely used in\nunsupervised learning tasks such as representation learning [1] [2], denoising [3], and generative\nmodel [4] [5]. Most autoencoders are under-complete autoencoders, for which the latent space is\nsmaller than the input data [2]. Over-complete autoencoders have latent space larger than input\ndata. While the goal of under-complete autoencoder is to \ufb01nd a low dimensional representation of\ninput data, the goal of over-complete autoencoder is to \ufb01nd a higher dimensional representation of\ninput data so that from a noisy version of the higher dimensional representation, original data can be\nreliably recovered. Over-complete autoencoders are used in sparse representation learning [3] [6] and\nrobust representation learning [7].\nChannel coding aims at communicating a message over a noisy random channel [8]. As shown in\nFigure 1 left, the transmitter maps a message to a codeword via adding redundancy (this mapping is\ncalled encoding). A channel between the transmitter and the receiver randomly corrupts the codeword\nso that the receiver observes a noisy version which is used by the receiver to estimate the transmitted\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmessage (this process is called decoding). The encoder and the decoder together can be naturally\nviewed as an over-complete autoencoder, where the noisy channel in the middle corrupts the hidden\nrepresentation (codeword). Therefore, designing a reliable autoencoder can have a strong bearing\non alternative ways of designing new encoding and decoding schemes for wireless communication\nsystems.\nTraditionally, the design of communication algorithms \ufb01rst involves designing a \u2018code\u2019 (i.e., the en-\ncoder) via optimizing certain mathematical properties of encoder such as minimum code distance [9].\nThe associated decoder that minimizes the bit-error-rate then isderived based on the maximum a\nposteriori (MAP) principle. However, while the optimal MAP decoder is computationally simple for\nsome simple codes (e.g., convolutional codes), for known capacity-achieving codes, the MAP decoder\nis not computationally ef\ufb01cient; hence, alternative decoding principles such as belief propagation are\nemployed (e.g., for decoding turbo codes). The progress on the design of optimal channel codes with\ncomputationally ef\ufb01cient decoders has been quite sporadic due to its reliance on human ingenuity.\nSince Shannon\u2019s seminal work in 1948 [8], it took several decades of research to \ufb01nally reach to the\ncurrent state-of-the-art codes [10].\nNear-optimal channel codes such as Turbo [11], Low Density Parity Check (LDPC) [12], and Polar\ncodes [10] show Shannon capacity-approaching [8] performance on AWGN channels, and they have\nhad a tremendous impact on the Long Term Evolution (LTE) and 5G standards. The traditional\napproach has the following caveats:\n(a) Decoder design heavily relies on handcrafted optimal decoding algorithms for the canonical\nAdditive White Gaussian Noise (AWGN) channels, where the signal is corrupted by i.i.d. Gaussian\nnoise. In practical channels, when the channel deviates from AWGN settings, often times heuristics\nare used to compensate the non-Gaussian properties of the noise, which leaves a room for the potential\nimprovement in reliability of a decoder [9] [13].\n(b) Channel codes are designed for a \ufb01nite block length K. Channel codes are guaranteed to be\noptimal only when the block-length approaches in\ufb01nity, and thus are near-optimal in practice only\nwhen the block-length is large. On the other hand, under short and moderate block length regimes,\nthere is a room for improvement [14].\n(c) The encoder designed for the AWGN channel is used across a large family of channels, while the\ndecoder is adapted. This design methodology fails to utilize the \ufb02exibility of the encoder.\nRelated work. Deep learning has pushed the state-of-the-art performance of computer vision and\nnatural language processing to a new level far beyond handcrafted algorithms in a data-driven\nfashion [15]. There also has been a recent movement in applying deep learning to wireless com-\nmunications. Deep learning based channel decoder design has been studied since [16] [17], where\nencoder is \ufb01xed as a near-optimal code. It is shown that belief propagation decoders for LDPC\nand Polar codes can be imitated by neural networks [18] [19] [20] [21] [22]. It is also shown that\nconvolutional and turbo codes can be decoded optimally via Recurrent Neural Networks (RNN) [23]\nand Convolutional Neural Networks (CNN) [24]. Equipping a decoder with a learnable neural\nnetwork also allows fast adaptation via meta-learning [25]. Recent works also extend deep learning\nto multiple-input and multiple-output (MIMO) settings [26]. While neural decoders show improved\nperformance on various communication channels, there has been limited success in inventing novel\ncodes using this paradigm. Training methods for improving both modulation and channel coding\nare introduced in [16] [17], where a (7,4) neural code mapping a 4-bit message to a length-7 code-\nword can match (7,4) Hamming code performance. Current research includes training an encoder\nand a decoder with noisy feedback [27], improving modulation gain [28], as well as extensions to\nmulti-terminal settings [29]. Joint source-channel coding shows improved results combining source\ncoding (compression) along with channel coding (noise mitigation) [30]. Neural codes were shown\nto outperform existing state-of-the-art codes on the feedback channel [31]. However, in the canonical\nsetting of AWGN channel, neural codes are still far from capacity-approaching performance due to\nthe following challenges.\n(Challenge A) Encoding with randomness is critical to harvest coding gain on long block lengths [8].\nHowever, existing sequential neural models, both CNN and even RNN, can only learn limited local\ndependency [32]. Hence, neural encoder cannot suf\ufb01ciently utilize the bene\ufb01ts of even moderate\nblock length.\n\n2\n\n\f(Challenge B) Training neural encoder and decoder jointly (with a random channel in between)\nintroduces optimization issues where the algorithm gets stuck at local optima. Hence, a novel training\nalgorithm is needed.\nContributions. In this paper, we confront the above challenges by introducing Turbo Autoencoder\n(henceforth, TurboAE) \u2013 the \ufb01rst channel coding scheme with both encoder and decoder powered\nby neural networks that achieves reliability close to the state-of-the-art channel codes under AWGN\nchannels for a moderate block length. We demonstrate that channel coding, which has been a focus\nof study by mathematicians for several decades [9], can be learned in an end-to-end fashion from\ndata alone. Our major contributions are:\n\n\u2022 We introduce TurboAE, a neural network based over-complete autoencoder parameterized\nas Convolutional Neural Networks (CNN) along with interleavers (permutation) and de-\ninterleavers (de-permutation) inspired by turbo codes (Section 3.1). We introduce TurboAE-\nbinary, which binarizes the codewords via straight-through estimator (Section 3.2).\n\n\u2022 We propose techniques that are critical for training TurboAE which includes mechanisms\nof alternate training of encoder and decoder as well as strategies to choose right training\nexamples. Our training methodology ensures stable training of TurboAE without getting\ntrapped at locally optimal encoder-decoder solutions. (Section 3.3)\n\n\u2022 Compared to multiple capacity-approaching codes on AWGN channels, TurboAE shows\nsuperior performance in the low to middle SNR range when the block length is of moderate\nsize (K \u223c 100). To the best of our knowledge, this is the \ufb01rst result demonstrating the deep\nlearning powered discovered neural codes can outperform traditional codes in the canonical\nAWGN setting (Section 4.1).\n\n\u2022 On a non-AWGN channel, \ufb01ne-tuned TurboAE shows signi\ufb01cant improvements over state-\nof-the-art coding schemes due to the \ufb02exibility of encoder design, which shows that TurboAE\nhas advantages on designing codes where handcrafted solutions fail (Section 4.2).\n\n2 Problem Formation\nThe channel coding problem is illustrated in Figure 1 left, which consists of three blocks \u2013 an encoder\nf\u03b8(\u00b7), a channel c(\u00b7), and a decoder g\u03c6(.). A channel c(\u00b7) randomly corrupts an input x and is\nrepresented as a probability transition function py|x. A canonical example of channel c(\u00b7) is an\nidentically and independently distributed (i.i.d.) AWGN channel, which generates yi = xi + zi for\nzi \u223c N (0, \u03c32), i = 1,\u00b7\u00b7\u00b7 , K. The encoder x = f\u03b8(u) maps a random binary message sequence\nu = (u1,\u00b7\u00b7\u00b7 , uK) \u2208 {0, 1}K of block length K to a codeword x = (x1,\u00b7\u00b7\u00b7 , xN ) of length N,\nwhere x must satisfy either soft power constraint where E(x) = 0 and E(x2) = 1, or hard power\nconstraint x \u2208 {\u22121, +1}. Code rate is de\ufb01ned as R = K\nN , where N > K. The decoder g\u03c6(y) maps\na real valued received sequence y = (y1,\u00b7\u00b7\u00b7 , yN ) \u2208 RN to an estimate of the transmitted message\nsequence \u02c6u = (\u02c6u1,\u00b7\u00b7\u00b7 , \u02c6uK) \u2208 {0, 1}K.\nAWGN channel allows closed-form mathematical analysis, which has remained as the major\nplayground for channel coding researchers. The noise level is de\ufb01ned as signal-to-noise ratio,\nSN R = \u221210 log10 \u03c32. The decoder recovers the original message as \u02c6u = g\u03c6(y) using the received\nsignal y.\n(cid:80)K\nChannel coding aims to minimize the error rate of recovered message \u02c6u. The standard metrics are bit\n1 Pr(\u02c6ui (cid:54)= ui), and block error rate (BLER), de\ufb01ned as\nerror rate (BER), de\ufb01ned as BER = 1\nBLER = Pr(\u02c6u (cid:54)= u).\nK\nWhile canonical capacity-approaching channel codes work well as block length goes to in\ufb01nity, when\nthe block length is short, they are not guaranteed to be optimal. We show the benchmarks on block\nlength 100 in Figure 1 right with widely-used LDPC, Turbo, Polar, and Tail-bitting Convolutional\nCode (TBCC), generated via Vienna 5G simulator [33] [34], with code rate 1/3.\nNaively applying deep learning models by replacing encoder and decoder with general purpose neural\nnetwork does not perform well. Direct applications of fully connected neural network (FCNN) cannot\nscale to a longer block length; the performance of FCNN-AE is even worse than repetition code [35].\nDirect applications where both the encoder and the decoder are Convolutional Autoencoder (termed\nas CNN-AE [36]) shows better performance than TBCC, but are far from capacity-approaching codes\n\n3\n\n\fFigure 1: Channel coding can be viewed as an over-complete autoencoder with channel in the middle\n(left). TurboAE performs well under moderate block length in low and middle SNR (right).\n\nsuch as LDPC, Polar, and Turbo. Bidirectional RNN and LSTM [35] has similar performance as\nCNN-AE and is not shown in the \ufb01gure for clarity. Thus neither CNN nor RNN based auto-encoders\ncan directly approach state-of-the-art performance. A key reason for their shortcoming is that they\nhave only local memory, the encoder only remembers information locally. To have high protection\nagainst channel noise, it is necessary to have long term memory.\nWe propose TurboAE with interleaved encoding and iterative decoding that creates long term memory\nin the code and shows a signi\ufb01cant improvement compared to CNN-AE. TurboAE has two versions,\nTurboAE-continuous which faces soft power constraint (i.e., the total power across a codeword is\nbounded) and TurboAE-binary which faces hard power constraint (i.e., each transmitted symbol has\na power constraint - and is thus forced to be binary). Both TurboAE-binary and TurboAE-continuous\nperform comparable or better than all other capacity-approaching codes at a low SNR, while at a high\nSNR (over 2 dB with BER < 10\u22125), the performance is only worse than LDPC and Polar code.\n\n3 TurboAE : Architecture Design and Training\n\n3.1 Design of TurboAE\n\nTurbo code and turbo principle: Turbo code is the \ufb01rst capacity-approaching code ever de-\nsigned [11]. There are two novel components of Turbo code which led to its success: an interleaved\nencoder and an iterative decoder. The starting point of the Turbo code is a recursive systematic\nconvolutional (RSC) code which has an optimal decoding algorithm (the Bahl-Cocke-Jelinek-Raviv\n(BCJR) algorithm [37]). A key disadvantage in the RSC code is that the algorithm lacks long range\nmemory (since the convolutional code operates on a sliding window). The key insight of Berrou was\nto introduce long range memory by creating two copies of the input bits - the \ufb01rst goes through the\nRSC code and the second copy goes through an interleaver (which is a permutation of the bits) before\ngoing through the same code. Such a code can be decoded by iteratively alternating between soft-\ndecoding based on the signal received from the \ufb01rst copy and then using the de-interleaved version as\na prior to decode the second copy. The \u2018Turbo principle\u2019 [38] refers to the iterative decoding with\nsuccessively re\ufb01ning the posterior distribution on the transmitted bits across decoding stages with\noriginal and interleaved order. This code is known to have excellent performance, and inspired by\nthis, we design TurboAE featuring both learnable interleaved encoder and iterative decoder.\nInterleaved Encoding Structure: Interleaving is widely used in communication systems to mitigate\nbursty noise [39]. Formally, interleaver x\u03c0 = \u03c0(x) and de-interleaver x = \u03c0\u22121(x\u03c0) shuf\ufb02e and\nshuf\ufb02e back the input sequence x with the a pseudo random interleaving array known to both encoder\nand decoder, respectively, as shown in Figure 2 left. In the context of Turbo code and TurboAE, the\ninterleaving is not used to mitigate bursty errors (since we are mainly concerned with i.i.d. channels)\nbut rather to add long range memory in the structure of the code.\nWe take code rate 1/3 as an example for interleaved encoder f\u03b8, which consists of three learnable\nencoding blocks fi,\u03b8(.) for i \u2208 {1, 2, 3}, where fi,\u03b8(.) encodes bi = f\u03b8(u), i \u2208 {1, 2} and b3 =\n\n4\n\n\ff3,\u03b8(\u03c0(u)), where bi is a continuous value. The power constraint of channel coding is enforced via\npower constraint block xi = h(bi).\n\nFigure 2: Visualization of Interleaver (\u03c0) and De-interleaver (\u03c0\u22121) (left); TurboAE encoder on code\nrate 1/3 (right)\n\nIterative Decoding Structure: As received codewords are encoded from original message u and\ninterleaved message \u03c0(u), decoding interleaved code requires iterative decoding on both interleaved\nand de-interleaved order shown in Figure 3. Let y1, y2, y3 denote noisy versions of x1, x2, x3,\nrespectively. The decoder runs multiple iterations, with each iteration contains two decoders g\u03c6i,1\nand g\u03c6i,2 for interleaved and de-interleaved order on the i-th iteration.\nThe \ufb01rst decoder g\u03c6i,1 takes received signal y1, y2 and de-interleaved prior p with shape (K, F ),\nwhere as F is the information feature size for each code bit, to produce the posterior q with same\nshape (K, F ). The second decoder g\u03c6i,2 takes interleaved signal \u03c0(y1), y3 and interleaved prior p to\nproduce posterior q. The posterior of previous stage q serves as the prior of next stage p. The \ufb01rst\niteration takes 0 as a prior, and at last iteration the posterior is of shape (K, 1), are decoded as by\nsigmoid function as \u02c6u = sigmoid(q).\nBoth encoder and decoder structure can be considered as a parametrization of Turbo code. Once we\nparametrize the encoder and the decoder, since the encoder, channel, and decoder are differentiable,\nTurboAE can be trained end-to-end via gradient descent and its variants.\n\nFigure 3: TurboAE iterative decoder on code rate 1/3\n\nEncoder and Decoder Design: The space of messages and codewords are exponential (For a length-\nK binary sequence, there are 2K distinct messages). Hence, the encoder and decoder must have some\nstructural restrictions to ensure generalization to messages unseen during the training [40]. Applying\nparameter-sharing sequential neural models such as CNN and RNN are natural parametrization\nmethods for both the encoding and the decoding blocks.\nRNN models such as Gated Recurrent Unit (GRU) and Long-Short Term Memory (LSTM) are\ncommonly used for sequential modeling problems [41]. RNN is widely used in deep learning based\ncommunications systems [23] [31] [24] [35], as RNN has a natural connection to sequential encoding\nand decoding algorithms such as convolutional code and BCJR algorithm [23].\nHowever RNN models are: (1) of higher complexity than CNN models, (2) harder to train due to\ngradient explosion, and (3) harder to run in parallel [32]. In this paper, we use one dimensional CNN\n(1D-CNN) as the alternative encoding and decoding model. Although the longest dependency length\nis \ufb01xed, 1D-CNN has lower complexity, better trainability [42], and easier implementation in parallel\nvia AI-chips [43]. The learning curve comparison between CNN and RNN is shown in Figure 4 left.\nTraining CNN-based model converges faster and more stable than RNN-based GRU model. The\nTurboAE complexity is shown in appendix.\nPower Constraint Block: The operation of power constraint blocks (i.e., h(\u00b7) in x = h(b)) depends\non the requirement of power constraint.\n\n5\n\n\f(cid:80)K\n\nSoft power constraint normalize the power of code, as E(x) = 0 and E(x2) = 1. TurboAE-\ncontinuous with soft power constraint allows the code x to be continuous. Addressing the statistical\n(cid:80)K\nestimation issue given a limited batch size, we use normalization method [44] as:xi = bi\u2212\u00b5(b)\n, where\ni=1(bi \u2212 \u00b5(b))2 are scalar mean and standard deviation\n\u00b5(b) = 1\nK\nestimation of the whole block. During the training phase, \u00b5(b) and \u03c3(b) are estimated from the whole\nbatch. On the other hand, during the testing phase, \u00b5(b) and \u03c3(b) are pre-computed with multiple\nbatches. The normalization layer can be also considered as BatchNorm without af\ufb01ne projection,\nwhich is critical to stabilize the training of the encoder [45].\n\ni=1 bi and \u03c3(b) =\n\n\u03c3(b)\n\n(cid:113)\n\n1\nK\n\n3.2 Design of TurboAE-binary \u2013 Binarization via Straight-Through Estimator\n\nSome wireless communication system requires a hard power constraint, where the encoder output is\nbinary as x \u2208 {\u22121, +1} [46] - so that every symbol has exactly the same power and the information\nis conveyed in the sign. Hard power constraint is not differentiable, since restricting x \u2208 {\u22121, +1}\nvia x = sign(b) has zero gradient almost everywhere. We combine normalization and Straight-\nThrough Estimator (STE) [47] [48] to bypass this differentiability issue. STE passes the gradient of\n\u2202b = 1(|b| \u2264 1) and enables training of an encoder by passing estimated gradients to\nx = sign(b) as \u2202x\nthe encoder, while enforcing hard power constraint.\nSimply training with STE cannot learn a good encoder as shown in Figure 4 right. To mitigate the\ntrainability issue, we apply pre-training, which pre-trains TurboAE-continuous \ufb01rstly, and then add\nthe hard power constraint on top of soft power constraint as x = sign( b\u2212\u00b5(b)\n\u03c3(b) ), whereas the gradient\nis estimated via STE. Figure 4 right shows that with pre-training, TurboAE-binary reaches Turbo\nperformance within 100 epochs of \ufb01ne-tuning.\nTurboAE-binary is slightly worse than TurboAE-continuous as shown in Figure 1, especially at high\nSNR, since: (a) TurboAE-continuous can be considered as a joint coding and high order modulation\nscheme, which has a larger capacity than binary coding at high SNR [46], and (b) STE is an estimated\ngradient, which makes training encoder more noisy and less stable.\n\nFigure 4: Learning Curves on CNN vs GRU: CNN shows faster training convergence (left); Training\nwith STE requires soft-constraint pre-training (right)\n\n3.3 Neural Trainability Design\n\nThe training algorithms for training TurboAE are shown in Algorithm 1. Compared to the conventional\ndeep learning model training, training TurboAE has the following differences:\n\n\u2022 Very Large Batch Size Large batch size is critical to average the channel noise effects.\nEmpirically, TurboAE reaches Turbo performance only when the batch size is grater than\n500.\n\n\u2022 Train Encoder and Decoder Separately We train encoder and decoder separately as shown\n\nin Algorithm 1, to avoid getting stuck in local optimum [27] [35].\n\n6\n\n\fAlgorithm 1 Training Algorithm for TurboAE\nRequire: Batch Size B, Train Encoder Steps Tenc, Train Decoder Steps Tdec, Number of Epoch M\n\nEncoder Training SNR \u03c3enc, Decoder Training SNR \u03c3dec\nfor i \u2264 M do\n\nfor j \u2264 Tenc do\n\nend for\nfor j \u2264 Tdec do\n\nend for\n\nend for\n\nGenerate random training example u, and random noise z \u223c N (0, \u03c3enc).\nTrain encoder f\u03b8 with decoder \ufb01xed, with u and z.\n\nGenerate random training example u, and random noise z \u223c N (0, \u03c3dec).\nTrain decoder g\u03c6 with encoder \ufb01xed, with u and z.\n\n\u2022 Different Training Noise Level for Encoder and Decoder Empirically, while it is best to\ntrain a decoder at a low training SNR as discussed in [23], it is best to train an encoder at\na training SNR that matches the testing SNR, e.g training encoder at 2dB results in good\nencoder when testing at 2dB [35]. In this work, we use random selection of -1.5 to 2 dB for\ntraining the decoder, and test and train the encoder at the same SNR.\n\nWe do a detailed analysis of training algorithms in the supplementary materials. The hyper-parameters\nare shown in Table 1.\n\nLoss\nEncoder\nDecoder\nDecoder Iterations\nInfo Feature Size F\nBatch Size\nOptimizer\nTraining Schedule for Each Epoch\nBlock Length K\nNumber of Epochs M\n\nBinary Cross-Entropy (BCE)\n2 layers 1D-CNN, kernel size 5, 100 \ufb01lters for each fi,\u03b8(.) block\n5 layers 1D-CNN, kernel size 5, 100 \ufb01lters for each g\u03c6i,j (.) block\n6\n5\n500 when start, double when saturates for 20 epochs, till reaches 2000\nAdam with initial learning rate 0.0001\nTrain encoder Tenc = 100 times, then train decoder Tdec = 500 times\n100\n800\n\nTable 1: Hyper-parameters of TurboAE\n\n4 Experiment Results\n\n4.1 Block length coding gain of TurboAE\n\nAs block length increases, better reliability can be achieved via channel coding, which is referred to\nas blocklength gain [11]. We compare TurboAE (only TurboAE-continuous is shown in this section)\nwith the Turbo code and CNN-AE, tested at BER at 2dB on different block lengths, shown in Figure\n5 left. Both CNN-AE and TurboAE are trained with block length 100, and tested on various block\nlengths. As the block length increases, CNN-AE shows saturating blocklength gain, while TurboAE\nand Turbo code reduce the error rate as the block length increases. Naively applying general purpose\nneural network such as CNN to channel coding problem cannot gain performance on long block\nlengths.\nNote that TurboAE is still worse than Turbo when the block length is large, since long block length\nrequires large memory usage and more complicated structure to train. Improving TurboAE on very\nlong block length remains open as an interesting future direction.\nThe BER performance boosted by neural architecture design is shown in Figure 5 right. We compare\nthe \ufb01ne-tuned performance among CNN-AE, TurboAE, and TurboAE without interleaving as x\u03c0 =\n\u03c0(x). TurboAE with interleaving signi\ufb01cantly outperforms TurboAE without interleaving and\nCNN-AE.\n\n7\n\n\fFigure 5: Interleaving improves blocklength gain (left); Neural Architecture improves BER perfor-\nmance (right).\n\n4.2 Performance on non-AWGN channels\n\nTypically there are no close-form solutions under non-AWGN and non-iid channels. We compare two\nbenchmarks: (1) canonical Turbo code, and (2) DeepTurbo Decoder [24], a neural decoder \ufb01ne-tuned\nat the given channel. We test the performance on both iid channels and non-iid channels in settings as\nfollows:\n(a) iid Additive T-distribution Noise (ATN) Channel, with yi = xi + zi, where iid zi \u223c T (\u03bd, \u03c32) is\nheavy-tail (tail weight controlled based on the parameter \u03bd = 3.0) T-distribution noise with variance\n\u03c32. The performance is shown in Figure 6 left.\n(b) non-iid Markovian-AWGN channel, is a special AWGN channel with two states, {good, bad}. At\nbad state the noise is worse by 1dB than the SNR, and at good state, the noise is better by 1dB than the\nSNR. The state transition probability between good and bad states are symmetric as pbg = pgb = 0.8.\nThe performance is shown in Figure 6 right.\nFor both ATN and Markovian-AWGN channels, DeepTurbo outperforms canonical Turbo code.\nTurboAE-continuous with learnable encoder outperforms DeepTurbo in both cases. TurboAE-binary\noutperforms DeepTurbo on ATN channel, while on Markovian-AWGN channel, TurboAE-binary\ndoes not perform better than DeepTurbo at high SNR regimes (but still outperforms canonical Turbo).\nWith the \ufb02exibility of designing an encoder, TurboAE designs better code than handcrafted Turbo\ncode, for channels without a closed-form mathematical solution.\n\nFigure 6: TurboAE on iid ATN channel (left) and on-iid Markovian-AWGN channel (right)\n\n8\n\n\f5 Conclusion and discussion\n\nIn summary, in this paper, we propose TurboAE, an end-to-end learnt channel coding scheme with\nnovel neural structure and training algorithms. TurboAE learns capacity-approaching code on various\nchannels under moderate block length by building upon \u2018turbo principle\u2019 and thus, exhibits discovery\nof codes for channels where a closed-form representation may not exist. TurboAE, hence, brings\nan interesting research direction to design channel coding algorithms via joint encoder and decoder\ndesign.\nA few pending issues hamper further improving TurboAE. Large block length requires extensive\ntraining memory. With enough computing resources, we believe that TurboAE\u2019s performance at\nlarger block lengths can potentially improve. High SNR training remains hard, as in high SNR\nthe error events become extremely rare. Optimizing BLER requires novel and stable objective for\ntraining. Such pending issues are interesting future directions.\n\nAcknowledgments\n\nThis work was supported in part by NSF awards 1908003, 651236 and 1703403.\n\nReferences\n[1] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, \u201cExtracting and composing robust features\nwith denoising autoencoders,\u201d in Proceedings of the 25th international conference on Machine learning.\nACM, 2008, pp. 1096\u20131103.\n\n[2] A. Krizhevsky and G. E. Hinton, \u201cUsing very deep autoencoders for content-based image retrieval.\u201d in\n\nESANN, 2011.\n\n[3] A. Makhzani and B. Frey, \u201cK-sparse autoencoders,\u201d arXiv preprint arXiv:1312.5663, 2013.\n\n[4] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, \u201cInfogan: Interpretable\nrepresentation learning by information maximizing generative adversarial nets,\u201d in Advances in neural\ninformation processing systems, 2016, pp. 2172\u20132180.\n\n[5] D. P. Kingma and M. Welling, \u201cAuto-encoding variational bayes,\u201d arXiv preprint arXiv:1312.6114, 2013.\n\n[6] L. Deng, M. L. Seltzer, D. Yu, A. Acero, A.-r. Mohamed, and G. Hinton, \u201cBinary coding of speech\nspectrograms using a deep auto-encoder,\u201d in Eleventh Annual Conference of the International Speech\nCommunication Association, 2010.\n\n[7] Q. V. Le, A. Karpenko, J. Ngiam, and A. Y. Ng, \u201cIca with reconstruction cost for ef\ufb01cient overcomplete\n\nfeature learning,\u201d in Advances in neural information processing systems, 2011, pp. 1017\u20131025.\n\n[8] C. E. Shannon, \u201cA mathematical theory of communication, part i, part ii,\u201d Bell Syst. Tech. J., vol. 27, pp.\n\n623\u2013656, 1948.\n\n[9] T. Richardson and R. Urbanke, Modern coding theory. Cambridge university press, 2008.\n\n[10] E. Arikan, \u201cChannel polarization: A method for constructing capacity-achieving codes,\u201d in 2008 IEEE\n\nInternational Symposium on Information Theory.\n\nIEEE, 2008, pp. 1173\u20131177.\n\n[11] C. Berrou, A. Glavieux, and P. Thitimajshima, \u201cNear shannon limit error-correcting coding and decoding:\nTurbo-codes. 1,\u201d in Proceedings of ICC\u201993-IEEE International Conference on Communications, vol. 2.\nIEEE, 1993, pp. 1064\u20131070.\n\n[12] D. J. MacKay and R. M. Neal, \u201cNear shannon limit performance of low density parity check codes,\u201d\n\nElectronics letters, vol. 33, no. 6, pp. 457\u2013458, 1997.\n\n[13] J. Li, X. Wu, and R. Laroia, OFDMA mobile broadband communications: A systems approach. Cambridge\n\nUniversity Press, 2013.\n\n[14] Y. Polyanskiy, H. V. Poor, and S. Verd\u00fa, \u201cChannel coding rate in the \ufb01nite blocklength regime,\u201d IEEE\n\nTransactions on Information Theory, vol. 56, no. 5, p. 2307, 2010.\n\n[15] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.\n\n9\n\n\f[16] T. J. O\u2019Shea, K. Karra, and T. C. Clancy, \u201cLearning to communicate: Channel auto-encoders, domain\nspeci\ufb01c regularizers, and attention,\u201d in Signal Processing and Information Technology (ISSPIT), 2016\nIEEE International Symposium on.\n\nIEEE, 2016, pp. 223\u2013228.\n\n[17] T. J. O\u2019Shea and J. Hoydis, \u201cAn introduction to machine learning communications systems,\u201d arXiv preprint\n\narXiv:1702.00832, 2017.\n\n[18] E. Nachmani, Y. Be\u2019ery, and D. Burshtein, \u201cLearning to decode linear codes using deep learning,\u201d in\nIEEE,\n\nCommunication, Control, and Computing (Allerton), 2016 54th Annual Allerton Conference on.\n2016, pp. 341\u2013346.\n\n[19] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be\u2019ery, \u201cDeep learning methods\n\nfor improved decoding of linear codes,\u201d IEEE Journal of Selected Topics in Signal Processing, 2018.\n\n[20] W. Xu, Z. Wu, Y.-L. Ueng, X. You, and C. Zhang, \u201cImproved polar decoder based on deep learning,\u201d in\n\n2017 IEEE International Workshop on Signal Processing Systems (SiPS).\n\nIEEE, 2017, pp. 1\u20136.\n\n[21] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, \u201cOn deep learning-based channel decoding,\u201d in\n\nInformation Sciences and Systems (CISS), 2017 51st Annual Conference on.\n\nIEEE, 2017, pp. 1\u20136.\n\n[22] S. Cammerer, T. Gruber, J. Hoydis, and S. ten Brink, \u201cScaling deep learning-based decoding of polar codes\nIEEE, 2017, pp.\n\nvia partitioning,\u201d in GLOBECOM 2017-2017 IEEE Global Communications Conference.\n1\u20136.\n\n[23] H. Kim, Y. Jiang, R. Rana, S. Kannan, S. Oh, and P. Viswanath, \u201cCommunication algorithms via deep learn-\ning,\u201d in The International Conference on Representation Learning (ICLR 2018) Proceedings. Vancouver,\n2018.\n\n[24] Y. Jiang, H. Kim, S. Kannan, S. Oh, and P. Viswanath, \u201cDeepturbo: Deep turbo decoder,\u201d 2019.\n\n[25] Y. Jiang, H. Kim, H. Asnani, and S. Kannan, \u201cMind: Model independent neural decoder,\u201d arXiv preprint\n\narXiv:1903.02268, 2019.\n\n[26] N. Samuel, T. Diskin, and A. Wiesel, \u201cDeep mimo detection,\u201d in 2017 IEEE 18th International Workshop\n\non Signal Processing Advances in Wireless Communications (SPAWC).\n\nIEEE, 2017, pp. 1\u20135.\n\n[27] F. A. Aoudia and J. Hoydis, \u201cEnd-to-end learning of communications systems without a channel model,\u201d\n\nin 2018 52nd Asilomar Conference on Signals, Systems, and Computers.\n\nIEEE, 2018, pp. 298\u2013303.\n\n[28] A. Felix, S. Cammerer, S. D\u00f6rner, J. Hoydis, and S. t. Brink, \u201cOfdm-autoencoder for end-to-end learning\n\nof communications systems,\u201d arXiv preprint arXiv:1803.05815, 2018.\n\n[29] T. J. O\u2019Shea, T. Erpek, and T. C. Clancy, \u201cDeep learning based MIMO communications,\u201d CoRR, vol.\n\nabs/1707.07980, 2017.\n\n[30] K. Choi, K. Tatwawadi, T. Weissman, and S. Ermon, \u201cNecst: Neural joint source-channel coding,\u201d arXiv\n\npreprint arXiv:1811.07557, 2018.\n\n[31] H. Kim, Y. Jiang, S. Kannan, S. Oh, and P. Viswanath, \u201cDeepcode: Feedback codes via deep learning,\u201d in\n\nAdvances in Neural Information Processing Systems, 2018, pp. 9436\u20139446.\n\n[32] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, \u201cEmpirical evaluation of gated recurrent neural networks\n\non sequence modeling,\u201d arXiv preprint arXiv:1412.3555, 2014.\n\n[33] M. K. Muller, F. Ademaj, T. Dittrich, A. Fastenbauer, B. R. Elbal, A. Nabavi, L. Nagel, S. Schwarz, and\nM. Rupp, \u201cFlexible multi-node simulation of cellular mobile communications: the Vienna 5G System\nLevel Simulator,\u201d EURASIP Journal on Wireless Communications and Networking, vol. 2018, no. 1, p. 17,\nSep. 2018.\n\n[34] B. Tahir, S. Schwarz, and M. Rupp, \u201cBer comparison between convolutional, turbo, ldpc, and polar codes,\u201d\n\nin 2017 24th International Conference on Telecommunications (ICT).\n\nIEEE, 2017, pp. 1\u20137.\n\n[35] Y. Jiang, H. Kim, H. Asnani, S. Kannan, S. Oh, and P. Viswanath, \u201cLearn codes: Inventing low-latency\n\ncodes via recurrent neural networks,\u201d arXiv preprint arXiv:1811.12707, 2018.\n\n[36] B. Zhu, J. Wang, L. He, and J. Song, \u201cJoint transceiver optimization for wireless communication phy with\n\nconvolutional neural network,\u201d arXiv preprint arXiv:1808.03242, 2018.\n\n[37] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, \u201cOptimal decoding of linear codes for minimizing symbol error\n\nrate (corresp.),\u201d IEEE Transactions on information theory, vol. 20, no. 2, pp. 284\u2013287, 1974.\n\n10\n\n\f[38] J. Hagenauer, \u201cThe turbo principle: Tutorial introduction and state of the art,\u201d in Proc. International\n\nSymposium on Turbo Codes and Related Topics, 1997, pp. 1\u201311.\n\n[39] H. R. Sadjadpour, N. J. Sloane, M. Salehi, and G. Nebe, \u201cInterleaver design for turbo codes,\u201d IEEE Journal\n\non Selected Areas in Communications, vol. 19, no. 5, pp. 831\u2013837, 2001.\n\n[40] S. D\u00f6rner, S. Cammerer, J. Hoydis, and S. t. Brink, \u201cDeep learning-based communication over the air,\u201d\n\narXiv preprint arXiv:1707.03384, 2017.\n\n[41] D. Bahdanau, K. Cho, and Y. Bengio, \u201cNeural machine translation by jointly learning to align and translate,\u201d\n\narXiv preprint arXiv:1409.0473, 2014.\n\n[42] W. Yin, K. Kann, M. Yu, and H. Sch\u00fctze, \u201cComparative study of cnn and rnn for natural language\n\nprocessing,\u201d arXiv preprint arXiv:1702.01923, 2017.\n\n[43] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung, \u201cAccelerating deep convolu-\ntional neural networks using specialized hardware,\u201d Microsoft Research Whitepaper, vol. 2, no. 11, pp.\n1\u20134, 2015.\n\n[44] J. Lei Ba, J. R. Kiros, and G. E. Hinton, \u201cLayer normalization,\u201d arXiv preprint arXiv:1607.06450, 2016.\n\n[45] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, \u201cHow does batch normalization help optimization?\u201d in\n\nAdvances in Neural Information Processing Systems, 2018, pp. 2483\u20132493.\n\n[46] D. Tse and P. Viswanath, Fundamentals of wireless communication. Cambridge university press, 2005.\n\n[47] Y. Bengio, N. L\u00e9onard, and A. Courville, \u201cEstimating or propagating gradients through stochastic neurons\n\nfor conditional computation,\u201d arXiv preprint arXiv:1308.3432, 2013.\n\n[48] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, \u201cBinarized neural networks,\u201d in\n\nAdvances in neural information processing systems, 2016, pp. 4107\u20134115.\n\n11\n\n\f", "award": [], "sourceid": 1585, "authors": [{"given_name": "Yihan", "family_name": "Jiang", "institution": "University of Washington Seattle"}, {"given_name": "Hyeji", "family_name": "Kim", "institution": "Samsung AI Center Cambridge"}, {"given_name": "Himanshu", "family_name": "Asnani", "institution": "University of Washington, Seattle"}, {"given_name": "Sreeram", "family_name": "Kannan", "institution": "University of Washington"}, {"given_name": "Sewoong", "family_name": "Oh", "institution": "University of Washington"}, {"given_name": "Pramod", "family_name": "Viswanath", "institution": "UIUC"}]}