{"title": "Predictive Coding with Neural Nets: Application to Text Compression", "book": "Advances in Neural Information Processing Systems", "page_first": 1047, "page_last": 1054, "abstract": null, "full_text": "PREDICTIVE CODING WITH \n\nNEURAL NETS: APPLICATION TO \n\nTEXT COMPRESSION \n\nJ iirgen Schmidhuber \n\nStefan Heil \n\nFakultat fiir Informatik \n\nTechnische Universitat Miinchen \n\n80290 Miinchen, Germany \n\nAbstract \n\nTo compress text files, a neural predictor network P is used to ap(cid:173)\nproximate the conditional probability distribution of possible \"next \ncharacters\", given n previous characters. P's outputs are fed into \nstandard coding algorithms that generate short codes for characters \nwith high predicted probability and long codes for highly unpre(cid:173)\ndictable characters. Tested on short German newspaper articles, \nour method outperforms widely used Lempel-Ziv algorithms (used \nin UNIX functions such as \"compress\" and \"gzip\"). \n\n\f1048 \n\nliirgen Schmidhuber, Stefan Heil \n\n1 \n\nINTRODUCTION \n\nThe method presented in this paper is an instance of a strategy known as \"predic(cid:173)\ntive coding\" or \"model-based coding\". To compress text files, a neural predictor \nnetwork P approximates the conditional probability distribution of possible \"next \ncharacters\", given n previous characters. P's outputs are fed into algorithms that \ngenerate short codes for characters with low information content (characters with \nhigh predicted probability) and long codes for characters conveying a lot of infor(cid:173)\nmation (highly unpredictable characters) [5]. Two such standard coding algorithms \nare employed: Huffman Coding (see e.g. [1]) and Arithmetic Coding (see e.g. [7]). \nWith the off-line variant of the approach, P's training phase is based on a set F \nof training files. After training, the weights are frozen. Copies of P are installed \nat all machines functioning as message receivers or senders. From then on, P is \nused to encode and decode unknown files without being changed any more. The \nweights become part of the code of the compression algorithm. Note that the \nstorage occupied by the network weights does not have to be taken into account to \nmeasure the performance on unknown files - just like the code for a conventional \ndata compression algorithm does not have to be taken into account. \n\nThe more sophisticated on-line variant of our approach will be addressed later. \n\n2 A PREDICTOR OF CONDITIONAL PROBABILITIES \n\nAssume that the alphabet contains k possible characters Zl, Z2, \u2022.\u2022 , Z1c. The (local) \nrepresentation of Zi is a binary k-dimensional vector r( Zi) with exactly one non-zero \ncomponent (at the i-th position). P has nk input units and k output units. n is \ncalled the \"time-window size\". We insert n default characters Zo at the beginning \nof each file. The representation of the default character, r(zo), is the k-dimensional \nzero-vector. The m-th character of file f (starting from the first default character) \nis called efn. \nFor all f E F and all possible m > n, P receives as an input \n\nr(e;\"_n) 0 r(e;\"_n+l) 0 ... 0 r(c!n_l), \n\n(1) \n\nwhere 0 is the concatenation operator, for vectors. P produces as an output Pin, \na k-dimensional output vector. Using back-propagation [6][2][3][4], P is trained to \nmmlIDlze \n\n~ L L II r(c!n) - pin 112 . \n\njEFm>n \n\nExpression (2) is minimal if pin always equals \n\nE(r(efn) I e;\"_n,\u00b7\u00b7 \u00b7,c!n-l), \n\n(2) \n\n(3) \n\nthe conditional expectation of r( efn), given r( e;\"_n) ore e;\"_n+1)o . . . or( c!n-l). Due \nto the local character representation, this is equivalent to (Pin); being equal to the \n\n\fPredictive Coding with Neural Nets \n\nconditional probability \n\n1049 \n\n(4) \n\nfor all / and for all appropriate m> n, where (P,{Jj denotes the j-th component \nof the vector P/n. \nIn general, the (P/n)i will not quite match the corresponding conditional probabili(cid:173)\nties. For normalization purposes, we define \n\nL:j=I(Pm)j \nNo normalization is used during training, however. \n\nPI (.) _ \nm 1 \n\n-\n\n(P/n)i \nj: \n\nf \u00b7 \n\n(5) \n\n3 HOW TO USE THE PREDICTOR FOR \n\nCOMPRESSION \n\nWe use a standard procedure for predictive coding. With the help of a copy of \nP, an unknown file / can be compressed as follows: Again, n default characters \nare inserted at the beginning. For each character cfn (m> n), the predictor emits \nits output P/n based on the n previous characters. There will be a k such that \ncfn = Zj:. The estimate of P(cfn = Zj: I c!n-n, ... , Crn-l) is given by P/n(k). The \ncode of cfn, code( cfn), is generated by feeding P/n (k) into the Huffman Coding \nalgorithm (see below), or, alternatively, into the Arithmetic Coding algorithm (see \nbelow). code(cfn) is written into the compressed file. The basic ideas of both coding \nalgorithms are described next. \n\n3.1 HUFFMAN CODING \n\nWith a given probability distribution on a set of possible characters, Huffman Cod(cid:173)\ning (e.g. [1]) encodes characters by bitstrings as follows. \nCharacters are terminal nodes of a binary tree to be built in an incremental fashion. \nThe probability of a terminal node is defined as the probability of the corresponding \ncharacter. The probability of a non-terminal node is defined as the sum of the \nprobabilities of its sons. Starting from the terminal nodes, a binary tree is built as \nfollows: \n\nRepeat as long as possible: \nAmong those nodes that are not children 0/ any non-terminal nodes \ncreated earlier, pick two with lowest associated probabilities. Make \nthem the two sons 0/ a newly generated non-terminal node. \n\nThe branch to the \"left\" son of each non-terminal node is labeled by a O. The \nbranch to its \"right\" son is labeled by a 1. The code of a character c, code(c), is the \nbitstring obtained by following the path from the root to the corresponding node. \nObviously, if c #- d, then code(c) cannot be the prefix of code(d). This makes the \ncode uniquely decipherable. \n\n\f1050 \n\nJurgen Schmidhuber, Stefan Heil \n\nCharacters with high associated probability are encoded by short bitstrings. Char(cid:173)\nacters with low associated probability are encoded by long bitstrings. Huffman \nCoding guarantees minimal expected code length, provided all character probabili(cid:173)\nties are integer powers of ~. \n\n3.2 ARlTHMETIC CODING \n\nIn general, Arithmetic Coding works slightly better than Huffman Coding. For \nsufficiently long messages, Arithmetic Coding achieves expected code lenghts ar(cid:173)\nbitrarily close to the information-theoretic lower bound. This is true even if the \ncharacter probabilities are not powers of ~ (see e.g. [7]) . \nThe basic idea of Arithmetic Coding is: a message is encoded by an interval of \nreal numbers from the unit interval [0,1[. The output of Arithmetic Coding is a \nbinary representation of the boundaries of the corresponding interval. This binary \nrepresentation is incrementally generated during message processing. Starting with \nthe unit interval, for each observed character the interval is made smaller, essentially \nin proportion to the probability of the character. A message with low information \ncontent (and high corresponding probability) is encoded by a comparatively large \ninterval whose precise boundaries can be specified with comparatively few bits. A \nmessage with a lot of information content (and low corresponding probability) is \nencoded by a comparatively small interval whose boundaries require comparatively \nmany bits to be specified. \nAlthough the basic idea is elegant and simple, additional technical considerations \nare necessary to make Arithmetic Coding practicable. See [7] for details. \n\nNeither Huffman Coding nor Arithmetic Coding requires that the probability dis(cid:173)\ntribution on the characters remains fixed. This allows for using \"time-varying\" \nconditional probability distributions as generated by the neural predictor. \n\n3.3 HOW TO \"UNCOMPRESS\" DATA \n\nThe information in the compressed file is sufficient to reconstruct the original file \nwithout loss of information. This is done with the \"uncompress\" algorithm, which \nworks as follows: Again, for each character efn (m > n), the predictor (sequentially) \nemits its output pin based on the n previous characters, where the e{ with n < I < m \nwere gained sequentially by feeding the approximations p/ (k) of the probabilities \nP(e{ = ZIc I e{-n,\u00b7\u00b7\u00b7, e{-l) into the inverse Huffman Coding procedure (see e.g. [1]), \nor, alternatively (depending on which coding procedure was used), into the inverse \nArithmetic Coding procedure (e.g. [7]). Both variants allow for correct decoding of \nc{ from eode(c{). With both variants, to correctly decode some character, we first \nneed to decode all previous characters. Both variants are guaranteed to restore \nthe original file from the compressed file. \n\nWHY NOT USE A LOOK-UP TABLE INSTEAD OF A NETWORK? \n\nBecause a look-up table would be extremely inefficient. A look-up table requires \nkn +1 entries for all the conditional probabilities corresponding to all possible com-\n\n\fPredictive Coding with Neural Nets \n\n1051 \n\nbinations of n previous characters and possible next characters. \nIn addition, a \nspecial procedure is required for dealing with previously unseen combinations of \ninput characters. In contrast, the size of a neural net typically grows in proportion \nto n 2 (assuming the number of hidden units grows in proportion to the number of \ninput units), and its inherent \"generalization capability\" is going to take care of \npreviously unseen combinations of input characters (hopefully by coming up with \ngood predicted probabilities). \n\n4 SIMULATIONS \n\nWe implemented both alternative variants of the encoding and decoding procedure \ndescribed above. \nOur current computing environment prohibits extensive experimental evaluations \nof the method. The predictor updates turn out to be quite time consuming, which \nmakes special neural net hardware recommendable. The limited software simula(cid:173)\ntions presented in this section, however, will show that the \"neural\" compression \ntechnique can achieve \"excellent\" compression ratios. Here the term \"excellent\" is \ndefined by a statement from [1]: \n\n\"In general, good algorithms can be expected to achieve an av(cid:173)\nerage compression ratio of 1.5, while excellent algorithms based \nupon sophisticated processing techniques will achieve an average \ncompression ratio exceeding 2.0.\" \n\nHere the average compression ratio is the average ratio between the lengths of \noriginal and compressed files. \nThe method was applied to German newspaper articles. The results were compared \nto those obtained with standard encoding techniques provided by the operating \nsystem UNIX, namely \"pack\", \"compress\", and \"gzip\" . The corresponding decod(cid:173)\ning algorithms are \"unpack\", \"uncompress\", and \"gunzip\", respectively. ''pack'' is \nbased on Huffman-Coding (e.g. [1]), while \"compress\" and \"gzip\" are based on \ntechniques developed by Lempel and Ziv (e.g. [9]). As the file size goes to infinity, \nLempel-Ziv becomes asymptotically optimal in a certain information theoretic sense \n[8]. This does not necessarily mean, however, that Lempel-Ziv is optimal for finite \nfile sizes. \nThe training set for the predictor was given by a set of 40 articles from the newspa(cid:173)\nper Miinchner M erkur, each containing between 10000 and 20000 characters. The \nalphabet consisted of k = 80 possible characters, including upper case and lower \ncase letters, digits, interpunction symbols, and special German letters like \"0\", \"ii\", \n\"a.\". P had 430 hidden units. A \"true\" unit with constant activation 1.0 was con(cid:173)\nnected to all hidden and output units. The learning rate was 0.2. The training \nphase consisted of 25 sweeps through the training set. \n\nThe test set consisted of newspaper articles excluded from the training set, each con(cid:173)\ntaining between 10000 and 20000 characters. Table 1 lists the average compression \nratios. The \"neural\" method outperformed the strongest conventional competitor, \nthe UNIX \"gzip\" function based on a Lempel-Ziv algorithm. \n\n\f1052 \n\nJurgen Schmidhuber, Stefan Heil \n\nMethod \n\nCompression Ratio I \n\nHuffman Coding (UNIX: pack) \n\nLempel-Ziv Coding (UNIX: compress) \nImproved Lempel-Ziv ( UNIX: gzip -9) \nNeural predictor + Huffman Coding, n = 5 \nNeural predictor + Arithmetic Coding, n = 5 \n\n1.74 \n1.99 \n2.29 \n2.70 \n2.72 \n\nTable 1: Compression ratios of various compression algorithms for short German \ntext files \u00ab 20000 Bytes) from the unknown test set. \n\nMethod \n\nCompression Ratio I \n\nHuffman Coding (UNIX: pack) \n\nLempel-Ziv Coding (UNIX: compress) \nImproved Lempel-Zlv ( UNIX: gzip -9) \nNeural predictor + Huffman Coding, n = 5 \nNeural predictor + Arithmetic Coding, n = 5 \n\n1.67 \n1.71 \n2.03 \n2.25 \n2.20 \n\nTable 2: Compression ratios for articles from a different newspaper. The neural \npredictor was not retrained. \n\nHow does a neural net trained on articles from Miinchner Merkurperform on articles \nfrom other sources? Without retraining the neural predictor, we applied all com(cid:173)\npeting methods to 10 articles from another German newspaper (the Frankenpost). \nThe results are given in table 2. \nThe Frankenpost articles were harder to compress for all algorithms. But relative \nperformance remained comparable. \nNote that the time-window was quite small (n = 5). In general, larger time windows \nwill make more information available to the predictor. In turn, this will improve \nthe prediction quality and increase the compression ratio. Therefore we expect to \nobtain even better results for n > 5 and for recurrent predictor networks. \n\n5 DISCUSSION / OUTLOOK \n\nOur results show that neural networks are promising tools for loss-free data compres(cid:173)\nsion. It was demonstrated that even off-line methods based on small time windows \ncan lead to excellent compression ratios - at least with small text files, they can \noutperform conventional standard algorithms. We have hardly begun, however, to \nexhaust the potential of the basic approach. \n\n5.1 ON-LINE METHODS \n\nA disadvantage of the off-line technique above is that it is off-line: The predictor \ndoes not adapt to the specific text file it sees. This limitation is not essential, \nhowever. It is straight-forward to construct an on-line variant of the approach. \n\n\fPredictive Coding with Neural Nets \n\n1053 \n\nWith the on-line variant, the predictor continues to learn during compression. The \non-line variant proceeds like this: Both the sender and the receiver start with exactly \nthe same initial predictor. Whenever the sender sees a new character, it encodes \nit using its current predictor. The code is sent to the receiver who decodes it. \nBoth the sender and the receiver use exactly the same learning protocol to modify \ntheir weights. This implies that the modified weights need not be sent from the \nsender to the receiver and do not have to be taken into account to compute the \naverage compression ratio. Of course, the on-line method promises much higher \ncompression ratios than the off-line method. \n\n5.2 LIMITATIONS \n\nThe main disadvantage of both on-line and off-line variants is their computational \ncomplexity. The current off-line implementation is clearly slower than conventional \nstandard techniques, by about three orders of magnitude (but no attempt was made \nto optimize the code with respect to speed). And the complexity of the on-line \nmethod is even worse (the exact slow-down factor depends on the precise nature of \nthe learning protocol, of course). For this reason, especially the promising on-line \nvariants can be recommended only if special neural net hardware is available. Note, \nhowever, that there are many commercial data compression applications which rely \non specialized electronic chips. \n\n5.3 ONGOING RESEARCH \n\nThere are a few obvious directions for ongoing experimental research: (1) Use larger \ntime windows - they seem to be promising even for off-line methods (see the last \nparagraph of section 4). (2) Thoroughly test the potential of on-line methods. Both \n(1) and (2) should greatly benefit from fast hardware. (3) Compare performance \nof predictive coding based on neural predictors to the performance of predictive \ncoding based on different kinds of predictors. \n\n6 ACKNOWLEDGEMENTS \n\nThanks to David MacKay for directing our attention towards Arithmetic Coding. \nThanks to Margit Kinder, Martin Eldracher, and Gerhard Weiss for useful com(cid:173)\nments. \n\n\f1054 \n\nReferences \n\nJiirgen Schmidhuber, Stefan Heil \n\n[1] G. Held. Data Compression. Wiley and Sons LTD, New York, 1991. \n[2] Y. LeCun. Une procedure d'apprentissage pour reseau a. seuil asymetrique. \n\nProceedings of Cognitiva 85, Paris, pages 599-604, 1985. \n\n[3] D. B. Parker. Learning-logic. Technical Report TR-47, Center for Compo Re(cid:173)\n\nsearch in Economics and Management Sci., MIT, 1985. \n\n[4] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal repre(cid:173)\n\nsentations by error propagation. In Parallel Distributed Processing, volume 1, \npages 318-362. MIT Press, 1986. \n\n[5] J. H. Schmidhuber and S. Heil. Sequential neural text compression. \n\nTransactions on Neural Networks, 1994. Accepted for publication. \n\nIEEE \n\n[6] P. J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the \n\nBehavioral Sciences. PhD thesis, Harvard University, 1974. \n\n[7] I. H. Witten, R. M. Neal, and J. G. Cleary. Arithmetic coding for data com(cid:173)\n\npression. Communications of the ACM, 30(6):520-540, 1987. \n\n[8] A. Wyner and J . Ziv. Fixed data base version of the Lempel-Ziv data compres(cid:173)\n\nsion algorithm. IEEE Transactions In/ormation Theory, 37:878-880, 1991. \n\n[9] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. \n\nIEEE Transactions on Information Theory, IT-23(5):337-343, 1977. \n\n\f", "award": [], "sourceid": 897, "authors": [{"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}, {"given_name": "Stefan", "family_name": "Heil", "institution": null}]}