{"title": "Connectionist Temporal Classification with Maximum Entropy Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 831, "page_last": 841, "abstract": "Connectionist Temporal Classification (CTC) is an objective function for end-to-end sequence learning, which adopts dynamic programming algorithms to directly learn the mapping between sequences. CTC has shown promising results in many sequence learning applications including speech recognition and scene text recognition. However, CTC tends to produce highly peaky and overconfident distributions, which is a symptom of overfitting. To remedy this, we propose a regularization method based on maximum conditional entropy which penalizes peaky distributions and encourages exploration. We also introduce an entropy-based pruning method to dramatically reduce the number of CTC feasible paths by ruling out unreasonable alignments. Experiments on scene text recognition show that our proposed methods consistently improve over the CTC baseline without the need to adjust training settings. Code has been made publicly available at: https://github.com/liuhu-bigeye/enctc.crnn.", "full_text": "Connectionist Temporal Classi\ufb01cation with\n\nMaximum Entropy Regularization\n\nHu Liu Sheng Jin Changshui Zhang\n\nInstitute for Arti\ufb01cial Intelligence, Tsinghua University (THUAI)\n\nBeijing National Research Center for Information Science and Technology (BNRist)\n\nState Key Lab of Intelligent Technologies and Systems\n\nDepartment of Automation, Tsinghua University, Beijing, P.R.China\n\n{liuhu15, js17}@mails.tsinghua.edu.cn\n\nzcs@mail.tsinghua.edu.cn\n\nAbstract\n\nConnectionist Temporal Classi\ufb01cation (CTC) is an objective function for end-to-\nend sequence learning, which adopts dynamic programming algorithms to directly\nlearn the mapping between sequences. CTC has shown promising results in\nmany sequence learning applications including speech recognition and scene text\nrecognition. However, CTC tends to produce highly peaky and overcon\ufb01dent\ndistributions, which is a symptom of over\ufb01tting. To remedy this, we propose a\nregularization method based on maximum conditional entropy which penalizes\npeaky distributions and encourages exploration. We also introduce an entropy-\nbased pruning method to dramatically reduce the number of CTC feasible paths by\nruling out unreasonable alignments. Experiments on scene text recognition show\nthat our proposed methods consistently improve over the CTC baseline without\nthe need to adjust training settings. Code has been made publicly available at:\nhttps://github.com/liuhu-bigeye/enctc.crnn.\n\n1\n\nIntroduction\n\nPast few years have witnessed signi\ufb01cant progress in sequence learning tasks. Currently recurrent\nneural network (RNN) with Connectionist Temporal Classi\ufb01cation (CTC) [5] has become a popular\nframework and is widely used in areas such as speech recognition [6, 8, 1, 2, 22], sign language\nrecognition [4], video segmentation [10, 18] and scene text recognition [29, 19, 7]. CTC views the\noutputs of RNN as a probability distribution over all possible alignments and directly learns the\nmapping from input sequences to target sequences. It is proven to be effective in weakly supervised\nsequence modeling with only temporal order supervision but no alignment information provided.\nCTC can be regarded as a kind of Multiple Instance Learning (MIL) [21]. From the perspective of\nMIL, the label sequence is a bag containing all feasible paths. CTC learns by maximum likelihood\nestimation (MLE) over the summation of all feasible path probabilities. However, as the number of\nfeasible paths grows exponentially with the input sequence length, it is hard for CTC to \ufb01nd the most\nsuitable one. More seriously, once CTC \ufb01nds a dominant feasible path during the training process,\nthe error signal will concentrate on the vicinity of this path, and the prediction of this feasible path\nwill continuously strengthen until this path completely dominates the prediction output. As blanks are\nincluded in most of the feasible paths, dominant paths are often overwhelmed by blanks, interspersed\nby sharp spikes (narrow regions along the time axis) of non-blank labels, which is known as the CTC\npeaky distribution problem [5, 22].\nThis problem of CTC will lead to the following consequences.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fexploration and is prone to fall into worse local minima.\n\n\u2022 Harm the training process. The positive feedback-like error signal means CTC lacks\n\u2022 Output overcon\ufb01dent paths. CTC tends to concentrate all its output distribution over\none speci\ufb01c path. On the one hand, it is not suitable to handle the situation where the\nsegmentation boundary is ambiguous, e.g. adjacent syllables in speech recognition and\naction switching in continuous sign language recognition. On the other hand, the low-entropy\noutput distribution is a symptom of over\ufb01tting [31], leading to low prediction accuracy.\n\u2022 Output paths with peaky distribution. The peaky distribution is not desirable for sequence\nsegmentation tasks when the model needs to densely predict labels for each time-step.\nEven if only the temporal order of labels is required, learning the correct segmentation will\nimprove model generalization and interpretation abilities.\n\nMotivated by the maximum entropy principle [13], we propose a maximum conditional entropy based\nregularization for CTC (EnCTC). We consider the conditional distribution of feasible paths given the\ninput sequence and label sequence. EnCTC prevents the entropy of feasible paths from decreasing\ntoo fast, thus alleviating the impact of CTC\u2019s positive feedback-like error signal and encouraging\nexploration during training. It also prevents the probability from being dominated by a single path\nand solves the peaky distribution problem. It mitigates over\ufb01tting and is more suitable for depicting\nambiguous segmentation boundaries.\nWe further consider another solution to the problem of \ufb01nding the reasonable feasible path\u2014limit\nthe size of the feasible set. We observe that in many sequence learning tasks, the intervals between\nthe adjacent elements in the input sequence are almost the same, e.g. characters in text recognition,\nsyllables in speech recognition and gestures in continuous sign language recognition. We summarize\nthis phenomenon as the equal spacing prior. Therefore, we propose an algorithm to limit the size\nof the CTC feasible set by eliminating these unreasonable paths that seriously violate the equal\nspacing prior (EsCTC). Moreover, the equal spacing prior can give theoretical explanations from the\nperspective of maximum entropy, which indicates that equal spacing is the best prior without any\nadditional subjective assumptions.\nThe main contributions of this paper can be summarized as: (1) We propose a maximum conditional\nentropy regularization for CTC (EnCTC), which encourages exploration for CTC training and\nprevents peaky output distributions. (2) We derive from equal spacing prior a pruning algorithm\n(EsCTC) to effectively limit the size of CTC feasible set and give theoretical explanations from the\nperspective of maximum entropy. (3) We provide polynomial-time dynamic programming algorithms\nfor calculating EnCTC, EsCTC and their combination (EnEsCTC). (4) We validate the proposed\nmethods on scene text recognition tasks and show that these methods are able to improve the baseline\nmodel without changing training settings.\n\n2 Related Work\n\nCTC [5] is a popular framework for end-to-end sequence learning tasks, such as speech recogni-\ntion [6, 8, 1, 2, 22], scene text recognition [29, 19, 7], sign language recognition [4] and video\nsegmentation [10, 18]. However, CTC tends to output highly peaky distribution [5, 22], which is a\nsign for model overcon\ufb01dence. To remedy this, some intuitive smoothing methods are proposed. [32]\nintroduces path sampling to speed up CTC training, which has a side-effect of reducing the posterior\nspikiness. [18, 22] smooth the estimation of path label priors by discounting the majority number\nof \"background\" labels and increasing the counts of rare actions, then normalizing the posterior\ndistribution using the priors when decoding. However, their smoothing methods are performed during\ndecoding and do not affect the training process.\nModel overcon\ufb01dence is a universal problem in the \ufb01elds of machine learning and several regulariza-\ntion approaches have been proposed to handle it. Weight decay [16] regularize by limiting the range\nof parameters; Dropout [9] and DropConnect [33] by adding noise to the model structure; Stochastic\nPooling [37] by adding noises to the pooling operations. Their methods regularize model parameters,\nbut do not directly regularize the output distributions.\nThe maximum entropy based regularization [23] has long been investigated in the literature to\nregularize model behavior. The maximum entropy estimation [13] involves no additional assumptions\nwhen estimating the distribution. It assigns the positive weight to every possible situation and produces\n\n2\n\n\fthe maximum entropy predictions under certain constraints. In reinforcement learning, [36, 25], the\nmaximum entropy regularization is proposed to encourage exploration and prevents early convergence.\nIn supervised learning, [27] proposes to penalize the entropy of high-con\ufb01dence output softmax\ndistributions by adding negative entropy to the objective function.\nThis work is mostly related to those CTC based methods that use regularization to reduce the over-\ncon\ufb01dent prediction. In [3], label smoothing [31] and increasing temperature of the SoftMax function\nare employed to improve beam search. [15] simply adds con\ufb01dence penalty regularization term [27]\nto regularize the output distribution. To some extent, their approaches improve generalization and\nreduce the peaky distribution. However label smoothing [31] and con\ufb01dence penalty [27] regularize\nthe model prediction at each time-step, which corresponds to regularize both feasible and invalid\npaths. We instead regularize the entropy among the feasible paths of CTC to handle peaky distribution\nproblem. Alignment constraints on CTC paths have been previously explored in [28, 32]. We further\npropose an entropy based pruning method to reduce the searching space of possible alignments and\nfacilitate convergence.\n\n3 Method\n\n3.1 Problem De\ufb01nition\n\nThroughout the paper, the sequence learning problem is de\ufb01ned as follows:\n\n\u2022 The dataset consists of pairs of input sequences X and corresponding target sequences l.\n\u2022 Each element of the target sequence is de\ufb01ned in a \ufb01xed-length label alphabet L.\n\u2022 Each input sequence X should be longer than its corresponding target sequence l.\n\u2022 The alignment between X and l is unknown but in a sequential manner.\n\n3.2 Connectionist Temporal Classi\ufb01cation (CTC)\n\nCTC [5] is a popular method for sequence learning. It enables the end-to-end model training with no\npre-de\ufb01ned alignment information required. In the framework of CTC, given an input sequence X1:T\nof length T , the model predicts a sequence y1:T of length T , where yt denotes the probability vector\nof observing labels over the \ufb01xed-length label alphabet L(cid:48). L(cid:48) = L \u222a \u2205 contains all the pre-de\ufb01ned\nlabels including a \u2019blank\u2019 label \u2205 at time-step t. We call the concatenation of observed labels at all\ntime-steps as a path \u03c0.\nIn order to access the relationship between path \u03c0 and target sequence l, CTC de\ufb01nes a many-to-one\nmapping operation B. B \ufb01rstly removes the repeated labels then removes all blanks from the given\npath. Given a label sequence l, we de\ufb01ne feasible paths as all those \u03c0 that can be mapped onto l\nthrough B. The conditional probability of a given target sequence is de\ufb01ned as the sum of probabilities\nof all feasible paths.\n\np(l|X1:T ) =\n\np(\u03c0|X1:T ),\n\n(cid:88)\n\n\u03c0\u2208B\u22121(l)\n\nwhere the probability of \u03c0 is de\ufb01ned as\n\np(\u03c0|X1:T ) =\n\nT(cid:89)\n\nt=1\n\n,\u2200\u03c0 \u2208 L(cid:48)T .\n\nyt\n\u03c0t\n\n(1)\n\n(2)\n\n(3)\n\nCTC guides the end-to-end model training by directly optimizing the loss function Lctc =\n\u2212 log p(l|X1:T ). It uses dynamic programming to ef\ufb01ciently sum up all the feasible paths.\nHowever, due to the massive amount of feasible path in Equation 1, directly optimizing CTC loss\nmay not result in a good alignment. More speci\ufb01cally, the error signal of CTC loss with respect to yt\nk\nis computed as:\n\n= \u2212\n\n\u2202Lctc\n\u2202yt\nk\n\n1\n\np(l|X)yt\n\nk\n\n(cid:88)\n\np(\u03c0|X).\n\n{\u03c0|\u03c0\u2208B\u22121(l),\u03c0t=k}\n\n3\n\n\fWe can see that the error signal is proportional to the fraction of all feasible paths that go through\nsymbol k at time t. That means once a feasible path is dominant, the error signal of yt\n\u03c0t will\ndominate yt at all time-steps t, causing all the probabilities to focus on a single path while ignoring\nits alternatives. This positive feedback-like error signal makes CTC lack exploration during training\nand tend to over\ufb01t.\n\n3.3 Maximum Conditional Entropy Regularization for CTC (EnCTC)\n\nIn order to facilitate training and be more likely to \ufb01nd the feasible paths, we propose a regularization\nmethod based on maximum conditional entropy (EnCTC). EnCTC constitutes an entropy-based\nregularization term that prevents the entropy of the feasible paths from decreasing too fast, leading to\nbetter generalization and exploration.\n\nLenctc = Lctc \u2212 \u03b2H(p(\u03c0|l, X)),\n\nwhere \u03b2 controls the strength of the maximum conditional entropy regularization.\nThe entropy of the feasible paths given input sequence X and target sequence l is de\ufb01ned as:\n\nH(p(\u03c0|l, X)) = \u2212 (cid:88)\n\np(\u03c0|X, l) log p(\u03c0|X, l)\n\n\u03c0\u2208B\u22121(l)\n\n= \u2212 1\n\np(l|X)\n\n(cid:88)\n\n\u03c0\u2208B\u22121(l)\n\np(\u03c0|X) log p(\u03c0|X) + log p(l|X).\n\n(4)\n\n(5)\n\n=\n\nQ(l)\np(l|X)yt\n\nk\n\n(\n\nFast convergence to one feasible path means that the entropy term of Equation 5 reduces rapidly.\nMore speci\ufb01cally, the error signal of entropy regularization term \u2212H(p(\u03c0|l, X)) with respect to yt\nis computed as:\n\u2202\u2212H(p(\u03c0|l, X))\n\n(cid:80){\u03c0|\u03c0\u2208B\u22121(l),\u03c0t=k} p(\u03c0|X) log p(\u03c0|X)\n\n(cid:80){\u03c0|\u03c0\u2208B\u22121(l),\u03c0t=k} p(\u03c0|X)\n\nk\n\n\u2212\n\np(l|X)\n\n),\n\nQ(l)\n\n\u2202yt\nk\n\nwhere Q(l) =(cid:80)\n\n(6)\n\u03c0\u2208B\u22121(l) p(\u03c0|X) log p(\u03c0|X). We can see from Equation 6 that this error signal is\nproportional to the fraction of p(\u03c0|X) log p(\u03c0|X) minus the fraction of p(\u03c0|X) for all feasible paths\nthat go through symbol k at time t. Notice that these two fractions are zero when p(\u03c0|X) = 0, but\np(\u03c0|X) log p(\u03c0|X) decreases more rapidly around zero and reaches its minimum when p(\u03c0|X) =\ni.e. paths near the dominant path,\n1/e. Therefore, paths with probability between 0 and 1/e,\ncontribute the most to the error signal. This error signal will in turn increase the probability of the\nnearby paths and improve the exploration during training.\n\n3.4 Equal Spacing CTC (EsCTC)\n\nAs discussed previously, CTC is prone to output degenerated paths due to large searching space. In\nthis section, we present a pruning method to dramatically reduce the searching space of possible\nalignments.\nWe observe that in many sequence learning tasks, such as scene text recognition, the spacing of two\nconsecutive elements (or the width of elements) is nearly the same. We assume that this property\nwill hold for all reasonable alignments. We thus propose to measure the spacing equality and set a\nthreshold to rule out those unreasonable alignments. This protects our model from distractions and\nhelps convergence.\nHere we provide some theoretical demonstration of the equal spacing prior based on maximum\nentropy over segmentation.\nGiven an input sequence X1:T of length T and a label sequence l, de\ufb01ne the segmentation sequence\nz1:|l| splitting X into a series of short segments. Each segment corresponds to a duration zs and\ns=1 zs \u2264 T (with an optional all-blank suf\ufb01x). Sequence Z consists of the starting\ni