{"title": "Unsupervised Sequence Classification using Sequential Output Statistics", "book": "Advances in Neural Information Processing Systems", "page_first": 3550, "page_last": 3559, "abstract": "We consider learning a sequence classifier without labeled data by using sequential output statistics. The problem is highly valuable since obtaining labels in training data is often costly, while the sequential output statistics (e.g., language models) could be obtained independently of input data and thus with low or no cost. To address the problem, we propose an unsupervised learning cost function and study its properties. We show that, compared to earlier works, it is less inclined to be stuck in trivial solutions and avoids the need for a strong generative model. Although it is harder to optimize in its functional form, a stochastic primal-dual gradient method is developed to effectively solve the problem. Experiment results on real-world datasets demonstrate that the new unsupervised learning method gives drastically lower errors than other baseline methods. Specifically, it reaches test errors about twice of those obtained by fully supervised learning.", "full_text": "Unsupervised Sequence Classi\ufb01cation using\n\nSequential Output Statistics\n\nYu Liu \u2020, Jianshu Chen \u21e4, and Li Deng\u2020\n\n\u21e4 Microsoft Research, Redmond, WA 98052, USA\u21e4\n\njianshuc@microsoft.com\n\n\u2020 Citadel LLC, Seattle/Chicago, USA\u2020\n\nLi.Deng@citadel.com\n\nAbstract\n\nWe consider learning a sequence classi\ufb01er without labeled data by using sequential\noutput statistics. The problem is highly valuable since obtaining labels in training\ndata is often costly, while the sequential output statistics (e.g., language models)\ncould be obtained independently of input data and thus with low or no cost. To\naddress the problem, we propose an unsupervised learning cost function and study\nits properties. We show that, compared to earlier works, it is less inclined to\nbe stuck in trivial solutions and avoids the need for a strong generative model.\nAlthough it is harder to optimize in its functional form, a stochastic primal-dual\ngradient method is developed to effectively solve the problem. Experiment results\non real-world datasets demonstrate that the new unsupervised learning method\ngives drastically lower errors than other baseline methods. Speci\ufb01cally, it reaches\ntest errors about twice of those obtained by fully supervised learning.\n\n1\n\nIntroduction\n\nUnsupervised learning is one of the most challenging problems in machine learning. It is often\nformulated as the modeling of how the world works without requiring a huge amount of human\nlabeling effort, e.g.\n[8]. To reach this grand goal, it is necessary to \ufb01rst solve a sub-goal of\nunsupervised learning with high practical value; that is, learning to predict output labels from input\ndata without requiring costly labeled data. Toward this end, we study in this paper the learning of a\nsequence classi\ufb01er without labels by using sequential output statistics. The problem is highly valuable\nsince the sequential output statistics, such as language models, could be obtained independently of\nthe input data and thus with no labeling cost.\nThe problem we consider here is different from most studies on unsupervised learning, which\nconcern automatic discovery of inherent regularities of the input data to learn their representations\n[13, 28, 18, 17, 5, 1, 31, 20, 14, 12]. When these methods are applied in prediction tasks, either the\nlearned representations are used as feature vectors [22] or the learned unsupervised models are used\nto initialize a supervised learning algorithm [9, 18, 2, 24, 10]. In both ways, the above unsupervised\nmethods played an auxiliary role in helping supervised learning when it is applied to prediction tasks.\nRecently, various solutions have been proposed to address the input-to-output prediction problem\nwithout using labeled training data, all without demonstrated successes [11, 30, 7]. Similar to this\nwork, the authors in [7] proposed an unsupervised cost that also exploits the sequence prior of\nthe output samples to train classi\ufb01ers. The power of such a strong prior in the form of language\n\n\u21e4All the three authors contributed equally to the paper.\n\u2020The work was done while Yu Liu and Li Deng were at Microsoft Research.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fmodels in unsupervised learning was also demonstrated in earlier studies in [21, 3]. However, these\nearlier methods did not perform well in practical prediction tasks with real-world data without using\nadditional strong generative models. Possible reasons are inappropriately formulated cost functions\nand inappropriate choices of optimization methods. For example, it was shown in [7] that optimizing\nthe highly non-convex unsupervised cost function could easily get stuck in trivial solutions, although\nadding a special regularization mitigated the problem somewhat.\nThe solution provided in this paper fundamentally improves these prior works in [11, 30, 7] in\nfollowing aspects. First, we propose a novel cost function for unsupervised learning, and \ufb01nd that it\nhas a desired coverage-seeking property that makes the learning algorithm less inclined to be stuck in\ntrivial solutions than the cost function in [7]. Second, we develop a special empirical formulation\nof this cost function that avoids the need for a strong generative model as in [30, 11].3 Third,\nalthough the proposed cost function is more dif\ufb01cult to optimize in its functional form, we develop\na stochastic primal-dual gradient (SPDG) algorithm to effectively solve problem. Our analysis of\nSPDG demonstrates how it is able to reduce the high barriers in the cost function by transforming it\ninto a primal-dual domain. Finally and most importantly, we demonstrate the new cost function and\nthe associated SPDG optimization algorithm work well in two real-world classi\ufb01cation tasks. In the\nrest of the paper, we proceed to demonstrate these points and discuss related works along the way.\n\n2 Empirical-ODM: An unsupervised learning cost for sequence classi\ufb01ers\n\nIn this section, we extend the earlier work of [30] and propose an unsupervised learning cost named\nEmpirical Output Distribution Match (Empirical-ODM) for training classi\ufb01ers without labeled data.\nWe \ufb01rst formulate the unsupervised learning problem with sequential output structures. Then, we\nintroduce the Empirical-ODM cost and discuss its important properties that are closely related to\nunsupervised learning.\n\n2.1 Problem formulation\n\nWe consider the problem of learning a sequence classi\ufb01er that predicts an output sequence\n(y1, . . . , yT0) from an input sequence (x1, . . . , xT0) without using labeled data, where T0 denotes\nthe length of the sequence. Speci\ufb01cally, the learning algorithm does not have access to a labeled\ntraining set DXY , {(xn\nTn) : n = 1, . . . , M}, where Tn denotes the length\nof the n-th sequence. Instead, what is available is a collection of input sequences, denoted as\nDX , {(xn\nTn) : n = 1, . . . , M}. In addition, we assume that the sequential output statistics\n(or sequence prior), in the form of an N-gram probability, are available:\ntN +1 = i1, . . . , yn\n\npLM(i1, . . . , iN ) , pLM(yn\n\n1 , . . . , xn\n\n1 , . . . , xn\n\n1 , . . . , yn\n\nTn), (yn\n\nt = iN )\n\nt could be an English character and xn\n\nwhere i1, . . . , iN 2{ 1, . . . , C} and the subscript \u201cLM\u201d stands for language model. Our objective is\nto train the sequence classi\ufb01er by just using DX and pLM(\u00b7). Note that the sequence prior pLM(\u00b7), in\nthe form of language models, is a type of structure commonly found in natural language data, which\ncan be learned from a large amount of text data freely available without labeling cost. For example, in\noptical character recognition (OCR) tasks, yn\nt is the input image\ncontaining this character. We can estimate an N-gram character-level language model pLM(\u00b7) from a\nseparate text corpus. Therefore, our learning algorithm will work in a fully unsupervised manner,\nwithout any human labeling cost. In our experiment section, we will demonstrate the effectiveness\nof our method on such a real OCR task. Other potential applications include speech recognition,\nmachine translation, and image/video captioning.\nt ) that is, it computes\nIn this paper, we focus on the sequence classi\ufb01er in the form of p\u2713(yn\nt in the sequence.\nthe posterior probability p\u2713(yn\nt ) to be linear classi\ufb01ers4 and focus our attention on\nFurthermore, we restrict our choice of p\u2713(yn\ndesigning and understanding unsupervised learning costs and methods for label-free prediction. In\n\nt ) only based on the current input sample xn\n\nt |xn\n\nt |xn\n\nt |xn\n\n3The work [11] only proposed a conceptual idea of using generative models to integrate the output structure\nand the output-to-input structure for unsupervised learning in speech recognition. Speci\ufb01cally, the generative\nmodels are built from the domain knowledge of speech waveform generation mechanism. No mathematical\nformulation or successful experimental results are provided in [11].\n\n4p\u2713(yn\n\nt = i|xn\n\nt , where the model parameter is \u2713 , {wi 2 Rd, i = 1, . . . , C}.\n\nt ) = ew T\n\ni xn\n\nj=1 ew T\n\nj xn\n\nt /PC\n\n2\n\n\ffact, as we will show in later sections, even with linear models, the unsupervised learning problem\nis still highly nontrivial and the cost function is also highly non-convex. And we emphasize that\ndeveloping a successful unsupervised learning approach for linear classi\ufb01ers, as we do in this paper,\nprovides important insights and is an important \ufb01rst step towards more advanced nonlinear models\n(e.g., deep neural networks). We expect that, in future work, the insights obtained here could help us\ngeneralize our techniques to nonlinear models.\nA recent work that shares the same motivations as our work is [29], which also recognizes the high\ncost of obtaining labeled data and seeks label-free prediction. Different from our setting, they exploit\ndomain knowledge from laws of physics in computer vision applications, whereas our approach\nexploits sequential statistics in the natural language outputs. Finally, our problem is fundamentally\ndifferent from the sequence transduction method in [15], although it also exploits language models\nfor sequence prediction. Speci\ufb01cally, the method in [15] is a fully supervised learning in that it\nrequires supervision at the sequence level; that is, for each input sequence, a corresponding output\nsequence (of possibly different length) is provided as a label. The use of language model in [15] only\nserves the purpose of regularization in the sequence-level supervised learning. In stark contrast, the\nunsupervised learning we propose does not require supervision at any level including speci\ufb01cally\nthe sequence level; we do not need the sequence labels but only the prior distribution pLM(\u00b7) of the\noutput sequences.\n\n2.2 The Empirical-ODM\nWe now introduce an unsupervised learning cost that exploits the sequence structure in pLM(\u00b7). It is\nmainly inspired by the approach to breaking the Caesar cipher, one of the simplest forms of encryption\n[23]. Caesar cipher is a substitution cipher where each letter in the original message is replaced with\na letter corresponding to a certain number of letters up or down in the alphabet. For example, the\nletter \u201cD\u201d is replaced by the letter \u201cA\u201d, the letter \u201cE\u201d is replaced by the letter \u201cB\u201d, and so on. In\nthis way, the original message that was readable ends up being less understandable. The amount of\nthis shifting is also known to the intended receiver of the message, who can decode the message by\nshifting back each letter in the encrypted message. However, Caesar cipher could also be broken\nby an unintended receiver (not knowing the shift) when it analyzes the frequencies of the letters\nin the encrypted messages and matches them up with the letter distribution of the original text [4,\npp.9-11]. More formally, let yt = f (xt) denote a function that maps each encrypted letter xt into an\noriginal letter yt. And let pLM(i) , pLM(yt = i) denote the prior letter distribution of the original\nmessage, estimated from a regular text corpus. When f (\u00b7) is constructed in a way that all mapped\nletters {yt : yt = f (xt), t = 1, . . . , T} have the same distribution as the prior pLM(i), it is able to\nbreak the Caesar cipher and recover the original letters at the mapping outputs.\nt ) in our classi\ufb01cation problem can\nInspired by the above approach, the posterior probability p\u2713(yn\nt |xn\nt (the \u201cencrypted letter\u201d) into\nbe interpreted as a stochastic mapping, which maps each input vector xn\nt ). Then in a samplewise manner,\nan output vector yn\nt (the \u201coriginal letter\u201d) with probability p\u2713(yn\nt |xn\neach input sequence (xn\nTn).\nWe move a step further than the above approach by requiring that the distribution of the N-grams\namong all the mapped output sequences are close to the prior N-gram distribution pLM(i1, . . . , iN ).\nWith this motivation, we propose to learn the classi\ufb01er p\u2713(yt|xt) by minimizing the cross entropy\nbetween the prior distribution and the expected N-gram frequency of the output sequences:\n\nTn) is stochastically mapped into an output sequence (yn\n\n1 , . . . , xn\n\n1 , . . . , yn\n\nmin\n\n\u2713 \u21e2J (\u2713) , Xi1,...,iN\n\npLM(i1, . . . , iN ) ln p\u2713(i1, . . . , iN )\n\n(1)\n\nwhere p\u2713(i1, . . . , iN ) denotes the expected frequency of a given N-gram (i1, . . . , iN ) among all the\noutput sequences. In Appendix B of the supplementary material, we derive its expression as\n\np\u2713(i1, . . . , iN ) , 1\nT\n\nMXn=1\n\nTnXt=1\n\nN1Yk=0\n\np\u2713(yn\n\ntk = iNk|xn\n\ntk)\n\n(2)\n\nwhere T , T1 + \u00b7\u00b7\u00b7 + TM is the total number of samples in all sequences. Note that minimizing the\ncross entropy in (1) is also equivalent to minimizing the Kullback-Leibler (KL) divergence between\nthe two distributions since they only differ by a constant term,P pLM ln pLM. Therefore, the cost\n\nfunction (1) seeks to estimate \u2713 by matching the two output distributions, where the expected N-gram\n\n3\n\n\fdistribution in (2) is an empirical average over all the samples in the training set. For this reason, we\nname the cost (1) as Empirical Output Distribution Match (Empirical-ODM) cost.\nIn [30], the authors proposed to minimize an output distribution match (ODM) cost, de\ufb01ned as\nthe KL-divergence between the prior output distribution and the marginalized output distribution,\n\nD(pLM(y)||p\u2713(y)), where p\u2713(y) ,R p\u2713(y|x)p(x)dx. However, evaluating p\u2713(y) requires integrat-\n\ning over the input space using a generative model p(x). Due to the lack of such a generative model,\nthey were not able to optimize this proposed ODM cost. Instead, alternative approaches such as Dual\nautoencoders and GANs were proposed as heuristics. Their results were not successful without using\na few labeled data. Our proposed Empirical-ODM cost is different from the ODM cost in [30] in\nthree key aspects. (i) We do not need any labeled data for training. (ii) We exploit sequence structure\nof output statistics, i.e., in our case y = (y1, . . . , yN ) (N-gram) whereas in [30] y = yt (unigram,\ni.e., no sequence structure). This is crucial in developing a working unsupervised learning algorithm.\nThe change from unigram to N-gram allows us to explicitly exploit the sequence structures at the\noutput, which makes the technique from non-working to working (see Table 2 in Section 4). It\nmight also explain why the method in [30] failed as it does not exploit the sequence structure. (iii)\nWe replace the marginalized distribution p\u2713(y) by the expected N-gram frequency in (2). This is\ncritical in that it allows us to directly minimize the divergence between two output distributions\nwithout the need for a generative model, which [30] could not do. In fact, we can further show that\np\u2713(i1, . . . , iN ) is an empirical approximation of p\u2713(y) with y = (y1, . . . , yN ) (see Appendix B.2 of\nthe supplementary material). In this way, our cost (1) can be understood as an N-gram and empirical\nversion of the ODM cost except for an additive constant, i.e., y is replaced by y = (y1, . . . , yN ) and\np\u2713(y) is replaced by its empirical approximation.\n\n2.3 Coverage-seeking versus mode-seeking\nWe now discuss an important property of the proposed Empirical-ODM cost (1) by comparing it with\nthe cost proposed in [7]. We show that the Empirical-ODM cost has a coverage-seeking property,\nwhich makes it more suitable for unsupervised learning than the mode-seeking cost in [7].\nIn [7], the authors proposed the expected negative log-likelihood as the unsupervised learning cost\nfunction that exploits the output sequential statistics. The intuition was to maximize the aggregated\nlog-likelihood of all the output sequences assumed to be generated by the stochastic mapping\np\u2713(yn\n\nt ). We show in Appendix A of the supplementary material that their cost is equivalent to\n\n Xi1,...,iN1XiN\nwhere pLM(iN|iN1, . . . , i1) , p(yn\ntN +1 = i1), and the summations\nare over all possible values of i1, . . . , iN 2{ 1, . . . , C}. In contrast, we can rewrite our cost (1) as\n(4)\n\np\u2713(i1, . . . , iN ) ln pLM(iN|iN1, . . . , i1)\n\nt = iN|yn\n\nt1 = iN1, . . . , yn\n\nt |xn\n\n(3)\n\n Xi1,...,iN1\n\npLM(i1, . . . , iN1) \u00b7XiN\n\npLM(iN|iN1, . . . , i1) ln p\u2713(i1, . . . , iN )\n\nwhere we used the chain rule of conditional probabilities. Note that both costs (3) and (4) are in a\ncross entropy form. However, a key difference is that the positions of the distributions p\u2713(\u00b7) and\npLM(\u00b7) are swapped. We show that the cost in the form of (3) proposed in [7] is a mode-seeking\ndivergence between two distributions, while by swapping p\u2713(\u00b7) and pLM(\u00b7), our cost in (4) becomes\na coverage-seeking divergence (see [25] for a detailed discussion on divergences with these two\ndifferent behaviors). To understand this, we consider the following two situations:\n\n\u2022 If pLM(iN|iN1, . . . , i1) ! 0 and p\u2713(i1, . . . , iN ) > 0 for a certain (i1, . . . , iN ), the cross\n\u2022 If pLM(iN|iN1, . . . , i1) > 0 and p\u2713(i1, . . . , iN ) ! 0 for a certain (i1, . . . , iN ), the cross\n\nentropy in (3) goes to +1 and the cross entropy in (4) approaches zero.\nentropy in (3) approaches zero and the cross entropy in (4) goes to +1.\n\nTherefore, the cost function (3) will heavily penalize the classi\ufb01er if it predicts an output that is\nbelieved to be less probable by the prior distribution pLM(\u00b7), and it will not penalize the classi\ufb01er when\nit does not predict an output that pLM(\u00b7) believes to be probable. That is, the classi\ufb01er is encouraged\nto predict a single output mode with high probability in pLM(\u00b7), a behavior called \u201cmode-seeking\u201d in\n[25]. This probably explains the phenomena observed in [7]: the training process easily converges to\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: The pro\ufb01les of J (\u2713) for the OCR dataset on a two-dimensional af\ufb01ne space passing through\nthe supervised solution. The three \ufb01gures show the same pro\ufb01le from different angles, where the red\ndot is the supervised solution. The contours of the pro\ufb01les are shown at the bottom.\n\na trivial solution of predicting the same output that has the largest probability in pLM(\u00b7). In contrast,\nthe cost (4) will heavily penalize the classi\ufb01er if it does not predict the output for which pLM(\u00b7) is\npositive, and will penalize less if it predicts outputs for which pLM(\u00b7) is zero. That is, this cost will\nencourage p\u2713(y|x) to cover as much of pLM(\u00b7) as possible, a behavior called \u201ccoverage-seeking\u201d in\n[25]. Therefore, training the classi\ufb01er using (4) will make it less inclined to learn trivial solutions\nthan that in [7] since it will be heavily penalized. We will verify this fact in our experiment section 4.\nIn addition, the coverage-seeking property could make the learning less sensitive to the sparseness of\nlanguage models (i.e., pLM is zero for some N-grams) since the cost will not penalize these N-grams.\nIn summary, our proposed cost (1) is more suitable for unsupervised learning than that in [7].\n\n2.4 The dif\ufb01culties of optimizing J (\u2713)\nHowever, there are two main challenges of optimizing the Empirical-ODM cost J (\u2713) in (1). The \ufb01rst\none is that the sample average (over the entire training data set) in the expression of p\u2713(\u00b7) (see (2)) is\ninside the logarithmic loss, which is different from traditional machine learning problems where the\naverage is outside loss functions (e.g.,Pt ft(\u2713)). This functional form prevents us from applying\nstochastic gradient descent (SGD) to minimize (1) as the stochastic gradients would be intrinsically\nbiased (see Appendix C for a detailed discussion and see section 4 for the experiment results). The\nsecond challenge is that the cost function J (\u2713) is highly non-convex even with linear classi\ufb01ers. To\nsee this, we visualize the pro\ufb01le of the cost function J (\u2713) (restricted to a two-dimensional sub-space)\naround the supervised solution in Figure 1.56 We observe that there are local optimal solutions and\nthere are high barriers between the local and global optimal solutions. Therefore, besides the dif\ufb01culty\nof having the sample average inside the logarithmic loss, minimizing this cost function directly will\nbe dif\ufb01cult since crossing the high barriers to reach the global optimal solution would be hard if not\nproperly initialized.\n\n3 The Stochastic Primal-Dual Gradient (SPDG) Algorithm\n\nTo address the \ufb01rst dif\ufb01culty in Section 2.4, we transform the original cost (1) into an equivalent\nmin-max problem in order to bring the sample average out of the logarithmic loss. Then, we could\nobtain unbiased stochastic gradients to solve the problem. To this end, we \ufb01rst introduce the concept\nof convex conjugate functions. For a given convex function f (u), its convex conjugate function f ?(\u232b)\nis de\ufb01ned as f ?(\u232b) , supu(\u232bT u f (u)) [6, pp.90-95], where u and \u232b are called primal and dual\nvariables, respectively. For a scalar function f (u) = ln u, its conjugate function can be calculated\nas f ?(\u232b) = 1 ln(\u232b) with \u232b< 0. Furthermore, it holds that f (u) = sup\u232b(uT \u232b f ?(\u232b)), by\n5The approach to visualizing the pro\ufb01le is explained with more detail in Appendix F. More slices and a video\n6Note that the supervised solution (red dot) coincides with the global optimal solution of J (\u2713). The intuition\nfor this is that the classi\ufb01er trained by supervised learning should also produce output N-gram distribution that\nis close to the prior marginal output N-gram distribution given by pLM(\u00b7).\n\nof the pro\ufb01les from many angles can be found in the supplementary material.\n\n5\n\n\fAlgorithm 1 Stochastic Primal-Dual Gradient Method\n1: Input data: DX = {(xn\n2: Initialize \u2713 and V where the elements of V are negative\n3: repeat\n4:\n\n1 , . . . , xn\n\nTn) : n = 1, . . . , M} and pLM(i1, . . . , iN ).\n\nRandomly sample a mini-batch of B subsequences of length N from all the sequences in the\ntraining set DX, i.e., B = {(xnm\nCompute the stochastic gradients for each subsequence in the mini-batch and average them\n\ntmN +1, . . . , xnm\n\ntm )}B\n\nm=1.\n\n5:\n\n\u2713 =\n\n1\nB\n\nBXm=1\n\n@Lnm\ntm\n@\u2713\n\n, V =\n\n1\nB\n\nBXm=1\n\n@Lnm\ntm\n\n@V\n\n+\n\n@\n\n@V Xi1...iN\n\npLM(i1,. . ., iN ) ln(\u232bi1,...,iN)\n\n6:\n7: until convergence or a certain stopping condition is met\n\nUpdate \u2713 and V according to \u2713 \u2713 \u00b5\u2713\u2713 and V V + \u00b5vV .\n\nwhich we have ln u = max\u232b(u\u232b + 1 + ln(\u232b)).7 Substituting it into (1), the original minimization\nproblem becomes the following equivalent min-max problem:\npLM(i1, . . . , iN ) ln(\u232bi1,...,iN ) (5)\n\n{\u232bi1,...,iN <0}\u21e2L(\u2713, V ) , 1\n\nt (\u2713, V ) + Xi1,...,iN\n\nMXn=1\n\nTnXt=1\n\nmax\n\nmin\n\nLn\n\nT\n\n\u2713\n\nwhere V , {\u232bi1,...,iN} is a collection of all the dual variables \u232bi1,...,iN , and Ln\ncomponent function in the n-th sequence, de\ufb01ned as\n\nt (\u2713, V ) is the t-th\n\nLn\n\nt (\u2713, V ) , Xi1,...,iN\n\npLM(i1, . . . , iN )\u232bi1,...,iN\n\nN1Yk=0\n\np\u2713(yn\n\ntk = iNk|xn\n\ntk)\n\nIn the equivalent min-max problem (5), we \ufb01nd the optimal solution (\u2713?, V ?) by minimizing L with\nrespect to the primal variable \u2713 and maximizing L with respect to the dual variable V . The obtained\noptimal solution to (5), (\u2713?, V ?), is called the saddle point of L [6]. Once it is obtained, we only\nkeep \u2713?, which is also the optimal solution to (1) and thus the model parameter.\nWe further note that the equivalent min-max problem (5) is now in a form that sums over T =\nT1 + \u00b7\u00b7\u00b7 + TM component functions Ln\nt (\u2713, V ). Therefore, the empirical average has been brought\nout of the logarithmic loss and we are ready to apply stochastic gradient methods. Speci\ufb01cally, we\nminimize L with respect to the primal variable \u2713 by stochastic gradient descent and maximize L\nwith respect to the dual variable V by stochastic gradient ascent. Therefore, we name the algorithm\nstochastic primal-dual gradient (SPDG) method (see its details in Algorithm 1). We implement the\nSPDG algorithm in TensorFlow, which automatically computes the stochastic gradients.8 Finally,\nthe constraint on dual variables \u232bi1,...,iN are automatically enforced by the inherent log-barrier,\nln(\u232bi1,...,iN ), in (5) [6]. Therefore, we do not need a separate method to enforce the constraint.\nWe now show that the above min-max (primal-dual) reformulation also alleviates the second dif\ufb01culty\ndiscussed in Section 2.4. Similar to the case of J (\u2713), we examine the pro\ufb01le of L(\u2713, V ) in (5)\n(restricted to a two-dimensional sub-space) around the optimal (supervised) solution in Figure 2a\n(see Appendix F for the visualization details). Comparing Figure 2a to Figure 1, we observe that\nthe pro\ufb01le of L(\u2713, V ) is smoother than that of J (\u2713) and the barrier is signi\ufb01cantly lower. To further\ncompare J (\u2713) and L(\u2713, V ), we plot in Figure 2b the values of J (\u2713) and L(\u2713, V ) along the same line\nof \u2713? + p(\u27131 \u2713?) for different p. It shows that the barrier of L(\u2713, V ) along the primal direction\nis lower than that in J (\u2713). These observations imply that the reformulated min-max problem (5) is\nbetter conditioned than the original problem (1), which further justi\ufb01es the use of SPDG method.\n\n7The supremum is attainable and is thus replaced by maximum.\n8The code will be released soon.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: The pro\ufb01les of L(\u2713, V ) for the OCR dataset. (a) The pro\ufb01le on a two-dimensional af\ufb01ne\nspace passing through the optimal solution (red dot). (b) The pro\ufb01le along the line of \u2713? + p(\u27131 \u2713?)\nfor different values of p 2 R, where the circles are the optimal solutions.\n\n4 Experiments\n\n4.1 Experimental setup\n\nWe evaluate our unsupervised learning scheme described in earlier secitons using two classi\ufb01cation\ntasks, unsupervised character-level OCR and unsupervised English Spelling Correction (Spell-Corr).\nIn both tasks, there is no label provided during training. Hence, they are both unsupervised.\nFor the OCR task, we obtain our dataset from a public database UWIII English Document Image\nDatabase [27], which contains images for each line of text with its corresponding groudtruth. We \ufb01rst\nuse Tesseract [19] to segment the image for each line of text into characters tiles and assign each tile\nwith one character. We verify the segmentation result by training a simple neural network classi\ufb01er on\nthe segmented results and achieve 0.9% error rate on the test set. Then, we select sentence segments\nthat are longer than 100 and contain only lowercase English characters and common punctuations\n(space, comma, and period). As a result, we have a vocabulary of size 29 and we obtain 1,175\nsentence segments including 153,221 characters for our OCR task. To represent images, we extract\nVGG19 features with dim = 4096, and project them into 200-dimension vectors using Principal\nComponent Analysis. We train the language models (LM) pLM(\u00b7) to provide the required output\nsequence statistics from both in-domain and out-of-domain data sources. The out-of-domain data\nsources are completely different databases, including three different language partitions (CNA, NYT,\nXIN) in the English Gigaword database [26].\nIn Spell-Corr task, we learn to correct the spelling from a mis-spelled text. From the AFP partition\nof the Gigaword database, we select 500 sentence segments into our Spell-Corr dataset. We select\nsentences that are longer than 100 and contain only English characters and common punctuations, re-\nsulting in a total of 83,567 characters. The mis-spelled texts are generated by substitution simulations\nand are treated as our inputs. The objective of this task is to recover the original text.\n\n4.2 Results: Comparing optimization algorithms\n\nIn the \ufb01rst set of experiments, we aim to evaluate the effectiveness of the SPDG method as described\nin Section 3, which is designed for optimizing the Empirical-ODM cost in Section 2. The analysis\nprovided in Sections 2 and 3 sheds insight to why SPDG is superior to the method in [7] and to the\nstandard stochastic gradient descent (SGD) method. The coverage-seeking behavior of the proposed\nEmpirical-ODM cost helps avoid trivial solutions, and the simultaneous optimization of primal-dual\nvariables reduces the barriers in the highly non-convex pro\ufb01le of the cost function. Furthermore, we\ndo not include the methods from [30] because their approaches could not achieve satisfactory results\nwithout a few labeled data, while we only consider fully unsupervised learning setting. In addition, the\nmethods in [30] are not optimizing the ODM cost and do not exploit the output sequential statistics.\n\n7\n\n\fTable 1 provides strong experimental evidence demonstrating the substantially greater effectiveness\nof the primal-dual method over the SGD and the method in [7] on both tasks. All these results are\nobtained by training the models until converge. Let us examine the results on the OCR in detail. First,\nthe SPGD on the unsupervised cost function achieves 9.21% error rate, much lower than the error\nrates of any of mini-batch SGD runs, where the size of the mini-batches ranges from 10 to 10,000.\nNote that, larger mini-batch sizes produce lower errors here because it becomes closer to full-batch\ngradient and thus lower bias in SGD. On the other hand, when the mini-batch size is as small as 10,\nthe high error rate of 83.09% is close to a guess by majority rule \u2014 predicting the character (space)\nthat has a largest proportion in the train set, i.e., 25, 499/153, 221 = 83.37%. Furthermore, the\nmethod from [7] does not perform well no matter how we tune the hyperparameters for the generative\nregularization. Finally and perhaps most interestingly, with no labels provided in the training, the\nclassi\ufb01cation errors produced by our method are only about twice compared with supervised learning\n(4.63% shown in Table 1). This clearly demonstrates that the unsupervised learning scheme proposed\nin this paper is an effective one. For the Spelling Correction data set (see the third column in Table 1),\nwe observe rather consistent results with the OCR data set.\n\nTable 1: Test error rates on two datasets: OCR and Spell-Corr. The 2-gram character LM is trained\nfrom in-domain data. The numbers inside h\u00b7i are the mini-batch sizes of the SGD method.\nData sets\n\nSGD\nh10i\n\nSGD\nh100i\n\nSGD\nh1ki\n\nOCR\nSpell-Corr\n\nMethod\nfrom [7]\n\nSPDG\n(Ours)\n9.59% 83.37% 83.09% 78.05% 67.14% 56.48% 4.63%\n1.94% 82.91% 82.91% 72.93% 65.69% 45.24% 0.00%\n\nSGD\nh10ki\n\nSupervised\nLearning\n\nMajority\nGuess\n83.37%\n82.91%\n\n4.3 Results: Comparing orders of language modeling\n\nIn the second set of experiments, we examine to what extent the use of sequential statistics (e.g. 2-\nand 3-gram LMs) can do better than the uni-gram LM (no sequential information) in unsupervised\nlearning. The unsupervised prediction results are shown in Table 2, using different data sources to\nestimate N-gram LM parameters. Consistent across all four ways of estimating reliable N-gram\nLMs, we observe signi\ufb01cantly lower error rates when the unsupervised learning exploits 2-gram and\n3-gram LM as sequential statistics compared with exploiting the prior with no sequential statistics\n(i.e. 1-gram). In three of four cases, exploiting a 3-gram LM gives better results than a 2-gram LM.\nFurthermore, the comparable error rate associated with 3-gram using out-of-domain output character\ndata (10.17% in Table 2) to that using in-domain output character data (9.59% in Table 1) indicates\nthat the effectiveness of the unsupervised learning paradigm presented in this paper is robust to the\nquality of the LM acting as the sequential prior.\n\nTable 2: Test error rates on the OCR dataset. Character-level language models (LMs) with the orders\nare trained from three out-of-domain datasets and from the fused in-domain and out-of-domain data.\n\nNo. Sents\nNo. Chars\n1-gram\n2-gram\n3-gram\n\nNYT-LM\n1,206,903\n86,005,542\n\n71.83%\n10.93%\n10.17%\n\nXIN-LM\n155,647\n18,626,451\n\n72.14%\n12.55%\n12.89%\n\nCNA-LM\n12,234\n1,911,124\n71.51%\n10.56%\n10.29 %\n\nFused-LM\n\n15,409\n2,064,345\n71.25%\n10.33%\n9.21%\n\n5 Conclusions and future work\n\nIn this paper, we study the problem of learning a sequence classi\ufb01er without the need for labeled\ntraining data. The practical bene\ufb01t of such unsupervised learning is tremendous. For example, in\nlarge scale speech recognition systems, the currently dominant supervised learning methods typically\nrequire a few thousand hours of training data, where each utterance in the acoustic form needs to\nbe labeled by humans. Although there are millions of hours of natural speech data available for\ntraining, labeling all of them for supervised learning is less feasible. To make effective use of such\n\n8\n\n\fhuge amounts of acoustic data, the practical unsupervised learning approach discussed in this paper\nwould be called for. Other potential applications such as machine translation, image and video\ncaptioning could also bene\ufb01t from our paradigm. This is mainly because of their common natural\nlanguage output structure, from which we could exploit the sequential structures for learning the\nclassi\ufb01er without labels. For other (non-natural-language) applications where there is also a sequential\noutput strucutre, our proposed approach could be applicable in a similar manner. Furthermore, our\nproposed Empirical-ODM cost function signi\ufb01cantly improves over the one in [7] by emphasizing the\ncoverage-seeking behavior. Although the new cost function has a functional form that is more dif\ufb01cult\nto optimize, a novel SPDG algorithm is developed to effectively address the problem. An analysis of\npro\ufb01les of the cost functions sheds insight to why SPDG works well and why previous methods failed.\nFinally, we demonstrate in two datasets that our unsupervised learning method is highly effective,\nproducing only about twice errors as fully supervised learning, which no previous unsupervised\nlearning could produce without additional steps of supervised learning. While the current work is\nrestricted to linear classi\ufb01ers, we intend to generalize the approach to nonlinear models (e.g., deep\nneural nets [16]) in our future work. We also plan to extend our current method from exploiting\nN-gram LM to exploiting the currently state-of-the-art neural-LM. Finally, one challenge that remains\nto be addressed is the scaling of the current method to large vocabulary and high-order LM (i.e.,\nlarge C and N). In this case, the summation over all (i1, . . . , iN ) in (5) becomes computationally\nexpensive. A potential solution is to parameterize the dual variable \u232bi1,...,iN by a recurrent neural\nnetwork and approximate the sum using beamsearch, which we leave as a future work.\n\nAcknowledgments\nThe authors would like to thank all the anonymous reviewers for their constructive feedback.\n\nReferences\n[1] Yoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine\n\nLearning, 2(1):1\u2013127, January 2009.\n\n[2] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise\ntraining of deep networks. In Proceedings of the Advances in Neural Information Processing\nSystems (NIPS), pages 153\u2013160, 2007.\n\n[3] Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. Unsupervised transcription of historical\ndocuments. In Proceedings of the 51st Annual Meeting of the Association for Computational\nLinguistics, pages 207\u2013217, 2013.\n\n[4] Albrecht Beutelspacher. Cryptology. Mathematical Association of America, 1994.\n[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. The Journal\n\nof Machine Learning Research, 3:993\u20131022, March 2003.\n\n[6] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press,\n\n2004.\n\n[7] Jianshu Chen, Po-Sen Huang, Xiaodong He, Jianfeng Gao, and Li Deng. Unsupervised learning\n\nof predictors from unpaired input-output samples. arXiv:1606.04646, 2016.\n\n[8] Soumith Chintala and Yann LeCun. A path to unsupervised learning through adversarial\nIn https://code.facebook.com/posts/1587249151575490/a-path-to-unsupervised-\n\nnetworks.\nlearning-through-adversarial-networks/, 2016.\n\n[9] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep neural\nnetworks for large-vocabulary speech recognition. Audio, Speech, and Language Processing,\nIEEE Transactions on, 20(1):30\u201342, 2012.\n\n[10] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Proceedings of the\n\nAdvances in Neural Information Processing Systems (NIPS), pages 3079\u20133087, 2015.\n\n[11] Li Deng.\n\nDeep learning for\n\nIn Tutorial\nat Interspeech Conf, Dresden, Germany, https://www.microsoft.com/en-us/research/wp-\ncontent/uploads/2016/07/interspeech-tutorial-2015-lideng-sept6a.pdf, Aug-Sept, 2015.\n\nspeech and language processing.\n\n9\n\n\f[12] Ian Goodfellow.\n\nIn\nhttp://www.cs.toronto.edu/ dtarlow/pos14/talks/goodfellow.pdf, 2016.\n\nnets.\n\nGenerative\n\nadversarial\n\nTutorial\n\nat NIPS,\n\n[13] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning, by MIT Press. 2016.\n[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the\nAdvances in Neural Information Processing Systems (NIPS), pages 2672\u20132680, 2014.\n\n[15] Alex Graves.\n\nSequence transduction with recurrent neural networks.\n\narXiv:1211.3711, 2012.\n\narXiv preprint\n\n[16] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-Rahman Mohamed, Navdeep Jaitly,\nAndrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and B. Kingsbury. Deep\nneural networks for acoustic modeling in speech recognition: The shared views of four research\ngroups. IEEE Signal Processing Magazine, 29(6):82\u201397, November 2012.\n\n[17] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep\n\nbelief nets. Neural computation, 18(7):1527\u20131554, 2006.\n\n[18] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with\n\nneural networks. Science, 313(5786):504\u2013507, 2006.\n\n[19] Anthony Kay. Tesseract: An open-source optical character recognition engine. Linux Journal,\n\n2007.\n\n[20] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[21] Kevin Knight, Anish Nair, Nishit Rathod, and Kenji Yamada. Unsupervised analysis for\n\ndecipherment problems. In Proceedings of the COLING/ACL, pages 499\u2013506, 2006.\n\n[22] Quoc Le, Marc\u2019Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg Corrado, Jeff\nDean, and Andrew Ng. Building high-level features using large scale unsupervised learning. In\nInternational Conference in Machine Learning, 2012.\n\n[23] Dennis Luciano and Gordon Prichett. Cryptology: From caesar ciphers to public-key cryptosys-\n\ntems. The College Mathematics Journal, 18(1):2\u201317, 1987.\n\n[24] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[25] Tom Minka. Divergence measures and message passing. Technical report, Technical report,\n\nMicrosoft Research, 2005.\n\n[26] Robert et al Parker. English gigaword fourth edition ldc2009t13. Philadelphia: Linguistic Data\n\nConsortium, 2009.\n\n[27] Ihsin\n\nPhillips,\n\ndata.science.uva.nl/events/dlia//datasets/uwash3.html.\n\nBhabatosh Chanda,\n\nand Robert Haralick.\n\nhttp://isis-\n\n[28] P. Smolensky. Parallel distributed processing: Explorations in the microstructure of cognition,\nvol. 1. chapter Information Processing in Dynamical Systems: Foundations of Harmony Theory,\npages 194\u2013281. 1986.\n\n[29] Russell Stewart and Stefano Ermon. Label-free supervision of neural networks with physics\n\nand domain knowledge. In Proceedings of AAAI, 2017.\n\n[30] Ilya Sutskever, Rafal Jozefowicz, Karol Gregor, Danilo Rezende, Tim Lillicrap, and Oriol\nVinyals. Towards principled unsupervised learning. arXiv preprint arXiv:1511.06440, 2015.\n\n[31] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.\nStacked denoising autoencoders: Learning useful representations in a deep network with a local\ndenoising criterion. The Journal of Machine Learning Research, 11:3371\u20133408, 2010.\n\n10\n\n\f", "award": [], "sourceid": 2009, "authors": [{"given_name": "Yu", "family_name": "Liu", "institution": "Citadel LLC"}, {"given_name": "Jianshu", "family_name": "Chen", "institution": "Microsoft Research, Redmond, W"}, {"given_name": "Li", "family_name": "Deng", "institution": "Citadel"}]}