{"title": "A Rate Distortion Approach for Semi-Supervised Conditional Random Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 2008, "page_last": 2016, "abstract": "We propose a novel information theoretic approach for semi-supervised learning of conditional random fields. Our approach defines a training objective that combines the conditional likelihood on labeled data and the mutual information on unlabeled data. Different from previous minimum conditional entropy semi-supervised discriminative learning methods, our approach can be naturally cast into the rate distortion theory framework in information theory. We analyze the tractability of the framework for structured prediction and present a convergent variational training algorithm to defy the combinatorial explosion of terms in the sum over label configurations. Our experimental results show that the rate distortion approach outperforms standard $l_2$ regularization and minimum conditional entropy regularization on both multi-class classification and sequence labeling problems.", "full_text": "A Rate Distortion Approach for Semi-Supervised\n\nConditional Random Fields\n\nYang Wang\u2020\u2217\n\nGholamreza Haffari\u2020\u2217\n\u2020School of Computing Science\n\nSimon Fraser University\n\nBurnaby, BC V5A 1S6, Canada\n\n{ywang12,ghaffar1,mori}@cs.sfu.ca\n\nShaojun Wang\u2021\n\nGreg Mori\u2020\n\n\u2021Kno.e.sis Center\n\nWright State University\nDayton, OH 45435, USA\nshaojun.wang@wright.edu\n\nAbstract\n\nWe propose a novel information theoretic approach for semi-supervised learning\nof conditional random \ufb01elds that de\ufb01nes a training objective to combine the con-\nditional likelihood on labeled data and the mutual information on unlabeled data.\nIn contrast to previous minimum conditional entropy semi-supervised discrimi-\nnative learning methods, our approach is grounded on a more solid foundation,\nthe rate distortion theory in information theory. We analyze the tractability of the\nframework for structured prediction and present a convergent variational train-\ning algorithm to defy the combinatorial explosion of terms in the sum over label\ncon\ufb01gurations. Our experimental results show the rate distortion approach outper-\nforms standard l2 regularization, minimum conditional entropy regularization as\nwell as maximum conditional entropy regularization on both multi-class classi\ufb01-\ncation and sequence labeling problems.\n\nIntroduction\n\n1\nIn most real-world machine learning problems (e.g., for text, image, audio, biological sequence\ndata), unannotated data is abundant and can be collected at almost no cost. However, supervised\nmachine learning techniques require large quantities of data be manually labeled so that automatic\nlearning algorithms can build sophisticated models. Unfortunately, manual annotation of a large\nquantity of data is both expensive and time-consuming. The challenge is to \ufb01nd ways to exploit the\nlarge quantity of unlabeled data and turn it into a resource that can improve the performance of su-\npervised machine learning algorithms. Meeting this challenge requires research at the cutting edge\nof automatic learning techniques, useful in many \ufb01elds such as language and speech technology, im-\nage processing and computer vision, robot control and bioinformatics. A surge of semi-supervised\nlearning research activities has occurred in recent years to devise various effective semi-supervised\ntraining schemes. Most of these semi-supervised learning algorithms are applicable only to multi-\nclass classi\ufb01cation problems [1, 10, 32], with very few exceptions that develop discriminative mod-\nels suitable for structured prediction [2, 9, 16, 20, 21, 22].\n\nIn this paper, we propose an information theoretic approach for semi-supervised learning of condi-\ntional random \ufb01elds (CRFs) [19], where we use the mutual information between the empirical distri-\nbution of unlabeled data and the discriminative model as a data-dependent regularized prior. Grand-\nvalet and Bengio [15] and Jiao et al. [16] have proposed a similar information theoretic approach that\nused the conditional entropy of their discriminative models on unlabeled data as a data-dependent\nregularization term to obtain very encouraging results. Minimum entropy approach can be explained\nfrom data-smoothness assumption and is motivated from semi-supervised classi\ufb01cation, using unla-\nbeled data to enhance classi\ufb01cation; however, its degeneracy is even more problematic and arguable\nby noting minimum entropy 0 can be achieved by putting all mass on one label and zeros for the\nrest of labels. As far as we know, there is no formal principled explanation for the validity of this\nminimum conditional entropy approach. Instead, our approach can be naturally cast into the rate\n\n\u2217These authors contributed equally to this work.\n\n1\n\n\fdistortion theory framework which is well-known in information theory [14]. The closest work to\nours is the one by Corduneanu et al. [11, 12, 13, 28]. Both works are discriminative models and\ndo indeed use mutual information concepts. There are two major distinctions between our work\nand theirs. First, their approach is essentially motivated from semi-supervised classi\ufb01cation point\nof view and formulated as a communication game, while our approach is based on a completely\ndifferent motivation, semi-supervised clustering that uses labeled data to enhance clustering and is\nformulated as a data compression scheme, thus leads to a formulation distinctive from Corduneanu\net al. Second, their model is non-parametric, whereas ours is parametric. As a result, their model can\nbe trained by optimizing a convex objective function through a variant of Blahut-Arimoto alternating\nminimization algorithm, whereas our model is more complex and the objective function becomes\nnon-convex. In particular, training a simple chain structured CRF model [19] in our framework turns\nout to be intractable even if using Blahut-Arimoto\u2019s type of alternating minimization algorithm. We\ndevelop a convergent variational approach to approximately solve this problem. Another relevant\nwork is the information bottleneck (IB) method introduced by Tishby et al [30]. IB method is an\ninformation-theoretic framework for extracting relevant components of an input random variable\nX, with respect to an output random variable Y . Instead of directly compressing X to its repre-\nsentation Y subject to an expected distortion through a parametric probabilistic mapping like our\nproposed approach, IB method is performed by \ufb01nding a third, compressed, non-parametric and\nmodel-independent representation T of X that is most informative about Y . Formally speaking, the\nnotion of compression is quanti\ufb01ed by the mutual information between T and X while the informa-\ntiveness is quanti\ufb01ed by the mutual information between T and Y . The solutions are characterized\nby the bottleneck equations and can be found by a convergent re-estimation method that general-\nizes the Blahut-Arimoto algorithm. Finally in contrast to our approach which minimizes both the\nnegative conditional likelihood on labeled data and the mutual information between the hidden vari-\nables and the observations on unlabeled data for a discriminative model, Oliver and Garg [24] have\nproposed maximum mutual information hidden Markov models (MMIHMM) of semi-supervised\ntraining for chain structured graph. The objective is to maximize both the joint likelihood on labeled\ndata and the mutual information between the hidden variables and the observations on unlabeled data\nfor a generative model. It is equivalent to minimizing conditional entropy of a generative HMM for\nthe part of unlabeled data. The maximum mutual information of a generative HMM was originally\nproposed by Bahl et al. [4] and popularized in speech recognition community [23], but it is differ-\nent from Oliver and Garg\u2019s approach in that an individual HMM is learned for each possible class\n(e.g., one HMM for each word string), and the point-wise mutual information between the choice\nof HMM and the observation sequence is maximized. It is equivalent to maximizing the conditional\nlikelihood of a word string given observation sequence to improve the discrimination across differ-\nent models [18]. Thus in essence, Bahl et al. [4] proposed a discriminative learning algorithm for\ngenerative HMMs of training utterances in speech recognition.\n\nIn the following, we \ufb01rst motivate our rate distortion approach for semi-supervised CRFs as a data\ncompression scheme and formulate the semi-supervised learning paradigm as a classic rate distortion\nproblem. We then analyze the tractability of the framework for structured prediction and present a\nconvergent variational learning algorithm to defy the combinatorial explosion of terms in the sum\nover label con\ufb01gurations. Finally we demonstrate encouraging results with two real-world problems\nto show the effectiveness of the proposed approach: text categorization as a multi-class classi\ufb01cation\nproblem and hand-written character recognition as a sequence labeling problem. Similar ideas have\nbeen successfully applied to semi-supervised boosting [31].\n\n2 Rate distortion formulation\nLet X be a random variable over data sequences to be labeled, and Y be a random variable\nover corresponding label sequences. All components, Yi, of Y are assumed to range over a \ufb01-\nnite label alphabet Y. Given a set of labeled examples, Dl = n(x\n(N ))o,\n(M )o, we would like to build a CRF model\nand unlabeled examples, Du = nx\nZ\u03b8(x) exp(cid:16)h\u03b8, f (x, y)i(cid:17) over sequential input data x, where \u03b8 = (\u03b81, \u00b7 \u00b7 \u00b7 , \u03b8K )\u22a4,\np\u03b8(y|x) = 1\nf (x, y) = (f1(x, y), \u00b7 \u00b7 \u00b7 , fK(x, y))\u22a4, and Z\u03b8(x) = Py exp(cid:16)h\u03b8, f (x, y)i(cid:17). Our goal is to learn\nsuch a model from the combined set of labeled and unlabeled examples, Dl \u222a Du. For notational\nconvenience, we assume that there are no identical examples in Dl and Du.\n\n(N +1), \u00b7 \u00b7 \u00b7 , x\n\n(1), y\n\n(1)), \u00b7 \u00b7 \u00b7 , (x\n\n(N ), y\n\n2\n\n\fThe standard supervised training procedure for CRFs is based on minimizing the negative log con-\nditional likelihood of the labeled examples in Dl\n\nN\n\nCL(\u03b8) = \u2212\n\nlog p\u03b8(y\n\n(i)|x\n\n(i)) + \u03bbU (\u03b8)\n\n(1)\n\nwhere U (\u03b8) can be any standard regularizer on \u03b8, e.g. U (\u03b8) = k\u03b8k2/2 and \u03bb is a parameter that\ncontrols the in\ufb02uence of U (\u03b8). Regularization can alleviate over-\ufb01tting on rare features and avoid\ndegeneracy in the case of correlated features.\nObviously, Eq. (1) ignores the unlabeled examples in Du. To make full use of the available training\ndata, Grandvalet and Bengio [15] and Jiao et al. [16] proposed a semi-supervised learning algo-\nrithm that exploits a form of minimum conditional entropy regularization on the unlabeled data.\nSpeci\ufb01cally, they proposed to minimize the following objective\n\nRLminCE(\u03b8) = \u2212\n\nlog p\u03b8(y\n\n(i)|x\n\n(i)) + \u03bbU (\u03b8) \u2212 \u03b3\n\np\u03b8(y|x\n\n(j)) log p\u03b8(y|x\n\n(j))\n\n(2)\n\nN\n\nM\n\nXi=1\n\nXi=1\n\nXj=N +1Xy\n\nwhere the \ufb01rst term is the negative log conditional likelihood of the labeled data, and the third term\nis the conditional entropy of the CRF model on the unlabeled data. The tradeoff parameters \u03bb and \u03b3\ncontrol the in\ufb02uences of U (\u03b8) and the unlabeled data, respectively.\nThis is equivalent to minimizing the following objective (with different values of \u03bb and \u03b3)\n\n(3)\n\n\u02dcpl(x,y)\n\nRLminCE(\u03b8) = D\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8) + \u03b3 Xx\u2208Du\n\u02dcpu(x)H\u201cp\u03b8(y|x)\u201d\nwhere D(cid:16)\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)(cid:17) = P(x,y)\u2208Dl \u02dcpl(x, y) log\n\u02dcpl(x)p\u03b8(y|x) , H(cid:16)p\u03b8(y|x)(cid:17) =\nPy p\u03b8(y|x) log p\u03b8(y|x). Here we use \u02dcpl(x, y) to denote the empirical distribution of both X and\nY on labeled data Dl, \u02dcpl(x) to denote the empirical distribution of X on labeled data Dl, and \u02dcpu(x)\nto denote the empirical distribution of X on unlabeled data Du.\nIn this paper, we propose an alternative approach for semi-supervised CRFs. Rather than using\nminimum conditional entropy as a regularization term on unlabeled data, we use minimum mutual\ninformation on unlabeled data. This approach has a nice and strong information theoretic interpre-\ntation by rate distortion theory.\nWe de\ufb01ne the marginal distribution p\u03b8(y) of our discriminative model on unlabeled data Du to be\np\u03b8(y) = Px\u2208Du \u02dcpu(x)p\u03b8(y|x) over the input data x. Then the mutual information between the\nempirical distribution \u02dcp(x) and the discriminative model is\n\u02dcpu(x)H\u201cp\u03b8(y|x)\u201d\nI\u201c\u02dcpu(x), p\u03b8(y|x)\u201d = Xx\u2208DuXy\nwhere H(cid:16)p\u03b8(y)(cid:17) = \u2212Py Px\u2208Du \u02dcpu(x)p\u03b8(y|x) log (cid:16)Px\u2208Du \u02dcpu(x)p\u03b8(y|x)(cid:17) is the entropy of\nthe label Y on unlabeled data. Thus in rate distortion terminology, the empirical distribution of\nunlabeled data \u02dcpu(x) corresponds to input distribution, the model p\u03b8(y|x) corresponds to the prob-\nabilistic mapping from X to Y , and p\u03b8(y) corresponds to the output distribution of Y .\nOur proposed rate distortion approach for semi-supervised CRFs optimizes the following con-\nstrained optimization problem,\n\n\u02dcpu(x)p\u03b8(y) \u201d = H\u201cp\u03b8(y)\u201d \u2212Xx\u2208Du\n\n\u02dcpu(x)p\u03b8(y|x) log\u201c \u02dcpu(x)p\u03b8(y|x)\n\nmin\n\n\u03b8\n\nI\u201c\u02dcpu(x), p\u03b8(y|x)\u201d s.t. D\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8) \u2264 d\n\nThe rationale for this formulation can be seen from an information-theoretic perspective using the\nrate distortion theory [14]. Assume we have a source X with a source distribution p(x) and its com-\npressed representation Y through a probabilistic mapping p\u03b8(y|x). If there is a large set of features\n(in\ufb01nite in the extreme case), this probabilistic mapping might be too redundant. We\u2019d better look\nfor its minimum description. What determines the quality of the compression is the information\nrate, i.e. the average number of bits per message needed to specify an element in the representation\nwithout confusion. According to the standard asymptotic arguments [14], this quantity is bounded\nbelow by the mutual information I(cid:16)p(x), p\u03b8(y|x)(cid:17) since the average cardinality of the partition-\ning of X is given by the ratio of the volume of X to the average volume of the elements of X\n\n(4)\n\n3\n\n\fthat are mapped to the same representation Y through p\u03b8(y|x), 2H(X)/2H(X|Y ) = 2I(X,Y ). Thus\nmutual information is the minimum information rate and is used as a good metric for clustering\n[26, 27]. True distribution of X should be used to compute the mutual information. Since it is\nunknown, we use its empirical distribution on unlabeled data set Du and the mutual information\nI(cid:16)\u02dcpu(x), p\u03b8(y|x)(cid:17) instead. However, information rate alone is not enough to characterize good\nrepresentation since the rate can always be reduced by throwing away many features in the prob-\nabilistic mapping. This makes the mapping likely too simple and leads to distortion. Therefore\nwe need an additional constraint provided through a distortion function which is presumed to be\nsmall for good representations. Apparently there is a tradeoff between minimum representation\nand maximum distortion. Since joint distribution gives the distribution for the pair of X and its\nrepresentation Y , we choose the log likelihood ratio, log\np(x)p\u03b8(y|x) , plus a regularized complexity\nterm of \u03b8, \u03bbU (\u03b8), as the distortion function. Thus the expected distortion is the non-negative term\nD(cid:16)p(x, y), p(x)p\u03b8(y|x)(cid:17) + \u03bbU (\u03b8). Again true distributions p(x, y) and p(x) should be used here,\nbut they are unknown. In semi-supervised setting, we have labeled data available which provides\nvaluable information to measure the distortion: we use the empirical distributions on labeled data set\nDl and the expected distortion D(cid:16)\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)(cid:17) + \u03bbU (\u03b8) instead to encode the informa-\ntion provided by labeled data, and add a distortion constraint we should respect for data compression\nto help the clustering. There is a monotonic tradeoff between the rate of the compression and the\nexpected distortion: the larger the rate, the smaller is the achievable distortion. Given a distortion\nmeasure between X and Y on the labeled data set Dl, what is the minimum rate description re-\nquired to achieve a particular distortion on the unlabeled data set Du? The answer can be obtained\nby solving (4).\nFollowing standard procedure, we convert the constrained optimization problem (4) into an uncon-\nstrained optimization problem which minimizes the following objective:\n\np(x,y)\n\n\u03ba )1:\n\n(5)\n\n(6)\n\nwhere \u03ba > 0, which again is equivalent to minimizing the following objective (with \u03b3 = 1\n\nRLMI(\u03b8) = I\u201c\u02dcpu(x), p\u03b8(y|x)\u201d + \u03ba\u201cD\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8)\u201d\nRLMI(\u03b8) = D\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8) + \u03b3I\u201c\u02dcpu(x), p\u03b8(y|x)\u201d\n\nIf (4) is a convex optimization problem, then for every solution \u03b8 to Eq. (4) found using some\nparticular value of d, there is some corresponding value of \u03b3 in the optimization problem (6) that\nwill give the same \u03b8. Thus, these are two equivalent re-parameterizations of the same problem. The\nequivalence between the two problems can be veri\ufb01ed using convex analysis [8] by noting that the\nLagrangian for the constrained optimization (4) is exactly the objective in the optimization (5) (plus\na constant that does not depend on \u03b8), where \u03ba is the Lagrange multiplier. Thus, (4) can be solved\nby solving either (5) or (6) for an appropriate \u03ba or \u03b3. Unfortunately (4) is not a convex optimization\nproblem, because its objective I(cid:16)\u02dcpu(x), p\u03b8(y|x)(cid:17) is not convex. This can be veri\ufb01ed using the same\nargument as in the minimum conditional entropy regularization case [15, 16]. There may be some\nminima of (4) that do not minimize (5) or (6) whatever the value of \u03ba or \u03b3 may be. This is however\nnot essential to motivate the optimization criterion. Moreover there are generally local minima in\n(5) or (6) due to the non-convexity of its mutual information regularization term.\nAnother training method for semi-supervised CRFs is the maximum entropy approach, maximizing\nconditional entropy (minimizing negative conditional entropy) over unlabeled data Du subject to the\nconstraint on labeled data Dl,\n\nmin\n\n\u03b8\n\n\u201c \u2212 Xx\u2208Du\n\n\u02dcpu(x)H\u201cp\u03b8(y|x)\u201d\u201d s.t. D\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8) \u2264 d\n\nagain following standard procedure, we convert the constrained optimization problem (7) into an\nunconstrained optimization problem which minimizes the following objective:\n\n(7)\n\nRLmaxCE(\u03b8) = D\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8) \u2212 \u03b3 Xx\u2208Du\n\n\u02dcpu(x)H\u201cp\u03b8(y|x)\u201d\n\n(8)\n\n1For\n\nthe part of unlabeled data,\n\ninformation,\nI(\u02dcpu(x), p\u03b8(x|y)), of a generative model p\u03b8(x|y) instead, which is equivalent to minimizing conditional en-\ntropy of a generative model p\u03b8(x|y), since I(\u02dcpu(x), p\u03b8(x|y)) = H(\u02dcpu(x)) \u2212 H(p\u03b8(x|y)) and H(\u02dcpu(x)) is\na constant.\n\nthe MMIHMM algorithm [24] maximizes mutual\n\n4\n\n\fAgain minimizing (8) is not exactly equivalent to (7); however, it is not essential to motivate the\noptimization criterion. When comparing maximum entropy approach with minimum conditional\nentropy approach, there is only a sign change on conditional entropy term.\n\nFor non-parametric models, using the analysis developed in [5, 6, 7, 25], it can be shown that maxi-\nmum conditional entropy approach is equivalent to rate distortion approach when we compress code\nvectors in a mass constrained scheme [25]. But for parametric models such as CRFs, these three\napproaches are completely distinct.\nThe difference between our rate distortion approach for semi-supervised CRFs (6) and the minimum\nconditional entropy regularized semi-supervised CRFs (2) is not only on the different sign of condi-\ntional entropy on unlabeled data but also the additional term \u2013 entropy of p\u03b8(y) on unlabeled data.\nIt is this term that makes direct computation of the derivative of the objective for the rate distortion\napproach for semi-supervised CRFs intractable. To see why, we take derivative of this term with\nrespect to \u03b8, we have:\n\n\u2202\n\n\u2202\u03b8\u201c \u2212 H(p\u03b8(y))\u201d = Xx\u2208Du\n\u2212 Xx\u2208Du\n\n\u02dcpu(x)Xy\n\u02dcpu(x)Xy\n\np\u03b8(y|x)f (x, y) log\u201c Xx\u2208Du\np\u03b8(y|x) log\u201c Xx\u2208Du\n\n\u02dcpu(x)p\u03b8(y|x)\u201d\n\u02dcpu(x)p\u03b8(y|x)\u201dXy\n\n\u2032\n\np\u03b8(y\n\n\u2032|x)f (x, y\n\n\u2032)\n\nIn the case of structured prediction, the number of sums over Y is exponential, and there is a sum\ninside the log. These make the computation of the derivative intractable even for a simple chain\nstructured CRF.\n\nAn alternative way to solve (6) is to use the famous algorithm for the computation of the rate distor-\ntion function established by Blahut [6] and Arimoto [3]. Corduneanu and Jaakkola [12, 13] proposed\na distributed propagation algorithm, a variant of Blahut-Arimoto algorithm, to solve their problem.\nHowever as illustrated in the following, this approach is still intractable for structured prediction in\nour case.\nBy extending a lemma for computing rate distortion in [14] to parametric models, we can rewrite\nthe minimization problem (5) of mutual information regularized semi-supervised CRFs as a double\nminimization,\n\nmin\n\n\u03b8\n\nmin\nr(y)\n\ng(\u03b8, r(y)) where\n\ng(\u03b8, r(y)) = Xx\u2208DuXy\n\n\u02dcpu(x)p\u03b8(y|x) log\n\np\u03b8(y|x)\n\nr(y)\n\n+ \u03ba\u201cD\u201c\u02dcpl(x, y), \u02dcpl(x)p\u03b8(y|x)\u201d + \u03bbU (\u03b8)\u201d\n\nWe can use an alternating minimization algorithm to \ufb01nd a local minimum of RLM I (\u03b8). First, we\nassign the initial CRF model to be the optimal solution of the supervised CRF on labeled data and\ndenote it as p\u03b8(0)(y|x). Then we de\ufb01ne r(0)(y) and in general r(t)(y) for t \u2265 1 by\n\nr(t)(y) = Xx\u2208Du\n\n\u02dcpu(x)p\u03b8(t) (y|x)\n\n(9)\n\nIn order to de\ufb01ne p\u03b8(1)(y|x) and in general p\u03b8(t)(y|x), we need to \ufb01nd the p\u03b8(y|x) which minimizes\ng for a given r(y). The gradient of g(\u03b8, r(y)) with respect to \u03b8 is\n\n\u2202\n\u2202\u03b8\n\ng(\u03b8, r(y)) =\n\n\u02dcpu(x\n\nM\n\nXi=N +1\n+Xy\nXi=1\n\n\u2212\u03ba\n\nN\n\np\u03b8(y|x\n\n(i))\u201ccovp\u03b8 (y|x(i))hf (x\n(i)) log r(y)Xy\n(i)) f (x\n\n(i), y\n\n\u2032\n\n\u02dcpl(x\n\n(i), y)i\u03b8 \u2212Xy\n\np\u03b8(y\n\n\u2032|x\n\n(i))f (x\n\n(i), y\n\np\u03b8(y|x\n\n(i))f (x\n\n(i)) \u2212Xy\n\np\u03b8(y|x\n\n(i))f (x\n\n(i), y) log r(y) (10)\n\n\u2032)\u201d\n(i), y)! + \u03ba\u03bb\n\n(11)\n\n\u2202\n\u2202\u03b8\n\nU (\u03b8)\n\n(12)\n\nEven though the \ufb01rst term in Eq. (10) and (12) can be ef\ufb01ciently computed via recursive formu-\nlas [16], we run into the same intractable problem to compute the second term Eq. (10) and Eq. 11)\nsince the number of sums over Y is exponential and implicitly there is a sum inside the log due\nto r(y). This makes the computation of the derivative in the alternating minimization algorithm\nintractable.\n\n5\n\n\f3 A variational training procedure\nIn this section, we derive a convergent variational algorithm to train rate distortion based semi-\nsupervised CRFs for sequence labeling. The basic idea of convexity-based variational inference is\nto make use of Jensen\u2019s inequality to obtain an adjustable upper bound on the objective function\n[17]. Essentially, one considers a family of upper bounds indexed by a set of variational parameters.\nThe variational parameters are chosen by an optimization procedure that attempts to \ufb01nd the tightest\npossible upper bound.\nFollowing Jordan et al.\nH(p\u03b8(y)) using Jensen\u2019s inequality as the following,\n\n[17], we begin by introducing a variational distribution q(x) to bound\n\nH(p\u03b8(y)) = \u2212Xy Xx\u2208Du\nXj=N +1\n\n\u2264 \u2212Xy\n\nM\n\n\u02dcpu(x)p\u03b8(y|x) log Xx\u2208Du\n(j))\" M\nXl=N +1\n\n(j))p\u03b8(y|x\n\n\u02dcpu(x\n\n\u02dcpu(x)p\u03b8(y|x)\n\nq(x)\n\nq(x)!\n(l)) log\u201e \u02dcpu(x\n\nq(x\n\n(l))p\u03b8(y|x\nq(x(l))\n\n(l))\n\n\u00ab#\n\nThus the desideratum of \ufb01nding a tight upper bound of RLMI(\u03b8) in Eq. (6) translates directly into\nthe following alternative optimization problem:\n\nU(\u03b8, q) =\n\nN\n\n(\u03b8\u2217, q\u2217) = min\n\n\u03b8,q\n\nU(\u03b8, q)\n\nwhere\n\nM\n\nM\n\n(i)) log p\u03b8(y\n\n(i)|x\n\n(i)) + \u03bbU (\u03b8) \u2212 \u03b3\n\n(j))q(x\n\np\u03b8(y|x\n\n(j)) log p\u03b8(y|x\n\n(l)) (13)\n\n\u02dcpu(x\n\n(j))\n\nM\n\nXl=N +1\n\nq(x\n\n(l)) log\n\n(l))\n\u02dcpu(x\nq(x(l))\n\n+ \u03b3\n\nMinimizing U with respect to q has a closed form solution,\n\n(l))Xy\n\nXj=N +1\n\n\u02dcpu(x\n\nXl=N +1\nXj=N +1Xy\n\nM\n\n\u02dcpu(x\n\n(j))p\u03b8(y|x\n\n(j)) log p\u03b8(y|x\n\n(j))\n\n(14)\n\n\u2212\n\n\u2212\u03b3\n\n\u02dcpl(x\n\nXi=1\nXj=N +1\n\nM\n\nq(x\n\n(l)) =\n\nIt can be shown that\n\n\u02dcpu(x\n\n(l)) exp\u201cPM\nk=1 \u02dcpu(x(k)) exp\u201cPM\nPM\nU(\u03b8, q) \u2265 RLMI(\u03b8) +Xy Xx\u2208Du\n\n(j))p\u03b8(y|x\n\n(j)) log p\u03b8(y|x\n\n(l))\u201d\nj=N +1Py \u02dcpu(x\nj=N +1Py \u02dcpu(x(j))p\u03b8(y|x(j)) log p\u03b8(y|x(k))\u201d\n\n\u2200 x\n\n(l) \u2208 Du\n\n(15)\n\n\u02dcpu(x)p\u03b8(y|x) Xx\u2208Du\n\nD\u201cq(x), q\u03b8(x|y)\u201d \u2265 0\n\n(16)\n\n\u02dcpu(x)p\u03b8(y|x)\n\nPx\u2208Du \u02dcpu(x)p\u03b8(y|x) \u2200 x \u2208 Du. Thus U is bounded below, the alternative mini-\n\nwhere q\u03b8(x|y) =\nmization algorithm monotonically decreases U and converges.\nIn order to calculate the derivative of U with respect to \u03b8, we just need to notice that the \ufb01rst term\nin Eq. (13) is the log-likelihood in CRF, and the \ufb01rst term in Eq. (14) is a constant and second term\nin Eq. (14) is the conditional entropy in [16]. They all can be ef\ufb01ciently computed [16, 21]. In the\nfollowing, we show how to compute the derivative of the last term in Eq.(13) using an idea similar\nto that proposed in [21]. Without loss of generality, we assume all the unlabeled data are of equal\nlengths in the sequence labeling case. We will describe how to handle the case of unequal lengths in\nSec. 4.\nIf we de\ufb01ne A(y, x\n(l)) in (13) for a \ufb01xed (j, l) pair,\n(l) form two linear-chain graphs of equal lengths, we can calcu-\nwhere we assume x\nlate the derivative of A(y, x\n(l)) with respect to the k-th parameter \u03b8k, where all the terms\ncan be computed through standard dynamic programming techniques in CRFs except one term\n(j), y). Nevertheless similar to [21], we compute this term as\nPy p\u03b8(y|x\nfollows [21]: we \ufb01rst de\ufb01ne pairwise subsequence constrained entropy on (x\n(l)) (as suppose\nto the subsequence constrained entropy de\ufb01ned in [21]) as:\n\n(l)) = Py p\u03b8(y|x\n\n(j), x\n(j) and x\n\n(j)) log p\u03b8(y|x\n\n(j)) log p\u03b8(y|x\n\n(l))fk(x\n\n(j), x\n\n(j), x\n\nH \u03c3\n\njl(y\u2212(a..b)|ya..b, x\n\n(j), x\n\n(l)) = Xy\u2212(a..b)\n\np\u03b8(y\u2212(a..b)|ya..b, x(j)) log p\u03b8(y\u2212(a..b)|ya..b, x(l))\n\n6\n\n\fwhere y\u2212(a..b) is the label sequence with its subsequence ya..b \ufb01xed. If we have H \u03c3\nthen the term Py p\u03b8(y|x\ndence property of linear-chain CRF, we have the following:\n\njl for all (a, b),\n(j), y) can be easily computed. Using the indepen-\n\n(j)) log p\u03b8(y|x\n\n(l))fk(x\n\np\u03b8(y\u2212(a..b), ya..b|x\n\n(j)) log p\u03b8(y\u2212(a..b), ya..b|x\n\n(l))\n\nXy\u2212(a..b)\n\n= p\u03b8(ya..b|x\n\n(j)) log p\u03b8(ya..b|x\n\n(j))H \u03b1\n\njl(y1..(a\u22121)|ya, x\n\n(j), x\n\n(l))\n\n+p\u03b8(ya..b|x\n\n(j))H \u03b2\n\njl(y(b+1)..n|yb, x\n\n(l)) + p\u03b8(ya..b|x\n(l))\n\n(j), x\n\nGiven H \u03b1\nH \u03b1\n\njl(\u00b7) and H \u03b2\n\njl(\u00b7), any sequence entropy can be computed in constant time [21]. Computing\n\njl(\u00b7) can be done using the following dynamic programming [21]:\n\nH \u03b1\n\njl(y1..i|yi+1, x\n\n(j), x\n\np\u03b8(yi|yi+1, x\n\n(j)) log p\u03b8(yi|yi+1, x\n\n(l))\n\np\u03b8(yi|yi+1, x\n\n(j))H \u03b1\n\njl(y1..(i\u22121)|yi, x\n\n(j), x\n\n(l))\n\n(l)) = Xyi\n+Xyi\n\nThe base case for the dynamic programming is H \u03b1\np\u03b8(yi|yi+1, x\nbe similarly computed using dynamic programming.\n\nj)) needed in the above formula can be obtained using belief propagation. H \u03b2\n\njl(\u2205|y1, x\n\n(j), x\n\n(l)) = 0. All the probabilities (i.e.,\njl(\u00b7) can\n\n4 Experiments\n\nWe compare our rate distortion approach for semi-supervised learning with one of the state-of-the-art\nsemi-supervised learning algorithms, minimum conditional entropy approach and maximum condi-\ntional entropy approach on two real-world problems: text categorization and hand-written character\nrecognition. The purpose of the \ufb01rst task is to show the effectiveness of rate distortion approach\nover minimum and maximum conditional entropy approaches when no approximation is needed in\ntraining. In the second task, a variational method has to be used to train semi-supervised chain\nstructured CRFs. We demonstrate the effectiveness of the rate distortion approach over minimum\nand maximum conditional entropy approaches even when an approximation is used during training.\n\n4.1 Text categorization\nWe select different class pairs from the 20 newsgroup dataset 2 to construct our binary classi\ufb01cation\nproblems. The chosen classes are similar to each other and thus hard for classi\ufb01cation algorithms.\nWe use Porter stemmer to reduce the morphological word forms. For each label, we rank words\nbased on their mutual information with that label (whether it predicts label 1 or 0). Then we choose\nthe top 100 words as our features. For each problem, we select 15% of the training data, almost 150\ninstances, as the labeled training data and select the unlabeled data from the remaining data. The\nvalidation set (for setting the free parameters, e.g. \u03bb and \u03b3) contains 100 instances. The test set\ncontains about 700 instances. We vary the ratio between the amount of unlabeled and labeled data,\nrepeat the experiments ten times with different randomly selected labeled and unlabeled training\ndata, and report the mean and standard deviation over different trials. For each run, we initialize the\nmodel parameter for mutual information (MI) regularization and maximum/minimum conditional\nentropy (CE) regularization using the parameter learned from a l2-regularized logistic regression\nclassi\ufb01er. Figure 1 shows the classi\ufb01cation accuracies of these four regularization methods versus\nthe ratio between the amount of unlabeled and labeled data on different classi\ufb01cation problems. We\ncan see that mutual information regularization outperforms the other three regularization schemes.\nIn most cases, maximum CE regularization outperforms minimum CE regularization and the base-\nline (logistic regression with l2 regularization) which uses only the labeled data. Although the\nrandomly selected labeled instances are different for different experiments, we should not see a sig-\nni\ufb01cant difference in the performance of the learned models based on the baseline; since for each\nparticular ratio of labeled and unlabeled data, the performance is averaged over ten runs. We suspect\nthe reason for the performance differences of the baselines models in Figure 1 is due to our feature\nselection phase.\n\n2http://people.csail.mit.edu/jrennie/20Newsgroups.\n\n7\n\n\f0.882\n\n0.88\n\n0.878\n\n0.876\n\n0.874\n\n0.872\n\n0.87\n\n0.868\n\n0.866\n\n0.864\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.862\n \n0\n\n1\n\nMI\nminCE\nmaxCE\nL2\n\n \n\n0.87\n\n0.865\n\n0.86\n\n0.855\n\n0.85\n\n0.845\n\n0.84\n\ny\nc\na\nr\nu\nc\nc\na\n\nMI\nminCE\nmaxCE\nL2\n\n \n\n \n\ny\nc\na\nr\nu\nc\nc\na\n\n0.845\n\n0.84\n\n0.835\n\n0.83\n\n0.825\n\n0.82\n\nMI\nminCE\nmaxCE\nL2\n\n2\n\n3\n\n4\n\nratio unlabel/label\n\n5\n\n6\n\n0.835\n \n0\n\n1\n\n2\n\n3\n\n4\n\nratio unlabel/label\n\n5\n\n6\n\n0.815\n \n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nratio unlabel/label\n\nMI\nminCE\nmaxCE\nL2\n\n0.73\n\n0.72\n\n0.71\n\n0.7\n\n0.69\n\n0.68\n\n0.67\n\ny\nc\na\nr\nu\nc\nc\na\n\nMI\nminCE\nmaxCE\nL2\n\n \n\n0.8\n\n0.795\n\n0.79\n\n0.785\n\n0.78\n\n0.775\n\n0.77\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.66\n \n0\n\n1\n\n2\n\n3\n\n4\n\nratio unlabel/label\n\n5\n\n6\n\n0.765\n \n0\n\n1\n\n2\n\n3\n\n4\n\nratio unlabel/label\n\n \n\n5\n\n6\n\nFigure 1: Results on \ufb01ve different binary classi\ufb01cation problems in text categorization (left to right):\ncomp.os.ms-windows.misc vs comp.sys.mac.hardware; rec.autos vs rec.motorcycles; rec.sport.baseball vs\nrec.sport.hockey; talk.politics.guns vs talk.politics.misc; sci.electronics vs sci.med.\n\n0.825\n\n0.82\n\n0.815\n\n0.81\n\n0.805\n\n0.8\n\n0.795\n\n0.79\n\ny\nc\na\nr\nu\nc\nc\na\n\n0.785\n \n0\n\n1\n\nMI\n\nminCE\n\nmaxCE\n\nL2\n\n \n\nMI\n\nminCE\n\nmaxCE\n\nL2\n\n \n\n0.78\n\n0.76\n\n0.74\n\n0.72\n\n0.7\n\n0.68\n\n0.66\n\ny\nc\na\nr\nu\nc\nc\na\n\n2\n\n3\n\n4\n\nratio unlabel/label\n\n5\n\n6\n\n0.64\n \n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\nratio unlabel/label\n\nFigure 2: Results on hand-written character recognition: (left) sequence labeling; (right) multi-class classi\ufb01-\ncation.\n\n4.2 Hand-written character recognition\nOur dataset for hand-written character recognition contains \u223c6000 handwritten words with average\nlength of \u223c8 characters. Each word was divided into characters, each character is resized to a 16 \u00d7 8\nbinary image. We choose \u223c600 words as labeled data, \u223c600 words as validation data, \u223c2000 words\nas test data. Similar to text categorization, we vary the ratio between the amount of unlabeled and\nlabeled data, and report the mean and standard deviation of classi\ufb01cation accuracies over several\ntrials.\n\nWe use a chain structured graph to model hand-written character recognition as a sequence labeling\nproblem, similar to [29]. Since the unlabeled data may have different lengths, we modify the mu-\ntual information as I = P\u2113 I\u2113, where I\u2113 is the mutual information computed on all the unlabeled\ndata with length \u2113. We compare our approach (MI) with other regularizations (maximum/minimum\nconditional entropy, l2). The results are shown in Fig. 2 (left). As a sanity check, we have also\ntried solving hand-written character recognition as a multi-class classi\ufb01cation problem, i.e. without\nconsidering the correlation between adjacent characters in a word. The results are shown in Fig. 2\n(right). We can see that MI regularization outperforms maxCE, minCE and l2 regularizations in\nboth multi-class and sequence labeling cases. There are signi\ufb01cant gains in the structured learning\ncompared with the standard multi-class classi\ufb01cation setting.\n5 Conclusion and future work\n\nWe have presented a new semi-supervised discriminative learning algorithm to train CRFs. The\nproposed approach is motivated by the rate distortion framework in information theory and utilizes\nthe mutual information on the unlabeled data as a regularization term, to be more precise a data\ndependent prior. Even though a variational approximation has to be used during training process for\neven a simple chain structured graph, our experimental results show that our proposed rate distortion\napproach outperforms supervised CRFs with l2 regularization and a state-of-the-art semi-supervised\nminimum conditional entropy approach as well as semi-supervised maximum conditional entropy\napproach in both multi-class classi\ufb01cation and sequence labeling problems. As future work, we\nwould like to apply this approach to other graph structures, develop more ef\ufb01cient learning algo-\nrithms and illuminate how reducing the information rate helps generalization.\n\n8\n\n\fReferences\n\n[1] S. Abney. Semi-Supervised Learning for Computational Linguistics. Chapman & Hall/CRC, 2007.\n[2] Y. Altun, D. McAllester and M. Belkin. Maximum margin semi-supervised learning for structured vari-\n\nables. NIPS 18:33-40, 2005.\n\n[3] S. Arimoto. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE\n\nTransactions on Information Theory, 18:1814-1820, 1972.\n\n[4] L. Bahl, P. Brown, P. de Souza and R. Mercer. Maximum mutual information estimation of hidden Markov\n\nmodel parameters for speech recognition. ICASSP, 11:49-52, 1986.\n\n[5] T. Berger and J. Gibson. Lossy source coding. IEEE Transactions on Information Theory, 44(6):2693-\n\n2723, 1998.\n\n[6] R. Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Informa-\n\ntion Theory, 18:460-473, 1972.\n\n[7] R. Blahut. Principles and Practice of Information Theory, Addison-Wesley, 1987.\n[8] S. Boyd and L. Vandenberghe. Convex Optimization, Cambridge University Press, 2004.\n[9] U. Brefeld and T. Scheffer. Semi-supervised learning for structured output variables.\n\n2006.\n\nICML, 145-152,\n\n[10] O. Chapelle, B. Scholk\u00a8opf and A. Zien. Semi-Supervised Learning, MIT Press, 2006.\n[11] A. Corduneanu and T. Jaakkola. On information regularization. UAI, 151-158, 2003.\n[12] A. Corduneanu and T. Jaakkola. Distributed information regularization on graphs. NIPS, 17:297-304,\n\n2004.\n\n[13] A. Corduneanu and T. Jaakkola. Data dependent regularization.\n\nIn Semi-Supervised Learning, O.\n\nChapelle, B. Scholk\u00a8opf and A. Zien, (Editors), 163-182, MIT Press, 2006.\n\n[14] T. Cover and J. Thomas. Elements of Information Theory, Wiley, 1991.\n[15] Y. Grandvalet and Y. Bengio. Semi-supervised learning by entropy minimization. NIPS, 17:529-536,\n\n2004.\n\n[16] F. Jiao, S. Wang, C. Lee, R. Greiner and D. Schuurmans. Semi-supervised conditional random \ufb01elds for\n\nimproved sequence segmentation and labeling. COLING/ACL, 209-216, 2006.\n\n[17] M. Jordan, Z. Ghahramani, T. Jaakkola and L. Saul. Introduction to variational methods for graphical\n\nmodels. Machine Learning, 37:183-233, 1999.\n\n[18] D. Jurafsky and J. Martin. Speech and Language Processing, 2nd Edition, Prentice Hall, 2008.\n[19] J. Lafferty, A. McCallum and F. Pereira. Conditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data. ICML, 282-289, 2001.\n\n[20] C. Lee, S. Wang, F. Jiao, D. Schuurmans and R. Greiner. Learning to model spatial dependency: Semi-\n\nsupervised discriminative random \ufb01elds. NIPS, 19:793-800, 2006.\n\n[21] G. Mann and A. McCallum. Ef\ufb01cient computation of entropy gradient for semi-supervised conditional\n\nrandom \ufb01elds. NAACL/HLT, 109-112, 2007.\n\n[22] G. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning of conditional\n\nrandom \ufb01elds. ACL, 870-878, 2008.\n\n[23] Y. Normandin. Maximum mutual information estimation of hidden Markov models. In Automatic Speech\nand Speaker Recognition: Advanced Topics, C. Lee, F. Soong and K. Paliwal (Editors), 57-81, Springer,\n1996.\n\n[24] N. Oliver and A. Garg. MMIHMM: maximum mutual information hidden Markov models. ICML, 466-\n\n473, 2002.\n\n[25] K. Rose. Deterministic annealing for clustering, compression, classi\ufb01cation, regression, and related opti-\n\nmization problems. Proceedings of the IEEE, 80:2210-2239, 1998.\n\n[26] N. Slonim, G. Atwal, G. Tkacik and W. Bialek. Information based clustering. Proceedings of National\n\nAcademy of Science (PNAS), 102:18297-18302, 2005.\n\n[27] S. Still and W. Bialek. How many clusters? An information theoretic perspective. Neural Computation,\n\n16:2483-2506, 2004.\n\n[28] M. Szummer and T. Jaakkola. Information regularization with partially labeled data. NIPS, 1025-1032,\n\n2002.\n\n[29] B. Taskar, C. Guestrain and D. Koller. Max-margin Markov networks. NIPS, 16:25-32, 2003.\n[30] N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method. The 37th Annual Allerton\n\nConference on Communication, Control, and Computing, 368-377, 1999.\n\n[31] L. Zheng, S. Wang, Y. Liu and C. Lee. Information theoretic regularization for semi-supervised boosting.\n\nKDD, 1017-1026, 2009.\n\n[32] X. Zhu. Semi-supervised learning literature survey. Computer Sciences TR 1530, University of Wisconsin\n\nMadison, 2007.\n\n9\n\n\f", "award": [], "sourceid": 700, "authors": [{"given_name": "Yang", "family_name": "Wang", "institution": null}, {"given_name": "Gholamreza", "family_name": "Haffari", "institution": null}, {"given_name": "Shaojun", "family_name": "Wang", "institution": null}, {"given_name": "Greg", "family_name": "Mori", "institution": null}]}