{"title": "Cooperative neural networks (CoNN): Exploiting prior independence structure for improved classification", "book": "Advances in Neural Information Processing Systems", "page_first": 4126, "page_last": 4136, "abstract": "We propose a new approach, called cooperative neural networks (CoNN), which use a set of cooperatively trained neural networks to capture latent representations that exploit prior given independence structure. The model is more flexible than traditional graphical models based on exponential family distributions, but incorporates more domain specific prior structure than traditional deep networks or variational autoencoders. The framework is very general and can be used to exploit the independence structure of any graphical model. We illustrate the technique by showing that we can transfer the independence structure of the popular Latent Dirichlet Allocation (LDA) model to a cooperative neural network, CoNN-sLDA. Empirical evaluation of CoNN-sLDA on supervised text classification tasks demonstrate that the theoretical advantages of prior independence structure can be realized in practice - we demonstrate a 23 percent reduction in error on the challenging MultiSent data set compared to state-of-the-art.", "full_text": "Cooperative neural networks (CoNN): Exploiting\n\nprior independence structure for improved\n\nclassi\ufb01cation\n\nHarsh Shrivastava \u2217\n\nGeorgia Tech\n\nhshrivastava3@gatech.edu\n\nEugene Bart \u2020\n\nPARC\n\nbart@parc.com\n\nBob Price \u2020\n\nPARC\n\nbprice@parc.com\n\nHanjun Dai \u2217\nGeorgia Tech\n\nhanjundai@gatech.edu\n\nBo Dai \u2217\nGeorgia Tech\n\nbodai@gatech.edu\n\nSrinivas Aluru \u2217\nGeorgia Tech\n\naluru@cc.gatech.edu\n\nAbstract\n\nWe propose a new approach, called cooperative neural networks (CoNN), which\nuses a set of cooperatively trained neural networks to capture latent representa-\ntions that exploit prior given independence structure. The model is more \ufb02exible\nthan traditional graphical models based on exponential family distributions, but\nincorporates more domain speci\ufb01c prior structure than traditional deep networks\nor variational autoencoders. The framework is very general and can be used to\nexploit the independence structure of any graphical model. We illustrate the tech-\nnique by showing that we can transfer the independence structure of the popular\nLatent Dirichlet Allocation (LDA) model to a cooperative neural network, CoNN-\nsLDA. Empirical evaluation of CoNN-sLDA on supervised text classi\ufb01cation tasks\ndemonstrates that the theoretical advantages of prior independence structure can be\nrealized in practice - we demonstrate a 23% reduction in error on the challenging\nMultiSent data set compared to state-of-the-art.\n\n1\n\nIntroduction\n\nNeural networks offer a low-bias solution for learning complex concepts such as the linguistic\nknowledge required to separate documents into thematically related classes. However, neural networks\ntypically start with a fairly generic structure, with each level comprising a number of functionally\nequivalent neurons connected to other layers by identical, repetitive connections. Any structure\npresent in the problem domain must be learned from training examples and encoded as weights.\nIn practice, some domain structure is often known ahead of time; in such cases, it is desirable to\npre-design a network with this domain structure in mind. In this paper, we present an approach\nthat allows incorporating certain kinds of independence structure into a new kind of neural learning\nmachine.\nThe proposed approach is called \u201cCooperative Neural Networks\u201d (CoNN). This approach works\nby constructing a set of neural networks, each trained to output an embedding of a probability\ndistribution. The networks are iteratively updated so that each embedding is consistent with the\nembeddings of the other networks and with the training data. Like probabilistic graphical models,\nthe representation is factored into components that are independent. Unlike probabilistic graphical\n\n\u2217Dept. of Comp. Sci. & Eng. Georgia Institute of Technology Atlanta, GA 30332\n\u20203333 Coyote Hill Rd, Palo Alto, CA,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmodels, which are limited to tractable conditional probability distributions (e.g., exponential family),\nCoNNs can exploit powerful generic distributions represented by non-linear neural networks. The\nresulting approach allows us to create models that can exploit both known independence structure as\nwell as the expressive powers of neural networks to improve accuracy over competing approaches.\nWe illustrate the general approach of cooperative neural networks by showing how one can transfer\nthe independence structure from the popular Latent Dirichlet Allocation (LDA) model [2] to a set of\ncooperative neural networks. We call the resultant model CoNN-sLDA. Cooperative neural networks\nare different from feed forward networks as they use back-propagation to enforce consistency across\nvariables within the latent representation. CoNN-sLDA improves over LDA as it admits more complex\ndistributions for document topics and better generalization over word distributions. CoNN-sLDA is\nalso better than a generic neural network classi\ufb01er as the factored representation forces a consistent\nlatent feature representation that has a natural relationship between topics, words and documents. We\ndemonstrate empirically that the theoretical advantages of cooperative neural networks are realized\nin practice by showing that our CoNN-sLDA model beats both probabilistic and neural network-\nbased state-of-the-art alternatives. We emphasize that although our example is based on LDA, the\nCoNN approach is general and can be used with other graphical models, as well as other sources of\nindependence structure (for example, physics- or biology-based constraints).\n\n2 Related Work\n\nText classi\ufb01cation has a long history beginning with the use of support vector machines on text features\n[11]. More sophisticated approaches integrated unsupervised feature generation and classi\ufb01cation in\nmodels such as sLDA [17, 6] and discriminative LDA (discLDA) [13] and a maximum margin based\ncombination [33].\nOne limitation of LDA-based models is that they pick topic distributions from a Dirichlet distribution\nand cannot represent the joint probability of topics in a document ( i.e., hollywood celebrities, politics\nand business are all popular categories, but politics and business appear together more often than\ntheir independent probabilities would predict). Models such as pachinko allocation [15] attempt to\naddress this with complex tree structured priors. Another limitation of LDA stems from the fact that\nword topics and words themselves are selected from categorical distributions. These admit arbitrary\nempirical distributions over tokens, but don\u2019t generalize what they learn. Learning about the topic for\nthe token \"happy\" tells us nothing about the token \"joyful\".\nThere have been many generative deep learning models such as Deep Boltzmann Machines [27],\nNADE [14, 32], variational auto-encoders (VAEs) [31] and variations [18], GANs[9] and other\ndeep generative networks [28, 1, 22, 20] which can capture complex joint distributions of words in\ndocuments and surpass the performance of LDA. These techniques have proven to be good generative\nmodels. However, as purely generative models, they need a separate classi\ufb01er to assign documents\nto classes. As a result, they are not trained end-to-end for the actual discriminative task that needs\nto be performed. Therefore, the resulting representation that is learned does not incorporate any\nproblem-speci\ufb01c structure, leading to limited classi\ufb01cation performance. Supervised convolutional\nnetworks have been applied to text classi\ufb01cation [12] but are limited to small \ufb01xed inputs and still\nrequire signi\ufb01cant data to get high accuracy. Recurrent networks have also been used to handle\nopen ended text [8]. A supervised approach for LDA with DNN was developed by [4, 5] using\nend-to-end learning for LDA by using Mirror-Descent back propagation over a deep architecture\ncalled BP-sLDA. To achieve better classi\ufb01cation, they have to increase the number of layers of their\nmodel, which results in higher model complexity, thereby limiting the capability of their model to\nscale. In summary, there are still signi\ufb01cant challenges to creating expressive, but ef\ufb01ciently trainable\nand computationally tractable models.\nIn the face of limited data, regularization techniques are an important way of trying to reduce\nover\ufb01tting in neural approaches. The use of pretrained layers for networks is a key regularization\nstrategy; however, training industrial applications with domain speci\ufb01c language and tasks remains\nchallenging. For instance, classi\ufb01cation of \ufb01eld problem reports must handle content with arcane\ntechnical jargon, abbreviations and phrasing and be able to output task speci\ufb01c categories.\nTechniques such as L2 normalization of weights and random drop-out [26] of neurons during training\nare now widely used but provide little problem speci\ufb01c advantage. Bayesian neural networks with\ndistributions have been proposed, but independent distributions over weights result in network\n\n2\n\n\f(a) LDA summarizes the content of each document\nm in M as a topic distribution \u03b8m. Each word\nwm,n in Nm has topic zm,n drawn from \u03b8m.\n\n(b) Variational LDA approximates the posterior\ntopic distribution \u03b8m and word topic zm,n with\nindependent distributions.\n\nFigure 1: Plate models representing the original LDA and its approximation.\n\nweight means where the variance must be controlled fairly closely so that relative relationship of\nweights produces the desired computation. Variational auto-encoders explicitly enable probability\ndistributions and can therefore be integrated over, but are still largely undifferentiated structure of\nidentical units. They don\u2019t provide a lot of prior structure to assist with limited data.\nRecently there has been work incorporating other kinds of domain inspired structure into networks\nsuch Spatial transformer networks [10], capsule networks [23] and natural image priors [21].\n\n3 Deriving Cooperative Neural Networks\n\nApplication of our approach proceeds in several distinct steps. First, we de\ufb01ne the independence\nstructure for the problem. In our supervised text classi\ufb01cation example, we incorporate structure\nfrom latent dirichlet allocation (LDA) by choosing to factor the distribution over document texts\ninto document topic probabilities and word topic probabilities. This structure naturally enforces\nthe idea that there are topics that are common across all documents and that documents express a\nmixture of these topics independently through word choices. Second, a set of inference equations\nis derived from the independence structure. Next, the probability distributions involved in the\nvariational approximation, as well as the inference equations, are mapped into a Hilbert space to\nreduce limitations on their functional form. Finally, these mapped Hilbert-space equations are\napproximated by a set of neural networks (one for each constraint), and inference in the Hilbert space\nis performed by iterating these networks. We call the combination of Cooperative Neural Networks\nand LDA as Cooperative Neural Network supervised Latent Dirichlet Allocation, or \u2018CoNN-sLDA\u2019.\nThese steps are elaborated in the following sections.\n\n3.1 LDA model\n\nHere, we use the same notation and the same plate diagram (Figure 1a) as in the original LDA\ndescription [2]. Let K be the number of topics, N be the number of words in a document, V be the\nvocabulary size over the whole corpus, and M be the number of documents in the corpus. Given the\nprior over topics \u03b1 and topic word distributions \u03b2, the joint distribution over the latent topic structure\n\u03b8, word topic assignments z, and observed words in documents w is given by:\n\np(\u03b8, z, w|\u03b1, \u03b2) = p(\u03b8|\u03b1)\n\np(zi|\u03b8)p(wi|zi, \u03b2)\n\n(1)\n\nN(cid:89)\n\n3.2 Variational approximation to LDA\n\ni=1\n\nInference in LDA requires estimating the distribution over \u03b8 and z. Using the Bayes rule, this posterior\ncan be written as follows:\n\np(\u03b8, z|w, \u03b1, \u03b2) =\n\np(\u03b8, z, w|\u03b1, \u03b2)\n\np(w|\u03b1, \u03b2)\n\n(2)\n\nUnfortunately, directly marginalizing out \u03b8 in the original model is intractable. Variational approxi-\nmation of p(\u03b8, z) is a common work-around. To perform variational approximation, we approximate\n\n3\n\n\fthis LDA posterior with the Probabilistic Graphical Model (PGM) shown in Figure 1b. The joint\ndistribution for the approximate PGM is given by:\n\nq(\u03b8, z) = q(\u03b8)\n\nqi(zi)\n\n(3)\n\ni=1\n\nWe want to tune the approximate distribution to resemble the true posterior as much as possible. To\nthis end, we minimize the KL divergence between the two distributions. Alternatively, this can be\nseen as minimizing the variational free energy of the Mean-Field inference algorithm [30]:\n\n{DKL( q(\u03b8, z) || p(\u03b8, z|w, \u03b1, \u03b2) )}\n\nmin{q}\n\n(4)\n\nTo solve this minimization problem, we derive a set of \ufb01xed-point equations in Appendix(A). These\n\ufb01xed-point equations can be expressed as\n\nN(cid:89)\n\n(cid:90)\n\nN(cid:88)\n\nlog q(\u03b8) = log p(\u03b8|\u03b1)+\n\nqi(zi) log p(zi|\u03b8) dzi \u2212 1\n\ni=1\n\nzi\n\nlog qi(zi) = log p(wi|zi, \u03b2) +\n\n(cid:90)\n\nq(\u03b8) log p(zi|\u03b8)d\u03b8 \u2212 1\n\n(5)\n\n(6)\n\n\u03b8\n\nThis set of equations is dif\ufb01cult to solve analytically. In addition, even if it was possible to solve them\nanalytically, they are still subject to the limitations of the original graphical models, such as the need\nto use exponential family distributions and conjugate priors for tractability.\nTherefore, the next step in the proposed method is to map the probability distributions and the\ncorresponding \ufb01xed-point equations into a Hilbert space, where some of these limitations can be\nrelaxed. Section 3.3 gives a general overview of Hilbert space embeddings, and section 3.4 derives\nthe corresponding equations for our model.\n\n3.3 Hilbert Space Embeddings of Distributions\n\nWe follow the notations and procedure de\ufb01ned in [7] for parameterizing Hilbert spaces. By de\ufb01nition,\nthe Hilbert Space embeddings of probability distributions are mappings of these distributions into\npotentially in\ufb01nite -dimensional feature spaces. [24]. For any given distribution p(X) and a feature\nmap \u03c6(x), the embedding \u00b5X : P \u2192 F is de\ufb01ned as:\n\n\u00b5X := EX [\u03c6(X)] =\n\n\u03c6(x)p(x)dx\n\n(7)\n\n(cid:90)\n\nX\n\nFor some choice of feature map \u03c6, the above embedding of distributions becomes injective [25].\nTherefore, any two distinct distributions p(X) and q(X) are mapped to two distinct points in the\nfeature space. We can treat the injective embedding \u00b5X as a suf\ufb01cient statistic of the corresponding\nprobability density. In other words, \u00b5X preserves all the information of p(X). Using \u00b5X, we can\nuniquely recover p(X) and any mathematical operation on p(X) will have an equivalent operation\non \u00b5X. These properties lead to the following equivalence relations. We can compute a functional\nf : P \u2192 IR of the density p(X) using only its embedding,\nf (p(x)) = \u02dcf (\u00b5X )\n\n(8)\nby de\ufb01ning \u02dcf : F \u2192 IR as the operation on \u00b5X equivalent to f. Similarly, we can generalize this\nproperty to operators. An operator T : P \u2192 IRd applied to a density can also be equivalently carried\nout using its embedding,\n\n(9)\nwhere \u02dcT : F \u2192 IRd is again the corresponding equivalent operator applied to the embedding. In our\nderivations, we assume that there exists a feature space where the embeddings are injective and apply\nthe above equivalence relations in subsequent sections.\n\nT \u25e6 p(x) = \u02dcT \u25e6 \u00b5X\n\n4\n\n\fFigure 2: Visualization of the CoNN-sLDA architecture for a single document. For the i\u2019th word,\nthe latent topic variable is zi. The embedding for the distribution p(zi) is \u00b5zi; these embeddings\nare shown as three-dimensional vectors for illustration. They are accumulated and passed through a\nnon-linearity to obtain \u00b5\u03b8, which is the embedding of p(\u03b8), the distribution over the topics for the\ndocument. Thus, the embedding \u00b5\u03b8 is determined (up to the non-linearity) by the average of the\nembeddings \u00b5zi, as in the original LDA model. Similarly, there is feedback from \u00b5\u03b8 (which happens\nfor T iterations, see Alg1), so that \u00b5\u03b8, in turn, in\ufb02uences \u00b5zi, again, as in the original LDA model.\n\n3.4 Hilbert space embedding for LDA\n\nWe consider Hilbert space embeddings of q(\u03b8), qi(zi), as well as the equations (5) and (6). By\nde\ufb01nition given in equation(7),\n\n\u00b5\u03b8 =\n\n\u03c6(\u03b8)q(\u03b8)d\u03b8\n\n\u00b5zi =\n\n\u03c6(zi)qi(zi)dzi\n\n(10)\n\n(cid:90)\n\n\u03b8\n\n(cid:90)\n\nzi\n\nThe variational update equations in (5) and (6) provide us with the key relationships between latent\nvariables in the model. We can replace the speci\ufb01c distributional forms in these equations with\noperators that maintain the same relationships among distributions represented in the Hilbert space\nembeddings.\n\nq(\u03b8) = f1(\u03b8,{qi(zi)})\n\nqi(zi) = f2(zi, wi, q(\u03b8))\n\n(11)\n\nHere, f1 and f2 represent the abstract structure of the model implied by (5) and (6) without speci\ufb01c\ndistributional forms. We will provide a speci\ufb01c instantiation of f1 and f2 shortly. Following the\nsame argument as in equation (8), we can write equation (11) as q(\u03b8) = \u02dcf1(\u03b8,{\u00b5zi}). Similarly,\nqi(zi) = \u02dcf2(zi, wi, \u00b5\u03b8). Iterating through all values of \u03b8, zi and using the operator view given in\nequation (9) as reference, we get the following equivalent \ufb01xed-point equations in the Hilbert Space:\n\n\u00b5\u03b8 = T1 \u25e6 {\u00b5zi}\n\n\u00b5zi = T2 \u25e6 [wi, \u00b5\u03b8]\n\n(12)\n\n3.5 Parameterization of Hilbert space embedding using Deep Neural Networks\nThe operators T1 and T2 have complex non-linear dependencies on the unknown true probability\ndistributions and the feature map \u03c6. Thus, we need to model these operators in such a way that we\ncan utilize the available data to learn the underlying non-linear functions. We will use deep neural\nnetworks which are known for their ability to model non-linear functions.\nWe start by parameterizing the embeddings. We assume that any point in the Hilbert space is a vector\n\u00b5i \u2208 IRD. Next, as the operators are non-linear function maps, we replace them by deep neural\nnetworks. In its simplest form, we only use a single fully connected layer with \u2018tanh\u2019 activations\nyielding the following \ufb01xed point update equations,\n\n\u00b5\u03b8 = tanh( W1 \u00b7 N(cid:88)\n\n{\u00b5zi} )\n\u00b5zi = tanh( W2 \u00b7 word2vec(wi) + W3.\u00b5\u03b8 )\n\ni=1\n\n(13)\n\n(14)\n\nThe original work on Hilbert space embeddings required the embeddings to be injective. We observe\nthat we do not need the embedding to be injective on the domain of all distributions. Instead, we only\n\n5\n\n\fneed it to be injective on the sub-domain of distributions used in the training corpus. The supervised\ntraining process on the training set will have to \ufb01nd embeddings that allow the model to distinguish\ndocuments that occur in the corpus automatically causing the learned embeddings to be injective for\nthe training domain.\nWe keep the dimension of the word2vec [19] embedding identical to the Hilbert space embedding,\ni.e. wi \u2208 IRD. Note, that the above parameterization is one example. Multiple fully connected layers\ncan be used to achieve denser models.\nAssume the parameters word2vec, W1, W2 and W3 are known. We calculate the set of embeddings\nfor a given text corpus by iterating equations(13, 14). Algorithm 1 summarizes this procedure. We\nnormalize the embeddings after every iteration to avoid over\ufb02ow. This is the heart of the Cooperative\nNeural Network paradigm in which a set of neural networks co-constrain each other to produce\nan embedding informed by prior structure. In our experience, we found that \u2018tanh\u2019 works better\nthan \u2018\u03c3\u2019 as a choice for non-linearity. Using recti\ufb01ed linear \u2018ReLU\u2019 units will not work as they\nzero out negative values of the embeddings. We apply dropout [26] to \u00b5zi\u2019s, \u00b5\u03b8 and word2vec for\nregularization. For every document, the algorithm returns the associated \u00b5\u03b8 embedding, representing\nthe document in the Hilbert space.\n\nAlgorithm 1 Getting Hilbert Space Embed-\ndings\n\nAlgorithm 2 Training using Hilbert Space Embed-\ndings\n\nInput: Document Corpus D, with each doc \u2018d\u2019\nhas set of words [wd,i] \u2208 Nd.\nInitialize P(0) = {W(0), u(0), word2vec(0)}\nwith random values. Let \u2018learning rate = r\u2019.\nfor t = 1 to T do\n\n; P(t\u22121)(cid:1)\n\nypred = H(cid:0)\u00b5s\n\nSample docs from D as {Ds, ys}\nUsing Alg(1) get Hilbert embeddings {\u00b5s\n\u03b8d\nfor \u2018Ds\u2019\nUpdate: P(t) = P(t\u22121)\nL(ypred, ys)\nend for\nreturn {PT }\n\n- r. (cid:53)P(t\u22121)\n\n\u03b8d\n\n}\n\nInput: Parameters {W1, W2, W3}\nzi } = 0 \u2208 IRD.\nInitialize {\u00b5(0)\nfor t = 1 to T iterations do\nfor i = 1 to N words do\n\n\u03b8 , \u00b5(0)\n\n\u00b5(t)\nzi = tanh(W2.word2vec(wi) +\nW3.\u00b5(t)\n\u03b8 )\nNormalize \u00b5(t)\nzi\n\n\u03b8 = tanh(W1.(cid:80)N\n\nend for\n\u00b5(t)\nNormalize \u00b5(t)\n\u03b8\n\ni=1{\u00b5(t\u22121)\n\nzi\n\n})\n\nend for\nreturn {\u00b5(T )\n\n\u03b8 } : Document embeddings\n\nIn practice, the parameters word2vec, W1, W2 and W3 are not known and need to be learned from\ntraining data. This requires formulating an objective function, and then optimizing that objective\nfunction. An additional advantage of the proposed method is that it allows using a wide variety of\nobjective functions. In our case, we trained the model using a discriminative/supervised criterion\nthat relies on the labels associated with each document, and we used binary cross-entropy loss or\ncross-entropy loss for multiclass classi\ufb01cation.\nAlgorithm 2 summarizes the training procedure. It uses Algorithm 1 as a subroutine. The H function\nis chosen to be a single fully connected layer in our implementation, which transforms the input\nembedding to a vector corresponding to number of classes. We sample (without replacement) a batch\nof documents Ds from the corpus, compute their embeddings and update the parameters. The loss\nfunction takes in the \u00b5\u03b8 embeddings and the corresponding document labels. The resulting model,\ncalled \u2018CoNN-sLDA\u2019 is schematically illustrated in Figure 2.\nThe CoNN-sLDA model retains the overall structure of the LDA model by separating the problem into\ndocument topic distributions and word topic distributions within each document. As with traditional\nLDA, one can visualize a document corpus by projecting topic vectors associated with documents\ninto a 2D plane (e.g., using MDS, tSNE). An advantage of CoNN-sLDA over typical neural network\napproaches is that typical DNNs produce only a single embedding, whereas CoNN-sLDA elegantly\nfactors the local and global information into separate parts of the model. An advantage of CoNN-\nsLDA over traditional probabilistic graphical models is that we can use low-bias, highly expressive\ndistributions implied by the neural network implementations of update operators.\n\n6\n\n\f4 Experiments\n\n4.1 Description of Datasets\n\nWe evaluated our model \u2018CoNN-sLDA\u2019 on two real-world datasets. The \ufb01rst dataset is a multi-domain\nsentiment dataset (MultiSent) [3], consisting of 342,104 Amazon product reviews on 25 different\ntypes of products (apparels, books, DVDs, kitchen appliances, \u00b7\u00b7\u00b7 ). For each review, we go through\nthe ratings given by the customer (between 1 to 5 stars) and label a it as positive, if the rating is higher\nthan 3 stars and negative otherwise. We pose this as a binary classi\ufb01cation problem. The average\nlength of reviews is roughly 210 words after preprocessing the data. The ratio of positive to negative\nreviews is \u223c 8 : 1. We use 5-fold cross validation and report the average area under the ROC curve\n(AUC), in %.\nThe second dataset is the 20 Newsgroup dataset3. It has around 19,000 news articles, divided roughly\nequally into 20 different categories. We pose this as a multiclass classi\ufb01cation problem and report\naccuracy over 20 classes. The dataset is divided into training set (11,314 articles) and test set (7,531\narticles), approximately maintaining the relative ratio of articles of different categories. The average\nlength of documents after preprocessing is \u223c 160 words. This task becomes challenging as there are\nsome categories that are highly similar, making their separation dif\ufb01cult. For example, the categories\n\u201cPC hardware\u201d and \u201cMac hardware\u201d have quite a lot in common.\nWe apply standard text preprocessing steps to both datasets. We convert everything to lower case\ncharacters and remove the standard stopwords de\ufb01ned in the \u2018Natural Language Toolkit\u2019 library. We\nremove punctuations, followed by lemmatization and stemming to further clean the data. However,\nfor other classi\ufb01ers, we use the preprocessing techniques recommended by the respective authors.\n\n4.2 Baselines for comparison\n\nWe compare \u2018CoNN-sLDA\u2019 with existing state-of-the-art algorithms for document classi\ufb01cation. We\ncompare against VI-sLDA, [6, 17], which includes the label of the document in the graphical model\nformulation and then maximizes the variational lower bound. Different from VI-sLDA, the supervised\ntopic model using DiscLDA [13] reduces the dimensionality of topic vectors \u03b8 for classi\ufb01cation by\nintroducing a class-dependent linear transformation.\nBoltzmann Machines are traditionally used to model distributions and with the recent development of\ndeep learning techniques, these approaches have gained momentum. We compare with one such Deep\nBoltzmann Machine developed for modeling documents called Over-Replicated Softmax (OverRep-S)\n[27]. Another popular approach is by [4], called BP-sLDA, which does end-to-end learning of LDA\nby mirror-descent back propagation over a deep architecture. We also compare with a recent deep\nlearning model developed by [5] called DUI-sLDA.\n\n4.3 Classi\ufb01cation Results\n\nTable(1) shows the accuracy results on newsgroup dataset together with standard error on the mean\n(SEM) over 5 folds. For each of 5 folds, we split training data into train and validation and optimize\nall parameters. We then evaluate against a \ufb01xed common test set. As the number of classes is 20, we\nfound that using higher Hilbert space dimensions work better (See entries for Dim=40 and Dim=80\nin table). A dropout of \u223c 0.8 was applied to word2vec embeddings. The batch size was \ufb01xed at 100\nand we trained for around 400 batches. The performance of CoNN-sLDA is better than BP-sLDA\nand at par with 5 layer DUI-sLDA model. The cost sensitive version CoNN-sLDA (Imb), balances\nout the misclassi\ufb01cation cost for different classes in the loss function tends to perform slightly better.\nThe 20 newsgroup dataset is one of the earliest and most studied text corpuses. It is fairly separable,\nso most modern state-of-the-art methods do well on it, but it is an important benchmark to establish\nthe credibility of an algorithm.\nOur CoNN-sLDA model was able to outperform the recently proposed state-of-the-art method, DUI-\nsLDA, on the large \u2018MultiSent\u2019 dataset (table2) having over 300K documents by a signi\ufb01cant AUC\nmargin of 2%. This corresponds to a 23% reduction in error rate. We used a single fully connected\nlayer with tanh non-linear function for both, \u00b5\u03b8, \u00b5zi embeddings. Hilbert space dimension and\n\n3 http://qwone.com/ jason/20Newsgroups/\n\n7\n\n\fClassi\ufb01er\nVI-sLDA\nDiscLDA\nOverRep-S\nBP-sLDA\nDUI-sLDA\nCoNN-sLDA\nCoNN-sLDA(imb)\n\nAccuracy(%) Details\n73.8\u00b1 0.49\n80.2\u00b1 0.45\n69.5\u00b1 0.36\n81.8\u00b1 0.36\n83.5\u00b1 0.22\n83.4 \u00b1 0.18\n83.7\u00b1 0.13\n\nK=50\nK=50\nK=512\nK=50, L=5\nK=50, L=5\nDim=40\nDim=80\nTable 1: \u201820 Newsgroups\u2019 classi\ufb01cation ac-\ncuracy on 19K documents. SEM over 5 fold\nCV. Dim indicates Hilbert space dimension.\n\nClassi\ufb01er\nVI-sLDA\nDiscLDA\nBP-sLDA\nDUI-sLDA\nDUI-sLDA\nCoNN-sLDA\nCoNN-sLDA(imb)\n\nDetails\n\nAUC (%)\n76.8\u00b1 0.40 K=50 (topics)\n82.1\u00b1 0.40 K=50\n88.9\u00b1 0.36 K=50, L=5\n86.0\u00b1 0.31 K=50, L=1\n91.4\u00b1 0.27 K=50, L=5\n93.3\u00b1 0.13 Dim=10\n93.4\u00b1 0.13 Dim=20\n\nTable 2: \u2018MultiSent\u2019 AUC on 324K docu-\nments. SEM over 5 Fold CV. Dim indicates\nHilbert space dimension.\n\nword2vec dimension are both 10. We use a dropout probability of 0.1, The Algorithm(1) was unrolled\nfor 1 iteration. \u2018Batch size\u2019 was set at 100 and ran for 3000 batches with optimization done using\n\u2018Adam\u2019 optimizer. We also ran a cost sensitive version of CoNN-sLDA (Imb) model, with a balancing\nratio of 1.4 towards the minority class which was incorporated in the loss function. We observe slight\nimprovement in results. CoNN-sLDA consistently outperformed other models over various choices\nof model parameters, see Appendix(B).\nThe number of layers required by other deep models like DUI-sLDA, BP-sLDA for good classi\ufb01cation\nis usually quite high and their performance decreases considerably with fewer layers. CoNN-sLDA\noutperforms them with a single layer neural network.\nWe have a vectorized and ef\ufb01cient implementation of CoNN-sLDA in PyTorch and Tensor\ufb02ow. The\nresults shown above are from the PyTorch version. We ran our experiments on NVIDIA Tesla P100\nGPUs. The runtime for 1 fold of \u2018MultiSent\u2019 for the settings mentioned above is around 5 minutes,\nwhile a single fold for \u201820 Newsgroup\u2019 dataset runs within 2 minutes.\nIn Appendix(B), we report our experiments to optimize the algorithmic and architectural hyperpa-\nrameters. We use the \u2018MultiSent\u2019 data for our analysis. In general for training, we recommend\nstarting with a small Hilbert space dimension and batch size, then try increasing the number of fully\nconnected layers and \ufb01nally choose to unroll the model further.\n\nFigure 3: A t-SNE projection of the 40-dimensional embeddings \u00b5\u03b8 for test documents in the 20-\nNewsgroups dataset. The colors represent the category label for each document. The embeddings\nseparate categories very well.\n\n5 Discussions & Future extensions\n\nIn addition to supervised classi\ufb01cation, we can use LDA style models for visualizing and interpreting\nthe cluster structure of the datasets. For example, in CoNN-sLDA model, we can use t-SNE [16] to\nvisualize the documents using their \u00b5\u03b8 values. In Figure 3 we see that CoNN-sLDA clearly maps\ndifferent newsgroups to homogeneous regions of space that help classi\ufb01cation accuracy and provide\n\n8\n\n\fFigure 4: tSNE visualization of a random sam-\nple of 10-dimensional \u00b5\u03b8 embeddings for Multi-\nsent documents (Blue positive, red negative). The\nembeddings project distinct categories to highly\ncoherent regions.\n\ninsight into the structure of the domain. Similarly, Figure 4 shows that CoNN-sLDA maps the positive\nand negative product reviews into different regions facilitating classi\ufb01cation and interpretation.\nAn interesting extension for the CoNN-sLDA model will be to map the Hilbert space topic embedding\n\u00b5\u03b8 back to the original topic space distribution. This would potentially allow us to provide text\nlabels for the discovered clusters providing an intuitive interpretation for the model learned by our\ntechnique. Appendix (C) discusses an approach to get most relevant words in a document pertaining\nto a discriminative task.\nIn this work, we obtain the \ufb01xed point update equations using the mean-\ufb01eld inference technique. In\ngeneral, we can extend this procedure to other variational inference techniques. For example, we can\n\ufb01nd embeddings for Algorithm 1 by minimizing the free energies of loopy belief propagation or its\nvariants (e.g., [29]) and use Algorithm 2 to train them end-to-end.\n\n6 Conclusion\n\nCooperative neural networks (CoNN) are a new theoretical approach for implementing learning\nsystems which can exploit both prior insights about the independence structure of the problem\ndomain and the universal approximation capability of deep networks. We make the theory concrete\nwith an example, CoNN-sLDA, which has superior performance to both prior work based on the\nprobabilistic graphical model LDA and generic deep networks. While we demonstrated the method\non text classi\ufb01cation using the structure of LDA, the approach provides a fully general methodology\nfor computing factored embeddings using a set of highly expressive networks. Cooperative neural\nnetworks thus expand the design space of deep learning machines in new and promising ways.\n\nAcknowledgements\n\nWe are thankful to our colleagues Srinivas Eswar, Patrick Flick and Rahul Nihalani for their careful\nreading of our submission.\n\nReferences\n[1] Yoshua Bengio, Eric Laufer, Guillaume Alain, and Jason Yosinski. Deep generative stochastic networks\n\ntrainable by backprop. In International Conference on Machine Learning, pages 226\u2013234, 2014.\n\n[2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine\n\nLearning research, 3(Jan):993\u20131022, 2003.\n\n[3] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boom-boxes and blenders:\nDomain adaptation for sentiment classi\ufb01cation. In Proceedings of the 45th annual meeting of the association\nof computational linguistics, pages 440\u2013447, 2007.\n\n[4] Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, Xiaodong He, Jianfeng Gao, Xinying Song, and Li Deng.\nEnd-to-end learning of LDA by mirror-descent back propagation over a deep architecture. In Advances in\nNeural Information Processing Systems, pages 1765\u20131773, 2015.\n\n9\n\n\f[5] Jen-Tzung Chien and Chao-Hsi Lee. Deep unfolding for topic models. IEEE transactions on pattern\n\nanalysis and machine intelligence, 40(2):318\u2013331, 2018.\n\n[6] Wang Chong, David Blei, and Fei-Fei Li. Simultaneous image classi\ufb01cation and annotation. In Computer\nVision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1903\u20131910. IEEE, 2009.\n\n[7] Hanjun Dai, Bo Dai, and Le Song. Discriminative embeddings of latent variable models for structured\n\ndata. In International Conference on Machine Learning, pages 2702\u20132711, 2016.\n\n[8] Adji Dieng. TopicRNN: A recurrent neural network with long-range semantic dependency. In arXiv\n\npreprint arXiv:1611.01702, 2016.\n\n[9] Zhe Gan, Changyou Chen, Ricardo Henao, David Carlson, and Lawrence Carin. Scalable deep poisson\nfactor analysis for topic modeling. In International Conference on Machine Learning, pages 1823\u20131832,\n2015.\n\n[10] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer\n\nnetworks. In NIPS, 2015.\n\n[11] Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant\n\nfeatures. In ECML, 1998.\n\n[12] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. In arXiv, 2014.\n\n[13] Simon Lacoste-Julien, Fei Sha, and Michael I Jordan. DiscLDA: Discriminative learning for dimensionality\nreduction and classi\ufb01cation. In Advances in neural information processing systems, pages 897\u2013904, 2009.\n\n[14] Hugo Larochelle and Stanislas Lauly. A neural autoregressive topic model.\n\nInformation Processing Systems, pages 2708\u20132716, 2012.\n\nIn Advances in Neural\n\n[15] Wei Li and Andrew McCallum. Pachinko allocation:dag-structured mixture models of topic correlations.\n\n2006.\n\n[16] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[17] Jon D Mcauliffe and David M Blei. Supervised topic models. In Advances in neural information processing\n\nsystems, pages 121\u2013128, 2008.\n\n[18] Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational inference for text processing. In International\n\nConference on Machine Learning, 2016.\n\n[19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations\nof words and phrases and their compositionality. In Advances in neural information processing systems,\npages 3111\u20133119, 2013.\n\n[20] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv\n\npreprint arXiv:1402.0030, 2014.\n\n[21] Hojjat S. Mousavi, Tiantong Guo, and Vishal Monga. Deep image super resolution via natural image\n\npriors. In arxiv, 2018.\n\n[22] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[23] Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. Dynamic routing between capsules. In Advances in\n\nNeural Information Processing Systems, pages 3856\u20133866, 2017.\n\n[24] Alex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00f6lkopf. A hilbert space embedding for distributions.\n\nIn International Conference on Algorithmic Learning Theory, pages 13\u201331. Springer, 2007.\n\n[25] Bharath K Sriperumbudur, Arthur Gretton, Kenji Fukumizu, Gert Lanckriet, and Bernhard Sch\u00f6lkopf.\n\nInjective hilbert space embeddings of probability measures. 2008.\n\n[26] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research,\n15:1929\u20131958, 2014.\n\n[27] Nitish Srivastava, Ruslan R Salakhutdinov, and Geoffrey E Hinton. Modeling documents with deep\n\nboltzmann machines. arXiv preprint arXiv:1309.6865, 2013.\n\n10\n\n\f[28] Yichuan Tang and Ruslan R Salakhutdinov. Learning stochastic feedforward neural networks. In Advances\n\nin Neural Information Processing Systems, pages 530\u2013538, 2013.\n\n[29] Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. Tree-reweighted belief propagation\n\nalgorithms and approximate ML estimation by pseudo-moment matching. In AISTATS, 2003.\n\n[30] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational\n\ninference. Foundations and Trends R(cid:13) in Machine Learning, 1(1\u20132):1\u2013305, 2008.\n\n[31] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor Berg-Kirkpatrick. Improved variational\n\nautoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139, 2017.\n\n[32] Yin Zheng, Yu-Jin Zhang, and Hugo Larochelle. A deep and autoregressive approach for topic modeling of\nmultimodal data. IEEE transactions on pattern analysis and machine intelligence, 38(6):1056\u20131069, 2016.\n\n[33] Jun Zhu, Amr Ahmed, and Eric P Xing. MedLDA: maximum margin supervised topic models for regression\nand classi\ufb01cation. In Proceedings of the 26th annual international conference on machine learning, pages\n1257\u20131264. ACM, 2009.\n\n11\n\n\f", "award": [], "sourceid": 2048, "authors": [{"given_name": "Harsh", "family_name": "Shrivastava", "institution": "Georgia Institute of Technology"}, {"given_name": "Eugene", "family_name": "Bart", "institution": "Palo Alto Research Center"}, {"given_name": "Bob", "family_name": "Price", "institution": "PARC"}, {"given_name": "Hanjun", "family_name": "Dai", "institution": "Georgia Tech"}, {"given_name": "Bo", "family_name": "Dai", "institution": "Google Brain"}, {"given_name": "Srinivas", "family_name": "Aluru", "institution": "Georgia Institute of Technology"}]}