{"title": "Documents as multiple overlapping windows into grids of counts", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "In text analysis documents are represented as disorganized bags of words, models of count features are typically based on mixing a small number of topics \\cite{lda,sam}. Recently, it has been observed that for many text corpora documents evolve into one another in a smooth way, with some features dropping and new ones being introduced. The counting grid \\cite{cgUai} models this spatial metaphor literally: it is multidimensional grid of word distributions learned in such a way that a document's own distribution of features can be modeled as the sum of the histograms found in a window into the grid. The major drawback of this method is that it is essentially a mixture and all the content much be generated by a single contiguous area on the grid. This may be problematic especially for lower dimensional grids. In this paper, we overcome to this issue with the \\emph{Componential Counting Grid} which brings the componential nature of topic models to the basic counting grid. We also introduce a generative kernel based on the document's grid usage and a visualization strategy useful for understanding large text corpora. We evaluate our approach on document classification and multimodal retrieval obtaining state of the art results on standard benchmarks.", "full_text": "Documents as multiple overlapping windows into a\n\ngrid of counts\n\nAlessandro Perina1\n\nNebojsa Jojic1\n\nManuele Bicego2\n\nAndrzej Turski1\n\n1Microsoft Corporation, Redmond, WA\n\n2University of Verona, Italy\n\nAbstract\n\nIn text analysis documents are often represented as disorganized bags of words;\nmodels of such count features are typically based on mixing a small number of\ntopics [1,2]. Recently, it has been observed that for many text corpora documents\nevolve into one another in a smooth way, with some features dropping and new\nones being introduced. The counting grid [3] models this spatial metaphor liter-\nally: it is a grid of word distributions learned in such a way that a document\u2019s own\ndistribution of features can be modeled as the sum of the histograms found in a\nwindow into the grid. The major drawback of this method is that it is essentially\na mixture and all the content must be generated by a single contiguous area on\nthe grid. This may be problematic especially for lower dimensional grids. In this\npaper, we overcome this issue by introducing the Componential Counting Grid\nwhich brings the componential nature of topic models to the basic counting grid.\nWe evaluated our approach on document classi\ufb01cation and multimodal retrieval\nobtaining state of the art results on standard benchmarks.\n\nIntroduction\n\n1\nA collection of documents, each consisting of a disorganized bag of words is often modeled\ncompactly using mixture or admixture models, such as Latent Semantic Analysis (LSA) [4] and\nLatent Dirichlet Allocation (LDA) [1]. The data is represented by a small number of semantically\ntight topics, and a document is assumed to have a mix of words from an even smaller subset of these\ntopics. There are no strong constraints in how the topics are mixed [5].\nit has been observed that for many text corpora\nRecently, an orthogonal approach emerged:\ndocuments evolve into one another in a smooth way, with some words dropping and new ones\nbeing introduced. The counting grid model (CG) [3] takes this spatial metaphor \u2013 of moving\nthrough sources of words and dropping and picking new words \u2013 literally: it is multidimensional\ngrid of word distributions, learned in such a way that a document\u2019s own distribution of words can\nbe modeled as the sum of the distributions found in some window into the grid. By using large\nwindows to collate many grid distributions from a large grid, CG model can be a very large mixture\nwithout overtraining, as these distributions are highly correlated. LDA model does not have this\nbene\ufb01t, and thus has to deal with a smaller number of topics to avoid overtraining.\n\nIn Fig.1a we show an excerpt of a grid learned from cooking recipes from around the world. Each\nposition in the grid is characterized by a distribution over the words in a vocabulary and for each\nposition we show the 3 words with higher probability whenever they exceed a threshold. The shaded\npositions, are characterized by the presence, with a non-zero probability, of the word \u201cbake\u201d1. On\nthe grid we also show the windows W of size 4 \u21e5 5 for 5 recipes. Nomi (1), an Afghan egg-based\nbread, is close to the recipe of the usual pugliese bread (2), as indeed they share most of the ingre-\ndients and procedure and their windows largely overlap. Note how moving from (1) to (2) the word\n\n1Which may or may not be in the top three\n\n1\n\n\fFigure 1: a) A particular of a E = 30 \u21e5 30 componential counting grid \u21e1i learned over a corpus\nof recipes. In each cell we show the 0-3 most probable words greater than a threshold. The area\nin shaded red has \u21e1(0bake0) > 0. b) For 6 recipes, we show how their components are mapped\nonto this grid. The \u201cmass\u201d of each component (e.g., \u2713 see Sec.2) is represented with the window\nthickness. For each component c = j in position j, we show the words generated in each window\n\ncz \u00b7Pj2Wi\n\n\u21e1j(z)\n\n\u201cegg\u201d is dropped. Moving to the right we encounter the basic pizza (3) whose dough is very simi-\nlar to the bread\u2019s. Continuing to the right words often associated to desserts like sugar, almond, etc\nemerge. It is not surprising that baked desserts such as cookies (4), and pastry in general, are mapped\nhere. Finally further up we encounter other desserts which do not require baking, like tiramisu (5),\nor chocolate crepes. This is an example of a\u201ctopical shift\u201d; others appear in different portions of the\nfull grid which is included in the additional material.\nThe major drawback of counting grids is that they are essentially a mixture model, assuming only\none source for all features in the bag and the topology of the space highly constrains the document\nmappings resulting in local minima or suboptimal grids. For example, more structured recipes like\nGrecian Chicken Gyros Pizza or Tex-Mex pizza would have very low likelihood, as words related to\nmeat, which is abundant in both, are hard to generate in the baking area where the recipes would\nnaturally goes.\nAs \ufb01rst contribution we extend here the counting grid model so that each document can be rep-\nresented by multiple latent windows, rather than just one. In this way, we create a substantially\nmore \ufb02exible admixture model, the componential counting grid (CCG), which becomes a direct\ngeneralization of LDA as it does allow multiple sources (e.g., the windows) for each bag, in a math-\nematically identical way as LDA. But, the equivalent of LDA topics are windows in a counting grid,\nwhich allows the model to have a very large number of topics that are highly related, as shift in the\ngrid only slightly re\ufb01nes any topic.\nStarting from the same grid just described, we recomputed the mapping of each recipe which now\ncan be described by multiple windows, if needed. Fig. 1b shows mappings for some recipes. Also\nthe words generated in each component are shown. The three pizzas place most of the mass in the\nsame area (dough), but the words related to the topping are borrowed from different areas. Another\nexample is the Caesar salad which have a component in the salad/vegetable area, and borrows the\n\n2\n\n grainricecooktypecookerresultantgoodwantdoesntmethodusualhoweverbecausebeingsdonttarkawhitebrandypourchocolatepeakmascarponegradualchocolatefolslowlycleardishrunstartchangesitwayindiantryindianknowgoinggenerationtheykeptonlyexcellencequiteelectricmeringuerumyolkbeatgranulatedcutletpourgentlypickliftbackfullpersianmayreheatvaryusefulgivensectionlookneededcompletionstoretastynormalcontainerairtightelectricextractmixervanillaspeedswirebeatcakeeggspringformspoonfulcarefullypourbottomwoodenprocedurespatulaquicklyroundnonstickspreaddosaleastpancaketexturebiscottigriddlealwayscrepeairtightlongalmondpeachrackcinnamonsugarbutterrindconfectionersalternativepanfullargersideinvertupsideomeletslideslipcylinderspatulasecondlogapartmentgriddlecookiebiscuitpretzelsiftgreasepaperparchmentgoldenfoldpressingmoisteneggbreadcrumbpanfulcrumbbreadtoothpickcrumbcrustyeggbeatenlightlyaltogetheronenaanturnhandfulincorporatehandfulrotidiameterroundinchtogethersheetpastryeggpressingsheetbordersidesaltmixturepoursidebrowndishmixturesheetadditionalbeatstickyshapebrushdividerollrectanglecutsealedgeedgesheetplaceremovablechivespreheatminuteovenproofpreheatmiddlesheetrackbowlturnmixerbulkdoughkneadboarddivideshapesurfacetowelcenterformfoldsealremovableplacedegreeovenovenpreheatbakepreheatbaguetteworkbowlrisesmoothekneadelasticcircledoughclothdamproundcentertogetherformleftsquaretrianglraviolsetasidecentercenterarrangepreheat oildishgreasespraycornmealpizzaloafsurfaceloavesdoublesprinkleballyeastrisewarmballpalmusefulequalbunstartbitworkwrapperlinzestpuddinghalfmixcompletioncoolpourpatterncoolinserttraysharpresemblanceformsurfacelooselongmoistmachinebreadfeelstartersizedesirableamountthoroughlykitchenreadyamountfeedingneatbakeovenNoni Afghan BreadBrown BreadCeasar SaladPizza di NapoliGrecian Chicken Gyros Pizza 'dough' 'roll' 'ball' 'shape' 'yeast' 'knead' 'rise' 'bread' 'egg' 'dough' 'roll' 'yeast' 'knead' 'shape' 'desirable' 'water' 'divide' 'keep' 'water' 'aside' 'add' 'smoothe' 'minute' 'lukewarm' 'remain' 'fry' 'sauce' 'deep' 'oil' 'hot' 'golden' 'mix' 'lettuce' 'salad' 'slice' 'garnish' 'dressings' 'beans' 'mix' 'cheese' 'place' 'melt' 'basil' 'cover' 'bag' 'broil' 'chicken' 'marinade' 'shallow' 'hot' 'coat' 'refrigeration' 'heat' 'crust' 'evenly' 'spread' 'edge' 'pressing' 'center' 'place' 'feta' 'mixture' 'useful'a)b)(cid:83)iWj(1)(2)(3)(4)(5)[...][...][...]\fcroutons from the bread area.\nBy observing Fig.1b, one can also notice how the embedding produced by CCGs yields to a sim-\nilarity measure based on the grid usage of each sample. For example, words relative to the three\npizzas are generated from windows that overlap, therefore they share words usage and thus they are\n\u201csimilar\u201d. As second contribution we exploited this fact to de\ufb01ne a novel generative kernel, whose\nperformance largely outperformed similar classi\ufb01cation strategies based on LDA\u2019s topic usage [1,2].\nWe evaluated componential counting grids and in particular the kernel, on the 20-Newsgroup dataset\n[6], on a novel dataset of recipes which we will make available to the community, and on the re-\ncent \u201cWikipedia picture of the day\u201d dataset [7]. In all the experiments, CCGs set a new state of the\nart. Finally, for the \ufb01rst time we explore visualization through examples and videos available in the\nadditional material.\n\n2 Counting Grids and Componential Counting Grids\nThe basic Counting Grid \u21e1i is a set of distribu-\ntions over the vocabulary on the N-dimensional\ndiscrete grid indexed by i where each id 2\n[1 . . . Ed] and E describes the extent of the\ncounting grid in d dimensions. The index z in-\ndexes a particular word in the vocabulary z =\n[1 . . . Z] being Z the size of the vocabulary. For\nexample, \u21e1i(0P izza0) is the probability of the\nword \u201cPizza\u201d at the location i. Since \u21e1 is a grid\n\nt=1 and each word wt\n\nof distributions,Pz \u21e1i(z) = 1 everywhere on\nthe grid. Each bag of words is represented by a\nn takes\nlist of words {wt}T\na value between 1 and Z. In the rest of the pa-\nper, we will assume that all the samples have N\nwords.\nCounting Grids assume that each bags follow\na word distribution found somewhere in the\ncounting grid; in particular, using windows of\ndimensions W, a bag can be generated by \ufb01rst\naveraging all counts in the window Wi starting\nat grid location i and extending in each direc-\n\nFigure 2: a) Plate notation representing the CCG\nmodel. b) CCG generative process for one word:\nPick a window from \u2713, Pick a position within the\nwindow, Pick a word. c) Illustration of U W and\nrelative to the particular \u2713 shown in plate b).\n\u21e4W\n\u2713\n\ntion d by Wd grid positions to form the histogram hi(z) = 1Qd WdPj2Wi\n\u21e1j(z), and then generating\na set of features in the bag (see Fig.1a where we used a 3 \u21e5 4 window). In other words, the position\nof the window i in the grid is a latent variable given which we can write the probability of the bag\nas\n\u21e1j(wn),\n\nRelaxing the terminology, E and W are referred to as, respectively, the counting grid and the win-\ndow size. The ratio of the two volumes, \uf8ff, is called the capacity of the model in terms of an\nequivalent number of topics, as this is how many non-overlapping windows can be \ufb01t onto the grid.\nFinally, with Wi we indicate the particular window placed at location i.\n\nQd Wd \u00b7 Xj2Wi\n\n1\n\np({w}|i) =Yn\n\nhi,z =Yn \n\nComponential Counting Grids As seen in the previous section, counting grids generate words\nfrom a distribution in a window W , placed at location i in the grid. Windows close in the grid\ngenerate similar features because they share many cells: As we move the window on the grid,\nsome new features appear while others are dropped. On the other hand componential models, like\n[1], represent the standard way of modeling of text corpora. In these models each feature can be\ngenerated by a different source or topic, and documents are then seen as admixtures of topics.\nComponential counting grids get the best of both worlds: being based on the counting grid geometry\nthey capture smooth shifts of topics, plus their componential nature, which allows documents to be\ngenerated by several windows (akin to LDA\u2019s topics). The number of windows need not be speci\ufb01ed\na-priori.\nComponential Counting Grids assumes the following generative process (also illustrated by Fig.2b.)\nfor each document in a corpus:\n\n3\n\n Uw(cid:47)(cid:84)wc)(cid:84)lnknwn(cid:83)(cid:68)NT(cid:68)Z = |Vocabulary|wn = \u2018Pizza\u2019(cid:83)knln=(5,3)kn=ln +(0,3)Pick a window W from (cid:84)Pick a location within the window WPick a word from the distribution (cid:83)kb)a)\f1. Sample the multinomial over the locations \u2713 \u21e0 Dir(\u21b5)\n2. For each of the N words wn\n\na) Choose a at location ln \u21e0 M ultinomial(\u2713) for a window of size W\nb) Choose a location within the window Wln; kn\nc) Choose a word wn from \u21e1kn\n\nAs visible, each word wn is generated from a different window, placed at location ln, but the choice\nof the window follows the same prior distributions \u2713 for all words. It worth noticing that when\nW = 1 \u21e5 1, ln = kn and the model becomes Latent Dirichlet Allocation.\nThe Bayesian network is shown in Fig.2a) and it de\ufb01nes the following joint probability distribution\n(1)\n\np(wn|kn,\u21e1 ) \u00b7 p(kn|ln) \u00b7 p(ln|\u2713) \u00b7 p(\u2713|\u21b5)\n\nP =Yt,nXln Xkn\n\nwhere p(wn = z|kn = i,\u21e1 ) = \u21e1i(z) is a multinomial over the vocabulary, p(kn = i|ln = k) =\n) in the\nU W (i  k) is a distribution over the grid locations, with U W uniform and equal to ( 1\n|W|\nupper left window of size W and 0 elsewhere (See Fig.2c). Finally p(ln|\u2713) = \u2713(l) is the prior\ndistribution over the windows location, and p(\u2713|\u21b5) = Dir(\u2713; \u21b5) is a Dirichlet distribution of\nparameters \u21b5.\n\nSince the posterior distribution p(k, l,\u2713 |w,\u21e1,\u21b5 ) is intractable for exact inference, we learned the\nmodel using variational inference [8].\nWe \ufb01rstly introduced the posterior distributions q, approximating the true posterior as qt(k, l,\u2713 ) =\nqt(\u2713)\u00b7Qnqt(kn)\u00b7 qt(ln) being q(kn) and q(ln) multinomials over the locations, and q(\u2713) a Dirac\nfunction centered at the optimal value \u02c6\u2713.\nThen by bounding (variationally) the non-constant part of log P , we can write the negative free\nenergy F, and use the iterative variational EM algorithm to optimize it.\nqt(kn)\u00b7 qt(ln)\u00b7 log \u21e1kn(wn)\u00b7 U W (kn ln)\u00b7 \u2713ln \u00b7 p(\u2713|\u21b5)H(qt)\u2318\nlog P  F =Xt \u21e3Xn  Xln,kn\n\nwhere H(q) is the entropy of the distribution q.\nMinimization of Eq. 2 reduces in the following update rules:\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\n(6)\n\nqt(ln = j) \u00b7 log U W (i  j)\u2318\nqt(kn = j) \u00b7 log U W (j  i)\u2318\n\nqt(kn = i) / \u21e1i(wn) \u00b7 exp\u21e3Xln=j\nqt(ln = i) / \u2713t(i) \u00b7 exp\u21e3Xkn=j\n\u2713t(i) / \u21b5i  1 +Xn\n\u21e1i(z) / Xt Xn\n\nqt(ln = i)\n\nqt(kn = i)[wn=z]\n\nwhere [wn = z] is an indicator function, equal to 1 when wn is equal to z. Finally, the parameters \u21b5\nof the Dirichlet prior can be either kept \ufb01xed [9] or learned using standard techniques [10].\nThe minimization procedure described by Eqs.3-6 can be carried out ef\ufb01ciently in O(N log N )\ntime using FFTs [11].\n\nSome simple mathematical manipulations of Eq.1 can yield to a speed up. In fact, from Eq.1 one\ncan marginalize the variable ln\n\nP = Yt,n Xln=i,kn=j\n= Yt,n Xln=i,kn=j\n= Yt,n Xkn=j\n\np(wn|kn = j) \u00b7 p(kn = j|ln = i) \u00b7 p(ln = i|\u2713) \u00b7 p(\u2713|\u21b5)\n\u21e1j(wn) \u00b7 U W (j  i) \u00b7 \u2713(i) \u00b7 p(\u2713(i)|\u21b5i)\n\n\u21e1j(wn) \u00b7\u21e3Xln=i\n\nU W (j  i) \u00b7 \u2713(i)\u2318 \u00b7 p(\u2713(i)|\u21b5i) =Yt,n Xkn=j\n\n\u21e1j(wn) \u00b7 \u21e4W\n\n\u2713t \u00b7 p(\u2713(i)|\u21b5i)(7)\n\n4\n\n\fwhere \u21e4W\n\u2713\nupdate for q(k) becomes\n\nis a distribution over the grid locations, equal to the convolution of U W with \u2713. The\n\nqt(kn = i) / \u21e1i(wn) \u00b7 \u21e4W\n\n\u2713 (i)\n\nIn the same way, we can marginalize the variable kn\n\nP =Yt,nXln=i\n\n\u2713(i)\u00b7\u21e3Xkn=j\n\nU W (j i)\u00b7 \u21e1j(wn)\u2318\u00b7 p(\u2713(i)|\u21b5i) =Yt,nXln=i\n\n(8)\n\n\u2713(i)\u00b7 hi(wn)\u00b7 p(\u2713(i)|\u21b5i) (9)\n\nto obtain the new update for qt(ln)\n\nqt(ln = i) / hi(wn) \u00b7 \u2713t(i)\n\n(10)\nwhere hi is the feature distribution in a window centered at location i, which can be ef\ufb01ciently\ncomputed in linear time using cumulative sums [3]. Eq.10 highlights further relationships between\nCCGs and LDA: CCGs can be thought as an LDA model whose topics live on the space de\ufb01ned\nby the counting grids geometry. The new updates for the cell distribution q(k) and the window\ndistribution q(l), require only a single convolution and, more importantly, they don\u2019t directly depend\non each other. The model becomes more ef\ufb01cient and has a faster convergence. This is very critical\nespecially when we are analyzing big text corpora.\nThe most similar generative model to CCG comes from the statistic community. Dunson et al. [12]\nworked on sources positioned in a plane at real-valued locations, with the idea that sources within\na radius would be combined to produce topics in an LDA-like model. They used an expensive\nsampling algorithm that aimed at moving the sources in the plane and determining the circular\nwindow size. The grid placement of sources of CCG yields much more ef\ufb01cient algorithms and\ndenser packing.\n\n2.1 A Kernel based on CCG embedding\nHybrid generative discriminative classi\ufb01cation paradigms have been shown to be a practical and\neffective way to get the best of both worlds in approaching classi\ufb01cation [13\u201315]. In the context of\ntopic models a simple but effective kernel is de\ufb01ned as the product of the topic proportions of each\ndocument. This kernel measures the similarity between topic usage of each sample and it proved to\nbe effective on several tasks [15\u201317]. Despite CCG\u2019s \u2713s, the locations proportions, can be thought\nas LDA\u2019s, we propose another kernel, which exploits exactly the same geometric reasoning of the\nunderlying generative model. We observe in fact that by construction, each point in the grid depends\nby its neighborhood, de\ufb01ned by W and this information is not captured using \u2713, but using \u21e4W\n\u2713\nwhich is de\ufb01ned by spreading \u2713 in the appropriate window (Eq.7).\nMore formally, given two samples t and u, we de\ufb01ne a kernel based on CCG embedding as\n\nK(t, u) =Xi\n\nS(\u21e4W\n\n\u2713t (i), \u21e4W\n\n\u2713u(i)) where \u21e4W\n\n\u2713 (i) =Xj\n\nU W (i  j) \u00b7 \u2713(j)\n\n(11)\n\nwhere S(\u00b7,\u00b7) is any similarity measure which de\ufb01nes a kernel.\nIn our experiments we considered the simple product, even if other measures, such as histogram\nintersection can be used. The \ufb01nal kernel turns to be (\u21e5 is the dot-product)\n\u2713u\n\u2713t \u21e5 \u21e4W\n\nKLN (t, u) =Xi\n\n\u2713u(i) = T r\u21e4W\n\n\u2713t (i) \u00b7 \u21e4W\n\u21e4W\n\n(12)\n\n3 Experiments\nAlthough our model is fairly simple, it is still has multiple aspects that can be evaluated. As a\ngenerative model, it can be evaluated in left-out likelihood tests. Its latent structure, as in other gen-\nerative models, can be evaluated as input to classi\ufb01cation algorithms. Finally, as both its parameters\nand the latent variables live in a compact space of dimensionality and size chosen by the user, our\nlearning algorithm can be evaluated as an embedding method that yields itself to data visualization\napplications. As the latter two have been by far the more important sets of metrics when it comes to\nreal-world applications, our experiments focus on them.\nIn all the tests we considered squared grids of size E = [40 \u21e5 40, 50 \u21e5 50, . . . , 90 \u21e5 90] and win-\ndows of size W = [2\u21e5 2, 4\u21e5 4, . . . , 8\u21e5 8]. A variety of other methods are occasionally compared\nto, with slightly different evaluation methods described in individual subsections, when appropriate.\n\n5\n\n\fa) \u201cSame\u201d-20 NewsGroup Results\n90\n\n \n\ny\nc\na\nr\nu\nc\nc\nA\nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n85\n\n80\n\n75\n\n70\n\n65\n\n101\n\n102\n\nCapacity (cid:78)(cid:3)/ No. Topics\n\nb) Mastercook Recipes Results\n\nComponential Counting Grid ((cid:84))\nComponential Counting Grid ((cid:47)) \n\nLDA ((cid:84))\nCounting Grid (q(l) )\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n \n\ny\nc\na\nr\nu\nc\nc\nA\nn\no\ni\nt\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n101\n\n102\n\nCapacity (cid:78)(cid:3)/ No. Topics\n\nc) Wikipedia Picture of the Day Results\n\nCorrespondence LDA\nLDA + Discr. Classifier\nMultimodal Random Field model\nComponential Counting Grid\n\n1\n\n \n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nt\na\nr\n \nr\no\nr\nr\nE\n\n0\n\n \n0\n\n0.2\n\n0.4\n\n0.6\nPercentage\n\n0.8\n\n1\n\nFigure 3: a-b) Results for the text classi\ufb01cation tasks. The Mastercook recipes dataset is available\non www.alessandroperina.com. We represented the grid size E using gray levels (see the\ntext). c) Wikipedia Picture of the day result: average Error rate as a function of the percentage of\nthe ranked list considered for retrieval. Curves closer to the axes represents better performances.\n\nDocument Classi\ufb01cation We compared componential counting grids (CCGs) with counting grids\n[3] (CGs), latent Dirichlet allocation [1] (LDA) and the spherical admixture model [2] (SAM), fol-\nlowing the validation paradigm previously used in [2, 3].\nEach data sample consists of a bag of words and a label. The bags were used without labels to train\na model that capture covariation in word occurrences, with CGs mostly modeling thematic shifts,\nLDA and SAM modeling topic mixing and CCGs both aspects. Then, the label prediction task is\nperformed in a 10-folds crossevaluation setting, using the linear kernel presented in Eq.12 which\nfor LDA reduces in using a linear kernel on the topic proportions. To show the effectiveness of the\nspreading in the kernel de\ufb01nition, we also report results by employing CCG\u2019s \u2713s instead of \u21e4W\n\u2713 . For\nCGs we used the original strategy [3], Nearest Neighbor in the embedding space, while for SAM\nwe reported the results from the original paper. To the best of our knowledge the strategies just de-\nscribed, based on [3] and [2], are two of the most effective methods to classify text documents. SAM\nis characterized by the same hierarchical nature of LDA, but it represents bags using directional dis-\ntributions on a spherical manifold modeling features frequency, presence and absence. The model\ncaptures \ufb01ne-grained semantic structure and performs better when small semantic distinctions are\nimportant. CCGs map documents on a probabilistic simplex (e.g., \u2713) and for W > [1 \u21e5 1] can be\nthought as an LDA model whose topics, hi, are much \ufb01ner as computed from overlapping windows\n(see also Eq.10); a comparison is therefore natural.\nAs \ufb01rst dataset we considered the CMU newsgroup dataset2. Following previous work [2, 3, 6]\nwe reduced the dataset into subsets with varying similarities among the news groups; news-\n20-different, with posts from rec.sport.baseball, sci.space and alt.atheism,\nnews-20-similar, with posts from rec.talk.baseball, talk.politics.gun and\ntalk.politics.misc and news-20-same, with posts from comp.os.ms-windows,\ncomp.windows.x and comp.graphics. For the news-20-same subset (the hardest), in Fig.3a\nwe show the accuracies of CCGs and LDA across the complexities. On the x-axis we have the dif-\nferent model size, in term of capacity \uf8ff, whereas in the y-axis we reported the accuracy. The same\n\uf8ff can be obtained with different choices of E and W therefore we represented the grid size E using\ngray levels, the lighter the marker the bigger the grid. The capacity \uf8ff is roughly equivalent to the\nnumber of LDA topics as it represents the number of independent windows that can be \ufb01t in the grid\nand we compared the with LDA using this parallelism [18].\nComponential counting grids outperform Latent Dirichlet Allocation across all the spectrum and the\naccuracy regularly raises with \uf8ff independently from the Grid size3. The priors helped to prevent\novertraining for big capacities \uf8ff. When using CCG\u2019s \u2713s to de\ufb01ne the kernel, as expected the accu-\n\n2http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html\n3This happens for \u201creasonable\u201d window sizes. For small windows (e.g, 2 \u21e5 2), the model doesn\u2019t have\n\nenough overlapping power and performs similarly a mixture model.\n\n6\n\n\fTable1:Documentclassi\ufb01cation.TheimprovementonSimilarandSamearestatisticallysig-ni\ufb01cant.TheaccuraciesforSAMaretakenfrom[2]andtheyrepresentthebestresultsobtainedacrossthechoiceofnumberoftopics.BOWstandsforclassi\ufb01cationwithalinearSVMonthecountsmatrices.DatasetCCG2DCG3DCGLDABOWSAM\u21e4[3][3][1][2]Different96,49%96,51%96,34%91,8%91,43%94,1%Similar92,81%89,72%90,11%85,7%81,52%88,1%Same83,63%81,28%81,03%75,6%71,81%78,1%etofuwokstirfrycornstarchwokmixcornstarchlightcoatapplydonetawasizzlerstirfrychowshoottbsscalliongingersherrysoyachestnutpiecewhitecutnoodleveinsetasidegarlicsauceasidebitepeppercorndeveinpinkchickenstockremaincarrotceleryremovableaddfryertonguereducebringingboilboilsimmerbringingcoverreturnpotslottedlowermarinationpiecepanfryshallowafghanprovidegrillrefrigerationbrushmuslintwocurdleastclothpaneerputpourleavebreastwingsskinmarinadsauerkrautsavecoalbasteskewercheeseclothtiecharcoaloutsidewoodthreadleavelengthbonepiecedripskimhourovernightmarinadecavitycarcassesturnsecurepreferablesmokercubecubeawayterrinepacklengthroughlyfreshlychopduckcutturnipfatrenderexcessiveporkribharemeathourkabobturnpushcrockinchlongordertightlycrockpotinchshreddingjackrefryremovablepotkidneylargeremovablefatsausagegoosecassouletcasestufftogetherspicemixwidemixgroundlambsixburritobrowncookslightlysaltcookpumpkinmixturespicetavagrinderketchuprestgroundelectricmixownersaintchickenchickenchickenwingkeosnowlettucebedsthinthinbeefcoupleporkribsaucesweetoptionalmsgpoundasianbeefpeanutwokcondimentcuisinerootpastegingerthaiprikaromaticgrasslemongrasstabletomcurrychilicoconutgalangalshrimpdeveinchilisoupeshallotsoupetureenmortarpestlesambalbrothladlefulsoupenectarintendgentlebringingbroccolirouxboila)b)c) Zoom(cid:83)iFigure4:Asimpleinterfacebuiltuponthewordembedding\u21e1.racydropped(bluedotsinFig.3).Resultsforallthedatasetsandforavarietyofmethods,arereportedinTab.1whereweemployed10%ofthetrainingdataasvalidationsettopickacomplexity(adifferentcomplexityhavebeenchosenforeachfold).Asvisible,CCGoutperformsothermodels,withalargermarginonthemorechallengingsameandsimilardatasets,wherewewouldindeedexpectthatquiltingthetopicstocapture\ufb01ne-grainedsimilaritiesanddifferenceswouldbemosthelpful.Asseconddataset,wedownloaded10KMastercookrecipes,whicharefreelyavailableonthewebinplaintextformat.Thenweextractedthewordsofeachrecipefromitsingredientsandcookinginstructionsandweusedtheoriginoftherecipe,todividethedatasetin15classes4.Theresultingdatasethasavocabularysizeof12538uniquewordsandatotalof\u21e01Mtokens.Toclassifytherecipesweused10-foldcrossevaluationwith5repetitions,picking80randomrecipesper-classforeachrepetition.Classi\ufb01cationresultsareillustratedinFig.3b.Asfortheprevioustest,CCGclassi\ufb01cationaccuraciesgrowsregularlywith\uf8ffindependentlyfromthegridsizeE.Com-ponentialmodels(e.g.,LDAandCCGs)performedsigni\ufb01cantlybetterastocorrectlyclassifytheoriginofarecipe,spicepalettes,cookingstyleandproceduresmustbeidenti\ufb01ed.ForexamplewhilemostAsiancuisinesusessimilaringredientsandcookingprocedurestheyde\ufb01nitelyhavedifferentspicepalettes.CountingGrids,beingmixtures,cannotcapturethatastheymaparecipeinasinglelocationwhichheavilydependsontheingredientsused.Amongcomponentialmodels,CCGsworkthebest.MultimodalRetrievalWeconsideredtheWikipediaPictureoftheDaydataset[7],wherethetaskismulti-modalimageretrieval:givenatextquery,weaimto\ufb01ndimagesthataremostrelevanttoit.Toaccomplishthis,we\ufb01rstlylearnedamodelusingthevisualwordsofthetrainingdata{wt,V},obtaining\u2713t,\u21e1Vi.Then,keeping\u2713t\ufb01xedanditeratingtheM-step,weembeddedthetextualwords{wt,T}obtaining\u21e1Wi.Foreachtestsampleweinferredthevaluesof\u2713t,Vand\u2713t,Wrespectivelyfrom\u21e1Viand\u21e1WiandweusedEq.12tocomputetheretrievalscores.Asin[7]wesplitthedatain104Weconsideredthefollowingcuisines:Afghan,Cajun,Chinese,English,French,German,Greek,Indian,Indonesian,Italian,Japanese,Mexican,MiddleEastern,SpanishandThai.7\ffoldsandweusedavalidationsettopickacomplexity.ResultsareillustratedinFig.3c.Althoughweusedthissimpleprocedurewithoutdirectlytrainingamultimodalmodel,CCGsoutperformLDA,CorrLDA[19]andthemultimodaldocumentrandom\ufb01eldmodelpresentedin[7]andsetsanewstateoftheart.Theareaunderthecurve(AUC)forourmethodis21.92\u00b10.6,whilefor[7]is23.14\u00b11.49(Smallervaluesindicatebetterperformance).CountingGridsandLDAbothfailwithAUCsaround40.VisualizationImportantbene\ufb01tsofCCGsarethat1)theylaydownsources\u21e1iona2-Ddimen-sionalgrid,whicharereadyforvisualization,and2)theyenforcethatcloselocationsgeneratesimilartopics,whichleadstosmooththematicshiftsthatprovideconnectionsamongdistanttopicsonthegrid.Thisisveryusefulforsensemaking[20].Todemonstratethiswedevelopedasimpleinterface.AparticularisshowninFig.4b,relativetotheextractofthecountinggridshowninFig.4a.Theinterfaceispannableandzoomableand,atanymoment,onthescreenonlythetopN=500wordsareshown.Tode\ufb01netheimportanceofeachwordineachpositionweweighted\u21e1i(z)withtheinversedocumentfrequency.Fig.4bshowsthelowestlevelofzoom:onlywordsfromfewcellsarevisibleandthefontsizeresemblestheirweight.Ausercanzoomintoseethecontentofparticularcells/areas,untilhereachesthehigh-estlevelofzoomwhenmostofthewordsgeneratedinapositionarevisible,Fig.4c.FRYDEEP FRYSTIR FRYFigure5:Searchresultfortheword\u201cfry\u201d.Wealsoproposeasimplesearchstrategy:onceakeyword\u02c6zisselected,eachwordzineachpo-sitionj,isweightedwithawordandpositiondependentweights.The\ufb01rstisequalto1ifzco-occurwith\u02c6zinsomedocument,and0other-wise,whilethelatteristhesumof\u21e1i(\u02c6z)inallthejsgiventhatthereexistsawindowWkthatcon-tainsbothiandj.Otherstrategiesareofcoursepossible.Asresult,thisstrategyhighlightssomeareasandwords,relatedto\u02c6zonthegridandineachareaswordsrelated(similartopic)to\u02c6zap-pears.Interestingly,ifasearchtermisusedindifferentcontexts,fewislandsmayappearonthegrid.ForexampleFig.5showstheresultofthesearchfor\u02c6z=\u201cfry\u201d:Thegeneralfryingiswellseparatedfrom\u201cdeepfrying\u201dand\u201cstirfrying\u201dwhichappearsattheextremesofthesameis-land.Presentingsearchresultsasislandsona2-dimensionalgrid,apparentlyimprovesthestandardstrategy,alinearlistofhits,inwhichrecipesrelativetothethreefryingstyleswouldhavebemixed,whiletempurahavelittletodowithpanfriednoodles.4ConclusionInthispaperwepresentedthecomponentialcountinggridmodel\u2013whichbridgesthetopicmodelandcountinggridworlds\u2013togetherwithasimilaritymeasurebasedonit.Wedemonstratedthatthehiddenmappingvariablesassociatedwitheachdocumentcannaturallybeusedinclassi\ufb01cationtasks,leadingtothestateoftheartperformanceonacoupleofdatasets.Bymeansofproposingasimpleinterface,wehavealsoshownthegreatpotentialofCCGstovisu-alizeacorpora.AlthoughthesameholdsforCGs,thisisthe\ufb01rstpaperthatinvestigatethisaspect.MoreoverCCGssubsumeCGsasthecomponentsareusedonlywhenneeded.Foreveryrestart,thegridsqualitativelyalwaysappearedverysimilar,andsomeofthemoresalientsimilarityrelation-shipswerecapturedbyalltheruns.ThewordembeddingproducedbyCCGhasalsoadvantagesw.r.t.otherEuclideanembeddingmethodssuchasISOMAP[21],CODE[22]orLLE[23],whichareoftenusedfordatavisualization.InfactCCG\u2019scomputationalcomplexityislinearinthedatasetsize,asopposedtothequadraticcomplexityof[21,21\u201323]whichallarebasedonpairwisedis-tances.Then[21,23]onlyembeddocumentsorwordswhileCG/CCGsprovidebothembeddings.Finallyasopposedtopreviousco-occurrenceembeddingmethodsthatconsiderallpairsofwords,ourrepresentationnaturallycapturesthesamewordappearinginmultiplelocationswhereithasadifferentmeaningbasedoncontext.Theword\u201cmemory\u201dintheSciencemagazinecorpusisastrikingexample(memoryinneruoscience,memoryinelectronicdevices,immunologicmemory).8\fReferences\n[1] Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of machine Learning Research 3 (2003)\n\n993\u20131022\n\n[2] Reisinger, J., Waters, A., Silverthorn, B., Mooney, R.J.: Spherical topic models. In: ICML \u201910: Proceed-\n\nings of the 27th international conference on Machine learning. (2010)\n\n[3] Jojic, N., Perina, A.: Multidimensional counting grids: Inferring word order from disordered bags of\n\nwords. In: Proceedings of conference on Uncertainty in arti\ufb01cial intelligence (UAI). (2011) 547\u2013556\n\n[4] Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning Journal\n\n42 (2001) 177\u2013196\n\n[5] Blei, D.M., Lafferty, J.D.: Correlated topic models. In: NIPS. (2005)\n[6] Banerjee, A., Basu, S.: Topic models over text streams: a study of batch and online unsupervised learning.\n\nIn: In Proc. 7th SIAM Intl. Conf. on Data Mining. (2007)\n\n[7] Jia, Y., Salzmann, M., Darrell, T.: Learning cross-modality similarity for multinomial data. In: Proceed-\nings of the 2011 International Conference on Computer Vision. ICCV \u201911, Washington, DC, USA, IEEE\nComputer Society (2011) 2407\u20132414\n\n[8] Neal, R.M., Hinton, G.E.: A view of the em algorithm that justi\ufb01es incremental, sparse, and other variants.\n\nLearning in graphical models (1999) 355\u2013368\n\n[9] Asuncion, A., Welling, M., Smyth, P., Teh, Y.W.: On smoothing and inference for topic models. In: In\n\nProceedings of Uncertainty in Arti\ufb01cial Intelligence. (2009)\n\n[10] Minka, T.P.: Estimating a Dirichlet distribution. Technical report, Microsoft Research (2012)\n[11] Frey, B.J., Jojic, N.: Transformation-invariant clustering using the em algorithm. IEEE Trans. Pattern\n\nAnal. Mach. Intell. 25 (2003) 1\u201317\n\n[12] Dunson, D.B., Park, J.H.: Kernel stick-breaking processes. Biometrika 95 (2008) 307\u2013323\n[13] Perina, A., Cristani, M., Castellani, U., Murino, V., Jojic, N.: Free energy score spaces: Using generative\n\ninformation in discriminative classi\ufb01ers. IEEE Trans. Pattern Anal. Mach. Intell. 34 (2012) 1249\u20131262\n\n[14] Raina, R., Shen, Y., Ng, A.Y., Mccallum, A.: Classi\ufb01cation with hybrid generative/discriminative models.\n\nIn: In Advances in Neural Information Processing Systems 16, MIT Press (2003)\n\n[15] Jebara, T., Kondor, R., Howard, A.: Probability product kernels. J. Mach. Learn. Res. 5 (2004) 819\u2013844\n[16] Bosch, A., Zisserman, A., Mu\u02dcnoz, X.: Scene classi\ufb01cation using a hybrid generative/discriminative\n\napproach. IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 712\u2013727\n\n[17] Bicego, M., Lovato, P., Perina, A., Fasoli, M., Delledonne, M., Pezzotti, M., Polverari, A., Murino, V.:\nInvestigating topic models\u2019 capabilities in expression microarray data classi\ufb01cation. IEEE/ACM Trans.\nComput. Biology Bioinform. 9 (2012) 1831\u20131836\n\n[18] Perina, A., Jojic, N.: Image analysis by counting on a grid. In: Proceedings of IEEE Computer Society\n\nConference on Computer Vision and Pattern Recognition (CVPR). (2011) 1985\u20131992\n\n[19] Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th annual international ACM\n\nSIGIR conference on Research and development in informaion retrieval. SIGIR \u201903 (2003) 127\u2013134\n\n[20] Thomas, J., Cook, K.: Illuminating the Path: The Research and Development Agenda for Visual Analyt-\n\nics. IEEE Press (2005)\n\n[21] Tenenbaum, J.B., de Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimen-\n\nsionality Reduction. Science 290 (2000) 2319\u20132323\n\n[22] Globerson, A., Chechik, G., Pereira, F., Tishby, N.: Euclidean embedding of co-occurrence data. Journal\n\nof Machine Learning Research 8 (2007) 2265\u20132295\n\n[23] Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. SCIENCE\n\n290 (2000) 2323\u20132326\n\n9\n\n\f", "award": [], "sourceid": 20, "authors": [{"given_name": "Alessandro", "family_name": "Perina", "institution": "Microsoft Research"}, {"given_name": "Nebojsa", "family_name": "Jojic", "institution": "Microsoft Research"}, {"given_name": "Manuele", "family_name": "Bicego", "institution": "University of Verona"}, {"given_name": "Andrzej", "family_name": "Truski", "institution": "Microsoft Research"}]}