{"title": "Distilled Wasserstein Learning for Word Embedding and Topic Modeling", "book": "Advances in Neural Information Processing Systems", "page_first": 1716, "page_last": 1725, "abstract": "We propose a novel Wasserstein method with a distillation mechanism, yielding joint learning of word embeddings and topics. \nThe proposed method is based on the fact that the Euclidean distance between word embeddings may be employed as the underlying distance in the Wasserstein topic model. \nThe word distributions of topics, their optimal transport to the word distributions of documents, and the embeddings of words are learned in a unified framework. \nWhen learning the topic model, we leverage a distilled ground-distance matrix to update the topic distributions and smoothly calculate the corresponding optimal transports. \nSuch a strategy provides the updating of word embeddings with robust guidance, improving algorithm convergence. \nAs an application, we focus on patient admission records, in which the proposed method embeds the codes of diseases and procedures and learns the topics of admissions, obtaining superior performance on clinically-meaningful disease network construction, mortality prediction as a function of admission codes, and procedure recommendation.", "full_text": "Distilled Wasserstein Learning for\n\nWord Embedding and Topic Modeling\n\nHongteng Xu1,2 Wenlin Wang2 Wei Liu3\n\nLawrence Carin2\n\n2Duke University\n\n3Tencent AI Lab\n\n1In\ufb01nia ML, Inc.\n\nhongteng.xu@infiniaml.com\n\nAbstract\n\nWe propose a novel Wasserstein method with a distillation mechanism, yielding\njoint learning of word embeddings and topics. The proposed method is based on\nthe fact that the Euclidean distance between word embeddings may be employed\nas the underlying distance in the Wasserstein topic model. The word distributions\nof topics, their optimal transports to the word distributions of documents, and\nthe embeddings of words are learned in a uni\ufb01ed framework. When learning\nthe topic model, we leverage a distilled underlying distance matrix to update the\ntopic distributions and smoothly calculate the corresponding optimal transports.\nSuch a strategy provides the updating of word embeddings with robust guidance,\nimproving the algorithmic convergence. As an application, we focus on patient\nadmission records, in which the proposed method embeds the codes of diseases\nand procedures and learns the topics of admissions, obtaining superior performance\non clinically-meaningful disease network construction, mortality prediction as a\nfunction of admission codes, and procedure recommendation.\n\n1\n\nIntroduction\n\nWord embedding and topic modeling play important roles in natural language processing (NLP), as\nwell as other applications with textual and sequential data. Many modern embedding methods [30,\n33, 28] assume that words can be represented and predicted by contextual (surrounding) words.\nAccordingly, the word embeddings are learned to inherit those relationships. Topic modeling\nmethods [8], in contrast, typically represent documents by the distribution of words, or other \u201cbag-\nof-words\u201d techniques [17, 24], ignoring the order and semantic relationships among words. The\ndistinction between how the word order is (or is not) accounted for when learning topics and word\nembeddings manifests a potential methodological gap or mismatch.\nThis gap is important when considering clinical-admission analysis, the motivating application of\nthis paper. Patient admissions in hospitals are recorded by the code of international classi\ufb01cation\nof diseases (ICD). For each admission, one may observe a sequence of ICD codes corresponding\nto certain kinds of diseases and procedures, and each code is treated as a \u201cword.\u201d To reveal the\ncharacteristics of the admissions and relationships between different diseases/procedures, we seek to\nmodel the \u201ctopics\u201d of admissions and also learn an embedding for each ICD code. However, while\nwe want embeddings of similar diseases/procedures to be nearby in the embedding space, learning\nthe embedding vectors based on surrounding ICD codes for a given patient admission is less relevant,\nas there is often a diversity in the observed codes for a given admission, and the code order may\nhold less meaning. Take the MIMIC-III dataset [25] as an example. The ICD codes in each patient\u2019s\nadmission are ranked according to a manually-de\ufb01ned priority, and the adjacent codes are often not\nclinically-correlated with each other. Therefore, we desire a model that jointly learns topics and word\nembeddings, and that for both does not consider the word (ICD code) order. Interestingly, even in the\ncontext of traditional NLP tasks, it has been recognized recently that effective word embeddings may\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Consider two admissions with mild and severe diabetes, which are represented by two distributions\nof diseases (associated with ICD codes) in red and orange, respectively. They are two dots in the Wasserstein\nambient space, corresponding to two weighted barycenters of Wasserstein topics (the color stars). The optimal\ntransport matrix between these two admissions is built on the distance between disease embeddings in the\nEuclidean latent space. The large value in the matrix (the dark blue elements) indicates that it is easy to transfer\ndiabetes to its complication like nephropathy, whose embedding is a short distance away (short blue arrows).\n\nbe learned without considering word order [37], although that work didn\u2019t consider topic modeling or\nour motivating application.\nAlthough some works have applied word embeddings to represent ICD codes and related clinical\ndata [11, 22], they ignore the fact that the clinical relationships among the diseases/procedures in an\nadmission may not be approximated well by their neighboring relationships in the sequential record.\nMost existing works either treat word embeddings as auxiliary features for learning topic models [15]\nor use topics as the labels for supervised embedding [28]. Prior attempts at learning topics and word\nembeddings jointly [38] have fallen short from the perspective of these two empirical strategies.\nWe seek to \ufb01ll the aforementioned gap, while applying the proposed methodology to clinical-\nadmission analysis. As shown in Fig. 1, the proposed method is based on a Wasserstein-distance\nmodel, in which (i) the Euclidean distance between ICD code embeddings works as the underlying\ndistance (also referred to as the cost) of the Wasserstein distance between the distributions of the\ncodes corresponding to different admissions [26]; (ii) the topics are \u201cvertices\u201d of a geometry in\nthe Wasserstein space and the admissions are the \u201cbarycenters\u201d of the geometry with different\nweights [36]. When learning this model, both the embeddings and the topics are inferred jointly. A\nnovel learning strategy based on the idea of model distillation [20, 29] is proposed, improving the\nconvergence and the performance of the learning algorithm.\nThe proposed method uni\ufb01es word embedding and topic modeling in a framework of Wasserstein\nlearning. Based on this model, we can calculate the optimal transport between different admissions\nand explain the transport by the distance of ICD code embeddings. Accordingly, the admissions of\npatients become more interpretable and predictable. Experimental results show that our approach is\nsuperior to previous state-of-the-art methods in various tasks, including predicting admission type,\nmortality of a given admission, and procedure recommendation.\n\n2 A Wasserstein Topic Model Based on Euclidean Word Embeddings\n\nAssume that we have M documents and a corpus with N words, e.g., respectively, admission records\nand the dictionary of ICD codes. These documents can be represented by Y = [ym] \u2208 RN\u00d7M , where\nym \u2208 \u03a3N , m \u2208 {1, ..., M}, is the distribution of the words in the m-th document, and \u03a3N is an\nN-dimensional simplex. These distributions can be represented by some basis (i.e., topics), denoted\nas B = [bk] \u2208 RN\u00d7K, where bk \u2208 \u03a3N is the k-th base distribution. The word embeddings can be\nformulated as X = [xn] \u2208 RD\u00d7N , where xn is the embedding of the n-th word, n \u2208 {1, ..., N}, is\nobtained by a model, i.e., xn = g\u03b8(wn) with parameters \u03b8 and prede\ufb01ned representation wn of the\nword (e.g., wn may be a one-hot vector for each word). The distance between two word embeddings\nis denoted dnn(cid:48) = d(xn, xn(cid:48)), and generally it is assumed to be Euclidean. These distances can be\nformulated as a parametric distance matrix D\u03b8 = [dnn(cid:48)] \u2208 RN\u00d7N .\n\n2\n\nAdmissionrecordsbasedonICDcodes:Admission1:d348,d271,p4538,\u2026.Admission2:d3919,d394,d4011,d4019,\u2026\u2026AdmissionM:d4160,p423,\u2026WorddistributionWordembeddingd250Diabetesd250Diabetesd27410Nephropathyd9920Stroked9920StrokeIndexofICDcodeWassersteinSpaceforAdmissionsEuclideanSpaceforICDCodeEmbeddingOptimaltransportbetweenadmissionsTheprobabilityofICDcodeIndexofICDcoded27410NephropathyAdmissionTopicTopicDiabetesNephropathy\fDenote the space of the word distributions as the ambient space and that of their embeddings as the\nlatent space. We aim to model and learn the topics in the ambient space and the embeddings in the\nlatent space in a uni\ufb01ed framework. We show that recent developments in the methods of Wasserstein\nlearning provide an attractive solution to achieve this aim.\n\n2.1 Revisiting topic models from a geometric viewpoint\n\nTraditional topic models [8] often decompose the distribution of words conditioned on the observed\ndocument into two factors: the distribution of words conditioned on a certain topic, and the distribution\nof topics conditioned on the document. Mathematically, it corresponds to a low-rank factorization\nof Y , i.e., Y = B\u039b, where B = [bk] contains the word distributions of different topics and\n\u039b = [\u03bbm] \u2208 RK\u00d7M , \u03bbm = [\u03bbkm] \u2208 \u03a3K, contains the topic distributions of different documents.\nGiven B and \u03bbm, ym can be equivalently written as\n\nym = B\u03bbm = arg miny\u2208\u03a3N\n\n(1)\nwhere \u03bbkm is the probability of topic k given document m. From a geometric viewpoint, {bk} in\n(1) can be viewed as vertices of a geometry, whose \u201cweights\u201d are \u03bbm. Then, ym is the weighted\nbarycenter of the geometry in the Euclidean space.\nFollowing this viewpoint, we can extend (1) to another metric space, i.e.,\n\nk=1\n\n\u03bbkm(cid:107)bk \u2212 y(cid:107)2\n2,\n\n(cid:88)K\n\n(cid:88)K\n\nk=1\n\nym = arg miny\u2208\u03a3N\n\n\u03bbkmd2(bk, y) = yd2 (B, \u03bbm),\n\n(2)\n\nwhere yd2(B, \u03bbm) is the barycenter of the geometry, with vertices B and weights \u03bbm in the space\nwith metric d.\n\n2.2 Wasserstein topic model\n\nWhen the distance d in (2) is the Wasserstein distance, we obtain a Wasserstein topic model, which has\na natural and explicit connection with word embeddings. Mathematically, let (\u2126, d) be an arbitrary\nspace with metric D and P (\u2126) be the set of Borel probability measures on \u2126, respectively.\nDe\ufb01nition 2.1. For p \u2208 [1,\u221e) and probability measures u and v in P (\u2126), their p-order Wasserstein\ndistance [40] is Wp(u, v) = (inf \u03c0\u2208\u03a0(u,v)\np , where \u03a0(u, v) is the set of all\nprobability measures on \u2126 \u00d7 \u2126 with u and v as marginals.\nDe\ufb01nition 2.2. The p-order weighted Fr\u00e9chet mean in the Wasserstein space (or called Wasser-\nstein barycenter) [1] of K measures B = {b1, ..., bK} in P \u2282 P (\u2126) is q(B, \u03bb) =\n\n\u2126\u00d7\u2126 dp(x, y)d\u03c0(x, y))\n\n(cid:82)\n\n1\n\nk=1 \u03bbkW p\n\np (bk, q), where \u03bb = [\u03bbk] \u2208 \u03a3K decides the weights of the measures.\n\narg inf q\u2208P(cid:80)K\n\nWhen \u2126 is a discrete state space, i.e., {1, ..., N}, the Wasserstein distance is also called the optimal\ntransport (OT) distance [36]. More speci\ufb01cally, the Wasserstein distance with p = 2 corresponds to\nthe solution to the discretized Monge-Kantorovich problem:\n\nW 2\n\n2 (u, v; D) := minT \u2208\u03a0(u,v) Tr(T (cid:62)D),\n\n(3)\nwhere u and v are two distributions of the discrete states and D \u2208 RN\u00d7N is the underlying distance\nmatrix, whose element measures the distance between different states. \u03a0(u, v) = {T|T 1 =\nu, T (cid:62)1 = v}, and Tr(\u00b7) represents the matrix trace. The matrix T is called the optimal transport\nmatrix when the minimum in (3) is achieved.\nApplying the discrete Wasserstein distance in (3) to (2), we obtain our Wasserstein topic model, i.e.,\n\nyW 2\n\n2\n\n(B, \u03bb; D) = arg miny\u2208\u03a3N\n\n\u03bbkW 2\n\n2 (bk, y; D).\n\n(4)\n\nIn this model, the discrete states correspond to the words in the corpus and the distance between\ndifferent words can be calculated by the Euclidean distance between their embeddings.\nIn this manner, we establish the connection between the word embeddings and the topic model: the\ndistance between different topics (and different documents) is achieved by the optimal transport\nbetween their word distributions built on the embedding-based underlying distance. For arbitrary\n\n3\n\n(cid:88)K\n\nk=1\n\n\ftwo word embeddings, the more similar they are, the smaller underlying distance we have, and\nmore easily we can achieve transfer between them. In the learning phase (as shown in the following\nsection), we can learn the embeddings and the topic model jointly. This model is especially suitable\nfor clinical admission analysis. As discussed above, we not only care about the clustering structure\nof admissions (the relative proportion, by which each topic is manifested in an admission), but also\nwant to know the mechanism or the tendency of their transfers in the level of disease. As shown in\nFig. 1, using our model, we can calculate the Wasserstein distance between different admissions in\nthe level of disease and obtain the optimal transport from one admission to another explicitly. The\nhierarchical architecture of our model helps represent each admission by its topics, which are the\ntypical diseases/procedures (ICD codes) appearing in a class of admissions.\n\n3 Wasserstein Learning with Model Distillation\n\nGiven the word-document matrix Y and a prede\ufb01ned number of topics K, we wish to jointly learn\nthe basis B, the weight matrix \u039b, and the model g\u03b8 of word embeddings. This learning problem can\nbe formulated as\n\nminB,\u039b,\u03b8\ns.t. bk \u2208 \u03a3N , for k = 1, .., K, and \u03bbm \u2208 \u03a3K, for m = 1, .., M.\n\n(B, \u03bbm; D\u03b8)),\n\nm=1\n\n2\n\nL(ym, yW 2\n\nHere, D\u03b8 = [dnn(cid:48)] and the element dnn(cid:48) = (cid:107)g\u03b8(wn)\u2212g\u03b8(wn(cid:48))(cid:107)2. The loss function L(\u00b7,\u00b7) measures\nthe difference between ym and its estimation yW 2\n(B, \u03bbm; D\u03b8). We can solve this problem based on\nthe idea of alternating optimization. In each iteration we \ufb01rst learn the basis B and the weights \u039b\ngiven the current parameters \u03b8. Then, we learn the new parameters \u03b8 based on updated B and \u039b.\n\n2\n\n(cid:88)M\n\n(5)\n\n(6)\n\n3.1 Updating word embeddings to enhance the clustering structure\n\n(cid:88)M\n\n(cid:88)\nn,n(cid:48) tnn(cid:48)(cid:107)xn,\u03b8 \u2212 xn(cid:48),\u03b8(cid:107)2\n2,\n\nSuppose that we have obtained updated B and \u039b. Given current D\u03b8, we denote the optimal transport\nbetween document ym and topic bk as Tkm. Accordingly, the Wasserstein distance between ym\nand bk is Tr(T (cid:62)\nkmD\u03b8). Recall from the topic model in (4) that each document ym is represented as\nthe weighted barycenter of B in the Wasserstein space, and the weights \u03bbm = [\u03bbkm] represent the\ncloseness between the barycenter and different bases (topics). To enhance the clustering structure\nof the documents, we update \u03b8 by minimizing the Wasserstein distance between the documents and\ntheir closest topics. Consequently, the documents belonging to different clusters would be far away\nfrom each other. The corresponding objective function is\n\nTr(T (cid:62)\n\nkmmD\u03b8) = Tr(T (cid:62)D\u03b8) =\n\nthese transports is given by T =(cid:80)\n\nm=1\n\nwhere Tkmm is the optimal transport between ym and its closest base bkm. The aggregation of\nm Tkmm = [tnn(cid:48)], and X\u03b8 = [xn,\u03b8] are the word embeddings.\nConsidering the symmetry of D\u03b8, we can replace tnn(cid:48) in (6) with tnn(cid:48) +tn(cid:48) n\n. The objective function\n1N )\u2212 T +T (cid:62)\n\u03b8 ), where L = diag( T +T (cid:62)\ncan be further written as Tr(X\u03b8LX(cid:62)\nis the Laplacian matrix.\nTo avoid trivial solutions like X\u03b8 = 0, we add a smoothness regularizer and update \u03b8 by optimizing\nthe following problem:\n\nmin\u03b8 E(\u03b8) = min\u03b8 Tr(X\u03b8LX(cid:62)\n\n(7)\nwhere \u03b8c is current parameters and \u03b2 controls the signi\ufb01cance of the regularizer. Similar to Laplacian\nEigenmaps [6], the aggregated optimal transport T works as the similarity measurement between\nproposed embeddings. However, instead of requiring the solution of (7) to be the eigenvectors of L,\nwe enhance the stability of updating by ensuring that the new \u03b8 is close to the current one.\n\n\u03b8 ) + \u03b2(cid:107)\u03b8 \u2212 \u03b8c(cid:107)2\n2,\n\n2\n\n2\n\n2\n\n3.2 Updating topic models based on the distilled underlying distance\n\nGiven updated word embeddings and the corresponding underlying distance D\u03b8, we wish to further\nupdate the basis B and the weights \u039b. The problem is formulated as a Wasserstein dictionary-learning\nproblem, as proposed in [36]. Following the same strategy as [36], we rewrite {\u03bbm} and {bk} as\n\n(cid:80)\n\n\u03bbkm(A) =\n\nexp(\u03b1km)\nk(cid:48) exp(\u03b1k(cid:48)m)\n\n,\n\nbnk(R) =\n\n4\n\n(cid:80)\n\nexp(\u03b3nk)\nn(cid:48) exp(\u03b3n(cid:48)k)\n\n,\n\n(8)\n\n\fAlgorithm 1 Distilled Wasserstein Learning (DWL) for Joint Word Embedding and Topic Modeling\n1: Input: The distributions of words for documents Y . The distillation parameter \u03c4. The number\nof epochs I. Batch size s. The weight in Sinkhon distance \u0001. The weight \u03b2 in (7). The learning\nrate \u03c1.\n\n2: Output: The parameters \u03b8, basis B, and weights \u039b.\n3: Initialize \u03b8, A, R \u223c N (0, 1), and calculate B(R) and \u039b(A) by (8).\n4: For i = 1, ..., I\n5:\n6:\n7:\n8:\n\nCalculate the Sinkhorn gradient with distillation: \u2207BL\u03c4|B and \u2207\u039bL\u03c4|\u039b.\nR \u2190 R \u2212 \u03c1\u2207BL\u03c4|B\u2207RB|R, A \u2190 A \u2212 \u03c1\u2207\u039bL\u03c4|\u039b\u2207A\u039b|A.\nCalculate B(R), \u039b(A) and the gradient of (7) \u2207\u03b8E(\u03b8)|\u03b8, then update \u03b8 \u2190 \u03b8 \u2212 \u03c1\u2207\u03b8E(\u03b8)|\u03b8.\n\nFor Each batch of documents\n\nwhere A = [\u03b1km] and R = [\u03b3nk] are new parameters. Based on (8), the normalization of {\u03bbm} and\n{bk} is met naturally, and we can reformulate (5) to an unconstrained optimization problem, i.e.,\n\nminA,R\n\nm=1\n\nL(ym, yW 2\n\n2\n\n(B(R), \u03bbm(A); D\u03b8)).\n\n(9)\n\n(cid:88)M\n\nDifferent from [36], we introduce a model distillation method to improve the convergence of our\nmodel. The key idea is that the model with the current underlying distance D\u03b8 works as a \u201cteacher,\u201d\nwhile the proposed model with new basis and weights is regarded as a \u201cstudent.\u201d Through D\u03b8,\nthe teacher provides the student with guidance for its updating. We \ufb01nd that if we use the current\nunderlying distance D\u03b8 to calculate basis B and weights \u039b, we will encounter a serious \u201cvanishing\ngradient\u201d problem when solving (7) in the next iteration. Because Tr(T (cid:62)\nkmmD\u03b8) in (6) has been\noptimal under the current underlying distance and new B and \u039b, it is dif\ufb01cult to further update D\u03b8.\nInspired by recent model distillation methods in [20, 29, 34], we use a smoothed underlying distance\nmatrix to solve the \u201cvanishing gradient\u201d problem when updating B and \u039b.\nIn particular, the\n(B(R), \u03bbm(A); D\u03b8) in (9) is replaced by a Sinkhorn distance with the smoothed underlying\nyW 2\n\u03b8 ), where (\u00b7)\u03c4 , 0 < \u03c4 < 1, is an element-wise power function\ndistance, i.e., yS\u0001(B(R), \u03bbm(A); D\u03c4\nof a matrix. The Sinkhorn distance S\u0001 is de\ufb01ned as\n\n2\n\nS\u0001(u, v; D) = minT \u2208\u03a0(u,v) Tr(T (cid:62)D) + \u0001Tr(T (cid:62) ln(T )),\n\n(10)\nwhere ln(\u00b7) calculates element-wise logarithm of a matrix. The parameter \u03c4 works as the reciprocal\nof the \u201ctemperature\u201d in the smoothed softmax layer in the original distillation method [20, 29].\nThe principle of our distilled learning method is that when updating B and \u039b, the smoothed\nunderlying distance is used to provide \u201cweak\u201d guidance. Consequently, the student (i.e., the proposed\nnew model with updated B and \u039b) will not completely rely on the information from the teacher\n(i.e., the underlying distance obtained in a previous iteration), and will tend to explore new basis and\nweights. In summary, the optimization problem for learning the Wasserstein topic model is\n\nminA,R L\u03c4 (A, R) = minA,R\n\nL(ym, yS\u0001 (B(R), \u03bbm(A); D\u03c4\n\n\u03b8 )),\n\n(11)\n\n(cid:88)M\n\nm=1\n\nwhich can be solved under the same algorithmic framework as that in [36].\nOur algorithm is shown in Algorithm 1. The details of the algorithm and the in\ufb02uence of our distilled\nlearning strategy on the convergence of the algorithm are given in the Supplementary Material. Note\nthat our method is compatible with existing techniques, which can work as a \ufb01ne-tuning method when\nthe underlying distance is initialized by prede\ufb01ned embeddings. When the topic of each document is\ngiven, km in (6) is prede\ufb01ned and the proposed method can work in a supervised way.\n\n4 Related Work\n\nWord embedding, topic modeling, and their application to clinical data Traditional topic models,\nlike latent Dirichlet allocation (LDA) [8] and its variants, rely on the \u201cbag-of-words\u201d representation\nof documents. Word embedding [30] provides another choice, which represents documents as the\nfusion of the embeddings [27]. Recently, many new word embedding techniques have been proposed,\ne.g., the Glove in [33] and the linear ensemble embedding in [32], which achieve encouraging\n\n5\n\n\fperformance on word and document representation. Some works try to combine word embedding\nand topic modeling. As discussed above, they either use word embeddings as features for topic\nmodels [38, 15] or regard topics as labels when learning embeddings [41, 28]. A uni\ufb01ed framework\nfor learning topics and word embeddings was still absent prior to this paper.\nFocusing on clinical data analysis, word embedding and topic modeling have been applied to many\ntasks. Considering ICD code assignment as an example, many methods have been proposed to\nestimate the ICD codes based on clinical records [39, 5, 31, 22], aiming to accelerate diagnoses.\nOther tasks, like clustering clinical data and the prediction of treatments, can also be achieved by\nNLP techniques [4, 19, 11].\nWasserstein learning and its application in NLP The Wasserstein distance has been proven useful\nin distribution estimation [9], alignment [44] and clustering [1, 43, 14], avoiding over-smoothed\nintermediate interpolation results. It can also be used as loss function when learning generative mod-\nels [12, 3]. The main bottleneck of the application of Wasserstein learning is its high computational\ncomplexity. This problem has been greatly eased since Sinkhorn distance was proposed in [13]. Based\non Sinkhorn distance, we can apply iterative Bregman projection [7] to approximate Wasserstein\ndistance, and achieve a near-linear time complexity [2]. Many more complicated models have been\nproposed based on Sinkhorn distance [16, 36]. Focusing on NLP tasks, the methods in [26, 21]\nuse the same framework as ours, computing underlying distances based on word embeddings and\nmeasuring the distance between documents in the Wasserstein space. However, the work in [26]\ndoes not update the pretrained embeddings, while the model in [21] does not have a hierarchical\narchitecture for topic modeling.\nModel distillation As a kind of transfer learning techniques, model distillation was originally pro-\nposed to learn a simple model (student) under the guidance of a complicated model (teacher) [20].\nWhen learning the target-distilled model, a regularizer based on the smoothed outputs of the compli-\ncated model is imposed. Essentially, the distilled complicated model provides the target model with\nsome privileged information [29]. This idea has been widely used in many applications, e.g., textual\ndata modeling [23], healthcare data analysis [10], and image classi\ufb01cation [18]. Besides transfer\nlearning, the idea of model distillation has been extended to control the learning process of neural\nnetworks [34, 35, 42]. To the best of our knowledge, our work is the \ufb01rst attempt to combine model\ndistillation with Wasserstein learning.\n\n5 Experiments\n\nTo demonstrate the feasibility and the superiority of our distilled Wasserstein learning (DWL)\nmethod, we apply it to analysis of admission records of patients, and compare it with state-of-the-art\nmethods. We consider a subset of the MIMIC-III dataset [25], containing 11, 086 patient admissions,\ncorresponding to 56 diseases and 25 procedures, and each admission is represented as a sequence of\nICD codes of the diseases and the procedures. Using different methods, we learn the embeddings of\nthe ICD codes and the topics of the admissions and test them on three tasks: mortality prediction,\nadmission-type prediction, and procedure recommendation. For all the methods, we use 50% of the\nadmissions for training, 25% for validation, and the remaining 25% for testing in each task. For\nour method, the embeddings are obtained by the linear projection of one-hot representations of the\nICD codes, which is similar to the Word2Vec [30] and the Doc2Vec [27]. For our method, the loss\nfunction L is squared loss. The hyperparameters of our method are set via cross validation: the batch\nsize s = 256, \u03b2 = 0.01, \u0001 = 0.01, the number of topics K = 8, the embedding dimension D = 50,\nand the learning rate \u03c1 = 0.05. The number of epochs I is set to be 5 when the embeddings are\ninitialized by Word2Vec, and 50 when training from scratch. The distillation parameter is \u03c4 = 0.5\nempirically, whose in\ufb02uence on learning result is shown in the Supplementary Material.\n\n5.1 Admission classi\ufb01cation and procedure recommendation\n\nThe admissions of patients often have a clustering structure. According to the seriousness of the\nadmissions, they are categorized into four classes in the MIMIC-III dataset: elective, emergency,\nurgent and newborn. Additionally, diseases and procedures may lead to mortality, and the admissions\ncan be clustered based on whether the patients die or not during their admissions. Even if learned\nin a unsupervised way, the proposed embeddings should re\ufb02ect the clustering structure of the\nadmissions to some degree. We test our DWL method on the prediction of admission type and\n\n6\n\n\fTable 1: Admission classi\ufb01cation accuracy (%) for various methods.\n\nWord Feature\n\n\u2014\n\u2014\n\nWord2Vec [30]\nWord2Vec [30]\n\nGlove [33]\n\nDWL (Scratch)\nDWL (Finetune)\nWord2Vec [30]\nDWL (Scratch)\nDWL (Finetune)\nWord2Vec [30]\n\nGlove [33]\n\nDWL (Scratch)\nDWL (Finetune)\n\nMetric\n\nEuclidean\n\nDoc. Feature\nTF-IDF [17]\n\nLDA [8]\n\nDoc2Vec [27]\nAvePooling\nAvePooling\nAvePooling\nAvePooling\n\nTopic weight [36] Euclidean\n\nWord distribution\n\nWasserstein\n\n[26]\n\nMortality\n\nAdm. Type\n\n5-NN\n\n1-NN\n\n5-NN\n\n1-NN\n\nDim.\n81 69.98\u00b10.05 75.32\u00b10.04 82.27\u00b10.03 88.28\u00b10.02\n66.03\u00b10.06 69.05\u00b10.06 81.41\u00b10.04 86.57\u00b10.04\n8\n50 57.98\u00b10.08 59.80\u00b10.08 70.57\u00b10.08 79.94\u00b10.07\n50 70.42\u00b10.05 75.21\u00b10.04 84.88\u00b10.07 89.16\u00b10.06\n50 66.94\u00b10.06 73.21\u00b10.04 81.91\u00b10.05 88.21\u00b10.05\n50 71.01\u00b10.12 74.74\u00b10.11 84.54\u00b10.13 89.49\u00b10.12\n50 71.52\u00b10.07 75.44\u00b10.07 85.54\u00b10.09 89.28\u00b10.09\n70.31\u00b10.04 74.89\u00b10.04 83.63\u00b10.05 89.25\u00b10.04\n70.45\u00b10.08 74.88\u00b10.07 83.82\u00b10.12 88.80\u00b10.12\n70.88\u00b10.07 75.67\u00b10.07 84.26\u00b10.09 89.13\u00b10.08\n70.61\u00b10.04 75.92\u00b10.04 84.08\u00b10.05 89.06\u00b10.05\n70.64\u00b10.06 75.97\u00b10.05 83.92\u00b10.08 89.17\u00b10.07\n71.01\u00b10.10 75.88\u00b10.09 84.23\u00b10.12 89.33\u00b10.11\n70.65\u00b10.07 76.00\u00b10.06 84.35\u00b10.08 89.61\u00b10.07\n\n81\n\n8\n\nTable 2: Top-N procedure recommendation results for various methods.\nTop-5 (%)\n\nTop-1 (%)\n\nTop-3 (%)\n\nP\n\n39.95\n32.66\n37.89\n40.00\n\nR\n\n13.27\n13.01\n12.42\n13.76\n\nF1\n18.25\n17.22\n17.16\n18.71\n\nP\n\n31.70\n29.45\n30.14\n31.88\n\nR\n\n33.46\n30.99\n29.78\n33.71\n\nF1\n29.30\n27.41\n27.14\n29.58\n\nP\n\n28.89\n27.93\n27.39\n30.59\n\nR\n\n46.98\n44.79\n43.81\n48.56\n\nF1\n32.59\n31.47\n30.81\n34.28\n\nMethod\n\nWord2Vec [30]\n\nGlove [33]\n\nDWL (Scratch)\nDWL (Finetune)\n\nmortality. For the admissions, we can either represent them by the distributions of the codes and\ncalculate the Wasserstein distance between them, or represent them by the average pooling of the\ncode embeddings and calculate the Euclidean distance between them. A simple KNN classi\ufb01er can\nbe applied under these two metrics, and we consider K = 1 and K = 5. We compare the proposed\nmethod with the following baselines: (i) bag-of-words-based methods like TF-IDF [17] and LDA [8];\n(ii) word/document embedding methods like Word2Vec [30], Glove [33], and Doc2Vec [27]; and\n(iii) the Wasserstein-distance-based method in [26]. We tested various methods in 20 trials. In each\ntrial, we trained different models on a subset of training admissions and tested them on the same\ntesting set, and calculated the averaged results and their 90% con\ufb01dential intervals.\nThe classi\ufb01cation accuracy for various methods are shown in Table 1. Our DWL method is superior\nto its competitors on classi\ufb01cation accuracy. Besides this encouraging result, we also observe two\ninteresting and important phenomena. First, for our DWL method the model trained from scratch\nhas comparable performance to that \ufb01ne-tuned from Word2Vec\u2019s embeddings, which means that\nour method is robust to initialization when exploring clustering structure of admissions. Second,\ncompared with measuring Wasserstein distance between documents, representing the documents by\nthe average pooling of embeddings and measuring their Euclidean distance obtains comparable results.\nConsidering the fact that measuring Euclidean distance has much lower complexity than measuring\nWasserstein distance, this phenomenon implies that although our DWL method is time-consuming in\nthe training phase, the trained models can be easily deployed for large-scale data in the testing phase.\nThe third task is recommending procedures according to the diseases in the admissions. In our\nframework, this task can be solved by establishing a bipartite graph between diseases and procedures\nbased on the Euclidean distance between their embeddings. The proposed embeddings should\nre\ufb02ect the clinical relationships between procedures and diseases, such that the procedures are\nassigned to the diseases with short distance. For the m-th admission, we may recommend a list of\nprocedures with length L, denoted as Em, based on its diseases and evaluate recommendation results\nbased on the ground truth list of procedures, denoted as Tm. In particular, given {Em, Tm}, we\n|Em\u2229Tm|\n,\n. Table 2 shows the performance of\nvarious methods with L = 1, 3, 5. We \ufb01nd that although our DWL method is not as good as the\nWord2Vec when the model is trained from scratch, which may be caused by the much fewer epochs\nwe executed, it indeed outperforms other methods when the model is \ufb01ne-tuned from Word2Vec.\n\ncalculate the top-L precision, recall and F1-score as follows: P =(cid:80)M\nR = (cid:80)M\n\nm=1 Rm = (cid:80)M\n\nm=1 Pm =(cid:80)M\n\n, F 1 = (cid:80)M\n\n2PmRm\nPm+Rm\n\n|Em\u2229Tm|\n\n|Em|\n\n|Tm|\n\nm=1\n\nm=1\n\nm=1\n\n7\n\n\f(a) Full graph\n\n(b) Enlarged part 1\n\n(c) Enlarged part 2\n\n(d) Enlarged part 3\n\nFigure 2:\n(a) The KNN graph of diseases and procedures with K = 4. Its enlarged version is in the\nSupplementary Material. The ICD codes related to diseases are with a pre\ufb01x \u201cd\u201d, whose nodes are blue, while\nthose related to procedures are with a pre\ufb01x \u201cp\u201d, whose nodes are orange. (b-d) Three enlarged subgraphs\ncorresponding to the red frames in (a). In each sub\ufb01gure, the nodes/dots in blue are diseases while the nodes/dots\nin orange are procedures.\n\nTable 3: Top-3 ICD codes in each topic associated with the corresponding diseases/procedures.\nTopic 8\nd_311\nChronic kidney disease Aortic valve disorders Mycobacteria Coronary arteriography Hypothyroidism Neonatal jaundice Cell transfusion Mycobacteria\nd_5119\n\nTopic 4\np_8856\n\nTopic 5\nd_2449\n\nTopic 2\nd_4241\n\nTopic 1\nd_5859\n\nTopic 3\nd_311\n\nTopic 6\nd_7742\n\nTopic 7\np_9904\n\nd_41071\n\np_3891\n\nd_V3001\n\nArterial catheterization Single liveborn Subendocardial infarction\n\nPleural effusion Pleural effusion\nd_42731\nCardiac complications Kidney failure Posthemorrhagic anemia Atherosclerosis Serum transfusion Incision of lung Atrial \ufb01brillation\n\nd_41401\n\nd_2851\n\nd_5849\n\nd_9971\n\np_331\n\nGout\n\nd_2859\nAnemia\np_8872\n\nHeart ultrasound\n\nd_2749\n\nd_5119\n\np_9672\nVentilation\np_9907\n\n5.2 Rationality Analysis\n\nTo verify the rationality of our learning result, in Fig. 2 we visualize the KNN graph of diseases\nand procedures. We can \ufb01nd that the diseases in Fig. 2(a) have obvious clustering structure while\nthe procedures are dispersed according to their connections with matched diseases. Furthermore,\nthe three typical subgraphs in Fig. 2 can be interpreted from a clinical viewpoint. Figure 2(b)\nclusters cardiovascular diseases like hypotension (d_4589, d_45829) and hyperosmolality (d_2762)\nwith their common procedure, i.e., diagnostic ultrasound of heart (p_8872). Figure 2(c) clusters\ncoronary artery bypass (p_3615) with typical postoperative responses like hyperpotassemia (d_2767),\ncardiac complications (d_9971) and congestive heart failure (d_4280). Figure 2(d) clusters chronic\npulmonary heart diseases (d_4168) with its common procedures like cardiac catheterization (p_3772)\nand abdominal drainage (p_5491) and the procedures are connected with potential complications\nlike septic shock (d_78552). The rationality of our learning result can also be demonstrated by\nthe topics shown in Table 3. According to the top-3 ICD codes, some topics have obvious clinical\ninterpretations. Speci\ufb01cally, topic 1 is about kidney disease and its complications and procedures;\ntopic 2 and 5 are about serious cardiovascular diseases; topic 4 is about diabetes and its cardiovascular\ncomplications and procedures; topic 6 is about the diseases and the procedures of neonatal. We show\nthe map between ICD codes and corresponding diseases/procedures in the Supplementary Material.\n\n6 Conclusion and Future Work\n\nWe have proposed a novel method to jointly learn the Euclidean word embeddings and a Wasserstein\ntopic model in a uni\ufb01ed framework. An alternating optimization method was applied to iteratively\nupdate topics, their weights, and the embeddings of words. We introduced a simple but effective\nmodel distillation method to improve the performance of the learning algorithm. Testing on clinical\nadmission records, our method shows the superiority over other competitive models for various tasks.\nCurrently, the proposed learning method shows a potential for more-traditional textual data analysis\n(documents), but its computational complexity is still too high for large-scale document applications\n(because the vocabulary for real documents is typically much larger than the number of ICD codes\nconsidered here in the motivating hospital-admissions application). In the future, we plan to further\naccelerate the learning method, e.g., by replacing the Sinkhorn-based updating precedure with its\nvariants like the Greenkhorn-based updating method [2].\n\n8\n\n\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cfd_4019d_41401d_4241d_V4582d_2724d_486d_99592d_51881d_5990d_5849d_78552d_25000d_2449d_41071d_4280d_4168d_412d_2761d_2720d_2762d_0389d_4589d_42731d_2859d_311d_V3001d_V053d_4240d_V3000d_7742d_42789d_5070d_V502d_2760d_V1582d_40390d_V4581d_V290d_5845d_2875d_2767d_32723d_V5861d_2851d_53081d_496d_40391d_9971d_5119d_2749d_5859d_49390d_45829d_3051d_V5867d_5180p_9604p_9671p_3615p_3961p_8872p_9904p_9907p_9672p_331p_3893p_966p_3995p_9915p_8856p_9955p_3891p_9390p_9983p_640p_3722p_8853p_3723p_5491p_3324p_4513\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cfd_4019d_41401d_4241d_V4582d_2724d_486d_99592d_51881d_5990d_5849d_78552d_25000d_2449d_41071d_4280d_4168d_412d_2761d_2720d_2762d_0389d_4589d_42731d_2859d_311d_V3001d_V053d_4240d_V3000d_7742d_42789d_5070d_V502d_2760d_V1582d_40390d_V4581d_V290d_5845d_2875d_2767d_32723d_V5861d_2851d_53081d_496d_40391d_9971d_5119d_2749d_5859d_49390d_45829d_3051d_V5867d_5180p_9604p_9671p_3615p_3961p_8872p_9904p_9907p_9672p_331p_3893p_966p_3995p_9915p_8856p_9955p_3891p_9390p_9983p_640p_3722p_8853p_3723p_5491p_3324p_4513\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cfd_4019d_41401d_4241d_V4582d_2724d_486d_99592d_51881d_5990d_5849d_78552d_25000d_2449d_41071d_4280d_4168d_412d_2761d_2720d_2762d_0389d_4589d_42731d_2859d_311d_V3001d_V053d_4240d_V3000d_7742d_42789d_5070d_V502d_2760d_V1582d_40390d_V4581d_V290d_5845d_2875d_2767d_32723d_V5861d_2851d_53081d_496d_40391d_9971d_5119d_2749d_5859d_49390d_45829d_3051d_V5867d_5180p_9604p_9671p_3615p_3961p_8872p_9904p_9907p_9672p_331p_3893p_966p_3995p_9915p_8856p_9955p_3891p_9390p_9983p_640p_3722p_8853p_3723p_5491p_3324p_4513\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cf\u25cfd_4019d_41401d_4241d_V4582d_2724d_486d_99592d_51881d_5990d_5849d_78552d_25000d_2449d_41071d_4280d_4168d_412d_2761d_2720d_2762d_0389d_4589d_42731d_2859d_311d_V3001d_V053d_4240d_V3000d_7742d_42789d_5070d_V502d_2760d_V1582d_40390d_V4581d_V290d_5845d_2875d_2767d_32723d_V5861d_2851d_53081d_496d_40391d_9971d_5119d_2749d_5859d_49390d_45829d_3051d_V5867d_5180p_9604p_9671p_3615p_3961p_8872p_9904p_9907p_9672p_331p_3893p_966p_3995p_9915p_8856p_9955p_3891p_9390p_9983p_640p_3722p_8853p_3723p_5491p_3324p_4513 \f7 Acknowledgments\n\nThis research was supported in part by DARPA, DOE, NIH, ONR and NSF. Morgan A. Schmitz\nkindly helped us by sharing his Wasserstein dictionary learning code. We also thank Prof. Hongyuan\nZha at Georgia Institute of Technology for helpful discussions.\n\nReferences\n[1] M. Agueh and G. Carlier. Barycenters in the Wasserstein space. SIAM Journal on Mathematical\n\nAnalysis, 43(2):904\u2013924, 2011.\n\n[2] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal\n\ntransport via Sinkhorn iteration. arXiv preprint arXiv:1705.09634, 2017.\n\n[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875,\n\n2017.\n\n[4] J. M. Bajor, D. A. Mesa, T. J. Osterman, and T. A. Lasko. Embedding complexity in the data\nrepresentation instead of in the model: A case study using heterogeneous medical data. arXiv\npreprint arXiv:1802.04233, 2018.\n\n[5] T. Baumel, J. Nassour-Kassis, M. Elhadad, and N. Elhadad. Multi-label classi\ufb01cation of patient\n\nnotes a case study on ICD code assignment. arXiv preprint arXiv:1709.09587, 2017.\n\n[6] M. Belkin and P. Niyogi. Laplacian Eigenmaps for dimensionality reduction and data represen-\n\ntation. Neural computation, 15(6):1373\u20131396, 2003.\n\n[7] J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyr\u00e9. Iterative Bregman projections\nfor regularized transportation problems. SIAM Journal on Scienti\ufb01c Computing, 37(2):A1111\u2013\nA1138, 2015.\n\n[8] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of machine Learning\n\n[9] E. Boissard, T. Le Gouic, J.-M. Loubes, et al. Distribution\u2019s template estimate with Wasserstein\n\nresearch, 3(Jan):993\u20131022, 2003.\n\nmetrics. Bernoulli, 21(2):740\u2013759, 2015.\n\n[10] Z. Che, S. Purushotham, R. Khemani, and Y. Liu. Distilling knowledge from deep networks\n\nwith applications to healthcare domain. arXiv preprint arXiv:1512.03542, 2015.\n\n[11] E. Choi, M. T. Bahadori, E. Searles, C. Coffey, M. Thompson, J. Bost, J. Tejedor-Sojo, and\n\nJ. Sun. Multi-layer representation learning for medical concepts. In KDD, 2016.\n\n[12] N. Courty, R. Flamary, and M. Ducoffe. Learning Wasserstein embeddings. arXiv preprint\n\narXiv:1710.07457, 2017.\n\n[13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in\n\nneural information processing systems, pages 2292\u20132300, 2013.\n\n[14] M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In International\n\nConference on Machine Learning, pages 685\u2013693, 2014.\n\n[15] R. Das, M. Zaheer, and C. Dyer. Gaussian LDA for topic models with word embeddings. In\n\nACL (1), pages 795\u2013804, 2015.\n\n[16] A. Genevay, G. Peyr\u00e9, and M. Cuturi. Sinkhorn-AutoDiff: Tractable Wasserstein learning of\n\ngenerative models. arXiv preprint arXiv:1706.00292, 2017.\n\n[17] S. Gerard and J. M. Michael. Introduction to modern information retrieval. ISBN, 1983.\n[18] S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer.\n\nIn\nComputer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2827\u2013\n2836. IEEE, 2016.\n\n[19] H. Harutyunyan, H. Khachatrian, D. C. Kale, and A. Galstyan. Multitask learning and bench-\n\nmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.\n\n[20] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[21] G. Huang, C. Guo, M. J. Kusner, Y. Sun, F. Sha, and K. Q. Weinberger. Supervised word\nmover\u2019s distance. In Advances in Neural Information Processing Systems, pages 4862\u20134870,\n2016.\n\n[22] J. Huang, C. Osorio, and L. W. Sy. An empirical evaluation of deep learning for ICD-9 code\n\nassignment using MIMIC-III clinical notes. arXiv preprint arXiv:1802.02311, 2018.\n\n[23] H. Inan, K. Khosravi, and R. Socher. Tying word vectors and word classi\ufb01ers: A loss framework\n\nfor language modeling. arXiv preprint arXiv:1611.01462, 2016.\n\n9\n\n\f[24] T. Joachims. Learning to classify text using support vector machines: Methods, theory and\n\nalgorithms, volume 186. Kluwer Academic Publishers Norwell, 2002.\n\n[25] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody,\nP. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database.\nScienti\ufb01c data, 3:160035, 2016.\n\n[26] M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger. From word embeddings to document\n\ndistances. In International Conference on Machine Learning, pages 957\u2013966, 2015.\n\n[27] Q. Le and T. Mikolov. Distributed representations of sentences and documents. In International\n\nConference on Machine Learning, pages 1188\u20131196, 2014.\n\n[28] Y. Liu, Z. Liu, T.-S. Chua, and M. Sun. Topical word embeddings. In AAAI, pages 2418\u20132424,\n\n2015.\n\n[29] D. Lopez-Paz, L. Bottou, B. Sch\u00f6lkopf, and V. Vapnik. Unifying distillation and privileged\n\ninformation. arXiv preprint arXiv:1511.03643, 2015.\n\n[30] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Ef\ufb01cient estimation of word representations in\n\nvector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[31] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein. Explainable prediction of\n\nmedical codes from clinical text. arXiv preprint arXiv:1802.05695, 2018.\n\n[32] A. Murom\u00e4gi, K. Sirts, and S. Laur. Linear ensembles of word embedding models. arXiv\n\npreprint arXiv:1704.01419, 2017.\n\n[33] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation.\nIn Proceedings of the 2014 conference on empirical methods in natural language processing\n(EMNLP), pages 1532\u20131543, 2014.\n\n[34] G. Pereyra, G. Tucker, J. Chorowski, \u0141. Kaiser, and G. Hinton. Regularizing neural networks\n\nby penalizing con\ufb01dent output distributions. arXiv preprint arXiv:1701.06548, 2017.\n\n[35] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu,\nR. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671,\n2016.\n\n[36] M. A. Schmitz, M. Heitz, N. Bonneel, F. Ngole, D. Coeurjolly, M. Cuturi, G. Peyr\u00e9, and\nJ.-L. Starck. Wasserstein dictionary learning: Optimal transport-based unsupervised nonlinear\ndictionary learning. SIAM Journal on Imaging Sciences, 11(1):643\u2013678, 2018.\n\n[37] D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin.\nBaseline needs more love: On simple word-embedding-based models and associated pooling\nmechanisms. In ACL, 2018.\n\n[38] B. Shi, W. Lam, S. Jameel, S. Schockaert, and K. P. Lai. Jointly learning word embeddings and\nlatent topics. In Proceedings of the 40th International ACM SIGIR Conference on Research and\nDevelopment in Information Retrieval, pages 375\u2013384. ACM, 2017.\n\n[39] H. Shi, P. Xie, Z. Hu, M. Zhang, and E. P. Xing. Towards automated ICD coding using deep\n\nlearning. arXiv preprint arXiv:1711.04075, 2017.\n\n[40] C. Villani. Optimal transport: Old and new, volume 338. Springer Science & Business Media,\n\n2008.\n\n[41] W. Wang, Z. Gan, W. Wang, D. Shen, J. Huang, W. Ping, S. Satheesh, and L. Carin. Topic\n\ncompositional neural language model. arXiv preprint arXiv:1712.09783, 2017.\n\n[42] Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample\n\nlearning. In European Conference on Computer Vision, pages 616\u2013634. Springer, 2016.\n\n[43] J. Ye, P. Wu, J. Z. Wang, and J. Li. Fast discrete distribution clustering using Wasserstein\nbarycenter with sparse support. IEEE Transactions on Signal Processing, 65(9):2317\u20132332,\n2017.\n\n[44] Y. Zemel and V. M. Panaretos. Fr\u00e9chet means and Procrustes analysis in Wasserstein space.\n\narXiv preprint arXiv:1701.06876, 2017.\n\n10\n\n\f", "award": [], "sourceid": 870, "authors": [{"given_name": "Hongteng", "family_name": "Xu", "institution": "Infinia ML"}, {"given_name": "Wenlin", "family_name": "Wang", "institution": "Duke University"}, {"given_name": "Wei", "family_name": "Liu", "institution": "Tencent AI Lab"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}