{"title": "Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation", "book": "Advances in Neural Information Processing Systems", "page_first": 1530, "page_last": 1540, "abstract": "Generating long and coherent reports to describe medical images poses challenges to bridging visual patterns with informative human linguistic descriptions. We propose a novel Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent) which reconciles traditional retrieval-based approaches populated with human prior knowledge, with modern learning-based approaches to achieve structured, robust, and diverse report generation. HRGR-Agent employs a hierarchical decision-making procedure. For each sentence, a high-level retrieval policy module chooses to either retrieve a template sentence from an off-the-shelf template database, or invoke a low-level generation module to generate a new sentence. HRGR-Agent is updated via reinforcement learning, guided by sentence-level and word-level rewards. Experiments show that our approach achieves the state-of-the-art results on two medical report datasets, generating well-balanced structured sentences with robust coverage of heterogeneous medical report contents. In addition, our model achieves the highest detection precision of medical abnormality terminologies, and improved human evaluation performance.", "full_text": "Hybrid Retrieval-Generation Reinforced Agent for\n\nMedical Image Report Generation\n\nChristy Y. Li\u2217\nDuke University\nyl558@duke.edu\n\nXiaodan Liang\u2020\n\nCarnegie Mellon University\nxiaodan1@cs.cmu.edu\n\nZhiting Hu\n\nCarnegie Mellon University\nzhitingh@cs.cmu.edu\n\nEric P. Xing\nPetuum, Inc\n\nepxing@cs.cmu.edu\n\nAbstract\n\nGenerating long and coherent reports to describe medical images poses challenges\nto bridging visual patterns with informative human linguistic descriptions. We\npropose a novel Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent)\nwhich reconciles traditional retrieval-based approaches populated with human prior\nknowledge, with modern learning-based approaches to achieve structured, robust,\nand diverse report generation. HRGR-Agent employs a hierarchical decision-\nmaking procedure. For each sentence, a high-level retrieval policy module chooses\nto either retrieve a template sentence from an off-the-shelf template database, or\ninvoke a low-level generation module to generate a new sentence. HRGR-Agent\nis updated via reinforcement learning, guided by sentence-level and word-level\nrewards. Experiments show that our approach achieves the state-of-the-art results\non two medical report datasets, generating well-balanced structured sentences with\nrobust coverage of heterogeneous medical report contents. In addition, our model\nachieves the highest detection precision of medical abnormality terminologies, and\nimproved human evaluation performance.\n\n1\n\nIntroduction\n\nBeyond the traditional visual captioning task [41, 28, 43, 40, 18] that produces one single sentence,\ngenerating long and topic-coherent stories or reports to describe visual contents (images or videos)\nhas recently attracted increasing research interests [19, 35, 22], posed as a more challenging and\nrealistic goal towards bridging visual patterns with human linguistic descriptions. Particularly, report\ngeneration has several challenges to be resolved: 1) The generated report is a long narrative consisting\nof multiple sentences or paragraphs, which must have a plausible logic and consistent topics; 2) There\nis a presumed content coverage and speci\ufb01c terminology/phrases, depending on the task at hand.\nFor example, a sports game report should describe competing teams, wining points, and outstanding\nplayers [38]. 3) The content ordering is very crucial. For example, a sports game report usually talks\nabout the competition results before describing teams and players in detail.\nAs one of the most representative and practical report generation task, the desired medical image\nreport generation must satisfy more critical protocols and ensure the correctness of medical term\nusage. As shown in Figure 1, a medical report consists of a \ufb01ndings section describing medical\nobservations in details of both normal and abnormal features, an impression or conclusion sentence\n\n\u2217This work was conducted when Christy Y. Li was at Petuum, Inc.\n\u2020Corresponding author.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: An example of medical image report generation. The middle column is a report written by\nradiologists for the chest x-ray image on the left column. The right column contains three reports\ngenerated by a retrieval-based system (R), a generation-based model (G) and our proposed model\n(HRGR-Agent) respectively. The retrieval-based model correctly detects effusion while the generative\nmodel fails to. Our HRGR-Agent detects effusion and also describes supporting evidence.\n\nindicating the most prominent medical observation or conclusion, and comparison and indication\nsections that list patient\u2019s peripheral information. Among these sections, the \ufb01ndings section posed as\nthe most important component, ought to cover contents of various aspects such as heart size, lung\nopacity, bone structure; any abnormality appearing at lungs, aortic and hilum; and potential diseases\nsuch as effusion, pneumothorax and consolidation. And, in terms of content ordering, the narrative of\n\ufb01ndings section usually follows a presumptive order, e.g. heart size, mediastinum contour followed\nby lung opacity, remarkable abnormalities followed by mild or potential abnormalities.\nState-of-the-art caption generation models [41, 9, 43, 34] tend to perform poorly on medical report\ngeneration with speci\ufb01c content requirements due to several reasons. First, medical reports are usually\ndominated by normal \ufb01ndings, that is, a small portion of majority sentences usually forms a template\ndatabase. For these normal cases, a retrieval-based system (e.g. directly perform classi\ufb01cation\namong a list of majority sentences given image features) can perform surprisingly well due to the\nlow variance of language. For instance, in Figure 1, a retrieval-based system correctly detects\neffusion from a chest x-ray image, while a generative model that generates word-by-word given\nimage features, fails to detect effusion. On the other hand, abnormal \ufb01ndings which are relatively rare\nand remarkably diverse, however, are of higher importance. Current text generation approaches [16]\noften fail to capture the diversity of such small portion of descriptions, and pure generation pipelines\nare biased towards generating plausible sentences that look natural by the language model but poor at\n\ufb01nding visual groundings [17]. On the contrary, a desirable medical report usually has to not only\ndescribe normal and abnormal \ufb01ndings, but also support itself by visual evidences such as location\nand attributes of the detected \ufb01ndings appearing in the image.\nInspired by the fact that radiologists often follow templates for writing reports and modify them\naccordingly for each individual case [5, 12, 10], we propose a Hybrid Retrieval-Generation Reinforced\nAgent (HRGR-Agent) which is the \ufb01rst attempt to incorporate human prior knowledge with learning-\nbased generation for medical reports. HRGR-Agent employs a retrieval policy module to decide\nbetween automatically generating sentences by a generation module and retrieving speci\ufb01c sentences\nfrom the template database, and then sequentially generates multiple sentences via a hierarchical\ndecision-making. The template database is built based on human prior knowledge collected from\navailable medical reports. To enable effective and robust report generation, we jointly train the\nretrieval policy module and generation module via reinforcement learning (RL) [30] guided by\nsentence-level and word-level rewards, respectively. Figure 1 shows an example generated report\nby our HRGR-Agent which correctly describes \"a small effusion\" from the chest x-ray image, and\nsuccessfully supports its \ufb01nding by providing the appearance (\"blunting\") and location (\"costophrenic\nsulcus\") of the evidence.\nOur main contribution is to bridge rule-based (retrieval) and learning-based generation via reinforce-\nment learning, which can achieve plausible, correct and diverse medical report generation. Moreover,\nour HRGR-Agenet has several technical merits compared to existing retrieval-generation-based\nmodels: 1) our retrieval and generation modules are updated and bene\ufb01t from each other via policy\nlearning; 2) the retrieval actions are regarded as a part of the generation whose selection of templates\ndirectly in\ufb02uences the \ufb01nal generated result. 3) the generation module is encouraged to learn diverse\nand complicated sentences while the retrieval policy module learns template-like sentences, driven by\ndistinct word-level and sentence-level rewards, respectively. Other work such as [24] still enforces\nthe generative model to predict template-like sentences.\n\n2\n\nComparison: Indication: 60-year-old male with seizure, ethanol abuseFindings: The heart size and mediastinal contours appear within normal limits. There is blunting of the right lateral costophrenic sulcus which could be secondary to a small effusion versus scarring. No focal airspace consolidation or pneumothorax. No acute bony abnormalities.Impression: Blunting of the right costophrenic sulcus could be secondary to a pleural effusion versus scarring. Findings: [R]: The heart size is normal. There is mild effusion. No acute bony abnormalities. [G]: The heart size normal. No pleural effusion or pneumothorax. No acute bony abnormalities. [HRGR-Agent]: The heart size and mediastinal contours are normal. There is blunting of costophrenic sulcus suggesting a small effusion. No bony abnormalities. \fWe conduct extensive experiments on two medical image report dataset [8]. Our HRGR-Agent\nachieves the state-of-the-art performance on both datasets under three kinds of evaluation metrics:\nautomatic metrics such as CIDEr [33], BLEU [25] and ROUGE [20], human evaluation, and detection\nprecision of medical terminologies. Experiments show that the generated sentences by HRGR-Agent\nshares a descent balance between concise template sentences, and complicated and diverse sentences.\n\n2 Related Work\n\nVisual Captioning and Report Generation. Visual captioning aims at generating a descriptive\nsentence for images or videos. State-of-the-art approaches use CNN-RNN architectures and attention\nmechanisms [27, 41, 43, 28]. The generated sequence is usually short, describing only the dominating\nvisual event, and is primarily rewarded by language \ufb02uency in practice. Generating reports that are\ninformative and have multiple sentences [38, 16] poses higher requirements on content selection,\nrelation generation, and content ordering. The task differs from image captioning [43, 23] and\nsentence generation [14, 6] where usually single or few sentences are required, or summarization [2,\n44] where summaries tend to be more diverse without clear template sentences. State-of-the-art\nmethods on report generation [16] are still remarkably cloning expert behaviour, and incapable\nof diversifying language and depicting rare but prominent \ufb01ndings. Our approach prevents from\nmimicking teacher behaviour by sparing the burden of automatic generative model with a template\nselection and retrieval mechanism, which by design promotes language diversity and better content\nselection.\nTemplate Based Sequence Generation. Some of the recent approaches bridged generative language\napproaches and traditional template-based methods. However, state-of-the-art approaches either\ntreat a retrieval mechanism as latent guidance [44], the impact of which to text generation is limited,\nor still encourage the generation network to mimic template-like sequences [24]. Our method is\nclose to previous copy mechanism work such as pointer-generator [2], however, we are different\nin that: 1) our retrieval module aims to retrieve from an external common template base, which is\nparticularly effective to the task, as opposed to copying from a speci\ufb01c source article; 2) we formulate\nthe retrieval-generation choices as discrete actions (as opposed to soft weights as in previous work)\nand learn with hierarchical reinforcement learning for optimizing both short- and long-term goals.\nReinforcement Learning for Sequence Generation. Recently, reinforcement learning (RL) has\nbeen receiving increasing popularity in sequence generation [27, 3, 13] such as visual captioning [21,\n28, 18], text summarization [26], and machine translation [39]. Traditional methods use cross entropy\nloss which is prone to exposure bias [27, 31] and do not necessarily optimize evaluation metrics such\nas CIDEr [33], ROUGE [20], BLEU [25] and METEOR [4]. In contrast, reinforcement learning can\ndirectly use the evaluation metrics as reward and update model parameters via policy gradient. There\nhas been some recent efforts [42] devoted in applying hierarchical reinforcement learning (HRL) [7]\nwhere sequence generation is broken down into several sub-tasks each of which targets at a chunk of\nwords. However, HRL for long report generation is still under-explored.\n\n3 Approach\n\nMedical image report generation aims at generating a report consisting of a sequence of sentences\nY = (y1, y2, . . . , yM ) given a set of medical images I = {Ij}K\nj=1 of a patient case. Each sentence\ncomprises a sequence of words yi = (yi,1, yi,2, . . . , yi,N ), yi,j \u2208 V where i is the index of sentences,\nj the index of words, and V the vocabulary of all output tokens. In order to generate long and\ntopic-coherent reports, we formulate the decoding process in a hierarchical framework that \ufb01rst\nproduces a sequence of hidden sentence topics, and then predicts words of each sentence conditioning\non each topic.\nIt is observed that doctors writing a report tend to follow certain patterns and reuse templates, while\nadjusting statements for each individual case when necessary. To mimic the procedure, we propose\nto combine retrieval and generation for automatic report generation. In particular, we \ufb01rst compile\nan off-the-shelf template database T that consists of a set of sentences that occur frequently in the\ntraining corpus. Such sentences typically describe general observations, and are often inserted into\nmedical reports, e.g., \"the heart size is normal\" and \"there is no pleural effusion or pneumothorax\".\n(Table 1 provides more examples.)\n\n3\n\n\fFigure 2: Hybrid Retrieval-Generation Reinforced Agent. Visual features are encoded by a CNN\nand image encoder, and fed to a sentence decoder to recurrently generate hidden topic states. A\nretrieval policy module decides for each topic state to either automatic generate a sentence, or retrieve\na speci\ufb01c template from a template database. Dashed black lines indicate hierarchical policy learning.\nAs described in Figure 2, a set of images for each sample is \ufb01rst fed into a CNN to extract visual\nfeatures which is then transformed into a context vector by an image encoder. Then a sentence\ndecoder recurrently generates a sequence of hidden states q = (q1, q2, . . . , qM ) which represent\nsentence topics. Given each topic state qi, a retrieval policy module decides to either automatically\ngenerate a new sentence by invoking a generation module, or retrieve an existing template from the\ntemplate database. Both the retrieval policy module (that determines between automatic generation\nor template retrieval) and the generation module (that generates words) are making discrete decisions\nand be updated via the REINFORCE algorithm [37, 30]. We devise sentence-level and word-level\nrewards accordingly for the two modules, respectively.\n\nj=1, we \ufb01rst extract their features {vj}K\n\n3.1 Hybrid Retrieval-Generation Reinforced Agent\nImage Encoder. Given a set of images {Ij}K\nj=1 with a\npretrained CNN, and then average {vj}K\nj=1 to obtain v. The image encoder converts v into a context\nvector hv \u2208 RD which is used as the visual input for all subsequent modules. Speci\ufb01cally, the image\nencoder is parameterized as a fully-connected layer, and the visual features are extracted from the\nlast convolution layer of a DenseNet [15] or VGG-19 [29].\nSentence Decoder. Sentence decoder comprises stacked RNN layers which generates a sequence\nof topic states q. We equip the stacked RNNs with attention mechanism to enhance text generation,\ni , where\ninspired by [32, 41, 23]. Each stacked RNN \ufb01rst generates an attentive context vector cs\ni indicates time steps, given the image context vector hv and previous hidden state hs\ni\u22121. It then\ni is further projected\ngenerates a hidden state hs\ninto a topic space as qi and a stop control probability zi \u2208 [0, 1] through non-linear functions\nrespectively. Formally, the sentence decoder can be written as:\n\ni based on cs\n\ni and hs\n\ni\u22121. The generated hidden state hs\n\nattn(hv, hs\ncs\ni = F s\ni\u22121)\nhs\nRNN(cs\ni , hs\ni = F s\ni\u22121)\nqi = \u03c3(Wqhs\ni + bq)\nzi = Sigmoid(Wzhs\n\ni + bz),\n\n(1)\n(2)\n(3)\n(4)\n\nattn denotes a function of the attention mechanism [28], F s\n\nwhere F s\nRNN denotes the non-linear functions\nof Stacked RNN, Wq and bq are parameters which project hidden states into the topic space while\nWz and bz are parameters for stop control, and \u03c3 is a non-linear activation function. The stop control\nprobability zi greater than or equal to a prede\ufb01ned threshold (e.g. 0.5) indicates stopping generating\ntopic states, and thus the hierarchical report generation process.\nRetrieval Policy Module. Given each topic state qi, the retrieval policy module takes two steps.\nFirst, it predicts a probability distribution ui \u2208 R1+|T| over actions of generating a new sentence\nand retrieving from |T| candidate template sentences. Based on the prediction of the \ufb01rst step, it\ntriggers different actions. If automatic generation obtains the highest probability, the generation\nmodule is activated to generate a sequence of words conditioned on current topic state (the second\nrow on the right side of Figure 2). If a template in T obtains the highest probability, it is retrieved\nfrom the off-the-shelf template database and serves as the generation result of current sentence topic\n(the \ufb01rst row on the right side of Figure 2). We reserve 0 index to indicate the probability of selecting\nautomatic generation and positive integers in {1,|T|} to index the probability of selecting templates\n\n4\n\nGeneration Module Template DatabaseReward ModuleRetrievalPolicy ModuleTemplate sentenceImage EncoderGenerated sentenceSentence DecoderRetrieve templateAutomatic generationCNNContext vectorVisual featuresTopic stateTopic stateReward of sentenceReward of wordsTopic state\fin T. The \ufb01rst step is parameterized as a fully-connected layer with Softmax activation:\n\nui = Softmax(Wuqi + bu)\nmi = argmax(ui),\n\n(5)\n(6)\n\nwhere Wu and bu are network parameters, and the resulting mi is the index of highest probability in\nui.\nGeneration Module. Generation module generates a sequence of words conditioned on current topic\nstate qi and image context vector hv for each sentence. It comprises RNNs which take environment\nparameters and previous hidden state hg\ni,t which is\nfurther transformed to a probability distribution ai,t over all words in V, where t indicates t-th word.\nWe de\ufb01ne environment parameters as a concatenation of current topic state qi, context vector cg\ni,t\nencoded by following the same attention paradigm in sentence decoder, and embedding of previous\nword ei,t\u22121. The procedure of generating each word is written as follows, which is an attentional\ndecoding step:\n\ni,t\u22121 as input, and generate a new hidden state hg\n\ni,t\u22121)\n\ni,t\u22121)\n\nattn(hv, [ei,t\u22121; qi], hg\ni,t; ei,t\u22121; qi], hg\nRNN([cg\ni,t + by)\n\ncg\ni,t = F g\ni,t = F g\nhg\nat = Softmax(Wyhg\nyt = argmax(at)\nei,t = WeO(yi,t),\n\n(7)\n(8)\n(9)\n(10)\n(11)\n\nattn denotes the attention mechanism of generation module, F g\n\nwhere F g\nRNN denotes non-linear functions\nof RNNs, Wy and by are parameters for generating word probability distribution, yi,t is index of the\nmaximum probable word, We is a learnable word embedding matrix initialized uniformly, and O\ndenotes one hot vector.\nReward Module. We use automatic metrics CIDEr for computing rewards since recent work on\nimage captioning [28] has shown that CIDEr performs better than many traditional automatic metrics\nsuch as BLEU, METEOR and ROUGE. We consider two kinds of reward functions: sentence-level\nreward and word-level reward. For the i-th generated sentence yi = (yi,1, yi,2, . . . , yi,N ) either\nfrom retrieval or generation outputs, we compute a delta CIDEr score at sentence level, which is\nRsent(yi) = f ({yk}i\nk=1, gt), where f denotes CIDEr evaluation, and gt denotes\nground truth report. This assesses the advantages the generated sentence brings in to the existing\nsentences when evaluating the quality of the whole report. For a single word input, we use reward as\ndelta CIDEr score which is Rword(yt) = f ({yk}t\nk=1, gts) \u2212 f ({yk}t\u22121\nk=1, gts) where gts denotes the\nground truth sentence. The sentence-level and word-level rewards are used for computing discounted\nreward for retrieval policy module and generation module respectively.\n\nk=1, gt) \u2212 f ({yk}i\u22121\n\n3.2 Hierarchical Reinforcement Learning\nOur objective is to maximize the reward of generated report Y compared to ground truth report Y\u2217.\nOmitting the condition on image features for simplicity, the loss function can be written as:\n\nL(\u03b8) = \u2212Ez,m,y[R(Y, Y\u2217)]\n\n\u2207\u03b8L(\u03b8) = \u2212Ez,m,y [\u2207\u03b8 log p(z, m, y)R(Y, Y\u2217)]\n\n(cid:17)(cid:35)\n(cid:16)\u2207\u03b8rL(\u03b8r) + 1(mi = 0|mi\u22121)\u2207\u03b8gL(\u03b8g)\n\n(12)\n(13)\n\n,\n\n(14)\n\n(cid:34)(cid:88)\n\ni=1\n\n= \u2212Ez,m,y\n\n1(zi <\n\n|zi\u22121)\n\n1\n2\n\nwhere \u03b8, \u03b8r ,and \u03b8g denote parameters of the whole network, retrieval policy module, and generation\nmodule respectively; 1(\u00b7) is binary indicator; zi is the probability of topic stop control in Equation 4;\nmi is the action chosen by retrieval policy module among automatic generation (mi = 0) and all\ntemplates (mi \u2208 [1,|T|]) in the template database. The loss of HRGR-Agent comes from two parts:\nretrieval policy module L(\u03b8r) and generation module L(\u03b8g) as de\ufb01ned below.\nPolicy Update for Retrieval Policy Module. We de\ufb01ne the reward for retrieval policy module Rr\nat sentence level. The generated sentence or retrieved template sentence is used for computing the\n\n5\n\n\freward. The discounted sentence-level reward and its corresponding policy update according to\nREINFORCE algorithm [30] can be written as:\n\nRr(yi) =\n\u03b3jRsent(yi+j)\nL(\u03b8r) = \u2212Emi[Rr(mi, m\u2217\ni )]\n\nj=0\n\n\u2207\u03b8rL(\u03b8r) = \u2212Emi[\u2207\u03b8r log p(mi|mi\u22121)Rr(mi, m\u2217\n\n(16)\n(17)\nwhere \u03b3 is a discount factor; yi is the i-th generated sequence; and \u03b8r represents parameters of\nretrieval policy module which are Wu and bu in Equation 5 .\nPolicy Update for Generation Module. We de\ufb01ne the word-level reward Rg(yt) for each word\ngenerated by generation module as discounted reward of all generated words after the considered\nword. The discounted reward function and its policy update for generation module can be written as:\n\ni )],\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n(15)\n\n(18)\n\n(19)\n(20)\n\n\u03b3jRword(yt+j)\n\nRg(yt) =\nL(\u03b8g) = \u2212Eyt[Rg(yt, y\u2217\nt )]\n\n(cid:88)\n\u2207\u03b8gL(\u03b8g) = \u2212Eyt[\n\nj=0\n\n\u2207\u03b8g log p(yt|yt\u22121)Rg(yt, y\u2217\n\nt )],\n\nwhere \u03b3 is a discount factor, and \u03b8g represents the parameters of generation module such as Wy, by,\nWe in Equation 9-11 and parameters of attention functions in Equation 7 and RNNs in Equation 8.\nDetailed policy update algorithm is provides in supplementary materials.\n\nt=1\n\n4 Experiments and Analysis\n\nDatasets. We conduct experiments on two medical image report datasets. First, Indiana University\nChest X-Ray Collection (IU X-Ray) [8] is a public dataset consists of 7,470 frontal and lateral-view\nchest x-ray images paired with their corresponding diagnostic reports. Each patient has 2 images and\na report which includes impression, \ufb01ndings, comparison and indication sections. We preprocess the\nreports by tokenizing, converting to lower-cases, and \ufb01ltering tokens of frequency no less than 3 as\nvocabulary, which results in 1185 unique tokens covering over 99.0% word occurrences in the corpus.\nCX-CHR is a proprietary internal dataset of chest X-ray images with Chinese reports collected from\na professional medical institution for health checking. The dataset consists of 35,500 patients. Each\npatient has one or multiple chest x-ray images in different views such as posteroanterior and lateral,\nand a corresponding Chinese report. We select patients with no more than 2 images and obtained\n33,236 patient samples in total which covers over 93% of the dataset. We preprocess the reports\nthrough tokenizing by Jieba [1] and \ufb01ltering tokens of frequency no less than 3 as vocabulary, which\nresults in 1282 unique tokens.\nOn both datasets, we randomly split the data by patients into training, validation and testing by a ratio\nof 7:1:2. There is no overlap between patients in different sets. We predict the \u2019\ufb01ndings\u2019 section as\nit is the most important component of reports. On CX-CHR dataset, we pretrain a DenseNet with\npublic available ChestX-ray8 dataset [36] on classi\ufb01cation, and \ufb01ne-tune it on CX-CHR dataset on 20\ncommon thorax disease labels. As IU X-Ray dataset is relatively small, we do not directly \ufb01ne-tune\nthe pretrained DenseNet on it, and instead extract visual features from a DenseNet pretrained jointly\non ChestX-ray8 dataset [36] and CX-CHR datasets. Please see Supplementary Material for more\ndetails.\nTemplate Database. We select sentences in the training set whose document frequencies (the number\nof occurrence of a sentence in training documents) are no less than a threshold as template candidates.\nWe further group candidates that express the same meaning but have a little linguistic variations. For\nexample, \"no pleural effusion or pneumothorax\" and \"there is no pleural effusion or pneumonthorax\"\nare grouped as one template. This results in 97 templates with greater than 500 document frequency\nfor CX-CHR and 28 templates with greater than 100 document frequency for IU X-Ray. Upon\nretrieval, only the most frequent sentence of a template group will be retrieved for HRGR-Agent or\nany rule-based models that we compare with. Although this introduces minor but inevitable error in\n\n6\n\n\fthe generated results, our experiments show that the error is negligible compared to the advantages\nthat a hybrid of retrieval-based and generation-based approaches brings in. Besides, separating\ntemplates of the same meaning into different categories diminishes the capability of retrieval policy\nmodule to predict the most suitable template for a given visual input, as multiple templates share the\nexact same meaning. Table 1 shows examples of templates for IU X-Ray dataset. More template\nexamples are provided in supplementary materials.\n\nTemplate\n\nNo pneumothorax or pleural effusion.\nNo pleural effusion or pneumothorax.\n\nThere is no pleural effusion or pneumothorax.\n\nThe lungs are clear\n\nLungs are clear.\n\nThe lung are clear bilaterally.\n\nNo evidence of focal consolidation, pneumothorax, or pleural effusion.\n\nno focal consolidation, pneumothorax or large pleural effusion.\n\nNo focal consolidation, pleural effusion, or pneumothorax identi\ufb01ed.\n\nCardiomediastin silhouett is within normal limit.\n\nThe cardiomediastin silhouett is within normal limit.\n\nThe cardiomediastin silhouett is within normal limit for size and contour.\n\ndf(%)\n\n18.36\n\n23.60\n\n6.55\n\n5.12\n\nTable 1: Examples of template database of IU X-Ray dataset. Each template is constructed by a group\nof sentences of the same meaning but slightly different linguistic variations. Top 3 most frequent\nsentences for a template are displayed in the \ufb01rst and third column. The second column shows\ndocument frequency (in percentage of training corpus) of each template.\nEvaluation Metrics. We use three kinds of evaluation metrics: 1) automatic metrics including\nCIDEr, ROUGE, and BLEU; 2) medical abnormality terminology detection accuracy: we select 10\nmost frequent medical abnormality terminologies in medical reports and evaluate average precision\nand average false positive (AFP) of compared models; 3) human evaluation: we randomly select\n100 samples from testing set for each method and conduct surveys through Amazon Mechanical\nTurk. Each survey question gives a ground truth report, and ask candidate to choose among reports\ngenerated by different models that matches with the ground truth report the best in terms of language\n\ufb02uency, content selection, and correctness of medical abnormal \ufb01nding. A default choice is provided\nin case of no or both reports are preferred. We collect results from 20 participants and compute the\naverage preference percentage for each model excluding default choices.\nTraining Details. We implement our model on PyTorch and train on a GeForce GTX TITAN GPU.\nWe \ufb01rst train all models with cross entropy loss for 30 epochs with an initial learning rate of 5e-4,\nand then \ufb01ne-tune the retrieval policy module and generation module of HRGR-Agent via RL with\na \ufb01xed learning rate 5e-5 for another 30 epochs. We use 512 as dimension of all hidden states and\nword embeddings, and batch size 16. We set the maximum number of sentences of a report and\nmaximum number of tokens in a sentence as 18 and 44 for CX-CHR and 7 and 15 for IU X-Ray.\nBesides, as observed from baseline models which overly predict most popular and normal reports\nfor all testing samples and the fact that most medical reports describe normal cases, we add post-\nprocessing to increase the length and comprehensiveness of the generated reports for both datasets\nwhile maintaining the design of HRGR-Agent to better predict abnormalities. The post-processing\nwe use is that we \ufb01rst select 4 most commonly predicted key words with normal descriptions by\nother baselines, then for each key word, if the generated report does not describe any abnormality\nnor normality of these key words, we add the a corresponding sentence of these key words that\ndescribe their normal cases respectively. The key words for IU X-Ray are \u2019heart size and mediastinal\ncontours\u2019, \u2019pleural effusion or pneumothorax\u2019, \u2019consolidation\u2019, and \u2019lungs are clear\u2019. As observed\nin our experiments, this step maintains the same medical abnormality term detection results, and\nimproves the automatic report generation metrics, especially on BLEU-n metrics.\nBaselines. On both datasets, we compare with four state-of-the-art image captioning models: CNN-\nRNN [34], LRCN [9], AdaAtt [23], and Att2in [28]. Visual features for all models are extracted\nfrom the last convolutional layer of pretrained densetNets respectively as mentioned in 4, yielding\n16 \u00d7 16 \u00d7 256 feature maps for both datasets. We use greedy search and argmax sampling for\nHRGR-Agent and the baselines on both datasets. On IU X-Ray dataset, we also compare with\nCoAtt [16] which uses different visual features extracted from a pretrained ResNet [11]. The authors\nof CoAtt [16] re-trained their model using our train/test split, and provided evaluation results for\n\n7\n\n\fDataset\n\nCX-CHR\n\nIU X-Ray\n\nModel\n\nCNN-RNN [34]\n\nLRCN [9]\nAdaAtt [23]\nAtt2in [28]\nGeneration\nRetrieval\n\nHRG\n\nHRGR-Agent\nCNN-RNN [34]\n\nLRCN [9]\nAdaAtt [23]\nAtt2in [28]\nCoAtt* [16]\nHRGR-Agent\n\nCIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4 ROUGE\n1.580\n0.577\n0.577\n1.588\n0.575\n1.568\n0.576\n1.566\n0.322\n0.361\n2.565\n0.536\n0.588\n2.800\n0.612\n2.895\n0.306\n0.294\n0.284\n0.305\n0.308\n0.295\n0.308\n0.297\n0.369\n0.277\n0.343\n0.322\n\n0.590\n0.593\n0.588\n0.587\n0.307\n0.535\n0.629\n0.673\n0.216\n0.223\n0.220\n0.224\n0.455\n0.438\n\n0.411\n0.413\n0.409\n0.408\n0.121\n0.409\n0.463\n0.486\n0.066\n0.067\n0.068\n0.068\n0.154\n0.151\n\n0.450\n0.452\n0.446\n0.446\n0.160\n0.437\n0.497\n0.530\n0.087\n0.089\n0.089\n0.089\n0.205\n0.208\n\n0.506\n0.508\n0.503\n0.503\n0.216\n0.475\n0.547\n0.587\n0.124\n0.128\n0.127\n0.129\n0.288\n0.298\n\nTable 2: Automatic evaluation results on CX-CHR (upper part) and IU X-Ray Datasets (lower part).\nBLEU-n denotes BLEU score uses up to n-grams.\n\nRetrieval Generation HRGR-Agent CNN-RNN [34] CoAtt [16] HRGR-Agent\n\nDataset\nModels\nPrec. (%)\n\nAFP\nHit (%)\n\n14.13\n0.133\n\n\u2013\n\nCX-CHR\n\n27.50\n0.064\n23.42\n\n29.19\n0.059\n52.32\n\n0.00\n0.000\n\n\u2013\n\nIU X-Ray\n\n5.01\n0.019\n28.00\n\n12.14\n0.043\n48.00\n\nTable 3: Average precision (Prec.) and average false positive (AFP) of medical abnormality terminol-\nogy detection, and human evaluation (Hit). The higher Prec. and the lower AFP, the better.\n\nautomatic report generation metrics using greedy search and sampling temperature 0.5 at test time. We\nfurther evaluated their prediction to obtain medical abnormality terminology detection precision and\nAFP. Due to the relatively large size of CX-CHR, we conduct additional experiments on it to compare\nHRGR-Agent with its different variants by removing individual components (Retrieval, Generation,\nRL). We train a hierarchical generative model (Generation) without any template retrieval or RL\n\ufb01ne-tuning, and our model without RL \ufb01ne-tuning (HRG). To exam the quality of our pre-de\ufb01ned\ntemplates, we separately evaluate the retrieval policy module of HRGR-Agent by masking out the\ngeneration part and only use the retrieved templates as prediction (Retrieval). Note that Retrieval\nuses the same model as HRG-Agent whose training involves automatic generation of sentences,\nthus the results of which may be higher than a general retrieval-based system (e.g. directly perform\nclassi\ufb01cation among a list of majority sentences given image features).\n\n4.1 Results and Analyses\n\nAutomatic Evaluation. Table 2 shows automatic evaluation comparison of state-of-the-art methods\nand our model variants. Most importantly, HRGR-Agent outperforms all baseline models that have\nno retrieval mechanism or hierarchical structure on both datasets by great margins, demonstrating its\neffectiveness and robustness. On IU X-Ray dataset, HRGR-Agent achieves slightly lower BLEU-\n1,4 and ROUGE score than that of CoAtt [16]. However, CoAtt uses different pre-processing of\nreports and visual features, jointly predicts \u2019impression\u2019 and \u2019\ufb01ndings\u2019, and uses single-image input\nwhile our method focuses on \u2019\ufb01ndings\u2019 and use combined frontal and lateral view of patients. On\nCX-CHR, HRGR-Agent increases CIDEr score by 0.73 compared to HRG, demonstrating that\nreinforcement \ufb01ne-tuning is crucial to performance increase since it directly optimizes the evaluation\nmetric. Besides, Retrieval surpasses Generation by relatively large margins, showing that retrieval-\nbased method is bene\ufb01cial to generating structured reports, which leads to boosted performance of\nHRGR-Agent when combined with neural generation approaches (generation module). To better\nunderstand HRGR-Agent\u2019s performance, each generated report at testing has on average 7.2 and 4.8\nsentences for CX-CHR and IU X-Ray dataset, respectively. The percentages of retrieval vs generation\nare 83.5 vs 16.5 on the CX-CHR data, and 82.0 vs 18.0 on IU X-Ray, respectively.\nMedical Abnormality Terminology Evaluation. Table 3 shows evaluation results of average preci-\nsion and average false positive of medical abnormality terminology detection. HGRG-Agent achieves\nthe highest precision, and is only slightly lower AFP than CoAtt, demonstrating that its robustness on\ndetecting rare abnormal \ufb01ndings which are among the most important components of medical reports.\n\n8\n\n\fGround Truth\nThe cardiomediastinal silhouette is within\nnormal limits. Calci\ufb01ed right lower lobe\ngranuloma. No focal airspace consoli-\ndation. No visualized pneumothorax or\nlarge pleural effusion. No acute bony ab-\nnormalities.\n\nExam limited by patient rotation. Mild\nrightward deviation of the trachea. Sta-\nble cardiomegaly. Unfolding of the tho-\nracic aorta. Persistent right pleural effu-\nsion with adjacent atelectasis. Low lung\nvolumes. No focal airspace consolidation.\nThere is severe degenerative changes of\nthe right shoulder.\n\nFrontal and lateral views of the chest with\noverlying external cardiac monitor leads\nshow reduced lung volumes with bron-\nchovascular crowding of basilar atelecta-\nsis. No de\ufb01nite focal airspace consolida-\ntion or pleural effusion. The cardiac sil-\nhouette appears mildly enlarged.\n\nApparent cardiomegaly partially accentu-\nated by low lung volumes. No focal con-\nsolidation, pneumothorax or large pleural\neffusion. Right base calci\ufb01ed granuloma.\nStable right infrahilar nodular density (lat-\neral view). Negative for acute bone abnor-\nmality.\n\nis normal\n\nCoAtt [16]\nin\nThe heart\nsize. The mediastinum is\nunremarkable. The lungs\nare clear.\n\nThe heart size and pul-\nmonary vascularity ap-\npear within normal limits.\nThe lungs are free of fo-\ncal airspace disease. No\npleural effusion or pneu-\nmothorax. No acute bony\nabnormality.\n\nThe heart size and pul-\nmonary vascularity ap-\npear within normal limits.\nThe lungs are free of fo-\ncal airspace disease. No\npleural effusion or pneu-\nmothorax. no acute bony\nabnormality.\n\nis normal\n\nThe heart\nin\nsize. The mediastinum is\nunremarkable. The lungs\nare clear.\n\nHRGR-Agent\nThe cardiomediastinal silhouette is nor-\nmal size and con\ufb01guration. Pulmonary\nvasculature within normal limits. There\nis right middle lobe airspace disease,\nmay re\ufb02ect granuloma or pneumonia.\nNo pleural effusion. No pneumothorax.\nNo acute bony abnormalities.\nThe heart is enlarged. Possible car-\ndiomegaly. There is pulmonary vascular\ncongestion with diffusely increased inter-\nstitial and mild patchy airspace opacities.\nSuspicious pleural effusion. There is no\npneumothorax. There are no acute bony\n\ufb01ndings.\n\nThe heart is mildly enlarged. The\naorta is atherosclerotic and ectatic.\nChronic parenchymal changes are noted\nwith mild scarring and/or subsegmental\natelectasis in the right lung base. No\nfocal consolidation or signi\ufb01cant pleural\neffusion identi\ufb01ed. Costophrenic UNK\nare blunted.\n\nThe heart size and pulmonary vascular-\nity appear within normal limits. Low\nlung volumes.\nSuspicious calci\ufb01ed\ngranuloma. No pleural effusion or pneu-\nmothorax. No acute bony abnormality.\n\nFigure 3: Examples of ground truth report and generated reports by CoAtt [16] and HRGR-Agent.\nHighlighted phrases are medical abnormality terms. Italicized text is retrieved from template database.\nRetrieval vs. Generation. It\u2019s worth knowing that on CX-CHR, Retrieval achieves higher automatic\nevaluation scores (Table 2 the 7th row) but lower medical term detection precision (Table 3 the 2nd\ncolumn) than Generation. Note that Retrieval evaluates retrieval policy module of HRGR-Agent by\nmasking out the generation results of generation module. The result shows that simply composing\ntemplates that mostly describe normal medical \ufb01ndings can lead to high automatic evaluation scores\nsince the majority reports describe normal cases. However, this kind retrieval-based approaches\nlack of the capability of detecting signi\ufb01cant but rare abnormal \ufb01ndings. On the other hand, the\nhigh medical abnormality term detection precision and low average false positive of HRGR-Agent\nveri\ufb01es that its generation module learns to describe abnormal \ufb01ndings. The win-win combination of\nretrieval policy module and generation module leads to state-of-the-art performance of HRGE-Agent,\nsurpassing a generative model (Generation) that is purely trained without any retrieval mechanism.\nHuman Evaluation. Table 3 (last row) shows average human preference percentage of HRGR-Agent\ncompared with Generation and CoAtt [16] on CX-CHR and IU X-Ray respectively, evaluated in terms\nof content coverage, speci\ufb01c terminology accuracy and language \ufb02uency. HRGR-Agent achieves\nmuch higher human preference than baseline models, showing that it is able to generate natural and\nplausible reports that are human preferable.\nQualitative Analysis. Figure 3 demonstrate qualitative results of HRGR-Agent and baseline models\non IU X-Ray dataset. The reports of HRGR-Agent are generally longer than that of the baseline\nmodels, and share a well balance of templates and generated sentences. And, among the generated\nsentences, HRGR-Agent has higher rate of detecting abnormal \ufb01ndings.\n\n5 Conclusion\n\nIn this paper, we introduce a novel Hybrid Retrieval-Generation Reinforced Agent (HRGR-Agent)\nto perform robust medical image report generation. Our approach is the \ufb01rst attempt to bridge\nhuman prior knowledge and generative neural network via reinforcement learning. Experiments\nshow that HRGR-Agent does not only achieve state-of-the-art performance on two medical image\nreport datasets, but also generates robust reports that has high precision on medical abnormal \ufb01ndings\ndetection and best human preference.\n\n9\n\n\fReferences\n[1] \"jieba\" (chinese for \"to stutter\") chinese text segmentation: built to be the best python chinese word\n\nsegmentation module. https://github.com/fxsjy/jieba, 2018. Accessed: 2018-05-01.\n\n[2] C. D. M. Abigail See, Peter J. Liu. Get to the point: Summarization with pointer-generator networks. in\n\nACL, 2017.\n\n[3] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic\n\nalgorithm for sequence prediction. ICLR, 2017.\n\n[4] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with\n\nhuman judgments. In ACL workshop, 2005.\n\n[5] J. M. Bosmans, J. J. Weyler, A. M. De Schepper, and P. M. Parizel. The radiology report as seen by\n\nradiologists and referring clinicians: results of the cover and rover surveys. Radiology, 2011.\n\n[6] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from\n\na continuous space. CoNLL, 2016.\n\n[7] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In NeurIPS, 1993.\n\n[8] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R.\nThoma, and C. J. McDonald. Preparing a collection of radiology examinations for distribution and retrieval.\nJournal of the American Medical Informatics Association, 2015.\n\n[9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell.\n\nLong-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.\n\n[10] S. K. Goergen, F. J. Pool, T. J. Turner, J. E. Grimm, M. N. Appleyard, C. Crock, M. C. Fahey, M. F.\nFay, N. J. Ferris, S. M. Liew, et al. Evidence-based guideline for the written radiology report: Methods,\nrecommendations and implementation challenges. Journal of medical imaging and radiation oncology,\n2013.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n\n[12] Y. Hong and C. E. Kahn. Content analysis of reporting templates and free-text radiology reports. Journal\n\nof digital imaging, 2013.\n\n[13] Z. Hu, H. Shi, Z. Yang, B. Tan, T. Zhao, J. He, W. Wang, X. Yu, L. Qin, D. Wang, et al. Texar: A\nmodularized, versatile, and extensible toolkit for text generation. arXiv preprint arXiv:1809.00794, 2018.\n\n[14] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing. Toward controlled generation of text. In\n\nICML, 2017.\n\n[15] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks.\n\nIn CVPR, 2017.\n\n[16] B. Jing, P. Xie, and E. Xing. On the automatic generation of medical imaging reports. In ACL, 2018.\n\n[17] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR,\n\n2015.\n\n[18] L. Li and B. Gong. End-to-end video captioning with multitask reinforcement learning. In ICLR, 2017.\n\n[19] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition gan for visual paragraph\n\ngeneration. In ICCV, 2017.\n\n[20] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In ACL, 2013.\n\n[21] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient\n\noptimization of spider. In Proc. IEEE Int. Conf. Comp. Vis, volume 3, 2017.\n\n[22] Y. Liu, J. Fu, T. Mei, and C. W. Chen. Let your photos talk: Generating narrative paragraph for photo\n\nstream via bidirectional attention recurrent neural networks. In AAAI, 2017.\n\n[23] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel\n\nfor image captioning. In CVPR, 2017.\n\n[24] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In CVPR, 2018.\n\n10\n\n\f[25] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine\n\ntranslation. In ACL, 2002.\n\n[26] R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. In ICLR,\n\n2018.\n\n[27] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks.\n\nIn ICLR, 2016.\n\n[28] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image\n\ncaptioning. In CVPR, 2017.\n\n[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nICLR, 2015.\n\n[30] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge,\n\n1998.\n\n[31] B. Tan, Z. Hu, Z. Yang, R. Salakhutdinov, and E. Xing. Connecting the dots between mle and rl for text\n\ngeneration. 2018.\n\n[32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin.\n\nAttention is all you need. In NeurIPS, 2017.\n\n[33] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation.\n\nIn CVPR, 2015.\n\n[34] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In\n\nCVPR, 2015.\n\n[35] X. Wang, W. Chen, Y.-F. Wang, and W. Y. Wang. No metrics are perfect: Adversarial reward learning for\n\nvisual storytelling. In ACL, 2018.\n\n[36] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers. Chestx-ray8: Hospital-scale chest x-ray\ndatabase and benchmarks on weakly-supervised classi\ufb01cation and localization of common thorax diseases.\nIn CVPR, 2017.\n\n[37] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nIn Reinforcement Learning, pages 5\u201332. Springer, 1992.\n\n[38] S. Wiseman, S. M. Shieber, and A. M. Rush. Challenges in data-to-document generation. In ICCV, 2017.\n\n[39] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey,\net al. Google\u2019s neural machine translation system: Bridging the gap between human and machine translation.\narXiv preprint arXiv:1609.08144, 2016.\n\n[40] Z. Y. Y. Y. Y. Wu and R. S. W. W. Cohen. Encode, review, and decode: Reviewer module for caption\n\ngeneration. In NeurIPS, 2016.\n\n[41] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and\n\ntell: Neural image caption generation with visual attention. In ICML, 2015.\n\n[42] D. Yarats and M. Lewis. Hierarchical text generation and planning for strategic dialogue. In EMNLP,\n\n2017.\n\n[43] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In CVPR, 2016.\n\n[44] S. L. Ziqiang Cao, Wenjie Li and F. Wei. Retrieve, rerank and rewrite: Soft template based neural\n\nsummarization. In ACL, 2018.\n\n11\n\n\f", "award": [], "sourceid": 778, "authors": [{"given_name": "Yuan", "family_name": "Li", "institution": "Duke University"}, {"given_name": "Xiaodan", "family_name": "Liang", "institution": "Sun Yat-sen University"}, {"given_name": "Zhiting", "family_name": "Hu", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. /  Carnegie Mellon University"}]}