{"title": "Text-Based Interactive Recommendation via Constraint-Augmented Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 15214, "page_last": 15224, "abstract": "Text-based interactive recommendation provides richer user preferences and has demonstrated advantages over traditional interactive recommender systems. However, recommendations can easily violate preferences of users from their past natural-language feedback, since the recommender needs to explore new items for further improvement. To alleviate this issue, we propose a novel constraint-augmented reinforcement learning (RL) framework to efficiently incorporate user preferences over time. Specifically, we leverage a discriminator to detect recommendations violating user historical preference, which is incorporated into the standard RL objective of maximizing expected cumulative future rewards. Our proposed framework is general and is further extended to the task of constrained text generation. Empirical results show that the proposed method yields consistent improvement relative to standard RL methods.", "full_text": "Reward Constrained Interactive Recommendation\n\nwith Natural Language Feedback\n\nRuiyi Zhang1\u2217, Tong Yu2\u2217 , Yilin Shen2, Hongxia Jin2, Changyou Chen3, Lawrence Carin1\n\n1 Duke University, 2 Samsung Research America, 3 University at Buffalo\n\nAbstract\n\nText-based interactive recommendation provides richer user feedback and has\ndemonstrated advantages over traditional interactive recommender systems. How-\never, recommendations can easily violate preferences of users from their past\nnatural-language feedback, since the recommender needs to explore new items\nfor further improvement. To alleviate this issue, we propose a novel constraint-\naugmented reinforcement learning (RL) framework to ef\ufb01ciently incorporate user\npreferences over time. Speci\ufb01cally, we leverage a discriminator to detect recom-\nmendations violating user historical preference, which is incorporated into the\nstandard RL objective of maximizing expected cumulative future rewards. Our\nproposed framework is general and is further extended to the task of constrained\ntext generation. Empirical results show that the proposed method yields consistent\nimprovement relative to standard RL methods.\n\n1\n\nIntroduction\n\nTraditional recommender systems depend heavily on user history. However, these approaches, when\nimplemented in an of\ufb02ine manner, cannot provide satisfactory performance due to sparse history\ndata and unseen dynamic new items (e.g., new products, recent movies, etc.). Recent work on\nrecommender systems has sought to interact with users, to adapt to user preferences over time. Most\nexisting interactive recommender systems are designed based on simple user feedback, such as\nclicking data or updated ratings [6, 29, 32]. However, this type of feedback contains little information\nto re\ufb02ect complex user attitude towards various aspects of an item. For example, a user may like the\ngraphic of a dress but not its color. A click or numeric rating is typically not suf\ufb01cient to express such\na preference, and thus it may lead to poor recommendations. By contrast, allowing a recommender\nsystem to use natural-language feedback provides richer information for future recommendation,\nespecially for visual item recommendation [19, 20]. With natural-language feedback, a user can\ndescribe features of desired items that are lacking in the current recommended items. The system\ncan then incorporate feedback and subsequently recommend more suitable items. This type of\nrecommendation is referred to as text-based interactive recommendation.\nFlexible feedback with natural language may still induce undesired issues. For example, a system may\nignore the previous interactions and keep recommending similar items, for which a user has expressed\nthe preference before. To tackle these issues, we propose a reward constrained recommendation\n(RCR) framework, where one sequentially incorporates constraints from previous feedback into\nthe recommendation. Speci\ufb01cally, we formulate the text-based interactive recommendation as\na constraint-augmented reinforcement learning (RL) problem. Compared to standard constraint-\naugmented RL, there are no explicit constraints in text-based interactive recommendation. To this end,\nwe use a discriminator to detect violations of user preferences in an adversarial manner. To further\nvalidate our proposed RCR framework, we extend it to constrained text generation to discourage\nundesired text generation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\u2217 Equal contribution. Work done while RZ was a part-time research intern at Samsung Research America.\n\n\fThe main contributions of this paper are summarized as follows. (i) A novel reward constrained rec-\nommendation framework is developed for text-based interactive recommendation, where constraints\nwork as a dynamically updated critic to penalize the recommender. (ii) A novel way of de\ufb01ning\nconstraints is proposed, in an adversarial manner, with better generalization. (iii) Extensive empirical\nevaluations are performed on text-based interactive recommendation and constrained text generation\ntasks, demonstrating consistent performance improvement over existing approaches.\n2 Background\n2.1 Reinforcement Learning\n\nReinforcement learning aims to learn an optimal policy for an agent interacting with an unknown\n(and often highly complex) environment. A policy is modeled as a conditional distribution \u03c0(a|s),\nspecifying the probability of choosing action a \u2208 A when in state s \u2208 S. Formally, an RL problem\nis characterized by a Markov decision process (MDP) [38], M = (cid:104)S,A, P, R(cid:105). In this work, we\nconsider recommendation for \ufb01nite-horizon environments with the average reward criterion. If the\nagent chooses action a \u2208 A at state s \u2208 S, then the agent will receive an immediate reward r(s, a),\nand the state will transit to s(cid:48)\n|s, a). The expected total reward of a policy\n\u03c0 is de\ufb01ned as [42]:\n(1)\n\n\u2208 S with probability P (s(cid:48)\n\nEP,\u03c0 [r(st, at)] .\n\nJR(\u03c0) =\n\n\u221e(cid:88)\n\nt=1\n\nIn (1) the sum is over in\ufb01nite time steps, but in practice we will be interested in \ufb01nite horizons.\nThe goal of an agent is to learn an optimal policy that maximizes JR(\u03c0). A constrained Markov\ndecision process (CMDP) [3] extends the MDP framework by introducing the constraint C(s, a)\n(mapping a state-action pair to costs, similar to the usual reward) 2 and a threshold \u03b1 \u2208 [0, 1]. Denoting\nEP,\u03c0[C(st, at)], the constrained policy\noptimization thus becomes [1]:\n\nthe expectation over the constraint C(s, a) as JC(\u03c0) =(cid:80)\u221e\n\nt=1\n\nmax\n\u03c0\u2208\u03a0\n\nJR(\u03c0),\n\ns.t. JC(\u03c0) \u2264 \u03b1 .\n\n(2)\n\n2.2 Text-based Interactive Recommendation as Reinforcement Learning\n\nWe employ an RL-based formulation for sequential recommendation of items to users, utilizing user\nfeedback in natural language. Denote st \u2208 S as the state of the recommendation environment at time\nt and at \u2208 A as the recommender-de\ufb01ned items from the candidate items set A. In the context of a\nrecommendation system, as discussed further below, the state st corresponds to the state of sequential\nrecommender, implemented via a LSTM [23] state tracker. At time t, the system recommends item\nat based on the current state st at time t. After viewing item at, a user may comment on the\nrecommendation in natural language (a sequence of natural-language text) xt, as feedback. The\nrecommender then receives a reward rt and perceives the new state st+1. Accordingly, we can model\nthe recommendation-feedback loop as an MDP M = (cid:104)S,A, P, R(cid:105), where P : S \u00d7 A \u00d7 S (cid:55)\u2192 R\nis the environment dynamic of recommendation and R : S \u00d7 A (cid:55)\u2192 R is the reward function used\nto evaluate recommended items. The recommender seeks to learn a policy parameterized by \u03b8,\nreward as JR(\u03c0) =(cid:80)\ni.e., \u03c0\u03b8(a|s), that corresponds to the distribution of items conditioned on the current state of the\nrecommender. The recommender is represented as an optimal policy that maximizes the expected\nEP,\u03c0 [r(st, at)]. At each time step, the recommender sequentially selects\n\nt\n\npotential desired items at each time step via at = arg maxa\u2208A \u03c0\u03b8(a|st).\n3 Proposed Method\n\nIn text-based interactive recommendation, users provide natural-language-based feedback. We\nconsider the recommendation of visual items [19, 20]. As shown in Figure 2, the system recommends\nan item to the user, with its visual appearance. The user then views the recommended item and\ngives feedback in natural language, describing the desired aspects that the current recommended item\nlacks. The system then incorporates the user feedback and recommends (ideally) more-suitable items,\nuntil the desired item is found. While users provide natural-language feedback on recommendations,\nstandard RL methods may overlook the information from the feedback and recommend items that\n\n2For simplicity, we here only introduce one constraint function; in practice, there may be many constraint\n\nfunctions.\n\n2\n\n\fFigure 1: Overview of the reward constrained recommender model. When receiving the recommended\nimages, the user gives natural-language feedback, and this feedback will be used for the next item\nrecommendation, as well as preventing future violations.\nviolate the user\u2019s previous feedback. To better understand this issue, consider the example in Figure 2.\nIn round 3, the system forgets, and recommends an item that violates previous user preference on the\n\u2018ankle boots\u2019.\nTo alleviate this issue, we con-\nsider using feedback from users\nas constraints, and formulate\ntext-based interactive recom-\nmendation as a constrained pol-\nicy optimization problem. The\nFigure 2: An example of text-based interactive recommendation.\ndifference between the investigated problem and conventional constrained policy optimization [3, 5]\nis that constraints are added sequentially, affecting the search space of a policy in a different manner.\nOur model is illustrated in Figure 1.\n\n3.1 Recommendation as Constrained Policy Optimization\nWe consider an RL environment with a large number of discrete actions, deterministic transitions,\nand deterministic terminal returns. Suppose we have the user preference as constraints JC(\u03c0\u03b8) when\nmaking recommendations. The objective of learning a recommender is de\ufb01ned as:\n\nJR(\u03c0\u03b8) =\n\nEP,\u03c0\u03b8 [r(st, at)] , s.t. JC(\u03c0\u03b8) \u2264 \u03b1 .\n\n(3)\n\n\u221e(cid:88)\n\nt=1\n\nIf one naively augments previous user preferences as a hard constraint, i.e., exactly attributes matching,\nit usually leads to a sub-optimal solution. To alleviate this issue, we propose to use a learned constraint\nfunction based on the visual and textual information.\nConstraint Functions\nIn text-based interactive recommendation, we explicitly use the user prefer-\nence as constraints. Speci\ufb01cally, we exploit user feedback and put it as sequentially added constraints.\nTo generalize well on the constraints, we learn a discriminator C\u03c6 parameterized by \u03c6 as the con-\nstraint function. We de\ufb01ne two distributions on feedback-recommendation pairs, i.e., non-violation\ndistribution pr, and violation distribution pf (details provided in Appendix A.2). The objective of the\ndiscriminator is to minimize the following objective:\n(s,a)\u223cpf [log(C\u03c6(s, a))] \u2212 E\n\n(4)\nEP,\u03c0\u03b8 [C\u03c6(st, at)], the constraint is\nimposed. However, directly solving the constrained-optimization problem in (3) is dif\ufb01cult, and we\nemploy the Lagrange relaxation technique [4] to transform the original objective to an equivalent\nproblem as:\n\n(s,a)\u223cpr [log(1 \u2212 C\u03c6(s, a))] .\n\nWith the discriminator as the constraint i.e., JC\u03c6(\u03c0\u03b8) (cid:44)(cid:80)\u221e\n\nL(\u03c6) = \u2212E\n\nt=1\n\n(cid:2)JR(\u03c0\u03b8) \u2212 \u03bb \u00b7 (JC\u03c6(\u03c0\u03b8) \u2212 \u03b1)(cid:3) ,\n\n(5)\n\nmin\n\u03bb\u22650\n\nmax\n\n\u03b8\n\nL(\u03bb, \u03b8, \u03c6) = min\n\u03bb\u22650\n\nmax\n\n\u03b8\n\n3\n\nRecommendedImageatAAAB+nicbVC7TsMwFHXKq5RXCiOLRYXEVCWlEoyVWBiLRB9SG0WO47ZWnTiyb0BV6KewMIAQK1/Cxt/gtBmg5UiWj865Vz4+QSK4Bsf5tkobm1vbO+Xdyt7+weGRXT3uapkqyjpUCqn6AdFM8Jh1gINg/UQxEgWC9YLpTe73HpjSXMb3MEuYF5FxzEecEjCSb1ezYSBFqGeRuTCZ++DbNafuLIDXiVuQGirQ9u2vYShpGrEYqCBaD1wnAS8jCjgVbF4ZppolhE7JmA0MjUnEtJctos/xuVFCPJLKnBjwQv29kZFI5+HMZERgole9XPzPG6QwuvYyHicpsJguHxqlAoPEeQ845IpREDNDCFXcZMV0QhShYNqqmBLc1S+vk26j7l7WG3fNWqtZ1FFGp+gMXSAXXaEWukVt1EEUPaJn9IrerCfrxXq3PpajJavYOUF/YH3+AJdJlCo=UserI prefer high heel(cid:17)(cid:3)Image Database RecommenderVisualEncoderImageFeaturesMLPMatchingPredictedFeaturesNextRecommendation Feature ExtractorTextualEncoderMLPUserRewards DiscriminatorHistoryCommentsVisualEncoderLSTMVisualEncoderMLPat+1AAAB/nicbVDNS8MwHE3n15xfVfHkJTgEQRjtHOhx4MXjBPcBWylpmm5haVKSVBil4L/ixYMiXv07vPnfmG496OaDkMd7vx95eUHCqNKO821V1tY3Nreq27Wd3b39A/vwqKdEKjHpYsGEHARIEUY56WqqGRkkkqA4YKQfTG8Lv/9IpKKCP+hZQrwYjTmNKEbaSL59ko0CwUI1i80FUe5n+tLNfbvuNJw54CpxS1IHJTq+/TUKBU5jwjVmSKmh6yTay5DUFDOS10apIgnCUzQmQ0M5ionysnn8HJ4bJYSRkOZwDefq740MxaoIaCZjpCdq2SvE/7xhqqMbL6M8STXhePFQlDKoBSy6gCGVBGs2MwRhSU1WiCdIIqxNYzVTgrv85VXSazbcq0bzvlVvt8o6quAUnIEL4IJr0AZ3oAO6AIMMPINX8GY9WS/Wu/WxGK1Y5c4x+APr8wdM2ZWmConstraintPenaltystAAAB9XicbVDLSsNAFL3xWeur6tLNYBFclaSKuiy4cVnBPqCNZTKdtEMnD2ZulBLyH25cKOLWf3Hn3zhps9DWAwOHc+7lnjleLIVG2/62VlbX1jc2S1vl7Z3dvf3KwWFbR4livMUiGamuRzWXIuQtFCh5N1acBp7kHW9yk/udR660iMJ7nMbcDegoFL5gFI300A8ojj0/1dkgxWxQqdo1ewayTJyCVKFAc1D56g8jlgQ8RCap1j3HjtFNqULBJM/K/UTzmLIJHfGeoSENuHbTWeqMnBplSPxImRcimam/N1IaaD0NPDOZp9SLXi7+5/US9K/dVIRxgjxk80N+IglGJK+ADIXiDOXUEMqUMFkJG1NFGZqiyqYEZ/HLy6RdrznntfrdRbVxWdRRgmM4gTNw4AoacAtNaAEDBc/wCm/Wk/VivVsf89EVq9g5gj+wPn8AUomTAg==I prefer ankle boots.I prefer shoes with suede texture.I prefer ankle boots.I prefer shoes with moctoe.Round 1Round 2Round 3Round 4\fwhere \u03bb \u2265 0 is a Lagrange multiplier. Note that as \u03bb increases, the solution to (5) converges to that\nof (3). The goal is to \ufb01nd a saddle point (\u03b8\u2217(\u03bb\u2217), \u03bb\u2217) of (5), that can be achieved approximately by\nalternating gradient descent/ascent. Speci\ufb01cally, the gradient of (5) can be estimated using policy\ngradient [42] as:\n\n\u2207\u03b8L(\u03b8, \u03bb, \u03c6) = EP,\u03c0[(r(st, at) \u2212 \u03bbC\u03c6(st, at))\u2207\u03b8 log \u03c0\u03b8(st, at)] ,\n\u2207\u03bbL(\u03b8, \u03bb, \u03c6) = \u2212(EP,\u03c0[C\u03c6(st, at)] \u2212 \u03b1) ,\n\n(6)\n(7)\n\nwhere C\u03c6(st, at) is the general constraint, speci\ufb01ed in the following.\nPenalized Reward Functions Note that the update in (6) is similar to the actor-critic method [42].\nWhile the original use of a critic in reinforcement learning was for variance reduction [42], here we\nuse it to penalize the policy for constraint violations. In order to ensure the constraints, \u03bb is also\noptimized using policy gradient via (7). The optimization proceeds intuitively as: i) when a violation\nhappens (i.e., C\u03c6(s, a) > \u03b1), \u03bb will increase to penalize the policy. ii) If there is no violation (i.e.,\nC\u03c6(s, a) < \u03b1), \u03bb will decrease to give the policy more reward.\nModel Training We alternatively update the constraint function, i.e., the discriminator and the\nrecommender \u03c0\u03b8, similar to the Generative Adversarial Network (GAN) [15]. Speci\ufb01cally, the\nparameters are updated via the following rules:\n\n\u03b8k+1 = \u0393\u03b8[\u03b8k + \u03b71(k)\u2207\u03b8L(\u03bbk, \u03b8k, \u03c6k)] ,\n\u03c6k+1 = \u03c6k + \u03b72(k)\u2207\u03c6L(\u03bbk, \u03b8k, \u03c6k) ,\n\u03bbk+1 = \u0393\u03bb[\u03bbk \u2212 \u03b73(k)\u2207\u03bbL(\u03bbk, \u03b8k, \u03c6k)] ,\n\n(8)\n(9)\n(10)\n\nwhere \u0393\u03b8 is a projection operator, which\nkeeps the stability as the parameters are\nupdated within a trust region; \u0393\u03bb projects\n\u03bb into the range [0, \u03bbmax].\nWe denote a three-timescale Reward Con-\nstrained Recommendation process, i.e., the\nthree parts are updated with different fre-\nquency and step sizes: the recommender\naims to maximize the expected reward with\nless violations following (8). As described\nin the Algorithm 1, the discriminator is\nupdated following (9) to detect new vio-\nlations, and \u03bb is updated following (10).\n3.2 Model Details\n\nAlgorithm 1 Reward Constrained Recommendation\nInput: constraint C(\u00b7), threshold \u03b1, learning rates\n\u03b71(k) > \u03b72(k) > \u03b73(k)\nInitialize recommender and discriminator parameters\nwith pretrained ones, Lagrange multipliers \u03bb0 = 0\nrepeat\nfor t = 0, 1, ..., T \u2212 1 do\nSample action at \u223c \u03c0, observe next state st+1,\nreward rt and penalties ct\n\u02c6Rt = rt \u2212 \u03bbkct\nRecommender update with (8)\n\nend for\nDiscriminator update with (9)\nLagrange multiplier update with (10)\n\nuntil Model converges\nreturn recommender (policy) parameters \u03b8\n\nWe discuss details on model design when\napplying the proposed framework in a text-\nbased recommender system.\nFeature Extractor Our feature extractor consists of the encoders of text and visual inputs. Similar\nto [20], we consider the case where the visual attributes are available. We encode the raw images of\nthe items by ResNet50 [21] and an attribute network, i.e., the visual feature cvis\nof the item at is\nthe concatenation of ResNet(at) and AttrNet(at). The input of the attribute network is an item\u2019s\nencoding by ResNet50 and the attribute network outputs this items\u2019 attribute values. We further\nencode the user comments in texts by an embedding layer, a LSTM and a linear mapping. Given\na user comment xt, the \ufb01nal output of textual context is denoted as ctxt\n. The encoded image and\ncomment are further concatenated as the input to an MLP, and then the recommender component.\nRecommender With the visual feature cvis\n, the recommender perceives\nthe state in an auto-regressive manner. At time t, the state is st = f (g([cvis\n]), st\u22121), where g is\nt\nan MLP for textual and visual matching, and f is the LSTM unit [23]. Since our goal in each user\nsession is to \ufb01nd items with a set of desired attribute values, we use the policy \u03c0\u03b8 with multi-discrete\naction spaces [22, 12]. For each attribute, the desired attribute value by the user is sampled from a\ncategorical distribution. Given the state st, the probability of choosing a particular attribute value\nis output by a three-layer fully connected neural network with a softmax activation function. The\nrecommender samples the values of different attributes from \u03c0\u03b8. If K items are recommended at\neach time, we select the items that are top K closest to the sampled attribute values under Euclidean\ndistance in the visual attribute space.\n\nand textual feature ctxt\n\n, ctxt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n4\n\n\ft\n\nj }t\u22121\n\n, and textual features {ctxt\n\nDiscriminator The discriminator is designed to discriminate whether a recommended item at time\nt violates previous user comments in the current session. That is, given the visual feature of current\nimage cvis\nj=1, the discriminator outputs whether the image violates the\nuser comment. In practice, this discriminator is a three-layer fully connected neural network and\ntrained on-the-\ufb02y to incrementally learn the multimodal matching between the user comments and\nitem visual features. Following Algorithm 1, we update the discriminator after each user session,\nwhere a user interacts with the system for several time steps, or quits. To further enhance the results,\nwhen making recommendations, we reject some items based on this discriminator. If an item at\nsampled by the recommender has high probability of violating the previous comments {xi}t\u22121\ni=1, we\nignore this item and sample another item to recommend.\n3.3 Extension to Constrained Text Generation\n\nIn this section, we describe how to extend our framework for constrained text generation.\nWe consider text generation with speci\ufb01c con-\nstraints. Speci\ufb01cally, we consider the scenario of\ncontrolling for negative sentiments. For example, a\ngenerator may generate some offensive or negative\nwords, which will affect the user experience in some\nsituations, such as with an online chatbot for help-\ning consumers. To alleviate this issue, we applied\nthe proposed RCR methods for text generation.\nWe assume each sentence is generated from a la-\ntent vector z \u223c p(z), where p(z) is the distribution\n(cid:82)\nof a latent code. Text generation is then formu-\nlated as the learning of a distribution: p(X) =\n\nFigure 3: Overview of the constrained text-\ngeneration model: Lae is the reconstruction\nterm from the VAE in pretraining. The con-\nstraint will give a penalty when generated text\nviolates the constraint discriminator.\n\np(X|zx)q(zx|X)dzx, where p corresponds to a decoder and q to an encoder model, within the\nzx\nencoder-decoder framework; z is the latent code containing content information. The generator\nlearns a policy \u03c0\u03b8 to generate a sequence Y = (y1, . . . , yT ) of length T . Here each yt is a token\nfrom vocabulary A. The objective is to maximize the expected reward with less constraint violations,\nde\ufb01ned as:\n(11)\nwhere r is the reward function, that can be a metric reward (e.g., BLEU) or a learned reward function\nwith general discriminator [52]; C\u03c6(\u00b7) is the constraint discriminator for the generation. In practice,\nwe pretrain our generator \u03c0\u03b8 with a variational autoencoder (VAE) [25], and we only use the decoder\nas our generator. More details about the pretrained model are provided in Appendix A.1. There is a\nconstraint for the generation, and the framework is illustrated in Figure 3. The general discriminator\ncan be a language model [48], and the constraint is a learned function parameterized by a neural\nnetwork. During inference, the model generates text based on draws from an isotropic Gaussian\ndistribution, i.e., z \u223c N (0, I). Here we only consider the static constraint with non-zero \ufb01nal\ndeterministic reward.\n\nEY \u223c\u03c0\u03b8 [r(Y ) \u2212 \u03bb(C\u03c6(Y ) \u2212 \u03b1)] ,\n\nL(\u03b8, \u03bb, \u03c6) = min\n\u03bb\u22650\n\nmax\n\n\u03b8\n\n4 Related Work\n\nConstrained Policy Optimization Constrained Markov Decision Processes [3] are employed in a\nwide range of applications, including analysis of electric grids [26] and in robotics [8, 17]. Lagrange\nmultipliers are widely used to solve the CMDP problem [43, 5], as adopted in our proposed framework.\nOther solutions of CMDP include use of a trust region [1], and integrating prior knowledge [11].\nAdditionally, some previous work manually selects the penalty coef\ufb01cient [13, 31, 37]. In contrast\nwith standard methods, our constraint functions are: (i) sequentially added via natural-language\nfeedback; (ii) parameterized by a dynamically updated neural network with better generalization.\nText-Based Recommender System Communications between a user and recommendation system\nhave been leveraged to understand user preference and provide recommendations. Entropy-based\nmethods and bandits have been studied in question selection [34, 10]. Deep learning and reinforcement\nlearning models have been proposed to understand user conversations and make recommendations\n[2, 9, 16, 41, 30, 53, 56]. Similar to [10, 41, 30, 53], the items are associated with a set of attributes\nin our recommendation setting. In the existing works, the content of the conversation serves as the\n\n5\n\nEncoderDecoderDiscriminator(Constraint)zAAAB+HicbVC7TsMwFL0pr1IeDTCyWFRITFVSEDBWYmEsEn1IbVQ5jtNadZzIdpDaqF/CwgBCrHwKG3+D02aAliNZPjrnXvn4+AlnSjvOt1Xa2Nza3invVvb2Dw6r9tFxR8WpJLRNYh7Lno8V5UzQtmaa014iKY58Trv+5C73u09UKhaLRz1NqBfhkWAhI1gbaWhXs4Ef80BNI3Oh2Xxo15y6swBaJ25BalCgNbS/BkFM0ogKTThWqu86ifYyLDUjnM4rg1TRBJMJHtG+oQJHVHnZIvgcnRslQGEszREaLdTfGxmOVB7NTEZYj9Wql4v/ef1Uh7dexkSSairI8qEw5UjHKG8BBUxSovnUEEwkM1kRGWOJiTZdVUwJ7uqX10mnUXcv642Hq1rzuqijDKdwBhfgwg004R5a0AYCKTzDK7xZM+vFerc+lqMlq9g5gT+wPn8AJ06TXg==XAAAB+HicbVDLSgMxFL3js9ZHqy7dBIvgqsxUUZcFNy4r2Ae0Q8lkMm1oJhmSjFCHfokbF4q49VPc+Tdm2llo64GQwzn3kpMTJJxp47rfztr6xubWdmmnvLu3f1CpHh51tEwVoW0iuVS9AGvKmaBtwwynvURRHAecdoPJbe53H6nSTIoHM02oH+ORYBEj2FhpWK1kg0DyUE9je6HebFituXV3DrRKvILUoEBrWP0ahJKkMRWGcKx133MT42dYGUY4nZUHqaYJJhM8on1LBY6p9rN58Bk6s0qIIqnsEQbN1d8bGY51Hs1OxtiM9bKXi/95/dREN37GRJIaKsjioSjlyEiUt4BCpigxfGoJJorZrIiMscLE2K7KtgRv+curpNOoexf1xv1lrXlV1FGCEziFc/DgGppwBy1oA4EUnuEV3pwn58V5dz4Wo2tOsXMMf+B8/gDzlZM8Discriminator(General)\u02c6XAAAB/nicbVDLSsNAFL3xWesrKq7cDBbBVUmqqMuCG5cV7AOaUCaTSTt0kgkzE6GEgr/ixoUibv0Od/6NkzYLbT0wzOGce5kzJ0g5U9pxvq2V1bX1jc3KVnV7Z3dv3z447CiRSULbRHAhewFWlLOEtjXTnPZSSXEccNoNxreF332kUjGRPOhJSv0YDxMWMYK1kQb2sTfCOs+9QPBQTWJzod50OrBrTt2ZAS0TtyQ1KNEa2F9eKEgW00QTjpXqu06q/RxLzQin06qXKZpiMsZD2jc0wTFVfj6LP0VnRglRJKQ5iUYz9fdGjmNVZDOTMdYjtegV4n9eP9PRjZ+zJM00Tcj8oSjjSAtUdIFCJinRfGIIJpKZrIiMsMREm8aqpgR38cvLpNOouxf1xv1lrXlV1lGBEziFc3DhGppwBy1oA4EcnuEV3qwn68V6tz7moytWuXMEf2B9/gDhnZYJLaeAAAB+XicbVDLSsNAFJ34rPUVdelmsAiuSlJFXRbcuHBRwT6gDWEyvWmHTiZhZlIoIX/ixoUibv0Td/6NkzYLbT0wcDjnXu6ZEyScKe0439ba+sbm1nZlp7q7t39waB8dd1ScSgptGvNY9gKigDMBbc00h14igUQBh24wuSv87hSkYrF40rMEvIiMBAsZJdpIvm0PIqLHlPDsIfczArlv15y6MwdeJW5JaqhEy7e/BsOYphEITTlRqu86ifYyIjWjHPLqIFWQEDohI+gbKkgEysvmyXN8bpQhDmNpntB4rv7eyEik1CwKzGSRUy17hfif1091eOtlTCSpBkEXh8KUYx3jogY8ZBKo5jNDCJXMZMV0TCSh2pRVNSW4y19eJZ1G3b2sNx6vas3rso4KOkVn6AK56AY10T1qoTaiaIqe0St6szLrxXq3Phaja1a5c4L+wPr8Af4dk9o=\fconstraint when a system makes recommendations. However, in most existing works, constraints from\nthe conversations are not explicitly modeled. By contrast, this paper proposes a novel constrained\nreinforcement learning framework to emphasize the constraints when making recommendations.\nInteractive Image Retrieval Leveraging user feedback on images to improve image retrieval has\nbeen studied extensively [45]. Depending on the feedback format, previous works can be categorized\ninto relevance feedback [39, 47] and relative-attributes feedback [27, 36, 49]. In these works, the\nattributes to describe the images are pre-de\ufb01ned and \ufb01xed. To achieve more \ufb02exible and precise\nrepresentation of the image attributes, Guo, et al. [19] proposes an end-to-end approach, without\npre-de\ufb01ning a set of attributes. Their goal is to improve the ranking of the target item, while we\nfocus on recommending items that do not violate the users\u2019 previous comments in the iterative\nrecommendation. Thus, we develop a different evaluation simulator as detailed in Section 5.1. In [53],\nit is assumed that an accurate discriminator pretrained on huge-amount of\ufb02ine data is available at the\nbeginning, which is usually impractical. Instead, our novel RCR framework learns the discriminator\nfrom scratch and dynamically updates the model \u03c6 and its weight \u03bb by (9) and (10) online.\nConstrained Text Generation Adversarial text generation [52, 7, 33, 14, 54, 35] use reinforcement\nlearning (RL) algorithms for text generation. They use the REINFORCE algorithm to provide an\nunbiased gradient estimator for the generator, and apply the roll-out policy to obtain the reward from\nthe discriminator. LeakGAN [18] adopts a hierarchical RL framework to improve text generation.\nGSGAN [28] and TextGAN [55, 24] use the Gumbel-softmax and soft-argmax representation,\nrespectively, to deal with discrete data. Wang, et al. [46] put topic-aware priors on the latent codes to\ngenerate text on speci\ufb01c topics. All these works consider generating sentences with better quality and\ndiversity, without explicit constraints.\n5 Experiments\n\nWe apply the proposed methods in two applications: text-based interactive recommendation and\nconstrained text generation, to demonstrate the effectiveness of our proposed RCR framework.\n5.1 Text-Based Interactive Recommendation\n\nDataset and Setup Our approaches are evaluated on the UT-Zappos50K dataset [50, 51]. UT-\nZappos50K is a shoe dataset consisting of 50,025 shoe images. This dataset provides rich attribute\ndata and we focus on shoes category, shoes subcategory, heel height, closure, gender and toe style\nin our evaluation. Among all the images, 40,020 images are randomly sampled as training data and\nthe rest are used as test data. To validate the generalization ability of our approach, we compare\nthe performance on seen items and unseen items. The seen items are the items in the training data\nwhere the item visual attributes are carefully labeled. The unseen items are the items in the test data.\nWe assume the unseen items are newly collected and have no labeled visual attributes. We train the\nattribute network on the training data, under the cross-entropy loss. The ResNet50 is pretrained on\nImageNet and is \ufb01xed subsequently. When we report the results on seen and unseen items, their\nattribute values are predicted by the attribute network. We pretrain the textual encoder, where the\nlabels are the described attribute values, under the cross-entropy loss. The training data consists of\nthe comments collected by annotators as detailed later in this section. In reinforcement learning, we\nuse Adam [25] as the optimizer. We set \u03b1 = 0.5 and \u03bbmax = 1.\nWe de\ufb01ne the reward as the visual similarity between the recommended and desired items. Similar\nto [20], in our task both images and their visual attributes are available to measure the similarity.\nIt is desired that the recommended item becomes more similar to the desired item with more\nuser interactions. Thus, at time t, given the recommended item at and the desired item a\u2217, we\nwant to minimize their visual difference.\nIn detail, we maximize the following visual reward\nrt = \u2212||ResNet(at) \u2212 ResNet(a\u2217)||2 \u2212 \u03bbatt||AttrNet(at) \u2212 AttrNet(a\u2217)||0, where || \u00b7 ||2 is the\nL2 norm, || \u00b7 ||0 is the L0 norm, and we set \u03bbatt = 0.5 to ensure the scales of the two distances are\nsimilar. If the system is not able to \ufb01nd the desired item before 50 interactions, we will terminate this\nuser session and the system will receive an extra reward \u22123 (i.e., a penalty).\nOnline Evaluation We cannot directly detect the violations with existing text-based interactive\nrecommendation dataset [19], since there are no attribute labels for the images. A recent relevant\nfashion dataset provides the attribute labels 3 derived from the text metadata [20]. Unfortunately, we\n\n3Available at https://github.com/hongwang600/image_tag_dataset/tree/master/tags.\n\n6\n\n\fFigure 4: Number of Interactions (NI), Number of Violations (NV), Success Rate@30 (SR@30) with\nrespect to training iterations and the values of \u03bb in RCR with respect to number of samples. The RL\nmethod converges much slower than the RCR.\n\nRL (Unseen)\nRL + Naive (Unseen)\nRCR (Unseen)\nRCR (Seen)\n\nSR@10 \u2191\n\n19%\n52%\n74%\n78%\n\nSR@20 \u2191\n\n44%\n83%\n86%\n91%\n\nSR@30 \u2191\n\nNI \u2193\n63% 26.75 \u00b1 1.67\n94% 12.72 \u00b1 0.93\n94% 10.91 \u00b1 1.06\n92% 10.34 \u00b1 1.18\n\nNV \u2193\n70.02 \u00b1 6.20\n16.47 \u00b1 2.75\n11.32 \u00b1 1.98\n12.25 \u00b1 2.99\n\nTable 1: Comparisons between different approaches. Except the row of RCR (seen) reporting results\non training data, all the results are on the test data with unseen items.\n\nobserve that the user\u2019s comments are usually unrelated to the attribute labels. Therefore, we need to\ncollect the user\u2019s comments relevant to attributes with groundtruth, for our evaluation purpose.\nFurther, evaluating the proposed system requires the ability to get access to all user reactions to any\npossible items at each time step. For the evaluation on the UT-Zappos50K dataset, we use a similar\nsimulator to Guo, et al. [19]. This simulator acts as a surrogate for real human users by generating\ntheir comments in natural language. The generated comments describe the prominent visual attribute\ndifferences between any pair of desired and candidate items.\nTo achieve this, we collect user comments relevant to the attributes with groundtruth and train a user\nsimulator. A training dataset is collected for 10,000 pairs of images with visual attributes. These\npairs are prepared such that in each pair there is a recommended item and a desired item. Given\na pair of images, one user comment is collected. The data are collected in a scenario in which\nthe customer talks with the shopping assistant to get the desired items. The annotators act as the\ncustomers to express the desired attribute values of items. For the evaluation purpose, we adopt a\nsimpli\ufb01ed setting and instruct the annotators to describe the comments related to a \ufb01xed set of visual\nattributes. Thus, the comments in our evaluation are relatively simpler compared to the real-world\nsentences. Considering this, we further augment the collected user comment data as follows. From\nthe real-world sentences collected from annotators, we derive several sentence templates. Then, we\ngenerate 20,000 labeled sentences by \ufb01lling these templates with the groundtruth attribute label. On\nthe augmented user comment data, we train the user simulator.\nOur user simulator is implemented via a sequence-to-sequence model. The inputs of the user simulator\nare the differences on one attribute value between the candidate and desired items. Given the inputs,\nthe user simulator generates a sentence describing the visual attribute difference between the candidate\nitem and the desired item. We use two LSTMs as the encoder and decoder. The dimensionality of\nthe latent code is set as 256. We use Adam as the optimizer, where the initial learning is set as 0.001\nwith batch size of 64. Note that for evaluating how the current recommended item\u2019s visual attributes\nsatisfy the user\u2019s previous feedback, our user simulator on UT-Zappos50K only generates simple\ncomments on the visual attribute difference between the candidate image and the desired image: we\ncan calculate how many attributes violate the users\u2019 previous feedback based on the visual attribute\ngroundtruth available in UT-Zappos50K.\nWe de\ufb01ne four evaluation metrics: i) task success rate (SR@K), which is the success rate after\nafter K interactions; ii) number of user interactions before success (NI); and iii) number of violated\nattributes (NV). In each user session, we assume the user aims to \ufb01nd items with a set of desired\nattribute values sampled from the dataset. We report results averaged over 100 sessions with standard\nerror. We develop an RL baseline approach by ignoring the constraints (i.e., discriminator) in RCR.\nA major difference between our RL baseline approach and Guo, et al. [19] is that we consider the\nattributes in the model learning, while the attributes are ignored in [19]. We compare RCR with\nthe RL without constraints, as well as RL methods with naive constraints, i.e., naively using hard\nconstraints. That is, we track all the visual attributes previously described by the user in this session,\nand make further recommendations based on the matching between them and the items in dataset.\n\n7\n\n010000200003000040000Training iterations15202530354045NIRCRRL010000200003000040000Training iterations20406080100120140NVRCRRL010000200003000040000Training iterations20406080SR@30RCRRL0200040006000800010000Number of Samples0.00.20.40.60.8\fAnalysis All models are trained for 100,000\niterations (user sessions), and the results with\nstandard errors under different metrics are\nshown in Table 1. The proposed RCR frame-\nwork shows consistent improvements on most\nmetrics, compared with the baselines. The gap\nbetween RL with naive constraints and RCR\ndemonstrate the learned constraint (discrimina-\ntor) has better generalization. Figure 4 shows\nthe metrics with standard errors of RL and the\nproposed RCR in the \ufb01rst 40,000 iterations.\nRCR shows much faster convergence than RL.\nThe last sub\ufb01gure shows the values of \u03bb with\ndifferent number of samples. It is interesting to\nsee that \u03bb increases at the initial stage because\nof too many violations. Then, with less viola-\ntions, \u03bb decreases to a relatively small value as\n\u03bb = 0.04 and then remains stable, which is the automatically learned weight of the discriminator.\nSome examples in Figure 5 show how the constraint improves the recommendation.\n5.2 Constrained Text Generation\nExperimental Setup We use the Yelp review dataset [40] to validate the proposed methods. We\nsplit the data as 444,000, 63,500, and 127,000 sentences in the training, validation and test sets,\nrespectively. The generator is trained on the Yelp dataset to generate reviews without sentiment labels.\nWe de\ufb01ne the reward of the generated sentence as the probability of being real and the constraint is to\ngenerate positive reviews, i.e., the generator will receive a penalty if it generates negative reviews.\nThe constraint is a neural network with a classi\ufb01cation accuracy of 97.4% on the validation set,\ntrained on sentences with the sentiment labels. We follow the strategy in [52, 18] and adopt the BLEU\nscore, referenced by test set with only positive reviews (test-BLEU) and themselves (self-BLEU) to\nevaluate the quality of generated samples. We also report the violation rate (VR), the percentage of\ngenerated negative reviews violating the constraint.\n\nFigure 5: Three use cases, from logged experimen-\ntal results. (a) and (b) are successful use cases by\nRCR. (c) is not successful by RL, which demon-\nstrate the common challenge of failing to meet the\nconstraint in recommendation.\n\nTest-BLEU-2\n0.807\n0.840\n\n3\n\n4\n\n5\n\nSelf-BLEU-2\n0.658\n0.683\n\n3\n\n0.315\n0.348\n\n4\n\n0.098\n0.151\n\nVR\n\nRL\n40.36%\nRCR (ours)\n10.49%\nTable 2: Comparison between RCR and standard RL for constrained text generation on Yelp.\n\n0.469\n0.492\n\n0.376\n0.392\n\n0.622\n0.651\n\nAnalysis As illustrated in Table 2, RCR achieves better test-BLEU scores than standard RL,\ndemonstrating high-quality generated sentences. Further, RCR shows a little higher but reasonable\nself-BLEU scores, since we only generate sentences with positive sentiments, leading to lower\ndiversity. Our proposed method shows much lower violation rate, demonstrating the effectiveness of\nRCR. Some randomly generated examples are shown in Table 3.\nRL without Constraints\nthe ceiling is low , the place smells awful , gambling sucked .\ni have been here a few times and each time has been great !\nbad food , bad service , takes too much time .\nfood was good , but overall it was a very bad dining experience .\nmy entree was a sea bass which was well prepared and tasty .\nthe food is delicious and very consistently so .\nthe waitress was horrible and came by maybe once every hour .\n\nRCR\nevery dish was so absolutely delicious and seasoned perfectly .\nhe is the most compassionate vet i have ever met .\ncompared to other us cities , this place ranks very generous in my book .\nthen you already know what this tastes like .\nthank you my friends for letting us know this \ufb01nest dining place in lv .\ngreat service and the food was excellent .\nthe lines can get out of hand sometimes but it goes pretty quick .\n\nTable 3: Randomly selected examples of text generation by two methods.\n\n6 Conclusions\n\nMotivated by potential constraints in real-world tasks with RL training, and inspired by constrained\npolicy optimization, we propose the RCR framework, where a neural network is parameterized and\ndynamically updated to represent constraints for RL training. By applying this new framework to\nconstrained interactive recommendation and text generation, we demonstrate that our proposed model\noutperforms several baselines. The proposed method is a general framework, and can be extended to\nother applications, such as vision-and-dialog navigation [44]. Future work also includes incorporating\nuser historical information into the recommendation.\n\n8\n\nI want boots.I prefer knee high.Show me shoes with round toe.Good!I prefer ankle.I want the shoes with pull-on closure.I prefer ankle.I prefer the ones for women.(fail to meet the constraint of ankle) Show me more shoes with round toe.I prefer elastic gore.I prefer shoes for women.Good!(a)(b)(c)Round 1Round 2Round 3Round 4\fReferences\n[1] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization.\n\nIn ICML, 2017.\n\n[2] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying\nquestions in open-domain information-seeking conversations. In SIGIR, pages 475\u2013484, 2019.\n\n[3] Eitan Altman. Constrained Markov decision processes. CRC Press, 1999.\n\n[4] Dimitri P Bertsekas. Nonlinear programming. Journal of the Operational Research Society,\n\n1997.\n\n[5] Vivek S Borkar. An actor-critic algorithm for constrained markov decision processes. Systems\n\n& control letters, 2005.\n\n[6] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances\n\nin neural information processing systems, pages 2249\u20132257, 2011.\n\n[7] Tong Che, Yanran Li, Ruixiang Zhang, R. Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua\nBengio. Maximum-likelihood augmented discrete generative adversarial networks. In CoRR,\n2017.\n\n[8] Yinlam Chow, Aviv Tamar, Shie Mannor, and Marco Pavone. Risk-sensitive and robust decision-\n\nmaking: a cvar optimization approach. In NIPS, 2015.\n\n[9] Konstantina Christakopoulou, Alex Beutel, Rui Li, Sagar Jain, and Ed H Chi. Q&r: A two-stage\n\napproach toward interactive recommendation. In KDD, pages 139\u2013148. ACM, 2018.\n\n[10] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversational\n\nrecommender systems. In KDD, pages 815\u2013824. ACM, 2016.\n\n[11] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval\nTassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.\n\n[12] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec\nRadford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. Openai baselines.\nhttps://github.com/openai/baselines, 2017.\n\n[13] Dotan Di Castro, Aviv Tamar, and Shie Mannor. Policy gradients with variance related risk\n\ncriteria. arXiv preprint arXiv:1206.6404, 2012.\n\n[14] William Fedus, Ian Goodfellow, and Andrew M Dai. Maskgan: Better text generation via \ufb01lling\n\nin the _. ICLR, 2018.\n\n[15] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[16] Claudio Greco, Alessandro Suglia, Pierpaolo Basile, and Giovanni Semeraro. Converse-et-\nimpera: Exploiting deep learning and hierarchical reinforcement learning for conversational\nrecommender systems. In Conference of the Italian Association for Arti\ufb01cial Intelligence, pages\n372\u2013386. Springer, 2017.\n\n[17] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learning\n\nfor robotic manipulation with asynchronous off-policy updates. In ICRA, 2017.\n\n[18] Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation\n\nvia adversarial training with leaked information. In AAAI, 2017.\n\n[19] Xiaoxiao Guo, Hui Wu, Yu Cheng, Steven Rennie, Gerald Tesauro, and Rogerio Feris. Dialog-\n\nbased interactive image retrieval. In NIPS, pages 676\u2013686. 2018.\n\n[20] Xiaoxiao Guo, Hui Wu, Yupeng Gao, Steven Rennie, and Rogerio Feris. The fashion iq dataset:\nRetrieving images by combining side information and relative natural language feedback.\narXiv:1905.12794, 2019.\n\n9\n\n\f[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778, 2016.\n\n[22] Ashley Hill, Antonin Raf\ufb01n, Maximilian Ernestus, Adam Gleave, Anssi Kanervisto, Rene\nTraore, Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert,\nAlec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Stable baselines. https:\n//github.com/hill-a/stable-baselines, 2018.\n\n[23] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[24] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Toward\n\ncontrolled generation of text. In ICML, 2017.\n\n[25] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.\n\n[26] Iordanis Koutsopoulos and Leandros Tassiulas. Control and optimization meet the smart power\n\ngrid: Scheduling of power demands for optimal energy management. In ICECN, 2011.\n\n[27] Adriana Kovashka, Devi Parikh, and Kristen Grauman. Whittlesearch: Image search with\n\nrelative attribute feedback. In CVPR, pages 2973\u20132980. IEEE, 2012.\n\n[28] Matt J Kusner, Hern\u00e1ndez-Lobato, and Jos\u00e9 Miguel. Gans for sequences of discrete elements\n\nwith the gumbel-softmax distribution. arXiv preprint arXiv:1611.04051, 2016.\n\n[29] Branislav Kveton, Csaba Szepesvari, Zheng Wen, and Azin Ashkan. Cascading bandits:\n\nLearning to rank in the cascade model. In ICML, pages 767\u2013776, 2015.\n\n[30] Wenqiang Lei, Xiangnan He, Yisong Miao, Qingyun Wu, Richang Hong, Min-Yen Kan, and\nTat-Seng Chua. Estimation\u2013action\u2013re\ufb02ection: Towards deep interaction between conversational\nand recommender systems. In WSDM, volume 20.\n\n[31] Sergey Levine and Vladlen Koltun. Guided policy search. In ICML, 2013.\n\n[32] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to\n\npersonalized news article recommendation. In WWW, pages 661\u2013670. ACM, 2010.\n\n[33] Kevin Lin, Dianqi Li, Xiaodong He, Zhengyou Zhang, and Ming-Ting Sun. Adversarial ranking\n\nfor language generation. In NIPS, 2017.\n\n[34] Nader Mirzadeh, Francesco Ricci, and Mukesh Bansal. Feature selection methods for conversa-\ntional recommender systems. In IEEE International Conference on e-Technology, e-Commerce\nand e-Service, pages 772\u2013777. IEEE, 2005.\n\n[35] Weili Nie, Nina Narodytska, and Ankit Patel. Relgan: Relational generative adversarial networks\n\nfor text generation. In ICLR, 2018.\n\n[36] Devi Parikh and Kristen Grauman. Relative attributes. In ICCV, pages 503\u2013510. IEEE, 2011.\n\n[37] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deepmimic: Example-\nguided deep reinforcement learning of physics-based character skills. ACM Transactions on\nGraphics (TOG), 2018.\n\n[38] Martin L Puterman. Markov Decision Processes.: Discrete Stochastic Dynamic Programming.\n\nJohn Wiley & Sons, 2014.\n\n[39] Yong Rui, Thomas S Huang, Michael Ortega, and Sharad Mehrotra. Relevance feedback: a\npower tool for interactive content-based image retrieval. IEEE Transactions on circuits and\nsystems for video technology, 8(5):644\u2013655, 1998.\n\n[40] Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel\n\ntext by cross-alignment. In NIPS, 2017.\n\n[41] Yueming Sun and Yi Zhang. Conversational recommender system. In SIGIR, SIGIR \u201918, pages\n\n235\u2013244, 2018.\n\n10\n\n\f[42] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[43] Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization.\n\nIn ICLR, 2019.\n\n[44] Jesse Thomason, Michael Murray, Maya Cakmak, and Luke Zettlemoyer. Vision-and-dialog\n\nnavigation. 2019.\n\n[45] Bart Thomee and Michael S Lew. Interactive search in image retrieval: a survey. International\n\nJournal of Multimedia Information Retrieval, 1(2):71\u201386, 2012.\n\n[46] Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen, Changyou\nChen, and Lawrence Carin. Topic-guided variational autoencoders for text generation. In\nNAACL, 2019.\n\n[47] Hong Wu, Hanqing Lu, and Songde Ma. Willhunter: interactive image retrieval with multilevel\n\nrelevance. In ICPR, volume 2, pages 1009\u20131012. IEEE, 2004.\n\n[48] Zichao Yang, Zhiting Hu, Chris Dyer, Eric P Xing, and Taylor Berg-Kirkpatrick. Unsupervised\n\ntext style transfer using language models as discriminators. In NeurIPS, 2018.\n\n[49] Aron Yu and Kristen Grauman. Fine-grained comparisons with attributes. In Visual Attributes,\n\npages 119\u2013154. Springer, 2017.\n\n[50] Grauman K. Yu, A. Fine-grained visual comparisons with local learning. In CVPR, 2014.\n\n[51] Grauman K. Yu, A. Semantic jitter: Dense supervision for visual comparisons via synthetic\n\nimages. In ICCV, 2014.\n\n[52] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial\n\nnets with policy gradient. In AAAI, 2017.\n\n[53] Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. Vision-language recom-\nmendation via attribute augmented multimodal reinforcement learning. In ACM Multimedia,\n2019.\n\n[54] Ruiyi Zhang, Changyou Chen, Zhe Gan, Wenlin Wang, Liqun Chen, Dinghan Shen, Guoyin\nWang, and Lawrence Carin. Improving rl-based sequence generation by modeling the distant\nfuture. In RSDM, ICML, 2019.\n\n[55] Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin.\n\nAdversarial feature matching for text generation. In ICML, 2017.\n\n[56] Yu Zhu, Hao Li, Yikang Liao, Beidou Wang, Ziyu Guan, Haifeng Liu, and Deng Cai. What to\n\ndo next: Modeling user behaviors by time-lstm. In IJCAI, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8735, "authors": [{"given_name": "Ruiyi", "family_name": "Zhang", "institution": "Duke University"}, {"given_name": "Tong", "family_name": "Yu", "institution": "Samsung Research America"}, {"given_name": "Yilin", "family_name": "Shen", "institution": "Samsung Research America"}, {"given_name": "Hongxia", "family_name": "Jin", "institution": "Samsung Research America"}, {"given_name": "Changyou", "family_name": "Chen", "institution": "University at Buffalo"}]}