{"title": "Bandits Dueling on Partially Ordered Sets", "book": "Advances in Neural Information Processing Systems", "page_first": 2129, "page_last": 2138, "abstract": "We address the problem of dueling bandits defined on partially ordered sets, or posets. In this setting, arms may not be comparable, and there may be several (incomparable) optimal arms. We propose an algorithm, UnchainedBandits, that efficiently finds the set of optimal arms, or Pareto front, of any poset even when pairs of comparable arms cannot be a priori distinguished from pairs of incomparable arms, with a set of minimal assumptions. This means that UnchainedBandits does not require information about comparability and can be used with limited knowledge of the poset. To achieve this, the algorithm relies on the concept of decoys, which stems from social psychology. We also provide theoretical guarantees on both the regret incurred and the number of comparison required by UnchainedBandits, and we report compelling empirical results.", "full_text": "Bandits Dueling on Partially Ordered Sets\n\nJulien Audiffren\n\nCMLA\n\nENS Paris-Saclay, CNRS\n\nUniversit\u00b4e Paris-Saclay, France\n\njulien.audiffren@gmail.com\n\nLiva Ralaivola\n\nLab. Informatique Fondamentale de Marseille\n\nCNRS, Aix Marseille University\nInstitut Universitaire de France\n\nF-13288 Marseille Cedex 9, France\n\nliva.ralaivola@lif.univ-mrs.fr\n\nAbstract\n\nWe address the problem of dueling bandits de\ufb01ned on partially ordered sets, or\nposets. In this setting, arms may not be comparable, and there may be several\n(incomparable) optimal arms. We propose an algorithm, UnchainedBandits,\nthat ef\ufb01ciently \ufb01nds the set of optimal arms \u2014the Pareto front\u2014 of any poset\neven when pairs of comparable arms cannot be a priori distinguished from pairs\nof incomparable arms, with a set of minimal assumptions. This means that Un-\nchainedBandits does not require information about comparability and can be\nused with limited knowledge of the poset. To achieve this, the algorithm relies\non the concept of decoys, which stems from social psychology. We also provide\ntheoretical guarantees on both the regret incurred and the number of comparison\nrequired by UnchainedBandits, and we report compelling empirical results.\n\n1\n\nIntroduction\n\nMany real-life optimization problems pose the issue of dealing with a few, possibly con\ufb02icting,\nobjectives: think for instance of the choice of a phone plan, where a right balance between the price,\nthe network coverage/type, and roaming options has to be found. Such multi-objective optimization\nproblems may be studied from the multi-armed bandits perspective (see e.g. Drugan and Nowe\n[2013]), which is what we do here from a dueling bandits standpoint.\nDueling Bandits on Posets. Dueling bandits [Yue et al., 2012] pertain to the K-armed bandit\nframework, with the assumption that there is no direct access to the reward provided by any single\narm and the only information that can be gained is through the simultaneous pull of two arms: when\nsuch a pull is performed the agent is informed about the winner of the duel between the two arms.\nWe extend the framework of dueling bandits to the situation where there are pairs of arms that are not\ncomparable, that is, we study the case where there might be no natural order that could help decide\nthe winner of a duel\u2014this situation may show up, for instance, if the (hidden) values associated with\nthe arms are multidimensional, as is the case in the multi-objective setting mentioned above. The\nnotion of incomparability naturally links this problem with the theory of posets and our approach\ntake inspiration from works dedicated to selecting and sorting on posets [Daskalakis et al., 2011].\nChasing the Pareto Front. In this setting, the best arm may no longer be unique, and we consider\nthe problem of identifying among all available K arms the set of maximal incomparable arms, or\nthe Pareto front, with minimal regret. This objective signi\ufb01cantly differs from the usual objective\nof dueling bandit algorithms, which aim to \ufb01nd one optimal arm\u2014such as a Condorcet winner, a\nCopeland winner or a Borda winner\u2014and pull it as frequently as possible to minimize the regret.\nFinding the entire Pareto front (denoted P) is more dif\ufb01cult, but pertains to many real-world applica-\ntions. For instance, in the discussed phone plan setting, P will contain both the cheapest plan and the\nplan offering the largest coverage, as well as any non dominated plan in-between; therefore, every\ncustomer may then \ufb01nd a a suitable plan in P in accordance with her personal preferences.\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f1\n\n Pi,i /2P\n\nKey: Indistinguishability. In practice, the incomparability information might be dif\ufb01cult to obtain.\nTherefore, we assume the underlying incomparability structure, is unknown and inaccessible. A\npivotal issue that arises is that of indistinguishability. In the assumed setting, the pull of two arms that\nare comparable and that have close values\u2014and hence a probability for either arm to win a duel close\nto 0.5\u2014is essentially driven by the same random process, i.e. an unbiased coin \ufb02ip, as the draw of\ntwo arms that are not comparable. This induces the problem of indistinguishability: that of deciding\nfrom pulls whether a pair of arms is incomparable or is made of arms of similar strengths.\nContributions. Our main contribution, the UnchainedBandits algorithm, implements a strategy\nbased on a peeling approach (Section 3). We show that UnchainedBandits can \ufb01nd a nearly\noptimal approximation of the the set of optimal arms of S with probability at least 1 while\nincurring a regret upper bounded by R\uf8ffO \u21e3Kwidth(S) log K\ni\u2318 , where i is the\nregret associated with arm i, K the size of the poset and width(S) its width, and that this regret\nis essentially optimal. Moreover, we show that with little additional information, Unchained-\nBandits can recover the exact set of optimal arms, and that even when no additional information\nis available, UnchainedBandits can recover P by using decoy arms\u2014an idea stemming from\nsocial psychology, where decoys are used to lure an agent (e.g., a customer) towards a speci\ufb01c\ngood/action (e.g. a product) by presenting her a choice between the targetted good and a degraded\nversion of it (Section 4). Finally, we report results on the empirical performance of our algorithm in\ndifferent settings (Section 5).\nRelated Works. Since the seminal paper of Yue et al. [2012] on Dueling Bandits, numerous works\nhave proposed settings where the total order assumption is relaxed, but the existence of a Condorcet\nwinner is assumed [Yue and Joachims, 2011, Ailon et al., 2014, Zoghi et al., 2014, 2015b]. More\nrecent works [Zoghi et al., 2015a, Komiyama et al., 2016], which envision bandit problems from the\nsocial choice perspective, pursue the objective of identifying a Copeland winner. Finally, the works\nclosest to our partial order setting are [Ramamohan et al., 2016] and [Dud\u00b4\u0131k et al., 2015]. The former\nproposes a general algorithm which can recover many sets of winners\u2014including the uncovered set,\nwhich is akin to the Pareto front; however, it is assumed the problems do not contain ties while in our\nframework, any pair of incomparable arms is encoded as a tie. The latter proposes an extension of\ndueling bandits using contexts, and introduces several algorithms to recover a Von Neumann winner,\ni.e. a mixture of arms that is better that any other\u2014and in our setting, any mixture of arms from\nthe Pareto front is a Von Neumann winner. It is worth noting that the aforementioned works aim to\nidentify a single winner, either Condorcet, Copeland or Von Neumann. This is signi\ufb01cantly different\nfrom the task of identifying the entire Pareto front. Moreover, the incomparability property is not\naddressed in previous works; if some algorithms may still be applied if incomparability is encoded as\na tie, they are not designed to fully use this information, which is re\ufb02ected by their performances\nin our experiments. Moreover, our lower bound illustrates the fact that our algorithm is essentially\noptimal for the task of identifying the Pareto front. Regarding decoys, the idea originates from\nsocial psychology; they introduce the idea that the introduction of strictly dominated alternatives may\nin\ufb02uence the perceived value of items. This has generated an abundant literature that studied decoys\nand their uses in various \ufb01elds (see e.g. Tversky and Kahneman [1981], Huber et al. [1982], Ariely\nand Wallsten [1995], Sedikides et al. [1999]). From the computer science literature, we may mention\nthe work of Daskalakis et al. [2011], which addresses the problem of selection and sorting on posets\nand provides relevant data structures and accompanying analyses.\n\n2 Problem: Dueling Bandits on Posets\n\nWe here brie\ufb02y recall base notions and properties at the heart of our contribution.\nDe\ufb01nition 2.1 (Poset). Let S be a set of elements. (S, <) is a partially ordered set or poset if < is a\npartial re\ufb02exive, antisymmetric and transitive binary relation on S.\nTransitivity relaxation. Recent works on dueling bandits (see e.g. Zoghi et al. [2014]) have shown\nthat the transitivity property is not required for the agent to successfully identify the maximal element\n(in that cas,e the Condorcet winner), if it is assumed to exists. Similarly, most of the results we\nprovide do not require transitivity. In the following, we dub social poset a transitivity-free poset, i.e.\na partial binary relation which is solely re\ufb02exive and antisymmetric.\nRemark 2.2. Throughout, we will use S to denote indifferently the set S or the social poset (S, <),\nthe distinction being clear from the context. We make use of the additional notation: 8a, b 2S\n\n2\n\n\f\u2022 a k b if a and b are incomparable (neither a < b nor b < a);\n\u2022 a b if a < b and a 6= b;\n\nDe\ufb01nition 2.3 (Maximal element and Pareto front). An element a 2S is a maximal element of S\n.\nif 8b 2S , a < b or a k b. We denote by P(S)\n= {a : a < b or a k b,8b 2S} , the set of maximal\nelements or Pareto front of the social poset.\nSimilarly to the problem of the existence of a Condorcet winner, P might be empty for social poset\n(in with posets there always is at least one maximal element). In the following, we assume that\n|P| > 0. The notions of chain and antichain are key to identify P.\nDe\ufb01nition 2.4 (Chain, Antichain, Width and Height). C\u21e2S\nis a chain (resp. an antichain) if\n8a, b 2C , a < b or b < a (resp. a k b). C is maximal if 8a 2S \\ C , C[{ a} is not a chain (resp.\nan antichain). The height (resp. width) of S is the size of its longest chain (resp. antichain).\nK-armed Dueling Bandit on posets. The K-armed dueling bandit problem on a social poset\nS = {1, . . . , K} of arms might be formalized as follows. For all maximal chains {i1, . . . , im} of\nm arms there exist a family {ipiq}1\uf8ffp,q\uf8ffm of parameters such that ij 2 (1/2, 1/2) and the pull\nof a pair (ip, iq) of arms from the same chain is the independent realization of a Bernoulli random\nvariable Bipiq with expectation E(Bipiq ) = 1/2 + ipiq, where Bipiq = 1 means that i is the winner\nof the duel between i and j and conversely (note that: 8i, j, ji = ij). In the situation where the\npair of arms (ip, iq) selected by the agent corresponds to arms such that ip k iq, a pull is akin to the\ntoss of an unbiased coin \ufb02ip, that is, ipiq = 0. This is summarized by the following assumption:\nAssumption 1 (Order Compatibility). 8i, j 2S , (i j) if and only if ij > 0.\nRegret on posets. In the total order setting, the regret incurred by pulling an arm i is de\ufb01ned as the\ndifference between the best arm and arm i. In the poset framework, there might be multiple \u2019best\u2019\narms, and we chose to de\ufb01ne regret as the maximum of the difference between arm i and the best arm\ncomparable to i. Formally, the regret i is de\ufb01ned as :\n\ni = max{ji,8j 2P such that j < i}.\n\nWe then de\ufb01ne the regret incurred by comparing two arms i and j by i + j. Note the regret of a\ncomparison is zero if and only if the agent is comparing two elements of the Pareto front.\nProblem statement. The problem that we want to tackle is to identify the Pareto front P(S) of S as\nef\ufb01ciently as possible. More precisely, we want to devise pulling strategies such that for any given\n 2 (0, 1), we are ensured that the agent is capable, with probability 1 to identify P(S) with a\ncontrolled number of pulls and a bounded regret.\n\"-indistinguishability. In our model, we assumed that if i k j, then ij = 0: if two arms cannot be\ncompared, the outcome of the their comparison will only depend on circumstances independent from\nthe arms (like luck or personal tastes). Our encoding of such framework makes us assume that when\nconsidered over many pulls, the effects of those circumstances cancel out, so that no speci\ufb01c arm\nis favored, whence ij = 0. The limit of this hypothesis and the robustness of our results when not\nsatis\ufb01ed are discussed in Section 5.\nThis property entails the problem of indistinguishability evoked previously. Indeed, given two arms i\nand j, regardless of the number of comparisons, an agent may never be sure if either the two arms are\nvery close to each other (ij \u21e1 0 and i and j are comparable) or if they are not comparable (ij = 0).\nThis raises two major dif\ufb01culties. First, any empirical estimation \u02c6ij of ij being close to zero is no\nlonger a suf\ufb01cient condition to assert that i and j have similar values; insisting on pulling the pair\n(i, j) to decide whether they have similar value may incur a very large regret if they are incomparable.\nSecond, it is impossible to ensure that two elements are incomparable\u2014therefore, identifying the\nexact Pareto set is intractable if no additional information is provided. Indeed,the agent might never\nbe sure if the candidate set no longer contains unnecessary additional elements\u2014i.e. arms very close\nto the real maximal elements but nonetheless dominated. This problem motivates the following\nde\ufb01nition, which quanti\ufb01es the notion of indistinguishability:\nDe\ufb01nition 2.5 (\"-indistinguishability). Let a, b 2S and \"> 0. a and b are \"-indistinguishable,\nnoted a k\" b, if |ab|\uf8ff \".\nAs the notation k\" implies, the \"-indistinguishability of two arms can be seen as a weaker form of\nincomparability, and note that as \"-decreases, previously indistinguishable pairs of arms become dis-\n\n3\n\n\fAlgorithm 1 Direct comparison\n\nGiven (S,) a social poset, , \" > 0, a, b 2S\nDe\ufb01ne pab the average number of victories of a over b and Iab its 1 con\ufb01dence interval.\nCompare a and b until |Iab| <\" or 0.5 62 Iab.\nreturn a k\" b if |Iab| <\" , else a b if pab > 0.5, else b a.\n\nAlgorithm 2 UnchainedBandits\n\nt=1 2 RN\n\n+\n\ni,j=1 = min\u21e3q log(N K2/)\n\nfor t = 1 to N do\n\n2nij\n\ni,j=1 the average number of victories of i against j and\n\nSt+1 = UBSRoutine (St,\" t,/N, A = Algorithm 1).\n\n, 1\u2318 the corresponding 1 /N K 2 con\ufb01dence interval.\n\nGiven S = {s1, . . . , sK} a social poset, > 0, N > 0, (\"t)N\nDe\ufb01ne Set S0 = S. Maintain \u02c6p = (\u02c6pij)K\nI = (Iij)K\nPeel bP:\nreturn bP = SN +1\ntinguishable, and the only 0indistinguishable pair of arms are the incomparable pairs. The classical\nnotions of a poset related to incomparability can easily be extended to \ufb01t the \"-indistinguishability:\nDe\ufb01nition 2.6 (\"-antichain, \"-width and \"-approximation of P). Let \"> 0. C\u21e2S is an \"-antichain\nif 8a 6= b 2C , we have a k\" b. Additionally, P0 \u21e2S is an \"-approximation of P (noted P0 2P \") if\nP\u21e2P 0 and P0 is an \"-antichain. Finally, width\"(S) is the size of the largest \"-antichain of S.\nFeatures of P\". While the Pareto front is always unique, it might possess multiple \"-approximations.\nThe interest of working with P\" is threefold: i) to \ufb01nd an \"-approximation of P, the agent only\nhas to remove the elements of S which are not \"-indistinguishable from P; thus, if P cannot be\nrecovered in the partially observable setting, an \"-approximation of P can be obtained; ii) any set\nin P\" contains P, so no maximal element is discarded; iii) for any B 2P \" all the elements of B\nare nearly optimal, in the sense that 8i 2 B, i <\".\nIt is worth noting that \"-approximations of\nP may structurally differ from P in some settings, though. For instance, if S includes an isolated\ncycle, an \"-approximation of the Pareto front may contain elements of the cycle and in such case,\napproximating the Pareto front using \"-approximation may lead to counterintuitive results.\nFinding an \"-approximation of P is the focus of the next subsection.\n3 Chasing P\" with UnchainedBandits\n3.1 Peeling and the UnchainedBandits Algorithm\nWhile deciding if two arms are incomparable or very close is intractable, the agent is able to \ufb01nd\nif two arms a and b are \"-indistinguishable, by using for instance the direct comparison process\nprovided by Algorithm 1. Our algorithm, UnchainedBandits, follows this idea to ef\ufb01ciently\nretrieve an \"-approximation of the Pareto front. It is based on a peeling technique: given N > 0 and\n\na decreasing sequence (\"t)1\uf8fft\uf8ffN it computes and re\ufb01nes an \"t-approximation bPt of the Pareto front,\nusing UBSRoutine (Algorithm 3), which considers \"t-indistinguishable arms as incomparable.\nPeeling S. Peeling provides a way to control the time spent on pulling indistinguishable arms, and it\nis used to upper bound the regret.Without peeling, i.e. if the algorithm were directly called with \"N,\nthe agent could use a number of pulls proportional to 1/\"2\nN trying to distinguish two incomparable\narms, even though one of them is a regret inducing arm (e.g. an arm j with a large |i,j| for some\ni 2P ). The peeling strategy ensures that inef\ufb01cient arms are eliminated in early epochs, before the\nagent can focus on the remaining arms with an affordable larger number of comparisons.\nAlgorithm subroutine. At each epoch, UBSRoutine (Algorithm 3), called on St with parameter\n\"> 0 and > 0, works as follows. It chooses a single initial pivot\u2014an arm to which other arms are\ncompared\u2014and successively examines all the elements of St. The examined element p is compared to\nall the pivots (the current pivot and the previously collected ones), using Algorithm 1 with parameters\n\" and /K2. Each pivot that is dominated by p is removed from the pivot set. If after being compared\nto all the pivots, p has not been dominated, it is added to the pivot set. At the end, the set of remaining\npivots is returned.\n\n4\n\n\fAlgorithm 3 UBSRoutine\n\nGiven St a social poset, \"t > 0 a precision criterion, 0 an error parameter\nInitialisation Choose p 2S t at random. De\ufb01ne bP = {p} the set of pivots.\nConstruct bP\nfor c 2S t \\ {p} do\nfor c0 2 bP, compare c and c0 using Algorithm 1 with ( = 0/|St|2,\" = \"t).\n8c0 2 bP, such that c c0, remove c0 from bP\nif 8c0 2 bP, c k\"t c0 then add c to bP\nreturn \u02c6P\n\nReuse of informations. To optimize the ef\ufb01ciency of the peeling process, UnchainedBandits\nreuses previous comparison results: the empirical estimates pab and the corresponding con\ufb01dence\nintervals Iab are initialized using the statistics collected from previous pulls of a and b.\n\n3.2 Regret Analysis\nIn this part, we focus on geometrically decreasing peeling sequence, i.e. 9> 0 such that \"t =\nt 8n 0. We now introduce the following Theorem1 which gives an upper bound on the regret\nincurred by UnchainedBandits.\nTheorem 1. Let R be the regret generated by Algorithm 2 applied on S with parameters , N and\nt=1 such that \"t = t, 8t 0. Then with probability at least 1 ,\nwith a decreasing sequence (\"t)N\nUnchainedBandits successfully returns \u02c6P2P \"N after at most T comparisons, with\nT \uf8ffO Kwidth\"N (S)log(N K2/)/\"2\nN\n\n(1)\n\n(2)\n\nR\uf8ff\n\n2K\n\n2 log\u2713 2N K2\n\n \u25c6 KXi=1\n\n1\ni\n\nThe 1/2 re\ufb02ects the fact that a careful peeling, i.e. close to 1, is required to avoid unnecessary\nexpensive (regret-wise) comparisons: this prevents the algorithm from comparing two incomparable\u2014\nyet severely suboptimal\u2014arms for an extended period of time. Conversely, for a given approximation\naccuracy \"N = \", N increases as 1/ log , since N = \", which illustrates the fact that unnecessary\npeeling, i.e. peeling that do not remove any arms, lead to a slightly increased regret. In general, \nshould be chosen close to 1 (e.g. 0.95), as the advantages tend to surpass the drawbacks\u2014unless\nadditional information about the poset structure are known.\nIn\ufb02uence of the complexity of S. In the bounds of Theorem 1, the complexity of S in\ufb02uences the\nresult through its total size |S| = K and its width. One of the features of UnchainedBandits is\nthat the dependency in S in Theorem 1 is |S|width(S) and not |S|2. For instance, if S is actually\nequipped with a total order, then width(S) = 1 and we recover the best possible dependency in\n|S|\u2014which is highlighted by the lower bound (see Theorem 2).\nComparison Lower Bound. We will now prove that the previous result is nearly optimal in order. Let\nA denotes a dueling bandit algorithm on hidden posets. We \ufb01rst introduce the following Assumption:\nAssumption 2. 8K > W 2 N+\n, for all > 0, 1/8 >\"> 0, for any poset S such that |S| \uf8ff K\n\u21e4\nand max (|P\"(S)|) \uf8ff W , A identify an \"-approximation of the Pareto front P\" of S with probability\nat least 1 with at most T ,\"\nTheorem 2. Let A be a dueling bandit algorithm satisfying Assumption 2. Then for any > 0,\n1/8 >\"> 0, K and W two positive integers such that K > W > 0, there exists a poset S such that\n|S| = K, width(S) = |P(S)| = W , max (|P\"(S)|) \uf8ff W and\n\nA (K, W ) comparisons.\n\nA (K, W )|A(S) = P(S)\u2318 e\u21e5\u2713KW\nE\u21e3T ,\"\n\nlog(1/)\n\n\"2\n\n\u25c6 .\n\nThe main discrepancy between the usual dueling bandit upper and lower bounds for regret is the K\nfactor (see e.g. [Komiyama et al., 2015]) and ours is arguably the K factor. It is worth noting that\n\n1The complete proof for all our results can be found in the supplementary material.\n\n5\n\n\fAlgorithm 4 Decoy comparison\n\nGiven (S,) a poset, , > 0, a, b 2S\nInitialisation Create a0, b0 the respective - decoy of a, b. Maintains pab the average number of\nvictory of a over b and Iab its 1 /2 con\ufb01dence interval,\nCompare a and b0, b and a0, until max(|Iab0|,|Iba0|) < or pab0 > 0.5 or pa0b > 0.5.\nreturn a k\" b if max(|Iab0|,|Iba0|) < , else a b if pab0 > 0.5, else b a.\n\nthis additional complexity is directly related to the goal of \ufb01nding the entire Pareto front, as can be\nseen in the proof of Theorem 2 (see Supplementary).\n\n4 Finding P using Decoys\nIn this section, we discuss several methods to recover the exact Pareto front from an \"-approximation,\nwhen S is a poset. First, note that P can be found if additional information on the poset is available.\nFor instance, if a lower bound c > 0 on the minimum distance of any arm to the Pareto set\u2014de\ufb01ned\nas d(P) = min{ij,8i 2P , j 2S \\ P , such that i j}\u2014is known, then since Pc = {P}, Un-\nchainedBandits used with \"N = c will produce the Pareto front of S. Alternatively, if the size k\nof the Pareto front is known, P can be found by peeling St until it achieves the desired size. This can\nbe achieved by successively calling UBSRoutine with parameters St, \"t = t, and t = 6/\u21e12t2,\nand by stopping as soon as |St| = k.\nThis additional information may be unavailable in practice, so we propose an approach which does\nnot rely on external information to solve the problem at hand. We devise a strategy which rests\non the idea of decoys, that we now fully develop. First, we formally de\ufb01ne decoys for posets, and\nwe prove that it is a suf\ufb01cient tool to solve the incomparability problem (Algorithm 4). We also\npresent methods for building those decoys, both for the purely formal model of posets and for real-life\nproblems. In the following, is a strictly positive real number.\nDe\ufb01nition 4.1 (-decoy). Let a 2S . Then b 2S is said to be a -decoy of a if :\n\n1. a < b and a,b ;\n2. 8c 2S , a k c implies b k c;\n3. 8c 2S such that c < a, c,b .\n\nThe following proposition illustrates how decoys can be used to assess incomparability.\nProposition 4.2 (Decoys and incomparability). Let a and b 2S . Let a0 (resp. b0) be a -decoy of a\n(resp. b). Then a and b are comparable if and only if max(b,a0, a,b0) .\nAlgorithm 4 is derived from this result. The next proposition, which is an immediate consequence of\nProposition 4.2, gives a theoretical guarantee on its performance.\nProposition 4.3. Algorithm 4 returns the correct incomparability result with probability at least\n1 after at most T comparisons, where T = 4log(4/)/2.\nAdding decoys to a poset. A poset S may not contain all the necessary decoys. To alleviate this, the\nfollowing proposition states that it is always possible to add relevant decoys to a poset.\nProposition 4.4 (Extending a poset with a decoy). Let (S, <, ) be a dueling bandit problem on a\nposet S and a 2S . De\ufb01ne a0, S0,0, 0 as follows:\n\u2022 S0 = S[{ a0}\n\u2022 8b, c 2S , b < c i.f.f. b <0 c and 0b,c = b,c\n\u2022 8b 2S , if b < a then b < a0 and 0b,a0 = max(b,a, ). Otherwise, b k a0.\n\nThen (S0, <0, 0) de\ufb01nes a dueling bandit problem on poset, 0\nNote that the addition of decoys in a poset does not disqualify previous decoys, so that this proposition\ncan be used iteratively to produce the required number of decoys.\nDecoys in real-life. The intended goal of a decoy a0 of a is to have at hand an arm that is known to\nbe lesser than a. Creating such a decoy in real-life can be done by using a degraded version of a: for\nthe case of an item in a online shop, a decoy can be obtained by e.g. increasing the price. Note that\nwhile for large values of the parameter of the decoys Algorithm 4 requires less comparisons (see\n\n|S = , and a0 is a -decoy of a.\n\n6\n\n\fTable 1: Comparison between the \ufb01ve \ufb01lms with the highest average scores (bottom line) and the \ufb01ve \ufb01lms of\nthe computed \"-pareto set (top line).\n\nPareto Front Pulp Fiction Fight Club\n\nShawshank Redemption The Godfather Star Wars Ep. V\nPulp Fiction Usual Suspect Shawshank Redemption The Godfather The Godfather II\n\nTop Five\n\nProposition 4.3), in real-life problems, the second point of De\ufb01nition 4.1 tends to become false: the\nnew option is actually so worse than the original that the decoy becomes comparable (and inferior)\nto all the other arms, including previously non comparable arms (example: if the price becomes\nabsurd). In that case, the use of decoys of arbitrarily large can lead to erroneous conclusions about\nthe Pareto front and should be avoided. Given a speci\ufb01c decoy, the problem of estimating in a\nreal-life problem may seem dif\ufb01cult. However, as decoys are not new\u2014even though the use we make\nof them here is\u2014a number of methods [Heath and Chatterjee, 1995] have been designed to estimate\nthe quality of a decoy, which is directly related to , and, with limited work, this parameter may be\nestimated as well. We refer the interested reader to the aforementioned paper (and references therein)\nfor more details on the available estimation methods.\nUsing decoys. As a consequence of Proposition 4.3, Algorithm 3 used with decoys instead of direct\ncomparison and \" = will produce the exact Pareto front. But this process can be very costly,\nas the number of required comparison is proportional to 1/2, even for strongly suboptimal arms.\nTherefore, our algorithm, UnchainedBandits, when combined with decoys, \ufb01rst produces an\n\n\"-approximation bP of P using a peeling approach and direct comparisons before re\ufb01ning it into\nP by using Algorithm 3 together with decoys. The following theorems provide guarantees on the\nperformances of this modi\ufb01cation of UnchainedBandits.\nTheorem 3. UnchainedBandits applied on S with decoys, parameters ,N and with a\nt=1 lower bounded by q K\ndecreasing sequence (\"t)N1\nwidth(S), returns the Pareto front P of S with\nprobability at least 1 after at most T comparisons, with\nT \uf8ffO Kwidth(S)log(N K2/)/2\n(3)\nTheorem 4. UnchainedBandits applied on S with decoys, parameters ,N and with a\nsuch that \"N1 \uf8ff pK. returns the Pareto front P of S with\ndecreasing sequence (\"t)N1\nt=1\nprobability at least 1 while incurring a regret R such that\n \u25c6 KXi=1\n\n+ Kwidth(S) log\u2713 2N K2\n\n \u25c6 Xi,i<\"N1,i /2P\n\n2 log\u2713 2N K2\n\nR\uf8ff\n\n1\ni\n\n2K\n\n1\ni\n\n,\n\n(4)\n\nCompared to (2), (4) includes an extra term due to the regret incurred by the use of decoys. In this\nterm, the dependency in S is slightly worse (Kwidth(S) instead of K). However, this extra regret is\nlimited to arms belonging to an \"-approximation of the Pareto front, i.e. nearly optimal arms.\nConstraints on \". Theorem 4 require that \"t \uf8ff pK, which implies that only near-optimal arms\nremain during the decoy step. This is crucial to obtain a reasonable upper bound on the incurred\nregret, as the number of comparisons using decoys is large (\u21e1 1/2) and is the same for every arm,\nregardless of its regret. Conversely, in Theorem 3\u2014which provides an upper bound on the number of\ncomparisons required to \ufb01nd the Pareto front\u2014the \"t are required to be lower bounded. This bound\nis tight in the (worst-case) scenario where all the arms are -indistinguishable, i.e. peeling cannot\neliminate any arm. In that case, any comparison done during the peeling is actually wasted, and the\nlower bound on \"t allows to control the number of comparisons made during the peeling step. In\norder to satisfy both constraints, \"N must be chosen such thatpK/width(S) \uf8ff \"N \uf8ff pK. In\nparticular \"N = pK satisfy both condition and does not rely on the knowledge of width(S).\n\n5 Numerical Simulations\n\n5.1 Simulated Poset\nHere, we test UnchainedBandits on randomly generated posets of different sizes, widths and\nheights. To evaluate the performance of UnchainedBandits, we compare it to three variants of\ndueling bandit algorithms which were naively modi\ufb01ed to handle partial orders and incomparability:\n\n7\n\n\fFigure 1: Regret incurred by Modi\ufb01ed IF2, Modi\ufb01ed RUCB, UniformSampling and UnchainedBandits,\nwhen the structure of the poset varies. Dependence on (left:) height, (center:) size of the Pareto front and (right:)\naddition of suboptimal arms.\n\n1. A simple algorithm, UniformSampling, inspired from the successive elimination algo-\nrithm [Even-Dar et al., 2006], which simultaneously compares all possible pairs of arms\nuntil one of the arms appears suboptimal, at which point it is removed from the set of\nselected arms. When only -indistinguishable elements remain, it uses -decoys.\n\n2. A modi\ufb01ed version of the single-pivot IF2 algorithm [Yue et al., 2012]. Similarly to the\nregular IF2 algorithm, the agent maintains a pivot which is compared to every other elements;\nsuboptimal elements are removed and better elements replace the pivot. This algorithm is\nuseful to illustrate consequences of the multi-pivot approach.\n\n3. A modi\ufb01ed version of RUCB [Zoghi et al., 2014]. This algorithm is useful to provide a non\n\npivot based perspective.\n\nMore precisely, IF2 and RUCB were modi\ufb01ed as follows: the algorithms were provided with the\nadditional knowledge of d(P), the minimum gap between one arm of the Pareto front and any other\ngiven comparable arm. When during the execution of the algorithm, the empirical gap between two\narms reaches this threshold, the arms were concluded to be incomparable. This allowed the agent to\nretrieve the Pareto front iteratively, one element at a time.\nThe random posets are generated as follows: a Pareto front of size p is created, and w disjoint chains\nof length h 1 are added. Then, the top of the chains are connected to a random number of elements\nof the Pareto front. This creates the structure of the partial order . Finally, the exact values of\nthe ij\u2019s are obtained from a uniform distribution, conditioned to satisfy Assumption 1 and to have\nd(P) 0.01. When needed, -decoys are created according to Proposition 4.4. For each experiment,\nwe changed the value of one parameter, and left the other to their default values (p = 5, w = 2p,\nh = 10). Additionally, we provide one experiment where we studied the in\ufb02uence of the quality of\nthe arms (i) on the incurred regret, by adding clearly suboptimal arms2 to an existing poset. The\nresults are averaged over ten runs, and can be found in reported on Figure 1. By default, we use\n = 1/1000 and = 1 /100, = 0.9 and N = blog(pK)/ log )c.\nResult Analysis. While UniformSampling implements a naive approach, it does outperform the\nmodi\ufb01ed IF2. This can be explained as in modi\ufb01ed IF2, the pivot is constantly compared to all the\nremaining arms, including all the uncomparable, and potentially strongly suboptimal arms. These\nuncomparable arms can only be eliminated after the pivot has changed, which can take a large number\nof comparison, and produces a large regret. UnchainedBandits and modi\ufb01ed RUCB produce\nmuch better results than UniformSampling and modi\ufb01ed IF2, and their advantage increases\nwith the complexity of S. While UnchainedBandits performs better that modi\ufb01ed RUCB in\nall the experiments, it is worth noting that this difference is particularly important when additional\nsuboptimal arms are added. In RUCB, the general idea is roughly to compare the best optimistic arm\navailable to its closest opponent. While this approach works greatly in totally ordered set, in poset it\nproduces a lot of comparisons between an optimal arm i and an uncomparable arm j\u2014because in this\ncase ij = 0.5, and j appears to be a close opponent to i, even though j can be clearly suboptimal.\n\n2For this experiment, we say that an arm j is clearly suboptimal if 9c 2P s.t. cj > 0.15\n\n8\n\n\f5.2 MovieLens Dataset\nTo illustrate the application of UnchainedBandits to a concrete example, we used the 20 millions\nitems MovieLens dataset (Harper and Konstan [2015]), which contains movie evaluations. Movies\ncan be seen as a poset, as two movies may be incomparable because they are from different genres\n(e.g. a horror movie and a documentary). To simulate a dueling bandit on a poset we proceed as\nfollows: we remove all \ufb01lms with less than 50000 evaluations, thus obtaining 159 \ufb01lms, represented\nas arms. Then, when comparing two arms, we pick at random a user which has evaluated both \ufb01lms,\nand compare those evaluations (ties are broken with an unbiased coin toss). Since the decoy tool\ncannot be used in an of\ufb02ine dataset, we restrict ourselves to \ufb01nding an \"-approximation of the Pareto\nfront, with \" = 0.05, and parameters = 0.9, = 0.001 and N = blog \"/ log c = 28.\nDue to the lack of a ground-truth for this experiment, no regret estimation can be provided. Instead,\nthe resulting \"-Pareto front, which contains 5 \ufb01lms, is listed in Table 1, and compared to the \ufb01ve\n\ufb01lms among the original 159 with the highest average scores. It is interesting to note that three \ufb01lms\nare present in both list, which re\ufb02ects the fact that the best \ufb01lms in term of average score have a\nhigh chance of being in the Pareto Front. However, the \ufb01lms contained in the Pareto front are more\ndiverse in term of genre, which is expected of a Pareto front. For instance, the sequel of the \ufb01lm \u201dThe\nGodfather\u201d has been replaced by a a \ufb01lm of a totally different genre. It is important to remember\nthat UnchainedBandits does not have access to any information about the genre of a \ufb01lm: its\nresults are based solely on the pairwise evaluation, and this result illustrates the effectiveness of our\napproach.\nLimit of the uncomparability model. The hypothesis that i k j ) ij = 0 might not always hold\ntrue in all real life settings: for instance movies of a niche genre will probably get dominated in\nusers reviews by movies of popular genre\u2014even if they are theoretically incomparable\u2014resulting in\ntheir elimination by UnchainedBandit. This might explains why only 5 movies are present in our\n\" pareto front. However, even in this case, the algorithm will produce a subset of the Pareto Front,\nmade of uncomparable movies from popular genres. Hence, while the algorithm fails at \ufb01nding all\nthe different genre, it still provides a signi\ufb01cant diversity.\n\n6 Conclusion\n\nWe introduced dueling bandits on posets and the problem of \"-indistinguishability. We provided\na new algorithm, UnchainedBandits, together with theoretical performance guarantees and\ncompelling experiments to identify the Pareto front. Future work might include the study of the\nin\ufb02uence of additional hypotheses on the structure of the social poset, and see if some ideas proposed\nhere may carry over to lattices or upper semi-lattices. Additionally, it is an interesting question\nwhether different approaches to dueling bandits, such as Thompson Sampling [Wu and Liu, 2016],\ncould be applied to the partial order setting, and whether results for the von Neumann problem\n[Balsubramani et al., 2016] can be rendered valid in the poset setting.\n\nAcknowledgement\n\nWe would like to thank the anonymous reviewers of this work for their useful comments, particularly\nregarding the future work section.\n\n9\n\n\fReferences\nNir Ailon, Zohar Karnin, and Thorsten Joachims. Reducing dueling bandits to cardinal bandits. In Proceedings\n\nof The 31st International Conference on Machine Learning, pages 856\u2013864, 2014.\n\nDan Ariely and Thomas S Wallsten. Seeking subjective dominance in multidimensional space: An explanation of\nthe asymmetric dominance effect. Organizational Behavior and Human Decision Processes, 63(3):223\u2013232,\n1995.\n\nAkshay Balsubramani, Zohar Karnin, Robert E Schapire, and Masrour Zoghi. Instance-dependent regret bounds\n\nfor dueling bandits. In Conference on Learning Theory, pages 336\u2013360, 2016.\n\nConstantinos Daskalakis, Richard M Karp, Elchanan Mossel, Samantha J Riesenfeld, and Elad Verbin. Sorting\n\nand selection in posets. SIAM Journal on Computing, 40(3):597\u2013622, 2011.\n\nMadalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: a study. In\n\nNeural Networks (IJCNN), The 2013 International Joint Conference on, pages 1\u20138. IEEE, 2013.\n\nMiroslav Dud\u00b4\u0131k, Katja Hofmann, Robert E Schapire, Aleksandrs Slivkins, and Masrour Zoghi. Contextual\n\ndueling bandits. In Conference on Learning Theory, pages 563\u2013587, 2015.\n\nEyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-\narmed bandit and reinforcement learning problems. The Journal of Machine Learning Research, 7:1079\u20131105,\n2006.\n\nUriel Feige, Prabhakar Raghavan, David Peleg, and Eli Upfal. Computing with noisy information. SIAM Journal\n\non Computing, 23(5):1001\u20131018, 1994.\n\nF Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions on\n\nInteractive Intelligent Systems (TiiS), 5(4):19, 2015.\n\nTimothy B Heath and Subimal Chatterjee. Asymmetric decoy effects on lower-quality versus higher-quality\n\nbrands: Meta-analytic and experimental evidence. Journal of Consumer Research, 22(3):268\u2013284, 1995.\n\nJoel Huber, John W Payne, and Christopher Puto. Adding asymmetrically dominated alternatives: Violations of\n\nregularity and the similarity hypothesis. Journal of consumer research, pages 90\u201398, 1982.\n\nJunpei Komiyama, Junya Honda, Hisashi Kashima, and Hiroshi Nakagawa. Regret lower bound and optimal\n\nalgorithm in dueling bandit problem. In Conference on Learning Theory, pages 1141\u20131154, 2015.\n\nJunpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Copeland dueling bandit problem: Regret lower bound,\n\noptimal algorithm, and computationally ef\ufb01cient algorithm. arXiv preprint arXiv:1605.01677, 2016.\n\nSiddartha Y Ramamohan, Arun Rajkumar, and Shivani Agarwal. Dueling bandits: Beyond condorcet winners\nto general tournament solutions. In Advances in Neural Information Processing Systems, pages 1253\u20131261,\n2016.\n\nB Robert. Ash. information theory, 1990.\nConstantine Sedikides, Dan Ariely, and Nils Olsen. Contextual and procedural determinants of partner selection:\n\nOf asymmetric dominance and prominence. Social Cognition, 17(2):118\u2013139, 1999.\n\nAmos Tversky and Daniel Kahneman. The framing of decisions and the psychology of choice. Science, 211\n\n(4481):453\u2013458, 1981.\n\nHuasen Wu and Xin Liu. Double thompson sampling for dueling bandits. In Advances in Neural Information\n\nProcessing Systems, pages 649\u2013657, 2016.\n\nYisong Yue and Thorsten Joachims. Beat the mean bandit. In Proceedings of the 28th International Conference\n\non Machine Learning (ICML-11), pages 241\u2013248, 2011.\n\nYisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem.\n\nJournal of Computer and System Sciences, 78(5):1538\u20131556, 2012.\n\nMasrour Zoghi, Shimon Whiteson, Remi Munos, and Maarten D Rijke. Relative upper con\ufb01dence bound for the\nk-armed dueling bandit problem. In Proceedings of the 31st International Conference on Machine Learning\n(ICML-14), pages 10\u201318, 2014.\n\nMasrour Zoghi, Zohar S Karnin, Shimon Whiteson, and Maarten de Rijke. Copeland dueling bandits. In\n\nAdvances in Neural Information Processing Systems, pages 307\u2013315, 2015a.\n\nMasrour Zoghi, Shimon Whiteson, and Maarten de Rijke. Mergerucb: A method for large-scale online ranker\nevaluation. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining,\npages 17\u201326. ACM, 2015b.\n\n10\n\n\f", "award": [], "sourceid": 1279, "authors": [{"given_name": "Julien", "family_name": "Audiffren", "institution": "Exascale Infolab, Fribourg University"}, {"given_name": "Liva", "family_name": "Ralaivola", "institution": "LIF, IUF, Aix-Marseille University, CNRS"}]}