{"title": "Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions", "book": "Advances in Neural Information Processing Systems", "page_first": 1253, "page_last": 1261, "abstract": "Recent work on deriving $O(\\log T)$ anytime regret bounds for stochastic dueling bandit problems has considered mostly Condorcet winners, which do not always exist, and more recently, winners defined by the Copeland set, which do always exist. In this work, we consider a broad notion of winners defined by tournament solutions in social choice theory, which include the Copeland set as a special case but also include several other notions of winners such as the top cycle, uncovered set, and Banks set, and which, like the Copeland set, always exist. We develop a family of UCB-style dueling bandit algorithms for such general tournament solutions, and show $O(\\log T)$ anytime regret bounds for them. Experiments confirm the ability of our algorithms to achieve low regret relative to the target winning set of interest.", "full_text": "DuelingBandits:BeyondCondorcetWinnerstoGeneralTournamentSolutionsSiddarthaRamamohanIndianInstituteofScienceBangalore560012,Indiasiddartha.yr@csa.iisc.ernet.inArunRajkumarXeroxResearchBangalore560103,Indiaarun_r@csa.iisc.ernet.inShivaniAgarwalUniversityofPennsylvaniaPhiladelphia,PA19104,USAashivani@seas.upenn.eduAbstractRecentworkonderivingO(logT)anytimeregretboundsforstochasticduelingbanditproblemshasconsideredmostlyCondorcetwinners,whichdonotalwaysexist,andmorerecently,winnersde\ufb01nedbytheCopelandset,whichdoalwaysexist.Inthiswork,weconsiderabroadnotionofwinnersde\ufb01nedbytournamentsolutionsinsocialchoicetheory,whichincludetheCopelandsetasaspecialcasebutalsoincludeseveralothernotionsofwinnerssuchasthetopcycle,uncoveredset,andBanksset,andwhich,liketheCopelandset,alwaysexist.WedevelopafamilyofUCB-styleduelingbanditalgorithmsforsuchgeneraltournamentsolutions,andshowO(logT)anytimeregretboundsforthem.Experimentscon\ufb01rmtheabilityofouralgorithmstoachievelowregretrelativetothetargetwinningsetofinterest.1IntroductionTherehasbeensigni\ufb01cantinterestandprogressinrecentyearsindevelopingalgorithmsforduelingbanditproblems[1\u201311].HerethereareKarms;oneachtrialt,oneselectsapairofarms(it,jt)forcomparison,andreceivesabinaryfeedbacksignalyt\u2208{0,1}indicatingwhicharmwaspreferred.Mostworkonduelingbanditsisinthestochasticsettingandassumesastochasticmodel\u2013apreferencematrixPofpairwisecomparisonprobabilitiesPij\u2013fromwhichthefeedbacksignalsytaredrawn;aswithstandardstochasticmulti-armedbandits,thetargethereisusuallytodesignalgorithmswithO(lnT)regretbounds,andwherepossible,O(lnT)anytime(or\u2018horizon-free\u2019)regretbounds,forwhichthealgorithmdoesnotneedtoknowthehorizonornumberoftrialsTinadvance.EarlyworkonduelingbanditsoftenassumedstrongconditionsonthepreferencematrixP,suchasexistenceofatotalorder,underwhichthereisanaturalnotionofa\u2018maximal\u2019elementwithrespecttowhichregretismeasured.RecentworkhassoughttodesignalgorithmsunderweakerconditionsonP;mostwork,however,hasassumedtheexistenceofaCondorcetwinner,whichisanarmithatbeatseveryotherarmj(Pij>12\u2200j6=i),andwhichreducestothemaximalelementwhenatotalorderexists.Unfortunately,theCondorcetwinnerdoesnotalwaysexist,andthishasmotivatedasearchforothernaturalnotionsofwinners,suchasBordawinnersandtheCopelandset(seeFigure1).1Amongthese,theonlyworkthatoffersanytimeO(lnT)regretboundsistherecentworkofZoghietal.[11]onCopelandsets.Inthiswork,weconsiderde\ufb01ningwinnersinduelingbanditsviathenaturalnotionoftournamentsolutionsusedinsocialchoicetheory,ofwhichtheCopelandsetisaspecialcase.Wedevelopgeneraluppercon\ufb01dencebound(UCB)styleduelingbanditalgorithmsforanumberoftournamentsolutionsincludingthetopcycle,uncoveredset,andBanksset,andproveO(lnT)anytimeregretboundsforthem,wheretheregretismeasuredrelativetothetournamentsolutionofinterest.Ourprooftechniqueismodularandcanbeusedtodevelopalgorithmswithsimilarboundsforanytournamentsolutionforwhicha\u2018selectionprocedure\u2019satisfyingcertain\u2018safetyconditions\u2019canbedesigned.Experimentscon\ufb01rmtheabilityofouralgorithmstoachievelowregretrelativetothetargetwinningsetofinterest.1Recently,Dudiketal.[10]alsostudiedvonNeumannwinners,althoughtheydidsoinadifferent(contextual)setting,leadingtoO(T1/2)andO(T2/3)regretbounds.30thConferenceonNeuralInformationProcessingSystems(NIPS2016),Barcelona,Spain.\fAlgorithmConditiononPTargetWinnerAnytime?MultiSBM[5]U-LinCondorcetwinnerXIF[1]TO+SST+STICondorcetwinner\u00d7BTMB[2]TO+RST+STICondorcetwinner\u00d7RUCB[6]CWCondorcetwinnerXMergeRUCB[7]CWCondorcetwinnerXRMED[9]CWCondorcetwinnerXSECS[8]UBWBordawinner\u00d7PBR-SE[4]DBSBordawinner\u00d7PBR-CO[4]AnyPwithouttiesCopelandset\u00d7SAVAGE-BO[3]AnyPwithouttiesBordawinner\u00d7SAVAGE-CO[3]AnyPwithouttiesCopelandset\u00d7CCB,SCB[11]AnyPwithouttiesCopelandsetXUCB-TCAnyPwithouttiesTopcycleXUCB-UCAnyPwithouttiesUncoveredsetXUCB-BAAnyPwithouttiesBankssetXFigure1:SummaryofalgorithmsforstochasticduelingbanditproblemsthathaveO(lnT)regretbounds,togetherwithcorrespondingconditionsontheunderlyingpreferencematrixP,targetwinnersusedinde\ufb01ningregret,andwhethertheregretboundsare\"anytime\".The\ufb01gureontheleftshowsrelationsbetweensomeofthecommonlystudiedconditionsonP(seeTable1forde\ufb01nitions).Thealgorithmsinthelowerpartofthetable(showninred)areproposedinthispaper.2DuelingBandits,TournamentSolutions,andRegretMeasuresDuelingBandits.Wedenoteby[K]={1,...,K}thesetofKarms.Oneachtrialt,thelearnerselectsapairofarms(it,jt)\u2208[K]\u00d7[K](withitpossiblyequaltojt),andreceivesfeedbackintheformofacomparisonoutcomeyt\u2208{0,1},withyt=1indicatingitwaspreferredoverjtandyt=0indicatingthereverse.Thegoalofthelearneristoselectasoftenaspossiblefromasetof\u2018good\u2019or\u2018winning\u2019arms,whichweformalizebelowasatournamentsolution.Thepairwisefeedbackoneachtrialisassumedtobegeneratedstochasticallyaccordingtoa\ufb01xedbutunknownpairwisepreferencemodelrepresentedbyapreferencematrixP\u2208[0,1]K\u00d7KwithPij+Pji=1\u2200i,j:wheneverarmsiandjarecompared,iispreferredtojwithprobabilityPij,andjtoiwithprobabilityPji=1\u2212Pij.Thusforeachtrialt,wehaveyt\u223cBernoulli(Pitjt).Weassumethroughoutthispaperthatthereareareno\u201cties\u201dbetweendistinctarms,i.e.thatPij6=12\u2200i6=j.2WedenotebyPKthesetofallsuchpreferencematricesoverKarms:PK=(cid:8)P\u2208[0,1]K\u00d7K:Pij+Pji=1\u2200i,j;Pij6=12\u2200i6=j(cid:9).Foranypairofarms(i,j),wewillde\ufb01nethemarginof(i,j)w.r.t.Pas\u2206Pij=|Pij\u221212|.PreviousworkonduelingbanditshasconsideredavarietyofconditionsonP;seeTable1andFigure1.OurinteresthereisindesigningalgorithmsthathaveregretguaranteesunderminimalrestrictionsonP.Tothisend,wewillconsidergeneralnotionsofwinnersthatarederivedfromanaturaltournamentassociatedwithP,andthatarealwaysguaranteedtoexist.Wewillsayanarmibeatsanarmjw.r.t.PifPij>12;wewillexpressthisasabinaryrelation(cid:31)Pon[K]:i(cid:31)Pj\u21d0\u21d2Pij>12.ThetournamentassociatedwithPisthensimplyTP=([K],EP),whereEP={(i,j):i(cid:31)Pj}.Twofrequentlystudiednotionsofwinnersinpreviousworkonduelingbandits,bothofwhicharederivedfromthetournamentTP(andwhicharethetargetsofpreviousanytimeregretbounds),aretheCondorcetwinnerwhenitexists,andtheCopelandsetingeneral:De\ufb01nition1(Condorcetwinner).LetP\u2208PK.Ifthereexistsanarmi\u2217\u2208[K]suchthati\u2217(cid:31)Pj\u2200j6=i\u2217,theni\u2217issaidtobeaCondorcetwinnerw.r.t.P.De\ufb01nition2(Copelandset).LetP\u2208PK.TheCopelandsetw.r.t.P,denotedCO(P),isde\ufb01nedasthesetofallarmsin[K]thatbeatthemaximalnumberofarmsw.r.t.P:CO(P)=argmaxi\u2208[K]Pj6=i1(cid:0)i(cid:31)Pj(cid:1).HereweareinterestedinmoregeneralnotionsofwinningsetsderivedfromthetournamentTP.2Theassumptionofnotieswasalsomadeinderivingregretboundsw.r.t.totheCopelandsetin[3,4,11],andexistsimplicitlyin[1,2]aswell.2\fTable1:CommonlystudiedconditionsonthepreferencematrixP.ConditiononPPropertysatis\ufb01edbyPUtility-basedwithlinearlink(U-Lin)\u2203u\u2208[0,1]K:Pij=1\u2212(ui\u2212uj)2\u2200i,jTotalorder(TO)\u2203\u03c3\u2208Sn:Pij>12\u21d0\u21d2\u03c3(i)<\u03c3(j)Strongstochastictransitivity(SST)Pij>12,Pjk>12=\u21d2Pik\u2265max(Pij,Pjk)Relaxedstochastictransitivity(RST)\u2203\u03b3\u22651:Pij>12,Pjk>12=\u21d2Pik\u221212\u22651\u03b3max(Pij\u221212,Pjk\u221212)Stochastictriangleinequality(STI)Pij>12,Pjk>12=\u21d2Pik\u2264Pij+Pjk\u221212Condorcetwinner(CW)\u2203i:Pij>12\u2200j6=iUniqueBordawinner(UBW)\u2203i:Pk6=iPik>Pk6=jPjk\u2200j6=iDistinctBordascores(DBS)Pk6=iPik6=Pk6=jPjk\u2200i6=jTC UC/BA 4 5 3 CO 2 1 P_5 TC\u00a0UC\u00a0BA\u00a0CO\u00a05\u00a06\u00a07\u00a04\u00a01\u00a02\u00a03\u00a08\u00a09\u00a010\u00a011\u00a012\u00a013\u00a0Hudry\u00a0TC Tennis UC/BA 4 5 3 CO 7 1 6 2 8 Figure2:Examplesofvarioustournamentstogetherwiththeircorrespondingtournamentsolutions.Edgesthatarenotexplicitlyshownaredirectedfromlefttoright;edgesthatareincidentonsubsetsofnodes(roundedrectangles)applytoallnodeswithin.Left:Atournamenton5nodeswithgraduallydiscriminatingtournamentsolutions.Middle:TheHudrytournamenton13nodeswithdisjointCopelandandBankssets.Right:Atournamenton8nodesbasedonATPtennismatchrecords.TournamentSolutions.Tournamentsolutionshavelongbeenusedinsocialchoiceandvotingtheorytode\ufb01newinnersingeneraltournamentswhennoCondorcetwinnerexists[12,13].Speci\ufb01cally,atournamentsolutionisanymappingthatmapseachtournamentonKnodestoasubsetof\u2018winning\u2019nodesin[K];forourpurposes,wewillde\ufb01neatournamentsolutiontobeanymappingS:PK\u21922[K]thatmapseachpreferencematrixP(viatheinducedtournamentTP)toasubsetofwinningarmsS(P)\u2286[K].3TheCopelandsetisonesuchtournamentsolution.Weconsiderthreeadditionaltournamentsolutionsinthispaper:thetopcycle,theuncoveredset,andtheBanksset,allofwhichofferothernaturalgeneralizationsoftheCondorcetwinner.Thesetournamentsolutionsaremotivatedbydifferentconsiderations(rangingfromdominancetocoveringtodecompositionintoacyclicsubtournaments)andhavegradeddiscriminativepower,andcanthereforebeusedtomatchtheneedsofdifferentapplications;see[12]foracomprehensivesurvey.De\ufb01nition3(Topcycle).LetP\u2208PK.Thetopcyclew.r.t.P,denotedTC(P),isde\ufb01nedasthesmallestsetW\u2286[K]forwhichi(cid:31)Pj\u2200i\u2208W,j/\u2208W.De\ufb01nition4(Uncoveredset).LetP\u2208PK.Anarmiissaidtocoveranarmjw.r.t.Pifi(cid:31)Pjand\u2200k:j(cid:31)Pk=\u21d2i(cid:31)Pk.Theuncoveredsetw.r.t.P,denotedUC(P),isde\ufb01nedasthesetofallarmsthatarenotcoveredbyanyotherarmw.r.t.P:UC(P)=(cid:8)i\u2208[K]:6\u2203j\u2208[K]s.t.jcoversiw.r.t.P(cid:9).De\ufb01nition5(Banksset).LetP\u2208PK.AsubtournamentT=(V,E)ofTP,whereV\u2286[K]andE=EP|V\u00d7V,issaidtobemaximalacyclicif(i)Tisacyclic,and(ii)noothersubtournamentcontainingTisacyclic.DenotebyMAST(P)thesetofallmaximalacyclicsubtournamentsofTP,andforeachT\u2208MAST(P),denotebym\u2217(T)themaximalelementofT.ThentheBankssetw.r.t.P,denotedBA(P),isde\ufb01nedasthesetofmaximalelementsofallmaximalacyclicsubtournamentsofTP:BA(P)=(cid:8)m\u2217(T):T\u2208MAST(P)(cid:9).ItisknownthatBA(P)\u2286UC(P)\u2286TC(P)andCO(P)\u2286UC(P)\u2286TC(P).Ingeneral,BA(P)andCO(P)mayintersect,althoughtheycanalsobedisjoint.WhenPcontainsaCondorcetwinneri\u2217,allfourtournamentsolutionsreducetojustthesingletonset{i\u2217}.SeeFigure2forexamples.3Strictlyspeaking,themappingSmustbeinvariantunderpermutationsofthenodelabels.3\fRegretMeasures.WhenPadmitsaCondorcetwinneri\u2217,theindividualregretofanarmiisusuallyde\ufb01nedasrCWP(i)=\u2206Pi\u2217,i,andthecumulativeregretoverTtrialsofanalgorithmAthatselectsarms(it,jt)ontrialtisthengenerallyde\ufb01nedasRCWT(A)=PTt=1rCWP(it,jt),wherethepairwiseregretrCWP(i,j)iseithertheaverageregret12(cid:0)rCWP(i)+rCWP(j)(cid:1),thestrongregretmax(cid:0)rCWP(i),rCWP(j)(cid:1),ortheweakregretmin(cid:0)rCWP(i),rCWP(j)(cid:1)[1,2,6,7,9].4WhenthetargetwinnerisatournamentsolutionS,wecansimilarlyde\ufb01neasuitablenotionofindividualregretofanarmiw.r.t.S,andthenusethistode\ufb01nepairwiseregretsasabove.Inparticular,forthethreetournamentsolutionsdiscussedabove,wewillde\ufb01nethefollowingnaturalnotionsofindividualregret:rTCP(i)=(maxi\u2217\u2208TC(P)\u2206Pi\u2217,iifi/\u2208TC(P)0ifi\u2208TC(P);rUCP(i)=(maxi\u2217\u2208UC(P):i\u2217coversi\u2206Pi\u2217,iifi/\u2208UC(P)0ifi\u2208UC(P);rBAP(i)=(maxT\u2208MAST(P):Tcontainsi\u2206Pm\u2217(T),iifi/\u2208BA(P)0ifi\u2208BA(P).InthespecialcasewhenPadmitsaCondorcetwinneri\u2217,thethreeindividualregretsaboveallreducetotheCondorcetindividualregret,rCWP(i)=\u2206Pi\u2217,i.Ineachcaseabove,thecumulativeregretofanalgorithmAoverTtrialswillthenbegivenbyRST(A)=PTt=1rSP(it,jt),wherethepairwiseregretrSP(i,j)canbetheaverageregret12(cid:0)rSP(i)+rSP(j)(cid:1),thestrongregretmax(cid:0)rSP(i),rSP(j)(cid:1),ortheweakregretmin(cid:0)rSP(i),rSP(j)(cid:1).Ourregretboundswillholdforeachoftheseformsofpairwiseregret.Infact,ourregretboundsholdforanymeasureofpairwiseregretrSP(i,j)thatsatis\ufb01esthefollowingthreeconditions:(i)rSP(\u00b7,\u00b7)isnormalized:rSP(i,j)\u2208[0,1]\u2200i,j;(ii)rSP(\u00b7,\u00b7)issymmetric:rSP(i,j)=rSP(j,i)\u2200i,j;and(iii)rSP(\u00b7,\u00b7)isproperw.r.t.S:i,j\u2208S(P)=\u21d2rSP(i,j)=0.Itiseasytoverifythatforthethreetournamentsolutionsabove,theaverage,strongandweakpairwiseregretsaboveallsatisfytheseconditions.5,63UCB-TS:GenericDuelingBanditAlgorithmforTournamentSolutionsAlgorithm.InAlgorithm1weoutlineagenericduelingbanditalgorithm,whichwecallUCB-TS,foridentifyingwinnersfromageneraltournamentsolution.Thealgorithmcanbeinstantiatedtospeci\ufb01ctournamentsolutionsbydesigningsuitableselectionproceduresSELECTPROC-TS(moredetailsbelow).ThealgorithmmaintainsamatrixUt\u2208RK\u00d7K+ofuppercon\ufb01dencebounds(UCBs)UtijontheunknownpairwisepreferenceprobabilitiesPij.TheUCBsareconstructedbyaddingacon\ufb01dencetermtothecurrentempiricalestimateofPij;theexplorationparameter\u03b1>12controlstheexplorationrateofthealgorithmviathesizeofthecon\ufb01dencetermsused.Oneachtrialt,thealgorithmselectsapairofarms(it,jt)basedonthecurrentUCBmatrixUtusingtheselectionprocedureSELECTPROC-TS;onobservingthepreferencefeedbackyt,thealgorithmthenupdatestheUCBsforallpairsofarms(i,j)(theUCBsofallpairs(i,j)growslowlywithtsothatpairsthathavenotbeenselectedforawhilehaveanincreasingchanceofbeingexplored).InordertoinstantiatetheUCB-TSalgorithmtoaparticulartournamentsolutionS,thecriticalstepisindesigningtheselectionprocedureSELECTPROC-TSinamannerthatyieldsgoodregretboundsforasuitableregretmeasurew.r.t.S.BelowweidentifygeneralconditionsonSELECTPROC-TSthatallowforO(lnT)anytimeregretboundstobeobtained(wewilldesignproceduressatisfyingtheseconditionsforthethreetournamentsolutionsofinterestinSection4).4Thenotionofregretusedin[5]wasslightlydifferent.5Itisalsoeasytoverifythatde\ufb01ningtheindividualregretsastheminimumoraveragemarginrelativetoallrelevantarmsinthetournamentsolutionofinterest(insteadofthemaximummarginasdoneabove)alsopreservestheseproperties,andthereforeourregretboundsholdfortheresultingvariantsofregretaswell.6Onecanalsoconsiderde\ufb01ningtheindividualregretssimplyintermsofmistakesrelativetothetargettournamentsolutionofinterest,e.g.rTCP(i)=1(i/\u2208TC(P)),andde\ufb01neaverage/strong/weakpairwiseregretsintermsofthese;ourboundsalsoapplyinthiscase.4\fAlgorithm1UCB-TS1:Require:SelectionprocedureSELECTPROC-TS2:Parameter:Explorationparameter\u03b1>123:Initialize:\u2200(i,j)\u2208[K]\u00d7[K]:N1ij=0//#times(i,j)hasbeencompared;W1ij=0//#timesihaswonagainstj;U1ij=(cid:26)12ifi=j1otherwise//UCBforPij.4:Fort=1,2,...do:5:\u2022Select(it,jt)\u2190SELECTPROC-TS(Ut)6:\u2022Receivepreferencefeedbackyt\u2208{0,1}7:\u2022Updatecounts:\u2200(i,j)\u2208[K]\u00d7[K]:Nt+1ij=(cid:26)Ntij+1if{i,j}={it,jt}Ntijotherwise;Wt+1ij=\uf8f1\uf8f2\uf8f3Wtij+ytif(i,j)=(it,jt)Wtij+(1\u2212yt)if(i,j)=(jt,it)Wtijotherwise.8:\u2022UpdateUCBs:\u2200(i,j)\u2208[K]\u00d7[K]:Ut+1ij=\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f312ifi=j1ifi6=jandNt+1ij=0Wt+1ijNt+1ij+q\u03b1lntNt+1ijotherwise.RegretAnalysis.WeshowherethatiftheselectionprocedureSELECTPROC-TSsatis\ufb01estwonaturalconditionsw.r.t.atournamentsolutionS,namelythesafeidentical-armsconditionw.r.t.Sandthesafedistinct-armsconditionw.r.t.S,thentheresultinginstantiationoftheUCB-TSalgorithmhasanO(lnT)regretboundforanyregretmeasurethatisnormalized,symmetric,andproperw.r.tS.The\ufb01rstconditionensuresthatiftheUCBmatrixUgivenasinputtoSELECTPROC-TSinfactformsanelement-wiseupperboundonthetruepreferencematrixPandSELECTPROC-TSreturnstwoidenticalarms(i,i),thenimustbeinthewinningsetS(P).ThesecondconditionensuresthatifUupperboundsPandSELECTPROC-TSreturnstwodistinctarms(i,j),i6=j,theneitherbothi,jareinthewinningsetS(P),ortheUCBsUij,Ujiarestillloose(and(i,j)shouldbeexploredfurther).De\ufb01nition6(Safeidentical-armscondition).LetS:PK\u21922[K]beatournamentsolution.WewillsayaselectionprocedureSELECTPROC-TS:RK\u00d7K+\u2192[K]\u00d7[K]satis\ufb01esthesafeidentical-armsconditionw.r.t.SifforallP\u2208PK,U\u2208RK\u00d7K+suchthatPij\u2264Uij\u2200i,j,wehaveSELECTPROC-TS(U)=(i,i)=\u21d2i\u2208S(P).De\ufb01nition7(Safedistinct-armscondition).LetS:PK\u21922[K]beatournamentsolution.WewillsayaselectionprocedureSELECTPROC-TS:RK\u00d7K+\u2192[K]\u00d7[K]satis\ufb01esthesafedistinct-armsconditionw.r.t.SifforallP\u2208PK,U\u2208RK\u00d7K+suchthatPij\u2264Uij\u2200i,j,wehaveSELECTPROC-TS(U)=(i,j),i6=j=\u21d2(cid:8)(i,j)\u2208S(P)\u00d7S(P)(cid:9)or(cid:8)Uij+Uji\u22651+\u2206Pij(cid:9).Inwhatfollows,forK\u2208Z+,\u03b1>12,and\u03b4\u2208(0,1],wede\ufb01neC(K,\u03b1,\u03b4)=(cid:16)(4\u03b1\u22121)K2(2\u03b1\u22121)\u03b4(cid:17)1/(2\u03b1\u22121).Thisquantity,whichalsoappearsintheanalysisofRUCB[6],actsasaninitialtimeperiodbeyondwhichalltheUCBsUijupperboundPijw.h.p.Wehavethefollowingresult(proofinAppendixA):Theorem8(RegretboundforUCB-TSalgorithm).LetS:PK\u21922[K]beatournamentsolution,andsupposetheselectionprocedureSELECTPROC-TSusedintheUCB-TSalgorithmsatis\ufb01esboththesafeidentical-armsconditionw.r.t.Sandthesafedistinct-armsconditionw.r.t.S.LetP\u2208PK,andletrSP(i,j)beanynormalized,symmetric,properregretmeasurew.r.t.S.Let\u03b1>12and\u03b4\u2208(0,1].Thenwithprobabilityatleast1\u2212\u03b4(overthefeedbackytdrawnrandomlyfromPandanyinternalrandomnessinSELECTPROC-TS),thecumulativeregretoftheUCB-TSalgorithmwithexplorationparameter\u03b1isupperboundedasRST(cid:0)UCB-TS(\u03b1)(cid:1)\u2264C(K,\u03b1,\u03b4)+4\u03b1(lnT)(cid:18)Xi<j:(i,j)/\u2208S(P)\u00d7S(P)rSP(i,j)(\u2206Pij)2(cid:19).5\fFigure3:InferencesaboutthedirectionofpreferencebetweenarmsiandjunderthetruepreferencematrixPbasedontheUCBsUij,Uji,assumingthatPij,PjiareupperboundedbyUij,Uji.4DuelingBanditAlgorithmsforTopCycle,UncoveredSet,andBanksSetBelowwegiveselectionproceduressatisfyingboththesafeidentical-armsconditionandthesafedistinct-armsconditionabovew.r.t.thetopcycle,uncoveredset,andBanksset,whichimmediatelyyieldduelingbanditalgorithmswithO(lnT)regretboundsw.r.t.thesetournamentsolutions.AninstantiationofourframeworktotheCopelandsetisalsodiscussedinAppendixE.Theselectionprocedureforeachtournamentsolutioniscloselyrelatedtothecorrespondingwinnerdeterminationalgorithmforthattournamentsolution;however,whileastandardwinnerdeterminationalgorithmwouldhaveaccesstotheactualtournamentTP,theselectionprocedureswedesigncanonlyguess(withhighcon\ufb01dence)thepreferencedirectionsbetweensomepairsofarmsbasedontheUCBmatrixU.Inparticular,iftheentriesofUactuallyupperboundthoseofP,thenforanypairofarmsiandj,oneofthefollowingmustbetrue(seealsoFigure3):\u2022Uij<12,inwhichcasePij\u2264Uij<12andthereforej(cid:31)Pi;\u2022Uji<12,inwhichcasePji\u2264Uji<12andthereforei(cid:31)Pj;\u2022Uij\u226512andUji\u226512,inwhichcasethedirectionofpreferencebetweeniandjinTPisunresolved.Theselectionprocedureswedesignmanagetheexploration-exploitationtradeoffbyadoptinganoptimismfollowedbypessimismapproach,similartothatusedinthedesignoftheRUCBandCCBalgorithms[6,11].Speci\ufb01cally,ourselectionprocedures\ufb01rstoptimisticallyidentifyapotentialwinningarmabasedontheUCBsU(byoptimisticallysettingdirectionsofanyunresolvededgesinTPinfavorofthearmbeingconsidered;seeFigure3).Onceaputativewinningarmaisidenti\ufb01ed,theselectionproceduresthenpessimistically\ufb01ndanarmbthathasthegreatestchanceofinvalidatingaasawinningarm,andselectthepair(a,b)forcomparison.4.1UCB-TC:DuelingBanditAlgorithmforTopCycleTheselectionprocedureSELECTPROC-TC(Algorithm2),wheninstantiatedintheUCB-TStemplate,yieldstheUCB-TCduelingbanditalgorithm.Intuitively,SELECTPROC-TCconstructsanoptimisticestimateAofthetopcyclebasedontheUCBsU(line2),andselectsapotentialwinningarmafromA(line3);ifthereisnounresolvedarmagainsta(line5),thenitreturns(a,a)forcomparison,elseitselectsthebest-performingunresolvedopponentb(line8)andreturns(a,b)forcomparison.Wehavethefollowingresult(seeAppendixBforaproof):Theorem9(SELECTPROC-TCsatis\ufb01essafetyconditionsw.r.t.TC).SELECTPROC-TCsatis\ufb01esboththesafeidentical-armsconditionandthesafedistinct-armsconditionw.r.t.TC.ByvirtueofTheorem8,thisimmediatelyyieldsthefollowingregretboundforUCB-TC:Corollary10(RegretboundforUCB-TCalgorithm).LetP\u2208PK.Let\u03b1>12and\u03b4\u2208(0,1].Thenwithprobabilityatleast1\u2212\u03b4,thecumulativeregretofUCB-TCw.r.t.thetopcyclesatis\ufb01esRTCT(cid:0)UCB-TC(\u03b1)(cid:1)\u2264C(K,\u03b1,\u03b4)+4\u03b1(lnT)(cid:18)Xi<j:(i,j)/\u2208TC(P)\u00d7TC(P)rTCP(i,j)(\u2206Pij)2(cid:19).4.2UCB-UC:DuelingBanditAlgorithmforUncoveredSetTheselectionprocedureSELECTPROC-UC(Algorithm3),wheninstantiatedintheUCB-TStemplate,yieldstheUCB-UCduelingbanditalgorithm.SELECTPROC-UCreliesonthepropertythatanuncoveredarmbeatseveryotherarmeitherdirectlyorviaanintermediary[12].SELECTPROC-UCoptimisticallyidenti\ufb01essuchapotentiallyuncoveredarmabasedontheUCBsU(line5);ifitcanberesolvedthataisindeeduncovered(line7),thenitreturns(a,a),elseitselectsthebest-performingunresolvedopponentbwhenavailable(line11),oranarbitraryopponentbotherwise(line13),andreturns(a,b).Wehavethefollowingresult(seeAppendixCforaproof):6\fAlgorithm2SELECTPROC-TC1:Input:UCBmatrixU\u2208RK\u00d7K+2:LetA\u2286[K]beanyminimalsetsatisfyingUij\u226512\u2200i\u2208A,j/\u2208A3:Selectanya\u2208argmaxi\u2208Aminj6\u2208AUij4:B\u2190{i6=a:Uai\u226512\u2227Uia\u226512}5:ifB=\u2205then6:Return(a,a)7:else8:Selectanyb\u2208argmaxi\u2208BUia9:Return(a,b)10:endifAlgorithm3SELECTPROC-UC1:Input:UCBmatrixU\u2208RK\u00d7K+2:fori=1toKdo3:y(i)\u2190Pj1(Uij\u226512)+Pj,k1(Uij\u226512\u2227Ujk\u226512)4:endfor5:Selectanya\u2208argmaxiy(i)6:B\u2190{i6=a:Uai\u226512\u2227Uia\u226512}7:if(cid:0)\u2200i6=a:(Uia<12)\u2228(\u2203j:Uij<12\u2227Uja<12)(cid:1)then8:Return(a,a)9:else10:ifB6=\u2205then11:Selectanyb\u2208argmaxi\u2208BUia12:else13:Selectanyb6=a14:endif15:Return(a,b)16:endifAlgorithm4SELECTPROC-BA1:Input:UCBmatrixU\u2208RK\u00d7K+2:Selectanyj1\u2208[K]3:J\u2190{j1}//InitializecandidateBankstrajectory4:s\u21901//InitializesizeofcandidateBankstrajectory5:traj_found=FALSE6:whileNOT(traj_found)do7:C\u2190{i/\u2208J:Uij>12\u2200j\u2208J}8:ifC=\u2205then9:traj_found=TRUE10:break11:else12:js+1\u2208argmaxi\u2208C(minj\u2208JUij)13:J\u2190J\u222a{js+1}14:s\u2190s+115:endif16:endwhile17:if(cid:0)\u22001\u2264q<r\u2264s:Ujq,jr<12(cid:1)then18:a\u2190js19:Return(a,a)20:else21:Selectany(eq,er)\u2208argmax(q,r):1\u2264q<r\u2264sUjq,jr22:(a,b)\u2190(jeq,jer)23:Return(a,b)24:endifTheorem11(SELECTPROC-UCsatis\ufb01essafetyconditionsw.r.t.UC).SELECTPROC-UCsatis-\ufb01esboththesafeidentical-armsconditionandthesafedistinct-armsconditionw.r.t.UC.Again,byvirtueofTheorem8,thisimmediatelyyieldsthefollowingregretboundforUCB-UC:Corollary12(RegretboundforUCB-UCalgorithm).LetP\u2208PK.Let\u03b1>12and\u03b4\u2208(0,1].Thenwithprobabilityatleast1\u2212\u03b4,thecumulativeregretofUCB-UCw.r.t.theuncoveredsetsatis\ufb01esRUCT(cid:0)UCB-UC(\u03b1)(cid:1)\u2264C(K,\u03b1,\u03b4)+4\u03b1(lnT)(cid:18)Xi<j:(i,j)/\u2208UC(P)\u00d7UC(P)rUCP(i,j)(\u2206Pij)2(cid:19).4.3UCB-BA:DuelingBanditAlgorithmforBanksSetTheselectionprocedureSELECTPROC-BA(Algorithm4),wheninstantiatedintheUCB-TStemplate,yieldstheUCB-BAduelingbanditalgorithm.Intuitively,SELECTPROC-BA\ufb01rstconstructsanoptimisticcandidatemaximalacyclicsubtournament(setJ;alsocalledaBankstrajectory)basedontheUCBsU(lines2\u201316).Ifthissubtournamentiscompletelyresolved(line17),thenitsmaximalarmaispickedand(a,a)isreturned;ifnot,anunresolvedpair(a,b)isreturnedthatismostlikelytofailtheacyclicity/transitivityproperty.Wehavethefollowingresult(seeAppendixDforaproof):Theorem13(SELECTPROC-BAsatis\ufb01essafetyconditionsw.r.t.BA).SELECTPROC-BAsatis-\ufb01esboththesafeidentical-armsconditionandthesafedistinct-armsconditionw.r.t.BA.Again,byvirtueofTheorem8,thisimmediatelyyieldsthefollowingregretboundforUCB-BA:Corollary14(RegretboundforUCB-BAalgorithm).LetP\u2208PK.Let\u03b1>12and\u03b4\u2208(0,1].Thenwithprobabilityatleast1\u2212\u03b4,thecumulativeregretofUCB-BAw.r.t.theBankssetsatis\ufb01esRBAT(cid:0)UCB-BA(\u03b1)(cid:1)\u2264C(K,\u03b1,\u03b4)+4\u03b1(lnT)(cid:18)Xi<j:(i,j)/\u2208BA(P)\u00d7BA(P)rBAP(i,j)(\u2206Pij)2(cid:19).7\fFigure4:RegretperformanceofouralgorithmscomparedtoBTMB,RUCB,SAVAGE-CO,andCCB.Resultsareaveragedover10independentruns;lightcoloredbandsrepresentonestandarderror.Left:TopcycleregretofUCB-TConPMSLR.Middle:UncoveredsetregretofUCB-UConPTennis.Right:BankssetregretofUCB-BAonPHudry.SeeAppendixF.2foradditionalresults.5ExperimentsHereweprovideanempiricalevaluationoftheperformanceoftheproposedduelingbanditalgorithms.Weusedthefollowingthreepreferencematricesinourexperiments,oneofwhichissyntheticandtworeal-world,andnoneofwhichposessesaCondorcetwinner:\u2022PHudry\u2208P13:ThisisconstructedfromtheHudrytournamentshowninFigure2(b);asnotedearlier,thisisthesmallesttournamentwhoseCopelandsetandBankssetaredisjoint[14].DetailsofthispreferencematrixcanbefoundinAppendixF.1.1.\u2022PTennis\u2208P8:ThisisconstructedfromrealdatacollectedfromtheAssociationofTennisProfessionals\u2019(ATP\u2019s)websiteonoutcomesoftennismatchesplayedamong8well-knownprofessionaltennisplayers.ThetournamentassociatedwithPTennisisshowninFigure2(c);furtherdetailsofthispreferencematrixcanbefoundinAppendixF.1.2.\u2022PMSLR\u2208P16:ThisisconstructedfromrealdatafromtheMicrosoftLearningtoRank(MSLR)Web10Kdataset.FurtherdetailscanbefoundinAppendixF.1.3.Wecomparedtheperformanceofouralgorithms,UCB-TC,UCB-BA,andUCB-UC,withfourpreviousduelingbanditalgorithms:BTMB[2],RUCB[6],SAVAGE-CO[3],andCCB[11].7Ineachcase,weassessedthealgorithmsintermsofaveragepairwiseregretrelativetothetargettournamentsolutionofinterest(seeSection2),averagedover10independentruns.AsampleoftheresultsisshowninFigure4;ascanbeseen,theproposedalgorithmsUCB-TC,UCB-UC,andUCB-BAgenerallyoutperformexistingbaselinesintermsofminimizingregretrelativetothetopcycle,theuncoveredset,andtheBanksset,respectively.Additionalresults,includingresultswiththeCopelandsetvariantofouralgorithm,UCB-CO,canbefoundinAppendixF.2.6ConclusionInthispaper,wehaveproposedtheuseofgeneraltournamentsolutionsassetsof\u2018winning\u2019armsinstochasticduelingbanditproblems,withtheadvantagethatthesetournamentsolutionsalwaysexistandcanbeusedtode\ufb01newinnersaccordingtocriteriathataremostrelevanttoagivenduelingbanditsetting.WehavedevelopedaUCB-stylefamilyofalgorithmsforsuchgeneraltournamentsolutions,andhaveshownO(lnT)anytimeregretboundsforthealgorithminstantiatedtothetopcycle,uncoveredset,andBanksset(aswellastheCopelandset;seeAppendixE).Whileourapproachhasanappealingmodularstructurebothalgorithmicallyandinourproofs,anopenquestionconcernstheoptimalityofourregretboundsintheirdependenceonthenumberofarmsK.FortheCondorcetwinner,theMergeRUCBalgorithm[7]hasananytimeregretboundoftheformO(KlnT);fortheCopelandset,theSCBalgorithm[11]hasananytimeregretboundoftheformO(KlnKlnT).Intheworstcase,ourregretboundsareoftheformO(K2lnT).Isitpossiblethatforthetopcycle,uncoveredset,andBanksset,onecanalsoshowan\u2126(K2lnT)lowerboundontheregret?Orcanourregretboundsoralgorithmsbeimproved?Weleaveadetailedinvestigationofthisissuetofuturework.Acknowledgments.Thankstotheanonymousreviewersforhelpfulcommentsandsuggestions.SRthanksGoogleforatravelgranttopresentthisworkattheconference.7ForalltheUCB-basedalgorithms(includingouralgorithms,RUCB,andCCB),wesettheexplorationparameter\u03b1to0.51;forSAVAGE-CO,wesetthecon\ufb01denceparameter\u03b4to1/T;andforBTMB,weset\u03b4to1/Tandchose\u03b3tosatisfythe\u03b3-relaxedstochastictransitivitypropertyforeachpreferencematrix.8\fReferences[1]YisongYue,JosefBroder,RobertKleinberg,andThorstenJoachims.TheK-armedduelingbanditsproblem.JournalofComputerandSystemSciences,78(5):1538\u20131556,2012.[2]YisongYueandThorstenJoachims.Beatthemeanbandit.InProceedingsofthe28thInternationalConferenceonMachineLearning,2011.[3]TanguyUrvoy,FabriceClerot,RaphaelF\u00e9raud,andSamiNaamane.Genericexplorationandk-armedvotingbandits.InProceedingsofthe30thInternationalConferenceonMachineLearning,2013.[4]R\u00f3bertBusa-Fekete,BalazsSzorenyi,WeiweiCheng,PaulWeng,andEykeH\u00fcllermeier.Top-kselectionbasedonadaptivesamplingofnoisypreferences.InProceedingsofthe30thInternationalConferenceonMachineLearning,2013.[5]NirAilon,ZoharKarnin,andThorstenJoachims.Reducingduelingbanditstocardinalbandits.InProceedingsofthe31stInternationalConferenceonMachineLearning,2014.[6]MasrourZoghi,ShimonWhiteson,RemiMunos,andMaartendeRijke.Relativeuppercon\ufb01-denceboundforthek-armedduelingbanditproblem.InProceedingsofthe31stInternationalConferenceonMachineLearning,2014.[7]MasrourZoghi,ShimonWhiteson,andMaartendeRijke.MergeRUCB:Amethodforlarge-scaleonlinerankerevaluation.InProceedingsofthe8thACMInternationalConferenceonWebSearchandDataMining,2015.[8]KevinJamieson,SumeetKatariya,AtulDeshpande,andRobertNowak.Sparseduelingbandits.InProceedingsofthe18thInternationalConferenceonArti\ufb01cialIntelligenceandStatistics,2015.[9]JunpeiKomiyama,JunyaHonda,HisashiKashima,andHiroshiNakagawa.Regretlowerboundandoptimalalgorithminduelingbanditproblem.InProceedingsofthe28thConferenceonLearningTheory,2015.[10]MiroslavDud\u0131k,KatjaHofmann,RobertESchapire,AleksandrsSlivkins,andMasrourZoghi.Contextualduelingbandits.InProceedingsofthe28thConferenceonLearningTheory,2015.[11]MasrourZoghi,ZoharS.Karnin,ShimonWhiteson,andMaartendeRijke.Copelandduelingbandits.InAdvancesinNeuralInformationProcessingSystems28,2015.[12]FelixBrandt,MarkusBrill,andPaulHarrenstein.Tournamentsolutions.InHandbookofComputationalSocialChoice.CambridgeUniversityPress,2016.[13]FelixBrandt,AndreDau,andHansGeorgSeedig.Boundsonthedisparityandseparationoftournamentsolutions.DiscreteAppliedMathematics,187:41\u201349,2015.[14]OlivierHudry.AsmallesttournamentforwhichtheBankssetandtheCopelandsetaredisjoint.SocialChoiceandWelfare,16(1):137\u2013143,1999.[15]KennethAShepsleandBarryRWeingast.Uncoveredsetsandsophisticatedvotingoutcomeswithimplicationsforagendainstitutions.AmericanJournalofPoliticalScience,pages49\u201374,1984.[16]KevinJamiesonandRobertNowak.Activerankingusingpairwisecomparisons.InAdvancesinNeuralInformationProcessingSystems,2011.9\f", "award": [], "sourceid": 680, "authors": [{"given_name": "Siddartha", "family_name": "Ramamohan", "institution": "Indian Institute of Science"}, {"given_name": "Arun", "family_name": "Rajkumar", "institution": "Xerox Research Center, India."}, {"given_name": "Shivani", "family_name": "Agarwal", "institution": "Radcliffe Institute"}, {"given_name": "Shivani", "family_name": "Agarwal", "institution": "University of Pennsylvania"}]}